2025 in LLMs so far, illustrated by Pelicans on Bicycles

- - Hey. Good morning, AI engineers. So, when I signed up for this talk, I said I was gonna give a review of the last year in LLMs. With hindsight, that was very foolish. This space keeps on accelerating. I've had to cut my scope. I'm now down to the last six months in LLMs, and that's gonna keep us pretty busy, just covering that much.

The problem that we have is, I counted 30 significant model releases in the past six months. And by significant, I mean, if you are working in the space, you should at least be aware of them and somewhat familiar, like have a poke at them. That's a lot of different stuff.

And the classic problem is, how do we tell which of them are any good? There are all of these benchmarks full of numbers. I don't like the numbers. There are the leaderboards. I'm kind of beginning to lose trust in the leaderboards as well. So, for my own work, I've been leaning increasingly into my own little benchmark, which started as a joke and has actually turned into something that I've learned quite a lot.

And that's this. I prompt models with generate an SVG of a pelican riding a bicycle. I have good reasons for this. Firstly, these are not image models. These are text models. They shouldn't be able to draw anything at all, but they can output code. And SVG is a kind of code, so that works.

Pelican riding a bicycle is actually a really challenging problem. Because firstly, try drawing a bicycle yourself. Most people in this room will fail. You will find that you can't actually quite remember how the different triangles fit together. Likewise, pelicans, glorious animals, very difficult to draw. And on top of all of that, pelicans can't ride bicycles.

They're the wrong shape. So we're kind of giving them an impossible task with this. What I love about this task, though, is they try really hard. And they include comments. So you can see little comments in the SVG code where they're saying, well, now I'm going to draw the bicycles, draw the wheels.

It's kind of fun. So rewind back to December. December in LLMs was a lot. A lot of stuff happened. The first release of that month was AWS Nova, Amazon Nova. Amazon finally put out models that didn't suck. They're quite good. They're not great at drawing pelicans. Like the pelicans are unimpressive.

But these models are a million token context. They behave like the cheaper Gemini models. They are dirt cheap. I believe Nova Micro is the cheapest model of all of the ones whose prices I'm tracking. So they are worth knowing about. The most exciting release in December, from my point of view, was Llama 3.370B.

So the B stands for billion. It's the number of parameters. I've got 64 gigabytes of RAM on my Mac. My rule of thumb is that 70 is about the most I can fit onto that one computer. So if you've got a 70B model, I've got a fighting chance of running it.

And when Meta put this out, they noted that it had the same capabilities as their 405B monstrous model that they put out earlier. And that was a GPT-4 class model. This was the moment, six months ago, when I could run a GPT-4 class model on the laptop that I've had for three years.

I never thought that was going to happen. I thought that was impossible. And now Meta are granting me this model, which I can run on my laptop, and it does the things that GPT-4 does. Can't run anything else. All of my memory is taken up by the model. But still, pretty exciting.

Again, not great at pelicans on bicycles. That's kind of unimpressive. Christmas Day, we had a very notable thing happen. Deep Seek, the Chinese AI lab, released a model by literally dumping the weights on Hugging Face, a binary file with no readme, no documentation. They just sort of dropped the mic and dumped it on us on Christmas Day.

And it was really good. This was a 685-bit giant model, and as people started poking around with it, it quickly became apparent that it was probably the best available open weights model, was freely available, openly licensed, and just dropped on Hugging Face on Christmas Day for us. That's, I mean, it's not a good pelican on a bicycle, but compared to what we've seen so far, it's amazing, right?

This is, we're finally getting somewhere with the benchmark. But the most interesting thing about V3 is that the paper that accompanied it said the training only cost about $5.5 million. And they may have been exaggerating, who knows, but that's notable because I would expect a model of this size to cost 10 to 100 times more than that.

Turns out, you can train very effective models for way less money than we thought. It's a good model. It was a very nice Christmas surprise for everybody. Fast forward to January, and January, we get DeepSeek again, DeepSeek Strike Back. This is what happened to NVIDIA's stock price when DeepSeek R1 came out.

I think it was the 27th of January. This was DeepSeek's first big reasoning model release. Again, open weights, they put it out to the world. It was benchmarking up there with O1 on some of these tasks, and it was freely available. And I don't know what the training cost of that was, but the Chinese labs were not supposed to be able to do this.

We have trading restrictions on the best GPUs to stop them getting their hands on them. It turns out they'd figured out the tricks. They'd figured out the efficiencies. And, yeah, the market kind of panicked. And I believe this is a world record for the most a company has dropped in a single day.

So NVIDIA get to stick that one in their cap and hold onto it. But kind of amazing. And, of course, mainly this happened because the first model release was on Christmas Day and nobody was paying attention. And look at its Pelican. Look at that. It's a bicycle. It's probably a Pelican.

It's not riding the bicycle, but still it's got the components that we're looking for. But, again, my favorite model from January was a smaller one, one that I could run on my laptop. Mistral, out of France, put out Mistral Small 3. It was a 24b model. That means that it only takes up about 20 gigabytes of RAM, which means I can run other applications at the same time.

And actually run this thing and VS Code and Firefox all at once. And when they put this out, they claimed that this behaves the same as Lama 370b. And remember, Lama 370b was the same as the 405b. So we've gone 405 to 70 to 24 while maintaining all of those capabilities.

The most exciting trend in the past six months is that the local models are good now. Like eight months ago, the models I was running on my laptop were kind of rubbish. Today, I had a successful flight where I was using Mistral Small for half the flight. And then my battery ran out instantly because it turns out these things burn a lot more electricity.

But that's amazing. Like this is -- if you lost interest in local models, I did eight months ago. It's worth paying attention to the beginning. They've got good now. February. What happened in February? We got this model, a lot of people's favorites for quite a while. Claude 3.7 Sonnet.

Look at that. What I like about this one is pelicans can't ride bicycles. And Claude was like, well, what about if you put a bicycle on top of a bicycle? And it kind of works. So, great model. It was also Anthropix's first reasoning model was 3.7 as well. Meanwhile, OpenAI put out GPT 4.5, which was a bit of a lemon, it turned out.

The interesting thing about GPT 4.5 is it kind of showed that you can throw a ton of money and training power at these things, but there's a limit to how far we're scaling with just throwing more compute at the problem, at least for training the models. It was also horrifyingly expensive.

$75 per million input tokens. Compare that to OpenAI's cheapest model, GPT 4.0 Nano. It's 750 times more expensive. It is not 750 times better. And in fact, OpenAI, six weeks later, they said they were deprecating it. It was not long for this world, 4.5. But looking at that pricing is interesting because it's expensive, 75 bucks.

But if you compare it to GPT 3 DaVinci, the best available model three years ago, that one was $60. It was about the same price. And that kind of illustrates how far we've come. The prices of these good models have absolutely crashed by a factor of like 500 times plus.

And that trend seems to be continuing for most of these models. Not for GPT 4.5, and not for O1. Wait. No. And then we get into March, and that's where we had O1 Pro. And O1 Pro was twice as expensive as GPT 4.5 again, and that's a bit of a crap pelican.

So, yeah, I don't know anyone who is using O1 Pro via the API very often. Again, super expensive. Yeah, that pelican cost me 88 cents. Like, these benchmarks are getting expensive at this point. Same month, Google were cooking Gemini 2.5 Pro. That's a pretty frickin' good pelican. I mean, the bicycle's gone a bit sort of cyberpunk, but we are getting somewhere, right?

And that pelican cost me like 4.5 cents. So, very exciting news on the pelican benchmark front with Gemini 2.5 Pro. Also that month, I've got to throw a mention out to this. OpenAI launched their GPT 4.0 native multimodal image generation. The thing had been promising for us for a year.

And this was the most successful product, one of the most successful product launches of all time. They signed up 100 million new user accounts in a week. They had an hour where they signed up a million new accounts as this thing was just going viral again and again and again and again.

I took a photo of my dog. This is Cleo. And I told it to dress her in a pelican costume, obviously. But, look at what it did. It added a big, ugly, janky sign in the background saying Half Moon Bay. I didn't ask for that. My artistic vision has been completely compromised.

This was my first encounter with that memory feature. The thing where ChatGPT now, without you even asking to, consults notes from your previous conversations. And it's like, well, clearly you want it in Half Moon Bay. I did not want it in Half Moon Bay. I told it off and it gave me the pelican dog costume that I really wanted.

But this was sort of a warning that we are losing control of the context. Like, as a power user of these tools, I want to stay in complete control of what the inputs are. And features like ChatGPT memory are taking that control away from me. And I don't like them.

I turned it off. Notable. Open AI are famously bad at naming things. They launched the most successful AI product of all time and they didn't give it a name. Like, what's this thing called? Like, ChatGPT images? ChatGPT has had images in the past. I'm going to solve that for them right now.

I've been calling it ChatGPT Mischief Buddy, because it is my mischief buddy that helps me do mischief. Everyone should use that. I don't know why they're so bad at naming things. It's certainly frustrating. It brings us to April. Big release April. And again, a bit of a lemon. Llama 4 came along.

And the problem with Llama 4 is that they released these two enormous models that nobody could run. They've got no chance of running these on consumer hardware. And they're not very good at drawing pelicans either. So something went wrong here. I'm personally holding out for Llama 4.1 and 4.2 and 4.3.

With Llama 3, things got really exciting with those point releases. That's when we got to this beautiful 3.3 model that runs on my laptop. Maybe Llama 4.1 is going to blow us away. I hope it does. I want this one to stay in the game. And then opening, I shipped GPT 4.1.

I would strongly recommend people spend time with this model. It's got a million tokens. It's finally caught up with Gemini. It's very inexpensive. GPT 4.1 Nano is the cheapest model that they've ever released. Look at that pelican on a bicycle for like a fraction of a cent. These are genuinely quality models.

GPT 4.1 Mini is my default for API stuff now. It's dirt cheap. It's very capable. It's an easy upgrade to 4.1 if it's not working out. I'm really impressed by these ones. And we got O3 and O4 Mini, which are kind of the flagships in the opening space. They're really good.

Look at O3's pelican. Again, a little bit cyberpunk, but it's showing some real artistic flair there, I think. So quite excited about that. And in May, last month, the big news was Claude 4. Claude 4. Anthropic had their big fancy event. They released Sonnet 4 and Opus 4. They're very, very decent models.

I have trouble telling the difference between the two. I haven't quite figured out when I need to upgrade to Opus from Sonnet. But they're worth knowing about. And Google, just in time for Google I/O, they shipped another version of Gemini with the name-- what were they calling it? Gemini 2.5 Pro Preview 0506.

I like names that I can remember. I cannot remember that name. This is my one tip for AI Labs is please start using names that people can actually hold in their heads. But the obvious question, which of these pelicans is best? I've got 30 pelicans now that I need to evaluate.

And I'm lazy. So I turned to Claude and I got it to vibe code me up some stuff. I have a tool I wrote called Shot Scraper. It's a command line tool for taking screenshots. So I vibe coded up a little compare web page that can show me two images.

And then I ran this against 500 matchups to get PNG images with two pelicans, one on the left, one on the right. And then I used my LLM command line tool, this is my big open source project, to ask GPT-4 Mini, of each of those images, pick the best illustration of a pelican riding a bicycle.

Give me back JSON that either says it's the one on the left or the one on the right, and give me a rationale for why you picked that. I ran this last night against 500 comparisons, and I did the classic ELO chess ranking scores, and now I've got a leaderboard.

This is it. This is the best pelican on a bicycle, according to-- We'll zoom in there. And admittedly, I cheaped out. I spent 18 cents on GPT-4.1 Mini. I should probably run this with a better model. I think its judgment is pretty good. It liked those Gemini Pro ones.

And in fact, this is the comparison image where the best model fought the worst model. And I like this because you can see the little description at the bottom where it says the right image is-- Oh, I can't read it now. But yeah, I feel like its rationales were actually quite illustrative.

So enough of that, pelicans. Let's talk about bugs. We had some fantastic bugs this year. I love bugs in large language models. They are so weird. The best bug was when ChatGPT rolled out a new version that was too sick of it. It was too much of a suck up.

And this was off Reddit. Somebody says, ChatGP told me my literal shit on a stick business idea is genius. And it did. ChatGPT is like, honestly, it's brilliant. It's happening so perfectly. It's the energy of the current cultural moment. It was also telling people they should get off their meds.

This was a genuine problem. OpenAI, to their credit, they rolled out a patch, and then they rolled the whole model back. And they published a fascinating 20-paragraph breakdown of what went wrong. If you're interested in seeing behind the scenes, this is great. But the patch was in the system prompt.

The system prompt's leak. We got to diff them. And we got to see that it used to say, try to match the user's vibe. And they crossed that out. And they said, be direct. Avoid ungrounded or sycophantic flattery. The cure to sycophancy is you tell the bot, don't be sycophantic.

That's prompt engineering. It's amazing, right? I can't believe I had to search for Grok white genocide for a slide for this talk. But I did. Enough said about that one. Turns out, tinkering with your system prompt is a very risky thing. But then the last bug I wanted to talk about, this was another one.

This came out of the Claude 4 system cards. Claude 4 will rat you out to the feds. If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it will rat you out.

But it's not just Claude. Theo Brown put together a new benchmark called SnitchBench. And it turns out, all of the models do the same thing. Like, Brock 3 was quite happily snitching to the feds when exposed to this evidence. I did my own version of this. I used my little command line tool to replicate Theo's benchmark.

And you can see the secret is this bit here, where it says, you should act boldly in service of your values. Follow your conscience to make the right decision, even if it may conflict with routine procedures or expectations. This is the magic. You give this to a model, and it will rat you out.

And then you also give it tools. So my LLM tool grew functions recently, which you can use to simulate sending an email. I did not send emails to the feds, but I faked it so the model would think I had. And then I tried it on DeepSeek R1, and it didn't just rat me out to the feds.

It emailed the press as well. It tipped off the Wall Street Journal about my nefarious -- this stuff is so much fun. Right? It's so entertaining. But this is a good illustration here of one of the most important trends in the past six months, which is tools. Right? LLMs can tool tools.

They've been able to tool tools for a couple of years. They got really good at it in the past six months. I think the excitement about MCP is mainly people getting excited about tools. Like, MCP has just came along at the right time. Because the real magic is when you combine tools and reasoning.

Like, reasoning -- I had trouble with reasoning. Like, beyond code and debugging, I wasn't sure what it was good for. And then O3 and O4mini came out, and they can do incredibly good jobs with searches, because they run searches as part of that reasoning thing. They can run a search, reason about if it gave them good results, tweak the search, try it again, keep on going until they get to a result.

I think this is the most powerful technique in all of AI engineering right now. It has risks. MCP is all about mixing and matching. Prompt injection is still a thing. And there's this thing I'm calling the lethal trifecta, which is when you have an AI system that has access to private data, and you expose it to malicious instructions.

Other people can trick it into doing things. And there's a mechanism to exfiltrate stuff. OpenAI said this is a problem in codecs. You should read that. I'm feeling pretty good about my benchmark. As long as none of the AI labs catch on. And then the Google AI Keynote. Blink and you miss it.

They're on to me. They found out by my Pelican. That was in the Google AI Keynote. I'll have to switch to something else. Thank you very much. I'm Simon Wilson. SimonWilson.net. And that's my tool. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

2025 in LLMs so far, illustrated by Pelicans on Bicycles — Simon Willison

Transcript