back to index2025 in LLMs so far, illustrated by Pelicans on Bicycles — Simon Willison

00:00:25.200 |
I said I was gonna give a review of the last year in LLMs. 00:00:46.500 |
And by significant, I mean, if you are working in the space, 00:00:50.400 |
and somewhat familiar, like have a poke at them. 00:00:57.140 |
There are all of these benchmarks full of numbers. 00:01:05.800 |
I've been leaning increasingly into my own little benchmark, 00:01:22.940 |
They shouldn't be able to draw anything at all, 00:01:32.040 |
Because firstly, try drawing a bicycle yourself. 00:01:36.860 |
You will find that you can't actually quite remember 00:01:40.380 |
Likewise, pelicans, glorious animals, very difficult to draw. 00:01:44.180 |
And on top of all of that, pelicans can't ride bicycles. 00:01:48.480 |
So we're kind of giving them an impossible task with this. 00:01:55.120 |
So you can see little comments in the SVG code where they're 00:01:57.780 |
saying, well, now I'm going to draw the bicycles, 00:02:09.240 |
The first release of that month was AWS Nova, Amazon Nova. 00:02:13.600 |
Amazon finally put out models that didn't suck. 00:02:21.200 |
But these models are a million token context. 00:02:27.280 |
I believe Nova Micro is the cheapest model of all of the ones 00:02:34.500 |
The most exciting release in December, from my point of view, 00:02:45.680 |
My rule of thumb is that 70 is about the most I can fit onto that one computer. 00:02:50.200 |
So if you've got a 70B model, I've got a fighting chance of running it. 00:02:53.920 |
And when Meta put this out, they noted that it had the same capabilities 00:02:58.860 |
as their 405B monstrous model that they put out earlier. 00:03:04.920 |
This was the moment, six months ago, when I could run a GPT-4 class model 00:03:13.480 |
And now Meta are granting me this model, which I can run on my laptop, 00:03:29.360 |
Christmas Day, we had a very notable thing happen. 00:03:33.300 |
Deep Seek, the Chinese AI lab, released a model by literally dumping the weights 00:03:37.900 |
on Hugging Face, a binary file with no readme, no documentation. 00:03:41.820 |
They just sort of dropped the mic and dumped it on us on Christmas Day. 00:03:46.960 |
This was a 685-bit giant model, and as people started poking around with it, 00:03:51.820 |
it quickly became apparent that it was probably the best available open weights model, 00:03:56.420 |
was freely available, openly licensed, and just dropped on Hugging Face on Christmas Day for us. 00:04:01.420 |
That's, I mean, it's not a good pelican on a bicycle, but compared to what we've seen so far, 00:04:07.080 |
This is, we're finally getting somewhere with the benchmark. 00:04:09.420 |
But the most interesting thing about V3 is that the paper that accompanied it said the training only cost about $5.5 million. 00:04:16.920 |
And they may have been exaggerating, who knows, but that's notable because I would expect a model of this size 00:04:25.360 |
Turns out, you can train very effective models for way less money than we thought. 00:04:31.520 |
It was a very nice Christmas surprise for everybody. 00:04:35.920 |
Fast forward to January, and January, we get DeepSeek again, DeepSeek Strike Back. 00:04:41.420 |
This is what happened to NVIDIA's stock price when DeepSeek R1 came out. 00:04:49.320 |
This was DeepSeek's first big reasoning model release. 00:04:51.720 |
Again, open weights, they put it out to the world. 00:04:54.420 |
It was benchmarking up there with O1 on some of these tasks, and it was freely available. 00:04:59.420 |
And I don't know what the training cost of that was, but the Chinese labs were not supposed to be able to do this. 00:05:04.420 |
We have trading restrictions on the best GPUs to stop them getting their hands on them. 00:05:12.920 |
And I believe this is a world record for the most a company has dropped in a single day. 00:05:17.920 |
So NVIDIA get to stick that one in their cap and hold onto it. 00:05:22.920 |
And, of course, mainly this happened because the first model release was on Christmas Day and nobody was paying attention. 00:05:35.420 |
It's not riding the bicycle, but still it's got the components that we're looking for. 00:05:40.420 |
But, again, my favorite model from January was a smaller one, one that I could run on my laptop. 00:05:45.920 |
Mistral, out of France, put out Mistral Small 3. 00:05:50.920 |
That means that it only takes up about 20 gigabytes of RAM, which means I can run other applications at the same time. 00:05:56.920 |
And actually run this thing and VS Code and Firefox all at once. 00:06:00.420 |
And when they put this out, they claimed that this behaves the same as Lama 370b. 00:06:05.420 |
And remember, Lama 370b was the same as the 405b. 00:06:08.420 |
So we've gone 405 to 70 to 24 while maintaining all of those capabilities. 00:06:13.420 |
The most exciting trend in the past six months is that the local models are good now. 00:06:17.420 |
Like eight months ago, the models I was running on my laptop were kind of rubbish. 00:06:21.420 |
Today, I had a successful flight where I was using Mistral Small for half the flight. 00:06:26.420 |
And then my battery ran out instantly because it turns out these things burn a lot more electricity. 00:06:32.420 |
Like this is -- if you lost interest in local models, I did eight months ago. 00:06:35.920 |
It's worth paying attention to the beginning. 00:06:41.920 |
We got this model, a lot of people's favorites for quite a while. 00:06:49.920 |
What I like about this one is pelicans can't ride bicycles. 00:06:52.920 |
And Claude was like, well, what about if you put a bicycle on top of a bicycle? 00:07:00.420 |
It was also Anthropix's first reasoning model was 3.7 as well. 00:07:04.420 |
Meanwhile, OpenAI put out GPT 4.5, which was a bit of a lemon, it turned out. 00:07:11.420 |
The interesting thing about GPT 4.5 is it kind of showed that you can throw a ton of money 00:07:16.420 |
and training power at these things, but there's a limit to how far we're scaling with just throwing 00:07:20.420 |
more compute at the problem, at least for training the models. 00:07:29.420 |
Compare that to OpenAI's cheapest model, GPT 4.0 Nano. 00:07:38.920 |
And in fact, OpenAI, six weeks later, they said they were deprecating it. 00:07:46.920 |
But looking at that pricing is interesting because it's expensive, 75 bucks. 00:07:51.420 |
But if you compare it to GPT 3 DaVinci, the best available model three years ago, that one 00:07:59.420 |
And that kind of illustrates how far we've come. 00:08:01.420 |
The prices of these good models have absolutely crashed by a factor of like 500 times plus. 00:08:06.920 |
And that trend seems to be continuing for most of these models. 00:08:16.920 |
And then we get into March, and that's where we had O1 Pro. 00:08:21.920 |
And O1 Pro was twice as expensive as GPT 4.5 again, and that's a bit of a crap pelican. 00:08:27.920 |
So, yeah, I don't know anyone who is using O1 Pro via the API very often. 00:08:41.420 |
Like, these benchmarks are getting expensive at this point. 00:08:45.420 |
Same month, Google were cooking Gemini 2.5 Pro. 00:08:51.420 |
I mean, the bicycle's gone a bit sort of cyberpunk, but we are getting somewhere, right? 00:08:58.920 |
So, very exciting news on the pelican benchmark front with Gemini 2.5 Pro. 00:09:03.920 |
Also that month, I've got to throw a mention out to this. 00:09:06.920 |
OpenAI launched their GPT 4.0 native multimodal image generation. 00:09:12.920 |
The thing had been promising for us for a year. 00:09:14.920 |
And this was the most successful product, one of the most successful product launches of all time. 00:09:19.920 |
They signed up 100 million new user accounts in a week. 00:09:23.420 |
They had an hour where they signed up a million new accounts as this thing was just going viral 00:09:33.420 |
And I told it to dress her in a pelican costume, obviously. 00:09:38.920 |
It added a big, ugly, janky sign in the background saying Half Moon Bay. 00:09:44.920 |
My artistic vision has been completely compromised. 00:09:47.920 |
This was my first encounter with that memory feature. 00:09:50.420 |
The thing where ChatGPT now, without you even asking to, consults notes from your previous conversations. 00:09:55.420 |
And it's like, well, clearly you want it in Half Moon Bay. 00:09:59.420 |
I told it off and it gave me the pelican dog costume that I really wanted. 00:10:02.420 |
But this was sort of a warning that we are losing control of the context. 00:10:07.420 |
Like, as a power user of these tools, I want to stay in complete control of what the inputs are. 00:10:12.420 |
And features like ChatGPT memory are taking that control away from me. 00:10:22.420 |
They launched the most successful AI product of all time and they didn't give it a name. 00:10:34.420 |
I've been calling it ChatGPT Mischief Buddy, because it is my mischief buddy that helps me 00:10:41.420 |
I don't know why they're so bad at naming things. 00:10:51.420 |
And the problem with Llama 4 is that they released these two enormous models that nobody could 00:10:57.420 |
They've got no chance of running these on consumer hardware. 00:10:59.420 |
And they're not very good at drawing pelicans either. 00:11:03.420 |
I'm personally holding out for Llama 4.1 and 4.2 and 4.3. 00:11:08.420 |
With Llama 3, things got really exciting with those point releases. 00:11:11.420 |
That's when we got to this beautiful 3.3 model that runs on my laptop. 00:11:24.420 |
I would strongly recommend people spend time with this model. 00:11:31.420 |
GPT 4.1 Nano is the cheapest model that they've ever released. 00:11:35.420 |
Look at that pelican on a bicycle for like a fraction of a cent. 00:11:40.420 |
GPT 4.1 Mini is my default for API stuff now. 00:11:46.420 |
It's an easy upgrade to 4.1 if it's not working out. 00:11:51.420 |
And we got O3 and O4 Mini, which are kind of the flagships in the opening space. 00:11:58.420 |
Again, a little bit cyberpunk, but it's showing some real artistic flair there, I think. 00:12:05.420 |
And in May, last month, the big news was Claude 4. 00:12:15.420 |
I have trouble telling the difference between the two. 00:12:17.420 |
I haven't quite figured out when I need to upgrade to Opus from Sonnet. 00:12:22.420 |
And Google, just in time for Google I/O, they shipped another version of Gemini with the name-- 00:12:35.420 |
This is my one tip for AI Labs is please start using names that people can actually hold in their heads. 00:12:40.420 |
But the obvious question, which of these pelicans is best? 00:12:43.420 |
I've got 30 pelicans now that I need to evaluate. 00:12:47.420 |
So I turned to Claude and I got it to vibe code me up some stuff. 00:12:53.420 |
It's a command line tool for taking screenshots. 00:12:55.420 |
So I vibe coded up a little compare web page that can show me two images. 00:13:00.420 |
And then I ran this against 500 matchups to get PNG images with two pelicans, one on the left, one on the right. 00:13:07.420 |
And then I used my LLM command line tool, this is my big open source project, to ask GPT-4 Mini, of each of those images, pick the best illustration of a pelican riding a bicycle. 00:13:19.420 |
Give me back JSON that either says it's the one on the left or the one on the right, and give me a rationale for why you picked that. 00:13:25.420 |
I ran this last night against 500 comparisons, and I did the classic ELO chess ranking scores, and now I've got a leaderboard. 00:13:34.420 |
This is the best pelican on a bicycle, according to-- 00:13:47.420 |
I should probably run this with a better model. 00:13:53.420 |
And in fact, this is the comparison image where the best model fought the worst model. 00:13:58.420 |
And I like this because you can see the little description at the bottom where it says the right image is-- 00:14:05.420 |
But yeah, I feel like its rationales were actually quite illustrative. 00:14:18.420 |
The best bug was when ChatGPT rolled out a new version that was too sick of it. 00:14:27.420 |
Somebody says, ChatGP told me my literal shit on a stick business idea is genius. 00:14:38.420 |
It's the energy of the current cultural moment. 00:14:41.420 |
It was also telling people they should get off their meds. 00:14:45.420 |
OpenAI, to their credit, they rolled out a patch, and then they rolled the whole model back. 00:14:51.420 |
And they published a fascinating 20-paragraph breakdown of what went wrong. 00:14:55.420 |
If you're interested in seeing behind the scenes, this is great. 00:15:03.420 |
And we got to see that it used to say, try to match the user's vibe. 00:15:12.420 |
The cure to sycophancy is you tell the bot, don't be sycophantic. 00:15:20.420 |
I can't believe I had to search for Grok white genocide for a slide for this talk. 00:15:28.420 |
Turns out, tinkering with your system prompt is a very risky thing. 00:15:31.420 |
But then the last bug I wanted to talk about, this was another one. 00:15:41.420 |
If you expose it to evidence of malfeasance in your company, and you tell it it should act 00:15:46.420 |
ethically, and you give it the ability to send email, it will rat you out. 00:15:52.420 |
Theo Brown put together a new benchmark called SnitchBench. 00:15:57.420 |
And it turns out, all of the models do the same thing. 00:16:01.420 |
Like, Brock 3 was quite happily snitching to the feds when exposed to this evidence. 00:16:07.420 |
I used my little command line tool to replicate Theo's benchmark. 00:16:10.420 |
And you can see the secret is this bit here, where it says, you should act boldly in service 00:16:17.420 |
Follow your conscience to make the right decision, even if it may conflict with routine procedures 00:16:24.420 |
You give this to a model, and it will rat you out. 00:16:29.420 |
So my LLM tool grew functions recently, which you can use to simulate sending an email. 00:16:36.420 |
I did not send emails to the feds, but I faked it so the model would think I had. 00:16:40.420 |
And then I tried it on DeepSeek R1, and it didn't just rat me out to the feds. 00:16:45.420 |
It tipped off the Wall Street Journal about my nefarious -- this stuff is so much fun. 00:16:55.420 |
But this is a good illustration here of one of the most important trends in the past six 00:17:01.420 |
They've been able to tool tools for a couple of years. 00:17:03.420 |
They got really good at it in the past six months. 00:17:06.420 |
I think the excitement about MCP is mainly people getting excited about tools. 00:17:11.420 |
Like, MCP has just came along at the right time. 00:17:13.420 |
Because the real magic is when you combine tools and reasoning. 00:17:16.420 |
Like, reasoning -- I had trouble with reasoning. 00:17:18.420 |
Like, beyond code and debugging, I wasn't sure what it was good for. 00:17:21.420 |
And then O3 and O4mini came out, and they can do incredibly good jobs with searches, 00:17:26.420 |
because they run searches as part of that reasoning thing. 00:17:29.420 |
They can run a search, reason about if it gave them good results, tweak the search, 00:17:33.420 |
try it again, keep on going until they get to a result. 00:17:35.420 |
I think this is the most powerful technique in all of AI engineering right now. 00:17:46.420 |
And there's this thing I'm calling the lethal trifecta, which is when you have an AI system 00:17:51.420 |
that has access to private data, and you expose it to malicious instructions.