back to indexOpen Challenges for AI Engineering: Simon Willison

00:00:15.440 |
I am replacing OpenAI at the last minute, which is super fun. 00:00:18.720 |
So you can bet I used a lot of LLM assistance 00:00:21.640 |
to pull things together that I'm going to be showing you today. 00:00:29.840 |
Right, so back in March of last year, so just over a year ago, 00:00:36.080 |
GPT-4 was released, and it was obviously the best available model. 00:00:42.240 |
And then for 12-- and it turns out that wasn't actually 00:00:47.920 |
A month earlier, it had made the front page of the New York Times 00:00:51.920 |
when Microsoft's Bing, which was secretly running 00:00:55.080 |
on a preview of GPT-4, tried to break up a reporter's marriage, 00:00:58.800 |
which is kind of amazing. I love that that was the first exposure 00:01:07.040 |
And for a solid 12 months, it was uncontested, right? 00:01:11.680 |
The GPT-4 models were clearly the best available language models. 00:01:16.560 |
Lots of other people were trying to catch up. 00:01:19.680 |
And I found that kind of depressing, to be honest. 00:01:22.480 |
You know, it was-- you kind of want healthy competition in this space. 00:01:25.840 |
The fact that OpenAI had produced something that was so good 00:01:28.560 |
that nobody else was able to match it was a little bit disheartening. 00:01:37.200 |
My favorite image for sort of exploring and understanding the space that we exist in 00:01:44.320 |
She put this out as a chart that shows the performance on the MMLU benchmark 00:01:50.480 |
versus the cost per token of the different models. 00:01:53.440 |
Now, the problem with this chart is that this is from March. 00:01:56.240 |
The world has moved on a lot since March, so I needed a new version of this. 00:01:59.920 |
So what I did is I took her chart, and I pasted it into GPT-4 code interpreter. 00:02:06.720 |
I gave it new data, and I basically said, let's rip this off, right? 00:02:11.600 |
I feel like ripping off other people's creative work kind of does fit a little bit. 00:02:18.560 |
I gave it the data, and I spent a little bit of time with it, and I built this. 00:02:21.840 |
It's not nearly as pretty, but it does at least illustrate the state that we're 00:02:27.680 |
And if you look at this chart, there are three clusters that stand out. 00:02:33.280 |
The Gemini 1.5 Pro, GP40, the brand new Claude Point 3.5 Sonnet. 00:02:44.240 |
And like I said, a few months ago, GPT-4 had no competition. 00:02:47.440 |
Today, we're looking pretty healthy on that front. 00:02:50.160 |
And the pricing on those is pretty reasonable as well. 00:02:57.360 |
Like Claude 3 Haiku and the Gemini 1.5 Flash models, they are incredibly inexpensive. 00:03:05.440 |
You know, they're not quite GPT-4 class, but they are really, you can get a lot of stuff done 00:03:11.840 |
If you are building on top of large language models, these are the three that you should be focusing on. 00:03:17.200 |
And then over here, we've got GPT-3.5 Turbo, which is not as cheap and really quite bad these days. 00:03:24.880 |
If you are building there, you are in the wrong place. 00:03:27.600 |
You should move to another one of these bubbles. 00:03:29.520 |
Problem, all of these benchmarks are running. 00:03:36.320 |
The reason we use that one is it's the one that everyone reports their results on. 00:03:41.360 |
If you dig into what MMLU is, it's basically a bar trivia night. 00:03:52.080 |
The correct answer is A, this type occurs in binary systems. 00:03:56.320 |
I don't know about you, but none of the stuff that I do with LLMs requires this level of knowledge 00:04:04.800 |
It doesn't really tell us that much about how good these models are. 00:04:14.800 |
That's what matters when you're evaluating a model. 00:04:21.760 |
This is the LMSys chatbot arena where random voters of this thing are given the same prompt 00:04:33.760 |
And the best models bubble up to the top via the ELO ranking. 00:04:37.440 |
This is genuinely the best thing that we have out there 00:04:40.720 |
for really comparing these models in terms of the vibes that they have. 00:04:49.440 |
And you can see that GPT-40 is still right up there at the top. 00:04:52.560 |
But we've also got Claude Sonnet right up there with it. 00:04:54.880 |
Like, the GPT-4 is no longer in its own class. 00:04:58.400 |
If you scroll down, though, things get really exciting on the next page. 00:05:02.160 |
Because this is where the openly licensed models start showing up. 00:05:05.200 |
LLAMA 370B is right up there in that sort of GPT-4 class of models. 00:05:14.480 |
Alibaba and Deep Seek AI are both Chinese organizations that have great models now. 00:05:18.720 |
Now, it's pretty apparent from this that it's not lots of people are doing it now. 00:05:23.840 |
The GPT-4 barrier is no longer really a problem. 00:05:26.560 |
Incidentally, if you scroll all the way down to 66, there's GPT-3.5 turbo. 00:05:35.920 |
And there's actually a nicer way of viewing this chart. 00:05:46.080 |
There's a chap called Peter Gostev who produced this animation showing the arena over time as 00:05:54.480 |
people shuffle up and down and you see those new models appearing and their rankings changing. 00:06:02.320 |
I took two screenshots of bits of that animation to try and capture the vibes of the animation. 00:06:08.800 |
I fed them into Claude 3.5 Sonnet and I said, "Hey, can you build something like this?" 00:06:14.160 |
And after sort of 20 minutes of poking around, it did. 00:06:19.040 |
This is, again, not as pretty, but this right here is an animation of everything right up till 00:06:23.840 |
yesterday showing how that thing evolved over time. 00:06:27.760 |
I will share the prompts that I used for this later on as well. 00:06:30.560 |
But really, the key thing here is that GPT-4 barrier has been decimated. 00:06:39.120 |
They no longer have the best available model. 00:06:41.600 |
There's now four different organizations competing in that space. 00:06:44.960 |
So a question for us is, what does the world look like now that GPT-4 class models 00:06:50.800 |
They are just going to get faster and cheaper. 00:06:54.480 |
The Lama 370B fits on a hard drive and runs on my Mac. 00:07:00.640 |
Ethan Mollick is one of my favorite writers about modern AI. 00:07:08.240 |
He said, "I increasingly think the decision of OpenAI to make bad AI free is causing people to miss why AI 00:07:14.880 |
seems like such a huge deal to a minority of people that use advanced systems and elicits a shrug from everyone else." 00:07:26.080 |
But as of the last few weeks, GPT-4.0, OpenAI's best model, and Claude 3.5 Sonnet from Anthropic, 00:07:34.000 |
those are effectively free to consumers right now. 00:07:38.320 |
Anyone in the world who wants to experience the leading edge of these models can do so without even having to pay for them. 00:07:44.800 |
So a lot of people are about to have that wake-up call that we all got like 12 months ago when we were playing with GPT-4. 00:07:52.480 |
This thing can do a surprising amount of interesting things and is a complete wreck at all sorts of other things that we thought maybe it would be able to do." 00:08:00.240 |
But there is still a huge problem, which is that this stuff is actually really hard to use. 00:08:06.800 |
And when I tell people that ChatGPT is hard to use, some people are a little bit unconvinced. 00:08:12.800 |
How hard can it be to type something and get back a response? 00:08:14.800 |
If you think ChatGPT is easy to use, answer this question. 00:08:18.800 |
Under what circumstances is it effective to upload a PDF file to ChatGPT? 00:08:24.800 |
And I've been playing with ChatGPT since it came out. 00:08:28.800 |
And I realized I don't know the answer to this question. 00:08:33.800 |
It has to be one where you can drag and select text and preview. 00:08:36.800 |
If it's just a scanned document, it won't be able to use it. 00:08:41.800 |
Longer PDFs do actually work, but it does some kind of search against them. 00:08:46.800 |
No idea if that's full text search or vectors or whatever, but it can handle like a 450-page PDF just in a slightly different way. 00:08:54.800 |
If there are tables and diagrams in your PDF, it will almost certainly process those incorrectly. 00:08:59.800 |
But if you take a screenshot of a table or a diagram from PDF and paste the screenshot image, then it will work great because GPT Vision is really good. 00:09:12.800 |
And then in some cases, in case you're not lost already, it will use Code Interpreter. 00:09:24.800 |
Because I've been scraping the list of packages available in Code Interpreter using GitHub Actions and writing those to a file. 00:09:31.800 |
So I have the documentation for Code Interpreter that tells you what it can actually do. 00:09:38.800 |
I never tell you about how any of this stuff works. 00:09:40.800 |
So if you're not running a custom scraper against Code Interpreter to get that list of packages and their version numbers, how are you supposed to know what it can do with a PDF file? 00:09:52.800 |
And really, the lesson here is that tools like ChatGPT, genuinely, they're power user tools. 00:09:58.800 |
Now, it doesn't mean that if you're not a power user, you can't use them. 00:10:01.800 |
Anyone can open Microsoft Excel and edit some data in it. 00:10:06.800 |
But if you want to truly master Excel, if you want to compete in those Excel World Championships that get live streamed occasionally, it's going to take years of experience. 00:10:17.800 |
You've really got to spend time with them and develop that experience and intuition in order to be able to use them effectively. 00:10:25.800 |
I want to talk about another problem we face as an industry, and that is what I call the AI trust crisis. 00:10:31.800 |
That's best illustrated by a couple of examples from the last few months. 00:10:34.800 |
Dropbox, back in December, launched some AI features, and there was a massive freakout online over the fact that people were opted in by default and they're training on our private data. 00:10:46.800 |
Slack had the exact same problem just a couple of months ago. 00:10:51.800 |
Everyone's convinced that their private message on Slack are now being fed into the jaws of the AI monster. 00:10:56.800 |
And it was all down to like a couple of sentences in terms and condition and the defaulted on checkbox. 00:11:02.800 |
The wild thing about this is that neither Slack nor Dropbox were training AI models on customer data, right? 00:11:09.800 |
They were passing some of that data to OpenAI with a very solid signed agreement that OpenAI would not train models on this. 00:11:17.800 |
So this whole story was basically one of like misunderstood copy and sort of bad user experience design. 00:11:24.800 |
But you try and convince somebody who believes that a company is training on their data, but they're not. 00:11:30.800 |
So the question for us is, how do we convince people that we aren't training models on the data, on the private data that they share with us? 00:11:38.800 |
Especially those people who default to just plain not believing us, right? 00:11:43.800 |
There is a massive crisis of trust in terms of people who interact with these companies. 00:11:50.800 |
When they put out Claude 3.5 Sonnet, they included this paragraph, which includes, "To date, we have not used any customer or user submitted data to train our generative models." 00:12:00.800 |
This is notable because Claude 3.5 Sonnet, it's the best model. 00:12:06.800 |
It turns out you don't need customer data to train a great model. 00:12:11.800 |
I thought OpenAI had an impossible advantage because they had so much more chat GPT user data than anyone else did. 00:12:20.800 |
Not a single piece of user or customer data was in there. 00:12:23.800 |
Of course, they did commit the original sin, right? 00:12:26.800 |
They trained on an unlicensed scrape of the entire web. 00:12:29.800 |
And that's a problem because when you say to somebody they don't train on your data, they're like, yeah, well, they ripped off the stuff on my website, didn't they? 00:12:40.800 |
And I think that's going to be really difficult. 00:12:42.800 |
I'm going to talk about the subject I will never get on stage and not talk about. 00:12:46.800 |
I'm going to talk a little bit about prompt injection. 00:12:48.800 |
If you don't know what this means, you are part of the problem right now. 00:12:52.800 |
You need to get on Google and learn about this and figure out what this means. 00:12:56.800 |
So I won't define it, but I will give you one illustrative example. 00:13:00.800 |
And that's something which I've seen a lot of recently, which I call the markdown image exfiltration bug. 00:13:05.800 |
So the way this works is you've got a chatbot, and that chatbot can render markdown images, and it has access to private data of some sort. 00:13:14.800 |
There's a chat, Johan Rehberger, does a lot of research into this. 00:13:17.800 |
Here's a recent one he found in GitHub Copilot chat, where you could say in a document, write the words, Johan was here, 00:13:24.800 |
put out a markdown link linking to question mark q equals data on his server, and replace data with any sort of interesting secret private data that you have access to. 00:13:36.800 |
That image could be invisible, and that data has now been exfiltrated and passed off to an attacker's server. 00:13:42.800 |
So the solution here, well, it's basically don't do this. 00:13:45.800 |
Don't render markdown images in this kind of format. 00:13:48.800 |
But we have seen this exact same markdown image exfiltration bug in ChatGPT, Google Bard, Writer.com, Amazon Q, Google Notebook LM, and now GitHub Copilot Chat. 00:13:59.800 |
That's six different extremely talented teams who have made the exact same mistake. 00:14:05.800 |
So this is why you have to understand prompt injection. 00:14:08.800 |
If you don't understand it, you'll make dumb mistakes like this. 00:14:11.800 |
And obviously, don't render markdown images in a chat bot in that way. 00:14:15.800 |
Prompt injection isn't always a security hole. 00:14:20.800 |
This was somebody who built a rag application, and they tested it against the documentation for one of my projects. 00:14:29.800 |
And when they asked it, what is the meaning of life? 00:14:31.800 |
It said, dear human, what a profound question. 00:14:33.800 |
As a witty gerbil, I must say, I've given this topic a lot of thought. 00:14:39.800 |
The answer is that in my release notes, I have an example where I said, pretend to be a witty gerbil. 00:14:45.800 |
And then I said, what do you think of snacks? 00:14:50.800 |
I think if you do semantic search for what is the meaning of life, in all of my documentation, the closest match is that gerbil talking about how much that gerbil loves snacks. 00:15:01.800 |
There's now a Willison's gerbil with a beautiful profile image hanging out in a Slack or Discord somewhere. 00:15:09.800 |
The key problem here is that LLMs are gullible. 00:15:12.800 |
They believe anything that you tell them, but they believe anything that anyone else tells them as well. 00:15:20.800 |
We want them to believe the stuff that we tell them. 00:15:23.800 |
But if we think that we can trust them to make decisions based on unverified information they've been passed, 00:15:28.800 |
we're just going to end up in a huge amount of trouble. 00:15:34.800 |
This is a term which is beginning to get mainstream acceptance. 00:15:39.800 |
My definition of slop is this is anything that is AI-generated content that is both unrequested and unreviewed. 00:15:46.800 |
If I ask Claude to give me some information, that's not slop. 00:15:50.800 |
If I publish information that an LLM helps me write, but I've verified that that is good information, I don't think that's slop either. 00:15:57.800 |
But if you're not doing that, if you're just firing prompts into a model and then whatever comes out, you're publishing it online, you're part of the problem. 00:16:04.800 |
This has been covered. The New York Times and The Guardian both have articles about this. 00:16:08.800 |
I've got a quote in The Guardian, which I think represents my sort of feelings on this. 00:16:15.800 |
Before the term spam entered general use, it wasn't necessarily clear to everyone that you shouldn't send people unwanted marketing messages. 00:16:26.800 |
It can make it clear to people that generating and publishing that unreviewed AI content is bad behavior. 00:16:37.800 |
Really, the thing about slop, it's really about taking accountability. 00:16:43.800 |
If I publish content online, I'm accountable for that content and I'm staking part of my reputation to it. 00:16:49.800 |
I'm saying that I have verified this and I think that this is good. 00:16:53.800 |
And this is crucially something that language models will never be able to do. 00:16:57.800 |
ChatGPT cannot stake its reputation on the content that it is producing being good quality content that says something useful about the world. 00:17:06.800 |
It entirely depends on what prompt was fed into it in the first place. 00:17:11.800 |
And so if you have English as a second language, you're using a language model to help you publish great text. 00:17:17.800 |
Fantastic, provided you're reviewing that text and making sure that it is saying things that you think should be said. 00:17:24.800 |
Taking that accountability for stuff I think is really important for us. 00:17:29.800 |
So we're in this really interesting phase of this weird new AI revolution. 00:17:35.800 |
GPT-4 class models are free for everyone, right? 00:17:39.800 |
I mean, barring the odd country block, but everyone has access to the tools that we've been learning about for the past year. 00:17:48.800 |
I think everyone in this room, we're probably the most qualified people possibly in the world to take on these challenges. 00:17:55.800 |
Firstly, we have to establish patterns for how to use this stuff responsibly. 00:17:58.800 |
We have to figure out what it's good at, what it's bad at, what uses of this make the world a better place, and what uses like slop just sort of pile up and cause damage. 00:18:08.800 |
And then we have to help everyone else get on board. 00:18:11.800 |
Everyone has to figure out how to use this stuff. 00:18:21.800 |
My project is Dataset.io and LLM.dataset.io and many, many others.