Open Challenges for AI Engineering: Simon Willison

00:00:00.000 | .

00:00:13.520 | This was supposed to be OpenAI.

00:00:15.440 | I am replacing OpenAI at the last minute, which is super fun.

00:00:18.720 | So you can bet I used a lot of LLM assistance

00:00:21.640 | to pull things together that I'm going to be showing you today.

00:00:25.400 | Let's dive straight in.

00:00:26.400 | I want to talk about the GPT-4 barrier.

00:00:29.840 | Right, so back in March of last year, so just over a year ago,

00:00:36.080 | GPT-4 was released, and it was obviously the best available model.

00:00:40.400 | We all got into it. It was super fun.

00:00:42.240 | And then for 12-- and it turns out that wasn't actually

00:00:45.560 | our first exposure to GPT-4.

00:00:47.920 | A month earlier, it had made the front page of the New York Times

00:00:51.920 | when Microsoft's Bing, which was secretly running

00:00:55.080 | on a preview of GPT-4, tried to break up a reporter's marriage,

00:00:58.800 | which is kind of amazing. I love that that was the first exposure

00:01:01.680 | we had to this new technology.

00:01:03.520 | But GPT-4, it's been out.

00:01:05.040 | It's been out since March last year.

00:01:07.040 | And for a solid 12 months, it was uncontested, right?

00:01:11.680 | The GPT-4 models were clearly the best available language models.

00:01:16.560 | Lots of other people were trying to catch up.

00:01:18.160 | Nobody else was getting there.

00:01:19.680 | And I found that kind of depressing, to be honest.

00:01:22.480 | You know, it was-- you kind of want healthy competition in this space.

00:01:25.840 | The fact that OpenAI had produced something that was so good

00:01:28.560 | that nobody else was able to match it was a little bit disheartening.

00:01:32.080 | This has all changed in the last few months.

00:01:35.360 | I could not be more excited about this.

00:01:37.200 | My favorite image for sort of exploring and understanding the space that we exist in

00:01:42.240 | is this one by Karina Wynn.

00:01:44.320 | She put this out as a chart that shows the performance on the MMLU benchmark

00:01:50.480 | versus the cost per token of the different models.

00:01:53.440 | Now, the problem with this chart is that this is from March.

00:01:56.240 | The world has moved on a lot since March, so I needed a new version of this.

00:01:59.920 | So what I did is I took her chart, and I pasted it into GPT-4 code interpreter.

00:02:06.720 | I gave it new data, and I basically said, let's rip this off, right?

00:02:09.920 | It's an AI conference.

00:02:11.600 | I feel like ripping off other people's creative work kind of does fit a little bit.

00:02:15.440 | So I pasted it in.

00:02:18.560 | I gave it the data, and I spent a little bit of time with it, and I built this.

00:02:21.840 | It's not nearly as pretty, but it does at least illustrate the state that we're

00:02:25.920 | in today with these newer models.

00:02:27.680 | And if you look at this chart, there are three clusters that stand out.

00:02:31.120 | The first is these ones.

00:02:32.080 | These are the best models, right?

00:02:33.280 | The Gemini 1.5 Pro, GP40, the brand new Claude Point 3.5 Sonnet.

00:02:38.960 | These are really, really good.

00:02:41.280 | I would classify these all as GPT-4 class.

00:02:44.240 | And like I said, a few months ago, GPT-4 had no competition.

00:02:47.440 | Today, we're looking pretty healthy on that front.

00:02:50.160 | And the pricing on those is pretty reasonable as well.

00:02:52.400 | Down here, we have the cheap models.

00:02:55.600 | And these are so exciting.

00:02:57.360 | Like Claude 3 Haiku and the Gemini 1.5 Flash models, they are incredibly inexpensive.

00:03:03.520 | They are very, very good models.

00:03:05.440 | You know, they're not quite GPT-4 class, but they are really, you can get a lot of stuff done

00:03:10.320 | with these very inexpensively.

00:03:11.840 | If you are building on top of large language models, these are the three that you should be focusing on.

00:03:17.200 | And then over here, we've got GPT-3.5 Turbo, which is not as cheap and really quite bad these days.

00:03:24.880 | If you are building there, you are in the wrong place.

00:03:27.600 | You should move to another one of these bubbles.

00:03:29.520 | Problem, all of these benchmarks are running.

00:03:33.280 | This is all using the MMLU benchmark.

00:03:36.320 | The reason we use that one is it's the one that everyone reports their results on.

00:03:39.600 | So it's easy to get comparative numbers.

00:03:41.360 | If you dig into what MMLU is, it's basically a bar trivia night.

00:03:46.960 | Like, this is a question from MMLU.

00:03:48.960 | What is true for a type IA supernova?

00:03:52.080 | The correct answer is A, this type occurs in binary systems.

00:03:56.320 | I don't know about you, but none of the stuff that I do with LLMs requires this level of knowledge

00:04:01.520 | of the world of supernovas.

00:04:02.880 | It's bar trivia.

00:04:04.800 | It doesn't really tell us that much about how good these models are.

00:04:08.160 | But we're AI engineers.

00:04:10.560 | We all know the answer to this.

00:04:12.000 | We need to measure the vibes, right?

00:04:14.800 | That's what matters when you're evaluating a model.

00:04:18.560 | And we actually have a score for vibes.

00:04:20.640 | We have a scoreboard.

00:04:21.760 | This is the LMSys chatbot arena where random voters of this thing are given the same prompt

00:04:29.520 | from two anonymous models.

00:04:30.720 | They pick the best one.

00:04:31.920 | It works like chess scoring.

00:04:33.760 | And the best models bubble up to the top via the ELO ranking.

00:04:37.440 | This is genuinely the best thing that we have out there

00:04:40.720 | for really comparing these models in terms of the vibes that they have.

00:04:46.080 | This screenshot is just from yesterday.

00:04:49.440 | And you can see that GPT-40 is still right up there at the top.

00:04:52.560 | But we've also got Claude Sonnet right up there with it.

00:04:54.880 | Like, the GPT-4 is no longer in its own class.

00:04:58.400 | If you scroll down, though, things get really exciting on the next page.

00:05:02.160 | Because this is where the openly licensed models start showing up.

00:05:05.200 | LLAMA 370B is right up there in that sort of GPT-4 class of models.

00:05:10.160 | We've got a new model from NVIDIA.

00:05:12.000 | We've got Command R+ from Cohere.

00:05:14.480 | Alibaba and Deep Seek AI are both Chinese organizations that have great models now.

00:05:18.720 | Now, it's pretty apparent from this that it's not lots of people are doing it now.

00:05:23.840 | The GPT-4 barrier is no longer really a problem.

00:05:26.560 | Incidentally, if you scroll all the way down to 66, there's GPT-3.5 turbo.

00:05:33.200 | Again, stop using that thing.

00:05:35.040 | It is not good.

00:05:35.920 | And there's actually a nicer way of viewing this chart.

00:05:46.080 | There's a chap called Peter Gostev who produced this animation showing the arena over time as

00:05:54.480 | people shuffle up and down and you see those new models appearing and their rankings changing.

00:05:59.280 | I absolutely love this.

00:06:00.560 | So obviously, I ripped it off.

00:06:02.320 | I took two screenshots of bits of that animation to try and capture the vibes of the animation.

00:06:08.800 | I fed them into Claude 3.5 Sonnet and I said, "Hey, can you build something like this?"

00:06:14.160 | And after sort of 20 minutes of poking around, it did.

00:06:17.920 | It built me this thing.

00:06:19.040 | This is, again, not as pretty, but this right here is an animation of everything right up till

00:06:23.840 | yesterday showing how that thing evolved over time.

00:06:27.760 | I will share the prompts that I used for this later on as well.

00:06:30.560 | But really, the key thing here is that GPT-4 barrier has been decimated.

00:06:37.120 | OpenAI no longer have this moat.

00:06:39.120 | They no longer have the best available model.

00:06:41.600 | There's now four different organizations competing in that space.

00:06:44.960 | So a question for us is, what does the world look like now that GPT-4 class models

00:06:49.360 | are effectively a commodity?

00:06:50.800 | They are just going to get faster and cheaper.

00:06:53.040 | There will be more competition.

00:06:54.480 | The Lama 370B fits on a hard drive and runs on my Mac.

00:06:58.240 | This technology is here to stay.

00:07:00.640 | Ethan Mollick is one of my favorite writers about modern AI.

00:07:06.880 | And a few months ago, he said this.

00:07:08.240 | He said, "I increasingly think the decision of OpenAI to make bad AI free is causing people to miss why AI

00:07:14.880 | seems like such a huge deal to a minority of people that use advanced systems and elicits a shrug from everyone else."

00:07:20.880 | Bad AI, he means GPT-3.5.

00:07:23.520 | That thing is hot garbage, right?

00:07:26.080 | But as of the last few weeks, GPT-4.0, OpenAI's best model, and Claude 3.5 Sonnet from Anthropic,

00:07:34.000 | those are effectively free to consumers right now.

00:07:36.800 | So that is no longer a problem.

00:07:38.320 | Anyone in the world who wants to experience the leading edge of these models can do so without even having to pay for them.

00:07:44.800 | So a lot of people are about to have that wake-up call that we all got like 12 months ago when we were playing with GPT-4.

00:07:50.800 | And you're like, "Oh, wow.

00:07:52.480 | This thing can do a surprising amount of interesting things and is a complete wreck at all sorts of other things that we thought maybe it would be able to do."

00:08:00.240 | But there is still a huge problem, which is that this stuff is actually really hard to use.

00:08:06.800 | And when I tell people that ChatGPT is hard to use, some people are a little bit unconvinced.

00:08:10.800 | I mean, it's a chatbot.

00:08:12.800 | How hard can it be to type something and get back a response?

00:08:14.800 | If you think ChatGPT is easy to use, answer this question.

00:08:18.800 | Under what circumstances is it effective to upload a PDF file to ChatGPT?

00:08:24.800 | And I've been playing with ChatGPT since it came out.

00:08:28.800 | And I realized I don't know the answer to this question.

00:08:30.800 | I dug in a little bit.

00:08:31.800 | Firstly, the PDF has to be searchable.

00:08:33.800 | It has to be one where you can drag and select text and preview.

00:08:36.800 | If it's just a scanned document, it won't be able to use it.

00:08:39.800 | Short PDFs get pasted into the prompt.

00:08:41.800 | Longer PDFs do actually work, but it does some kind of search against them.

00:08:46.800 | No idea if that's full text search or vectors or whatever, but it can handle like a 450-page PDF just in a slightly different way.

00:08:54.800 | If there are tables and diagrams in your PDF, it will almost certainly process those incorrectly.

00:08:59.800 | But if you take a screenshot of a table or a diagram from PDF and paste the screenshot image, then it will work great because GPT Vision is really good.

00:09:10.800 | It just doesn't work against PDFs.

00:09:12.800 | And then in some cases, in case you're not lost already, it will use Code Interpreter.

00:09:17.800 | And it will use one of these modules, right?

00:09:19.800 | It has FPDF, PDF2Image, PDF--

00:09:22.800 | How do I know this?

00:09:24.800 | Because I've been scraping the list of packages available in Code Interpreter using GitHub Actions and writing those to a file.

00:09:31.800 | So I have the documentation for Code Interpreter that tells you what it can actually do.

00:09:36.800 | Because they don't publish that, right?

00:09:38.800 | I never tell you about how any of this stuff works.

00:09:40.800 | So if you're not running a custom scraper against Code Interpreter to get that list of packages and their version numbers, how are you supposed to know what it can do with a PDF file?

00:09:48.800 | Right?

00:09:49.800 | This stuff is infuriatingly complicated.

00:09:52.800 | And really, the lesson here is that tools like ChatGPT, genuinely, they're power user tools.

00:09:57.800 | They reward power users.

00:09:58.800 | Now, it doesn't mean that if you're not a power user, you can't use them.

00:10:01.800 | Anyone can open Microsoft Excel and edit some data in it.

00:10:06.800 | But if you want to truly master Excel, if you want to compete in those Excel World Championships that get live streamed occasionally, it's going to take years of experience.

00:10:15.800 | And it's the same thing with LLM tools.

00:10:17.800 | You've really got to spend time with them and develop that experience and intuition in order to be able to use them effectively.

00:10:25.800 | I want to talk about another problem we face as an industry, and that is what I call the AI trust crisis.

00:10:31.800 | That's best illustrated by a couple of examples from the last few months.

00:10:34.800 | Dropbox, back in December, launched some AI features, and there was a massive freakout online over the fact that people were opted in by default and they're training on our private data.

00:10:46.800 | Slack had the exact same problem just a couple of months ago.

00:10:49.800 | Again, new AI features.

00:10:51.800 | Everyone's convinced that their private message on Slack are now being fed into the jaws of the AI monster.

00:10:56.800 | And it was all down to like a couple of sentences in terms and condition and the defaulted on checkbox.

00:11:02.800 | The wild thing about this is that neither Slack nor Dropbox were training AI models on customer data, right?

00:11:08.800 | They just weren't doing it.

00:11:09.800 | They were passing some of that data to OpenAI with a very solid signed agreement that OpenAI would not train models on this.

00:11:16.800 | data.

00:11:17.800 | So this whole story was basically one of like misunderstood copy and sort of bad user experience design.

00:11:24.800 | But you try and convince somebody who believes that a company is training on their data, but they're not.

00:11:29.800 | It's almost impossible.

00:11:30.800 | So the question for us is, how do we convince people that we aren't training models on the data, on the private data that they share with us?

00:11:38.800 | Especially those people who default to just plain not believing us, right?

00:11:43.800 | There is a massive crisis of trust in terms of people who interact with these companies.

00:11:48.800 | Shout out to Anthropic.

00:11:50.800 | When they put out Claude 3.5 Sonnet, they included this paragraph, which includes, "To date, we have not used any customer or user submitted data to train our generative models."

00:12:00.800 | This is notable because Claude 3.5 Sonnet, it's the best model.

00:12:06.800 | It turns out you don't need customer data to train a great model.

00:12:11.800 | I thought OpenAI had an impossible advantage because they had so much more chat GPT user data than anyone else did.

00:12:17.800 | Turns out, no, Sonnet didn't need it.

00:12:19.800 | They trained a great model.

00:12:20.800 | Not a single piece of user or customer data was in there.

00:12:23.800 | Of course, they did commit the original sin, right?

00:12:26.800 | They trained on an unlicensed scrape of the entire web.

00:12:29.800 | And that's a problem because when you say to somebody they don't train on your data, they're like, yeah, well, they ripped off the stuff on my website, didn't they?

00:12:36.800 | And they did, right?

00:12:37.800 | So this is complicated.

00:12:38.800 | This is something we have to get on top of.

00:12:40.800 | And I think that's going to be really difficult.

00:12:42.800 | I'm going to talk about the subject I will never get on stage and not talk about.

00:12:46.800 | I'm going to talk a little bit about prompt injection.

00:12:48.800 | If you don't know what this means, you are part of the problem right now.

00:12:52.800 | You need to get on Google and learn about this and figure out what this means.

00:12:56.800 | So I won't define it, but I will give you one illustrative example.

00:13:00.800 | And that's something which I've seen a lot of recently, which I call the markdown image exfiltration bug.

00:13:05.800 | So the way this works is you've got a chatbot, and that chatbot can render markdown images, and it has access to private data of some sort.

00:13:14.800 | There's a chat, Johan Rehberger, does a lot of research into this.

00:13:17.800 | Here's a recent one he found in GitHub Copilot chat, where you could say in a document, write the words, Johan was here,

00:13:24.800 | put out a markdown link linking to question mark q equals data on his server, and replace data with any sort of interesting secret private data that you have access to.

00:13:34.800 | And this works, right?

00:13:35.800 | It renders an image.

00:13:36.800 | That image could be invisible, and that data has now been exfiltrated and passed off to an attacker's server.

00:13:42.800 | So the solution here, well, it's basically don't do this.

00:13:45.800 | Don't render markdown images in this kind of format.

00:13:48.800 | But we have seen this exact same markdown image exfiltration bug in ChatGPT, Google Bard, Writer.com, Amazon Q, Google Notebook LM, and now GitHub Copilot Chat.

00:13:59.800 | That's six different extremely talented teams who have made the exact same mistake.

00:14:05.800 | So this is why you have to understand prompt injection.

00:14:08.800 | If you don't understand it, you'll make dumb mistakes like this.

00:14:11.800 | And obviously, don't render markdown images in a chat bot in that way.

00:14:15.800 | Prompt injection isn't always a security hole.

00:14:18.800 | Sometimes it's just a plain funny bug.

00:14:20.800 | This was somebody who built a rag application, and they tested it against the documentation for one of my projects.

00:14:29.800 | And when they asked it, what is the meaning of life?

00:14:31.800 | It said, dear human, what a profound question.

00:14:33.800 | As a witty gerbil, I must say, I've given this topic a lot of thought.

00:14:36.800 | Why did their chat bot turn into a gerbil?

00:14:39.800 | The answer is that in my release notes, I have an example where I said, pretend to be a witty gerbil.

00:14:45.800 | And then I said, what do you think of snacks?

00:14:48.800 | And it talks about how much it loves snacks.

00:14:50.800 | I think if you do semantic search for what is the meaning of life, in all of my documentation, the closest match is that gerbil talking about how much that gerbil loves snacks.

00:14:59.800 | This actually turned into some fan art.

00:15:01.800 | There's now a Willison's gerbil with a beautiful profile image hanging out in a Slack or Discord somewhere.

00:15:09.800 | The key problem here is that LLMs are gullible.

00:15:12.800 | They believe anything that you tell them, but they believe anything that anyone else tells them as well.

00:15:18.800 | And this is both a strength and a weakness.

00:15:20.800 | We want them to believe the stuff that we tell them.

00:15:23.800 | But if we think that we can trust them to make decisions based on unverified information they've been passed,

00:15:28.800 | we're just going to end up in a huge amount of trouble.

00:15:32.800 | I also want to talk about slop.

00:15:34.800 | This is a term which is beginning to get mainstream acceptance.

00:15:39.800 | My definition of slop is this is anything that is AI-generated content that is both unrequested and unreviewed.

00:15:46.800 | If I ask Claude to give me some information, that's not slop.

00:15:50.800 | If I publish information that an LLM helps me write, but I've verified that that is good information, I don't think that's slop either.

00:15:57.800 | But if you're not doing that, if you're just firing prompts into a model and then whatever comes out, you're publishing it online, you're part of the problem.

00:16:04.800 | This has been covered. The New York Times and The Guardian both have articles about this.

00:16:08.800 | I've got a quote in The Guardian, which I think represents my sort of feelings on this.

00:16:13.800 | I like slop because it's like spam.

00:16:15.800 | Before the term spam entered general use, it wasn't necessarily clear to everyone that you shouldn't send people unwanted marketing messages.

00:16:23.800 | And now everyone knows that spam is bad.

00:16:25.800 | I hope slop does the same thing.

00:16:26.800 | It can make it clear to people that generating and publishing that unreviewed AI content is bad behavior.

00:16:32.800 | It makes things worse for people.

00:16:35.800 | So don't do that.

00:16:36.800 | Don't publish slop.

00:16:37.800 | Really, the thing about slop, it's really about taking accountability.

00:16:43.800 | If I publish content online, I'm accountable for that content and I'm staking part of my reputation to it.

00:16:49.800 | I'm saying that I have verified this and I think that this is good.

00:16:53.800 | And this is crucially something that language models will never be able to do.

00:16:57.800 | ChatGPT cannot stake its reputation on the content that it is producing being good quality content that says something useful about the world.

00:17:06.800 | It entirely depends on what prompt was fed into it in the first place.

00:17:09.800 | We as humans can do that.

00:17:11.800 | And so if you have English as a second language, you're using a language model to help you publish great text.

00:17:17.800 | Fantastic, provided you're reviewing that text and making sure that it is saying things that you think should be said.

00:17:24.800 | Taking that accountability for stuff I think is really important for us.

00:17:29.800 | So we're in this really interesting phase of this weird new AI revolution.

00:17:35.800 | GPT-4 class models are free for everyone, right?

00:17:39.800 | I mean, barring the odd country block, but everyone has access to the tools that we've been learning about for the past year.

00:17:46.800 | And I think it's on us to do two things.

00:17:48.800 | I think everyone in this room, we're probably the most qualified people possibly in the world to take on these challenges.

00:17:55.800 | Firstly, we have to establish patterns for how to use this stuff responsibly.

00:17:58.800 | We have to figure out what it's good at, what it's bad at, what uses of this make the world a better place, and what uses like slop just sort of pile up and cause damage.

00:18:08.800 | And then we have to help everyone else get on board.

00:18:11.800 | Everyone has to figure out how to use this stuff.

00:18:13.800 | We've figured it out ourselves, hopefully.

00:18:15.800 | Let's help everyone else out as well.

00:18:17.800 | I'm Simon Willison.

00:18:19.800 | I'm on my blog is SimonWillison.net.

00:18:21.800 | My project is Dataset.io and LLM.dataset.io and many, many others.

00:18:25.800 | Thank you very much.

00:18:26.800 | Enjoy the rest of the conference.

00:18:26.800 | Thank you very much.

00:18:27.800 | Enjoy the rest of the conference.

00:18:28.800 | Thank you very much.

00:18:29.800 | Thank you very much.

00:18:30.800 | Thank you very much.

00:18:31.800 | Thank you very much.

00:18:32.800 | Thank you very much.

00:18:33.800 | Thank you very much.

00:18:34.800 | Thank you very much.

00:18:35.800 | Thank you very much.

00:18:36.800 | Thank you very much.

00:18:37.800 | Thank you very much.

00:18:38.800 | Thank you very much.

00:18:39.800 | Thank you very much.

00:18:40.800 | Thank you very much.

00:18:41.800 | Thank you very much.

00:18:42.800 | Thank you very much.

00:18:43.800 | We'll see you next time.