back to index

New OpenAI Model 'Imminent' and AI Stakes Get Raised (plus Med Gemini, GPT 2 Chatbot and Scale AI)


Whisper Transcript | Transcript Only Page

00:00:00.000 | It has been a somewhat surreal few days in AI for so many reasons and the month of May
00:00:06.400 | promises to be yet stranger.
00:00:08.980 | And according to this under the radar article, company insiders and government officials
00:00:15.000 | tell of an imminent release of new open AI models.
00:00:18.800 | And yes, of course, the strangeness at the end of April was amplified by the GPT-2 chatbot,
00:00:26.120 | a mystery model showcased and then withdrawn within days, but which I did get to test.
00:00:31.880 | I thought testing it would be a slightly more appropriate response than doing an all cap
00:00:36.960 | video claiming that AGI has arrived.
00:00:39.880 | I also want to bring in two papers released in the last 24 hours, 90 pages in total and
00:00:45.680 | read in full.
00:00:46.960 | They might be more significant than any rumor you have heard.
00:00:51.080 | First things first, though, that article from Politico that I mentioned and the context
00:00:55.440 | of this article is this.
00:00:56.960 | There was an AI safety summit in Bletchley last year, near to where I live actually in
00:01:01.920 | Southern England.
00:01:03.000 | Some of the biggest players in AI like Meta and OpenAI promised the UK government that
00:01:08.080 | the UK government could safety test the frontier latest models before they were released.
00:01:13.640 | There's only one slight problem.
00:01:15.200 | They haven't done it.
00:01:16.200 | Now you might say that's just par for the course for big tech, but the article also
00:01:20.300 | revealed some interesting insider gossip.
00:01:22.920 | Politico spoke to a host, many company insiders, consultants, lobbyists, and government officials.
00:01:29.600 | They spoke anonymously over several months.
00:01:32.360 | And not only did we learn that it's only Google DeepMind that have given the government
00:01:36.920 | early access, we also learned that OpenAI didn't.
00:01:40.560 | Now somewhat obviously that tells us that they have a new model and that it's very
00:01:45.040 | near to release.
00:01:46.480 | Now I very much doubt they're going to call it GPT-5 and you can see more of my reasons
00:01:51.100 | for that in the video you can see on screen.
00:01:54.320 | But I think it's more likely to be something like GPT-4.5 optimized for reasoning and planning.
00:02:01.120 | Now some of you might be thinking, is that all the evidence you've got that a GPT-4.5
00:02:05.840 | will be coming before GPT-5?
00:02:07.440 | Well, not quite.
00:02:08.840 | How about this MIT technology review interview conducted with Sam Altman in the last few
00:02:13.600 | days?
00:02:14.600 | In a private discussion, Sam Altman was asked if he knew when the next version of GPT is
00:02:19.260 | slated to be released.
00:02:20.600 | And he said calmly, "Yes."
00:02:22.440 | Now think about it.
00:02:23.440 | If the model had months and months more of uncertain safety testing, he couldn't be
00:02:27.800 | that confident about a release date.
00:02:30.080 | Think about what happened to Google Gemini Ultra, which was delayed and delayed and delayed.
00:02:34.360 | That again points to a more imminent release.
00:02:37.680 | Then another bit of secondhand evidence, this time from an AI insider on Patreon.
00:02:43.040 | We have a wonderful discord and this insider at a Stanford event put a question directly
00:02:48.600 | to Sam Altman very recently.
00:02:50.460 | This was a different Stanford event to the one I'm about to also quote from.
00:02:54.560 | And in this response, Sam Altman confirmed that he's personally using the unreleased
00:02:58.680 | version of their new model.
00:03:00.440 | But enough of secondhand sources.
00:03:02.100 | What about another direct quote from Sam Altman?
00:03:04.720 | Well, here's some more evidence released yesterday that rather than drop a bombshell
00:03:08.840 | GPT-5 on us, which I predict to come somewhere between November and January, they're going
00:03:13.440 | to give us an iterative GPT-4.5 first.
00:03:16.980 | He doesn't want to surprise us.
00:03:19.920 | It does kind of suck to ship a product that you're embarrassed about, but it's much
00:03:23.040 | better than the alternative.
00:03:25.080 | And in this case in particular, where I think we really owe it to society to deploy iteratively.
00:03:31.280 | One thing we've learned is that AI and surprise don't go well together.
00:03:33.920 | People don't want to be surprised.
00:03:34.920 | People want a gradual rollout and the ability to influence these systems.
00:03:38.920 | That's how we're going to do it.
00:03:39.920 | Now, he might want to tell that to OpenAI's recent former head of developer relations.
00:03:45.660 | He now works at Google and said, "Something I really appreciate about Google's culture
00:03:49.980 | is how transparent things are.
00:03:51.960 | 30 days in, I feel like I have a great understanding of where we are going from a model perspective.
00:03:57.780 | Having line of sight on this makes it so much easier to start building compelling developer
00:04:01.860 | products."
00:04:02.860 | It almost sounds like the workers at OpenAI often don't have a great understanding of
00:04:07.820 | where they're going from a model perspective.
00:04:09.940 | Now, in fairness, Sam Altman did say that the current GPT-4 will be significantly dumber
00:04:15.780 | than their new model.
00:04:17.220 | "ChatGPT is not phenomenal, like ChatGPT is mildly embarrassing at best.
00:04:22.300 | GPT-4 is the dumbest model any of you will ever, ever have to use again by a lot.
00:04:27.060 | But you know, it's like important to ship early and often, and we believe in iterative
00:04:30.980 | deployment."
00:04:31.980 | So agency and reasoning focused GPT-4.5 coming soon, but GPT-5 not until the end of the year
00:04:39.260 | or early next, those are my predictions.
00:04:41.860 | Now, some people were saying that the mystery GPT-2 chatbot could be GPT-4.5.
00:04:48.380 | It was released on a site used to compare the different outputs of language models.
00:04:53.820 | And look, here is it creating a beautiful unicorn, which LLAMA 3 couldn't do.
00:04:58.780 | Now, I frantically got ready a tweet saying that superintelligence had arrived, but quickly
00:05:03.600 | had to delete it.
00:05:04.940 | And not just because other people were reporting that they couldn't get decent unicorns.
00:05:10.260 | And not just because that exact unicorn could be found on the web.
00:05:13.940 | But the main reason was that I was one of the lucky ones to get in and test GPT-2 chatbot
00:05:19.020 | on the arena before it was withdrawn.
00:05:21.220 | I could only do eight questions, but I gave it my standard handcrafted, so not on the
00:05:26.060 | web set of test questions, spanning logic, theory of mind, mathematics, coding, and more.
00:05:31.980 | Its performance was pretty much identical to GPT-4 turbo.
00:05:36.660 | There was one question that it would get right more often than GPT-4 turbo, but that could
00:05:41.980 | have been noise.
00:05:42.980 | So if this was a sneak preview of GPT-4.5, I don't think it's going to shock and stun
00:05:49.380 | the entire industry.
00:05:50.820 | So tempting as it was to bang out a video saying that AGI has arrived in all caps, I
00:05:55.860 | resisted the urge.
00:05:57.480 | Since then, other testers have found broadly the same thing.
00:06:01.320 | On language translation, the mystery GPT-2 chatbot massively underperforms Claude Opus
00:06:07.100 | and still underperforms GPT-4 turbo.
00:06:09.780 | On an extended test of logic, it does about the same as Opus and GPT-4 turbo.
00:06:15.580 | Of course, that still does leave the possibility that it is an open AI model, a tiny one, and
00:06:20.580 | one that they might even release open weights, meaning anyone can use it.
00:06:25.300 | And in that case, the impressive thing would be how well it's performing despite its
00:06:29.420 | small size.
00:06:30.420 | So if GPT-2 chatbot is a smaller model, how could it possibly be even vaguely competitive?
00:06:36.060 | The secret source is the data.
00:06:38.860 | As James Becker of OpenAI said, "It's not so much about tweaking model configurations
00:06:43.720 | and hyperparameters, nor is it really about architecture or optimizer choices.
00:06:48.460 | Behavior is determined by your dataset.
00:06:51.780 | It is the dataset that you are approximating to an incredible degree."
00:06:56.620 | In a later post, he referred to the flaws of DALI-3 and GPT-4 and also flaws in video,
00:07:02.220 | probably referring to the, at the time, unreleased Sora, and said they arise from a lack of data
00:07:08.060 | in a specific domain.
00:07:09.700 | And in a more recent post, he said that while compute efficiency was still super important,
00:07:14.300 | anything can be state-of-the-art with enough scale, compute, and eval hacking.
00:07:19.220 | Now we'll get to evaluation and benchmark hacking in just a moment, but it does seem
00:07:23.660 | to me that there are more and more hints that you can brute force performance with enough
00:07:28.380 | compute and, as mentioned, a quality dataset.
00:07:31.540 | But at least to me, it seems increasingly clear that you can pay your way to top performance.
00:07:37.100 | Unless OpenAI reveal something genuinely shocking, the performance of Meta's LLAMA-3, 8 billion,
00:07:43.100 | 70 billion, and soon 400 billion, show that they have less of a secret source than many
00:07:48.500 | people had thought.
00:07:49.900 | But as Mark Zuckerberg hinted recently, it could just come down to which company blinks
00:07:54.300 | first.
00:07:55.300 | Who among Google, Meta, and Microsoft, which provides the compute for OpenAI, is willing
00:08:00.140 | to continue to spend tens or hundreds of billions of dollars on new models?
00:08:05.300 | If the secret is simply the dataset, that would make less and less sense.
00:08:09.620 | "Over the last few years, I think there's this issue of GPU production, right?
00:08:15.160 | So even companies that had the money to pay for the GPUs couldn't necessarily get as many
00:08:19.420 | as they wanted because there were all these supply constraints.
00:08:22.660 | Now I think that's sort of getting less.
00:08:25.780 | So now I think you're seeing a bunch of companies think about, 'Wow, we should just like really
00:08:30.980 | invest a lot of money in building out these things.'
00:08:33.420 | And I think that will go for some period of time.
00:08:36.780 | There is a capital question of like, 'Okay, at what point does it stop being worth it
00:08:41.940 | to put the capital in?'
00:08:43.180 | But I actually think before we hit that, you're going to run into energy constraints."
00:08:47.460 | Now if you're curious about energy and data centre constraints, check out my "Why Does
00:08:52.180 | OpenAI Need a Stargate Supercomputer" video released four weeks ago.
00:08:56.520 | But before we leave data centres and datasets, I must draw your attention to this paper released
00:09:02.580 | in the last 24 hours.
00:09:04.580 | It's actually a brilliant paper from Scale AI.
00:09:07.920 | What they did was create a new and refined version of a benchmark that's used all the
00:09:12.560 | time to test the mathematical reasoning capabilities of AI models.
00:09:17.380 | And there were at least four fascinating findings relevant to all new models coming out this
00:09:22.360 | year.
00:09:23.360 | The first, the context, and they worried that many of the latest models had seen the benchmark
00:09:28.100 | questions in their training data.
00:09:30.380 | That's called contamination because of course it contaminates the results on the test.
00:09:34.620 | The original test had 8,000 questions, but what they did was create 1,000 new questions
00:09:39.360 | of similar difficulty.
00:09:40.860 | Now if contamination wasn't a problem, then models should perform just as well with the
00:09:44.740 | new questions as with the old.
00:09:47.220 | And obviously that didn't happen.
00:09:49.300 | For the Mistral and Phi family of models, performance notably lagged on the new test
00:09:53.860 | compared to the old one.
00:09:55.180 | Whereas fair's fair, for GPT-4 and Claude, performance was the same or better on the
00:10:00.540 | new fresh test.
00:10:01.760 | But here's the thing, the authors figured out that that wasn't just about which models
00:10:05.580 | had seen the questions in their training data.
00:10:07.980 | They say that Mistral Large, which performed exactly the same, was just as likely to have
00:10:12.480 | seen those questions as Mixed Trial Instruct, which way underperformed.
00:10:16.900 | So what could explain the difference?
00:10:18.620 | Well, the bigger models generalize.
00:10:21.020 | Even if they have seen the questions, they learn more from them and can generalize to
00:10:25.460 | new questions.
00:10:26.680 | And here's another supporting quote, they lean toward the hypothesis that sufficiently
00:10:30.580 | strong large language models learn elementary reasoning ability during training.
00:10:35.500 | You could almost say that benchmarks get more reliable when you're talking about the very
00:10:39.700 | biggest models.
00:10:40.700 | Next, and this seems to be a running theme in popular ML benchmarks, the GSM-8K, designed
00:10:46.580 | for high schoolers, has a few errors.
00:10:49.540 | They didn't say how many, but the answers were supposed to be positive integers and
00:10:53.140 | they weren't.
00:10:54.140 | The new benchmark, however, passed through three layers of quality checks.
00:10:58.220 | Third, they provide extra theories as to why models might overperform on benchmarks compared
00:11:03.500 | to the real world.
00:11:04.760 | That's not just about data contamination.
00:11:06.820 | It could be that model builders design datasets that are similar to test questions.
00:11:11.980 | After all, if you were trying to bake in reasoning to your model, what kind of data would you
00:11:16.180 | collect?
00:11:17.180 | Plenty of exams and textbooks.
00:11:18.980 | So the more similar their dataset is in nature, not just exact match to benchmarks, the more
00:11:24.700 | your benchmark performance will be elevated compared to simple real world use.
00:11:29.600 | Think about it, it could be an inadvertent thing where enhancing the overall smartness
00:11:33.860 | of the model comes at the cost of overperforming on benchmarks.
00:11:38.740 | And whatever you think about benchmarks, that does seem to work.
00:11:42.140 | Here's Sebastian Bubek, lead author of the PHY series of models.
00:11:45.620 | I've interviewed him for AI Insiders and he said this, "Even on those 1,000 never-before-seen
00:11:51.580 | questions, PHY 3 Mini, which is only 3.8 billion parameters, performed within about 8 or 9%
00:11:58.700 | of GPT-4 Turbo."
00:12:00.380 | Now we don't know the parameter count of GPT-4 Turbo, but it's almost certainly orders
00:12:04.260 | of magnitude bigger.
00:12:05.580 | So training on high quality data, as we have seen, definitely works, even if it slightly
00:12:10.260 | skews benchmark performance.
00:12:12.180 | But one final observation from me about this paper, I read almost all the examples that
00:12:16.900 | the paper gave from this new benchmark.
00:12:19.580 | And as the paper mentions, they involve basic addition, subtraction, multiplication, and
00:12:24.100 | division.
00:12:25.100 | After all, the original test was designed for youngsters.
00:12:27.620 | You can pause and try the questions yourself, but despite them being lots of words, they
00:12:31.580 | aren't actually hard at all.
00:12:32.860 | So my question is this, why are models like CLAWD3 Opus still getting any of these questions
00:12:38.900 | wrong?
00:12:39.900 | They're scoring around 60% in graduate level expert reasoning, the GPQA.
00:12:45.260 | If CLAWD3 Opus, for example, can get questions right that PhDs struggle to get right with
00:12:51.420 | Google and 30 Minutes, why on earth, with five short examples, can they not get these
00:12:57.940 | basic high school questions right?
00:12:59.620 | Either there are still flaws in the test, or these models do have a limit in terms of
00:13:04.420 | how much they can generalize.
00:13:06.020 | Now, if you like this kind of analysis, feel free to sign up to my completely free newsletter.
00:13:11.540 | It's called Signal to Noise, and the link is in the description.
00:13:15.220 | And if you want to chat in person about it, the regional networking on the AI Insiders
00:13:20.960 | Discord server is popping off.
00:13:23.340 | There are meetings being arranged not only in London, but Germany, the Midwest, Ireland,
00:13:29.100 | San Francisco, Madrid, Brazil, and it goes on and on.
00:13:32.700 | Honestly, I've been surprised and honored by the number of spontaneous meetings being
00:13:37.180 | arranged across the world, but it's time, arguably, for the most exciting development
00:13:42.460 | of the week, MedGemini from Google.
00:13:45.740 | It's a 58 page paper, but the TLDR is this.
00:13:49.380 | The latest series of Gemini models from Google are more than competitive with doctors at
00:13:55.580 | providing medical answers.
00:13:57.300 | And even in areas where they can't quite perform, like in surgery, they can be amazing
00:14:02.420 | assistants.
00:14:03.420 | In a world in which millions of people die due to medical errors, this could be a tremendous
00:14:08.780 | breakthrough.
00:14:09.780 | MedGemini contains a number of innovations, it wasn't just rerunning the same test on
00:14:14.380 | a new model.
00:14:15.380 | For example, you can inspect how confident a model is in its answer.
00:14:19.460 | By trawling through the raw outputs of a model, called the logit, you could see how high probability
00:14:24.880 | they find their answers.
00:14:26.480 | If they gave confident answers, you would submit that as the answer.
00:14:29.860 | They used this technique, by the way, for the original Gemini launch, where they claimed
00:14:32.980 | to beat GPT-4, but that's another story.
00:14:35.020 | Anyway, if the model is not confident, you can get the model to generate search queries
00:14:40.020 | to resolve those conflicts, train it, in other words, to use Google.
00:14:43.980 | Seems appropriate.
00:14:44.980 | Then you can feed that additional context provided by the web back into the model to
00:14:49.180 | see if it's confident now.
00:14:50.820 | But that was just one innovation.
00:14:52.460 | What about this fine tuning loop?
00:14:54.180 | To oversimplify, they get the model to output answers again using the help of search.
00:14:59.140 | And then the outputs that had correct answers were used to fine tune the models.
00:15:04.060 | Now that's not perfect, of course, because sometimes you can get the right answer with
00:15:07.340 | the wrong logic, but it worked.
00:15:09.300 | Up to a certain point, at least.
00:15:10.740 | Just last week, by the way, on Patreon, I described how this reinforced in-context learning
00:15:15.740 | can be applied to multiple domains.
00:15:18.460 | Other innovations come from the incredible long-context abilities of the Gemini 1.5 series
00:15:23.760 | of models.
00:15:24.760 | With that family of models, you can trawl through a 700,000-word electronic health record.
00:15:31.180 | Now imagine a human doctor trying to do the same thing.
00:15:34.300 | I remember on the night of Gemini 1.5's release calling it the biggest news of the day, even
00:15:39.380 | more significant than SORA, and I still stand by that.
00:15:42.380 | So what were the results?
00:15:43.660 | Well, of course, a state-of-the-art performance on MedQA that assesses your ability to diagnose
00:15:49.780 | diseases.
00:15:50.780 | The doctor pass rate, by the way, is around 60%.
00:15:53.620 | And how about this for a mini-theme of the video.
00:15:56.320 | When they carefully analyzed the questions in the benchmark, they found that 7.4% of
00:16:02.300 | the questions have quality issues.
00:16:04.500 | Things like lacking key information, incorrect answers, or multiple plausible interpretations.
00:16:10.180 | So just in this video alone, we've seen multiple benchmark issues, and I collected a thread
00:16:15.460 | of other benchmark issues on Twitter.
00:16:18.020 | The positive news, though, is just how good these models are getting at things like medical
00:16:22.260 | note summarization and clinical referral letter generation.
00:16:25.860 | But I don't want to detract from the headline, which is just how good these models are getting
00:16:30.500 | at diagnosis.
00:16:31.500 | Here you can see MedGemini with search way outperforming expert clinicians with search.
00:16:37.540 | By the way, when errors from the test were taken out, performance bumped up to around
00:16:42.980 | And the authors can't wait to augment their models with additional data.
00:16:47.280 | Things like data from consumer wearables, genomic information, nutritional data, and
00:16:51.820 | environmental factors.
00:16:53.180 | And as a quite amusing aside, it seems like Google and Microsoft are in a tussle to throw
00:16:58.380 | shade at each other's methods in a positive spirit.
00:17:01.640 | Google contrasts their approach to MedPrompt from Microsoft, saying that their approach
00:17:06.220 | is principled, and it can be easily extended to more complex scenarios beyond MedQA.
00:17:11.100 | Now you might say that's harsh, but Microsoft earlier had said that their MedPrompt approach
00:17:15.940 | shows GPT-4's ability to outperform Google's model that was fine-tuned specifically for
00:17:21.540 | medical applications.
00:17:22.540 | It outperforms on the same benchmarks by a significant margin.
00:17:26.340 | Well, Google have obviously one-upped them by reaching new state-of-the-art performances
00:17:30.620 | on 10 of 14 benchmarks.
00:17:32.860 | Microsoft had also said that their approach has simple prompting and doesn't need more
00:17:37.180 | sophisticated and expensive methods.
00:17:39.380 | Google shot back saying they don't need complex, specialized prompting and their approach
00:17:43.720 | is best.
00:17:44.720 | Honestly, this is competition that I would encourage.
00:17:47.620 | May they long compete for glory in this medical arena.
00:17:51.340 | In case you're wondering, because Gemini is a multimodal model, it can see images too.
00:17:56.620 | You can interact with patients and ask them to provide images.
00:17:59.940 | The model can also interact with primary care physicians and ask for things like x-rays.
00:18:04.620 | And most surprisingly to me, it can also interact with surgeons to help boost performance.
00:18:10.260 | Yes, that's video assistance during live surgery.
00:18:13.640 | Of course, they haven't yet deployed this for ethical and safety reasons, but Gemini
00:18:18.120 | is already capable of assessing a video scene and helping with surgery.
00:18:23.060 | For example, by answering whether the critical view of safety criteria are being met.
00:18:27.780 | Do you have a great view of the gallbladder, for example.
00:18:31.260 | And MedGemini could potentially guide surgeons in real time during these complex procedures
00:18:35.760 | for not only improved accuracy, but patient outcomes.
00:18:39.840 | Notice the nuance of the response from Gemini, oh, the lower third of the gallbladder is
00:18:44.560 | not dissected off the cystic plate.
00:18:47.240 | And the authors list many improvements that they could have made, they just didn't have
00:18:51.120 | time to make.
00:18:52.120 | For example, the models were searching the wild web.
00:18:55.520 | And one option would be to restrict the search results to just more authoritative medical
00:19:00.920 | sources.
00:19:01.920 | The catch though, is that this model is not open sourced and isn't widely available.
00:19:06.240 | Due to, they say, the safety implications of unmonitored use.
00:19:10.200 | I suspect the commercial implications of open sourcing Gemini also had something to do with
00:19:16.080 | But here's the question I would set for you.
00:19:17.360 | We know that hundreds of thousands or even millions of people die due to medical mistakes
00:19:21.860 | around the world.
00:19:22.860 | So if and when MedGemini 2, 3 or 4 becomes unambiguously better than all clinicians at
00:19:29.640 | diagnosing diseases, then at what point is it unethical not to at least deploy them in
00:19:35.640 | assisting clinicians?
00:19:36.640 | That's definitely something at least to think about.
00:19:39.400 | Overall, this is exciting and excellent work.
00:19:42.000 | So many congratulations to the team.
00:19:44.640 | And in a world that is seeing some stark misuses of AI, as well as increasingly autonomous
00:19:51.240 | deployment of AI, like this autonomous tank, the fact that we can get breakthroughs like
00:19:56.440 | this is genuinely uplifting.
00:19:58.640 | Thank you so much for watching to the end and have a wonderful day.