Gemini 1.5 and The Biggest Night in AI

You could call tonight the triumph of the transformers or maybe the battle for attention with two thunderous developments within a handful of hours of each other. One from Google DeepMind and then OpenAI. But ultimately, the bigger picture is this. The exponential advance of artificial intelligence and its applications shows no sign of slowing.

If you were hoping for a quiet year, tonight just shattered that peaceful vision. Of course, I could have done a video on either Gemini 1.5, which is arguably as of tonight the most performant language model in the world, or Sora, the text-to-video model OpenAI released shortly after Google. Possibly as a purposeful grab of the spotlight, possibly coincidentally.

Both developments are game-changing, but with the technical paper of Sora due out later today, it allows us to give Gemini 1.5 its due attention. Yes, I've read all 58 pages of the technical paper as well as four papers linked in the appendices in my search for how it got the results it did.

I've got 62 notes, so let's dive in. Here is the big development. Gemini 1.5 can recall and reason over information across millions of tokens of context. Or in another example, they gave 22 hours of audio or three hours of low frame rate video or six to eight minutes of normal frame rate videos.

I don't want to bury the headline. We're talking about near perfect retrieval of facts and details up to at least 10 million tokens. Performance did not dip at 10 million tokens. Indeed, the trend line got substantially better. For reference, in text, 10 million tokens would be about 7.5 million words.

To wrap your head around how many words 10 million tokens is, that's around 0.2% of all of English language Wikipedia. And again, just for emphasis, we're talking at least 10 million tokens. Now, admittedly, in the blog post and paper, they do talk about latency trade-offs with that many tokens.

And no, in case you're wondering, Gemini 1.5 isn't currently widely available just to a limited group of developers and enterprise customers. In case that puts you off though, Google promised this. Significant improvements in speed are also on the horizon. But before we get back to the paper, how about a demo so you can see Gemini 1.5 Pro in action?

This is a demo of long context understanding, an experimental feature in our newest model, Gemini 1.5 Pro. We'll walk through a screen recording of example prompts using a 402 page PDF of the Apollo 11 transcript, which comes out to almost 330,000 tokens. We started by uploading the Apollo PDF into Google AI Studio and asked, "Find three comedic moments.

List quotes from this transcript and emoji." This screen capture is sped up. This timer shows exactly how long it took to process each prompt, and keep in mind that processing times will vary. The model responded with three quotes, like this one from Michael Collins, "I'll bet you a cup of coffee on it." If we go back to the transcript, we can see the model found this exact quote and extracted the comedic moment accurately.

Then we tested a multimodal prompt. We gave it this drawing of a scene we were thinking of and asked, "What moment is this?" The model correctly identified it as Neil's first steps on the moon. I'll feature more demos later in the video, but notice that the inference time in this sped up video wasn't that bad.

And yes, for anyone paying attention, this is Gemini 1.5 Pro. What that means is that any results you're seeing will soon be improved upon by Gemini 1.5 Ultra. Remember that we have Gemini Nano, then Gemini Pro, the medium-sized model, and finally Gemini Ultra. Well, how did Google DeepMind achieve this feat?

They say in the introduction that the model incorporates a novel mixture of experts architecture, as well as major advances in training and serving infrastructure that allows it to push the boundaries of efficiency, reasoning, and long context performance. In case you're wondering, that long context again refers to that huge volume of text, image, video, and audio data that Gemini 1.5 can ingest.

But I will admit that when I first read this introduction, I thought they have used the Mamba architecture. That was billed as the successor architecture to the Transformer, and I did a video on it on the 1st of January. It too achieved amazing results in long context tasks and outperformed the Transformer.

The Transformer is of course the T in GPT. However, by the time I finished the paper, it was pretty clear that it wasn't based on Mamba, and it took me a little while to figure it out and reading quite a few papers cited in the appendices. But I think I've got a pretty good guess as to what the architecture is based on.

Anyway, what is the next interesting point from the paper? Well, Google does confirm that Gemini 1.5 Pro requires significantly less compute to train than 1.0 Ultra. So it's arguably better than Ultra, and we'll see the benchmarks in a moment, but requires significantly less compute to train. That is maybe why Google DeepMind were able to get out Gemini 1.5 Pro so soon after announcing Gemini 1.0.

And don't forget that Gemini 1.0 Ultra as part of Gemini Advanced was only released to the public a week ago. That's when my review video came out. Google, by the way, gives you a two month free trial of Gemini Advanced, and I now think part of the reason for that is to give them time to incorporate Gemini 1.5 Pro before that free trial ends.

And here is the first bombshell graphic in the paper. This is the task of finding a needle in a haystack across text, video, and audio modalities. A fact or passcode might be buried at varying depths of sequences of varying lengths. As you can see for text, these lengths went up to 10 million tokens.

For audio, it was up to 22 hours and video up to three hours. The models would then be tested on that fact. As you can see, the performance was incredible with just five missed facts. For reference, I went back to the original benchmark, which I've cited in two previous videos to compare.

Here is GPC 4's performance at up to only 128,000 tokens. As you can see, as the sequence length gets to around 80,000 words or 100,000 tokens, the performance, especially midway through the sequence, degrades. At the time, Anthropic's Claude 2.1 performed even worse, although they did subsequently come up with a prompt engineering hack to reduce most of these incorrect recalls.

But are you ready for the next bombshell? They see Gemini 1.5 Pro outperforming all competing models across all modalities, even when these models are augmented with external retrieval methods. In the lingo, that's RAG. In layman's terms, that's grabbing relevant text to assist them in answering the questions. Of course, with a long context, Gemini 1.5 Pro is just simply ingesting the entire document.

And now for one of the key tables in the paper. You might be wondering, and indeed Google DeepMind were wondering, if this extra performance in long context tasks would mean a trade-off for other types of tasks. Text, vision, and audio tasks that didn't require long context. The answer was no.

Gemini 1.5 Pro is better on average compared to 1.0 Pro across text, vision, and audio. In other words, it hasn't just got better at long context, it's got better at a range of other tasks. It beats Gemini 1.0 Pro 100% of the time in text benchmarks and most of the time in vision and audio benchmarks.

And wait, there's more. It also beats most of the time Gemini 1.0 Ultra in text benchmarks. Of course, at this point, if you only care about standard benchmarks and not about long context, it's more or less a draw with Gemini 1.0 Ultra. It has a win rate of 54.8% and it would also be pretty much a tie with GPT-4.

But I think it's fair to say that once you bring in long context capabilities, it is now indisputably the best language model in the world. I should caveat that with it's the best language model that is accessible to some people out there. Of course, behind the scenes, Google DeepMind have Gemini 1.5 Ultra, which would surely be better than 1.5 Pro.

And I know what you're thinking, this is Google, so maybe they use some kooky prompting method to get these results. Well, I've read the entire paper and it doesn't seem so to me. These are genuine like-for-like comparisons to 1.0 Pro and 1.0 Ultra. Now, I have looked in the appendices and the exact wording of the prompts may have been different to, for example, GPT-4.

It does seem borderline impossible to get perfectly like-for-like comparisons. But from everything I can see, this does seem to be a genuine result. Now, before you get too excited, as you can see, the benchmark results for 1.5 Pro in non-long context tasks is pretty impressive, but we're not completely changing the game here.

They haven't come up with some sort of architecture that just crushes in every task. We're still dealing with a familiar language model for most tasks. But before I give you my architecture speculations, time for another demo, this time analyzing a one frame per second, 44-minute movie. This is a demo of long context understanding, an experimental feature in our newest model, Gemini 1.5 Pro.

We'll walk through a screen recording of example prompts using a 44-minute Buster Keaton film, which comes out to over 600,000 tokens. In Google AI Studio, we uploaded the video and asked, "Find the moment when a piece of paper is removed from the person's pocket and tell me some key information on it with the time code." This screen capture is sped up, and this timer shows exactly how long it took to process each prompt.

And keep in mind that processing times will vary. The model gave us this response, explaining that the piece of paper is a pawn ticket from Goldman & Co. pawnbrokers with the date and cost. And it gave us this time code, 12/01. When we pulled up that time code, we found it was correct.

The model had found the exact moment the piece of paper is removed from the person's pocket, and it extracted text accurately. Next, we gave it this drawing of a scene we were thinking of and asked, "What is the time code when this happens?" The model returned this time code, 15/34.

We pulled that up and found that it was the correct scene. Like all generative models, responses vary and won't always be perfect. But notice how we didn't have to explain what was happening in the drawing. Gemini being multimodal from the ground up is really shining here, and I do think we have to take a step back and say, "Wow." At the moment, this might mean 6 to 8 minutes of a 24 or 30 frames per second video on YouTube.

But still, successfully picking out that minor detail in that short of a time is truly groundbreaking. Given that Google owns YouTube, you will soon be querying, say, AI Explained videos. Okay, time for my guests about how they managed it in terms of architecture. Well, first they say simply, "It's a sparse mixture of expert transformer-based model." Those are fairly standard terms described in other videos of mine.

But then here's the key sentence, "Gemini 1.5 Pro also builds on the following research and the language model research in this broader literature." Now, I had a look at most of these papers, particularly the recent ones, and one stood out. This paper by Zhang et al came out just over a month ago.

And remember, Google says that Gemini 1.5 Pro builds on this work. Now, building on something that came out that recently is pretty significant. Of course, Google have their own massive body of literature on sparse mixture of experts and indeed invented the transformer architecture. But this tweet tonight from one of the key authors of Gemini 1.5 does point to things developing more rapidly recently.

Pranav Shyam said this just a few months ago, "Nikolai, Dennis and I were exploring ways to dramatically increase our context lengths. Little did we know that our ideas would ship in production so quickly." So yes, Google has work going back years on sparse mixture of expert models. And yes, too many people underestimated the years of innovations going on quietly at Google, in this case for inference.

But for the purposes of time, this is the paper I'm going to focus on. This is the one by Zhang released around a month ago. It's, of course, mixture of experts from Mistral AI. That's that brand new French outfit with a multi-billion euro valuation. And no, I don't just think it's relevant because of the date and the fact it's sparse and a mixture of experts.

Mixture of experts in a nutshell being when you have a bigger model comprised of multiple smaller blocks or experts. When the tokens come in, they are dynamically rooted to just two, in most cases, relevant experts or blocks. So the entire model isn't active during inference. It's lightweight and effective, but no, that's not the reason why I focused on this paper.

It's because of this section, 3.2 long range performance mixed trial managed to achieve a hundred percent retrieval accuracy regardless of the context length and also regardless of the position or depth of the password. Of course, Mistral only proved that up to 32,000 tokens and Google, I believe, have taken it much further.

That's my theory. Let me know in the comments if you think I'm right. Google do say they also made improvements in terms of data optimization and systems, but if you're looking for more info on compute or the training dataset, good luck. Other than saying that the compute is significantly less than Gemini 1.0 Ultra and that Gemini 1.5 Pro is trained on a variety of multimodal and multilingual data, they don't really give us anything.

Well, I tell a lie. They do say that it was trained across multiple data centers. So given that a data center maxes out around 32,000 TPUs and I know Google uses TPUs, but that still gives us a sense about the sheer scale of Google Gemini's compute. And there is one more task that Google DeepMind really want us to focus on.

Admittedly, it is very impressive. They gave Gemini 1.5 Pro a grammar book and dictionary, 250,000 tokens in total from a super obscure, low resource language. The language is Kalamang and I had never heard of it. They take pains to point out that none of that language was in the training dataset.

And so what was the result? Well, not only did Gemini 1.5 Pro crush GPT-4, it also did as well as a human who had learned from the same materials. Now we're not talking about someone from that region of Papua New Guinea. The reason a human was used for comparison was to make that underlying point.

Models are starting to approach the learning rate, at least in terms of language of a human being. And don't forget, this factors in data efficiency, same amount of data, similar result. Next up is what I believe to be a fascinating graphic. It shows what happens as a model in blue Gemini 1.5 Pro is fed more and more of a long document and of a code database.

And the lower the curves go, the more accurate the model is getting at predicting the next word. What happens to December's Gemini Pro as you feed it more and more tokens? Well, it starts to get overwhelmed both in terms of code and for long documents. As the paper says that older model, and I hesitate to call it older because it's just two months ago, they're unable to effectively use information from the previous context and are deteriorating in terms of prediction quality.

But with Gemini 1.5 Pro, the more it's fed, the better it gets. Even for a sequence of length a million for documents or 10 million for code. It's quote, remembering things from millions of lines of code ago to answer questions now. I think it's significant that when we get to sequence lengths of around five to 10 million, the curve actually dips downward.

It no longer follows the power law trend. That would suggest to me that if we went up to a hundred million, the results would be even more impressive. Here's what Google have to say. The results above suggest that the model is able to improve its predictions by finding useful patterns, even if they occurred millions of tokens in the past, as in the case of code.

And to summarize this, we already knew that lower loss could be gone from more compute. It's a very similar curve, but what's new is that the power law is holding between loss and context length as shown above. They say from inspecting longer code token predictions closer to 10 million, we see a phenomena of the increased context occasionally providing outsized benefit.

That could be due to repetition of code blocks. They think this deserves further study and may be dependent on the exact data set used. So even Google aren't fully sure what's causing that dip. Now we all know that OpenAI kind of trolled Google tonight by releasing Sora so soon after Gemini 1.5 Pro.

But on this page, I feel Google were doing a little bit of trolling to OpenAI. First, we have this comparison again of retrieval and they say they got API errors after 128,000 tokens. Well, of course, they knew that because GPT-4 Turbo only supports 128,000 tokens. I think they kind of wanted to say that after this length, we crush it and with them, you just get an error code.

And the next bit of trolling comes here. These haystack challenges where they secrete a phrase like this. The special magic quote city number is quote. With this, the model has to retrieve the correct city and number which is randomized. But that phrase could have been hidden in any long text and they chose the essays of Paul Graham.

Now, yes, this is almost certainly coincidental, but Paul Graham was the guy who fired Sam Altman at Y Combinator. Sam Altman disputes that it was a firing. For audio, it's the same thing. Even when they break down long audio into segments that Whisper can digest, which are then transcribed and fed to GPT-4 Turbo, the difference is stark.

Before you think, though, that Gemini 1.5 Pro is perfect at retrieval, what happens when you feed in 100 needles into a massive haystack? Well, in that case, it still massively outperforms GPT-4 Turbo, but can recall, as you can see, 60, 70, 80% of those needles. It is not a perfect model and no, we don't have AGI.

And at this point, Google does state that retrieval is not the same as reasoning. They basically beg for harder benchmarks, ones that require integrating disparate facts, drawing inferences, or resolving inconsistencies, essentially reasoning. If you want to know more about how reasoning, some would say, is the final holy grail of large language models, do check out my Patreon AI Insiders.

I have around a dozen videos and podcasts up as of today. In fact, it was just today that I released this video on my Patreon. It's a 14 minute tour of deepfakes and features, interviews, and exclusives, and more. If you are a student or retired, do email me about a potential small discount.

Now for the final demo in coding. We'll walk through some example prompts using the 3JS example code, which comes out to over 800,000 tokens. We extracted the code for all of the 3JS examples and put it together into this text file, which we brought into Google AI Studio over here.

We asked the model to find three examples for learning about character animation. The model looked across hundreds of examples and picked out these three. Next, we asked, what controls the animations on the littlest Tokyo demo? As you can see here, the model was able to find that demo, and it explained that the animations are embedded within the glTF model.

Next, we wanted to see if it could customize this code for us. So we asked, show me some code to add a slider to control the speed of the animation. Use that kind of GUI the other demos have. This is what it looked like before on the original 3JS site.

And here's the modified version. It's the same scene, but it added this little slider to speed up, slow down, or even stop the animation on the fly. Again, with Audio Gemini Crush's Whisper, it has a significantly lower word error rate. And for video, it was pretty funny they had to invent their own benchmarks because the other ones were too easy.

Or in formal language, to bridge this evaluation gap, we introduced a new benchmark that was testing that incredible feat we saw earlier of picking out key details from long videos. Now, to be clear, despite the demos looking good and beating GPT-4V, we're still not close to perfect. Just because Gemini 1.5 Pro can see across long context and watch long videos doesn't mean it's perfect at answering questions.

Remember that recalling facts is not the same as reasoning or getting 100% on multiple choice questions. I also found this part of the paper quite funny where they tried to highlight the extent of trade-offs of switching architecture if it exists. And the problem was they couldn't find any. Across the board, 1.5 Pro was just better than 1.0 Pro.

Whether that was math, science, coding, multilinguality, instruction following, image understanding, video understanding, speech recognition, or speech translation. Of course, it's obligatory at this point for me to ding them about the accuracy level of their MMLU benchmark test for Gemini 1.5 Pro. They say for math and science, it's 1.8% behind Gemini 1.0 Ultra.

But how meaningful is that with this many errors just in the college chemistry section of the MMLU? Buried deep is one admission that 1.5 Pro doesn't seem quite as good at OCR. That's optical character recognition, in other words, recognizing text from an image. But Google Cloud Vision is state of the art anyway at OCR and soon enough, surely, they're going to integrate that.

So I don't see OCR being a long-term weakness for the Gemini series. And it's hard to tell, but it seems like Google found some false negatives in other benchmarks. And so the performance there was lower bounding the model's true performance. And they complain, as I did in my original Smart GPT video, that maybe we need to rely more on human evaluations for these datasets and that maybe we should deviate from strict string matching.

And there was this quite cute section in the impact assessment part of the paper. So what are the impacts going to be of Gemini 1.5 Pro? Well, they say the ability to understand longer content enhances the efficiency of individual and commercial users in processing various multimodal inputs. But that besides efficiency, the model enables societally beneficial downstream use cases.

And they foresee Gemini 1.5 Pro being used to explore archival content that might potentially benefit journalists and historians. Suffice to say, I think this is somewhat underplaying the impact of Gemini 1.5 Pro. Just for one, I think it could transform how YouTube works. Or another obvious one. What about long term "relationships" with chatbots?

GPT-4's new memory feature, which seems to me like an only slightly more advanced custom instruction, pales in comparison to Gemini 1.5's potential. You could have discussions lasting for months with Gemini and it might remember a detail you said back, say, six months ago. That seems to me true memory and might encourage a kind of companionship for some people with these models.

On safety, without giving too much detail, they just say it's safer than 1.0 Pro and 1.0 Ultra. But later they do admit two things. First, Gemini 1.5 Pro does seem a little bit more biased. It's probably a bit harder for the model to be anti-stereotypical when it remembers so much.

Also, and I know this is going to annoy quite a few people, it has a higher refusal rate. That's on questions that should be both legitimately refused and not legitimately. In other words, they should have been answered. Of course, by the time the model actually comes out, we'll have to see if this is still the case.

But you just have to take a look at my Gemini Ultra review to see that at the moment the refusals are pretty extreme. This could honestly be a key sticking point for a lot of people. We're drawing to an end here, but just a quick handful of further observations.

Remember that trick with Chachabitty where you submit the letter A with a space, say 500 times, and it regurgitates sometimes it's training data. Well, apparently that works also on Gemini 1.5 Pro. Thing is you have to manually repeat that character many more times, up to a million times. But with those long prompts, they do admit that it becomes easier to obtain it's memorized data.

I presume that's the kind of thing that Google DeepMind are working on before they release Gemini 1.5 Pro. And one more key detail from the blog post that many people might have missed. When Gemini 1.5 Pro is released to the public, it's going to start at just 128,000 token context window.

I say just, but that's still pretty impressive. And it seems to me, based on the wording of the following sentence, that even that basic version won't be free. They say we plan to introduce pricing tiers that start at the standard 128,000 context window. So anyone hoping to get Gemini 1.5 for free seems to have misplaced hope.

And then there's going to be tiers going up to 1 million tokens. I'm not sure how expensive that 1 million token tier will be, but I'll probably be on it. Notice that we probably won't be able to buy going up to 10 million tokens. But I do want to end on a positive note for Google.

There was one thing I missed out from my review of Google Gemini Ultra, and I want to make amends. And that is its creative writing ability. Gemini 1.0 Ultra is simply better at creative writing than GPT-4. And of course, we're not even talking about 1.5 Ultra. How so? Well, Gemini varies its sentence length.

We have short sentences like this, "Dibbons never really listened." We also get far more dialogue, which is just much more realistic to real creative writing. There's a bit more humor in there. Whatever they did with their writing data set, they did better than open AI. GPT-4 stories tend to be far more wordy, a lot more tell not show.

And I'm actually going to go one further and prove that to you. When you put GPT-4 story into two state-of-the-art AI text detectors, that's GPT-0 and Binoculars, which is a new tool, both of them say most likely AI generated. GPT-0 puts it at 97%. For Claude, we also get most likely AI generated.

Although GPT-0 erroneously says it's only 12% likely AI generated. That's Claude's story. It's way too much to get into this video now, but remember Binoculars is state-of-the-art compared to GPT-0. But here's the punchline. This is Google Gemini's story. We get from GPT-0, 0% chance of being AI generated. And even the state-of-the-art Binoculars gives it most likely human generated.

And I think this proves two points. First, Gemini is definitely better at creative writing and making marketing copy, by the way, but too long to get into here. And second, don't put your faith in AI text detectors, especially not in the age of Gemini. If you want to learn more about detecting AI and deepfakes, of course, I refer you back to my deepfakes video on my Patreon, AI Insiders.

So that is Gemini 1.5 Pro. And yes, this does seem the most significant night for AI since the release of GPT-4. As I said in my video on January the 1st, AI is still on an exponential curve. 2024 will not be a slow year in AI. And for as long as I can, I will be here to cover it all.

Thank you so much for watching and have a wonderful day.

Gemini 1.5 and The Biggest Night in AI

Transcript