Gemini 1.5 and The Biggest Night in AI

00:00:00.000 | You could call tonight the triumph of the transformers or maybe the battle for attention

00:00:06.720 | with two thunderous developments within a handful of hours of each other.

00:00:11.920 | One from Google DeepMind and then OpenAI. But ultimately, the bigger picture is this.

00:00:17.760 | The exponential advance of artificial intelligence and its applications shows no sign of slowing.

00:00:24.720 | If you were hoping for a quiet year, tonight just shattered that peaceful vision.

00:00:29.680 | Of course, I could have done a video on either Gemini 1.5, which is arguably as of tonight the

00:00:36.160 | most performant language model in the world, or Sora, the text-to-video model OpenAI released

00:00:43.280 | shortly after Google. Possibly as a purposeful grab of the spotlight, possibly coincidentally.

00:00:49.440 | Both developments are game-changing, but with the technical paper of Sora due out later today,

00:00:56.240 | it allows us to give Gemini 1.5 its due attention. Yes, I've read all 58 pages of the technical paper

00:01:04.160 | as well as four papers linked in the appendices in my search for how it got the results it did.

00:01:10.240 | I've got 62 notes, so let's dive in. Here is the big development. Gemini 1.5 can recall

00:01:17.760 | and reason over information across millions of tokens of context. Or in another example,

00:01:23.200 | they gave 22 hours of audio or three hours of low frame rate video or six to eight minutes of

00:01:30.560 | normal frame rate videos. I don't want to bury the headline. We're talking about

00:01:34.720 | near perfect retrieval of facts and details up to at least 10 million tokens. Performance did

00:01:40.800 | not dip at 10 million tokens. Indeed, the trend line got substantially better. For reference,

00:01:46.240 | in text, 10 million tokens would be about 7.5 million words. To wrap your head around how many

00:01:51.840 | words 10 million tokens is, that's around 0.2% of all of English language Wikipedia.

00:01:58.320 | And again, just for emphasis, we're talking at least 10 million tokens. Now, admittedly,

00:02:03.280 | in the blog post and paper, they do talk about latency trade-offs with that many tokens. And

00:02:08.240 | no, in case you're wondering, Gemini 1.5 isn't currently widely available just to a limited

00:02:13.440 | group of developers and enterprise customers. In case that puts you off though, Google promised

00:02:17.920 | this. Significant improvements in speed are also on the horizon. But before we get back to the

00:02:23.360 | paper, how about a demo so you can see Gemini 1.5 Pro in action? This is a demo of long context

00:02:30.640 | understanding, an experimental feature in our newest model, Gemini 1.5 Pro. We'll walk through

00:02:36.560 | a screen recording of example prompts using a 402 page PDF of the Apollo 11 transcript,

00:02:42.320 | which comes out to almost 330,000 tokens. We started by uploading the Apollo PDF into Google

00:02:49.760 | AI Studio and asked, "Find three comedic moments. List quotes from this transcript and emoji."

00:02:56.960 | This screen capture is sped up. This timer shows exactly how long it took to process each prompt,

00:03:03.600 | and keep in mind that processing times will vary.

00:03:08.000 | The model responded with three quotes, like this one from Michael Collins,

00:03:12.480 | "I'll bet you a cup of coffee on it." If we go back to the transcript,

00:03:16.080 | we can see the model found this exact quote and extracted the comedic moment accurately.

00:03:20.080 | Then we tested a multimodal prompt. We gave it this drawing of a scene we were thinking of and

00:03:24.960 | asked, "What moment is this?" The model correctly identified it as Neil's first steps on the moon.

00:03:33.680 | I'll feature more demos later in the video, but notice that the inference time in this

00:03:37.200 | sped up video wasn't that bad. And yes, for anyone paying attention, this is Gemini 1.5 Pro.

00:03:43.600 | What that means is that any results you're seeing will soon be improved upon by Gemini 1.5 Ultra.

00:03:50.240 | Remember that we have Gemini Nano, then Gemini Pro, the medium-sized model,

00:03:54.480 | and finally Gemini Ultra. Well, how did Google DeepMind achieve this feat?

00:03:58.640 | They say in the introduction that the model incorporates a novel

00:04:02.480 | mixture of experts architecture, as well as major advances in training and serving

00:04:07.760 | infrastructure that allows it to push the boundaries of efficiency, reasoning,

00:04:11.760 | and long context performance. In case you're wondering, that long context again refers to

00:04:16.480 | that huge volume of text, image, video, and audio data that Gemini 1.5 can ingest.

00:04:23.200 | But I will admit that when I first read this introduction, I thought they have

00:04:27.200 | used the Mamba architecture. That was billed as the successor architecture to the Transformer,

00:04:33.120 | and I did a video on it on the 1st of January. It too achieved amazing results in long context

00:04:38.960 | tasks and outperformed the Transformer. The Transformer is of course the T in GPT.

00:04:45.440 | However, by the time I finished the paper, it was pretty clear that it wasn't based on Mamba,

00:04:51.200 | and it took me a little while to figure it out and reading quite a few papers cited in the

00:04:55.840 | appendices. But I think I've got a pretty good guess as to what the architecture is based on.

00:05:01.360 | Anyway, what is the next interesting point from the paper?

00:05:04.480 | Well, Google does confirm that Gemini 1.5 Pro requires significantly less compute to train

00:05:11.760 | than 1.0 Ultra. So it's arguably better than Ultra, and we'll see the benchmarks in a moment,

00:05:17.760 | but requires significantly less compute to train. That is maybe why Google DeepMind were able to

00:05:23.840 | get out Gemini 1.5 Pro so soon after announcing Gemini 1.0. And don't forget that Gemini 1.0 Ultra

00:05:31.680 | as part of Gemini Advanced was only released to the public a week ago. That's when my review video

00:05:37.440 | came out. Google, by the way, gives you a two month free trial of Gemini Advanced, and I now

00:05:42.400 | think part of the reason for that is to give them time to incorporate Gemini 1.5 Pro before that

00:05:49.200 | free trial ends. And here is the first bombshell graphic in the paper. This is the task of finding

00:05:55.360 | a needle in a haystack across text, video, and audio modalities. A fact or passcode might be

00:06:01.840 | buried at varying depths of sequences of varying lengths. As you can see for text, these lengths

00:06:08.880 | went up to 10 million tokens. For audio, it was up to 22 hours and video up to three hours. The

00:06:15.360 | models would then be tested on that fact. As you can see, the performance was incredible with just

00:06:20.960 | five missed facts. For reference, I went back to the original benchmark, which I've cited in

00:06:26.160 | two previous videos to compare. Here is GPC 4's performance at up to only 128,000 tokens. As you

00:06:34.080 | can see, as the sequence length gets to around 80,000 words or 100,000 tokens, the performance,

00:06:40.240 | especially midway through the sequence, degrades. At the time, Anthropic's Claude 2.1 performed even

00:06:47.520 | worse, although they did subsequently come up with a prompt engineering hack to reduce most of

00:06:52.560 | these incorrect recalls. But are you ready for the next bombshell? They see Gemini 1.5 Pro outperforming

00:07:00.000 | all competing models across all modalities, even when these models are augmented with external

00:07:06.160 | retrieval methods. In the lingo, that's RAG. In layman's terms, that's grabbing relevant text to

00:07:11.280 | assist them in answering the questions. Of course, with a long context, Gemini 1.5 Pro is just simply

00:07:16.320 | ingesting the entire document. And now for one of the key tables in the paper. You might be wondering,

00:07:22.480 | and indeed Google DeepMind were wondering, if this extra performance in long context tasks

00:07:28.640 | would mean a trade-off for other types of tasks. Text, vision, and audio tasks that didn't require

00:07:34.800 | long context. The answer was no. Gemini 1.5 Pro is better on average compared to 1.0 Pro across text,

00:07:43.280 | vision, and audio. In other words, it hasn't just got better at long context, it's got better at a

00:07:48.560 | range of other tasks. It beats Gemini 1.0 Pro 100% of the time in text benchmarks and most of the

00:07:55.440 | time in vision and audio benchmarks. And wait, there's more. It also beats most of the time

00:08:00.800 | Gemini 1.0 Ultra in text benchmarks. Of course, at this point, if you only care about standard

00:08:06.960 | benchmarks and not about long context, it's more or less a draw with Gemini 1.0 Ultra. It has a

00:08:13.200 | win rate of 54.8% and it would also be pretty much a tie with GPT-4. But I think it's fair to say

00:08:21.520 | that once you bring in long context capabilities, it is now indisputably the best language model

00:08:28.720 | in the world. I should caveat that with it's the best language model that is accessible to some

00:08:34.160 | people out there. Of course, behind the scenes, Google DeepMind have Gemini 1.5 Ultra, which would

00:08:39.280 | surely be better than 1.5 Pro. And I know what you're thinking, this is Google, so maybe they

00:08:44.320 | use some kooky prompting method to get these results. Well, I've read the entire paper and

00:08:49.760 | it doesn't seem so to me. These are genuine like-for-like comparisons to 1.0 Pro and 1.0 Ultra.

00:08:56.800 | Now, I have looked in the appendices and the exact wording of the prompts may have been different to,

00:09:02.560 | for example, GPT-4. It does seem borderline impossible to get perfectly like-for-like

00:09:08.160 | comparisons. But from everything I can see, this does seem to be a genuine result.

00:09:13.680 | Now, before you get too excited, as you can see, the benchmark results for 1.5 Pro

00:09:18.720 | in non-long context tasks is pretty impressive, but we're not completely changing the game here.

00:09:25.520 | They haven't come up with some sort of architecture that just crushes in every task. We're still

00:09:29.760 | dealing with a familiar language model for most tasks. But before I give you my architecture

00:09:35.680 | speculations, time for another demo, this time analyzing a one frame per second, 44-minute movie.

00:09:42.800 | This is a demo of long context understanding, an experimental feature in our newest model,

00:09:48.800 | Gemini 1.5 Pro. We'll walk through a screen recording of example prompts using a 44-minute

00:09:54.960 | Buster Keaton film, which comes out to over 600,000 tokens. In Google AI Studio, we uploaded

00:10:01.600 | the video and asked, "Find the moment when a piece of paper is removed from the person's pocket

00:10:07.120 | and tell me some key information on it with the time code." This screen capture is sped up,

00:10:13.360 | and this timer shows exactly how long it took to process each prompt. And keep in mind that

00:10:18.240 | processing times will vary. The model gave us this response, explaining that the piece of paper

00:10:24.160 | is a pawn ticket from Goldman & Co. pawnbrokers with the date and cost. And it gave us this time

00:10:30.240 | code, 12/01. When we pulled up that time code, we found it was correct. The model had found the

00:10:37.040 | exact moment the piece of paper is removed from the person's pocket, and it extracted text accurately.

00:10:42.880 | Next, we gave it this drawing of a scene we were thinking of and asked, "What is the time code

00:10:47.680 | when this happens?" The model returned this time code, 15/34. We pulled that up and found that it

00:10:54.160 | was the correct scene. Like all generative models, responses vary and won't always be perfect. But

00:11:00.240 | notice how we didn't have to explain what was happening in the drawing. Gemini being multimodal

00:11:05.200 | from the ground up is really shining here, and I do think we have to take a step back and say,

00:11:11.040 | "Wow." At the moment, this might mean 6 to 8 minutes of a 24 or 30 frames per second video

00:11:16.640 | on YouTube. But still, successfully picking out that minor detail in that short of a time

00:11:21.920 | is truly groundbreaking. Given that Google owns YouTube, you will soon be querying,

00:11:26.960 | say, AI Explained videos. Okay, time for my guests about how they managed it in terms of

00:11:32.160 | architecture. Well, first they say simply, "It's a sparse mixture of expert transformer-based model."

00:11:38.160 | Those are fairly standard terms described in other videos of mine. But then here's the key

00:11:43.120 | sentence, "Gemini 1.5 Pro also builds on the following research and the language model research

00:11:51.040 | in this broader literature." Now, I had a look at most of these papers, particularly the recent ones,

00:11:56.640 | and one stood out. This paper by Zhang et al came out just over a month ago. And remember,

00:12:02.000 | Google says that Gemini 1.5 Pro builds on this work. Now, building on something that came out

00:12:08.800 | that recently is pretty significant. Of course, Google have their own massive body of literature

00:12:13.840 | on sparse mixture of experts and indeed invented the transformer architecture. But this tweet

00:12:18.960 | tonight from one of the key authors of Gemini 1.5 does point to things developing more rapidly

00:12:25.280 | recently. Pranav Shyam said this just a few months ago, "Nikolai, Dennis and I were exploring ways to

00:12:31.120 | dramatically increase our context lengths. Little did we know that our ideas would ship in production

00:12:36.480 | so quickly." So yes, Google has work going back years on sparse mixture of expert models. And yes,

00:12:42.880 | too many people underestimated the years of innovations going on quietly at Google,

00:12:48.000 | in this case for inference. But for the purposes of time, this is the paper I'm going to focus on.

00:12:53.360 | This is the one by Zhang released around a month ago. It's, of course, mixture of experts from

00:12:59.520 | Mistral AI. That's that brand new French outfit with a multi-billion euro valuation. And no,

00:13:05.680 | I don't just think it's relevant because of the date and the fact it's sparse and a mixture of

00:13:10.240 | experts. Mixture of experts in a nutshell being when you have a bigger model comprised of multiple

00:13:16.000 | smaller blocks or experts. When the tokens come in, they are dynamically rooted to just two,

00:13:22.160 | in most cases, relevant experts or blocks. So the entire model isn't active during inference.

00:13:28.160 | It's lightweight and effective, but no, that's not the reason why I focused on this paper.

00:13:32.960 | It's because of this section, 3.2 long range performance mixed trial managed to achieve

00:13:38.880 | a hundred percent retrieval accuracy regardless of the context length and also regardless of

00:13:44.640 | the position or depth of the password. Of course, Mistral only proved that up to 32,000 tokens and

00:13:51.280 | Google, I believe, have taken it much further. That's my theory. Let me know in the comments

00:13:55.840 | if you think I'm right. Google do say they also made improvements in terms of data optimization

00:14:01.440 | and systems, but if you're looking for more info on compute or the training dataset, good luck.

00:14:07.360 | Other than saying that the compute is significantly less than Gemini 1.0 Ultra and that Gemini 1.5 Pro

00:14:14.240 | is trained on a variety of multimodal and multilingual data, they don't really give us

00:14:19.520 | anything. Well, I tell a lie. They do say that it was trained across multiple data centers. So

00:14:24.800 | given that a data center maxes out around 32,000 TPUs and I know Google uses TPUs, but that still

00:14:31.920 | gives us a sense about the sheer scale of Google Gemini's compute. And there is one more task that

00:14:37.920 | Google DeepMind really want us to focus on. Admittedly, it is very impressive. They gave

00:14:43.200 | Gemini 1.5 Pro a grammar book and dictionary, 250,000 tokens in total from a super obscure,

00:14:50.480 | low resource language. The language is Kalamang and I had never heard of it. They take pains to

00:14:55.520 | point out that none of that language was in the training dataset. And so what was the result?

00:15:01.360 | Well, not only did Gemini 1.5 Pro crush GPT-4, it also did as well as a human who had learned

00:15:09.360 | from the same materials. Now we're not talking about someone from that region of Papua New Guinea.

00:15:14.560 | The reason a human was used for comparison was to make that underlying point. Models are starting

00:15:19.840 | to approach the learning rate, at least in terms of language of a human being. And don't forget,

00:15:24.400 | this factors in data efficiency, same amount of data, similar result. Next up is what I believe

00:15:29.920 | to be a fascinating graphic. It shows what happens as a model in blue Gemini 1.5 Pro is fed more and

00:15:36.880 | more of a long document and of a code database. And the lower the curves go, the more accurate

00:15:43.600 | the model is getting at predicting the next word. What happens to December's Gemini Pro

00:15:49.120 | as you feed it more and more tokens? Well, it starts to get overwhelmed both in terms of code

00:15:54.960 | and for long documents. As the paper says that older model, and I hesitate to call it older

00:15:59.760 | because it's just two months ago, they're unable to effectively use information from the previous

00:16:05.120 | context and are deteriorating in terms of prediction quality. But with Gemini 1.5 Pro,

00:16:10.960 | the more it's fed, the better it gets. Even for a sequence of length a million for documents

00:16:17.040 | or 10 million for code. It's quote, remembering things from millions of lines of code ago to

00:16:23.520 | answer questions now. I think it's significant that when we get to sequence lengths of around

00:16:28.320 | five to 10 million, the curve actually dips downward. It no longer follows the power law

00:16:34.320 | trend. That would suggest to me that if we went up to a hundred million, the results would be

00:16:38.960 | even more impressive. Here's what Google have to say. The results above suggest that the model is

00:16:44.000 | able to improve its predictions by finding useful patterns, even if they occurred millions of tokens

00:16:49.120 | in the past, as in the case of code. And to summarize this, we already knew that lower loss

00:16:54.000 | could be gone from more compute. It's a very similar curve, but what's new is that the power

00:16:59.840 | law is holding between loss and context length as shown above. They say from inspecting longer code

00:17:06.160 | token predictions closer to 10 million, we see a phenomena of the increased context occasionally

00:17:12.400 | providing outsized benefit. That could be due to repetition of code blocks. They think this deserves

00:17:17.600 | further study and may be dependent on the exact data set used. So even Google aren't fully sure

00:17:23.120 | what's causing that dip. Now we all know that OpenAI kind of trolled Google tonight by releasing

00:17:29.120 | Sora so soon after Gemini 1.5 Pro. But on this page, I feel Google were doing a little bit of

00:17:35.760 | trolling to OpenAI. First, we have this comparison again of retrieval and they say they got API

00:17:42.800 | errors after 128,000 tokens. Well, of course, they knew that because GPT-4 Turbo only supports 128,000

00:17:50.720 | tokens. I think they kind of wanted to say that after this length, we crush it and with them,

00:17:56.160 | you just get an error code. And the next bit of trolling comes here. These haystack challenges

00:18:00.880 | where they secrete a phrase like this. The special magic quote city number is quote. With this,

00:18:06.880 | the model has to retrieve the correct city and number which is randomized. But that phrase could

00:18:11.840 | have been hidden in any long text and they chose the essays of Paul Graham. Now, yes, this is almost

00:18:17.520 | certainly coincidental, but Paul Graham was the guy who fired Sam Altman at Y Combinator. Sam

00:18:22.960 | Altman disputes that it was a firing. For audio, it's the same thing. Even when they break down

00:18:28.640 | long audio into segments that Whisper can digest, which are then transcribed and fed to GPT-4 Turbo,

00:18:35.280 | the difference is stark. Before you think, though, that Gemini 1.5 Pro is perfect at retrieval,

00:18:41.440 | what happens when you feed in 100 needles into a massive haystack? Well, in that case, it still

00:18:47.600 | massively outperforms GPT-4 Turbo, but can recall, as you can see, 60, 70, 80% of those needles.

00:18:54.880 | It is not a perfect model and no, we don't have AGI. And at this point, Google does state that

00:19:01.120 | retrieval is not the same as reasoning. They basically beg for harder benchmarks,

00:19:06.960 | ones that require integrating disparate facts, drawing inferences, or resolving inconsistencies,

00:19:12.400 | essentially reasoning. If you want to know more about how reasoning, some would say,

00:19:16.320 | is the final holy grail of large language models, do check out my Patreon AI Insiders.

00:19:22.080 | I have around a dozen videos and podcasts up as of today. In fact, it was just today that I

00:19:27.600 | released this video on my Patreon. It's a 14 minute tour of deepfakes and features, interviews,

00:19:33.920 | and exclusives, and more. If you are a student or retired, do email me about a potential small

00:19:39.520 | discount. Now for the final demo in coding. We'll walk through some example prompts using

00:19:45.040 | the 3JS example code, which comes out to over 800,000 tokens. We extracted the code for all

00:19:51.040 | of the 3JS examples and put it together into this text file, which we brought into Google AI Studio

00:19:56.160 | over here. We asked the model to find three examples for learning about character animation.

00:20:01.280 | The model looked across hundreds of examples and picked out these three.

00:20:04.400 | Next, we asked, what controls the animations on the littlest Tokyo demo?

00:20:09.040 | As you can see here, the model was able to find that demo,

00:20:13.760 | and it explained that the animations are embedded within the glTF model.

00:20:19.760 | Next, we wanted to see if it could customize this code for us. So we asked,

00:20:23.280 | show me some code to add a slider to control the speed of the animation.

00:20:26.800 | Use that kind of GUI the other demos have. This is what it looked like before

00:20:30.400 | on the original 3JS site. And here's the modified version. It's the same scene,

00:20:35.120 | but it added this little slider to speed up, slow down, or even stop the animation on the fly.

00:20:40.080 | Again, with Audio Gemini Crush's Whisper, it has a significantly lower word error rate.

00:20:46.640 | And for video, it was pretty funny they had to invent their own benchmarks because the other

00:20:52.240 | ones were too easy. Or in formal language, to bridge this evaluation gap, we introduced a

00:20:57.840 | new benchmark that was testing that incredible feat we saw earlier of picking out key details

00:21:03.040 | from long videos. Now, to be clear, despite the demos looking good and beating GPT-4V,

00:21:09.040 | we're still not close to perfect. Just because Gemini 1.5 Pro can see across long context and

00:21:15.760 | watch long videos doesn't mean it's perfect at answering questions. Remember that recalling

00:21:21.040 | facts is not the same as reasoning or getting 100% on multiple choice questions. I also found this

00:21:27.200 | part of the paper quite funny where they tried to highlight the extent of trade-offs of switching

00:21:32.400 | architecture if it exists. And the problem was they couldn't find any. Across the board, 1.5 Pro

00:21:38.480 | was just better than 1.0 Pro. Whether that was math, science, coding, multilinguality,

00:21:44.800 | instruction following, image understanding, video understanding, speech recognition,

00:21:49.360 | or speech translation. Of course, it's obligatory at this point for me to ding them about the

00:21:54.880 | accuracy level of their MMLU benchmark test for Gemini 1.5 Pro. They say for math and science,

00:22:01.760 | it's 1.8% behind Gemini 1.0 Ultra. But how meaningful is that with this many errors just

00:22:08.400 | in the college chemistry section of the MMLU? Buried deep is one admission that 1.5 Pro doesn't

00:22:14.960 | seem quite as good at OCR. That's optical character recognition, in other words, recognizing text from

00:22:21.280 | an image. But Google Cloud Vision is state of the art anyway at OCR and soon enough, surely,

00:22:26.560 | they're going to integrate that. So I don't see OCR being a long-term weakness for the Gemini

00:22:31.680 | series. And it's hard to tell, but it seems like Google found some false negatives in other

00:22:36.800 | benchmarks. And so the performance there was lower bounding the model's true performance.

00:22:41.920 | And they complain, as I did in my original Smart GPT video, that maybe we need to rely more on

00:22:47.760 | human evaluations for these datasets and that maybe we should deviate from strict string matching.

00:22:53.520 | And there was this quite cute section in the impact assessment part of the paper.

00:22:58.160 | So what are the impacts going to be of Gemini 1.5 Pro? Well, they say the ability to understand

00:23:04.000 | longer content enhances the efficiency of individual and commercial users in processing

00:23:09.440 | various multimodal inputs. But that besides efficiency, the model enables societally

00:23:14.800 | beneficial downstream use cases. And they foresee Gemini 1.5 Pro being used to explore archival

00:23:21.520 | content that might potentially benefit journalists and historians. Suffice to say, I think this is

00:23:26.560 | somewhat underplaying the impact of Gemini 1.5 Pro. Just for one, I think it could transform

00:23:32.720 | how YouTube works. Or another obvious one. What about long term "relationships" with chatbots?

00:23:38.480 | GPT-4's new memory feature, which seems to me like an only slightly more advanced custom

00:23:43.120 | instruction, pales in comparison to Gemini 1.5's potential. You could have discussions lasting for

00:23:49.520 | months with Gemini and it might remember a detail you said back, say, six months ago.

00:23:55.280 | That seems to me true memory and might encourage a kind of companionship for some people with

00:24:00.800 | these models. On safety, without giving too much detail, they just say it's safer than 1.0 Pro

00:24:07.200 | and 1.0 Ultra. But later they do admit two things. First, Gemini 1.5 Pro does seem a little bit more

00:24:14.960 | biased. It's probably a bit harder for the model to be anti-stereotypical when it remembers so

00:24:20.480 | much. Also, and I know this is going to annoy quite a few people, it has a higher refusal rate.

00:24:26.800 | That's on questions that should be both legitimately refused and not legitimately. In

00:24:32.240 | other words, they should have been answered. Of course, by the time the model actually comes out,

00:24:36.160 | we'll have to see if this is still the case. But you just have to take a look at my Gemini Ultra

00:24:41.120 | review to see that at the moment the refusals are pretty extreme. This could honestly be a key

00:24:46.880 | sticking point for a lot of people. We're drawing to an end here, but just a quick handful of

00:24:51.840 | further observations. Remember that trick with Chachabitty where you submit the letter A with a

00:24:56.320 | space, say 500 times, and it regurgitates sometimes it's training data. Well, apparently that works

00:25:02.320 | also on Gemini 1.5 Pro. Thing is you have to manually repeat that character many more times,

00:25:08.000 | up to a million times. But with those long prompts, they do admit that it becomes easier to obtain

00:25:13.840 | it's memorized data. I presume that's the kind of thing that Google DeepMind are working on before

00:25:18.400 | they release Gemini 1.5 Pro. And one more key detail from the blog post that many people might

00:25:24.480 | have missed. When Gemini 1.5 Pro is released to the public, it's going to start at just 128,000

00:25:31.360 | token context window. I say just, but that's still pretty impressive. And it seems to me,

00:25:35.840 | based on the wording of the following sentence, that even that basic version won't be free.

00:25:41.520 | They say we plan to introduce pricing tiers that start at the standard 128,000 context window. So

00:25:48.640 | anyone hoping to get Gemini 1.5 for free seems to have misplaced hope. And then there's going to be

00:25:54.880 | tiers going up to 1 million tokens. I'm not sure how expensive that 1 million token tier will be,

00:26:00.960 | but I'll probably be on it. Notice that we probably won't be able to buy going up to

00:26:05.600 | 10 million tokens. But I do want to end on a positive note for Google. There was one thing

00:26:11.440 | I missed out from my review of Google Gemini Ultra, and I want to make amends. And that is

00:26:17.280 | its creative writing ability. Gemini 1.0 Ultra is simply better at creative writing than GPT-4.

00:26:24.400 | And of course, we're not even talking about 1.5 Ultra. How so? Well, Gemini varies its

00:26:29.760 | sentence length. We have short sentences like this, "Dibbons never really listened." We also

00:26:34.640 | get far more dialogue, which is just much more realistic to real creative writing. There's a

00:26:39.520 | bit more humor in there. Whatever they did with their writing data set, they did better than open

00:26:45.120 | AI. GPT-4 stories tend to be far more wordy, a lot more tell not show. And I'm actually going to go

00:26:52.000 | one further and prove that to you. When you put GPT-4 story into two state-of-the-art AI text

00:26:58.640 | detectors, that's GPT-0 and Binoculars, which is a new tool, both of them say most likely AI

00:27:05.360 | generated. GPT-0 puts it at 97%. For Claude, we also get most likely AI generated. Although GPT-0

00:27:13.440 | erroneously says it's only 12% likely AI generated. That's Claude's story. It's way too much to get

00:27:19.920 | into this video now, but remember Binoculars is state-of-the-art compared to GPT-0. But here's

00:27:26.320 | the punchline. This is Google Gemini's story. We get from GPT-0, 0% chance of being AI generated.

00:27:34.640 | And even the state-of-the-art Binoculars gives it most likely human generated. And I think this

00:27:40.880 | proves two points. First, Gemini is definitely better at creative writing and making marketing

00:27:46.320 | copy, by the way, but too long to get into here. And second, don't put your faith in AI text

00:27:51.600 | detectors, especially not in the age of Gemini. If you want to learn more about detecting AI and

00:27:57.360 | deepfakes, of course, I refer you back to my deepfakes video on my Patreon, AI Insiders.

00:28:02.720 | So that is Gemini 1.5 Pro. And yes, this does seem the most significant night for AI since

00:28:09.520 | the release of GPT-4. As I said in my video on January the 1st, AI is still on an exponential

00:28:16.240 | curve. 2024 will not be a slow year in AI. And for as long as I can, I will be here to cover it all.

00:28:24.800 | Thank you so much for watching and have a wonderful day.