back to indexGemini 1.5 and The Biggest Night in AI
00:00:00.000 |
You could call tonight the triumph of the transformers or maybe the battle for attention 00:00:06.720 |
with two thunderous developments within a handful of hours of each other. 00:00:11.920 |
One from Google DeepMind and then OpenAI. But ultimately, the bigger picture is this. 00:00:17.760 |
The exponential advance of artificial intelligence and its applications shows no sign of slowing. 00:00:24.720 |
If you were hoping for a quiet year, tonight just shattered that peaceful vision. 00:00:29.680 |
Of course, I could have done a video on either Gemini 1.5, which is arguably as of tonight the 00:00:36.160 |
most performant language model in the world, or Sora, the text-to-video model OpenAI released 00:00:43.280 |
shortly after Google. Possibly as a purposeful grab of the spotlight, possibly coincidentally. 00:00:49.440 |
Both developments are game-changing, but with the technical paper of Sora due out later today, 00:00:56.240 |
it allows us to give Gemini 1.5 its due attention. Yes, I've read all 58 pages of the technical paper 00:01:04.160 |
as well as four papers linked in the appendices in my search for how it got the results it did. 00:01:10.240 |
I've got 62 notes, so let's dive in. Here is the big development. Gemini 1.5 can recall 00:01:17.760 |
and reason over information across millions of tokens of context. Or in another example, 00:01:23.200 |
they gave 22 hours of audio or three hours of low frame rate video or six to eight minutes of 00:01:30.560 |
normal frame rate videos. I don't want to bury the headline. We're talking about 00:01:34.720 |
near perfect retrieval of facts and details up to at least 10 million tokens. Performance did 00:01:40.800 |
not dip at 10 million tokens. Indeed, the trend line got substantially better. For reference, 00:01:46.240 |
in text, 10 million tokens would be about 7.5 million words. To wrap your head around how many 00:01:51.840 |
words 10 million tokens is, that's around 0.2% of all of English language Wikipedia. 00:01:58.320 |
And again, just for emphasis, we're talking at least 10 million tokens. Now, admittedly, 00:02:03.280 |
in the blog post and paper, they do talk about latency trade-offs with that many tokens. And 00:02:08.240 |
no, in case you're wondering, Gemini 1.5 isn't currently widely available just to a limited 00:02:13.440 |
group of developers and enterprise customers. In case that puts you off though, Google promised 00:02:17.920 |
this. Significant improvements in speed are also on the horizon. But before we get back to the 00:02:23.360 |
paper, how about a demo so you can see Gemini 1.5 Pro in action? This is a demo of long context 00:02:30.640 |
understanding, an experimental feature in our newest model, Gemini 1.5 Pro. We'll walk through 00:02:36.560 |
a screen recording of example prompts using a 402 page PDF of the Apollo 11 transcript, 00:02:42.320 |
which comes out to almost 330,000 tokens. We started by uploading the Apollo PDF into Google 00:02:49.760 |
AI Studio and asked, "Find three comedic moments. List quotes from this transcript and emoji." 00:02:56.960 |
This screen capture is sped up. This timer shows exactly how long it took to process each prompt, 00:03:03.600 |
and keep in mind that processing times will vary. 00:03:08.000 |
The model responded with three quotes, like this one from Michael Collins, 00:03:12.480 |
"I'll bet you a cup of coffee on it." If we go back to the transcript, 00:03:16.080 |
we can see the model found this exact quote and extracted the comedic moment accurately. 00:03:20.080 |
Then we tested a multimodal prompt. We gave it this drawing of a scene we were thinking of and 00:03:24.960 |
asked, "What moment is this?" The model correctly identified it as Neil's first steps on the moon. 00:03:33.680 |
I'll feature more demos later in the video, but notice that the inference time in this 00:03:37.200 |
sped up video wasn't that bad. And yes, for anyone paying attention, this is Gemini 1.5 Pro. 00:03:43.600 |
What that means is that any results you're seeing will soon be improved upon by Gemini 1.5 Ultra. 00:03:50.240 |
Remember that we have Gemini Nano, then Gemini Pro, the medium-sized model, 00:03:54.480 |
and finally Gemini Ultra. Well, how did Google DeepMind achieve this feat? 00:03:58.640 |
They say in the introduction that the model incorporates a novel 00:04:02.480 |
mixture of experts architecture, as well as major advances in training and serving 00:04:07.760 |
infrastructure that allows it to push the boundaries of efficiency, reasoning, 00:04:11.760 |
and long context performance. In case you're wondering, that long context again refers to 00:04:16.480 |
that huge volume of text, image, video, and audio data that Gemini 1.5 can ingest. 00:04:23.200 |
But I will admit that when I first read this introduction, I thought they have 00:04:27.200 |
used the Mamba architecture. That was billed as the successor architecture to the Transformer, 00:04:33.120 |
and I did a video on it on the 1st of January. It too achieved amazing results in long context 00:04:38.960 |
tasks and outperformed the Transformer. The Transformer is of course the T in GPT. 00:04:45.440 |
However, by the time I finished the paper, it was pretty clear that it wasn't based on Mamba, 00:04:51.200 |
and it took me a little while to figure it out and reading quite a few papers cited in the 00:04:55.840 |
appendices. But I think I've got a pretty good guess as to what the architecture is based on. 00:05:01.360 |
Anyway, what is the next interesting point from the paper? 00:05:04.480 |
Well, Google does confirm that Gemini 1.5 Pro requires significantly less compute to train 00:05:11.760 |
than 1.0 Ultra. So it's arguably better than Ultra, and we'll see the benchmarks in a moment, 00:05:17.760 |
but requires significantly less compute to train. That is maybe why Google DeepMind were able to 00:05:23.840 |
get out Gemini 1.5 Pro so soon after announcing Gemini 1.0. And don't forget that Gemini 1.0 Ultra 00:05:31.680 |
as part of Gemini Advanced was only released to the public a week ago. That's when my review video 00:05:37.440 |
came out. Google, by the way, gives you a two month free trial of Gemini Advanced, and I now 00:05:42.400 |
think part of the reason for that is to give them time to incorporate Gemini 1.5 Pro before that 00:05:49.200 |
free trial ends. And here is the first bombshell graphic in the paper. This is the task of finding 00:05:55.360 |
a needle in a haystack across text, video, and audio modalities. A fact or passcode might be 00:06:01.840 |
buried at varying depths of sequences of varying lengths. As you can see for text, these lengths 00:06:08.880 |
went up to 10 million tokens. For audio, it was up to 22 hours and video up to three hours. The 00:06:15.360 |
models would then be tested on that fact. As you can see, the performance was incredible with just 00:06:20.960 |
five missed facts. For reference, I went back to the original benchmark, which I've cited in 00:06:26.160 |
two previous videos to compare. Here is GPC 4's performance at up to only 128,000 tokens. As you 00:06:34.080 |
can see, as the sequence length gets to around 80,000 words or 100,000 tokens, the performance, 00:06:40.240 |
especially midway through the sequence, degrades. At the time, Anthropic's Claude 2.1 performed even 00:06:47.520 |
worse, although they did subsequently come up with a prompt engineering hack to reduce most of 00:06:52.560 |
these incorrect recalls. But are you ready for the next bombshell? They see Gemini 1.5 Pro outperforming 00:07:00.000 |
all competing models across all modalities, even when these models are augmented with external 00:07:06.160 |
retrieval methods. In the lingo, that's RAG. In layman's terms, that's grabbing relevant text to 00:07:11.280 |
assist them in answering the questions. Of course, with a long context, Gemini 1.5 Pro is just simply 00:07:16.320 |
ingesting the entire document. And now for one of the key tables in the paper. You might be wondering, 00:07:22.480 |
and indeed Google DeepMind were wondering, if this extra performance in long context tasks 00:07:28.640 |
would mean a trade-off for other types of tasks. Text, vision, and audio tasks that didn't require 00:07:34.800 |
long context. The answer was no. Gemini 1.5 Pro is better on average compared to 1.0 Pro across text, 00:07:43.280 |
vision, and audio. In other words, it hasn't just got better at long context, it's got better at a 00:07:48.560 |
range of other tasks. It beats Gemini 1.0 Pro 100% of the time in text benchmarks and most of the 00:07:55.440 |
time in vision and audio benchmarks. And wait, there's more. It also beats most of the time 00:08:00.800 |
Gemini 1.0 Ultra in text benchmarks. Of course, at this point, if you only care about standard 00:08:06.960 |
benchmarks and not about long context, it's more or less a draw with Gemini 1.0 Ultra. It has a 00:08:13.200 |
win rate of 54.8% and it would also be pretty much a tie with GPT-4. But I think it's fair to say 00:08:21.520 |
that once you bring in long context capabilities, it is now indisputably the best language model 00:08:28.720 |
in the world. I should caveat that with it's the best language model that is accessible to some 00:08:34.160 |
people out there. Of course, behind the scenes, Google DeepMind have Gemini 1.5 Ultra, which would 00:08:39.280 |
surely be better than 1.5 Pro. And I know what you're thinking, this is Google, so maybe they 00:08:44.320 |
use some kooky prompting method to get these results. Well, I've read the entire paper and 00:08:49.760 |
it doesn't seem so to me. These are genuine like-for-like comparisons to 1.0 Pro and 1.0 Ultra. 00:08:56.800 |
Now, I have looked in the appendices and the exact wording of the prompts may have been different to, 00:09:02.560 |
for example, GPT-4. It does seem borderline impossible to get perfectly like-for-like 00:09:08.160 |
comparisons. But from everything I can see, this does seem to be a genuine result. 00:09:13.680 |
Now, before you get too excited, as you can see, the benchmark results for 1.5 Pro 00:09:18.720 |
in non-long context tasks is pretty impressive, but we're not completely changing the game here. 00:09:25.520 |
They haven't come up with some sort of architecture that just crushes in every task. We're still 00:09:29.760 |
dealing with a familiar language model for most tasks. But before I give you my architecture 00:09:35.680 |
speculations, time for another demo, this time analyzing a one frame per second, 44-minute movie. 00:09:42.800 |
This is a demo of long context understanding, an experimental feature in our newest model, 00:09:48.800 |
Gemini 1.5 Pro. We'll walk through a screen recording of example prompts using a 44-minute 00:09:54.960 |
Buster Keaton film, which comes out to over 600,000 tokens. In Google AI Studio, we uploaded 00:10:01.600 |
the video and asked, "Find the moment when a piece of paper is removed from the person's pocket 00:10:07.120 |
and tell me some key information on it with the time code." This screen capture is sped up, 00:10:13.360 |
and this timer shows exactly how long it took to process each prompt. And keep in mind that 00:10:18.240 |
processing times will vary. The model gave us this response, explaining that the piece of paper 00:10:24.160 |
is a pawn ticket from Goldman & Co. pawnbrokers with the date and cost. And it gave us this time 00:10:30.240 |
code, 12/01. When we pulled up that time code, we found it was correct. The model had found the 00:10:37.040 |
exact moment the piece of paper is removed from the person's pocket, and it extracted text accurately. 00:10:42.880 |
Next, we gave it this drawing of a scene we were thinking of and asked, "What is the time code 00:10:47.680 |
when this happens?" The model returned this time code, 15/34. We pulled that up and found that it 00:10:54.160 |
was the correct scene. Like all generative models, responses vary and won't always be perfect. But 00:11:00.240 |
notice how we didn't have to explain what was happening in the drawing. Gemini being multimodal 00:11:05.200 |
from the ground up is really shining here, and I do think we have to take a step back and say, 00:11:11.040 |
"Wow." At the moment, this might mean 6 to 8 minutes of a 24 or 30 frames per second video 00:11:16.640 |
on YouTube. But still, successfully picking out that minor detail in that short of a time 00:11:21.920 |
is truly groundbreaking. Given that Google owns YouTube, you will soon be querying, 00:11:26.960 |
say, AI Explained videos. Okay, time for my guests about how they managed it in terms of 00:11:32.160 |
architecture. Well, first they say simply, "It's a sparse mixture of expert transformer-based model." 00:11:38.160 |
Those are fairly standard terms described in other videos of mine. But then here's the key 00:11:43.120 |
sentence, "Gemini 1.5 Pro also builds on the following research and the language model research 00:11:51.040 |
in this broader literature." Now, I had a look at most of these papers, particularly the recent ones, 00:11:56.640 |
and one stood out. This paper by Zhang et al came out just over a month ago. And remember, 00:12:02.000 |
Google says that Gemini 1.5 Pro builds on this work. Now, building on something that came out 00:12:08.800 |
that recently is pretty significant. Of course, Google have their own massive body of literature 00:12:13.840 |
on sparse mixture of experts and indeed invented the transformer architecture. But this tweet 00:12:18.960 |
tonight from one of the key authors of Gemini 1.5 does point to things developing more rapidly 00:12:25.280 |
recently. Pranav Shyam said this just a few months ago, "Nikolai, Dennis and I were exploring ways to 00:12:31.120 |
dramatically increase our context lengths. Little did we know that our ideas would ship in production 00:12:36.480 |
so quickly." So yes, Google has work going back years on sparse mixture of expert models. And yes, 00:12:42.880 |
too many people underestimated the years of innovations going on quietly at Google, 00:12:48.000 |
in this case for inference. But for the purposes of time, this is the paper I'm going to focus on. 00:12:53.360 |
This is the one by Zhang released around a month ago. It's, of course, mixture of experts from 00:12:59.520 |
Mistral AI. That's that brand new French outfit with a multi-billion euro valuation. And no, 00:13:05.680 |
I don't just think it's relevant because of the date and the fact it's sparse and a mixture of 00:13:10.240 |
experts. Mixture of experts in a nutshell being when you have a bigger model comprised of multiple 00:13:16.000 |
smaller blocks or experts. When the tokens come in, they are dynamically rooted to just two, 00:13:22.160 |
in most cases, relevant experts or blocks. So the entire model isn't active during inference. 00:13:28.160 |
It's lightweight and effective, but no, that's not the reason why I focused on this paper. 00:13:32.960 |
It's because of this section, 3.2 long range performance mixed trial managed to achieve 00:13:38.880 |
a hundred percent retrieval accuracy regardless of the context length and also regardless of 00:13:44.640 |
the position or depth of the password. Of course, Mistral only proved that up to 32,000 tokens and 00:13:51.280 |
Google, I believe, have taken it much further. That's my theory. Let me know in the comments 00:13:55.840 |
if you think I'm right. Google do say they also made improvements in terms of data optimization 00:14:01.440 |
and systems, but if you're looking for more info on compute or the training dataset, good luck. 00:14:07.360 |
Other than saying that the compute is significantly less than Gemini 1.0 Ultra and that Gemini 1.5 Pro 00:14:14.240 |
is trained on a variety of multimodal and multilingual data, they don't really give us 00:14:19.520 |
anything. Well, I tell a lie. They do say that it was trained across multiple data centers. So 00:14:24.800 |
given that a data center maxes out around 32,000 TPUs and I know Google uses TPUs, but that still 00:14:31.920 |
gives us a sense about the sheer scale of Google Gemini's compute. And there is one more task that 00:14:37.920 |
Google DeepMind really want us to focus on. Admittedly, it is very impressive. They gave 00:14:43.200 |
Gemini 1.5 Pro a grammar book and dictionary, 250,000 tokens in total from a super obscure, 00:14:50.480 |
low resource language. The language is Kalamang and I had never heard of it. They take pains to 00:14:55.520 |
point out that none of that language was in the training dataset. And so what was the result? 00:15:01.360 |
Well, not only did Gemini 1.5 Pro crush GPT-4, it also did as well as a human who had learned 00:15:09.360 |
from the same materials. Now we're not talking about someone from that region of Papua New Guinea. 00:15:14.560 |
The reason a human was used for comparison was to make that underlying point. Models are starting 00:15:19.840 |
to approach the learning rate, at least in terms of language of a human being. And don't forget, 00:15:24.400 |
this factors in data efficiency, same amount of data, similar result. Next up is what I believe 00:15:29.920 |
to be a fascinating graphic. It shows what happens as a model in blue Gemini 1.5 Pro is fed more and 00:15:36.880 |
more of a long document and of a code database. And the lower the curves go, the more accurate 00:15:43.600 |
the model is getting at predicting the next word. What happens to December's Gemini Pro 00:15:49.120 |
as you feed it more and more tokens? Well, it starts to get overwhelmed both in terms of code 00:15:54.960 |
and for long documents. As the paper says that older model, and I hesitate to call it older 00:15:59.760 |
because it's just two months ago, they're unable to effectively use information from the previous 00:16:05.120 |
context and are deteriorating in terms of prediction quality. But with Gemini 1.5 Pro, 00:16:10.960 |
the more it's fed, the better it gets. Even for a sequence of length a million for documents 00:16:17.040 |
or 10 million for code. It's quote, remembering things from millions of lines of code ago to 00:16:23.520 |
answer questions now. I think it's significant that when we get to sequence lengths of around 00:16:28.320 |
five to 10 million, the curve actually dips downward. It no longer follows the power law 00:16:34.320 |
trend. That would suggest to me that if we went up to a hundred million, the results would be 00:16:38.960 |
even more impressive. Here's what Google have to say. The results above suggest that the model is 00:16:44.000 |
able to improve its predictions by finding useful patterns, even if they occurred millions of tokens 00:16:49.120 |
in the past, as in the case of code. And to summarize this, we already knew that lower loss 00:16:54.000 |
could be gone from more compute. It's a very similar curve, but what's new is that the power 00:16:59.840 |
law is holding between loss and context length as shown above. They say from inspecting longer code 00:17:06.160 |
token predictions closer to 10 million, we see a phenomena of the increased context occasionally 00:17:12.400 |
providing outsized benefit. That could be due to repetition of code blocks. They think this deserves 00:17:17.600 |
further study and may be dependent on the exact data set used. So even Google aren't fully sure 00:17:23.120 |
what's causing that dip. Now we all know that OpenAI kind of trolled Google tonight by releasing 00:17:29.120 |
Sora so soon after Gemini 1.5 Pro. But on this page, I feel Google were doing a little bit of 00:17:35.760 |
trolling to OpenAI. First, we have this comparison again of retrieval and they say they got API 00:17:42.800 |
errors after 128,000 tokens. Well, of course, they knew that because GPT-4 Turbo only supports 128,000 00:17:50.720 |
tokens. I think they kind of wanted to say that after this length, we crush it and with them, 00:17:56.160 |
you just get an error code. And the next bit of trolling comes here. These haystack challenges 00:18:00.880 |
where they secrete a phrase like this. The special magic quote city number is quote. With this, 00:18:06.880 |
the model has to retrieve the correct city and number which is randomized. But that phrase could 00:18:11.840 |
have been hidden in any long text and they chose the essays of Paul Graham. Now, yes, this is almost 00:18:17.520 |
certainly coincidental, but Paul Graham was the guy who fired Sam Altman at Y Combinator. Sam 00:18:22.960 |
Altman disputes that it was a firing. For audio, it's the same thing. Even when they break down 00:18:28.640 |
long audio into segments that Whisper can digest, which are then transcribed and fed to GPT-4 Turbo, 00:18:35.280 |
the difference is stark. Before you think, though, that Gemini 1.5 Pro is perfect at retrieval, 00:18:41.440 |
what happens when you feed in 100 needles into a massive haystack? Well, in that case, it still 00:18:47.600 |
massively outperforms GPT-4 Turbo, but can recall, as you can see, 60, 70, 80% of those needles. 00:18:54.880 |
It is not a perfect model and no, we don't have AGI. And at this point, Google does state that 00:19:01.120 |
retrieval is not the same as reasoning. They basically beg for harder benchmarks, 00:19:06.960 |
ones that require integrating disparate facts, drawing inferences, or resolving inconsistencies, 00:19:12.400 |
essentially reasoning. If you want to know more about how reasoning, some would say, 00:19:16.320 |
is the final holy grail of large language models, do check out my Patreon AI Insiders. 00:19:22.080 |
I have around a dozen videos and podcasts up as of today. In fact, it was just today that I 00:19:27.600 |
released this video on my Patreon. It's a 14 minute tour of deepfakes and features, interviews, 00:19:33.920 |
and exclusives, and more. If you are a student or retired, do email me about a potential small 00:19:39.520 |
discount. Now for the final demo in coding. We'll walk through some example prompts using 00:19:45.040 |
the 3JS example code, which comes out to over 800,000 tokens. We extracted the code for all 00:19:51.040 |
of the 3JS examples and put it together into this text file, which we brought into Google AI Studio 00:19:56.160 |
over here. We asked the model to find three examples for learning about character animation. 00:20:01.280 |
The model looked across hundreds of examples and picked out these three. 00:20:04.400 |
Next, we asked, what controls the animations on the littlest Tokyo demo? 00:20:09.040 |
As you can see here, the model was able to find that demo, 00:20:13.760 |
and it explained that the animations are embedded within the glTF model. 00:20:19.760 |
Next, we wanted to see if it could customize this code for us. So we asked, 00:20:23.280 |
show me some code to add a slider to control the speed of the animation. 00:20:26.800 |
Use that kind of GUI the other demos have. This is what it looked like before 00:20:30.400 |
on the original 3JS site. And here's the modified version. It's the same scene, 00:20:35.120 |
but it added this little slider to speed up, slow down, or even stop the animation on the fly. 00:20:40.080 |
Again, with Audio Gemini Crush's Whisper, it has a significantly lower word error rate. 00:20:46.640 |
And for video, it was pretty funny they had to invent their own benchmarks because the other 00:20:52.240 |
ones were too easy. Or in formal language, to bridge this evaluation gap, we introduced a 00:20:57.840 |
new benchmark that was testing that incredible feat we saw earlier of picking out key details 00:21:03.040 |
from long videos. Now, to be clear, despite the demos looking good and beating GPT-4V, 00:21:09.040 |
we're still not close to perfect. Just because Gemini 1.5 Pro can see across long context and 00:21:15.760 |
watch long videos doesn't mean it's perfect at answering questions. Remember that recalling 00:21:21.040 |
facts is not the same as reasoning or getting 100% on multiple choice questions. I also found this 00:21:27.200 |
part of the paper quite funny where they tried to highlight the extent of trade-offs of switching 00:21:32.400 |
architecture if it exists. And the problem was they couldn't find any. Across the board, 1.5 Pro 00:21:38.480 |
was just better than 1.0 Pro. Whether that was math, science, coding, multilinguality, 00:21:44.800 |
instruction following, image understanding, video understanding, speech recognition, 00:21:49.360 |
or speech translation. Of course, it's obligatory at this point for me to ding them about the 00:21:54.880 |
accuracy level of their MMLU benchmark test for Gemini 1.5 Pro. They say for math and science, 00:22:01.760 |
it's 1.8% behind Gemini 1.0 Ultra. But how meaningful is that with this many errors just 00:22:08.400 |
in the college chemistry section of the MMLU? Buried deep is one admission that 1.5 Pro doesn't 00:22:14.960 |
seem quite as good at OCR. That's optical character recognition, in other words, recognizing text from 00:22:21.280 |
an image. But Google Cloud Vision is state of the art anyway at OCR and soon enough, surely, 00:22:26.560 |
they're going to integrate that. So I don't see OCR being a long-term weakness for the Gemini 00:22:31.680 |
series. And it's hard to tell, but it seems like Google found some false negatives in other 00:22:36.800 |
benchmarks. And so the performance there was lower bounding the model's true performance. 00:22:41.920 |
And they complain, as I did in my original Smart GPT video, that maybe we need to rely more on 00:22:47.760 |
human evaluations for these datasets and that maybe we should deviate from strict string matching. 00:22:53.520 |
And there was this quite cute section in the impact assessment part of the paper. 00:22:58.160 |
So what are the impacts going to be of Gemini 1.5 Pro? Well, they say the ability to understand 00:23:04.000 |
longer content enhances the efficiency of individual and commercial users in processing 00:23:09.440 |
various multimodal inputs. But that besides efficiency, the model enables societally 00:23:14.800 |
beneficial downstream use cases. And they foresee Gemini 1.5 Pro being used to explore archival 00:23:21.520 |
content that might potentially benefit journalists and historians. Suffice to say, I think this is 00:23:26.560 |
somewhat underplaying the impact of Gemini 1.5 Pro. Just for one, I think it could transform 00:23:32.720 |
how YouTube works. Or another obvious one. What about long term "relationships" with chatbots? 00:23:38.480 |
GPT-4's new memory feature, which seems to me like an only slightly more advanced custom 00:23:43.120 |
instruction, pales in comparison to Gemini 1.5's potential. You could have discussions lasting for 00:23:49.520 |
months with Gemini and it might remember a detail you said back, say, six months ago. 00:23:55.280 |
That seems to me true memory and might encourage a kind of companionship for some people with 00:24:00.800 |
these models. On safety, without giving too much detail, they just say it's safer than 1.0 Pro 00:24:07.200 |
and 1.0 Ultra. But later they do admit two things. First, Gemini 1.5 Pro does seem a little bit more 00:24:14.960 |
biased. It's probably a bit harder for the model to be anti-stereotypical when it remembers so 00:24:20.480 |
much. Also, and I know this is going to annoy quite a few people, it has a higher refusal rate. 00:24:26.800 |
That's on questions that should be both legitimately refused and not legitimately. In 00:24:32.240 |
other words, they should have been answered. Of course, by the time the model actually comes out, 00:24:36.160 |
we'll have to see if this is still the case. But you just have to take a look at my Gemini Ultra 00:24:41.120 |
review to see that at the moment the refusals are pretty extreme. This could honestly be a key 00:24:46.880 |
sticking point for a lot of people. We're drawing to an end here, but just a quick handful of 00:24:51.840 |
further observations. Remember that trick with Chachabitty where you submit the letter A with a 00:24:56.320 |
space, say 500 times, and it regurgitates sometimes it's training data. Well, apparently that works 00:25:02.320 |
also on Gemini 1.5 Pro. Thing is you have to manually repeat that character many more times, 00:25:08.000 |
up to a million times. But with those long prompts, they do admit that it becomes easier to obtain 00:25:13.840 |
it's memorized data. I presume that's the kind of thing that Google DeepMind are working on before 00:25:18.400 |
they release Gemini 1.5 Pro. And one more key detail from the blog post that many people might 00:25:24.480 |
have missed. When Gemini 1.5 Pro is released to the public, it's going to start at just 128,000 00:25:31.360 |
token context window. I say just, but that's still pretty impressive. And it seems to me, 00:25:35.840 |
based on the wording of the following sentence, that even that basic version won't be free. 00:25:41.520 |
They say we plan to introduce pricing tiers that start at the standard 128,000 context window. So 00:25:48.640 |
anyone hoping to get Gemini 1.5 for free seems to have misplaced hope. And then there's going to be 00:25:54.880 |
tiers going up to 1 million tokens. I'm not sure how expensive that 1 million token tier will be, 00:26:00.960 |
but I'll probably be on it. Notice that we probably won't be able to buy going up to 00:26:05.600 |
10 million tokens. But I do want to end on a positive note for Google. There was one thing 00:26:11.440 |
I missed out from my review of Google Gemini Ultra, and I want to make amends. And that is 00:26:17.280 |
its creative writing ability. Gemini 1.0 Ultra is simply better at creative writing than GPT-4. 00:26:24.400 |
And of course, we're not even talking about 1.5 Ultra. How so? Well, Gemini varies its 00:26:29.760 |
sentence length. We have short sentences like this, "Dibbons never really listened." We also 00:26:34.640 |
get far more dialogue, which is just much more realistic to real creative writing. There's a 00:26:39.520 |
bit more humor in there. Whatever they did with their writing data set, they did better than open 00:26:45.120 |
AI. GPT-4 stories tend to be far more wordy, a lot more tell not show. And I'm actually going to go 00:26:52.000 |
one further and prove that to you. When you put GPT-4 story into two state-of-the-art AI text 00:26:58.640 |
detectors, that's GPT-0 and Binoculars, which is a new tool, both of them say most likely AI 00:27:05.360 |
generated. GPT-0 puts it at 97%. For Claude, we also get most likely AI generated. Although GPT-0 00:27:13.440 |
erroneously says it's only 12% likely AI generated. That's Claude's story. It's way too much to get 00:27:19.920 |
into this video now, but remember Binoculars is state-of-the-art compared to GPT-0. But here's 00:27:26.320 |
the punchline. This is Google Gemini's story. We get from GPT-0, 0% chance of being AI generated. 00:27:34.640 |
And even the state-of-the-art Binoculars gives it most likely human generated. And I think this 00:27:40.880 |
proves two points. First, Gemini is definitely better at creative writing and making marketing 00:27:46.320 |
copy, by the way, but too long to get into here. And second, don't put your faith in AI text 00:27:51.600 |
detectors, especially not in the age of Gemini. If you want to learn more about detecting AI and 00:27:57.360 |
deepfakes, of course, I refer you back to my deepfakes video on my Patreon, AI Insiders. 00:28:02.720 |
So that is Gemini 1.5 Pro. And yes, this does seem the most significant night for AI since 00:28:09.520 |
the release of GPT-4. As I said in my video on January the 1st, AI is still on an exponential 00:28:16.240 |
curve. 2024 will not be a slow year in AI. And for as long as I can, I will be here to cover it all. 00:28:24.800 |
Thank you so much for watching and have a wonderful day.