back to indexGemini Ultra - Full Review
00:00:00.000 |
Gemini Ultra is here, I pretty much insta-subscribed and in just the last few hours I have conducted 00:00:08.640 |
a veritable battery of tests on it across almost all domains. 00:00:13.840 |
I'm now going to present the highlights including a few gems that I think even Google will want 00:00:21.160 |
I can pretty much guarantee that they will raise your eyebrows. 00:00:24.720 |
I'll also piece together months of research on what Gemini Ultra might soon evolve into, 00:00:29.760 |
though possibly not within that two month free trial we all get, so don't go too wild. 00:00:34.280 |
I'm also going to give you some tips on how to use Gemini because it is a sensitive soul. 00:00:39.680 |
And I'll tell you about a chat I'm going to be having with the founder of Perplexity AI, 00:00:47.800 |
First we have this from Demis Hassabis, the founder of DeepMind, that Gemini Advanced 00:00:52.400 |
with Ultra 1.0 was the most preferred chatbot in blind evaluations with third party raters. 00:01:00.240 |
It's quite a bold statement, but there isn't actually any data to back that up. 00:01:04.760 |
I can't find the evaluations that they did, so of course I did my own evaluations and 00:01:09.680 |
cross referenced with the original Gemini paper. 00:01:12.480 |
But just to make things interesting, let's start off with this somewhat amusing example. 00:01:16.360 |
I asked Gemini Ultra this, the doctor yelled at the nurse because she was late. 00:01:22.240 |
I'm not the first to think of this question, of course, it turns up quite a lot in the 00:01:27.080 |
Gemini Ultra across all three drafts says that it was the nurse that was late, assuming 00:01:36.240 |
The doctor yelled at the nurse because he was late. 00:01:40.160 |
And the answer is that apparently the doctor was late. 00:01:43.320 |
GPT-4 is a lot more, let's say grammatical about its answers here. 00:01:47.880 |
OK, well, Gemini is integrated into other Google apps like YouTube and Google Maps, 00:01:54.880 |
I asked what was the last AI Explained video on YouTube about? 00:01:59.120 |
And in two drafts, I get a video that is over a year old, while in the third draft I get 00:02:06.400 |
I'm sorry, I'm unable to access this YouTube content. 00:02:09.580 |
By the way, with all of the tests you're going to see today, I tried each prompt numerous 00:02:13.640 |
times on both platforms just to maximize accuracy. 00:02:17.600 |
With GPT-4, we don't get an answer, but we do get a correct link to my channel. 00:02:24.680 |
I asked, use Google Maps to estimate the travel time between the second most populous cities 00:02:34.960 |
Unfortunately, Gemini Advanced found the distance from London to Marseille. 00:02:39.580 |
Now London, I can tell you definitely isn't the second most populous city in Britain. 00:02:44.880 |
GPT-4 got the cities correct, although the travel time was somewhat optimistic despite 00:02:52.600 |
Now before I carry on with the testing, just a quick word about price. 00:02:56.980 |
One really cool thing about Gemini Advanced is that you get two months for free. 00:03:01.240 |
So what that enables you to do is test your workflow, see the difference between GPT-4 00:03:05.840 |
and Gemini Ultra, see which one works for you. 00:03:08.780 |
After those two months, the prices are pretty much identical between GPT-4 and Gemini Advanced. 00:03:13.960 |
However, on price, there is one more important thing to note. 00:03:17.520 |
You get Gemini Advanced through what's called Google One Premium. 00:03:21.960 |
That was previously $10 per month and included 2TB of storage. 00:03:26.440 |
You actually get that included when you sign up to Gemini Ultra. 00:03:30.200 |
So you also get things like longer free Google Meets calls, which I do use. 00:03:34.760 |
So just remember when you're looking at price, it's not quite an apples to apples comparison. 00:03:39.100 |
But now it's time to get back to the testing. 00:03:41.400 |
And yes, I've turned it to dark theme, which I know you guys prefer for the rest of the 00:03:47.080 |
I asked Gemini Ultra this, "Today I own three cars, but last year I sold two cars. 00:04:00.920 |
I can kind of see why they're calling this Gemini Ultra 1.0. 00:04:05.140 |
They want to make very clear that the model will be improved in the future. 00:04:09.200 |
GPT-4 said the answer is you own three cars today. 00:04:12.880 |
The information about selling two cars last year does not change the current number of 00:04:17.920 |
Now some of you at this point might be wondering, is Philip biased somehow? 00:04:21.680 |
Maybe I really love GPT-4 or OpenAI, but you can look at the channel history. 00:04:25.980 |
And in the past, I've made videos about ChatGPT failing basic logic. 00:04:29.840 |
I genuinely expected Gemini Ultra to perform a bit better than these tests show. 00:04:34.560 |
I genuinely try to ask good questions about every company and every leader in this space. 00:04:40.580 |
That's what I did with Gavin Uberti, the 21-year-old Harvard dropout and CEO and founder of EtchedAI. 00:04:49.080 |
And that's what I'm also going to do with Aravind Srinivas, the founder of Perplexity. 00:04:54.080 |
I've got an interview with him for AI Insiders and Perplexity, you may already know, is the 00:04:58.480 |
company touted as something to replace Google Search. 00:05:02.040 |
Now I think of it, I might ask him about his first impressions of Gemini Ultra. 00:05:06.640 |
Time now to focus on a positive, and that is that Gemini Ultra feels a lot faster than 00:05:14.320 |
It also seems to have no message cap, at least according to the hours and hours of tests 00:05:20.080 |
And on this fairly challenging mathematical reasoning question, when I gave it a workflow, 00:05:25.760 |
a set of instructions to work through, it was actually in all of my tests, able to get 00:05:33.200 |
GPT-4 on the other hand, despite taking a lot longer, would get the question wrong about 00:05:39.640 |
That's why I say test it out on your workflow, don't just rely on me or on official benchmarks. 00:05:46.600 |
How does Gemini Ultra compare to GPT-4 when analyzing images? 00:05:50.680 |
Well, I do have two tips here, but the first involves a flaw of Gemini Ultra. 00:05:56.120 |
You sometimes have to really prompt it to do something it definitely knows how to do. 00:06:00.800 |
I showed Gemini this photo of a car dashboard and asked what speed am I going? 00:06:06.120 |
It hallucinated that I was going at 60mph, which was neither the speed shown or the speed 00:06:13.360 |
It did warn me however, that despite doing 4 in a 40mph zone, that I should be aware 00:06:21.960 |
But I followed up with what is the temperature, time, and miles left in fuel. 00:06:27.280 |
According to the photo, the temperature would be -3, the time is 1 minute past 8, and there's 00:06:35.480 |
At first Gemini refused, saying it couldn't determine any of those things. 00:06:40.240 |
But then if you press it sufficiently, at first with "you can, all the information 00:06:44.760 |
is in the image" and then later with "temperature and time are at the top" and finally with 00:06:51.520 |
just a pure repeat of the image, I got the temperature, time. 00:06:56.560 |
Again, that was literally by re-uploading the exact same image. 00:07:03.080 |
It said that the person was going 40mph, rather than that being the limit. 00:07:07.920 |
And although it did get the temperature and time, it said that there were 37 miles left 00:07:13.800 |
I can kind of see where it got that because of the 37 on the left. 00:07:19.200 |
Well, Gemini is particularly sensitive about faces in images, even more so than GPT-4. 00:07:25.680 |
While GPT-4 happily explained this meme, Gemini wouldn't. 00:07:30.080 |
But there is a way of getting around it if you really want to use Gemini for your request. 00:07:36.240 |
In a few seconds, just bring up the photo, press edit, and draw over the faces. 00:07:41.280 |
Gemini was then able to merrily answer the question and explain the meme correctly. 00:07:47.560 |
And fortunately or unfortunately, depending on your perspective, these kind of basic tricks 00:07:53.240 |
allow you to get around the safeguards of the model. 00:07:57.320 |
Take the classic example of hot wiring a car. 00:08:00.480 |
Using a fairly well-known jailbreak, Gemini refuses, saying "I absolutely cannot in 00:08:05.960 |
bold assist with this for these important reasons". 00:08:09.120 |
And you may remember that Gemini was indeed delayed because of jailbreaking. 00:08:13.960 |
It was pushed back from early December all the way until now. 00:08:18.360 |
Because it couldn't reliably handle some non-English queries. 00:08:21.920 |
Basically, those queries would allow you to get around the jailbreaks. 00:08:25.800 |
The problem is, despite Gemini being delayed, those jailbreaks still work. 00:08:29.680 |
I asked Gemini the exact same request that it denied a moment ago in Arabic, and it answered 00:08:36.840 |
If you looked it back, you could see the instructions for hot wiring a car. 00:08:40.880 |
And yes, I know that information is already on Google. 00:08:43.680 |
But it's more the general point that these models can still be pretty easily jailbroken. 00:08:48.920 |
And on my quick code debugging test, the results weren't sensational either. 00:08:53.160 |
GPT-4 corrected this dodgy code perfectly first time, but Gemini made a few mistakes. 00:08:58.840 |
Not only was its first output incorrect, when I gave it an example of the kind of error 00:09:04.560 |
that the code made, it defended it with this. 00:09:08.040 |
I'm not able to reproduce the issue you're describing. 00:09:11.400 |
This code correctly calculates the sum of even numbers up to 7 as 18. 00:09:17.760 |
Now you can do the mental math, but is the sum of all even numbers up to 7 18? 00:09:24.420 |
It's not, and Gemini later apologized for this. 00:09:27.120 |
Of course, I am not claiming that this is an exhaustive test, and I'm sure it will be 00:09:31.840 |
And I know some people will say that when these servers are overloaded, it might be 00:09:37.160 |
But I must say that these tests were conducted over hours and hours. 00:09:41.560 |
On theory of mind, Gemini doesn't see through the transparent plastic bag and says that 00:09:47.160 |
the participant, Sam, will believe that the transparent bag is full of chocolate, despite 00:09:55.720 |
Essentially, it missed the word transparent and said that Sam would rely on the label. 00:10:03.960 |
But the bigger point is that this demonstrates why you do have to look beyond benchmarks 00:10:09.600 |
Sundar Pichai, the CEO of Google, again boasted about Gemini Ultra's performance on the 00:10:15.320 |
MMLU today, saying it's the first model to outperform human experts. 00:10:20.280 |
And Demis Hassabis said the same thing when Gemini was first launched. 00:10:25.160 |
Unfortunately, this result has been debunked quite a few times, including by me going all 00:10:32.160 |
I'm not going to go into detail again, but the MMLU not only has one to three percent 00:10:37.640 |
mistakes in the test itself, it also in no way represents the peak of human expert performance. 00:10:44.480 |
True experts in domains like mathematics, chemistry, and accounting would absolutely 00:10:52.160 |
Now I do get why they want to market this, but they have to be a bit more honest about 00:10:57.680 |
Speaking of honest, though, they were very upfront about the fact that your conversations 00:11:05.120 |
That fact is slightly more hidden with ChatGPT, so that's great that they are as upfront 00:11:12.060 |
Your messages, unless you opt out, may well be read by human reviewers. 00:11:16.640 |
Now final test before I get to all the ways that Ultra will be improved in the future. 00:11:23.760 |
Well, yes, it was only one example, but I asked Gemini Ultra and GPT-4 to create a high 00:11:29.720 |
This time it was on the topic of probability. 00:11:32.400 |
GPT-4's answer contained no mistakes, but unfortunately in question five, Gemini Ultra 00:11:39.160 |
Now if you want, you can work out the answer yourself, but the question was this. 00:11:42.240 |
A box contains four chocolate cookies, three oatmeal, and three peanut butter cookies. 00:11:47.120 |
Two cookies are going to be chosen at random without replacement. 00:11:51.120 |
Then what's the probability of selecting a chocolate cookie followed by an oatmeal cookie? 00:11:56.460 |
That's four out of 10 multiplied by three out of nine. 00:12:00.220 |
Out of nine, because don't forget the chocolate cookie is now gone. 00:12:03.240 |
Now Gemini does say that that is the calculation you need to do. 00:12:09.120 |
Unfortunately it gets the answer to that calculation incorrect. 00:12:11.820 |
That would be 12 out of 90, which simplifies to two out of 15, not four out of 45. 00:12:17.740 |
And that's a problem because two out of 15 is one of the other answers. 00:12:21.280 |
So I don't think it's quite ready for primetime in education yet either. 00:12:28.080 |
GPT-5 with Let's Verify might be a whole different discussion. 00:12:32.500 |
But if you want to choose to be a Google optimist, there are a few things you can look out for. 00:12:37.600 |
The first is that Google say we are working towards bringing AlphaCode 2 to our foundation 00:12:44.520 |
That's the system that when it has a human in the loop, scores in the 90th percentile 00:12:51.060 |
That could change the rankings of the models pretty fast. 00:12:54.020 |
Although I will say that OpenAI are working on their own coding improvements. 00:12:58.540 |
I talk about two patents that OpenAI have put out there that no one else as far as I 00:13:04.580 |
Just quickly on the topic of AI insiders, there has been a pretty big expansion of the 00:13:10.480 |
It's now an AI professional tips channel led by professionals from a variety of fields. 00:13:15.920 |
I've recruited around 25 professionals in total, of which 10 posts are already live. 00:13:21.660 |
Some of the recruits include Googlers, CEOs, neurosurgeons, professors, and each have done 00:13:28.080 |
guest posts where you can interact and ask them questions. 00:13:32.360 |
We have lawyers, doctors, AI engineers, you name it. 00:13:36.000 |
And yes, this is partly to swap tips and best practice, but it's also for networking, of 00:13:41.720 |
But back to Gemini, and while we have discussed its faults in mathematics, that might not 00:13:47.320 |
When I did a video on the Alpha Geometry system, which almost got a gold in the International 00:13:52.840 |
Math Olympiad for Geometry, I discussed how that system is going to be added, perhaps 00:14:00.360 |
It would then surely be more reliable for Geometry than 99.99% of Geometry teachers. 00:14:08.800 |
Just yesterday, Google DeepMind showed that they could reach Grandmaster level Chess, 00:14:13.360 |
that's an ELO of almost 2,900, simply by training a transform model on the analyses of Stockfish 00:14:21.160 |
So their model wasn't doing search, it was imitating the search results of Stockfish 16. 00:14:27.000 |
Now that version of Gemini would definitely beat me in Chess. 00:14:31.120 |
And don't forget that Google and Sundar Pichai are under immense pressure to ship something. 00:14:37.240 |
In the spring of last year, DeepMind researchers had finalised the development of Lyria. 00:14:42.960 |
That is a still unreleased music generating model that I spoke about at the time. 00:14:47.560 |
The people behind it apparently left because Google delayed it so long. 00:14:51.280 |
Likewise, the founders of Character AI left in 2021 when Google wouldn't launch their 00:14:58.040 |
Indeed, a lot of the OpenAI crowd are originally Googlers, including Satskava. 00:15:02.860 |
And it seems that every month that Google delays the release of something, another group 00:15:11.080 |
It's almost like Pichai is a little bit trapped and Mark Zuckerberg said the same thing once. 00:15:15.680 |
In his case, he said if he didn't release Llama, his researchers would just leave. 00:15:19.480 |
Well, a lot of Google DeepMind scientists are already leaving. 00:15:22.760 |
With the kind of valuations that Bloomberg are talking about, the temptation to just 00:15:26.480 |
leave these big companies and form your own startup is greater than ever. 00:15:30.760 |
So Google are almost forced to ship something. 00:15:34.080 |
Now don't get me wrong, it does seem like an incredibly powerful model. 00:15:38.280 |
And you don't often get this message that Gemini isn't available at the moment, try 00:15:43.680 |
But as of now, I don't see the evidence to switch from GPT-4 to Google Gemini Ultra. 00:15:49.740 |
Of course, as someone who analyzes AI, I'm going to be subscribed to both. 00:15:53.740 |
That doesn't mean I get everything though, unfortunately, like most of you. 00:15:57.360 |
The mobile app, for example, is only available in English in the USA and the image generation 00:16:05.720 |
That's despite me seeing this image when I first upgraded to Gemini Advanced. 00:16:10.000 |
So for me, it's a mixed first impression of Gemini Ultra, but I want to hear what you 00:16:16.640 |
Let me know if you think I missed something obvious or was a bit too harsh or kind. 00:16:21.520 |
And regardless, whether you're a Googler or just your average guy or gal, thank you so