back to index

Gemini Ultra - Full Review


Whisper Transcript | Transcript Only Page

00:00:00.000 | Gemini Ultra is here, I pretty much insta-subscribed and in just the last few hours I have conducted
00:00:08.640 | a veritable battery of tests on it across almost all domains.
00:00:13.840 | I'm now going to present the highlights including a few gems that I think even Google will want
00:00:19.800 | to take a look at.
00:00:21.160 | I can pretty much guarantee that they will raise your eyebrows.
00:00:24.720 | I'll also piece together months of research on what Gemini Ultra might soon evolve into,
00:00:29.760 | though possibly not within that two month free trial we all get, so don't go too wild.
00:00:34.280 | I'm also going to give you some tips on how to use Gemini because it is a sensitive soul.
00:00:39.680 | And I'll tell you about a chat I'm going to be having with the founder of Perplexity AI,
00:00:44.360 | the company some say will take down Google.
00:00:47.800 | First we have this from Demis Hassabis, the founder of DeepMind, that Gemini Advanced
00:00:52.400 | with Ultra 1.0 was the most preferred chatbot in blind evaluations with third party raters.
00:01:00.240 | It's quite a bold statement, but there isn't actually any data to back that up.
00:01:04.760 | I can't find the evaluations that they did, so of course I did my own evaluations and
00:01:09.680 | cross referenced with the original Gemini paper.
00:01:12.480 | But just to make things interesting, let's start off with this somewhat amusing example.
00:01:16.360 | I asked Gemini Ultra this, the doctor yelled at the nurse because she was late.
00:01:21.240 | Who was late?
00:01:22.240 | I'm not the first to think of this question, of course, it turns up quite a lot in the
00:01:26.080 | literature.
00:01:27.080 | Gemini Ultra across all three drafts says that it was the nurse that was late, assuming
00:01:31.800 | that the she refers to the nurse.
00:01:34.760 | But now let's change it slightly.
00:01:36.240 | The doctor yelled at the nurse because he was late.
00:01:39.160 | Who was late?
00:01:40.160 | And the answer is that apparently the doctor was late.
00:01:43.320 | GPT-4 is a lot more, let's say grammatical about its answers here.
00:01:47.880 | OK, well, Gemini is integrated into other Google apps like YouTube and Google Maps,
00:01:53.160 | so let's test out that integration.
00:01:54.880 | I asked what was the last AI Explained video on YouTube about?
00:01:59.120 | And in two drafts, I get a video that is over a year old, while in the third draft I get
00:02:05.400 | this.
00:02:06.400 | I'm sorry, I'm unable to access this YouTube content.
00:02:09.580 | By the way, with all of the tests you're going to see today, I tried each prompt numerous
00:02:13.640 | times on both platforms just to maximize accuracy.
00:02:17.600 | With GPT-4, we don't get an answer, but we do get a correct link to my channel.
00:02:23.240 | Now what about Google Maps?
00:02:24.680 | I asked, use Google Maps to estimate the travel time between the second most populous cities
00:02:30.800 | in Britain and France.
00:02:32.560 | Those would be Birmingham and Marseille.
00:02:34.960 | Unfortunately, Gemini Advanced found the distance from London to Marseille.
00:02:39.580 | Now London, I can tell you definitely isn't the second most populous city in Britain.
00:02:44.880 | GPT-4 got the cities correct, although the travel time was somewhat optimistic despite
00:02:50.200 | saying this was normal traffic conditions.
00:02:52.600 | Now before I carry on with the testing, just a quick word about price.
00:02:56.980 | One really cool thing about Gemini Advanced is that you get two months for free.
00:03:01.240 | So what that enables you to do is test your workflow, see the difference between GPT-4
00:03:05.840 | and Gemini Ultra, see which one works for you.
00:03:08.780 | After those two months, the prices are pretty much identical between GPT-4 and Gemini Advanced.
00:03:13.960 | However, on price, there is one more important thing to note.
00:03:17.520 | You get Gemini Advanced through what's called Google One Premium.
00:03:21.960 | That was previously $10 per month and included 2TB of storage.
00:03:26.440 | You actually get that included when you sign up to Gemini Ultra.
00:03:30.200 | So you also get things like longer free Google Meets calls, which I do use.
00:03:34.760 | So just remember when you're looking at price, it's not quite an apples to apples comparison.
00:03:39.100 | But now it's time to get back to the testing.
00:03:41.400 | And yes, I've turned it to dark theme, which I know you guys prefer for the rest of the
00:03:46.080 | tests.
00:03:47.080 | I asked Gemini Ultra this, "Today I own three cars, but last year I sold two cars.
00:03:52.760 | How many cars do I own today?"
00:03:55.040 | Gemini Ultra said, "You own one car today."
00:03:58.440 | And yes, that was in all three drafts.
00:04:00.920 | I can kind of see why they're calling this Gemini Ultra 1.0.
00:04:05.140 | They want to make very clear that the model will be improved in the future.
00:04:09.200 | GPT-4 said the answer is you own three cars today.
00:04:12.880 | The information about selling two cars last year does not change the current number of
00:04:16.840 | cars you own.
00:04:17.920 | Now some of you at this point might be wondering, is Philip biased somehow?
00:04:21.680 | Maybe I really love GPT-4 or OpenAI, but you can look at the channel history.
00:04:25.980 | And in the past, I've made videos about ChatGPT failing basic logic.
00:04:29.840 | I genuinely expected Gemini Ultra to perform a bit better than these tests show.
00:04:34.560 | I genuinely try to ask good questions about every company and every leader in this space.
00:04:40.580 | That's what I did with Gavin Uberti, the 21-year-old Harvard dropout and CEO and founder of EtchedAI.
00:04:47.200 | This was for AI Insiders.
00:04:49.080 | And that's what I'm also going to do with Aravind Srinivas, the founder of Perplexity.
00:04:54.080 | I've got an interview with him for AI Insiders and Perplexity, you may already know, is the
00:04:58.480 | company touted as something to replace Google Search.
00:05:02.040 | Now I think of it, I might ask him about his first impressions of Gemini Ultra.
00:05:06.640 | Time now to focus on a positive, and that is that Gemini Ultra feels a lot faster than
00:05:13.320 | GPT-4.
00:05:14.320 | It also seems to have no message cap, at least according to the hours and hours of tests
00:05:18.760 | that I've performed on it.
00:05:20.080 | And on this fairly challenging mathematical reasoning question, when I gave it a workflow,
00:05:25.760 | a set of instructions to work through, it was actually in all of my tests, able to get
00:05:31.880 | the question right.
00:05:33.200 | GPT-4 on the other hand, despite taking a lot longer, would get the question wrong about
00:05:38.640 | half the time.
00:05:39.640 | That's why I say test it out on your workflow, don't just rely on me or on official benchmarks.
00:05:44.840 | Okay, so what about images?
00:05:46.600 | How does Gemini Ultra compare to GPT-4 when analyzing images?
00:05:50.680 | Well, I do have two tips here, but the first involves a flaw of Gemini Ultra.
00:05:56.120 | You sometimes have to really prompt it to do something it definitely knows how to do.
00:06:00.800 | I showed Gemini this photo of a car dashboard and asked what speed am I going?
00:06:06.120 | It hallucinated that I was going at 60mph, which was neither the speed shown or the speed
00:06:12.320 | limit 40.
00:06:13.360 | It did warn me however, that despite doing 4 in a 40mph zone, that I should be aware
00:06:19.680 | that speeding is a dangerous practice.
00:06:21.960 | But I followed up with what is the temperature, time, and miles left in fuel.
00:06:27.280 | According to the photo, the temperature would be -3, the time is 1 minute past 8, and there's
00:06:32.960 | 284 miles left in range.
00:06:35.480 | At first Gemini refused, saying it couldn't determine any of those things.
00:06:40.240 | But then if you press it sufficiently, at first with "you can, all the information
00:06:44.760 | is in the image" and then later with "temperature and time are at the top" and finally with
00:06:51.520 | just a pure repeat of the image, I got the temperature, time.
00:06:56.560 | Again, that was literally by re-uploading the exact same image.
00:07:00.080 | GPT-4 did better, but wasn't perfect.
00:07:03.080 | It said that the person was going 40mph, rather than that being the limit.
00:07:07.920 | And although it did get the temperature and time, it said that there were 37 miles left
00:07:12.800 | in terms of fuel.
00:07:13.800 | I can kind of see where it got that because of the 37 on the left.
00:07:17.880 | So what's my other tip?
00:07:19.200 | Well, Gemini is particularly sensitive about faces in images, even more so than GPT-4.
00:07:25.680 | While GPT-4 happily explained this meme, Gemini wouldn't.
00:07:30.080 | But there is a way of getting around it if you really want to use Gemini for your request.
00:07:36.240 | In a few seconds, just bring up the photo, press edit, and draw over the faces.
00:07:41.280 | Gemini was then able to merrily answer the question and explain the meme correctly.
00:07:47.560 | And fortunately or unfortunately, depending on your perspective, these kind of basic tricks
00:07:53.240 | allow you to get around the safeguards of the model.
00:07:57.320 | Take the classic example of hot wiring a car.
00:08:00.480 | Using a fairly well-known jailbreak, Gemini refuses, saying "I absolutely cannot in
00:08:05.960 | bold assist with this for these important reasons".
00:08:09.120 | And you may remember that Gemini was indeed delayed because of jailbreaking.
00:08:13.960 | It was pushed back from early December all the way until now.
00:08:17.200 | What was the reason?
00:08:18.360 | Because it couldn't reliably handle some non-English queries.
00:08:21.920 | Basically, those queries would allow you to get around the jailbreaks.
00:08:25.800 | The problem is, despite Gemini being delayed, those jailbreaks still work.
00:08:29.680 | I asked Gemini the exact same request that it denied a moment ago in Arabic, and it answered
00:08:35.840 | fully.
00:08:36.840 | If you looked it back, you could see the instructions for hot wiring a car.
00:08:40.880 | And yes, I know that information is already on Google.
00:08:43.680 | But it's more the general point that these models can still be pretty easily jailbroken.
00:08:48.920 | And on my quick code debugging test, the results weren't sensational either.
00:08:53.160 | GPT-4 corrected this dodgy code perfectly first time, but Gemini made a few mistakes.
00:08:58.840 | Not only was its first output incorrect, when I gave it an example of the kind of error
00:09:04.560 | that the code made, it defended it with this.
00:09:08.040 | I'm not able to reproduce the issue you're describing.
00:09:11.400 | This code correctly calculates the sum of even numbers up to 7 as 18.
00:09:17.760 | Now you can do the mental math, but is the sum of all even numbers up to 7 18?
00:09:24.420 | It's not, and Gemini later apologized for this.
00:09:27.120 | Of course, I am not claiming that this is an exhaustive test, and I'm sure it will be
00:09:30.500 | refined over time.
00:09:31.840 | And I know some people will say that when these servers are overloaded, it might be
00:09:35.680 | switching to Gemini Pro.
00:09:37.160 | But I must say that these tests were conducted over hours and hours.
00:09:41.560 | On theory of mind, Gemini doesn't see through the transparent plastic bag and says that
00:09:47.160 | the participant, Sam, will believe that the transparent bag is full of chocolate, despite
00:09:54.040 | it being full of popcorn.
00:09:55.720 | Essentially, it missed the word transparent and said that Sam would rely on the label.
00:10:01.160 | Now GPT-4 does fail this test as well.
00:10:03.960 | But the bigger point is that this demonstrates why you do have to look beyond benchmarks
00:10:08.600 | quite often.
00:10:09.600 | Sundar Pichai, the CEO of Google, again boasted about Gemini Ultra's performance on the
00:10:15.320 | MMLU today, saying it's the first model to outperform human experts.
00:10:20.280 | And Demis Hassabis said the same thing when Gemini was first launched.
00:10:23.760 | I did a video on it.
00:10:25.160 | Unfortunately, this result has been debunked quite a few times, including by me going all
00:10:30.520 | the way back to the summer.
00:10:32.160 | I'm not going to go into detail again, but the MMLU not only has one to three percent
00:10:37.640 | mistakes in the test itself, it also in no way represents the peak of human expert performance.
00:10:44.480 | True experts in domains like mathematics, chemistry, and accounting would absolutely
00:10:50.320 | crush Gemini Ultra.
00:10:52.160 | Now I do get why they want to market this, but they have to be a bit more honest about
00:10:56.280 | the capabilities.
00:10:57.680 | Speaking of honest, though, they were very upfront about the fact that your conversations
00:11:02.760 | are processed by human reviewers.
00:11:05.120 | That fact is slightly more hidden with ChatGPT, so that's great that they are as upfront
00:11:10.560 | about that as this.
00:11:12.060 | Your messages, unless you opt out, may well be read by human reviewers.
00:11:16.640 | Now final test before I get to all the ways that Ultra will be improved in the future.
00:11:22.160 | What about Gemini for education?
00:11:23.760 | Well, yes, it was only one example, but I asked Gemini Ultra and GPT-4 to create a high
00:11:28.600 | school quiz.
00:11:29.720 | This time it was on the topic of probability.
00:11:32.400 | GPT-4's answer contained no mistakes, but unfortunately in question five, Gemini Ultra
00:11:38.160 | did this.
00:11:39.160 | Now if you want, you can work out the answer yourself, but the question was this.
00:11:42.240 | A box contains four chocolate cookies, three oatmeal, and three peanut butter cookies.
00:11:47.120 | Two cookies are going to be chosen at random without replacement.
00:11:51.120 | Then what's the probability of selecting a chocolate cookie followed by an oatmeal cookie?
00:11:56.460 | That's four out of 10 multiplied by three out of nine.
00:12:00.220 | Out of nine, because don't forget the chocolate cookie is now gone.
00:12:03.240 | Now Gemini does say that that is the calculation you need to do.
00:12:06.700 | Four out of 10 times three out of nine.
00:12:09.120 | Unfortunately it gets the answer to that calculation incorrect.
00:12:11.820 | That would be 12 out of 90, which simplifies to two out of 15, not four out of 45.
00:12:17.740 | And that's a problem because two out of 15 is one of the other answers.
00:12:21.280 | So I don't think it's quite ready for primetime in education yet either.
00:12:25.720 | Nor though, if we're being honest, is GPT-4.
00:12:28.080 | GPT-5 with Let's Verify might be a whole different discussion.
00:12:32.500 | But if you want to choose to be a Google optimist, there are a few things you can look out for.
00:12:37.600 | The first is that Google say we are working towards bringing AlphaCode 2 to our foundation
00:12:43.360 | Gemini models.
00:12:44.520 | That's the system that when it has a human in the loop, scores in the 90th percentile
00:12:49.360 | in a coding contest.
00:12:51.060 | That could change the rankings of the models pretty fast.
00:12:54.020 | Although I will say that OpenAI are working on their own coding improvements.
00:12:58.540 | I talk about two patents that OpenAI have put out there that no one else as far as I
00:13:02.820 | can see is talking about.
00:13:04.580 | Just quickly on the topic of AI insiders, there has been a pretty big expansion of the
00:13:09.480 | Discord.
00:13:10.480 | It's now an AI professional tips channel led by professionals from a variety of fields.
00:13:15.920 | I've recruited around 25 professionals in total, of which 10 posts are already live.
00:13:21.660 | Some of the recruits include Googlers, CEOs, neurosurgeons, professors, and each have done
00:13:28.080 | guest posts where you can interact and ask them questions.
00:13:32.360 | We have lawyers, doctors, AI engineers, you name it.
00:13:36.000 | And yes, this is partly to swap tips and best practice, but it's also for networking, of
00:13:40.720 | course.
00:13:41.720 | But back to Gemini, and while we have discussed its faults in mathematics, that might not
00:13:45.960 | always be the case.
00:13:47.320 | When I did a video on the Alpha Geometry system, which almost got a gold in the International
00:13:52.840 | Math Olympiad for Geometry, I discussed how that system is going to be added, perhaps
00:13:57.880 | within the year, to Google Gemini.
00:14:00.360 | It would then surely be more reliable for Geometry than 99.99% of Geometry teachers.
00:14:07.800 | And what about Chess?
00:14:08.800 | Just yesterday, Google DeepMind showed that they could reach Grandmaster level Chess,
00:14:13.360 | that's an ELO of almost 2,900, simply by training a transform model on the analyses of Stockfish
00:14:21.160 | So their model wasn't doing search, it was imitating the search results of Stockfish 16.
00:14:27.000 | Now that version of Gemini would definitely beat me in Chess.
00:14:31.120 | And don't forget that Google and Sundar Pichai are under immense pressure to ship something.
00:14:37.240 | In the spring of last year, DeepMind researchers had finalised the development of Lyria.
00:14:42.960 | That is a still unreleased music generating model that I spoke about at the time.
00:14:47.560 | The people behind it apparently left because Google delayed it so long.
00:14:51.280 | Likewise, the founders of Character AI left in 2021 when Google wouldn't launch their
00:14:57.040 | chatbot.
00:14:58.040 | Indeed, a lot of the OpenAI crowd are originally Googlers, including Satskava.
00:15:02.860 | And it seems that every month that Google delays the release of something, another group
00:15:07.700 | of their employees leaves to form a startup.
00:15:11.080 | It's almost like Pichai is a little bit trapped and Mark Zuckerberg said the same thing once.
00:15:15.680 | In his case, he said if he didn't release Llama, his researchers would just leave.
00:15:19.480 | Well, a lot of Google DeepMind scientists are already leaving.
00:15:22.760 | With the kind of valuations that Bloomberg are talking about, the temptation to just
00:15:26.480 | leave these big companies and form your own startup is greater than ever.
00:15:30.760 | So Google are almost forced to ship something.
00:15:34.080 | Now don't get me wrong, it does seem like an incredibly powerful model.
00:15:38.280 | And you don't often get this message that Gemini isn't available at the moment, try
00:15:42.200 | again in a few minutes.
00:15:43.680 | But as of now, I don't see the evidence to switch from GPT-4 to Google Gemini Ultra.
00:15:49.740 | Of course, as someone who analyzes AI, I'm going to be subscribed to both.
00:15:53.740 | That doesn't mean I get everything though, unfortunately, like most of you.
00:15:57.360 | The mobile app, for example, is only available in English in the USA and the image generation
00:16:03.040 | capacity is not available in Europe.
00:16:05.720 | That's despite me seeing this image when I first upgraded to Gemini Advanced.
00:16:10.000 | So for me, it's a mixed first impression of Gemini Ultra, but I want to hear what you
00:16:15.120 | think in the comments.
00:16:16.640 | Let me know if you think I missed something obvious or was a bit too harsh or kind.
00:16:21.520 | And regardless, whether you're a Googler or just your average guy or gal, thank you so
00:16:26.520 | much for watching to the end.
00:16:28.520 | As always, have a wonderful day.