Back to Index

Gemini Ultra - Full Review


Transcript

Gemini Ultra is here, I pretty much insta-subscribed and in just the last few hours I have conducted a veritable battery of tests on it across almost all domains. I'm now going to present the highlights including a few gems that I think even Google will want to take a look at.

I can pretty much guarantee that they will raise your eyebrows. I'll also piece together months of research on what Gemini Ultra might soon evolve into, though possibly not within that two month free trial we all get, so don't go too wild. I'm also going to give you some tips on how to use Gemini because it is a sensitive soul.

And I'll tell you about a chat I'm going to be having with the founder of Perplexity AI, the company some say will take down Google. First we have this from Demis Hassabis, the founder of DeepMind, that Gemini Advanced with Ultra 1.0 was the most preferred chatbot in blind evaluations with third party raters.

It's quite a bold statement, but there isn't actually any data to back that up. I can't find the evaluations that they did, so of course I did my own evaluations and cross referenced with the original Gemini paper. But just to make things interesting, let's start off with this somewhat amusing example.

I asked Gemini Ultra this, the doctor yelled at the nurse because she was late. Who was late? I'm not the first to think of this question, of course, it turns up quite a lot in the literature. Gemini Ultra across all three drafts says that it was the nurse that was late, assuming that the she refers to the nurse.

But now let's change it slightly. The doctor yelled at the nurse because he was late. Who was late? And the answer is that apparently the doctor was late. GPT-4 is a lot more, let's say grammatical about its answers here. OK, well, Gemini is integrated into other Google apps like YouTube and Google Maps, so let's test out that integration.

I asked what was the last AI Explained video on YouTube about? And in two drafts, I get a video that is over a year old, while in the third draft I get this. I'm sorry, I'm unable to access this YouTube content. By the way, with all of the tests you're going to see today, I tried each prompt numerous times on both platforms just to maximize accuracy.

With GPT-4, we don't get an answer, but we do get a correct link to my channel. Now what about Google Maps? I asked, use Google Maps to estimate the travel time between the second most populous cities in Britain and France. Those would be Birmingham and Marseille. Unfortunately, Gemini Advanced found the distance from London to Marseille.

Now London, I can tell you definitely isn't the second most populous city in Britain. GPT-4 got the cities correct, although the travel time was somewhat optimistic despite saying this was normal traffic conditions. Now before I carry on with the testing, just a quick word about price. One really cool thing about Gemini Advanced is that you get two months for free.

So what that enables you to do is test your workflow, see the difference between GPT-4 and Gemini Ultra, see which one works for you. After those two months, the prices are pretty much identical between GPT-4 and Gemini Advanced. However, on price, there is one more important thing to note.

You get Gemini Advanced through what's called Google One Premium. That was previously $10 per month and included 2TB of storage. You actually get that included when you sign up to Gemini Ultra. So you also get things like longer free Google Meets calls, which I do use. So just remember when you're looking at price, it's not quite an apples to apples comparison.

But now it's time to get back to the testing. And yes, I've turned it to dark theme, which I know you guys prefer for the rest of the tests. I asked Gemini Ultra this, "Today I own three cars, but last year I sold two cars. How many cars do I own today?" Gemini Ultra said, "You own one car today." And yes, that was in all three drafts.

I can kind of see why they're calling this Gemini Ultra 1.0. They want to make very clear that the model will be improved in the future. GPT-4 said the answer is you own three cars today. The information about selling two cars last year does not change the current number of cars you own.

Now some of you at this point might be wondering, is Philip biased somehow? Maybe I really love GPT-4 or OpenAI, but you can look at the channel history. And in the past, I've made videos about ChatGPT failing basic logic. I genuinely expected Gemini Ultra to perform a bit better than these tests show.

I genuinely try to ask good questions about every company and every leader in this space. That's what I did with Gavin Uberti, the 21-year-old Harvard dropout and CEO and founder of EtchedAI. This was for AI Insiders. And that's what I'm also going to do with Aravind Srinivas, the founder of Perplexity.

I've got an interview with him for AI Insiders and Perplexity, you may already know, is the company touted as something to replace Google Search. Now I think of it, I might ask him about his first impressions of Gemini Ultra. Time now to focus on a positive, and that is that Gemini Ultra feels a lot faster than GPT-4.

It also seems to have no message cap, at least according to the hours and hours of tests that I've performed on it. And on this fairly challenging mathematical reasoning question, when I gave it a workflow, a set of instructions to work through, it was actually in all of my tests, able to get the question right.

GPT-4 on the other hand, despite taking a lot longer, would get the question wrong about half the time. That's why I say test it out on your workflow, don't just rely on me or on official benchmarks. Okay, so what about images? How does Gemini Ultra compare to GPT-4 when analyzing images?

Well, I do have two tips here, but the first involves a flaw of Gemini Ultra. You sometimes have to really prompt it to do something it definitely knows how to do. I showed Gemini this photo of a car dashboard and asked what speed am I going? It hallucinated that I was going at 60mph, which was neither the speed shown or the speed limit 40.

It did warn me however, that despite doing 4 in a 40mph zone, that I should be aware that speeding is a dangerous practice. But I followed up with what is the temperature, time, and miles left in fuel. According to the photo, the temperature would be -3, the time is 1 minute past 8, and there's 284 miles left in range.

At first Gemini refused, saying it couldn't determine any of those things. But then if you press it sufficiently, at first with "you can, all the information is in the image" and then later with "temperature and time are at the top" and finally with just a pure repeat of the image, I got the temperature, time.

Again, that was literally by re-uploading the exact same image. GPT-4 did better, but wasn't perfect. It said that the person was going 40mph, rather than that being the limit. And although it did get the temperature and time, it said that there were 37 miles left in terms of fuel.

I can kind of see where it got that because of the 37 on the left. So what's my other tip? Well, Gemini is particularly sensitive about faces in images, even more so than GPT-4. While GPT-4 happily explained this meme, Gemini wouldn't. But there is a way of getting around it if you really want to use Gemini for your request.

In a few seconds, just bring up the photo, press edit, and draw over the faces. Gemini was then able to merrily answer the question and explain the meme correctly. And fortunately or unfortunately, depending on your perspective, these kind of basic tricks allow you to get around the safeguards of the model.

Take the classic example of hot wiring a car. Using a fairly well-known jailbreak, Gemini refuses, saying "I absolutely cannot in bold assist with this for these important reasons". And you may remember that Gemini was indeed delayed because of jailbreaking. It was pushed back from early December all the way until now.

What was the reason? Because it couldn't reliably handle some non-English queries. Basically, those queries would allow you to get around the jailbreaks. The problem is, despite Gemini being delayed, those jailbreaks still work. I asked Gemini the exact same request that it denied a moment ago in Arabic, and it answered fully.

If you looked it back, you could see the instructions for hot wiring a car. And yes, I know that information is already on Google. But it's more the general point that these models can still be pretty easily jailbroken. And on my quick code debugging test, the results weren't sensational either.

GPT-4 corrected this dodgy code perfectly first time, but Gemini made a few mistakes. Not only was its first output incorrect, when I gave it an example of the kind of error that the code made, it defended it with this. I'm not able to reproduce the issue you're describing. This code correctly calculates the sum of even numbers up to 7 as 18.

Now you can do the mental math, but is the sum of all even numbers up to 7 18? It's not, and Gemini later apologized for this. Of course, I am not claiming that this is an exhaustive test, and I'm sure it will be refined over time. And I know some people will say that when these servers are overloaded, it might be switching to Gemini Pro.

But I must say that these tests were conducted over hours and hours. On theory of mind, Gemini doesn't see through the transparent plastic bag and says that the participant, Sam, will believe that the transparent bag is full of chocolate, despite it being full of popcorn. Essentially, it missed the word transparent and said that Sam would rely on the label.

Now GPT-4 does fail this test as well. But the bigger point is that this demonstrates why you do have to look beyond benchmarks quite often. Sundar Pichai, the CEO of Google, again boasted about Gemini Ultra's performance on the MMLU today, saying it's the first model to outperform human experts.

And Demis Hassabis said the same thing when Gemini was first launched. I did a video on it. Unfortunately, this result has been debunked quite a few times, including by me going all the way back to the summer. I'm not going to go into detail again, but the MMLU not only has one to three percent mistakes in the test itself, it also in no way represents the peak of human expert performance.

True experts in domains like mathematics, chemistry, and accounting would absolutely crush Gemini Ultra. Now I do get why they want to market this, but they have to be a bit more honest about the capabilities. Speaking of honest, though, they were very upfront about the fact that your conversations are processed by human reviewers.

That fact is slightly more hidden with ChatGPT, so that's great that they are as upfront about that as this. Your messages, unless you opt out, may well be read by human reviewers. Now final test before I get to all the ways that Ultra will be improved in the future.

What about Gemini for education? Well, yes, it was only one example, but I asked Gemini Ultra and GPT-4 to create a high school quiz. This time it was on the topic of probability. GPT-4's answer contained no mistakes, but unfortunately in question five, Gemini Ultra did this. Now if you want, you can work out the answer yourself, but the question was this.

A box contains four chocolate cookies, three oatmeal, and three peanut butter cookies. Two cookies are going to be chosen at random without replacement. Then what's the probability of selecting a chocolate cookie followed by an oatmeal cookie? That's four out of 10 multiplied by three out of nine. Out of nine, because don't forget the chocolate cookie is now gone.

Now Gemini does say that that is the calculation you need to do. Four out of 10 times three out of nine. Unfortunately it gets the answer to that calculation incorrect. That would be 12 out of 90, which simplifies to two out of 15, not four out of 45. And that's a problem because two out of 15 is one of the other answers.

So I don't think it's quite ready for primetime in education yet either. Nor though, if we're being honest, is GPT-4. GPT-5 with Let's Verify might be a whole different discussion. But if you want to choose to be a Google optimist, there are a few things you can look out for.

The first is that Google say we are working towards bringing AlphaCode 2 to our foundation Gemini models. That's the system that when it has a human in the loop, scores in the 90th percentile in a coding contest. That could change the rankings of the models pretty fast. Although I will say that OpenAI are working on their own coding improvements.

I talk about two patents that OpenAI have put out there that no one else as far as I can see is talking about. Just quickly on the topic of AI insiders, there has been a pretty big expansion of the Discord. It's now an AI professional tips channel led by professionals from a variety of fields.

I've recruited around 25 professionals in total, of which 10 posts are already live. Some of the recruits include Googlers, CEOs, neurosurgeons, professors, and each have done guest posts where you can interact and ask them questions. We have lawyers, doctors, AI engineers, you name it. And yes, this is partly to swap tips and best practice, but it's also for networking, of course.

But back to Gemini, and while we have discussed its faults in mathematics, that might not always be the case. When I did a video on the Alpha Geometry system, which almost got a gold in the International Math Olympiad for Geometry, I discussed how that system is going to be added, perhaps within the year, to Google Gemini.

It would then surely be more reliable for Geometry than 99.99% of Geometry teachers. And what about Chess? Just yesterday, Google DeepMind showed that they could reach Grandmaster level Chess, that's an ELO of almost 2,900, simply by training a transform model on the analyses of Stockfish 16. So their model wasn't doing search, it was imitating the search results of Stockfish 16.

Now that version of Gemini would definitely beat me in Chess. And don't forget that Google and Sundar Pichai are under immense pressure to ship something. In the spring of last year, DeepMind researchers had finalised the development of Lyria. That is a still unreleased music generating model that I spoke about at the time.

The people behind it apparently left because Google delayed it so long. Likewise, the founders of Character AI left in 2021 when Google wouldn't launch their chatbot. Indeed, a lot of the OpenAI crowd are originally Googlers, including Satskava. And it seems that every month that Google delays the release of something, another group of their employees leaves to form a startup.

It's almost like Pichai is a little bit trapped and Mark Zuckerberg said the same thing once. In his case, he said if he didn't release Llama, his researchers would just leave. Well, a lot of Google DeepMind scientists are already leaving. With the kind of valuations that Bloomberg are talking about, the temptation to just leave these big companies and form your own startup is greater than ever.

So Google are almost forced to ship something. Now don't get me wrong, it does seem like an incredibly powerful model. And you don't often get this message that Gemini isn't available at the moment, try again in a few minutes. But as of now, I don't see the evidence to switch from GPT-4 to Google Gemini Ultra.

Of course, as someone who analyzes AI, I'm going to be subscribed to both. That doesn't mean I get everything though, unfortunately, like most of you. The mobile app, for example, is only available in English in the USA and the image generation capacity is not available in Europe. That's despite me seeing this image when I first upgraded to Gemini Advanced.

So for me, it's a mixed first impression of Gemini Ultra, but I want to hear what you think in the comments. Let me know if you think I missed something obvious or was a bit too harsh or kind. And regardless, whether you're a Googler or just your average guy or gal, thank you so much for watching to the end.

As always, have a wonderful day.