Back to Index

Never Browse Alone? Gemini 2 Live and ChatGPT Vision


Chapters

0:0 Introduction
0:38 Live Interaction
3:43 Gemini 2.0 Flash Benchmarks
5:10 Audio and Image Output
6:38 Project Mariner (+ WebVoyager Bench)
8:49 But Progress Slowing Down?
10:43 OpenAI Announcements + Games

Transcript

First Google and then tonight OpenAI. Both companies have released fun, useful and pretty impressive tools that can see what you see, listen and even browse alongside you. Google's is free, OpenAI's is paid, but OpenAI have included parts of their tool in iPhone 16s. I'm going to give you just my highlights from the literally dozens of announcements in just the last 24 hours.

We'll take a one or two-minute demo tour and then I'll give you a hint about why these models make more mistakes than you might think. Plus of course some benchmarks, can't resist that. Now I could save the best till last but I won't. I'm just going to give you the coolest tool in my opinion straight up front.

You can access it for free at aistudio.google.com/live. We're going to have a conversation with Gemini 2.0 from Google DeepMind. I'm not going to say it's their smartest model and I'll get back to that later but it is their newest. Just for a brief moment though let's not care too much about the smartness of the model.

Look at what you can do. If you're on mobile what you'll want to do is click show Gemini and then it can gain permission to look out of your camera. Now you could already send videos to Gemini but this is live interaction. Maybe ask it about that building you're walking by or that plant you can see next to the path.

Because we're using this for entertainment and curiosity I don't mind too much that it makes the occasional mistake. Obviously do not rely on the answers. I'm going to click share screen in just a moment but first I have to give you a bit of context. Here is another tool announced by Google yesterday called Deep Research and it's only available via Gemini Advanced.

It's a decent tool. It researches the web, collates from like 20 results and presents them to you. I asked tell me everything about Simple Bench featured on AI Explained and it gave me a plan and then I approved it by saying start research and it gave me this result.

It's definitely comprehensive research but you can't really rely on it as exemplified by this sentence. Now though let's investigate that interactively with Gemini. This is free so bear that in mind. Okay share your screen and then pick the screen you want. Hi Gemini how are you? I am doing great thanks for asking.

How are you today? I'm good. Can you read out the sentence highlighted in yellow? Of course the highlighted sentence reads Lama 3.1 outperformed other models like GPT-4, Turbo, Gemini 1.5 Pro and Clawed 3.5 Sonnet on the Simple Bench. Close but good enough. Now can you look at the leaderboard and tell me is that correct?

Yes that's correct. The leaderboard shows that Lama 3.1 405B Instruct is in eighth place with a score of 23% while the first place is O1 Preview with a score of 41.7%. This is honestly really fun to speak to you and you sound like a cool dude but did Lama 3.1 really outperform Clawed 3.5 Sonnet which I'm going to highlight?

You are correct it seems that Lama 3.1 did not outperform Clawed 3.5 Sonnet as Clawed 3.5 Sonnet is ranked second with a score of 41.4% while Lama 3.1 is ranked eighth with a score of 23%. Thank you for your honesty Gemini. By the way this is the channel that came up with Simple Bench.

Any feedback on the aesthetics of the thumbnails? Sure the AI Explained channel on YouTube has a consistent style with its thumbnails. They typically use a combination of text and images and they have a very clear and legible text overlay. The colors are bright and contrasting which makes the thumbnails stand out.

Thank you very much Gemini. Of course you will have noticed some basic mistakes there because even though this is Gemini 2.0 Flash the whole new Gemini 2 era actually Gemini 2 Flash got around 20% on Simple Bench. Simple Bench is all about testing basic reasoning and seeing the question behind the question.

Other new entrants include Lama 3.3 at 19.9% and Gemini Experimental interestingly 31.1%. That's not the model that powers that interaction. Why then did Google pick Gemini 2 Flash for this tool? Well because it's much cheaper and faster. Some of you who have been following the announcements will say but wait Philip I thought Gemini 2 Flash got crazy results in all the benchmarks.

Well yes it got crazy results in all the benchmarks they picked to show you. Notice also they only compared the results to their own models. O1 for example let alone O1 Pro gets close to 20 points higher on this benchmark for example testing biology, physics, chemistry. But in fairness to Google they aren't really pretending like Gemini 2.0 Flash is the best model around.

Perhaps they learned their lesson from one of their previous releases where Demis Hassabis said I think it was Gemini 1.5 Pro is now better than experts in a range of subjects. Scoring highly on certain tests doesn't mean for now that you can match the breadth of human expertise. Luckily then Hassabis didn't make that claim this time around.

But there are other things that Gemini 2 Flash can do. It can speak in whispers and different tones and swap between languages. It can also output and kind of edit images and more on why I say kind of in a moment. Actually I was fortunate enough to get early access to that tool and even though I can't show you any of the results I can say it's still a work in progress.

Others though have shown that you can upload an image like this one of a tic-tac-toe or noughts and crosses board and ask it's X turned to play what should the next move be. I must say as anyone who has seen this channel before will know I am quite the expert at tic-tac-toe and so I would agree when Gemini 2 Flash says the move should be this one and then it gives us the image.

That's pretty cool. One more cool thing that Gemini 2 Flash can do before we get to a pretty big caveat. Google demoed Project Mariner where Gemini 2 actually takes over your computer. Well at least it can click things, browse the web and perform actions. Even purchase things. Not yet available mind you but I am trying to get access.

Here is a 20 second demo. You can ask it to do things like this that require multiple steps. Doing research to look up the artist, then finding the painting and then shopping for the right supplies. The agent uses the web to take action and find what you're looking for.

It plans and reasons at each step, always staying in your control and completing your task. Google DeepMind are not the first company to allow a model to take over your device. But the reason I didn't talk about computer use from Anthropic, powered by Claude at the time, was because it just really wasn't very good.

I tried to get it to do countless things and it simply couldn't. It took ages and rarely could achieve the goal. But just as an aside, bear in mind this benchmark called Web Voyager, testing how you can navigate the web with visual inputs. Claude computer use gets 52%. This particular agent scaffold got 67% almost.

Project Mariner, as you could see, powered by Gemini 2 Flash, 90.5% with TreeSearch. And in an interview released just three hours ago, I picked out this highlight, spoken by Aurel Vinyals. He is the technical lead on Gemini and will tell you what's coming soon. I could ask a model, a language model or a visual language model, to learn to play the game of StarCraft.

That's a very different approach to, say, create one agent that does play the game. In this other example, right, it could go online, watch videos about the game. It could, of course, download the game to start interacting with it, to learn, "Oh yeah, like I know how to, you know, I get it." Do research online, go to forums, read the forums, go play and figure out that it's weak at this thing and improve and so on.

And after literally, it could be weeks, it sends you an email that says, "I now know how to play the game. Let's play." Right? That's not a reality that's that far away. Frankly, that's quite enough buzz time to bring us all back down to earth. Progress, according to Google's own CEO, might be slowing down.

In his words, "The low-hanging fruit is gone." The current generation of LLM models are roughly, you know, a few companies have converged at the top, but I think they're all working on our next versions too. I think the progress is going to get harder when I look at 25.

The low-hanging fruit is gone. You know, the curve, the hill is steeper. You think it's slowing down though? So that's interesting because I will tell you that your competitor/peer in Sam Altman said, "There is no wall. There is no wall." You think it's maybe slowing down in terms of how quickly it scales?

I'm very confident there'll be a lot of progress in 25. I think the models are definitely going to get better at reasoning, completing a sequence of actions more reliably, you know, more agentic, if you will. That is a pretty stark contrast with what Sam Altman said as the CEO of OpenAI when he said, quote, "There is no wall," end quote, or Dario Amadei, the CEO of Anthropic, who predicted transformative AI by 2026.

That's AI that can automate whole swathes of the economy. You can let me know in the comments who you think is right. In just a few moments, of course, I will get to the announcements from OpenAI over the last couple of days, but there's one model that we have right now.

Forget the future. It comes from the sponsor of today's video, Assembly AI, and it's their Universal 2 speech-to-text model. I've shown you plenty of comparisons before on the channel, but here's another few. For example, proper nouns. Naturally, models have to get better at understanding what you said before they can get better at replying intelligently.

Links in the description. Some people criticize me for being too harsh on OpenAI and do check out my previous video for more of what I think, but I will say they are great at smooth interfaces. You could almost say Apple-like UX. So appropriately, ChatGBT is now inside iPhone 16.

Yes, that means it's inside Apple Intelligence, but it's made by OpenAI. Anyway, no, getting an iPhone 16 isn't a way to circumvent paying for Plus or Pro, because even though you do get ChatGBT Vision, it can, as of yet, only analyze an image from within a video. Not like a live interaction.

That's something separate. Still cool though, because you can invoke ChatGBT even if you don't have a ChatGBT account. So I can imagine hundreds of millions of "normies" using this, even though they don't follow AI. But for true video interaction, available in advanced voice mode, you'll need to get Plus or Pro.

That's $20 or $200 a month, respectively. And though it's not free, just as with Google AI Studio, you've got to admit it's a pretty compelling tool. I think this is the best of AI, where you don't always care too much about super accurate answers, but satisfying your curiosity. As of tonight, you can even speak to Santa via voice mode and ask terribly rude questions.

Again, as with Google, you will find that the underlying intelligence of the model you're using isn't the very best. The Vision and advanced voice mode is based on GPT-4.0, announced quite a while back. Obviously, that makes sense, they just don't have the GPUs to give everyone the best model at all times, interactively.

Gemini 2 Flash is for us currently heavily rate-limited, so we're struggling to get the 5 runs necessary to calculate an average, but on the runs we've done so far, it's scored around 20%. When the API does become available, I do expect O1 Pro Mode to sit comfortably at the top of this leaderboard.

I should say, at the top of the models, not above the human baseline. That baseline, O1 Pro Mode at least, will not be. So those were my highlights from the dozens and dozens that came out over the last 24 hours. AI models that can see what you see, listen, browse alongside you, even play games alongside you.

Actually, that gets me thinking, I can imagine in like 2-3 years, kids or even adults, when they're losing a game online, being like "you're using Gemini 4". Man, actually, that could be a pretty serious problem for competitive gameplay, unless you had like LAN tournaments. Anyway, I'm getting too distracted.

Those were the announcements I found the most interesting, let me know which ones you found the most compelling, and as always, have a wonderful day.