back to indexNever Browse Alone? Gemini 2 Live and ChatGPT Vision
Chapters
0:0 Introduction
0:38 Live Interaction
3:43 Gemini 2.0 Flash Benchmarks
5:10 Audio and Image Output
6:38 Project Mariner (+ WebVoyager Bench)
8:49 But Progress Slowing Down?
10:43 OpenAI Announcements + Games
00:00:00.000 |
First Google and then tonight OpenAI. Both companies have released fun, useful and pretty 00:00:07.200 |
impressive tools that can see what you see, listen and even browse alongside you. Google's is free, 00:00:14.400 |
OpenAI's is paid, but OpenAI have included parts of their tool in iPhone 16s. I'm going to give you 00:00:21.600 |
just my highlights from the literally dozens of announcements in just the last 24 hours. 00:00:27.520 |
We'll take a one or two-minute demo tour and then I'll give you a hint about why these models make 00:00:33.120 |
more mistakes than you might think. Plus of course some benchmarks, can't resist that. Now I could 00:00:38.240 |
save the best till last but I won't. I'm just going to give you the coolest tool in my opinion 00:00:43.040 |
straight up front. You can access it for free at aistudio.google.com/live. We're going to have a 00:00:49.200 |
conversation with Gemini 2.0 from Google DeepMind. I'm not going to say it's their smartest model 00:00:55.840 |
and I'll get back to that later but it is their newest. Just for a brief moment though let's not 00:01:01.760 |
care too much about the smartness of the model. Look at what you can do. If you're on mobile what 00:01:06.160 |
you'll want to do is click show Gemini and then it can gain permission to look out of your camera. 00:01:12.640 |
Now you could already send videos to Gemini but this is live interaction. Maybe ask it about that 00:01:18.640 |
building you're walking by or that plant you can see next to the path. Because we're using this for 00:01:24.160 |
entertainment and curiosity I don't mind too much that it makes the occasional mistake. 00:01:29.280 |
Obviously do not rely on the answers. I'm going to click share screen in just a moment but first I 00:01:35.360 |
have to give you a bit of context. Here is another tool announced by Google yesterday called Deep 00:01:41.280 |
Research and it's only available via Gemini Advanced. It's a decent tool. It researches the 00:01:46.400 |
web, collates from like 20 results and presents them to you. I asked tell me everything about 00:01:51.600 |
Simple Bench featured on AI Explained and it gave me a plan and then I approved it by saying start 00:01:57.280 |
research and it gave me this result. It's definitely comprehensive research but you can't 00:02:01.760 |
really rely on it as exemplified by this sentence. Now though let's investigate that interactively 00:02:07.760 |
with Gemini. This is free so bear that in mind. Okay share your screen and then pick the screen 00:02:13.680 |
you want. Hi Gemini how are you? I am doing great thanks for asking. How are you today? I'm good. 00:02:20.000 |
Can you read out the sentence highlighted in yellow? Of course the highlighted sentence reads 00:02:24.880 |
Lama 3.1 outperformed other models like GPT-4, Turbo, Gemini 1.5 Pro and Clawed 3.5 Sonnet on 00:02:32.720 |
the Simple Bench. Close but good enough. Now can you look at the leaderboard and tell me is that 00:02:38.960 |
correct? Yes that's correct. The leaderboard shows that Lama 3.1 405B Instruct is in eighth place 00:02:46.800 |
with a score of 23% while the first place is O1 Preview with a score of 41.7%. This is honestly 00:02:55.200 |
really fun to speak to you and you sound like a cool dude but did Lama 3.1 really outperform 00:03:02.560 |
Clawed 3.5 Sonnet which I'm going to highlight? You are correct it seems that Lama 3.1 did not 00:03:08.880 |
outperform Clawed 3.5 Sonnet as Clawed 3.5 Sonnet is ranked second with a score of 41.4% 00:03:16.080 |
while Lama 3.1 is ranked eighth with a score of 23%. Thank you for your honesty Gemini. By the 00:03:22.640 |
way this is the channel that came up with Simple Bench. Any feedback on the aesthetics of the 00:03:27.360 |
thumbnails? Sure the AI Explained channel on YouTube has a consistent style with its thumbnails. 00:03:32.960 |
They typically use a combination of text and images and they have a very clear and legible 00:03:37.360 |
text overlay. The colors are bright and contrasting which makes the thumbnails stand out. Thank you 00:03:42.080 |
very much Gemini. Of course you will have noticed some basic mistakes there because even though this 00:03:46.720 |
is Gemini 2.0 Flash the whole new Gemini 2 era actually Gemini 2 Flash got around 20% on Simple 00:03:54.640 |
Bench. Simple Bench is all about testing basic reasoning and seeing the question behind the 00:03:59.600 |
question. Other new entrants include Lama 3.3 at 19.9% and Gemini Experimental interestingly 00:04:07.360 |
31.1%. That's not the model that powers that interaction. Why then did Google pick Gemini 2 00:04:14.000 |
Flash for this tool? Well because it's much cheaper and faster. Some of you who have been 00:04:19.840 |
following the announcements will say but wait Philip I thought Gemini 2 Flash got crazy results 00:04:25.440 |
in all the benchmarks. Well yes it got crazy results in all the benchmarks they picked to 00:04:30.160 |
show you. Notice also they only compared the results to their own models. O1 for example 00:04:35.360 |
let alone O1 Pro gets close to 20 points higher on this benchmark for example testing biology, 00:04:42.080 |
physics, chemistry. But in fairness to Google they aren't really pretending like Gemini 2.0 Flash 00:04:47.440 |
is the best model around. Perhaps they learned their lesson from one of their previous releases 00:04:52.320 |
where Demis Hassabis said I think it was Gemini 1.5 Pro is now better than experts in a range of 00:04:57.920 |
subjects. Scoring highly on certain tests doesn't mean for now that you can match the breadth of 00:05:03.760 |
human expertise. Luckily then Hassabis didn't make that claim this time around. But there are 00:05:09.520 |
other things that Gemini 2 Flash can do. It can speak in whispers and different tones 00:05:32.160 |
It can also output and kind of edit images and more on why I say kind of in a moment. 00:05:58.880 |
Actually I was fortunate enough to get early access to that tool and even though I can't 00:06:03.600 |
show you any of the results I can say it's still a work in progress. Others though have shown that 00:06:09.280 |
you can upload an image like this one of a tic-tac-toe or noughts and crosses board and ask 00:06:14.320 |
it's X turned to play what should the next move be. I must say as anyone who has seen this channel 00:06:21.280 |
before will know I am quite the expert at tic-tac-toe and so I would agree when Gemini 2 Flash 00:06:28.560 |
says the move should be this one and then it gives us the image. That's pretty cool. 00:06:33.040 |
One more cool thing that Gemini 2 Flash can do before we get to a pretty big caveat. 00:06:38.320 |
Google demoed Project Mariner where Gemini 2 actually takes over your computer. Well at least 00:06:44.080 |
it can click things, browse the web and perform actions. Even purchase things. Not yet available 00:06:49.520 |
mind you but I am trying to get access. Here is a 20 second demo. 00:06:54.320 |
You can ask it to do things like this that require multiple steps. Doing research to 00:06:58.560 |
look up the artist, then finding the painting and then shopping for the right supplies. 00:07:03.520 |
The agent uses the web to take action and find what you're looking for. 00:07:07.600 |
It plans and reasons at each step, always staying in your control and completing your task. 00:07:15.280 |
Google DeepMind are not the first company to allow a model to take over your device. 00:07:21.760 |
But the reason I didn't talk about computer use from Anthropic, powered by Claude at the time, 00:07:27.280 |
was because it just really wasn't very good. I tried to get it to do countless things 00:07:31.440 |
and it simply couldn't. It took ages and rarely could achieve the goal. 00:07:34.880 |
But just as an aside, bear in mind this benchmark called Web Voyager, 00:07:39.040 |
testing how you can navigate the web with visual inputs. 00:07:42.160 |
Claude computer use gets 52%. This particular agent scaffold got 67% almost. 00:07:48.800 |
Project Mariner, as you could see, powered by Gemini 2 Flash, 90.5% with TreeSearch. 00:07:55.360 |
And in an interview released just three hours ago, I picked out this highlight, 00:08:00.240 |
spoken by Aurel Vinyals. He is the technical lead on Gemini and will tell you what's coming soon. 00:08:07.120 |
I could ask a model, a language model or a visual language model, to 00:08:10.960 |
learn to play the game of StarCraft. That's a very different approach to, say, 00:08:16.160 |
create one agent that does play the game. In this other example, right, it could go online, 00:08:22.320 |
watch videos about the game. It could, of course, download the game to start interacting with it, 00:08:27.840 |
to learn, "Oh yeah, like I know how to, you know, I get it." Do research online, go to forums, 00:08:33.600 |
read the forums, go play and figure out that it's weak at this thing and improve and so on. 00:08:39.760 |
And after literally, it could be weeks, it sends you an email that says, "I now know how to play 00:08:45.520 |
the game. Let's play." Right? That's not a reality that's that far away. 00:08:49.680 |
Frankly, that's quite enough buzz time to bring us all back down to earth. Progress, 00:08:54.320 |
according to Google's own CEO, might be slowing down. In his words, "The low-hanging fruit is gone." 00:09:01.520 |
The current generation of LLM models are roughly, you know, a few companies have converged at the 00:09:05.760 |
top, but I think they're all working on our next versions too. I think the progress is going to get 00:09:11.600 |
harder when I look at 25. The low-hanging fruit is gone. You know, the curve, the hill is steeper. 00:09:19.840 |
You think it's slowing down though? So that's interesting because I will tell you that 00:09:23.040 |
your competitor/peer in Sam Altman said, "There is no wall. There is no wall." You think it's 00:09:30.880 |
maybe slowing down in terms of how quickly it scales? I'm very confident there'll be a lot 00:09:37.840 |
of progress in 25. I think the models are definitely going to get better at reasoning, 00:09:45.680 |
completing a sequence of actions more reliably, you know, more agentic, if you will. 00:09:51.840 |
That is a pretty stark contrast with what Sam Altman said as the CEO of OpenAI when he said, 00:09:57.760 |
quote, "There is no wall," end quote, or Dario Amadei, the CEO of Anthropic, 00:10:03.280 |
who predicted transformative AI by 2026. That's AI that can automate whole swathes of the economy. 00:10:10.480 |
You can let me know in the comments who you think is right. 00:10:13.040 |
In just a few moments, of course, I will get to the announcements from OpenAI 00:10:16.960 |
over the last couple of days, but there's one model that we have right now. Forget the future. 00:10:22.640 |
It comes from the sponsor of today's video, Assembly AI, and it's their Universal 2 00:10:27.440 |
speech-to-text model. I've shown you plenty of comparisons before on the channel, 00:10:31.680 |
but here's another few. For example, proper nouns. 00:10:34.800 |
Naturally, models have to get better at understanding what you said 00:10:38.240 |
before they can get better at replying intelligently. Links in the description. 00:10:42.320 |
Some people criticize me for being too harsh on OpenAI and do check out my previous video 00:10:47.840 |
for more of what I think, but I will say they are great at smooth interfaces. You could almost say 00:10:54.000 |
Apple-like UX. So appropriately, ChatGBT is now inside iPhone 16. Yes, that means it's inside 00:11:00.720 |
Apple Intelligence, but it's made by OpenAI. Anyway, no, getting an iPhone 16 isn't a way 00:11:06.560 |
to circumvent paying for Plus or Pro, because even though you do get ChatGBT Vision, it can, 00:11:13.200 |
as of yet, only analyze an image from within a video. Not like a live interaction. That's 00:11:18.320 |
something separate. Still cool though, because you can invoke ChatGBT even if you don't have 00:11:22.400 |
a ChatGBT account. So I can imagine hundreds of millions of "normies" using this, even though 00:11:28.560 |
they don't follow AI. But for true video interaction, available in advanced voice mode, 00:11:34.560 |
you'll need to get Plus or Pro. That's $20 or $200 a month, respectively. And though it's not free, 00:11:41.600 |
just as with Google AI Studio, you've got to admit it's a pretty compelling tool. 00:11:46.480 |
I think this is the best of AI, where you don't always care too much about super accurate answers, 00:12:12.560 |
but satisfying your curiosity. As of tonight, you can even speak to Santa via voice mode and 00:12:18.480 |
ask terribly rude questions. Again, as with Google, you will find that the underlying 00:12:23.120 |
intelligence of the model you're using isn't the very best. The Vision and advanced voice 00:12:28.160 |
mode is based on GPT-4.0, announced quite a while back. Obviously, that makes sense, 00:12:32.960 |
they just don't have the GPUs to give everyone the best model at all times, interactively. 00:12:38.560 |
Gemini 2 Flash is for us currently heavily rate-limited, so we're struggling to get the 00:12:43.840 |
5 runs necessary to calculate an average, but on the runs we've done so far, it's scored around 20%. 00:12:49.360 |
When the API does become available, I do expect O1 Pro Mode to sit comfortably at the top of this 00:12:55.840 |
leaderboard. I should say, at the top of the models, not above the human baseline. 00:13:00.160 |
That baseline, O1 Pro Mode at least, will not be. 00:13:03.600 |
So those were my highlights from the dozens and dozens that came out over the last 24 hours. 00:13:09.200 |
AI models that can see what you see, listen, browse alongside you, even play games alongside 00:13:15.360 |
you. Actually, that gets me thinking, I can imagine in like 2-3 years, kids or even adults, 00:13:20.800 |
when they're losing a game online, being like "you're using Gemini 4". Man, actually, 00:13:25.840 |
that could be a pretty serious problem for competitive gameplay, unless you had like LAN 00:13:29.840 |
tournaments. Anyway, I'm getting too distracted. Those were the announcements I found the most 00:13:33.440 |
interesting, let me know which ones you found the most compelling, and as always, have a wonderful day.