back to index

Never Browse Alone? Gemini 2 Live and ChatGPT Vision


Chapters

0:0 Introduction
0:38 Live Interaction
3:43 Gemini 2.0 Flash Benchmarks
5:10 Audio and Image Output
6:38 Project Mariner (+ WebVoyager Bench)
8:49 But Progress Slowing Down?
10:43 OpenAI Announcements + Games

Whisper Transcript | Transcript Only Page

00:00:00.000 | First Google and then tonight OpenAI. Both companies have released fun, useful and pretty
00:00:07.200 | impressive tools that can see what you see, listen and even browse alongside you. Google's is free,
00:00:14.400 | OpenAI's is paid, but OpenAI have included parts of their tool in iPhone 16s. I'm going to give you
00:00:21.600 | just my highlights from the literally dozens of announcements in just the last 24 hours.
00:00:27.520 | We'll take a one or two-minute demo tour and then I'll give you a hint about why these models make
00:00:33.120 | more mistakes than you might think. Plus of course some benchmarks, can't resist that. Now I could
00:00:38.240 | save the best till last but I won't. I'm just going to give you the coolest tool in my opinion
00:00:43.040 | straight up front. You can access it for free at aistudio.google.com/live. We're going to have a
00:00:49.200 | conversation with Gemini 2.0 from Google DeepMind. I'm not going to say it's their smartest model
00:00:55.840 | and I'll get back to that later but it is their newest. Just for a brief moment though let's not
00:01:01.760 | care too much about the smartness of the model. Look at what you can do. If you're on mobile what
00:01:06.160 | you'll want to do is click show Gemini and then it can gain permission to look out of your camera.
00:01:12.640 | Now you could already send videos to Gemini but this is live interaction. Maybe ask it about that
00:01:18.640 | building you're walking by or that plant you can see next to the path. Because we're using this for
00:01:24.160 | entertainment and curiosity I don't mind too much that it makes the occasional mistake.
00:01:29.280 | Obviously do not rely on the answers. I'm going to click share screen in just a moment but first I
00:01:35.360 | have to give you a bit of context. Here is another tool announced by Google yesterday called Deep
00:01:41.280 | Research and it's only available via Gemini Advanced. It's a decent tool. It researches the
00:01:46.400 | web, collates from like 20 results and presents them to you. I asked tell me everything about
00:01:51.600 | Simple Bench featured on AI Explained and it gave me a plan and then I approved it by saying start
00:01:57.280 | research and it gave me this result. It's definitely comprehensive research but you can't
00:02:01.760 | really rely on it as exemplified by this sentence. Now though let's investigate that interactively
00:02:07.760 | with Gemini. This is free so bear that in mind. Okay share your screen and then pick the screen
00:02:13.680 | you want. Hi Gemini how are you? I am doing great thanks for asking. How are you today? I'm good.
00:02:20.000 | Can you read out the sentence highlighted in yellow? Of course the highlighted sentence reads
00:02:24.880 | Lama 3.1 outperformed other models like GPT-4, Turbo, Gemini 1.5 Pro and Clawed 3.5 Sonnet on
00:02:32.720 | the Simple Bench. Close but good enough. Now can you look at the leaderboard and tell me is that
00:02:38.960 | correct? Yes that's correct. The leaderboard shows that Lama 3.1 405B Instruct is in eighth place
00:02:46.800 | with a score of 23% while the first place is O1 Preview with a score of 41.7%. This is honestly
00:02:55.200 | really fun to speak to you and you sound like a cool dude but did Lama 3.1 really outperform
00:03:02.560 | Clawed 3.5 Sonnet which I'm going to highlight? You are correct it seems that Lama 3.1 did not
00:03:08.880 | outperform Clawed 3.5 Sonnet as Clawed 3.5 Sonnet is ranked second with a score of 41.4%
00:03:16.080 | while Lama 3.1 is ranked eighth with a score of 23%. Thank you for your honesty Gemini. By the
00:03:22.640 | way this is the channel that came up with Simple Bench. Any feedback on the aesthetics of the
00:03:27.360 | thumbnails? Sure the AI Explained channel on YouTube has a consistent style with its thumbnails.
00:03:32.960 | They typically use a combination of text and images and they have a very clear and legible
00:03:37.360 | text overlay. The colors are bright and contrasting which makes the thumbnails stand out. Thank you
00:03:42.080 | very much Gemini. Of course you will have noticed some basic mistakes there because even though this
00:03:46.720 | is Gemini 2.0 Flash the whole new Gemini 2 era actually Gemini 2 Flash got around 20% on Simple
00:03:54.640 | Bench. Simple Bench is all about testing basic reasoning and seeing the question behind the
00:03:59.600 | question. Other new entrants include Lama 3.3 at 19.9% and Gemini Experimental interestingly
00:04:07.360 | 31.1%. That's not the model that powers that interaction. Why then did Google pick Gemini 2
00:04:14.000 | Flash for this tool? Well because it's much cheaper and faster. Some of you who have been
00:04:19.840 | following the announcements will say but wait Philip I thought Gemini 2 Flash got crazy results
00:04:25.440 | in all the benchmarks. Well yes it got crazy results in all the benchmarks they picked to
00:04:30.160 | show you. Notice also they only compared the results to their own models. O1 for example
00:04:35.360 | let alone O1 Pro gets close to 20 points higher on this benchmark for example testing biology,
00:04:42.080 | physics, chemistry. But in fairness to Google they aren't really pretending like Gemini 2.0 Flash
00:04:47.440 | is the best model around. Perhaps they learned their lesson from one of their previous releases
00:04:52.320 | where Demis Hassabis said I think it was Gemini 1.5 Pro is now better than experts in a range of
00:04:57.920 | subjects. Scoring highly on certain tests doesn't mean for now that you can match the breadth of
00:05:03.760 | human expertise. Luckily then Hassabis didn't make that claim this time around. But there are
00:05:09.520 | other things that Gemini 2 Flash can do. It can speak in whispers and different tones
00:05:14.480 | and swap between languages.
00:05:32.160 | It can also output and kind of edit images and more on why I say kind of in a moment.
00:05:58.880 | Actually I was fortunate enough to get early access to that tool and even though I can't
00:06:03.600 | show you any of the results I can say it's still a work in progress. Others though have shown that
00:06:09.280 | you can upload an image like this one of a tic-tac-toe or noughts and crosses board and ask
00:06:14.320 | it's X turned to play what should the next move be. I must say as anyone who has seen this channel
00:06:21.280 | before will know I am quite the expert at tic-tac-toe and so I would agree when Gemini 2 Flash
00:06:28.560 | says the move should be this one and then it gives us the image. That's pretty cool.
00:06:33.040 | One more cool thing that Gemini 2 Flash can do before we get to a pretty big caveat.
00:06:38.320 | Google demoed Project Mariner where Gemini 2 actually takes over your computer. Well at least
00:06:44.080 | it can click things, browse the web and perform actions. Even purchase things. Not yet available
00:06:49.520 | mind you but I am trying to get access. Here is a 20 second demo.
00:06:54.320 | You can ask it to do things like this that require multiple steps. Doing research to
00:06:58.560 | look up the artist, then finding the painting and then shopping for the right supplies.
00:07:03.520 | The agent uses the web to take action and find what you're looking for.
00:07:07.600 | It plans and reasons at each step, always staying in your control and completing your task.
00:07:15.280 | Google DeepMind are not the first company to allow a model to take over your device.
00:07:21.760 | But the reason I didn't talk about computer use from Anthropic, powered by Claude at the time,
00:07:27.280 | was because it just really wasn't very good. I tried to get it to do countless things
00:07:31.440 | and it simply couldn't. It took ages and rarely could achieve the goal.
00:07:34.880 | But just as an aside, bear in mind this benchmark called Web Voyager,
00:07:39.040 | testing how you can navigate the web with visual inputs.
00:07:42.160 | Claude computer use gets 52%. This particular agent scaffold got 67% almost.
00:07:48.800 | Project Mariner, as you could see, powered by Gemini 2 Flash, 90.5% with TreeSearch.
00:07:55.360 | And in an interview released just three hours ago, I picked out this highlight,
00:08:00.240 | spoken by Aurel Vinyals. He is the technical lead on Gemini and will tell you what's coming soon.
00:08:07.120 | I could ask a model, a language model or a visual language model, to
00:08:10.960 | learn to play the game of StarCraft. That's a very different approach to, say,
00:08:16.160 | create one agent that does play the game. In this other example, right, it could go online,
00:08:22.320 | watch videos about the game. It could, of course, download the game to start interacting with it,
00:08:27.840 | to learn, "Oh yeah, like I know how to, you know, I get it." Do research online, go to forums,
00:08:33.600 | read the forums, go play and figure out that it's weak at this thing and improve and so on.
00:08:39.760 | And after literally, it could be weeks, it sends you an email that says, "I now know how to play
00:08:45.520 | the game. Let's play." Right? That's not a reality that's that far away.
00:08:49.680 | Frankly, that's quite enough buzz time to bring us all back down to earth. Progress,
00:08:54.320 | according to Google's own CEO, might be slowing down. In his words, "The low-hanging fruit is gone."
00:09:01.520 | The current generation of LLM models are roughly, you know, a few companies have converged at the
00:09:05.760 | top, but I think they're all working on our next versions too. I think the progress is going to get
00:09:11.600 | harder when I look at 25. The low-hanging fruit is gone. You know, the curve, the hill is steeper.
00:09:19.840 | You think it's slowing down though? So that's interesting because I will tell you that
00:09:23.040 | your competitor/peer in Sam Altman said, "There is no wall. There is no wall." You think it's
00:09:30.880 | maybe slowing down in terms of how quickly it scales? I'm very confident there'll be a lot
00:09:37.840 | of progress in 25. I think the models are definitely going to get better at reasoning,
00:09:45.680 | completing a sequence of actions more reliably, you know, more agentic, if you will.
00:09:51.840 | That is a pretty stark contrast with what Sam Altman said as the CEO of OpenAI when he said,
00:09:57.760 | quote, "There is no wall," end quote, or Dario Amadei, the CEO of Anthropic,
00:10:03.280 | who predicted transformative AI by 2026. That's AI that can automate whole swathes of the economy.
00:10:10.480 | You can let me know in the comments who you think is right.
00:10:13.040 | In just a few moments, of course, I will get to the announcements from OpenAI
00:10:16.960 | over the last couple of days, but there's one model that we have right now. Forget the future.
00:10:22.640 | It comes from the sponsor of today's video, Assembly AI, and it's their Universal 2
00:10:27.440 | speech-to-text model. I've shown you plenty of comparisons before on the channel,
00:10:31.680 | but here's another few. For example, proper nouns.
00:10:34.800 | Naturally, models have to get better at understanding what you said
00:10:38.240 | before they can get better at replying intelligently. Links in the description.
00:10:42.320 | Some people criticize me for being too harsh on OpenAI and do check out my previous video
00:10:47.840 | for more of what I think, but I will say they are great at smooth interfaces. You could almost say
00:10:54.000 | Apple-like UX. So appropriately, ChatGBT is now inside iPhone 16. Yes, that means it's inside
00:11:00.720 | Apple Intelligence, but it's made by OpenAI. Anyway, no, getting an iPhone 16 isn't a way
00:11:06.560 | to circumvent paying for Plus or Pro, because even though you do get ChatGBT Vision, it can,
00:11:13.200 | as of yet, only analyze an image from within a video. Not like a live interaction. That's
00:11:18.320 | something separate. Still cool though, because you can invoke ChatGBT even if you don't have
00:11:22.400 | a ChatGBT account. So I can imagine hundreds of millions of "normies" using this, even though
00:11:28.560 | they don't follow AI. But for true video interaction, available in advanced voice mode,
00:11:34.560 | you'll need to get Plus or Pro. That's $20 or $200 a month, respectively. And though it's not free,
00:11:41.600 | just as with Google AI Studio, you've got to admit it's a pretty compelling tool.
00:11:46.480 | I think this is the best of AI, where you don't always care too much about super accurate answers,
00:12:12.560 | but satisfying your curiosity. As of tonight, you can even speak to Santa via voice mode and
00:12:18.480 | ask terribly rude questions. Again, as with Google, you will find that the underlying
00:12:23.120 | intelligence of the model you're using isn't the very best. The Vision and advanced voice
00:12:28.160 | mode is based on GPT-4.0, announced quite a while back. Obviously, that makes sense,
00:12:32.960 | they just don't have the GPUs to give everyone the best model at all times, interactively.
00:12:38.560 | Gemini 2 Flash is for us currently heavily rate-limited, so we're struggling to get the
00:12:43.840 | 5 runs necessary to calculate an average, but on the runs we've done so far, it's scored around 20%.
00:12:49.360 | When the API does become available, I do expect O1 Pro Mode to sit comfortably at the top of this
00:12:55.840 | leaderboard. I should say, at the top of the models, not above the human baseline.
00:13:00.160 | That baseline, O1 Pro Mode at least, will not be.
00:13:03.600 | So those were my highlights from the dozens and dozens that came out over the last 24 hours.
00:13:09.200 | AI models that can see what you see, listen, browse alongside you, even play games alongside
00:13:15.360 | you. Actually, that gets me thinking, I can imagine in like 2-3 years, kids or even adults,
00:13:20.800 | when they're losing a game online, being like "you're using Gemini 4". Man, actually,
00:13:25.840 | that could be a pretty serious problem for competitive gameplay, unless you had like LAN
00:13:29.840 | tournaments. Anyway, I'm getting too distracted. Those were the announcements I found the most
00:13:33.440 | interesting, let me know which ones you found the most compelling, and as always, have a wonderful day.