Never Browse Alone? Gemini 2 Live and ChatGPT Vision

00:00:00.000 | First Google and then tonight OpenAI. Both companies have released fun, useful and pretty

00:00:07.200 | impressive tools that can see what you see, listen and even browse alongside you. Google's is free,

00:00:14.400 | OpenAI's is paid, but OpenAI have included parts of their tool in iPhone 16s. I'm going to give you

00:00:21.600 | just my highlights from the literally dozens of announcements in just the last 24 hours.

00:00:27.520 | We'll take a one or two-minute demo tour and then I'll give you a hint about why these models make

00:00:33.120 | more mistakes than you might think. Plus of course some benchmarks, can't resist that. Now I could

00:00:38.240 | save the best till last but I won't. I'm just going to give you the coolest tool in my opinion

00:00:43.040 | straight up front. You can access it for free at aistudio.google.com/live. We're going to have a

00:00:49.200 | conversation with Gemini 2.0 from Google DeepMind. I'm not going to say it's their smartest model

00:00:55.840 | and I'll get back to that later but it is their newest. Just for a brief moment though let's not

00:01:01.760 | care too much about the smartness of the model. Look at what you can do. If you're on mobile what

00:01:06.160 | you'll want to do is click show Gemini and then it can gain permission to look out of your camera.

00:01:12.640 | Now you could already send videos to Gemini but this is live interaction. Maybe ask it about that

00:01:18.640 | building you're walking by or that plant you can see next to the path. Because we're using this for

00:01:24.160 | entertainment and curiosity I don't mind too much that it makes the occasional mistake.

00:01:29.280 | Obviously do not rely on the answers. I'm going to click share screen in just a moment but first I

00:01:35.360 | have to give you a bit of context. Here is another tool announced by Google yesterday called Deep

00:01:41.280 | Research and it's only available via Gemini Advanced. It's a decent tool. It researches the

00:01:46.400 | web, collates from like 20 results and presents them to you. I asked tell me everything about

00:01:51.600 | Simple Bench featured on AI Explained and it gave me a plan and then I approved it by saying start

00:01:57.280 | research and it gave me this result. It's definitely comprehensive research but you can't

00:02:01.760 | really rely on it as exemplified by this sentence. Now though let's investigate that interactively

00:02:07.760 | with Gemini. This is free so bear that in mind. Okay share your screen and then pick the screen

00:02:13.680 | you want. Hi Gemini how are you? I am doing great thanks for asking. How are you today? I'm good.

00:02:20.000 | Can you read out the sentence highlighted in yellow? Of course the highlighted sentence reads

00:02:24.880 | Lama 3.1 outperformed other models like GPT-4, Turbo, Gemini 1.5 Pro and Clawed 3.5 Sonnet on

00:02:32.720 | the Simple Bench. Close but good enough. Now can you look at the leaderboard and tell me is that

00:02:38.960 | correct? Yes that's correct. The leaderboard shows that Lama 3.1 405B Instruct is in eighth place

00:02:46.800 | with a score of 23% while the first place is O1 Preview with a score of 41.7%. This is honestly

00:02:55.200 | really fun to speak to you and you sound like a cool dude but did Lama 3.1 really outperform

00:03:02.560 | Clawed 3.5 Sonnet which I'm going to highlight? You are correct it seems that Lama 3.1 did not

00:03:08.880 | outperform Clawed 3.5 Sonnet as Clawed 3.5 Sonnet is ranked second with a score of 41.4%

00:03:16.080 | while Lama 3.1 is ranked eighth with a score of 23%. Thank you for your honesty Gemini. By the

00:03:22.640 | way this is the channel that came up with Simple Bench. Any feedback on the aesthetics of the

00:03:27.360 | thumbnails? Sure the AI Explained channel on YouTube has a consistent style with its thumbnails.

00:03:32.960 | They typically use a combination of text and images and they have a very clear and legible

00:03:37.360 | text overlay. The colors are bright and contrasting which makes the thumbnails stand out. Thank you

00:03:42.080 | very much Gemini. Of course you will have noticed some basic mistakes there because even though this

00:03:46.720 | is Gemini 2.0 Flash the whole new Gemini 2 era actually Gemini 2 Flash got around 20% on Simple

00:03:54.640 | Bench. Simple Bench is all about testing basic reasoning and seeing the question behind the

00:03:59.600 | question. Other new entrants include Lama 3.3 at 19.9% and Gemini Experimental interestingly

00:04:07.360 | 31.1%. That's not the model that powers that interaction. Why then did Google pick Gemini 2

00:04:14.000 | Flash for this tool? Well because it's much cheaper and faster. Some of you who have been

00:04:19.840 | following the announcements will say but wait Philip I thought Gemini 2 Flash got crazy results

00:04:25.440 | in all the benchmarks. Well yes it got crazy results in all the benchmarks they picked to

00:04:30.160 | show you. Notice also they only compared the results to their own models. O1 for example

00:04:35.360 | let alone O1 Pro gets close to 20 points higher on this benchmark for example testing biology,

00:04:42.080 | physics, chemistry. But in fairness to Google they aren't really pretending like Gemini 2.0 Flash

00:04:47.440 | is the best model around. Perhaps they learned their lesson from one of their previous releases

00:04:52.320 | where Demis Hassabis said I think it was Gemini 1.5 Pro is now better than experts in a range of

00:04:57.920 | subjects. Scoring highly on certain tests doesn't mean for now that you can match the breadth of

00:05:03.760 | human expertise. Luckily then Hassabis didn't make that claim this time around. But there are

00:05:09.520 | other things that Gemini 2 Flash can do. It can speak in whispers and different tones

00:05:14.480 | and swap between languages.

00:05:32.160 | It can also output and kind of edit images and more on why I say kind of in a moment.

00:05:58.880 | Actually I was fortunate enough to get early access to that tool and even though I can't

00:06:03.600 | show you any of the results I can say it's still a work in progress. Others though have shown that

00:06:09.280 | you can upload an image like this one of a tic-tac-toe or noughts and crosses board and ask

00:06:14.320 | it's X turned to play what should the next move be. I must say as anyone who has seen this channel

00:06:21.280 | before will know I am quite the expert at tic-tac-toe and so I would agree when Gemini 2 Flash

00:06:28.560 | says the move should be this one and then it gives us the image. That's pretty cool.

00:06:33.040 | One more cool thing that Gemini 2 Flash can do before we get to a pretty big caveat.

00:06:38.320 | Google demoed Project Mariner where Gemini 2 actually takes over your computer. Well at least

00:06:44.080 | it can click things, browse the web and perform actions. Even purchase things. Not yet available

00:06:49.520 | mind you but I am trying to get access. Here is a 20 second demo.

00:06:54.320 | You can ask it to do things like this that require multiple steps. Doing research to

00:06:58.560 | look up the artist, then finding the painting and then shopping for the right supplies.

00:07:03.520 | The agent uses the web to take action and find what you're looking for.

00:07:07.600 | It plans and reasons at each step, always staying in your control and completing your task.

00:07:15.280 | Google DeepMind are not the first company to allow a model to take over your device.

00:07:21.760 | But the reason I didn't talk about computer use from Anthropic, powered by Claude at the time,

00:07:27.280 | was because it just really wasn't very good. I tried to get it to do countless things

00:07:31.440 | and it simply couldn't. It took ages and rarely could achieve the goal.

00:07:34.880 | But just as an aside, bear in mind this benchmark called Web Voyager,

00:07:39.040 | testing how you can navigate the web with visual inputs.

00:07:42.160 | Claude computer use gets 52%. This particular agent scaffold got 67% almost.

00:07:48.800 | Project Mariner, as you could see, powered by Gemini 2 Flash, 90.5% with TreeSearch.

00:07:55.360 | And in an interview released just three hours ago, I picked out this highlight,

00:08:00.240 | spoken by Aurel Vinyals. He is the technical lead on Gemini and will tell you what's coming soon.

00:08:07.120 | I could ask a model, a language model or a visual language model, to

00:08:10.960 | learn to play the game of StarCraft. That's a very different approach to, say,

00:08:16.160 | create one agent that does play the game. In this other example, right, it could go online,

00:08:22.320 | watch videos about the game. It could, of course, download the game to start interacting with it,

00:08:27.840 | to learn, "Oh yeah, like I know how to, you know, I get it." Do research online, go to forums,

00:08:33.600 | read the forums, go play and figure out that it's weak at this thing and improve and so on.

00:08:39.760 | And after literally, it could be weeks, it sends you an email that says, "I now know how to play

00:08:45.520 | the game. Let's play." Right? That's not a reality that's that far away.

00:08:49.680 | Frankly, that's quite enough buzz time to bring us all back down to earth. Progress,

00:08:54.320 | according to Google's own CEO, might be slowing down. In his words, "The low-hanging fruit is gone."

00:09:01.520 | The current generation of LLM models are roughly, you know, a few companies have converged at the

00:09:05.760 | top, but I think they're all working on our next versions too. I think the progress is going to get

00:09:11.600 | harder when I look at 25. The low-hanging fruit is gone. You know, the curve, the hill is steeper.

00:09:19.840 | You think it's slowing down though? So that's interesting because I will tell you that

00:09:23.040 | your competitor/peer in Sam Altman said, "There is no wall. There is no wall." You think it's

00:09:30.880 | maybe slowing down in terms of how quickly it scales? I'm very confident there'll be a lot

00:09:37.840 | of progress in 25. I think the models are definitely going to get better at reasoning,

00:09:45.680 | completing a sequence of actions more reliably, you know, more agentic, if you will.

00:09:51.840 | That is a pretty stark contrast with what Sam Altman said as the CEO of OpenAI when he said,

00:09:57.760 | quote, "There is no wall," end quote, or Dario Amadei, the CEO of Anthropic,

00:10:03.280 | who predicted transformative AI by 2026. That's AI that can automate whole swathes of the economy.

00:10:10.480 | You can let me know in the comments who you think is right.

00:10:13.040 | In just a few moments, of course, I will get to the announcements from OpenAI

00:10:16.960 | over the last couple of days, but there's one model that we have right now. Forget the future.

00:10:22.640 | It comes from the sponsor of today's video, Assembly AI, and it's their Universal 2

00:10:27.440 | speech-to-text model. I've shown you plenty of comparisons before on the channel,

00:10:31.680 | but here's another few. For example, proper nouns.

00:10:34.800 | Naturally, models have to get better at understanding what you said

00:10:38.240 | before they can get better at replying intelligently. Links in the description.

00:10:42.320 | Some people criticize me for being too harsh on OpenAI and do check out my previous video

00:10:47.840 | for more of what I think, but I will say they are great at smooth interfaces. You could almost say

00:10:54.000 | Apple-like UX. So appropriately, ChatGBT is now inside iPhone 16. Yes, that means it's inside

00:11:00.720 | Apple Intelligence, but it's made by OpenAI. Anyway, no, getting an iPhone 16 isn't a way

00:11:06.560 | to circumvent paying for Plus or Pro, because even though you do get ChatGBT Vision, it can,

00:11:13.200 | as of yet, only analyze an image from within a video. Not like a live interaction. That's

00:11:18.320 | something separate. Still cool though, because you can invoke ChatGBT even if you don't have

00:11:22.400 | a ChatGBT account. So I can imagine hundreds of millions of "normies" using this, even though

00:11:28.560 | they don't follow AI. But for true video interaction, available in advanced voice mode,

00:11:34.560 | you'll need to get Plus or Pro. That's $20 or $200 a month, respectively. And though it's not free,

00:11:41.600 | just as with Google AI Studio, you've got to admit it's a pretty compelling tool.

00:11:46.480 | I think this is the best of AI, where you don't always care too much about super accurate answers,

00:12:12.560 | but satisfying your curiosity. As of tonight, you can even speak to Santa via voice mode and

00:12:18.480 | ask terribly rude questions. Again, as with Google, you will find that the underlying

00:12:23.120 | intelligence of the model you're using isn't the very best. The Vision and advanced voice

00:12:28.160 | mode is based on GPT-4.0, announced quite a while back. Obviously, that makes sense,

00:12:32.960 | they just don't have the GPUs to give everyone the best model at all times, interactively.

00:12:38.560 | Gemini 2 Flash is for us currently heavily rate-limited, so we're struggling to get the

00:12:43.840 | 5 runs necessary to calculate an average, but on the runs we've done so far, it's scored around 20%.

00:12:49.360 | When the API does become available, I do expect O1 Pro Mode to sit comfortably at the top of this

00:12:55.840 | leaderboard. I should say, at the top of the models, not above the human baseline.

00:13:00.160 | That baseline, O1 Pro Mode at least, will not be.

00:13:03.600 | So those were my highlights from the dozens and dozens that came out over the last 24 hours.

00:13:09.200 | AI models that can see what you see, listen, browse alongside you, even play games alongside

00:13:15.360 | you. Actually, that gets me thinking, I can imagine in like 2-3 years, kids or even adults,

00:13:20.800 | when they're losing a game online, being like "you're using Gemini 4". Man, actually,

00:13:25.840 | that could be a pretty serious problem for competitive gameplay, unless you had like LAN

00:13:29.840 | tournaments. Anyway, I'm getting too distracted. Those were the announcements I found the most

00:13:33.440 | interesting, let me know which ones you found the most compelling, and as always, have a wonderful day.

Never Browse Alone? Gemini 2 Live and ChatGPT Vision

Chapters