back to index

Gemini 2 Multimodal and Spatial Awareness in Python


Chapters

0:0 Gemini 2 Multimodal
0:41 Gemini Focus on Agents
1:53 Running the Code
3:8 Asking Gemini to Describe Images
9:29 Gemini Image Bounding Boxes
21:6 Gemini Spatial Awareness Example 2
23:29 Gemini Spatial Awareness Example 3
26:52 Gemini Spatial Awareness Example 4
29:9 Gemini Image-to-Text
30:50 Google Gemini vs OpenAI GPTs

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're going to be taking a look at Google's new Gemini 2 model.
00:00:04.240 | Now, Gemini 2, I think it's probably one of the most impressive LM releases out there.
00:00:10.240 | And although I can't form a full opinion on it quite yet, I still need to work with it a little more.
00:00:17.840 | It does seem to be something that I would actually stop using OpenAI for potentially again, I need to test it more, but it's.
00:00:30.040 | If not that model, it's the closest I think I've ever seen to getting to that point where I'm like, oh, I don't necessarily need OpenAI for a lot of stuff anymore.
00:00:39.440 | So that is really interesting.
00:00:41.280 | And one thing I really like that they're doing is they're focusing on the agent's use case.
00:00:46.320 | I think agents are like short term future of LLMs and AI in general.
00:00:52.720 | And I mean, their announcement here is literally like it is the new AI model for the agentic era.
00:01:00.720 | Like they're really focusing on that.
00:01:02.280 | And one thing I have noticed is that this model produces a structured output very well.
00:01:07.480 | So we're going to be focusing on the text to image and image to text modalities.
00:01:11.760 | And we're going to be jumping into this multi-modal example here.
00:01:17.280 | So this is in the Aurelio Labs cookbook repo, Gen AI, Google AI, Gemini 2, and then multi-modal under here.
00:01:24.440 | We have a load more Gemini 2 examples coming as well.
00:01:29.000 | So there'll be more in here very soon.
00:01:31.720 | Now we're going to be working towards something kind of like this.
00:01:36.160 | And there's probably actually a few other really interesting examples where we're just taking some images that are not necessarily super clear.
00:01:43.440 | And we're just going to be able to see, you know, what does it do?
00:01:45.840 | Where does Gemini 2 not work very well?
00:01:47.920 | Where does it work incredibly well?
00:01:50.120 | And yeah, we'll see a few of those examples.
00:01:53.200 | So in that notebook, you will be able to open it in Colab.
00:01:57.360 | And that's probably the easiest way of like running through alongside me in this example.
00:02:02.720 | But I'm going to run it locally and you can run it locally as well.
00:02:06.680 | There's some set up instructions here, but easiest is just Colab.
00:02:12.360 | So if we're running locally, the first thing we need to do is just select our environment.
00:02:17.000 | Of course, in Colab, you don't really need to do that.
00:02:20.880 | So we run this.
00:02:22.080 | I have these four images here.
00:02:25.080 | OK, we're going to see them here, actually.
00:02:27.480 | So we run this.
00:02:29.280 | OK, so we'll just see each of the four images that we're going to ask for.
00:02:34.880 | So these are just screen grabs from like some diving videos.
00:02:39.160 | And, you know, there's like there's a lot going on in here.
00:02:42.520 | I think, you know, a lot of the examples I've seen from Google are like really clean images.
00:02:49.960 | So we have these images kind of blurry in a lot of cases, a lot of motion
00:02:56.400 | and a lot of noise is just a lot going on in all these pictures.
00:02:59.920 | So they're not easy images to work with here.
00:03:04.520 | So we'll see how how Gemini performs against them.
00:03:08.080 | First thing, we're just going to ask Gemini to describe what is in these images.
00:03:13.800 | So we're going with the first image where you have like these anemones
00:03:19.760 | and then the clownfish or anemone fish in the front here.
00:03:24.280 | There's a cleaner wrasse or something.
00:03:28.720 | I'm not entirely sure how you call it.
00:03:31.000 | I think these are table corals.
00:03:33.000 | Like there's a lot of stuff in this image that we can say, OK, what is in the image?
00:03:38.560 | It's also there is also clownfish over there in the background.
00:03:41.720 | I haven't managed to get Gemini to label that one.
00:03:44.800 | Probably it won't again.
00:03:46.080 | But we'll see. See how things go.
00:03:48.400 | Interestingly, running this notebook today compared to yesterday,
00:03:53.800 | the results change quite significantly.
00:03:56.720 | I'm not sure if I just got lucky one time, but the results seem a lot better.
00:03:59.640 | So I think maybe Google or maybe they're doing something.
00:04:03.200 | I don't know. Anyway.
00:04:06.240 | So first thing, we are going to need a Google AI Studio API key.
00:04:13.320 | And OK, we get that from Google AI Studio.
00:04:18.080 | So I'm going to go and open that.
00:04:19.800 | Right. So you will need to just go and create an account.
00:04:24.200 | So you create an account and then you just go through.
00:04:26.520 | And where is I think maybe settings API plan information.
00:04:33.880 | You come into here and then you want to go to you and open this up.
00:04:38.040 | So this is just going to open a window in GCP.
00:04:41.840 | And basically GCP generates this project for you, Gemini API.
00:04:46.480 | You can go ahead and just create a credential here.
00:04:49.920 | I think. OK, so we'll ask you API key.
00:04:53.120 | This is the one you want.
00:04:54.080 | So just click API key.
00:04:56.320 | You get your API key and you just want to paste it into.
00:04:59.440 | This should a little box should pop up or you just pop it straight into a string.
00:05:03.200 | Here. It's up to you.
00:05:05.080 | I'm going to go ahead and do that right.
00:05:07.680 | And that will just initialize the connection to your client.
00:05:09.960 | All right. So fairly straightforward.
00:05:13.680 | Nothing, nothing complicated so far.
00:05:16.080 | OK, we're going to be using this Gemini to flash.
00:05:19.960 | The XP here is experimental.
00:05:22.720 | Again, as I said, it's not really intended to be used in production.
00:05:26.560 | I'm not sure if they allow you to at the moment.
00:05:29.000 | I haven't read too much into that.
00:05:30.880 | But yes, experimental.
00:05:32.800 | So the actual sort of production ready model will hopefully come soon.
00:05:37.160 | Although Google are very slow with that sort of thing.
00:05:40.040 | So who knows?
00:05:42.040 | But anyway, so we have the the flash model.
00:05:45.840 | So it should be pretty fast.
00:05:47.480 | And where this is basically how we're using the API
00:05:53.120 | or how we're using the the model.
00:05:54.600 | So we are generating the content.
00:05:57.320 | Right. So we have content.
00:05:59.120 | And these here basically, at least the way that I understand this,
00:06:04.400 | this here is going to be like both of these independently,
00:06:08.920 | I believe, are being transformed into user messages.
00:06:11.920 | This is a text based user message.
00:06:14.080 | And this is a user message with just an image.
00:06:17.600 | It could be wrong then, wrong there.
00:06:19.360 | Maybe they both get combined.
00:06:20.680 | I'm not super sure.
00:06:22.720 | But yeah, we have both those.
00:06:24.120 | We haven't set a system prompt.
00:06:25.760 | Oh, sorry. Did we?
00:06:27.360 | Oh, no, we did set a system prompt. Sorry.
00:06:29.720 | So the system prompt we defined here.
00:06:31.840 | OK, so we define this config object here.
00:06:35.360 | And we pass in our system instruction or system prompt in there.
00:06:40.240 | All right. So the system prompt is just describe what you see in this image,
00:06:42.960 | identify any fish or coral species in the image
00:06:46.320 | and tell us how many of each you can see.
00:06:48.200 | Right. Fairly simple.
00:06:50.600 | We set the safe settings, which are basically just
00:06:54.400 | don't be too straight.
00:06:57.320 | It's kind of what we're doing here.
00:06:58.440 | I don't think we really need this, to be honest, for this.
00:07:01.280 | Maybe they're just fish, but who knows?
00:07:03.600 | Then also, yeah, temperature is pretty low, 0.1.
00:07:09.000 | I did find that it tends to.
00:07:11.240 | At least for this task, a slightly higher than zero
00:07:16.400 | temperature setting seem to actually get better results, which is interesting.
00:07:20.480 | So usually I would not expect that for a task
00:07:25.760 | where the agent needs to produce such structured output.
00:07:29.160 | OK, so, yeah, we have that.
00:07:32.880 | So we pass that to our model.
00:07:35.440 | Then we pass in those contents.
00:07:37.200 | So the one thing with Gemini 2 is I will generally output everything
00:07:42.120 | in Markdown without you even telling it to, which is fine.
00:07:45.640 | I like Markdown.
00:07:46.840 | So I'm just basically showing everything in Markdown here.
00:07:51.840 | So we have the overall scene.
00:07:54.040 | It's describing it as an underwater scene, coral reef,
00:07:57.160 | marine life, coral formations, it's daytime, natural light
00:08:02.280 | filtering through the water.
00:08:04.000 | So that's pretty accurate, I think.
00:08:07.520 | They have the clownfish, right?
00:08:08.960 | So this is an interesting thing.
00:08:11.080 | So when I'm asking it to describe the scene,
00:08:14.240 | it actually seems to identify things very well.
00:08:16.760 | But then when you're asking it to draw the bounding box, it doesn't.
00:08:20.000 | Or at least before it wasn't quite doing as well.
00:08:22.240 | So we have the clownfish.
00:08:24.360 | There's two clownfish it didn't notice, the one in the back.
00:08:27.320 | All right. The one over here.
00:08:30.640 | So interesting.
00:08:32.440 | And then there's this wrasse.
00:08:33.880 | So I don't know how you pronounce it, by the way,
00:08:36.320 | but I think it's wrasse.
00:08:37.440 | So this wrasse, it's this little thing on the left here,
00:08:41.320 | which is pretty tiny.
00:08:43.400 | I think also, if you haven't been diving much,
00:08:47.680 | you probably wouldn't even notice this fish.
00:08:50.320 | The only reason you notice, like, if you've been diving a little bit,
00:08:52.960 | it's because they'll try and kind of bite you and stuff
00:08:57.520 | and just be generally annoying.
00:08:59.080 | So that fish is just kind of like ingrained in your memory.
00:09:01.920 | So kind of cool that it identified that.
00:09:05.200 | You have the coral species, the anemone,
00:09:06.960 | which is what the clownfish live in.
00:09:09.120 | You have the hard corals, branching corals,
00:09:11.800 | plate corals, and some massive corals.
00:09:14.280 | But that's kind of cool.
00:09:17.480 | And then there's some counts.
00:09:18.920 | All right. So, yeah, that's pretty cool.
00:09:21.880 | So that is just describing an image, right?
00:09:24.960 | So image to text or text to text and image to text.
00:09:29.040 | Now, what we want to do is we're going to draw these bounding boxes.
00:09:34.120 | And so this prompt here, it's mostly adapted from an example
00:09:40.360 | that Google provided where they're doing the same thing, right?
00:09:43.760 | They're drawing the bounding boxes.
00:09:46.000 | And what I found is, like, modifying this much
00:09:49.680 | or tended to break it pretty easily, which is interesting.
00:09:54.240 | So I don't know if that's just a prompting thing.
00:09:56.160 | Maybe I need to get a little better at prompting this model.
00:09:59.880 | So, anyway, we're telling it to just return some bounding boxes, OK?
00:10:05.080 | And they do that as a JSON array with labels.
00:10:08.440 | Now, this seems to be something it's being pre-trained on,
00:10:11.840 | as far as I can tell,
00:10:13.480 | because I'm not really defining much here.
00:10:16.800 | And it is, if you come down, oh, I don't have the example.
00:10:23.160 | But it's producing this structure,
00:10:25.400 | which is exactly the structure that we need, essentially.
00:10:29.600 | So it's obviously -- it's had a few examples of this
00:10:33.040 | in the training or fine-tuning datasets.
00:10:36.520 | So we're just saying bounding boxes.
00:10:40.280 | One thing I'm doing here, if an object --
00:10:43.560 | I don't specify fish, but --
00:10:45.640 | because I also want to just kind of keep it flexible.
00:10:48.720 | If an object is present multiple times,
00:10:50.800 | label them according to their scientific and popular name.
00:10:53.760 | And so my first question with this new prompt --
00:10:56.960 | so this is our new system prompt.
00:10:59.960 | My first question is highlight different fish in this image.
00:11:03.240 | And you're seeing here, like, the runtime is super long, right?
00:11:08.720 | And the reason for that is that Gemini is repeating itself
00:11:14.600 | right now a lot, and I don't know why that is.
00:11:18.840 | I think, you know, beyond the example,
00:11:22.520 | like the really simple examples, it does tend to do this.
00:11:25.360 | So we'll see, and we'll fix it.
00:11:28.120 | But, yeah, you get all of these, which is kind of crazy.
00:11:32.080 | There's so much going on there.
00:11:34.280 | I actually didn't look at what the output of this would look like.
00:11:38.880 | And to be honest, I'm not sure we can --
00:11:41.000 | or not easily, because it cuts off just here.
00:11:43.680 | It just stops. So interesting.
00:11:46.760 | But it does seem to be getting -- like, the labels here are --
00:11:50.200 | you know, they are things that are in there for the most part.
00:11:52.280 | Of course, unknown fish is just -- but, yeah, I don't know.
00:11:56.920 | It's kind of interesting that it just repeats a ton of stuff there.
00:12:00.200 | But, anyway, we can resolve that, and it's super easy to.
00:12:03.400 | All I did was add a pretty high-frequency penalty here,
00:12:08.880 | and that resolved the issue.
00:12:10.520 | I was also playing around with the temperature.
00:12:12.040 | Like, you can increase or decrease this.
00:12:13.760 | It doesn't make too much of a difference.
00:12:16.080 | And then, yeah, I mean, the only other thing is, okay,
00:12:18.000 | we have a limit to 25 objects,
00:12:19.480 | but we had that before, and it didn't listen very well.
00:12:22.520 | So, actually, I didn't even modify anything here, right?
00:12:25.840 | This is still the same prompt.
00:12:27.440 | So all I've done is added the frequency penalty, right?
00:12:31.520 | So interesting. Now let's see what it does.
00:12:35.840 | Okay. We have this, right?
00:12:38.240 | It's actually labeling a lot more stuff than it was before,
00:12:42.640 | which is interesting.
00:12:44.880 | Let's see how -- well, we'll see in a moment how all of that looks.
00:12:48.160 | But, okay, we have this JSON output.
00:12:50.880 | That's what we need.
00:12:52.480 | And, yeah, you can see this is --
00:12:54.960 | like, Gem92 has obviously been trained to do this.
00:12:58.080 | We have our bounding pops.
00:13:00.520 | You have the -- these are the coordinates in some order.
00:13:03.840 | We'll see in a moment what they actually are.
00:13:05.600 | And then we have a label, right?
00:13:08.320 | Kind of interesting that it got a different clownfish.
00:13:12.080 | Anemonefish and clownfish are the same, as far as I know.
00:13:16.000 | So interesting they identified a different one to what it did before.
00:13:21.240 | Before, it was usually going with clarks, clownfish for both.
00:13:24.600 | Okay. So what we need to do is sometimes this will also output a message, right?
00:13:30.800 | So it might have some text, and then it outputs this JSON here, right?
00:13:36.680 | And it's probably worth me just showing you what that, you know,
00:13:42.600 | text actually looks like, because it's not --
00:13:46.400 | you know, we're formatting it with mark down here, right?
00:13:51.520 | So what it actually looks like is this.
00:13:56.240 | So to extract out our JSON,
00:13:58.880 | we're just going to be looking for this, like, code block.
00:14:02.920 | And we extract out whatever is in the middle.
00:14:05.520 | And we do that with regex here, like a simple regex.
00:14:09.720 | So we're just looking, okay, we're looking for this and this, right?
00:14:14.280 | And anything in between.
00:14:16.040 | So we're looking, yeah, first line of the JSON, final line of the JSON,
00:14:20.640 | and we grab everything in between.
00:14:22.440 | Okay? Yeah, that's it.
00:14:23.440 | So we do that, we extract that out, and then we just load it with JSON.
00:14:29.240 | And then we get this, right?
00:14:30.440 | So it's just a dictionary -- sorry, a list of dictionary objects
00:14:34.880 | with all of our coordinates and the labels.
00:14:37.880 | Then, okay, this is just wrapping what we did just up here
00:14:43.240 | into a parse JSON function, just to make things a little easier.
00:14:47.640 | Then what we're going to do is plot the bounding boxes, right?
00:14:51.400 | So there are a few things going on here.
00:14:54.480 | Most of this is pulled from one of Google's examples
00:14:58.360 | where they have a similar setup.
00:15:02.280 | And basically what we're doing is --
00:15:06.560 | okay, here we're just saying we pull out a few different colors
00:15:10.240 | or color names so that we can use them when drawing our bounding boxes.
00:15:14.520 | So we have, like, colorful bounding boxes, essentially.
00:15:16.920 | We don't necessarily need to do that, but we do here.
00:15:19.480 | It is easier when you're trying to read everything, I think.
00:15:22.840 | We take the image, right, we're passing it into here,
00:15:26.400 | but we're making a copy of it
00:15:27.920 | because otherwise we would modify the original image
00:15:30.720 | when we're adding those bounding boxes,
00:15:32.320 | which is super annoying if you run it more than once.
00:15:35.680 | Then we're just taking the width and height of the image,
00:15:38.400 | which we need for the normalization set below here.
00:15:41.320 | We initialize a drawing object.
00:15:43.920 | So this is part of the pillow library.
00:15:45.640 | Basically, we take that, the picture,
00:15:49.320 | and we're going to draw on top of it.
00:15:50.720 | That's what this draw image will allow us to do.
00:15:52.760 | Extract our bounding boxes from the LLM output.
00:15:57.080 | And then for each bounding box,
00:15:59.240 | we're going to go through and, like, well, draw them
00:16:04.480 | on our, like, draw object.
00:16:06.720 | Let's run this and see what it gives us.
00:16:11.000 | Nice. Okay, so actually when you do look at this,
00:16:15.360 | it identified quite a few different fish,
00:16:17.640 | but I think they're all actually pretty accurate, right?
00:16:20.160 | This is pretty good.
00:16:22.600 | So you have the clownfish or two clownfish here.
00:16:27.200 | It identifies them as different species,
00:16:29.280 | but they're definitely the same.
00:16:30.880 | It does pick up on the clownfish in the background
00:16:33.240 | just as it identifies a clownfish.
00:16:36.200 | Then I think that those two are actually other fish.
00:16:38.920 | I think maybe there's fish here, but I can hardly even --
00:16:43.800 | Yeah, there are. I think there's some there,
00:16:45.800 | but I can hardly see that.
00:16:48.200 | For some reason, none of the models
00:16:49.880 | seem to be able to identify this fish in the corner here,
00:16:54.040 | maybe because it's partly cut off,
00:16:55.520 | and it does identify the urassa, which is cool.
00:16:58.680 | Now, this one's harder, I think,
00:17:01.080 | to identify the different corals.
00:17:03.920 | Sometimes this kind of gets some of them okay,
00:17:06.880 | and then other times it just kind of goes off the rails.
00:17:10.320 | So, yeah, we can see.
00:17:12.960 | All right, so here, yeah, you can kind of see.
00:17:18.560 | I mean, there's just a lot of stuff going on there.
00:17:20.400 | So what it's saying here is that this is an equidopora coral.
00:17:26.400 | We can see it here. Let me -- All right, this thing here.
00:17:30.440 | Equidopora, right?
00:17:33.480 | So, yeah, I mean, I think those things are kind of everywhere,
00:17:37.520 | if it is what I think it is,
00:17:39.120 | and I wouldn't say necessarily got any of them here.
00:17:43.800 | Maybe, like, over here or here they could be, but, yeah.
00:17:47.040 | So it's just kind of saying
00:17:48.640 | that there's loads of that everywhere, which is interesting.
00:17:51.160 | I wouldn't say it's necessarily accurate.
00:17:55.320 | This one, I have no idea.
00:17:58.920 | Not bad.
00:18:01.440 | Actually, that's correct.
00:18:03.640 | I thought it was just anemone.
00:18:05.080 | Okay, so I thought this was called,
00:18:08.400 | like, giant anemone or something.
00:18:10.160 | So, yeah, it got the anemone correct,
00:18:12.640 | I think, in some places here, right?
00:18:16.480 | So it's saying the anemone is, like, this one here,
00:18:22.240 | so kind of covering the clownfish,
00:18:25.040 | and also here, so it's interesting.
00:18:27.160 | I think probably because clownfish live in the anemones,
00:18:30.320 | so I wonder if it's seeing the clownfish,
00:18:33.360 | seeing some corals around it, and it's like,
00:18:35.200 | okay, yeah, it's definitely an anemone, right?
00:18:38.800 | Because it's weird that the rest of them around here
00:18:41.280 | doesn't identify as that,
00:18:42.760 | but then when there's a clownfish next to it, it does.
00:18:45.880 | So that's kind of cool. Again, not perfect, though, right?
00:18:48.200 | But honestly, like, a few more models down the line,
00:18:51.960 | this would be pretty impressive.
00:18:53.760 | So I'm asking specifically for the clownfish,
00:18:56.480 | and it actually got the one in the background this time.
00:18:59.520 | This is the first time.
00:19:00.800 | Like, every other time, it didn't get this.
00:19:03.320 | So that is really very cool.
00:19:05.920 | I think Google are actually doing something with the model.
00:19:09.640 | Like, at the moment, they're definitely modifying it
00:19:12.800 | because I'm getting, like, gradually better results
00:19:15.880 | every time, I swear.
00:19:17.440 | So, yeah, it does.
00:19:21.120 | So it says that this is the Clark's clownfish,
00:19:26.480 | and this one is the other one, the,
00:19:28.280 | what did they call it, like, ocean something?
00:19:30.600 | I'm not entirely sure.
00:19:32.000 | So I think this one is wrong, but that one over there,
00:19:33.920 | I don't, I'm not entirely sure.
00:19:36.520 | I don't know what they all look like.
00:19:39.440 | Maybe we can have a quick look.
00:19:42.160 | Not bad. I don't think it is this one, though,
00:19:44.800 | but it does kind of look like it in this picture.
00:19:48.440 | It kind of looks like that,
00:19:49.680 | but I think it's just another one of these
00:19:51.360 | without the other stripe,
00:19:52.760 | or maybe it's a juvenile or something.
00:19:54.760 | I'm not sure.
00:19:57.680 | Let's just check if those are the correct one.
00:20:03.480 | That looks right to me.
00:20:06.280 | If it's not right, I mean, it's close.
00:20:09.840 | So that's cool. Very exciting.
00:20:12.800 | And then, so I was really having issues
00:20:15.480 | with identifying the clean arrests over here.
00:20:19.000 | I don't know if that's how you pronounce it again,
00:20:20.480 | but it just would not,
00:20:22.800 | Gemini could not seem to identify the rest.
00:20:25.840 | So let's see if it manages, and it kind of does maybe.
00:20:31.160 | What is this? So let's see what that is.
00:20:38.560 | Oh, amazing.
00:20:42.200 | Google is up to something.
00:20:45.800 | Or should I say DeepMind?
00:20:47.320 | Anyway, so that is pretty cool.
00:20:52.480 | I was happy with the results anyway.
00:20:54.280 | I thought it was just amazing.
00:20:56.600 | But that being said, I'm just super happy
00:20:59.680 | that the results are better
00:21:00.960 | since when I tested this less than 24 hours ago.
00:21:04.800 | So that is just wonderful.
00:21:07.080 | Okay, moving on to the next example.
00:21:10.200 | So another image.
00:21:12.360 | This one, oh, this one it didn't get last time as well,
00:21:16.640 | and I tried so many times, and they have it this time.
00:21:20.520 | So weirdly, when I ran this last time,
00:21:25.920 | so this big one in the middle is SomethingSweetLips, I think.
00:21:30.720 | They have interesting names for all the fish.
00:21:33.000 | So this is a SomethingSweetLips.
00:21:34.880 | I can't remember exactly.
00:21:36.320 | And you can see, well, you probably can't
00:21:38.480 | because this text is really small,
00:21:41.680 | but that here says SweetLips.
00:21:44.880 | So it actually did get it, which is just cool.
00:21:50.400 | Before, it was not even identifying it as a fish.
00:21:53.960 | It was pulling in all the fish from the background,
00:21:56.200 | but not this one, which I found super weird.
00:22:00.320 | So that is amazing that it actually got it now.
00:22:05.280 | Yeah, that's cool.
00:22:06.840 | I even wrote it doesn't catch a very large fish in the middle.
00:22:10.560 | Now it does.
00:22:12.280 | Well done.
00:22:15.520 | Yeah, I love this sort of thing.
00:22:19.440 | Yeah, nice, PaintedSweetLips, that was it.
00:22:29.880 | So PaintedSweetLips, I don't know,
00:22:32.200 | because they all have different patterns and stuff,
00:22:34.280 | and I don't know if this is-- it's definitely the type of fish,
00:22:37.240 | but is it right with the pattern?
00:22:41.120 | I don't know, to be honest.
00:22:43.480 | But if you look, there was one picture
00:22:46.680 | where it looked pretty similar.
00:22:48.080 | Yeah, yeah, yeah.
00:22:49.040 | Also, this is maybe a yellow-banded SweetLips
00:22:52.040 | rather than the painted SweetLips.
00:22:54.640 | But honestly, it's not bad.
00:22:59.960 | Come on.
00:23:00.640 | Yeah, so it would be a yellow-banded SweetLips
00:23:03.800 | instead.
00:23:05.120 | Really very cool.
00:23:06.840 | I'm impressed with that.
00:23:09.160 | So yeah, and then I was like, oh, come on.
00:23:12.600 | Give me-- tell me what that big fish is in the middle.
00:23:16.880 | I mean, it already got it, so I think it's fine.
00:23:19.800 | Yeah, he got it.
00:23:21.480 | But cool that it doesn't identify the other fish
00:23:23.800 | in the background now.
00:23:25.160 | That is pretty exciting.
00:23:27.880 | Cool.
00:23:28.360 | So that's good.
00:23:29.600 | We're going to move on to the next picture.
00:23:32.040 | There's just a couple more here.
00:23:33.960 | Every now and again, this does happen.
00:23:36.160 | So it's not perfect.
00:23:38.360 | But I mean, I was thinking, OK, if you're
00:23:40.760 | going to do tool calling and stuff--
00:23:42.960 | I haven't tried tool calling yet.
00:23:45.720 | I'm not even sure if you can.
00:23:47.360 | I didn't see anything super obvious on how to yet.
00:23:50.520 | But if you're going to do tool calling,
00:23:52.640 | that would kind of resolve the issue.
00:23:54.400 | But anyway, so yeah, this is pretty good.
00:23:57.120 | It's identifying a ton of different fish
00:23:59.160 | all over the place.
00:24:00.200 | I don't know what they all are.
00:24:02.360 | But these are all just--
00:24:03.880 | they're kind of fish that you--
00:24:05.920 | I don't think most divers would even know.
00:24:07.960 | I don't know.
00:24:08.920 | But it does identify these black ones over here.
00:24:11.200 | So this is a nasolitoratus.
00:24:15.880 | So we can have a look at what that is.
00:24:20.360 | So I mean, you kind of look at this and like, oh, no.
00:24:22.640 | It's not the same fish.
00:24:23.920 | But it's actually-- I think it is relatively close
00:24:28.800 | or related in some way.
00:24:30.040 | Because if you look at the fin, they
00:24:32.000 | have this weird fin where it's like a--
00:24:33.920 | I don't know what it is.
00:24:34.920 | But they have that very distinct shape.
00:24:37.840 | And if you look here, you can't really see it very well.
00:24:42.400 | Maybe if I just go and take the image itself,
00:24:46.440 | you might be able to see a little better.
00:24:48.320 | You can kind of see--
00:24:49.840 | not very well, but this fish here
00:24:51.640 | does have that sort of weird tail shape
00:24:54.560 | where it's like it sits up.
00:24:55.800 | And then it has a little two streams at either end.
00:25:01.660 | So potentially accurate.
00:25:04.720 | The front of it looks kind of similar.
00:25:07.280 | I don't know.
00:25:08.560 | But it seems like it's at least related.
00:25:12.520 | I don't know if it is the exact same fish.
00:25:15.000 | But that's pretty cool.
00:25:15.960 | Now, OK, not bad.
00:25:18.360 | Then, oh, yeah, the corals.
00:25:20.320 | Let's see how it is with corals.
00:25:21.680 | Because there's a lot of corals in this.
00:25:23.640 | I have no idea what most of those are or any of--
00:25:27.880 | I don't think I know what any of them are, to be honest.
00:25:30.400 | But yeah, let's see what it comes up with.
00:25:33.220 | Nice, so a ton of things here.
00:25:36.460 | So I think some of these--
00:25:38.700 | so the brain coral kind of doesn't look really like that.
00:25:41.860 | It actually looks more kind of brainy.
00:25:44.100 | Staghorn coral, I think, is correct for this thing,
00:25:46.740 | or at least close to this thing down here.
00:25:50.940 | So if we look at a picture, kind of similar.
00:25:55.740 | If it's not that, it looks pretty similar to--
00:25:59.340 | probably, to be honest, it probably isn't that, actually.
00:26:01.760 | No, look at it.
00:26:02.440 | But it's pretty similar, anyway.
00:26:04.220 | Now, you have the brain-- let me just
00:26:05.680 | show you what a brain coral looks like,
00:26:06.960 | because it doesn't really look like that, I don't think.
00:26:09.320 | So it's different.
00:26:10.600 | It's big, like those other blobs that it's pointing out.
00:26:14.560 | But it doesn't look quite like this.
00:26:18.000 | So I would say no for that, in my opinion.
00:26:21.680 | Maybe I'm wrong.
00:26:22.920 | Then we have-- yeah, staghorn, staghorn, brain.
00:26:29.980 | Yeah, it needs-- yeah, I think that's all it identified here.
00:26:33.780 | So generally, wrong.
00:26:36.380 | But at least it identified--
00:26:38.580 | mostly identified the actual fish, which is kind of cool.
00:26:42.940 | OK, not bad.
00:26:44.380 | It identified a few fishes, corals.
00:26:46.820 | I think it's pretty hard, to be honest.
00:26:48.420 | But not bad, I don't think.
00:26:52.340 | Then, finally, one more picture.
00:26:55.340 | So this picture, if I just pull it up before we label it,
00:27:02.740 | so it's actually to the side, the picture.
00:27:05.820 | So the ground would be down here.
00:27:09.100 | And up here is the surface.
00:27:12.580 | And this is a big ship turret, like a sunken ship turret.
00:27:18.860 | And there is a fish that lives in here.
00:27:22.740 | And you can put a--
00:27:26.380 | the fish keeps his home clean.
00:27:29.900 | So if you put something in there, the fish will grab it.
00:27:32.500 | And he'll throw it out, because he
00:27:34.420 | wants to keep his home clean, which is kind of cool.
00:27:37.860 | So it's kind of hidden, right?
00:27:39.460 | It's hidden inside there.
00:27:40.540 | So I just want to see if Gem and I can actually identify that.
00:27:44.820 | And let's see.
00:27:47.780 | Interesting.
00:27:48.480 | So it's-- I think that was an error.
00:27:50.860 | It was running for a long time as well.
00:27:52.520 | So maybe it started repeating.
00:27:54.860 | I didn't-- I haven't had that before.
00:27:56.420 | That's the first time I've had it break on that example
00:28:00.380 | after setting the frequency penalty.
00:28:04.420 | And I was actually quite impressed
00:28:06.980 | by how many fish it pointed out.
00:28:09.980 | There's a lot of fish just in the background.
00:28:12.380 | You can't really see them, like this one here or these.
00:28:17.260 | And it managed to identify them, which
00:28:19.360 | I thought was pretty cool.
00:28:20.640 | And I think this is also a--
00:28:22.400 | I think it's at least a damselfish.
00:28:23.980 | I don't know if it's exactly what this is.
00:28:26.520 | We can have a quick look.
00:28:27.520 | But it got the hidden fish, which is cool.
00:28:31.480 | So that is a Malucan damselfish.
00:28:38.600 | So the different colors here, but it's basically
00:28:41.920 | a black version of one of these.
00:28:43.680 | I don't know.
00:28:44.760 | I mean, at least to me, it looks--
00:28:46.360 | it's basically the same as these things here.
00:28:49.540 | So even if I just go damselfish, yeah,
00:28:53.940 | you can even see, like, there's a picture in the middle here.
00:28:56.440 | Or it's just like a black one.
00:28:57.680 | But I think they're the same.
00:29:01.000 | It seems pretty similar to me, size-wise, for sure.
00:29:05.680 | So yeah, I mean, that was that.
00:29:08.600 | And then-- oh, yeah, I just want to--
00:29:11.080 | so another thing is, OK, this is kind of a different picture.
00:29:14.240 | It's not just, like, fish and corals.
00:29:15.880 | It's actually-- it's a shipwreck.
00:29:18.120 | So can Gemini identify this as a shipwreck?
00:29:23.160 | So I've just explained--
00:29:25.040 | I've modified the system instructions
00:29:26.800 | to not do the whole bounding box thing.
00:29:29.400 | And said, OK, describe what you see in this image.
00:29:32.120 | Identify fish or coral species in the image.
00:29:34.520 | And tell us how many of each you can see.
00:29:36.200 | So I'm not even saying anything about, OK, look,
00:29:37.920 | there's something.
00:29:38.760 | There's, like, a shipwreck or anything.
00:29:40.360 | I'm just saying what is in this image.
00:29:43.400 | And then I've just said, OK, explain what this image
00:29:45.720 | contains, what's happening, and what is the location.
00:29:48.200 | I wanted to see if it could give, like,
00:29:49.820 | a super accurate location, but it didn't, of course.
00:29:53.200 | I did try and push for it.
00:29:55.440 | So there's a large cylindrical object
00:29:59.720 | with a hole in the center.
00:30:01.080 | It appears to be made of metal, and it's
00:30:02.760 | covered in marine growth, so on and so on.
00:30:05.120 | It is likely this is part of a shipwreck,
00:30:09.880 | and the cylindrical object could be a gun barrel, which
00:30:12.760 | is kind of cool.
00:30:13.680 | So you got that right.
00:30:15.280 | Subtropical, tropical, subtropical.
00:30:18.760 | One fish inside the hole, that's kind of cool.
00:30:23.240 | Then I just wanted to see if it would, like,
00:30:26.960 | guess a specific place.
00:30:31.000 | I knew it probably wouldn't, but who knows.
00:30:35.360 | A section of a shipwreck.
00:30:37.440 | Basically, it just kind of refuses to answer my question.
00:30:41.320 | But that's fine, it did well there, I think.
00:30:45.160 | Maybe if we gave it access to Google Search,
00:30:47.640 | it might actually be able to.
00:30:48.840 | But yeah, I would be impressed.
00:30:51.040 | Anyway, that's it.
00:30:52.840 | So that is our example.
00:30:55.280 | I mean, it does pretty well, in my opinion.
00:30:58.000 | Just, you know, like, as a first test, I am pretty impressed.
00:31:05.160 | I think it's cool.
00:31:06.760 | I think the structured output abilities of the model,
00:31:11.360 | even without tool calling here, pretty impressive.
00:31:14.680 | Of course, it has been fine to output this structure,
00:31:17.840 | I think, as well, which helps.
00:31:19.920 | But I'm pretty optimistic.
00:31:22.600 | I'm optimistic that this model might do very well as an agent.
00:31:26.360 | And that's mostly what, at least for me,
00:31:28.600 | it's mostly what I'm building with almost all the time,
00:31:31.280 | like building agents.
00:31:32.680 | So I think it could be pretty big, in my opinion.
00:31:39.160 | Especially compared to all the other models
00:31:41.040 | I've sort of tested over the years.
00:31:43.480 | Like, OK, every time OpenAI comes out with a new model,
00:31:46.280 | that's pretty big.
00:31:47.280 | But then the others just don't really
00:31:49.200 | match up in terms of agentic ability all that much.
00:31:53.560 | Or they have some weaknesses that really just kind of stop
00:31:56.720 | you from implementing actual use cases.
00:31:59.320 | So this one from Google, maybe--
00:32:04.160 | let's see-- maybe that model, that finally
00:32:07.960 | gets a good portion of people to move away from OpenAI
00:32:11.880 | and start actually exploring these other models, which
00:32:15.600 | could be pretty cool.
00:32:16.760 | So yeah, that's it for this video.
00:32:19.680 | I hope this has all been useful and interesting.
00:32:22.340 | But for now, I'll leave it there.
00:32:23.720 | So thank you very much for watching.
00:32:25.160 | And I will see you again in the next one.
00:32:27.760 | [MUSIC PLAYING]
00:32:31.120 | [MUSIC PLAYING]
00:32:34.640 | [MUSIC PLAYING]
00:32:38.000 | (soft music)
00:32:40.420 | [Music]