back to indexGemini 2 Multimodal and Spatial Awareness in Python
Chapters
0:0 Gemini 2 Multimodal
0:41 Gemini Focus on Agents
1:53 Running the Code
3:8 Asking Gemini to Describe Images
9:29 Gemini Image Bounding Boxes
21:6 Gemini Spatial Awareness Example 2
23:29 Gemini Spatial Awareness Example 3
26:52 Gemini Spatial Awareness Example 4
29:9 Gemini Image-to-Text
30:50 Google Gemini vs OpenAI GPTs
00:00:00.000 |
Today, we're going to be taking a look at Google's new Gemini 2 model. 00:00:04.240 |
Now, Gemini 2, I think it's probably one of the most impressive LM releases out there. 00:00:10.240 |
And although I can't form a full opinion on it quite yet, I still need to work with it a little more. 00:00:17.840 |
It does seem to be something that I would actually stop using OpenAI for potentially again, I need to test it more, but it's. 00:00:30.040 |
If not that model, it's the closest I think I've ever seen to getting to that point where I'm like, oh, I don't necessarily need OpenAI for a lot of stuff anymore. 00:00:41.280 |
And one thing I really like that they're doing is they're focusing on the agent's use case. 00:00:46.320 |
I think agents are like short term future of LLMs and AI in general. 00:00:52.720 |
And I mean, their announcement here is literally like it is the new AI model for the agentic era. 00:01:02.280 |
And one thing I have noticed is that this model produces a structured output very well. 00:01:07.480 |
So we're going to be focusing on the text to image and image to text modalities. 00:01:11.760 |
And we're going to be jumping into this multi-modal example here. 00:01:17.280 |
So this is in the Aurelio Labs cookbook repo, Gen AI, Google AI, Gemini 2, and then multi-modal under here. 00:01:24.440 |
We have a load more Gemini 2 examples coming as well. 00:01:31.720 |
Now we're going to be working towards something kind of like this. 00:01:36.160 |
And there's probably actually a few other really interesting examples where we're just taking some images that are not necessarily super clear. 00:01:43.440 |
And we're just going to be able to see, you know, what does it do? 00:01:53.200 |
So in that notebook, you will be able to open it in Colab. 00:01:57.360 |
And that's probably the easiest way of like running through alongside me in this example. 00:02:02.720 |
But I'm going to run it locally and you can run it locally as well. 00:02:06.680 |
There's some set up instructions here, but easiest is just Colab. 00:02:12.360 |
So if we're running locally, the first thing we need to do is just select our environment. 00:02:17.000 |
Of course, in Colab, you don't really need to do that. 00:02:29.280 |
OK, so we'll just see each of the four images that we're going to ask for. 00:02:34.880 |
So these are just screen grabs from like some diving videos. 00:02:39.160 |
And, you know, there's like there's a lot going on in here. 00:02:42.520 |
I think, you know, a lot of the examples I've seen from Google are like really clean images. 00:02:49.960 |
So we have these images kind of blurry in a lot of cases, a lot of motion 00:02:56.400 |
and a lot of noise is just a lot going on in all these pictures. 00:02:59.920 |
So they're not easy images to work with here. 00:03:04.520 |
So we'll see how how Gemini performs against them. 00:03:08.080 |
First thing, we're just going to ask Gemini to describe what is in these images. 00:03:13.800 |
So we're going with the first image where you have like these anemones 00:03:19.760 |
and then the clownfish or anemone fish in the front here. 00:03:33.000 |
Like there's a lot of stuff in this image that we can say, OK, what is in the image? 00:03:38.560 |
It's also there is also clownfish over there in the background. 00:03:41.720 |
I haven't managed to get Gemini to label that one. 00:03:48.400 |
Interestingly, running this notebook today compared to yesterday, 00:03:56.720 |
I'm not sure if I just got lucky one time, but the results seem a lot better. 00:03:59.640 |
So I think maybe Google or maybe they're doing something. 00:04:06.240 |
So first thing, we are going to need a Google AI Studio API key. 00:04:19.800 |
Right. So you will need to just go and create an account. 00:04:24.200 |
So you create an account and then you just go through. 00:04:26.520 |
And where is I think maybe settings API plan information. 00:04:33.880 |
You come into here and then you want to go to you and open this up. 00:04:38.040 |
So this is just going to open a window in GCP. 00:04:41.840 |
And basically GCP generates this project for you, Gemini API. 00:04:46.480 |
You can go ahead and just create a credential here. 00:04:56.320 |
You get your API key and you just want to paste it into. 00:04:59.440 |
This should a little box should pop up or you just pop it straight into a string. 00:05:07.680 |
And that will just initialize the connection to your client. 00:05:16.080 |
OK, we're going to be using this Gemini to flash. 00:05:22.720 |
Again, as I said, it's not really intended to be used in production. 00:05:26.560 |
I'm not sure if they allow you to at the moment. 00:05:32.800 |
So the actual sort of production ready model will hopefully come soon. 00:05:37.160 |
Although Google are very slow with that sort of thing. 00:05:47.480 |
And where this is basically how we're using the API 00:05:59.120 |
And these here basically, at least the way that I understand this, 00:06:04.400 |
this here is going to be like both of these independently, 00:06:08.920 |
I believe, are being transformed into user messages. 00:06:14.080 |
And this is a user message with just an image. 00:06:35.360 |
And we pass in our system instruction or system prompt in there. 00:06:40.240 |
All right. So the system prompt is just describe what you see in this image, 00:06:42.960 |
identify any fish or coral species in the image 00:06:50.600 |
We set the safe settings, which are basically just 00:06:58.440 |
I don't think we really need this, to be honest, for this. 00:07:03.600 |
Then also, yeah, temperature is pretty low, 0.1. 00:07:11.240 |
At least for this task, a slightly higher than zero 00:07:16.400 |
temperature setting seem to actually get better results, which is interesting. 00:07:20.480 |
So usually I would not expect that for a task 00:07:25.760 |
where the agent needs to produce such structured output. 00:07:37.200 |
So the one thing with Gemini 2 is I will generally output everything 00:07:42.120 |
in Markdown without you even telling it to, which is fine. 00:07:46.840 |
So I'm just basically showing everything in Markdown here. 00:07:54.040 |
It's describing it as an underwater scene, coral reef, 00:07:57.160 |
marine life, coral formations, it's daytime, natural light 00:08:14.240 |
it actually seems to identify things very well. 00:08:16.760 |
But then when you're asking it to draw the bounding box, it doesn't. 00:08:20.000 |
Or at least before it wasn't quite doing as well. 00:08:24.360 |
There's two clownfish it didn't notice, the one in the back. 00:08:33.880 |
So I don't know how you pronounce it, by the way, 00:08:37.440 |
So this wrasse, it's this little thing on the left here, 00:08:43.400 |
I think also, if you haven't been diving much, 00:08:50.320 |
The only reason you notice, like, if you've been diving a little bit, 00:08:52.960 |
it's because they'll try and kind of bite you and stuff 00:08:59.080 |
So that fish is just kind of like ingrained in your memory. 00:09:24.960 |
So image to text or text to text and image to text. 00:09:29.040 |
Now, what we want to do is we're going to draw these bounding boxes. 00:09:34.120 |
And so this prompt here, it's mostly adapted from an example 00:09:40.360 |
that Google provided where they're doing the same thing, right? 00:09:46.000 |
And what I found is, like, modifying this much 00:09:49.680 |
or tended to break it pretty easily, which is interesting. 00:09:54.240 |
So I don't know if that's just a prompting thing. 00:09:56.160 |
Maybe I need to get a little better at prompting this model. 00:09:59.880 |
So, anyway, we're telling it to just return some bounding boxes, OK? 00:10:05.080 |
And they do that as a JSON array with labels. 00:10:08.440 |
Now, this seems to be something it's being pre-trained on, 00:10:16.800 |
And it is, if you come down, oh, I don't have the example. 00:10:25.400 |
which is exactly the structure that we need, essentially. 00:10:29.600 |
So it's obviously -- it's had a few examples of this 00:10:45.640 |
because I also want to just kind of keep it flexible. 00:10:50.800 |
label them according to their scientific and popular name. 00:10:53.760 |
And so my first question with this new prompt -- 00:10:59.960 |
My first question is highlight different fish in this image. 00:11:03.240 |
And you're seeing here, like, the runtime is super long, right? 00:11:08.720 |
And the reason for that is that Gemini is repeating itself 00:11:14.600 |
right now a lot, and I don't know why that is. 00:11:22.520 |
like the really simple examples, it does tend to do this. 00:11:28.120 |
But, yeah, you get all of these, which is kind of crazy. 00:11:34.280 |
I actually didn't look at what the output of this would look like. 00:11:41.000 |
or not easily, because it cuts off just here. 00:11:46.760 |
But it does seem to be getting -- like, the labels here are -- 00:11:50.200 |
you know, they are things that are in there for the most part. 00:11:52.280 |
Of course, unknown fish is just -- but, yeah, I don't know. 00:11:56.920 |
It's kind of interesting that it just repeats a ton of stuff there. 00:12:00.200 |
But, anyway, we can resolve that, and it's super easy to. 00:12:03.400 |
All I did was add a pretty high-frequency penalty here, 00:12:10.520 |
I was also playing around with the temperature. 00:12:16.080 |
And then, yeah, I mean, the only other thing is, okay, 00:12:19.480 |
but we had that before, and it didn't listen very well. 00:12:22.520 |
So, actually, I didn't even modify anything here, right? 00:12:27.440 |
So all I've done is added the frequency penalty, right? 00:12:38.240 |
It's actually labeling a lot more stuff than it was before, 00:12:44.880 |
Let's see how -- well, we'll see in a moment how all of that looks. 00:12:54.960 |
like, Gem92 has obviously been trained to do this. 00:13:00.520 |
You have the -- these are the coordinates in some order. 00:13:03.840 |
We'll see in a moment what they actually are. 00:13:08.320 |
Kind of interesting that it got a different clownfish. 00:13:12.080 |
Anemonefish and clownfish are the same, as far as I know. 00:13:16.000 |
So interesting they identified a different one to what it did before. 00:13:21.240 |
Before, it was usually going with clarks, clownfish for both. 00:13:24.600 |
Okay. So what we need to do is sometimes this will also output a message, right? 00:13:30.800 |
So it might have some text, and then it outputs this JSON here, right? 00:13:36.680 |
And it's probably worth me just showing you what that, you know, 00:13:42.600 |
text actually looks like, because it's not -- 00:13:46.400 |
you know, we're formatting it with mark down here, right? 00:13:58.880 |
we're just going to be looking for this, like, code block. 00:14:02.920 |
And we extract out whatever is in the middle. 00:14:05.520 |
And we do that with regex here, like a simple regex. 00:14:09.720 |
So we're just looking, okay, we're looking for this and this, right? 00:14:16.040 |
So we're looking, yeah, first line of the JSON, final line of the JSON, 00:14:23.440 |
So we do that, we extract that out, and then we just load it with JSON. 00:14:30.440 |
So it's just a dictionary -- sorry, a list of dictionary objects 00:14:37.880 |
Then, okay, this is just wrapping what we did just up here 00:14:43.240 |
into a parse JSON function, just to make things a little easier. 00:14:47.640 |
Then what we're going to do is plot the bounding boxes, right? 00:14:54.480 |
Most of this is pulled from one of Google's examples 00:15:06.560 |
okay, here we're just saying we pull out a few different colors 00:15:10.240 |
or color names so that we can use them when drawing our bounding boxes. 00:15:14.520 |
So we have, like, colorful bounding boxes, essentially. 00:15:16.920 |
We don't necessarily need to do that, but we do here. 00:15:19.480 |
It is easier when you're trying to read everything, I think. 00:15:22.840 |
We take the image, right, we're passing it into here, 00:15:27.920 |
because otherwise we would modify the original image 00:15:32.320 |
which is super annoying if you run it more than once. 00:15:35.680 |
Then we're just taking the width and height of the image, 00:15:38.400 |
which we need for the normalization set below here. 00:15:50.720 |
That's what this draw image will allow us to do. 00:15:52.760 |
Extract our bounding boxes from the LLM output. 00:15:59.240 |
we're going to go through and, like, well, draw them 00:16:11.000 |
Nice. Okay, so actually when you do look at this, 00:16:17.640 |
but I think they're all actually pretty accurate, right? 00:16:22.600 |
So you have the clownfish or two clownfish here. 00:16:30.880 |
It does pick up on the clownfish in the background 00:16:36.200 |
Then I think that those two are actually other fish. 00:16:38.920 |
I think maybe there's fish here, but I can hardly even -- 00:16:49.880 |
seem to be able to identify this fish in the corner here, 00:16:55.520 |
and it does identify the urassa, which is cool. 00:17:03.920 |
Sometimes this kind of gets some of them okay, 00:17:06.880 |
and then other times it just kind of goes off the rails. 00:17:12.960 |
All right, so here, yeah, you can kind of see. 00:17:18.560 |
I mean, there's just a lot of stuff going on there. 00:17:20.400 |
So what it's saying here is that this is an equidopora coral. 00:17:26.400 |
We can see it here. Let me -- All right, this thing here. 00:17:33.480 |
So, yeah, I mean, I think those things are kind of everywhere, 00:17:39.120 |
and I wouldn't say necessarily got any of them here. 00:17:43.800 |
Maybe, like, over here or here they could be, but, yeah. 00:17:48.640 |
that there's loads of that everywhere, which is interesting. 00:18:16.480 |
So it's saying the anemone is, like, this one here, 00:18:27.160 |
I think probably because clownfish live in the anemones, 00:18:35.200 |
okay, yeah, it's definitely an anemone, right? 00:18:38.800 |
Because it's weird that the rest of them around here 00:18:42.760 |
but then when there's a clownfish next to it, it does. 00:18:45.880 |
So that's kind of cool. Again, not perfect, though, right? 00:18:48.200 |
But honestly, like, a few more models down the line, 00:18:53.760 |
So I'm asking specifically for the clownfish, 00:18:56.480 |
and it actually got the one in the background this time. 00:19:05.920 |
I think Google are actually doing something with the model. 00:19:09.640 |
Like, at the moment, they're definitely modifying it 00:19:12.800 |
because I'm getting, like, gradually better results 00:19:21.120 |
So it says that this is the Clark's clownfish, 00:19:28.280 |
what did they call it, like, ocean something? 00:19:32.000 |
So I think this one is wrong, but that one over there, 00:19:42.160 |
Not bad. I don't think it is this one, though, 00:19:44.800 |
but it does kind of look like it in this picture. 00:19:57.680 |
Let's just check if those are the correct one. 00:20:15.480 |
with identifying the clean arrests over here. 00:20:19.000 |
I don't know if that's how you pronounce it again, 00:20:25.840 |
So let's see if it manages, and it kind of does maybe. 00:21:00.960 |
since when I tested this less than 24 hours ago. 00:21:12.360 |
This one, oh, this one it didn't get last time as well, 00:21:16.640 |
and I tried so many times, and they have it this time. 00:21:25.920 |
so this big one in the middle is SomethingSweetLips, I think. 00:21:30.720 |
They have interesting names for all the fish. 00:21:44.880 |
So it actually did get it, which is just cool. 00:21:50.400 |
Before, it was not even identifying it as a fish. 00:21:53.960 |
It was pulling in all the fish from the background, 00:22:00.320 |
So that is amazing that it actually got it now. 00:22:06.840 |
I even wrote it doesn't catch a very large fish in the middle. 00:22:32.200 |
because they all have different patterns and stuff, 00:22:34.280 |
and I don't know if this is-- it's definitely the type of fish, 00:22:49.040 |
Also, this is maybe a yellow-banded SweetLips 00:23:00.640 |
Yeah, so it would be a yellow-banded SweetLips 00:23:12.600 |
Give me-- tell me what that big fish is in the middle. 00:23:16.880 |
I mean, it already got it, so I think it's fine. 00:23:21.480 |
But cool that it doesn't identify the other fish 00:23:47.360 |
I didn't see anything super obvious on how to yet. 00:24:08.920 |
But it does identify these black ones over here. 00:24:20.360 |
So I mean, you kind of look at this and like, oh, no. 00:24:23.920 |
But it's actually-- I think it is relatively close 00:24:37.840 |
And if you look here, you can't really see it very well. 00:24:42.400 |
Maybe if I just go and take the image itself, 00:24:55.800 |
And then it has a little two streams at either end. 00:25:23.640 |
I have no idea what most of those are or any of-- 00:25:27.880 |
I don't think I know what any of them are, to be honest. 00:25:38.700 |
so the brain coral kind of doesn't look really like that. 00:25:44.100 |
Staghorn coral, I think, is correct for this thing, 00:25:55.740 |
If it's not that, it looks pretty similar to-- 00:25:59.340 |
probably, to be honest, it probably isn't that, actually. 00:26:06.960 |
because it doesn't really look like that, I don't think. 00:26:10.600 |
It's big, like those other blobs that it's pointing out. 00:26:22.920 |
Then we have-- yeah, staghorn, staghorn, brain. 00:26:29.980 |
Yeah, it needs-- yeah, I think that's all it identified here. 00:26:38.580 |
mostly identified the actual fish, which is kind of cool. 00:26:55.340 |
So this picture, if I just pull it up before we label it, 00:27:12.580 |
And this is a big ship turret, like a sunken ship turret. 00:27:29.900 |
So if you put something in there, the fish will grab it. 00:27:34.420 |
wants to keep his home clean, which is kind of cool. 00:27:40.540 |
So I just want to see if Gem and I can actually identify that. 00:27:56.420 |
That's the first time I've had it break on that example 00:28:09.980 |
There's a lot of fish just in the background. 00:28:12.380 |
You can't really see them, like this one here or these. 00:28:38.600 |
So the different colors here, but it's basically 00:28:46.360 |
it's basically the same as these things here. 00:28:53.940 |
you can even see, like, there's a picture in the middle here. 00:29:01.000 |
It seems pretty similar to me, size-wise, for sure. 00:29:11.080 |
so another thing is, OK, this is kind of a different picture. 00:29:29.400 |
And said, OK, describe what you see in this image. 00:29:36.200 |
So I'm not even saying anything about, OK, look, 00:29:43.400 |
And then I've just said, OK, explain what this image 00:29:45.720 |
contains, what's happening, and what is the location. 00:29:49.820 |
a super accurate location, but it didn't, of course. 00:30:09.880 |
and the cylindrical object could be a gun barrel, which 00:30:18.760 |
One fish inside the hole, that's kind of cool. 00:30:37.440 |
Basically, it just kind of refuses to answer my question. 00:30:58.000 |
Just, you know, like, as a first test, I am pretty impressed. 00:31:06.760 |
I think the structured output abilities of the model, 00:31:11.360 |
even without tool calling here, pretty impressive. 00:31:14.680 |
Of course, it has been fine to output this structure, 00:31:22.600 |
I'm optimistic that this model might do very well as an agent. 00:31:28.600 |
it's mostly what I'm building with almost all the time, 00:31:32.680 |
So I think it could be pretty big, in my opinion. 00:31:43.480 |
Like, OK, every time OpenAI comes out with a new model, 00:31:49.200 |
match up in terms of agentic ability all that much. 00:31:53.560 |
Or they have some weaknesses that really just kind of stop 00:32:07.960 |
gets a good portion of people to move away from OpenAI 00:32:11.880 |
and start actually exploring these other models, which 00:32:19.680 |
I hope this has all been useful and interesting.