Gemini 2 Multimodal and Spatial Awareness in Python

Today, we're going to be taking a look at Google's new Gemini 2 model. Now, Gemini 2, I think it's probably one of the most impressive LM releases out there. And although I can't form a full opinion on it quite yet, I still need to work with it a little more.

It does seem to be something that I would actually stop using OpenAI for potentially again, I need to test it more, but it's. If not that model, it's the closest I think I've ever seen to getting to that point where I'm like, oh, I don't necessarily need OpenAI for a lot of stuff anymore.

So that is really interesting. And one thing I really like that they're doing is they're focusing on the agent's use case. I think agents are like short term future of LLMs and AI in general. And I mean, their announcement here is literally like it is the new AI model for the agentic era.

Like they're really focusing on that. And one thing I have noticed is that this model produces a structured output very well. So we're going to be focusing on the text to image and image to text modalities. And we're going to be jumping into this multi-modal example here. So this is in the Aurelio Labs cookbook repo, Gen AI, Google AI, Gemini 2, and then multi-modal under here.

We have a load more Gemini 2 examples coming as well. So there'll be more in here very soon. Now we're going to be working towards something kind of like this. And there's probably actually a few other really interesting examples where we're just taking some images that are not necessarily super clear.

And we're just going to be able to see, you know, what does it do? Where does Gemini 2 not work very well? Where does it work incredibly well? And yeah, we'll see a few of those examples. So in that notebook, you will be able to open it in Colab.

And that's probably the easiest way of like running through alongside me in this example. But I'm going to run it locally and you can run it locally as well. There's some set up instructions here, but easiest is just Colab. So if we're running locally, the first thing we need to do is just select our environment.

Of course, in Colab, you don't really need to do that. So we run this. I have these four images here. OK, we're going to see them here, actually. So we run this. OK, so we'll just see each of the four images that we're going to ask for. So these are just screen grabs from like some diving videos.

And, you know, there's like there's a lot going on in here. I think, you know, a lot of the examples I've seen from Google are like really clean images. So we have these images kind of blurry in a lot of cases, a lot of motion and a lot of noise is just a lot going on in all these pictures.

So they're not easy images to work with here. So we'll see how how Gemini performs against them. First thing, we're just going to ask Gemini to describe what is in these images. So we're going with the first image where you have like these anemones and then the clownfish or anemone fish in the front here.

There's a cleaner wrasse or something. I'm not entirely sure how you call it. I think these are table corals. Like there's a lot of stuff in this image that we can say, OK, what is in the image? It's also there is also clownfish over there in the background. I haven't managed to get Gemini to label that one.

Probably it won't again. But we'll see. See how things go. Interestingly, running this notebook today compared to yesterday, the results change quite significantly. I'm not sure if I just got lucky one time, but the results seem a lot better. So I think maybe Google or maybe they're doing something.

I don't know. Anyway. So first thing, we are going to need a Google AI Studio API key. And OK, we get that from Google AI Studio. So I'm going to go and open that. Right. So you will need to just go and create an account. So you create an account and then you just go through.

And where is I think maybe settings API plan information. You come into here and then you want to go to you and open this up. So this is just going to open a window in GCP. And basically GCP generates this project for you, Gemini API. You can go ahead and just create a credential here.

I think. OK, so we'll ask you API key. This is the one you want. So just click API key. You get your API key and you just want to paste it into. This should a little box should pop up or you just pop it straight into a string. Here.

It's up to you. I'm going to go ahead and do that right. And that will just initialize the connection to your client. All right. So fairly straightforward. Nothing, nothing complicated so far. OK, we're going to be using this Gemini to flash. The XP here is experimental. Again, as I said, it's not really intended to be used in production.

I'm not sure if they allow you to at the moment. I haven't read too much into that. But yes, experimental. So the actual sort of production ready model will hopefully come soon. Although Google are very slow with that sort of thing. So who knows? But anyway, so we have the the flash model.

So it should be pretty fast. And where this is basically how we're using the API or how we're using the the model. So we are generating the content. Right. So we have content. And these here basically, at least the way that I understand this, this here is going to be like both of these independently, I believe, are being transformed into user messages.

This is a text based user message. And this is a user message with just an image. It could be wrong then, wrong there. Maybe they both get combined. I'm not super sure. But yeah, we have both those. We haven't set a system prompt. Oh, sorry. Did we? Oh, no, we did set a system prompt.

Sorry. So the system prompt we defined here. OK, so we define this config object here. And we pass in our system instruction or system prompt in there. All right. So the system prompt is just describe what you see in this image, identify any fish or coral species in the image and tell us how many of each you can see.

Right. Fairly simple. We set the safe settings, which are basically just don't be too straight. It's kind of what we're doing here. I don't think we really need this, to be honest, for this. Maybe they're just fish, but who knows? Then also, yeah, temperature is pretty low, 0.1. I did find that it tends to.

At least for this task, a slightly higher than zero temperature setting seem to actually get better results, which is interesting. So usually I would not expect that for a task where the agent needs to produce such structured output. OK, so, yeah, we have that. So we pass that to our model.

Then we pass in those contents. So the one thing with Gemini 2 is I will generally output everything in Markdown without you even telling it to, which is fine. I like Markdown. So I'm just basically showing everything in Markdown here. So we have the overall scene. It's describing it as an underwater scene, coral reef, marine life, coral formations, it's daytime, natural light filtering through the water.

So that's pretty accurate, I think. They have the clownfish, right? So this is an interesting thing. So when I'm asking it to describe the scene, it actually seems to identify things very well. But then when you're asking it to draw the bounding box, it doesn't. Or at least before it wasn't quite doing as well.

So we have the clownfish. There's two clownfish it didn't notice, the one in the back. All right. The one over here. So interesting. And then there's this wrasse. So I don't know how you pronounce it, by the way, but I think it's wrasse. So this wrasse, it's this little thing on the left here, which is pretty tiny.

I think also, if you haven't been diving much, you probably wouldn't even notice this fish. The only reason you notice, like, if you've been diving a little bit, it's because they'll try and kind of bite you and stuff and just be generally annoying. So that fish is just kind of like ingrained in your memory.

So kind of cool that it identified that. You have the coral species, the anemone, which is what the clownfish live in. You have the hard corals, branching corals, plate corals, and some massive corals. But that's kind of cool. And then there's some counts. All right. So, yeah, that's pretty cool.

So that is just describing an image, right? So image to text or text to text and image to text. Now, what we want to do is we're going to draw these bounding boxes. And so this prompt here, it's mostly adapted from an example that Google provided where they're doing the same thing, right?

They're drawing the bounding boxes. And what I found is, like, modifying this much or tended to break it pretty easily, which is interesting. So I don't know if that's just a prompting thing. Maybe I need to get a little better at prompting this model. So, anyway, we're telling it to just return some bounding boxes, OK?

And they do that as a JSON array with labels. Now, this seems to be something it's being pre-trained on, as far as I can tell, because I'm not really defining much here. And it is, if you come down, oh, I don't have the example. But it's producing this structure, which is exactly the structure that we need, essentially.

So it's obviously -- it's had a few examples of this in the training or fine-tuning datasets. So we're just saying bounding boxes. One thing I'm doing here, if an object -- I don't specify fish, but -- because I also want to just kind of keep it flexible. If an object is present multiple times, label them according to their scientific and popular name.

And so my first question with this new prompt -- so this is our new system prompt. My first question is highlight different fish in this image. And you're seeing here, like, the runtime is super long, right? And the reason for that is that Gemini is repeating itself right now a lot, and I don't know why that is.

I think, you know, beyond the example, like the really simple examples, it does tend to do this. So we'll see, and we'll fix it. But, yeah, you get all of these, which is kind of crazy. There's so much going on there. I actually didn't look at what the output of this would look like.

And to be honest, I'm not sure we can -- or not easily, because it cuts off just here. It just stops. So interesting. But it does seem to be getting -- like, the labels here are -- you know, they are things that are in there for the most part.

Of course, unknown fish is just -- but, yeah, I don't know. It's kind of interesting that it just repeats a ton of stuff there. But, anyway, we can resolve that, and it's super easy to. All I did was add a pretty high-frequency penalty here, and that resolved the issue.

I was also playing around with the temperature. Like, you can increase or decrease this. It doesn't make too much of a difference. And then, yeah, I mean, the only other thing is, okay, we have a limit to 25 objects, but we had that before, and it didn't listen very well.

So, actually, I didn't even modify anything here, right? This is still the same prompt. So all I've done is added the frequency penalty, right? So interesting. Now let's see what it does. Okay. We have this, right? It's actually labeling a lot more stuff than it was before, which is interesting.

Let's see how -- well, we'll see in a moment how all of that looks. But, okay, we have this JSON output. That's what we need. And, yeah, you can see this is -- like, Gem92 has obviously been trained to do this. We have our bounding pops. You have the -- these are the coordinates in some order.

We'll see in a moment what they actually are. And then we have a label, right? Kind of interesting that it got a different clownfish. Anemonefish and clownfish are the same, as far as I know. So interesting they identified a different one to what it did before. Before, it was usually going with clarks, clownfish for both.

Okay. So what we need to do is sometimes this will also output a message, right? So it might have some text, and then it outputs this JSON here, right? And it's probably worth me just showing you what that, you know, text actually looks like, because it's not -- you know, we're formatting it with mark down here, right?

So what it actually looks like is this. So to extract out our JSON, we're just going to be looking for this, like, code block. And we extract out whatever is in the middle. And we do that with regex here, like a simple regex. So we're just looking, okay, we're looking for this and this, right?

And anything in between. So we're looking, yeah, first line of the JSON, final line of the JSON, and we grab everything in between. Okay? Yeah, that's it. So we do that, we extract that out, and then we just load it with JSON. And then we get this, right? So it's just a dictionary -- sorry, a list of dictionary objects with all of our coordinates and the labels.

Then, okay, this is just wrapping what we did just up here into a parse JSON function, just to make things a little easier. Then what we're going to do is plot the bounding boxes, right? So there are a few things going on here. Most of this is pulled from one of Google's examples where they have a similar setup.

And basically what we're doing is -- okay, here we're just saying we pull out a few different colors or color names so that we can use them when drawing our bounding boxes. So we have, like, colorful bounding boxes, essentially. We don't necessarily need to do that, but we do here.

It is easier when you're trying to read everything, I think. We take the image, right, we're passing it into here, but we're making a copy of it because otherwise we would modify the original image when we're adding those bounding boxes, which is super annoying if you run it more than once.

Then we're just taking the width and height of the image, which we need for the normalization set below here. We initialize a drawing object. So this is part of the pillow library. Basically, we take that, the picture, and we're going to draw on top of it. That's what this draw image will allow us to do.

Extract our bounding boxes from the LLM output. And then for each bounding box, we're going to go through and, like, well, draw them on our, like, draw object. Let's run this and see what it gives us. Nice. Okay, so actually when you do look at this, it identified quite a few different fish, but I think they're all actually pretty accurate, right?

This is pretty good. So you have the clownfish or two clownfish here. It identifies them as different species, but they're definitely the same. It does pick up on the clownfish in the background just as it identifies a clownfish. Then I think that those two are actually other fish. I think maybe there's fish here, but I can hardly even -- Yeah, there are.

I think there's some there, but I can hardly see that. For some reason, none of the models seem to be able to identify this fish in the corner here, maybe because it's partly cut off, and it does identify the urassa, which is cool. Now, this one's harder, I think, to identify the different corals.

Sometimes this kind of gets some of them okay, and then other times it just kind of goes off the rails. So, yeah, we can see. All right, so here, yeah, you can kind of see. I mean, there's just a lot of stuff going on there. So what it's saying here is that this is an equidopora coral.

We can see it here. Let me -- All right, this thing here. Equidopora, right? So, yeah, I mean, I think those things are kind of everywhere, if it is what I think it is, and I wouldn't say necessarily got any of them here. Maybe, like, over here or here they could be, but, yeah.

So it's just kind of saying that there's loads of that everywhere, which is interesting. I wouldn't say it's necessarily accurate. This one, I have no idea. Not bad. Actually, that's correct. I thought it was just anemone. Okay, so I thought this was called, like, giant anemone or something. So, yeah, it got the anemone correct, I think, in some places here, right?

So it's saying the anemone is, like, this one here, so kind of covering the clownfish, and also here, so it's interesting. I think probably because clownfish live in the anemones, so I wonder if it's seeing the clownfish, seeing some corals around it, and it's like, okay, yeah, it's definitely an anemone, right?

Because it's weird that the rest of them around here doesn't identify as that, but then when there's a clownfish next to it, it does. So that's kind of cool. Again, not perfect, though, right? But honestly, like, a few more models down the line, this would be pretty impressive. So I'm asking specifically for the clownfish, and it actually got the one in the background this time.

This is the first time. Like, every other time, it didn't get this. So that is really very cool. I think Google are actually doing something with the model. Like, at the moment, they're definitely modifying it because I'm getting, like, gradually better results every time, I swear. So, yeah, it does.

So it says that this is the Clark's clownfish, and this one is the other one, the, what did they call it, like, ocean something? I'm not entirely sure. So I think this one is wrong, but that one over there, I don't, I'm not entirely sure. I don't know what they all look like.

Maybe we can have a quick look. Not bad. I don't think it is this one, though, but it does kind of look like it in this picture. It kind of looks like that, but I think it's just another one of these without the other stripe, or maybe it's a juvenile or something.

I'm not sure. Let's just check if those are the correct one. That looks right to me. If it's not right, I mean, it's close. So that's cool. Very exciting. And then, so I was really having issues with identifying the clean arrests over here. I don't know if that's how you pronounce it again, but it just would not, Gemini could not seem to identify the rest.

So let's see if it manages, and it kind of does maybe. What is this? So let's see what that is. Oh, amazing. Google is up to something. Or should I say DeepMind? Anyway, so that is pretty cool. I was happy with the results anyway. I thought it was just amazing.

But that being said, I'm just super happy that the results are better since when I tested this less than 24 hours ago. So that is just wonderful. Okay, moving on to the next example. So another image. This one, oh, this one it didn't get last time as well, and I tried so many times, and they have it this time.

So weirdly, when I ran this last time, so this big one in the middle is SomethingSweetLips, I think. They have interesting names for all the fish. So this is a SomethingSweetLips. I can't remember exactly. And you can see, well, you probably can't because this text is really small, but that here says SweetLips.

So it actually did get it, which is just cool. Before, it was not even identifying it as a fish. It was pulling in all the fish from the background, but not this one, which I found super weird. So that is amazing that it actually got it now. Yeah, that's cool.

I even wrote it doesn't catch a very large fish in the middle. Now it does. Well done. Yeah, I love this sort of thing. Yeah, nice, PaintedSweetLips, that was it. So PaintedSweetLips, I don't know, because they all have different patterns and stuff, and I don't know if this is-- it's definitely the type of fish, but is it right with the pattern?

I don't know, to be honest. But if you look, there was one picture where it looked pretty similar. Yeah, yeah, yeah. Also, this is maybe a yellow-banded SweetLips rather than the painted SweetLips. But honestly, it's not bad. Come on. Yeah, so it would be a yellow-banded SweetLips instead. Really very cool.

I'm impressed with that. So yeah, and then I was like, oh, come on. Give me-- tell me what that big fish is in the middle. I mean, it already got it, so I think it's fine. Yeah, he got it. But cool that it doesn't identify the other fish in the background now.

That is pretty exciting. Cool. So that's good. We're going to move on to the next picture. There's just a couple more here. Every now and again, this does happen. So it's not perfect. But I mean, I was thinking, OK, if you're going to do tool calling and stuff-- I haven't tried tool calling yet.

I'm not even sure if you can. I didn't see anything super obvious on how to yet. But if you're going to do tool calling, that would kind of resolve the issue. But anyway, so yeah, this is pretty good. It's identifying a ton of different fish all over the place.

I don't know what they all are. But these are all just-- they're kind of fish that you-- I don't think most divers would even know. I don't know. But it does identify these black ones over here. So this is a nasolitoratus. So we can have a look at what that is.

So I mean, you kind of look at this and like, oh, no. It's not the same fish. But it's actually-- I think it is relatively close or related in some way. Because if you look at the fin, they have this weird fin where it's like a-- I don't know what it is.

But they have that very distinct shape. And if you look here, you can't really see it very well. Maybe if I just go and take the image itself, you might be able to see a little better. You can kind of see-- not very well, but this fish here does have that sort of weird tail shape where it's like it sits up.

And then it has a little two streams at either end. So potentially accurate. The front of it looks kind of similar. I don't know. But it seems like it's at least related. I don't know if it is the exact same fish. But that's pretty cool. Now, OK, not bad.

Then, oh, yeah, the corals. Let's see how it is with corals. Because there's a lot of corals in this. I have no idea what most of those are or any of-- I don't think I know what any of them are, to be honest. But yeah, let's see what it comes up with.

Nice, so a ton of things here. So I think some of these-- so the brain coral kind of doesn't look really like that. It actually looks more kind of brainy. Staghorn coral, I think, is correct for this thing, or at least close to this thing down here. So if we look at a picture, kind of similar.

If it's not that, it looks pretty similar to-- probably, to be honest, it probably isn't that, actually. No, look at it. But it's pretty similar, anyway. Now, you have the brain-- let me just show you what a brain coral looks like, because it doesn't really look like that, I don't think.

So it's different. It's big, like those other blobs that it's pointing out. But it doesn't look quite like this. So I would say no for that, in my opinion. Maybe I'm wrong. Then we have-- yeah, staghorn, staghorn, brain. Yeah, it needs-- yeah, I think that's all it identified here.

So generally, wrong. But at least it identified-- mostly identified the actual fish, which is kind of cool. OK, not bad. It identified a few fishes, corals. I think it's pretty hard, to be honest. But not bad, I don't think. Then, finally, one more picture. So this picture, if I just pull it up before we label it, so it's actually to the side, the picture.

So the ground would be down here. And up here is the surface. And this is a big ship turret, like a sunken ship turret. And there is a fish that lives in here. And you can put a-- the fish keeps his home clean. So if you put something in there, the fish will grab it.

And he'll throw it out, because he wants to keep his home clean, which is kind of cool. So it's kind of hidden, right? It's hidden inside there. So I just want to see if Gem and I can actually identify that. And let's see. Interesting. So it's-- I think that was an error.

It was running for a long time as well. So maybe it started repeating. I didn't-- I haven't had that before. That's the first time I've had it break on that example after setting the frequency penalty. And I was actually quite impressed by how many fish it pointed out. There's a lot of fish just in the background.

You can't really see them, like this one here or these. And it managed to identify them, which I thought was pretty cool. And I think this is also a-- I think it's at least a damselfish. I don't know if it's exactly what this is. We can have a quick look.

But it got the hidden fish, which is cool. So that is a Malucan damselfish. Yes. So the different colors here, but it's basically a black version of one of these. I don't know. I mean, at least to me, it looks-- it's basically the same as these things here. So even if I just go damselfish, yeah, you can even see, like, there's a picture in the middle here.

Or it's just like a black one. But I think they're the same. It seems pretty similar to me, size-wise, for sure. So yeah, I mean, that was that. And then-- oh, yeah, I just want to-- so another thing is, OK, this is kind of a different picture. It's not just, like, fish and corals.

It's actually-- it's a shipwreck. So can Gemini identify this as a shipwreck? So I've just explained-- I've modified the system instructions to not do the whole bounding box thing. And said, OK, describe what you see in this image. Identify fish or coral species in the image. And tell us how many of each you can see.

So I'm not even saying anything about, OK, look, there's something. There's, like, a shipwreck or anything. I'm just saying what is in this image. And then I've just said, OK, explain what this image contains, what's happening, and what is the location. I wanted to see if it could give, like, a super accurate location, but it didn't, of course.

I did try and push for it. So there's a large cylindrical object with a hole in the center. It appears to be made of metal, and it's covered in marine growth, so on and so on. It is likely this is part of a shipwreck, and the cylindrical object could be a gun barrel, which is kind of cool.

So you got that right. Subtropical, tropical, subtropical. One fish inside the hole, that's kind of cool. Then I just wanted to see if it would, like, guess a specific place. I knew it probably wouldn't, but who knows. A section of a shipwreck. Basically, it just kind of refuses to answer my question.

But that's fine, it did well there, I think. Maybe if we gave it access to Google Search, it might actually be able to. But yeah, I would be impressed. Anyway, that's it. So that is our example. I mean, it does pretty well, in my opinion. Just, you know, like, as a first test, I am pretty impressed.

I think it's cool. I think the structured output abilities of the model, even without tool calling here, pretty impressive. Of course, it has been fine to output this structure, I think, as well, which helps. But I'm pretty optimistic. I'm optimistic that this model might do very well as an agent.

And that's mostly what, at least for me, it's mostly what I'm building with almost all the time, like building agents. So I think it could be pretty big, in my opinion. Especially compared to all the other models I've sort of tested over the years. Like, OK, every time OpenAI comes out with a new model, that's pretty big.

But then the others just don't really match up in terms of agentic ability all that much. Or they have some weaknesses that really just kind of stop you from implementing actual use cases. So this one from Google, maybe-- let's see-- maybe that model, that finally gets a good portion of people to move away from OpenAI and start actually exploring these other models, which could be pretty cool.

So yeah, that's it for this video. I hope this has all been useful and interesting. But for now, I'll leave it there. So thank you very much for watching. And I will see you again in the next one. Bye. (soft music)

Gemini 2 Multimodal and Spatial Awareness in Python

Chapters

Transcript