AI Engineering with the Google Gemini 2.5 Model Family

- - Okay, hi everyone. Welcome, welcome. So welcome to our workshop, AI engineering with the Google Gemini 2.0 family. So as it is a workshop, we are going to keep it super hands-on, so please keep all computer open. You don't need any Google account, like Google Cloud account. You can use your personal Gmail.

It will be completely free for you to use. So that's the point. Before we get started, can you maybe help me understand how many of you has used Google Gemini before? Oh wow, that's cool. That's a lot of hands. More than the last time I gave a talk like this.

So what we are going to do, we have like three slides, so don't worry, not too much. But we are going to focus on Gemini 2.5. So there's a Gemini 2.5 Pro model and a Gemini 2.5 Flash model. We are going to use the Flash model as it's available for free tier via API access.

So we are going to do coding. And both models are multimodal by default, meaning they can understand text, images, audio, videos, documents, and can generate text. We also have Gemini models which can generate images, which we are also going to use. And we have now Gemini models which can generate audio.

So you can create speech from text. If you are curious where you can find those nice model cards with all of the feature, the model context, the output tokens, it's on the Gemini docs. As mentioned, we have two new text to speech models since Google I/O last week, not a week before.

Those are really cool. You will try and see them later. And now for the important details. So I created a Slack channel. If you are on the AI engineering Slack channel, you should be able to find it. Feel free to use it during the workshop. You can ask questions.

I try to regularly check to answer them. Or even afterwards, if you have questions, complete the workshop at home or next week. I will take a look and make sure that you get all of the answers. And then we have one QR code. It's AI Studio. You can also just enter in your browser AI.dev or AI.studio, which will bring you to AI Studio.

And the other link is-- so there's a GitHub repository with the workshop we are going to do. The workshop, I can now switch directly to it. So let's hope if the Wi-Fi gives us some freedom. So the good part about the workshop is that we have Google Colabs. So there's not a lot of downloading happening.

And it will all run in the Colab environment if you have a Google account. And the other thing is what we need to do is an AI Studio to generate an AI key. So the GitHub repository is now loaded. In the GitHub repository, we have notebooks. Can you use the slice piece?

Yeah, of course. Sorry. So in the GitHub repository, we have a notebooks folder, which includes all of our four workshops, plus a zero one, which is basically some minor instructions how to set up AI Studio, how to get an API key, and how to send your first request. And then we have one-- the beginning section will be all about text generation, getting started, getting familiar a bit with the SDK.

The second part will be all about multimodality. How can Gemini understand images, video, audios? How can we generate images or audio? And then the third section will be about function calling, structured outputs, the native tools. How can I integrate Google search into it? And then I guess with all of the hype currently going on, we look at how you can integrate MCP servers together with Gemini, using it as a model to call the different tools.

Also very nice, so there's a solutions folder. The solutions folder includes the same notebooks, but with the solutions. So all of the notebooks include to-do text and also some code snippets and some comments. So there's a mix between working code snippets, code snippets which has some pointers, and straight up exercises with to-dos for you to do.

So I will work with you through the existing snippets, and then everyone can work on the exercises, the ideas that we try to maybe use 30 minutes per different section. If you, for example, are already very familiar with how what I can do with text generation, and I would rather like look at the multi-modalities parts or at the function calling parts.

Feel free to like directly jump into the section. And in general, we want to keep it very open, very dynamic. If you have questions related to the content, maybe unrelated, please keep them coming, ask them in Slack, raise your hands. I'm not sure, maybe we have some microphones here as well so we can like give it to you to make it super interactive.

So I guess let's get started. So if you go to the notebooks, there's also a Colab button you can click, which opens the notebook directly in Google Colab. And if you prefer like a local Jupyter environment, you can try to clone the repository. I'm not sure if it works for the Wi-Fi or not.

I guess Colab will be the easiest. And as mentioned before, the only requirement you basically have is a working Google account. Can be your corporate one, can be your private one, can be one you create in the next five minutes. Again, the first step is what we need to do is basically go to AI Studio.

For the ones who are not familiar with AI Studio, AI Studio is our developer platform to quickly test the models, to experiment with the models, and also keep it very similar to the development code you will be using. So if I try to run a request, like maybe let's ask something.

What's the AI Engineering Summit? I can, on the right side, for example, enable native tools connected with Google Search. I have our Flash preview model. I can run the request, and we'll see how fast the model is thinking. And the nice part here is I can also directly get the SDK code from our request as soon as it's ready.

So if you are experimenting in AI Studio and you want to convert it into a Python script or want to play around with it, extend it, that's all possible. So the AI Engineering Summit refers to several events focusing on artificial intelligence and engineering. That's great. And that also matches the one from New York, which was done this February.

Cool. So what you need to do to get your API key at the top, right, is -- I can also make that bigger. Maybe it's easier. So we have a get API key at the top. You go to it. I'm sorry for it's German, for being it's German. But on the top right corner, there should be a create API key, a blue button.

And in there, you should -- it should open some -- I can click it -- should open some modal, where you can either select your Google Cloud project. If there's none, you should be able to create one. If -- yep? Can you switch to my project? Yep. Of course.

Um. Appearance. Wait. Okay. Good idea? Okay. Once you select your Google Cloud project or create one, you should be able to create one and once it is created, you should have it available as a popup. If not, you can scroll down a bit. There are your API keys. And then the second step would be to go into Colab and to go on the left side in the navigation.

So I can also -- let me quickly change that to light mode as well. But anyways, left side, there's a key, which is called Secrets, and then you enter a name, which is Gemini API key, and then the value of your API key. All of what I walked through is also part of the zero-zero setup and authentication notebook.

So if it was too fast, you can, like, look up. There should be a screenshot of it, where I clicked, and where to edit. If you are running locally, you need to expose the API key as environment variable with the same name. Should be also part of our notebook.

So in the first cell, so basically what we try here is we check whether we are in Google Colab. If we are in Colab -- yep? Yeah. No, don't worry, like, we go through at one, and then you have enough time to, like, five to ten minutes to set it up yourself.

Okay? Quickly that we -- Can we use the font? Yep. Is there any PowerPoint presentation? No. No PowerPoint. Code only. So we go through the API key setup in a minute. You have plenty of time to do it yourself, and if you have questions, I'm very happy to come to your place and help you get it created.

Just to complete the setup. So once you have created your API key, made it available to Colab, or made it available in your environment, best is to open the first notebook. It has a super small code snippet in it, which uses your API key and generates -- uses Gemini 2.0 Flash to generate a first string.

So our goal, next five to ten minutes, is trying to get this working. Okay? And again, going back, so we have those QQR codes. One is for AI Studio. It's the left one. The other one is for the GitHub repository. You can also go to AI Studio -- AI.dev to enter AI Studio or go through Google Search.

And you can find the GitHub repository on my GitHub account. It's like Gemini. It's like Gemini.2.5.aiEngineering.Workshop, and I will, in the meantime, change the appearance. Yeah. Sorry. Sorry. So there are in the GitHub repository if you go to the Notebook section. And each of the Notebooks at the top, there's a button, which opens Colab directly with the Notebook.

And then, you can do it. Okay. Thank you. Okay. Okay. Okay. Quick check. Are we ready? Yeah. Any no's? Okay. Cool. So the first section will be all about the defaults, basically. LLM started with being text only. We generated text. So the first section basically covers all of how can I generate text?

How can I generate text and have a streaming response? How can I account my tokens? It's like always important, right, to understand how many tokens did I use? How much will it cost? And there are like a few exercises for you to try out different models, try out different prompts.

It will also go a bit into detail on how the SDK works in terms of like which inputs you can provide. So in the Google AI SDK, we have this concept of a client. And client has the models abstraction. The model's abstraction has the method generate content or generate stream content.

And I can also make it maybe a bit bigger. And each of the, or it has parameter for the model. And the model ID is basically the Gemini model we want to use, which is defined at the top. So all of those cells use the same concept. So all of the workshop section have the same.

If you think, okay, 2.5 flash is not the right model for you, you can change it to a different model ID. If you have like a paid account and want to use the pro version, you can also change it. And content is basically our way to provide data or conversations, chats, and messages to Gemini.

So the first test basically is if we ask it to generate free names for a coffee shop that emphasizes sustainability and we use the client models, generate content, we have our model ID, prompt, and then we get our response from Gemini. And if you have already set up everything, you can try prompting a few things, ask it to explain some terms or maybe like just change the model ID, and then we continue with counting tokens.

So there are exercises in there which don't have any code snippets. The solutions part of the workshop has the code. So if you are getting stuck or if you want to look it up what I added, feel free to take a look there. But definitely try it first yourself.

If you want to get familiar with the SDK, there are plenty of other snippets. So exercises basically just to make sure that you understand the concepts and can practice it. And there are other cells which are partially done. So for example, the one we have here, which has some code comments and also some to-do calls.

Here the idea is really to force you not to learn new APIs. So next to the generate content method, there's also a count token API, which we can use to count our tokens. So similar to our generate method, we provide our model ID and then our prompt here. And basically what the API does is now it counts only the tokens for our prompt since we haven't generated something.

So we can run it and we get an input tokens of 11. So the Gemini tokenizer basically converted those 2, 4, 6, 8, 9 words plus a full stop to 11 tokens, which is then an estimate of roughly 0.00002 dollars. The count tokens API doesn't expose the pricing. So basically what I did here is like looked up the 2.5 flash pricing and calculated it.

Similar to only counting the input tokens, we oftentimes also want to count the output tokens to understand, okay, how much does it cost? So in the next example, we basically generate content and each response has a very nice method, which is called like an abstraction, which is called a text, which allows us to actually easily access the generation, but also has a usage meta data object.

And the usage measure data object includes all of our consumed tokens and generated tokens. So we have input tokens. We have thought tokens. So Gemini 2.5 is a thinking model. So before generating your response, it first generates thinking tokens, basically an abstraction where it uses more compute to have like more room to generate a good answer for you.

And then also the candidate tokens, which are the response tokens at the end. And that's how we can calculate the total cost of a request where we use the input token price and then our candidates tokens and of our tokens. And for this case, it would be less than like 0.2 cents.

Yeah. Is 2.5 flash also a thinking? Yeah. Okay. Yep. I just see different multipliers. Yeah. Is there a different price for thoughts token in your estimate? No. So input and output tokens are calculated different. Yes. So we have prompt tokens as basically the input tokens of your prompt. And then we have the candidate tokens.

Which is your response and the thought tokens. And those basically have the same pricing. And the output tokens are much more expensive than the input tokens. Because that's where the computation mostly happens. And the input is just one encoding. So that's why you always see like the big difference in input pricing versus output pricing.

More. Yeah. So for Gemini 2.0 flash, which is our most cost-effective and cheapest model, the input price for one million is 10 cents. And the output price is 40 cents. So you mean like why we got a 639 thought product. So that's sadly not like directly visible to us.

So we can like look up thought summaries. But basically the model generates first of all like a lot of like reasoning. Of course, in our case, like we ask it to generate a haiku. Might be not like the most, sorry, most difficult prompt. You can control the thought tokens with something called thinking budget, where you can limit how many tokens the model has to think or to reason.

So you have some sort of a cost control, but it's basically done dynamically based on your prompt. Okay. Yeah. There's also a question. Yeah. I was just curious about the price for the output token. I was just looking at the documentation that says 350 for thinking. Yes, so I can let me open the docs and maybe make it easier.

So Gemini 2.5 flash is a hybrid model. So you can use it with thinking and without thinking and without thinking. Basically the computation is much more cost effective. As you might know, like the transformers is all based on intention, which isn't like and by an x-like kind of calculation.

So it gets bigger and bigger, which means it gets more and more and more compute intenses. And without thinking, it's for us much easier or inefficient to run. So if you set thinking to zero or like thinking budget to zero, you will have zero thought tokens, but you have like your candidate tokens.

And those candidate tokens will then be charged with 0.60 cents. But once you once you use thinking, meaning a thinking budget greater zero, you will pay the price for the thinking tokens and for the output tokens. And that's where the 3.5 dollars. Yes, I will open the documentation so we'll see it in one second.

And how do you turn thinking on or off? Do you have to control it with the budget or is there a flag? I saw you have some kind of cost available on the general content. Yeah, so in any case, if you have like any questions, the Gemini docs are a great way to like find the answers or like just ask Google or Gemini directly.

And on the model capabilities, we have the thinking section and they are like the thinking budgets. And if you want to disable thinking, you basically set the thinking budget to zero. You can control it. It could be an integer between like zero and 24,000. And then setting thinking budget to zero disables thinking.

So that's your way to disable thinking. Is the thinking budget like a number of reasoning traces to do? So if I say 4? No, it's tokens. So yeah, so we have, yeah. So we have seen in our example here, we had 600, a bit more than 600 thought tokens.

So if we would set our thinking budget to 512, it would be a maximum of 512 thinking token. Put it off, but how does it know how many using traces to do or is that not controlled? Does it spin out other using traces or defines? No. Okay. I see that's just the intermediate number of tokens for the chain cloud.

I guess. Oh. Okay. Continuing with our notebook. So and please continue yourself. Like I'm like, you can do it with your own tempo or even like do it faster, slower, just to make sure that we are all like the same page. And I guess like the most interesting part about like streaming and like LLMs in general, we all have seen it with JetGPT, is that waiting for the whole response is a very bad user experience, right?

Like who wants to wait like 60 seconds, two minutes for a response? So that's why everyone now kind of uses streaming. And with the Gemini and the Gemini SDK, it's like super easy. So instead of having just generate content, we have generate content stream, same input parameters, except that we now get an iterator back from our call, which we can loop over and we can like print our junk or like stream it back to our user using an HTTP service.

And then similar to other models, Gemini is a chat model, right? And maybe you are familiar with the OpenAI SDK, where you have the concept of messages, where you have like a different like inputs per user turn, assistant turn, user turn, and it makes it very, I would say, it's still complex to manage yourself because you need to keep track of it.

To make it easier, we added something which is called a chats API, which basically does all of the state management on the client, but as part of the SDK. So you can create a chat with your model, and then you can basically send messages into our chat session. And the user, like in this case, it's like we are planning for a trip, we send a message, we get back the response, but also our chat session includes the user prompt and the assistant message.

So instead of needing to create this object of user terms and model terms, we can directly continue with like sending our next message, asking for some good food recommendations. And since we are in like a conversational setting, the model knows that we are, or like it mentioned for us to go to different European cities.

And based on like the next request, it uses like the whole like conversation as history to get our response. I can like also quickly print the response here, and get some different examples. And then if you, of course, if you need to store it to a database or in general, we have a nice get history method available, which allows you to basically retrieve the complete current state and you can like store it or update it or whatever you want to do.

Yeah? That's only a client abstraction. So the backend receives the same request if you would send it like as a single request with an array. It's only a client abstraction to make it easier for developers people to quickly build. And then similar to OpenAI or to other models, you can like give the model some kind of system instruction to have it behave differently, respond in a different language, make sure it respects like policies or guideline you provide.

This can be done through generations configs. So we have another argument now in our model call next to the model ID and the content. We have a config that we can provide our systems instructions. And similar to the systems instruction, we can provide other generation configurations. So temperatures can be used to make the generation more creative or more deterministic.

So if you, for example, build a retrieve log manager generation where you want a model really mostly trying to use what you provide as a context, there you would normally set the temperature tool or very low value. If you, if you work on some content writing marketing, you would set the temperature to a very high value.

We can control the max output tokens to make sure that we are not exceeding some budget, some length and top P and top K are also ways to make our generation more diverse. So here we have the similar in the config, we have the thinking config and in the thinking concept, you can basically set the thinking budget or also include the thoughts in your request.

Okay. And then I think what's very more unique about what we can do with Gemini is that we have direct, direct support for files. So in this case, I download adventures of Tom Sawyer, a book completely store it in a file, and use the files API to upload the file to a Google Cloud Storage bucket.

That's free for you. So like if you don't want to use your own corporate bucket or whatever bucket with each AI studio account, basically there's your personal bucket, which stores the file for, I think one day, but you can control the time to live. And instead of needing to provide the whole file with your request, which can be very intensive, you can upload it and then instead just provide the reference to the file and what the Gemini API does behind the scenes, basically it downloads the file on the back end and makes it available inside your prompt.

And similarly here we can, we uploaded our book. We passed it into our contents list here. So we don't know, not longer have like a single prompt. We have now an array with our file and we ask it to summarize the book and it was also done while I'm talking and then we can also see, okay, the token usage now we had for, was 100,000 tokens.

So much bigger than what we tested before and using the file API makes it very easy also to work with, with PDFs, which we'll see in like the next chapter and then as an exercise for you is to combine a bit all of the things. So how can I use a book to use the chat to chat with our model to help me better understand it?

Yep. So what you do is basically you upload the file from your client to the cloud into a bucket. And when you send a request part of your, I mean, I can like show it, part of your request would be only the reference to where the file is stored.

And what the Gemini API does behind the scenes is it loads the file into where the request runs and then puts it into the context. So you can use now this file pointer for all requests and you don't need to send it every time. And here we also have like the URI.

So that's basically where our file is stored or can be accessed as well. Yep. How do you deal with that, is that, is that, that file, is that hidden through the security? So is it only available to the chat agent reference to itself or? It's only available to your user.

So when you send a request, you send an API key and this API key is basically used to get the file. So nobody else can access the file. So if you would use the PDF, which has like more than a million tokens, basically what would happen? You would receive an error most likely with like saying that the file is like too big in terms of like a context.

What you can do is basically you can use the, the file, the file to count the tokens. So we have the file now here and we can client models, count tokens. You would need to junk it then. So if it, the file doesn't fit, it doesn't fit. And you would need to like think about, okay, can I junk it?

Can I summarize it? Can I maybe do use other techniques to first extract the important information? And once I have, um, context, which is, um, smaller than the maximum context of my model, you can provide it again. Yeah. Yeah. It doesn't have like no. No. No. No. Do you have any files, files upload?

No, no. So it's like, but it will be deleted. So don't expect the file you upload now to be there an hour or a day. Um, but you can like use the same concept with, uh, vertex and your own bucket where you have like more control over it where you can say, okay, maybe I'd want to upload it using a different API call or already have it available.

That also works. Okay. So there's an entry point to saying use this in the Google storage bucket. With vertex AI. Yes. Yes. It's roughly the same, but you need to set up the client differently. Yeah. Okay. Yeah. Yeah. Good question. So, I mean, we can maybe directly jump into like the, the, the PDF section.

So, um, continue with section one or jump to section two or three directly. What I will do is like in the section two, which is all about multimodality, meaning we will cover a visual understanding, audio understanding, videos and document processing. And that's where I will like jump to, um, and part of the, um, okay.

Being connected. Part of the, uh, working with PDFs basically. So similar to what we have seen a minute ago, we have a PDF. In this case, it's basically an invoice from a supermarket. Um, I upload it and I asked the model like, what's the, the, the total amount, uh, we can run it.

And what happens behind the scenes? I can show. Oh, it's not here. Wait one second. Okay. We don't have the file here. I uploaded it quickly. But what happens, um, behind the scenes is we run OCR on your, on your PDF and provide the PDF as image. Um, so you don't need to do it manually.

So there's no like, Hey, it's a PDF. Let's convert it to an image and then run OCR. And then I provide the image and, uh, OCR. That's not needed. Um, we are doing it for you. Okay. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah.

Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. So OCR's or like the image understanding is not perfect yet, right? If we reach a point where the model understands it like in the same way as without the text, then I guess there's no point. But like based on like what we have seen and also what the industry does is you receive better results when you provide the OCR plus the image.

Um, yeah, that's, that's basically it. Yeah. Maybe you kind of missed that. So the PDF itself, do you actually look at it as an image? So if there's tables and diagrams? Yeah. And you extracted an image space? Or you just take, there's also sometimes text hiding in the PDF or cut and paste?

No, I think, I mean, I don't know exactly, but I think it's just basic OCR, nothing special, no magic, and then a screenshot of the PDF. Yeah. Yeah. So it's both. Okay. You can try again. We have now the PDF available. And in case, um, the workshop has multiple sections with files, which are being part of the repository.

So if you run into like a similar issue, especially for the image understanding part or the audio understanding part, uh, and you use Colab, you might need to download the files manually and then upload it. But in our case, so we have now our invoice, I think we can like quickly show it.

So I was shopping in Germany, we have a supermarket called Rewe and I basically bought some butter, um, some bread, like some sweet potatoes and, uh, we prompted it and asked, okay, what's the total amount? We can see here a total amount is like 2020. Now let's see if we got it correctly and we got it correctly.

And it also correctly extracted it in German, even if I prompted it in English, which I think is like pretty cool. Okay. Let's start with the image understanding part. Can I ask just one quick question? Yeah, please. Sorry about the thinking budget. Yeah. So it's not just admitting the thinking process that you see, it's actually changing and changing.

I'm not exactly sure what happens. I only know that by defining the thinking budget, you can limit how many, how many tokens will be used or generated as a maximum. And very similar to what open AI has with like low medium high effort. We have a bit more granular controlled.

So we could technically do the same and say like low would be like a thousand token thinking medium would be maybe like 12,000 token thinking and high would be 24,000 token thinking. Uh, and it would then use like at a maximum those tokens to before generating your response. But what exactly happens, I can tell you.

Yeah. So without thinking what we have seen, especially on more like math type of question where the model benefits from like the reasoning, uh, the performance is a bit worse. But for like general everyday use, especially image understanding or like OCR, you can like easily run it without it.

The truth, it would be, you need to try. And I think the real benefit here is that you have those granular controls. So you can run evaluation of thinking budget zero, 1000, 2000, 4000, and see how it impacts your evaluation. And then you can like calculate for yourself. Okay.

How much am I able or like, what's my, my maximum cost of it? And like, what's the, the accuracy I need to reach. Okay. More questions or should we continue? There's a question. Yeah. No, it's like, uh, there's a documentation for it, but it's like, uh, Jason PDFs, all different image types, all different video types, all different audio types, all different audio types.

So all of the multimodal, um, features we support. Um, if it gets a bit more specific with like dot JSX file and dot view files for like web development, we are working on it, but it might, I mean, you will see an arrow and easiest way is to just replace it with a docs TXT.

No, I would say like, those are like, you would need to like, use like market down or like another library to convert it. And then, or like copy paste the input. Yeah. Yeah. Yeah. Yeah. Yeah. So I'm not exactly sure what our researchers did. The only thing I know is that we get better performance when you provide the image plus the OCR.

So I guess there's a benefit of having both. Awesome. And the second question is the script you just showed us. Yeah. Is it essentially the same thing as we saw on the UI in the face of . Yes, kinda. So the UI or the AI studio, of course, doesn't use Python, but it calls the same API's behind the scenes.

So the API behind the file upload one is the same API we call from AI studio. We call the same exact model. Both have the same exact parameters. So you should be easily what you test and experiment in the AI studio. Can you convert into code and run it locally?

There's also this, um, get code, um, button. So if you are in AI studio and I mean, we can quickly try uploading our invoice again. And acknowledge ship and use our prompt. So here. And now we run basically the same request. And we can now, we also have this, um, it's a bit hidden.

It's like this code, um, button at the top, which, uh, where you can get the exact same code. Basically takes a few seconds as the wifi is super bad. But here you get like the exact same Python code where you create your client. We have our model. In this case, as we uploaded it manually, we provide the document not as a file URI.

We provide it directly as base64. We have our prompt. Uh, we have the model request already since we generated it. And then you can continue. Okay. Cool. Uh, question. Yep. Uh, can we still use the count tokens, uh, function for when we use a token? Yep. So I mean, we can quickly, so try it.

So we are here. We have our PDF. I think, I guess it's very interesting to know, like, how many, um, tokens will we use. So, uh, let's quickly count tokens. So, if we go here, we have our count tokens and now we use the same contents. We have the same model ID.

We have our prompt and we have our PDF. And let's print our token count. Okay. And alternatively, what we also have done. So if you use the, uh, uh, response, so you run a request already, you should have access to the response or usage metadata. Yes. And so we have our account tokens.

So our PDF here is converted into like, uh, uh, roughly 500, uh, tokens. Um, and the, the prompt we have is like around 20. And if we compare it to like the request we run, we see, okay, we have the same exact amount of prompt tokens. And we have prompt details.

And we have prompt details. We have our output tokens, four tokens of 42 and candidate tokens of 78. Okay. More questions or? Yeah. Okay. Okay. Yeah. Yeah. Yeah. Okay. Okay. Yeah. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Okay. Um, not for, you know, okay. Yeah. Okay.

Yeah. Okay. Yeah. Okay. Okay. Okay. Okay. Okay. Yeah. So, the PDF is converted into an image and we provide the image and if the image or like the PDF has tables, visuals, mind maps, uh, the model is trained on similar data. So, it will definitely understand parts of it.

I mean, we can, we can try it. I mean, maybe we can like search for some mind map and ask it something. Um, you, maybe you can start thinking about a prompt. So it will definitely understand the parts of it. I mean, we can try it. I mean, maybe we can search for some mind map and ask it something.

Maybe you can start thinking about a prompt while I'm searching for a mind map, if Wi-Fi allows us. OK. OK. We have our mind map. I guess it's just a mind map on how to do mind maps. Any idea what you would like to know? I mean, what's the most central concept for you?

Sorry? What's the most central concept in the mind map? Yeah. Like this? Yeah. OK. I mean, I'm opening it, but I think on the small scale, it looks correct with how to mind map. Answers? Perfect. I mean, the issue that theory happens, right? Like, there's a table. So the fidelity of the table, in fact, that we're going to capture it in an image.

So the fidelity of the image represents that table well. If, for example, that's a complicated table. So those are our challenges, right? That you can basically see you as a media. And if you're saying it's only for OCR, and then only for an image, that if you're going to get this done with that, or even if it might not be using .

But if that's a challenge, . So where we have seen the most success is when you, like, separate a bit. So there are already very good existing methods, which allow you to extract tables or other visuals from documents. And when you then, like, work with those images and tables directly, and not, like, you-- the way how we work changes from, like, previously we provided a table, run OCR, then ask our question.

Now we directly ask the question based on the table, as the models get so good in, like, the multimodal understanding that it can combine, like, the different aspects. I mean, we can try it maybe as a good example as well. Maybe we find some nice, I don't know, inverse image or something.

And then we can, like, ask it maybe to add something or to combine it, which would be very hard, I guess, for a normal model. But in general, like, especially Gemini is so good with, like, the multimodal understanding. It also-- videos is, like-- that's, like, my most new favorite thing is, like, take a YouTube video, which is below one hour, put it into AI Studio and, like, have it summarize it.

Or if you have, like, any specific question, like, it's so much faster than, like, sitting there, even, like, watching it in, like, two speeds. You get a response in, like, 80 seconds or something. And you can, like, even, like, ask it to extract specific timestamps on when something was set or to help you, like, section it very easily.

Sorry, can you speak a little bit louder? Yeah. I don't think so. So AI Studio is, like, very developer-centric. And we don't want to do too much black magic. There are safety filters and control, which you can configure in the SDK or in AI Studio. And in AI Studio is under Advanced Settings.

I have safety settings. I have all of them off. But if you, like, want to filter on very explicit content or hateful content, we run some classifications, basically, before and after to make sure that you are not creating for your users harmful content. OK. I have our invoice image.

OK. What should we ask? How much it would cost if we subtract the pedal, maybe? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal?

Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal? Yeah. How much it would cost if we subtract the pedal?

I think you can always think a bit about, like, would we as humans have struggled to understand those PDFs if they are, like, switched up? If yes, then most likely the model will do as well. I think one very nice part about Gemini AI Studio and the Gemini API is you can get started very quickly.

Like, all of us kind of set up within, like, 20 to 30 minutes a free account with access to Gemini 2.5 Flash via API, 2.5 Pro in the UI. So the best thing always is, like, to test and to explore and evaluate. And even if you need to run, like, 1,000 PDFs, it's not very cost or, like, expensive anymore.

So the really best thing is to run your own evals to get some more than, like, I try five PDFs in the UI. Really look into it. And if you have, like, any questions or problems, like, best way is to reach out to us. We have teams helping and building with customers.

And then we can, like, iterate on it together. Going back to your PDF. Any question? Any prompt idea? Yeah, I have a question. Yeah. I know you explained about security. But there are mandated security by law, which are LLM firewall, or benchmarking automations, or LLM guardrails, or SEM, poster management, mandated by CISA.

So when I implement those rules for either defense or for financial applications, how do you -- just a small set of -- I know that Google is pretty strong in setting up those security measures from a different perspective. How do you integrate this go lab and the other security measures?

So I guess for those type of environments where you have a lot of, like, compliance regulations, best is to work with Google Cloud. So everything we do in AI Studio is also somewhat similar available in Google Cloud in Vertex AI. And Vertex AI provides more features for those kind of use cases.

They -- I'm not exactly sure, but they definitely have more, like, information they can provide on how to handle all of those things and those guardrails with Gemini. So that -- So that -- So the global -- you're suggesting we need to go on a global Google GCP environment.

When I go global on GCP environment, then there are certain performance and cost being more here, such as -- So there's a guardrail for bot-based, and there's a guardrail for organization-based, and there's a guardrail based on the industry-based. So we have to split those platforms into -- from most granular level to all the way to industry-wide level.

So that's becoming -- when I do it globally at GCP level, then it becomes more expensive, and as well as there can be a performance issue. Yeah. So I know that there are regional endpoints for Gemini and Vertex AI as well. And also at Cloud Next, they announced a new Gemini on-device kind of thing where big companies can buy basically a huge box where Gemini is pre-installed, and it gets delivered to your environment.

I'm not exactly sure about the details. The easiest is to do a quick Google search and look for it, but those are exactly where Vertex provides you more features and support than AI Studio. Yeah. Thanks. One of those , someone from Google also represented, and that's a good platform, but I didn't know how you can integrate from here to there.

Yeah. This is the most granular level. Yeah. How do you integrate from the most granular level to the global level? That's the part of . But anyway, thanks for it. Okay. Maybe back to your PDF question regarding what to ask. So instead of like asking what the total sum was, I asked it to sum up the unit prices, which worked very nicely.

Of course, it's a very well-formatted PDF in this case, but the image understanding is like very good, and the new way of how we should think about it is like I should ask the question directly based on the image before doing too much of processing we have been doing in the past.

Cool. Okay. We are now at like half-time almost. I guess it's time we move maybe a bit away from all of the multimodality part into more about, I guess, agenda parts, which are, I would say, definitely more interesting, at least to me, especially if you combine in with like the multimodality parts.

Okay. So part three is all about structured output and function calling. Do you know what structured outputs and function calling is and like how it roughly works? Any hand signals? Yes or? Yes. Everyone? Okay. Not bad. Okay. So the part three goes, continues with PDFs as they are kind of very interesting and also we want to do structured outputs.

So structured outputs is for us a way to create more structured data structures. from text, so which we can use to work way more easily afterwards, right? And at the end, we prefer structured output much better or much more because we can integrate it into other APIs. We can connect APIs and Gemini supports or the SDK supports PyDentic.

PyDentic. So PyDentic is a very nice Python library, which lets you create those data structures. And also we can create native nested data structures. So here we have a recipe with a name, ingredients, which is a list of strings. And then we have a recipe list, which is basically a list of our recipes.

And we can provide it in our configuration, so similar to the generation arguments or our thinking budget, we have a response type and a response schema we can provide. Here we ask it, okay, can it generate two popular cookie recipes for us? And we basically force it to use our structure.

And I already nicely printed it here, but if we look at the raw response... of our model, we get back a JSON with all of the different input fields. And there's also a nice.parse method, which allows us to convert it back into our Pylandic schema, and then we can access all of the data points.

And as we had our invoice, we can now maybe, like, complete the exercise with you. What if we, like, so we asked about the total amount, right? But when working with PDFs, normally we want to have structured data as a result, right? Text is not very helpful for us when we want to put it into a database or want to work with it.

We really need those data schemas. And what is very nice about Gemini is that we can basically combine both of it. So we use our structured output method with our multimodal capabilities for files. And we can provide our file... ...and we can provide our file... ...and we can provide our file...

...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file...

...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...and we can provide our file... ...oh... ...we need our invoice... ...so maybe... ...it's a good example... ...so... ...I didn't change...

...the recipe... ...from our... ...like data structure... ...which we want to create... ...and we ask it to extract... ...the information from our PDF... ...our PDF is an invoice... ...from a supermarket... ...so... ...it doesn't have... ...a recipe name... ...ingredients... ...so... ...Gemini did not... ...generate... ...or hallucinated something... ...so we get back...

...a... ...empty recipe list... ...so if we now change it... ...to our... ...invoice data... ...um... ...we should... ...hopefully... ...then see... ...the correct... ...extracted... ...information... ...yes... ...so... ...we extracted... ...the date... ...all of the items... ...we bought... ...and all of the different prices... ...and... ...with that data... ...now... ...it makes it much more easier...

...for us to work with... ...right... ...if I have some kind of... ...automated... ...system... ...where I need to... ...like... ...take in invoices... ...pdf... ...document... ...I can now provide... ...a... ...structured... ...to... ...what information... ...I want to extract... ...and Gemini... ...does basically... ...all of the matching... ...for us... ...which is like... ...super nice...

...and... ...function calling... ...basically... ...is the same idea... ...but instead of... ...um... ...having like... ...a data structure... ...our output is... ...a name... ...and the argument... ...so... ...similar to what we have... ...um... ...we create... ...a structure... ...of like... ...how a function... ...signature is done... ...so... ...we have a weather function... ...which has a name...

...a description... ...and like... ...the properties... ...which we need to provide... ...um... ...there's the same... ...function... ...but... ...just as Python code... ...and... ...with function calling... ...we provide... ...the function declaration... ...and our prompt... ...and then the model... ...generates... ...a structure... ...a structure... ...output... ...which has the... ...function name... ...it wants to call...

...the get weather method... ...and... ...have a weather function... ...we only have one... ...location... ...we provide it... ...similar... ...to all of the other... ...configurations... ...in our configuration... ...argument... ...this time... ...we have tools... ...and... ...we want to know... ...what the weather is in Tokyo... ...and... ...obviously... ...what's the weather... ...fits into the...

...description of our function... ...as it... ...helps us retrieve... ...run it... ...the model... ...instead of... ...generating a nice... ...response... ...wants to call... ...the get weather method... ...and... ...with the location... ...Tokyo... ...if I change... ...the prompt... ...to... ...hello... ...oh... ...makes sense... ...one second... ...so... ...if I change... ...to... ...hello... ...we don't have a function call...

...right... ...like... ...the model... ...correctly understands... ...hey... ...that's just a greeting... ...let's response... ...and like... ...how can I help you... ...so... ...but... ...we want to call a function... ...so... ...we have... ...what's the weather... ...is in Tokyo... ...and then the next step... ...is for you... ...as a developer... ...directly the function...

...what you would normally... ...then have in your... ...code... ...or in your applications... ...a way to identify... ...okay... ...which function is called... ...could be a... ...simple switch statement... ...to check... ...okay... ...what's the name... ...and if you get the name... ...call the method... ...with the provided argument... ...and then what you do...

...is... ...the output of your function... ...is the next user input... ...so... ...the model generates... ...this... ...name and arguments object... ...and we provide it... ...as a user... ...the output... ...and in our case... ...it's the result... ...and if we look at... ...the... ...weather method... ...we have... ...it's... ...basically... ...some dummy data...

...about the temperature... ...the condition... ...and where it is... ...and then... ...the... ...the model generates... ...a very nice response... ...so we have user input... ...model has a... ...structured output... ...user provides... ...a structured response... ...and then the model... ...generates a very nice... ...user-friendly response... ...user-friendly response... ...so... ...we call our function...

...the function... ...returns the weather... ...and then... ...the model generates... ...a very nice response... ...which is the weather... ...in Tokyo is sunny... ...with a temperature... ...of 22 degrees Celsius... ...and it feels like... ...24 degrees Celsius... ...so... ...you can think about it... ...and that's how I can... ...integrate tools... ...or make it...

...or convert my LLM... ...into an agent... ...more or less... ...or a way to call something... ...um... ...the get a weather method... ...can be anything... ...it can be a database call... ...it can be... ...a real API call... ...it can be... ...I don't know... ...reading emails... ...sending emails... ...all of the things...

...we currently see... ...with all of the MCP... ...hype basically going on... ...and MCP servers... ...have tools as well... ...and those tools... ...of MCP servers... ...basically expose... ...the same... ...declarations... ...so... ...an MCP servers... ...has the tools... ...defined... ...get weather... ...for example... ...if it is a weather MCP... ...server... ...and it has an endpoint...

...or a method... ...which you can call... ...which is list tools... ...and this list tools... ...method... ...would then return... ...these schemas... ...of our functions... ...which look... ...very similar... ...to what we have created here... ...and then what you do... ...on your LLM... ...side... ...or client side... ...is... ...manually... ...but more abstracted away...

...and more managed... ...and of course... ...the benefit here... ...is that... ...not every one of us... ...needs to implement... ...the schemas... ...from the MCP server... ...provided into your LLM... ...call... ...and then the LLM... ...generates... ...depending on the context... ...the output as well... ...which is structured... ...and then call the remote tool...

...so... ...very similar... ...to what we have done here... ...on the client side... ...manually... ...but more abstracted away... ...and more managed... ...and of course... ...the benefit here... ...is that... ...not every one of us... ...needs to implement... ...the get weather method... ...it's way easier to use... ...like the weather MCP service...

...from... ...I don't know... ...some weather provider... ...similar... ...if you want to use... ...maybe your own personal... ...Google Drive MCP server... ...this would be... ...if Google creates those... ...so that's the whole idea... ...why MCP servers... ...is kind of so cool... ...so... ...and... ...um... ...yeah... ...yeah... ...yeah... ...so... ...currently... ...we are working...

...on improving... ...and extending... ...but currently... ...the suggestion is... ...probably... ...between five to ten... ...and... ...if you have more tools... ...you can use... ...embedding models... ...to basically filter... ...you have to use a problem... ...and what you would do... ...is basically... ...run some... ...similarity matching... ...between the descriptions... ...and like... ...what makes sense...

...what doesn't make sense... ...and then you only... ...put... ...the... ...top tools... ...so... ...to speak... ...I'm not sure... ...about the claim... ...and like... ...how you do it... ...is that Gemini 2.5 Pro... ...was the first model... ...to complete... ...Pokemon Blue... ...which ran... ...I think... ...for like... ...200 hours straight... ...the only...

...like... ...the big challenge... ...here is like... ...so... ...we have like... ...a limited context... ...right... ...which is for Gemini... ...1 million... ...so... ...if you would continue... ...for two hours... ...you would definitely... ...run out of like... ...those two... ...like 1 million tokens... ...and... ...entropic... ...I'm not sure... ...what that context... ...currently is...

...but... ...what you most... ...likely do... ...is... ...you summarize... ...you compress... ...the context... ...and the conversation... ...and what you provide... ...to your model... ...so... ...yes... ...I'm pretty sure Gemini... ...can run for more... ...than two hours... ...but it depends on... ...like... ...what you want... ...to solve... ...and how you are going...

...to solve it... ...yep? If you call it a native tool... ...it seems like... ...the background traces... ...are all hidden... ...is that correct? Is there any way... ...to access them... ...that you can imagine... ...for managing performance? Yeah... ...that's correct... ...so... ...native tools... ...are the next section... ...we can go into it...

...in one second... ...and... ...yes... ...currently... ...the native tools... ...are not like... ...being returned... ...in a way we had... ...with like... ...the scenes... ...and what the user gets... ...is the final... ...good... ...assistant response... ...um... ...that's great feedback... ...um... ...regarding like... ...can we have them or not... ...like... ...very happy to take it...

...to the team... ...I can definitely see... ...why it would be helpful... ...for you... ...for like... ...people to... ...directly build with it... ...um... ...but for now... ...it's not the case... ...and... ...speaking of native tools... ...so... ...gemini can... ...is basically... ...trained to do... ...native things... ...so... ...we generate... ...declaration... ...and can...

...try to do everything... ...but native tools... ...are much easier to use... ...as you don't need to... ...define... ...or create a declaration... ...and... ...they work basically... ...on the backend side... ...so you don't need to... ...execute anything... ...and as native tools... ...we currently have... ...Google search available... ...um... ...basically... ...are in AI Studio here...

...so... ...we have... ...structured output... ...this is not a native tool... ...which we used... ...to... ...get back the structures... ...code execution... ...is a native tools... ...and... ...basically means... ...that we... ...or... ...gemini... ...runs code for us... ...so... ...function... ...to sort... ...top... ...five... ...cities... ...based... ...on... ...population... ...so... ...what it can do...

...is like run... ...python code for us... ...so... ...if you prompt it to... ...solve a task... ...using Python... ...um... ...um... ...it should run... ...for us... ...the Python code... ...so... ...it should run... ...for us... ...the Python code... ...so... ...it generates... ...the Python code... ...so... ...it generates... ...the Python code... ...and it did not...

...run it... ...that's a bad example... ...sorry? ... ...oh... ...did I? ...ah... ...so... ...my bad... ...not Gemini's bad... ...use... ...python... ...let's see... ...what it does... ...um... ...okay... ...yeah... ...now it gets... ...thanks... ...perfect... ...so... ...so... ...we have some reasoning... ...and then it generates... ...executable code... ...executable code is... ...also provided via the API...

...um... ...which runs for us... ...Python... ...so... ...it writes the Python script... ...it executes the code... ...um... ...and... ...it generates... ...the Mudblood lib chart... ...and normally we... ...okay... ...in the notebook... ...we have an example... ...available... ...on how you... ...get the chart... ...if we... ...look into... ...the code execution tool... ...and here...

...we run it... ...and it returns the markdown... ...and you can also... ...it can regenerate... ...and return... ...images... ...next to code execution tool... ...there is the URL context tool... ...which basically allows you... ...to provide a URL... ...as part of your prompt... ...and we will extract... ...the information... ...from the website...

...behind the scenes... ...and make it available... ...into your... ...context... ...instead of... ...going to a website... ...command A... ...command C... ...command V... ...we do it for you... ...so... ...in this case... ...I ask it... ...what is the... ...other... ...benefits of Python... ...from the URL... ...you can provide... ...up to 20 URLs...

...in one request... ...behind the scenes... ...it goes to the URL... ...extracts the information... ...provides it as part of your prompt... ...and if we look into our prompt... ...yeah... ...here... ...it returned our nice... ...matplot... ...chart... ...which is very cool... ...and then... ...a final... ...tool is... ...of course... ...Google... ...and Google Search...

...kind of makes sense... ...um... ...so... ...let me find it... ...yeah... ...you can... ...allow or enable Google Search... ...which then... ...um... ...allows Gemini to... ...use... ...or... ...do Google Search... ...behind the scenes... ...and what happens here... ...is that... ...it takes our prompt... ...um... ...in this case... ...what are the latest developments...

...and renewable energies... ...and... ...first... ...it converts it into one... ...or multiple Google Search queries... ...then it executes... ...those Google Search queries... ...provides it back to the model... ...and then the model generates... ...your final response... ...from your user input... ...and from all of the search results... ...currently we don't... ...as mentioned...

...don't export... ...or expose those... ...tool calls... ...but... ...I guess... ...especially helpful... ...or interesting... ...for... ...um... ...Google Search... ...is that... ...we have... ...a... ...grounding... ...metadata... ...object... ...yes... ...so we have... ...grounding support... ...which basically... ...allows... ...or points... ...exactly... ...to where it got... ...or like... ...which information... ...refers to... ...which... ...source... ...we have...

...and also... ...grounding... ...meta... ...information... ...which websites... ...there is... ...which websites... ...were crawled... ...and in our... ...case... ...so... ...we have... ...the... ...yes... ...there... ...so... ...um... ...for both... ...code execution... ...and... ...grounding... ...and... ...URL... ...context... ...we... ...increase the... ...context... ...right? ...google search... ...um... ...there's... ...a free tier... ...from... ...I think... ...1500 Google searches...

...which are free... ...and... ...afterwards... ...I think... ...they have... ...$35... ...per... ...1,000... ...searches... ...code execution... ...the... ...python... ...running... ...is... ...free... ...um... ...um... ...yeah... ...and... ...I think... ...what's... ...what's... ...very cool... ...you already have done it... ...or are going to do it later... ...is... ...you can combine those... ...so... ...you can use the Google search tool...

...with the URL context tool... ...or the... ...code execution tool... ...with the Google search tool... ...to... ...basically... ...have it done... ...more agentically... ...like... ...first search... ...for like... ...what's the latest react... ...or Python version... ...then write a Python script... ...and run it... ...and return it... ...um... ...which makes it very nice...

...to use... ...to use... ...to use... ...to use... ...to use... ...the deep research tool... ...that also would be... ...release... ...and... ...yeah... ...we have heard... ...like a few people... ...would like to have... ...the deep research... ...API... ...I think... ...the more people... ...ask for it... ...the more likely... ...it will going to be...

...um... ...maybe something... ...are part of... ...like our team... ...we... ...yesterday... ...we... ...open sourced... ...an example... ...for how you can build... ...your... ...um... ...your own... ...deep research... ...using... ...lang graph... ...with Gemini... ...very similar to... ...um... ...right... ...very similar to... ...what basically... ...the whole... ...deep research... ...agents do... ...you have a...

...question... ...we generate queries... ...we then run... ...like multiple... ...web searches... ...okay... ...was the user... ...question already asked... ...do I need to do... ...more research... ...and then you have... ...this kind of loop... ...for... ...okay... ...do I need to... ...search other tools... ...uh... ...it's completely open source... ...it uses... ...all of the things...

...we have currently... ...seen today... ...gemini 2.5... ...flesh... ...gemini 2.0... ...so... ...if you want... ...your own... ...deep research... ...that's probably... ...the best way... ...to start... ...so... ...so... ...there is additional... ...pricing for... ...the Google search... ...we had it... ...like... ...two minutes ago... ...basically... ...1,500... ...searches are free... ...if you use... ...the native tool...

...then it costs money... ...and all of those tools... ...in a way... ...enrich your context... ...so... ...if you use... ...the URL... ...context tool... ...which goes to... ...a blog post... ...and a blog post... ...has... ...10,000 tokens... ...you pay for those... ...10,000 tokens... ...as it's included... ...into your prompt... ...and then... ...tries to answer...

...your question... ...for like... ...the function calling... ...we had... ...it's a bit different... ...of course... ...we generate tokens... ...but like... ...the structured output... ...tokens are... ...less... ...and you do... ...the function calling... ...or like... ...the Python call... ...on your side... ...if we have... ...to... ...which have... ...overlapping... ...functions... ...like... ...in the same...

...whether... ...one... ...to... ...which is giving... ...and... ...the... ...to... ...which is giving... ...and... ...the... ...to... ...which is giving... ...and... ...to... ...to... ...which is giving... ...temperature... ...and... ...how does the... ...to... ...and... ...that's definitely... ...a good question... ...so... ...function calling... ...of course... ...is... ...much more than... ...one tool... ...and... ...never going to be...

...calling the... ...weather API... ...because... ...I mean... ...we have... ...weather apps, right? ...um... ...but it's a good... ...example to show... ...but that's very cool... ...is so... ...there's... ...parallel... ...function calling... ...so... ...we are under... ...gemini documentation... ...again... ...and... ...in the lights, right? ...if you... ...want to say... ...okay... ...set us into a party mode...

...you would expect... ...all of them to start... ...at the same time... ...and... ...with Gemini... ...we have parallel... ...true calling... ...so... ...basically... ...what Gemini would do... ...instead of generating... ...one... ...object as an output... ...it generates a list... ...with... ...basically three objects... ...for all of the functions... ...you need to call...

...and then... ...you iterate... ...over those... ...inputs and outputs... ...of the tools... ...are not dependent... ...of each other, right? ...I can start my disco ball... ...and the music... ...at the same time... ...and not... ...okay... ...I first need to start... ...my disco ball... ...and if it runs... ...I can like... ...start the music...

...if you have... ...more of those... ...sequential tool calling... ...basically... ...you need the input... ...sorry... ...you need the output... ...of the first function... ...for the input... ...of the second function... ...basically... ...you have like... ...some kind of... ...smart home system... ...and you want... ...to set the temperature... ...based on like... ...the outside weather...

...you would first need to... ...check what's the weather... ...and then like... ...set your temperature... ...inside of your house... ...here would... ...you basically... ...provide instructions... ...using the system prompt... ...to your model... ...to say okay... ...to change the weather... ...you first need to... ...look up... ...the weather... ...and then set the temperature...

...and what Gemini would do... ...basically... ...you have your user prompt... ...it generates the... ...function... ...structured function... ...you call the function... ...and instead of... ...generating a user-friendly... ...response... ...it generates another... ...function call... ...and then... ...you can call it... ...so basically... ...it continues... ...this function calling... ...loop... ...before generating... ...a very nice...

...user-friendly output... ...can you see that again? So... ...can I... ...function call... ...in the loop? Yes... ...kind of... ...I mean... ...I can like... ...so there's... ...I put up an example... ...for like... ...the sequential... ...function calling... ...to manage... ...multiple function... ...calling... ...time-outs... ...and... ...the responses... ...is there a... ...like... ...manage... ...that...

...you need to do... ...or... ...do we have to write... ...if there are... ...multiple... ...function calls... ...and some... ...sequential... ...and some... ...longer... ...for... ...for... ...is there... ...for... ...for... ...for... ...can... ...to... ...handle... ...like... ...functions... ...who could... ...time-out... ...or like... ...the best way... ...so it's a... ...bit like... ...I would say... ...parallel calling...

...is like... ...you need to wait... ...for all of the results... ...and provide an error message... ...or not ready yet... ...or something... ...but you would need to... ...to explore how it works... ...on the live API... ...which is basically... ...our way to create... ...real-time agents... ...they are working... ...something called... ...asynchronous function calling...

...where... ...the conversation continues... ...so... ...when you think about... ...a customer support agent... ...which you can... ...which you talk to... ...right... ...it would be very weird... ...if the agent stops... ...for like... ...three minutes... ...and doesn't say something... ...because it needs to look up... ...your information... ...so... ...that's what they call...

...asynchronous function calling... ...so... ...you can continue... ...the conversation... ...with the agent... ...but the agent... ...runs a tool call... ...and then... ...injects the response... ...later... ...but it's only... ...for the live API... ...and is that... ...for the... ...yeah... ...so... ...documentation... ...live API... ...it was launched... ...on... ...I think... ...in I/O... ...or...

...on Cloud Next... ...so... ...yeah... ...like... ...tool use with live API... ...and there's... ...asynchronous function calling... ...um... ...with code snippets... ...for... ...for Python... ...and JavaScript... ...cool... ...yeah... ...is asynchronous... ...function calling... ...the synonyms... ...for... ...parallay... ...parallay... ...parallay... ...no... ...it's more like... ...I start a function call now... ...I need to continue...

...my conversation... ...I get back the response... ...later... ...so... ...instead of like... ...getting the response... ...what you get back... ...yes... ...what you get back... ...from the model... ...it could be... ...for example... ...interrupted... ...the developer... ...know... ...okay... ...I started something... ...the model knows... ...it started something... ...and then you can inject...

...once... ...the... ...your functions... ...the output is ready... ...put it back into... ...the conversation... ...and then the model... ...uses that information... ...to continue... ...and what about... ...of... ...of... ...the API codes... ...yep... ...yep... ...yep... ...yep... ...yep... ...what's the equivalent... ...um... ...so... ...in AI Studio... ...we don't have... ...a feature like this...

...but you can like... ...use... ...third-party... ...tools... ...like... ...langsmith... ...or... ...arise... ...AI... ...Phoenix... ...but Vertex... ...AI... ...supports... ...those features... ...okay... ...so... ...if we have... ...yeah... ...there was a question... ...yeah... ...in AI... ...previous... ...understanding... ...and... ...in AI... ...in AI... ...in AI... ...I don't know... ...the only thing I know... ...was only Gemini 2.0...

...flash... ...available... ...in the live API... ...but now we have... ...Gemini 2.5... ...flash... ...so basically the model... ...we used in like... ...the Jupyter Notebooks... ...um... ...I guess you have to try... ...or maybe like... ...reach out to the people... ...who are working more... ...with it... ...so... ...you have to try... ...so...

...you have to try... ...you have to try... ...more questions... ...yep... ...url context... ...so... ...url context... ...true basically works... ...based on the link... ...you provide... ...so... ...if you provide... ...your own... ...personal blog... ...it goes to... ...that link... ...and tries to... ...extract the information... ...and uses it... ...in... ...the context... ...and uses it in...

...the context... ...and search... ...basically uses... ...google search... ...creates... ...one or multiple... ...search queries... ...based on the prompt... ...searches... ...and then... ...provides the outputs... ...into your prompt... ...so... ...if you already know... ...where... ...the source is... ...for... ...your prompt... ...to create... ...the best answer... ...url context... ...could be... ...a good use case...

...for it... ...I don't know... ...you... ...paywall content... ...you mean like... ...pages... ...I don't know... ...I would not expect... ...to have it... ...like... ...but... ...the live API... ...supports function calling... ...so... ...if you have... ...a subscription... ...or like... ...a way to access... ...just paid content... ...you can create... ...which the model...

...can then invoke... ...and use... ...as a context... ...for... ...for its... ...information... ...okay... ...cool... ...then... ...last section... ...I guess... ...every one of you... ...might have heard... ...about model context... ...protocol by now... ...and... ...that's like... ...either... ...our... ...solution... ...to how we all... ...align... ...on how to call... ...agents... of it. I think it makes it way more accessible for people to build agents, especially if we get more first-party remote MCP servers, where you can basically focus on building your agent instead of creating all of those functions.

I mean, I'm pretty sure there are like now a million get weather functions which can be used, and I hope with MCP we can like fix this. What we shipped and announced at Google I/O is a native integration of MCP servers inside the Google Gen AI SDK. So the SDK, which we have used during the whole workshop, allows us to directly use MCP servers and sessions, which makes it even easier for all of us to integrate it.

So when we look back at our function calling example, we needed to call the function, create a declaration, and all of those different things. And now with the MCP integration, we can basically only start or initialize our client. Here I created an MCP weather service, which kind of has the same functionality, and all of the things we now need to do is like we use our generate content method, and in our tools argument we provide our session.

So we start our server, we create our session, and we provide the session to the Gemini SDK. And then what happens behind the scenes is basically the same function call loop we did manually. It's like, okay, what are the tools available? Get all of the tools from the MCP servers, put them into the LLM call.

If the LLM call makes a function call, extract the function call, call the MCP server, get the response from the MCP server, put it back into our conversation, have the model generate a final response. I guess it's probably time to find out if Colab has node installed, and to see if it works.

Okay. Okay, it doesn't. What I can do is I can quickly set it up locally and show you in one second on how it works. and I can quickly set it up locally and I can quickly set it up locally in one second on how it works. I can quickly set it up locally and show you in one second on how it works.

I can quickly set it up locally and show you in one second on how it works. I can quickly set it up locally and show you in one second on how it works. I can quickly set it up locally and show you in one second on how it works.

I can quickly set it up locally and show you in one second on how it works. I can quickly set it up locally and show you in one second on how it works. Okay. Okay. So we are back. I'm in cursor. Same notebook. Same setup. And now I can like use it.

And now basically what happens if I, let's not ask about London, right? We are in San Francisco. Let's ask it about the weather in San Francisco. It now does all of like the different function calls and loops behind the scenes. The MCP server is very simple. It uses the Open Meteor API, which is kind of free to use on a small scale for testing.

So we generate the output, run all of the MCP calls. Might take a bit longer for the API call. And then we get back. Okay. The weather in San Francisco today will be 70 degrees Celsius at zero. And then all of the other numbers. And that's basically now all you need to combine or connect an MCP server with the Gemini SDK, which is like, I mean, it fits on a single screen, like, which is easy enough for you to get started.

We use here like a local running MCP server, but it works the same way with a remote MCP server. That's also the exercise for this part of the workshop. So basically, Deep Vicky from the Cognition AI guys in the Devon, they have a very nice remote MCP server, which can talk to GitHub repositories.

So instead of like creating a sitIO client, you can use the streamable HTTP client and connect a remote MCP server to Gemini to talk to it. And yeah, basically benefit for one of the advancements currently happening in AI. Do we have more questions? Yeah. Yeah. So ADK is, for those of you who don't know, is like Agent Development Kit.

It's an agent library, which adds a lot of more abstraction on top of like the client SDK we used. It makes it a lot easier to do all of the tool calling, integrates MCP server as well, makes it much more easier to manage like multiple agents or deploy it to cloud.

So it has a lot of, I would say, batteries included. Question again depends on you if you would like to use frameworks or rather prefer building it yourself. I see a bit benefit of like getting started very quickly using agentic frameworks. But the more abstraction you add, right, the more, the less you know on the first time.

And then maybe you need to dig a bit deeper later. And all of the advancements or speed you get in the beginning will be your slowdowns in the future. But ADK is definitely a great way to start. They also have like tons of examples. It has support for Gemini and the Gemini API.

So definitely a good way to take a look. Yep. No, I don't think so. But what you can do with YouTube links or YouTube videos is we can ask it to return timestamps for you as the response. And it works very nicely. So it is very accurately. Because so basically a video is more or less just many images behind each other, right?

And currently when we process videos, it will be done at one frames per second. And you will always have like the timestamp, the image, the timestamp, the image. That's how Gemini exactly knows at what place in the video is, I don't know, a man jumping or dancing or something.

So it's analyzing one frame per second? Yes. And then describing it? It depends on your, I mean, we can, I mean, it's part of the section. Can I use that for the time to summarize videos? Yeah, but yeah, basically what happens with videos is, I mean, videos are 24 frames per second or even more that like easily let the context explode.

So currently what is done is that we have one frame per second. So if you have 60 seconds, it would be 60 images, which makes it also easy to count how many tokens it would be. And that's also how we can fit in a one hour long video into 1 million tokens, basically.

Is it using the transcript from YouTube or is it doing a speech to text? No, it's all multimodal natively. Okay, it's not using... No, I mean, you can like easily upload normal.mp4 files, and you can ask it like what was saying or said in the video. Yeah, I tried that for Dutch speaking.

Okay, cool. Yeah. So I'm also like, maybe for you, if you don't know, it's like memory is basically next to tools, what makes an agent, an agent, and for memory, we have short term memory and long term memory. short term memory is basically your conversation, which is part of your current state when you talk to an agent, and long term memory is basically memory or like information about a previous interaction or something about the user.

long term memory is not like part of the LLM, which we need to provide externally. And mem zero is like something I looked at it because it's very nice. It's does it implicitly. So basically, it takes the conversation and tries to create a nice abstraction for you, which you can include.

I'm not sure like what the current state is like, how or what works well with Gemini. I'm like, what I always see or where Gemini really shines is the long context. So the better your information extraction is and like what you provide to the model based on the memory.

I think it works very well. I'm not sure about those tool calling memory kind of systems. Yep. Yeah. Yeah. So we are working on 2 million. I'm not sure when it is like generally available, but Yeah, like that there was research, but you know, like the bigger you go, the more expensive you get.

And do you want to pay like $50 per 1 million tokens? Yeah. More questions or everyone working? Yep. Then So, before MCP, we need to define the functions. Yeah. And to call it ourselves. Yeah. And maybe decide the logic, how to call it to, what do you call it to ourselves as well?

No. No. So the difference with MCPs now and what we had previously is that we don't need to write and define the functions and the function declarations. The LLM sees at the end the exact same thing. So if I create my function declarations manually with a JSON schema or if I retrieve them from an MCP server, those are the same things for the LLM.

It just makes it much easier to not need to rewrite the same functionalities over and over again, right? If you work at a company and you want to integrate APIs, which you have internally, there's a pretty big chance that two teams write the same function declaration and the same wrapper to call it.

And the idea here is to only have one team needing to write it and everyone to benefit from it. The same as with public APIs like Google Maps, G Drive. So MCPs gives us a way to collaboratively create the best or the standard as a way to how to call it.

And then everyone can implement the MCP tools, which you need for your use case. It's so an MCP server, for example, exposes four tools. You provide all of those four input schemas, which are like function declarations with a name, a description and the parameters to your LLM call. And then based on the prompt and those declarations, the LLM in this case, Gemini decides if it should call a tool, which tool could to call, or should it call multiple tools?

And that's the same logic we had for normal or have for normal function calling. So there's no, no difference in that way. The only difference we see is that MCP servers being easily available or implementing many tools, people start to be very easy on how many to add, right?

So we see people at like 50, 100 tools. So we need to improve our LLMs to be able to have them really know, okay, which is the right tool to use when you have like 50 different tools available at once. So there are examples for ADK to work with A2A.

So A2A stands for Agent to Agent Protocol, which is done by the Google Cloud team. The idea here is to allow companies to build agents in different frameworks like LLM chain, LLM graph, LLAMA index and then have an easy way to create multi-agent systems that one agent can call the other agent without needing to implement complex logic.

But the Gemini SDK is like not directly, but there are great examples for it. Cool. So we are working on basically browser use or computer use use cases, which is available in preview, which we are currently testing with a few companies. That would allow Gemini to control UI. So it could basically go to whatever website you wanted to go to.

And then there's, sorry, the URL context tool where you can provide a website and we would programmatically try to extract information from it. Of course, it's like if it's a super heavy JavaScript website, there's not much to extract. That's where like the browser use agent would then be useful.

Like an API for marina or something. Yeah, kind of. Yeah. Hopefully coming soon. So it will return structured outputs again based on what you provide so you can control the local environment. But we are also working with the cloud run team to make it super easy to you for you to run so that you can like run a Chrome instance on a cloud run, which Gemini can talk to in control or you can like control a local instance.

Yeah, yeah. Yeah. That's a good question. I think the whole industry is currently trying to answer like what's the best way to handle agent authentication. So the MCP as a protocol itself supports all off. So basically when you want to connect to an MCP server, which is protected, you would get back for free with like not authorized, which can then trigger an overflow on your client.

Maybe you have seen it in like cloud desktop, where you get a pop up like to log in with, I don't know, like you're like Atlassian account or something. So that's like one way of doing it, but it definitely needs more work, right? Do you want your agent to access all of your emails or all of your GitHub repositories or how can you like scope it down to like only the one specific repository, but they are currently like actively working on it.

And I know that off zero will be here tomorrow as well. Those guys are very great to talk to. I think I know that they're doing a lot there. So. Yeah. I was wondering if you could say something about the citations and how that's different from like URL contexts.

I think I was just struggling with trying to application on it and also just sort of curious about how, like how you trigger it as well as . You mean citation in when you just send a regular prompt or citation when you like use Google search and then have citations?

I think very specifically citation metadata, like the output . Yep. So that's available at like the Google search tool. And it has information about which websites was used to retrieve. You can click on the link directly to see where you land. And then there's this metadata or like chunking support or like crowning support, which basically has a start index and an end index from the answer.

And then also which sources were used. So you technically can highlight or put numbers behind it to help users understand, okay, that part of the response is generated based on those two links. And then the part of the part is that it seems like if you use the Google search tool, the citations that you get there.

But then like the citation metadata, it seems like maybe like guessing from the training data, but... You mean in general there are sometimes citation metadata or...? Sorry, what? Which citation data do you mean? The one literally called the non-ground data. Okay. I know, or I only know that's part of like legal and like training that if there's something referred directly, you need to like provide it.

That's why it is there. That's why it is not always there. It can be there, but it doesn't, must be. And it's not like for a user to, hey, that's coming from there, but it's more like compliance basically, why we need to have it. I only know about like the web search feature.

They have a citation thing where it's just whatever you pass as input, it'll get you correct the book from that. It's kind of like in the response. Ah, okay. So you provide as a context, like a document. Yeah. Okay. And that stage is where it's getting this from. I imagine you can kind of like force that.

Yep. But I'm curious what the good best practices are. And if you could use, you know, StrapGraph words versus the URL context, like another avenue, you could go with that? No, that's like a very good question. Currently, we don't have like the same experience as Anthropic has. I guess the base is like to really try and like different prompting strategies and like what you want to achieve.

Yeah, no, but it's like good, good feedback. Sadly, not yet. I hope, hopefully one day that you can like just use Gmail, here's Drive and let's chat, but currently there is no public remote MCP server. I don't know exactly what, what are the reasons. I think the more people ask for it, the easier it will be that we get once.

So if you like it's great feedback, more people, if the more people are going to use MCP servers, the higher the chances are going to be. And I think in general for MCP to succeed, we need more first party remote servers, right? Because we cannot build a secure GitHub MCP servers.

That's something GitHub needs to do because they know how their OOF system works, how you can scope it. So we really need those first party MCP servers in the long time. For AI mode? I have no idea. I think like, so AI mode, you mean about like inside Google search?

Yeah. I think in general, overall, same goes for like normal Google searches, like have high quality content, try to stick to like web standards. I guess same works here as well, but I don't know, sorry. Yeah. When making like full calls with the files API, should it trigger the prompt caching project 9.5 or does it work?

It should. So if you uploaded a PDF, you need to make sure that you put a PDF at the first part of your prompt. So you'd like the automatic caching works from the beginning to the end, right? If you change the beginning, you can never cache like the long document.

But if you put the PDF in the beginning and change the prompt behind it, it should work. Yes. Yeah. I had a question. Google search running as well. So sometimes there's a case where the index that provides either the startup and index is actually out of range for kind of like the length of the-- Really?

Yeah. Answers provide, yeah. I was curious if you had any similar experiences. No, but if you like have an example for us to reproduce and to share, that would be very helpful. The only thing I know is that so sometimes there can be a start index, which can be null, which basically means it's zero.

Like it starts at the beginning. Yeah. Sometimes I get for even start . Yeah. Like if you have an example, please like send it to me on Twitter or somewhere. Like, okay, that would be very helpful because that should not be the case. And then for the, excuse me, the grounding metadata is two parts, the .

Yeah. Is there any way to actually get the citation? Or maybe that was the question that the gentleman asked. The citation from the website, basically the text that it refers to or not? So that-- Because you only get the website, actually. You just get the URL. Yeah, no. Like, so currently you get like the start and end of the response, which refers to something, but not the web part.

But that's like, like, please put it all together and like send it to us. Like very happy to like talk to the Google search team who was like building the native tool. It's super helpful with like to better understand like what you need. Okay. Cool. Then, thanks all for coming.

Please continue with the workshop. Try it out. If you have like any questions, we are very happy to receive any positive, negative feedback, any ideas, any pain points you have. We are available on like social channels. You can like find me, Philip Schmidt, basically everywhere. If not, we open a GitHub issue or be very noisy about when something doesn't work.

We always try to make sure we fix it. Cool. Thanks. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

AI Engineering with the Google Gemini 2.5 Model Family - Philipp Schmid, Google DeepMind

Transcript