Back to Index

Veo 3 for Developers — Paige Bailey, Google DeepMind


Transcript

Thank you so much for having me. Thank you all for being here today and wanting to learn more about generative media. I'm going to keep this pretty quick. There's a lot to show, especially for some of the new features that we've released in VO3. But there's also a lot to discuss in the frame of how does this revolutionize the way that people build things?

How do people create ads? And how do people replicate some of the experiences that you might see every day? So, as mentioned, I'm Paige. I am the engine lead for our DevRel team at Google DeepMind. But I'm here today on behalf of our generative media team, who are all brilliant.

They are wonderful. Many of them could not be here today. But this this is all their work. So, I just want to send a thank you to the heavens for everything that they've been building. Today we're going to be talking about three different models. So, VO3, which is our new video and audio generation model.

Imagine 4, which can generate these static images. And also Lyria 2, which is a music generation model. And stay tuned. There will be more of all of the above coming shortly. VO is kind of magical in the sense that, you know, you can create videos of things that you've never seen before.

And we've saw -- we've seen some of these examples already. But it also has the potential to really revolutionize everything that we build and create as humans. And I loved this quote from Andre Karpathy just recently. The video has the potential to be an incredible surface for communication, but also for education and for human creativity.

So, all of these things are designed with that in mind. VO2, just want to touch on it briefly before we get into the VO3 capabilities. We released a whole bunch of additional things around creative control just recently. So, think on the order of about a month ago or not even that.

So, things like reference powered videos, videos, out painting, the ability to add and remove objects, character control and consistency, which I think you all were talking about just a little while ago. And also the ability to interpolate across first and last frames. And let's take a look at what that means.

Because for people who are not necessarily filmmakers, it's much, much easier to show and not tell. So, reference powered videos are things like things like you have a person, you have an environment, and you're able to kind of put one within the other or compose them together into something that feels very, very stylized, but also like really, really well crafted.

So, here you can see reference powered video with VO2. This is available via the API and via some of our tools like Flow today. And another example of reference powered video. So, a really, really cute little monster in a variety of environments that you just control by describing. It's performing pretty well on benchmarks.

So, you can see here, the green is VO. So, compared to things like Runway Gen 4 and Kling, the more green, the better. And for reference powered video, most human readers selected for VO for some of these side by side comparisons. You can also match styles. So, upload a reference image and then have different styles composed together.

Another example of styles being preserved. And then also camera controls the same that you might have if you were a filmmaker. So, things like being able to move back, move right, rotate up, zoom in, and to be able to precisely control all of these camera movements. Again, just via natural language and through some of the tools that are available in the APIs.

When I saw all of this, I was blown away because I don't think that we have nearly enough code samples demonstrating some of these capabilities. But these are all things that you can do with the VO2 models today with the APIs. We also have the ability to do outpainting.

This was important for a recent project with the Sphere around Wizard of Oz. So, being able to take us like a scene or an individual frame of a video and imagine what the rest of the scene might look like. So, even if you only have a view into a small portion, being able to create something that looks real or that looks consistent across the outpainting.

Adding objects or removing objects from scenes. So, you can see here a few examples as well. And again, all of these available for you to test and to try today. These are all things that exist that our research team has kind of gotten into the API designs. Some more examples of removing objects.

Character control is quite nice. You might have seen some of these demonstrated into your favorite products for kind of controlling mouth movements, controlling reference space movements for particular characters. We also give the ability to add a script and to add kind of a voice tone and to have the character map the lips to producing that sound and producing it in a way that feels consistent with the location.

These are some of the motion examples with VO2 and with VO3. So, being able to have an input image and controlling them or changing the design across the scene. More benchmarks. And then the first and last frame. So, you can have an input image and an output image and VO is able to interpolate across them to kind of make those images stitched together into a video.

And those are just another couple of examples. I feel like generative media presentations are very gratifying. These are certainly the most beautiful things that we get to see at developer conferences. So, VO3. Everything that you just saw, VO2. VO3. Right? Like blown away. So, VO3 is video but coupled together with audio.

And so, all of the tokens composed together natively, not audio being pulled in as a tool, but the model actually able to compose together all of these tokens across multiple modalities. This is similar to what you see with Gemini's native audio output. In addition to being able to output text and code, you can also output images, edit images, edit audio, compose audio, etc.

So, VO3, our latest state-of-the-art video generation model. It has these things around prompt adherence and native audio generation. But again, so much cooler to show and not tell. The little llama. So, interestingly, you're able to do not just background noises, but also things like music. Including very, very subtle sounds.

And the like. So, let's go to the next. VO3. There we go. So, this is hard, right? Like, it looks very cool. It's very hard to capture the nuances of an input prompt. And it's also been really historically very hard to preserve visual consistency. So, characters often, like, jump from one frame to another.

There might be backgrounds, and then suddenly walls disappear and you're able to see behind them. This is one of the reasons why VO3 feels like a leap forward is because the stylistic consistency and then also the contextual consistency is much, much better. Built on years of research. So, things like GQN, WALT, etc.

And it has responsibility at its core. So, you can see little human visible watermarks as well as synth ID watermarks for synthetically generated images and video. We've also been partnering really closely with many, many, many artists along the way. So, Darren Aronofsky, also musicians for Lyria models, artists for the Imagine models.

And we'll take a look at a couple of these as well. So, Imagine is image generation. You know, able to kind of preserve realism. Everything from humans to whales, you know, cute puppies. I've heard that the more cute puppies that you have in a presentation, the better it always is.

And then also being able to preserve detail across all of these images as well, including diverse styles and even things like typography. So, I love these stamps of Alamo Square and the mission. I really, really wish that we just had these as swag ideas that -- stickers for laptops or stickers just in general.

And another example of an artist that the team has been really closely collaborating with, Ross Lovegood, on some of his designs. So, Lyria 2, also very exciting. It's high-fidelity music and professional-grade audio. It also gives you a very, very granular creative control. So, the ability to steer the inputs and outputs and to steer the tones and the styles of the music along the way.

Music AI Sandbox is one of the products that's been created as a visual for this. If folks are familiar with Ableton or things like it, this probably looks very similar. And then there's also Music Effects, which is a project from our labs team that allows you to kind of compose together beats just to be a natural language.

And we use that for the demo later on today. Lyria Real Time has also been a deep collaboration with many musicians, both Jacob Collier, who's a legend, and Toro Iuma. And me circa, like, me circa college was like blown away that Toro Iuma was at Google I/O. Huge fan.

I do think that a big part of teaching music is giving people a chance to play music and play with music. However, people don't have access to the whole of music from day one. What I think this offers an interesting perspective into is the whole of music: mathematics, physics, history, geography, the human body, language, spelling, syntax.

One thing I've come to realize is that a lot of the same forces forces that make music work are the forces that make life work. Oh, that was not VO3. Some parts, but it's hard to tell, right? Like it's, yeah, yeah. Yeah, you know, parts of it, parts of it might have included visuals generated by VO3, though.

So Lyria, again, built in collaboration with the creative industry, not outside it, and then also incorporates many of these techniques like synth ID to make sure that you have some sort of digital water marking for the, for the assets themselves. So now we're going to get into it. These are some examples that I thought might be fun to share to show just how far we've come in the last couple of years, because I think being here, we get very sucked into the Bay Area bubble, and we don't really kind of take a step back and appreciate how far, you know, the world has changed in just a matter of months.

So this is an example of one of the papers that was produced around 2023. So released in 2023, research happening around 2022, for text to video, a raccoon wearing a black jacket dancing in slow motion in front of the pyramids. So just have that in your brain when you see it.

This is Walt, circa 2023. So very, very choppy, like really, really hard to have even like, that's not even eight seconds worth of a frame. LTX video from 2024. This is one of the ones that were available on Hugging Face. Cling 2.0, which I did with Fall. Heck yeah.

Yep. Yeah. So, which I did with Fall, released in 2025. And then VO2 2024, a very cute little raccoon, but I'm not sure how Willy's dancing. But this is just kind of a splattering of how the world has changed in the space of just a couple of years. And then when you put the same prompt through VO3, you get this.

Very stylish raccoon. So image to video, transforming static images into dynamic video content. So you can see here an image of a woman and her puppy, a woman I can only assume is in Texas, walking very slowly forward on the way to a gunfight. And also different stylized images of a person in a single frame being applied to different scenarios.

So running towards the camera, tractor beam taking off, being lifted into the sky, again, all steered via natural language. Prompt rewriting is something that we've also released in VO3. So the ability to take that very, very simple sentence before of a raccoon wearing a black jacket dancing in slow motion in front of the pyramids, and turn it into something a bit more fully formed that the VO is as much better able or better equipped to understand.

And so have that in your brain, as well as the the concept of sound generation. So both music, sound effects and background noises. And we'll take a look at what that simple prompt is now. I still like get blown away about how much detail there is, like being, you can almost see the reflections in the eyes for the things as they walk forward.

And then this is from our team in Paris, and they are very, very excited to have it shared today. Does anybody know French in the audience? Okay. Amazing. The rest of us will have a translation in a second. It's like Daft Punk. I can't believe this new V.O. model.

It is amazing. I can't believe this new V.O. model. It is amazing. I can't believe this new V.O. model. It is amazing. I can't believe this new V.O. model. It is amazing. I can't believe this new V.O. model. It is amazing. I can't believe this new V.O. model. It is amazing.

I can't believe this new V.O. model. It is amazing. It was a good V.O. model. You know, I don't believe there is a good or a bad V.O. model. So, what do you think now about this artificial intelligence champion? I think the trainer's strategy is good, not on the technical side and tactical side, but the important is the V.O.

3.0. Boss, humans are on the point of creating the AGI. Do we have to contact them? What do they do? They generate images of starter packs and make videos stupid videos. They are not yet ready. V.O. can make us sing anything. And for anyone who is curious, the translation is, so V.O.

means I see, I see, I see in Spanish. And then the guy responded with, yeah, I don't understand what you're saying because I'm French, actually. Artificial intelligence, artificial intelligence. And then, also, that's the interview. And Boss, the humans are about to create AGI, should we contact them? They're not ready yet.

So, amazing. Cool. Actually, I'm going to zoom to the next, zoom to the next one. So, how do you access it? Big question. Right now, we have a few different ways to access the V.O. 3.0 models. One is through the Google AI Ultra Plan, which is available in many different countries, including the UK just recently.

The Google AI Pro subscribers. So, being able to access via the Gemini mobile app for a limited number of uses. And then, also, V.O. 3.0 is available in private preview, currently in Vertex AI. V.O. 2.0, also available via Vertex AI for the Gemini APIs. Hoping to bring them to AI Studio.

Crossing fingers, but we'll see. And then, you can also fill out a form for early access if you would like it. QR code coming shortly. So, take a picture of that if you would like, with the form to go submit and test it out. This is also, while you're looking, a code sample of how easy it is to use with just your output bucket where you would like the video to be deposited.

If you have an input image that you want to use as a starter. Some things around aspect ratios and the like. So, toggling between different models or being able to specify some of these is just a handful of lines of code, which is pretty magical. So, this was intended to be a live demo.

I'm not sure if I can tempt the demo gods. And also, I'm probably like close to being over time. But I wanted to see how well I could replicate a commercial with V.O. that seemed pretty simple. So, the commercial is this one, which took my name. Hey, my name's Paige.

And what makes the Chick-fil-A chicken So, that was the, that was the input video. The process for replicating it with V.O.2 is you give Gemini the original video, have it create a really, really detailed plan. Segment into prompts to like handle the eight second limitation. Used music effects to create the background track, which was a combination of down home farm and slow guitar.

And then put it all in Camtasia, stitch it together with transitions. Whole process was relatively quick, but it also took a lot of thinking and a lot of work to get to that final assembly stage. And it looked like this. So, this is Hey, my name's Paige. And what makes a Chick-fil-A chicken sandwich original to me is the crispiness of the breading and a tenderness of the filet.

It's tasty. It's warm. It's total satisfaction. And that was, again, using V.O.2, using Gemini text to speech, and stitching it all together myself. And I actually like that one better than the original commercial, but your mileage may vary. Process with V.O.3. You have the original video, you generate the description, you give it to V.O.3, and you see how well it does.

This took just the span of like submitting the prompt to V.O.3. "Hey, my name's Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet." So, again, one prompt. And it was able to produce this. "Hey, my name's Paige.

And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet." So, again, one prompt. And it was able to produce this. "Hey, my name's Paige, and what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet." Incredible.

So, takeaways. V.O.3, pretty magical. We're committed to expanding it, expanding access as quick as we can, adding controls around durability. And thank you so much. That is not me actually waving at the camera. That is a static photo of me that has been V.O.3 animated. Excellent. Thank you. Thank you.

We'll see you next time.