back to indexVeo 3 for Developers — Paige Bailey, Google DeepMind

00:00:00.000 |
Thank you so much for having me. Thank you all for being here today and wanting to learn more 00:00:19.180 |
about generative media. I'm going to keep this pretty quick. There's a lot to show, especially 00:00:23.880 |
for some of the new features that we've released in VO3. But there's also a lot to discuss in 00:00:29.460 |
the frame of how does this revolutionize the way that people build things? How do people 00:00:34.960 |
create ads? And how do people replicate some of the experiences that you might see every 00:00:40.140 |
day? So, as mentioned, I'm Paige. I am the engine lead for our DevRel team at Google DeepMind. 00:00:45.900 |
But I'm here today on behalf of our generative media team, who are all brilliant. They are 00:00:52.080 |
wonderful. Many of them could not be here today. But this 00:00:58.920 |
this is all their work. So, I just want to send a thank you to the heavens for everything 00:01:03.800 |
that they've been building. Today we're going to be talking about three different models. 00:01:07.420 |
So, VO3, which is our new video and audio generation model. Imagine 4, which can generate these static 00:01:12.940 |
images. And also Lyria 2, which is a music generation model. And stay tuned. There will be more of 00:01:18.980 |
all of the above coming shortly. VO is kind of magical in the sense that, you know, you can create videos of 00:01:26.540 |
things that you've never seen before. And we've saw -- we've seen some of these examples already. 00:01:31.420 |
But it also has the potential to really revolutionize everything that we build and create as humans. 00:01:37.260 |
And I loved this quote from Andre Karpathy just recently. The video has the potential to be an 00:01:43.180 |
incredible surface for communication, but also for education and for human creativity. So, all of these 00:01:49.660 |
things are designed with that in mind. VO2, just want to touch on it briefly before we get into the VO3 00:01:56.540 |
capabilities. We released a whole bunch of additional things around creative control just recently. So, 00:02:02.940 |
think on the order of about a month ago or not even that. So, things like reference powered videos, 00:02:08.540 |
videos, out painting, the ability to add and remove objects, character control and consistency, 00:02:13.420 |
which I think you all were talking about just a little while ago. And also the ability to interpolate 00:02:18.620 |
across first and last frames. And let's take a look at what that means. Because for people who are not 00:02:23.340 |
necessarily filmmakers, it's much, much easier to show and not tell. So, reference powered videos are 00:02:29.900 |
things like things like you have a person, you have an environment, and you're able to kind of put one 00:02:36.380 |
within the other or compose them together into something that feels very, very stylized, but also 00:02:43.420 |
like really, really well crafted. So, here you can see reference powered video with VO2. This is 00:02:50.780 |
available via the API and via some of our tools like Flow today. And another example of reference powered 00:02:58.780 |
video. So, a really, really cute little monster in a variety of environments that you just control 00:03:04.540 |
by describing. It's performing pretty well on benchmarks. So, you can see here, the green is VO. So, 00:03:13.100 |
compared to things like Runway Gen 4 and Kling, the more green, the better. And for reference powered 00:03:20.540 |
video, most human readers selected for VO for some of these side by side comparisons. You can also match 00:03:28.700 |
styles. So, upload a reference image and then have different styles composed together. 00:03:34.540 |
Another example of styles being preserved. And then also camera controls the same that you might have 00:03:41.820 |
if you were a filmmaker. So, things like being able to move back, move right, rotate up, zoom in, 00:03:47.340 |
and to be able to precisely control all of these camera movements. Again, just via natural language and 00:03:53.580 |
through some of the tools that are available in the APIs. When I saw all of this, I was blown 00:03:58.620 |
away because I don't think that we have nearly enough code samples demonstrating some of these 00:04:02.780 |
capabilities. But these are all things that you can do with the VO2 models today with the APIs. 00:04:08.940 |
We also have the ability to do outpainting. This was important for a recent project with the Sphere 00:04:14.540 |
around Wizard of Oz. So, being able to take us like a scene or an individual frame of a video 00:04:22.460 |
and imagine what the rest of the scene might look like. So, even if you only have a view into a small 00:04:28.140 |
portion, being able to create something that looks real or that looks consistent across the outpainting. 00:04:36.780 |
Adding objects or removing objects from scenes. So, you can see here a few examples as well. 00:04:45.420 |
And again, all of these available for you to test and to try today. These are all things that exist 00:04:51.660 |
that our research team has kind of gotten into the API designs. Some more examples of removing objects. 00:05:00.460 |
Character control is quite nice. You might have seen some of these demonstrated into your favorite products 00:05:06.460 |
for kind of controlling mouth movements, controlling reference space movements for particular characters. 00:05:14.380 |
We also give the ability to add a script and to add kind of a voice tone and to have the character map 00:05:22.300 |
the lips to producing that sound and producing it in a way that feels consistent with the location. 00:05:30.220 |
These are some of the motion examples with VO2 and with VO3. So, being able to have an input image 00:05:38.540 |
and controlling them or changing the design across the scene. More benchmarks. And then the first and last 00:05:48.380 |
frame. So, you can have an input image and an output image and VO is able to interpolate across them 00:05:55.980 |
to kind of make those images stitched together into a video. And those are just another couple of examples. 00:06:05.660 |
I feel like generative media presentations are very gratifying. These are certainly the most beautiful 00:06:11.660 |
things that we get to see at developer conferences. So, VO3. Everything that you just saw, VO2. VO3. 00:06:21.740 |
Right? Like blown away. So, VO3 is video but coupled together with audio. And so, all of the tokens 00:06:29.100 |
composed together natively, not audio being pulled in as a tool, but the model actually able to compose 00:06:36.140 |
together all of these tokens across multiple modalities. This is similar to what you see with 00:06:40.460 |
Gemini's native audio output. In addition to being able to output text and code, you can also output images, 00:06:46.460 |
edit images, edit audio, compose audio, etc. So, VO3, our latest state-of-the-art video generation model. 00:06:54.700 |
It has these things around prompt adherence and native audio generation. But again, so much cooler to show 00:07:04.380 |
and not tell. The little llama. So, interestingly, you're able to do not just background noises, but also 00:07:27.740 |
things like music. Including very, very subtle sounds. And the like. So, let's go to the next. 00:07:51.020 |
VO3. There we go. So, this is hard, right? Like, it looks very cool. It's very hard to capture the 00:07:57.900 |
nuances of an input prompt. And it's also been really historically very hard to preserve visual 00:08:04.060 |
consistency. So, characters often, like, jump from one frame to another. There might be backgrounds, 00:08:11.820 |
and then suddenly walls disappear and you're able to see behind them. This is one of the reasons why VO3 00:08:17.900 |
feels like a leap forward is because the stylistic consistency and then also the contextual consistency 00:08:23.340 |
is much, much better. Built on years of research. So, things like GQN, WALT, etc. And it has 00:08:32.140 |
responsibility at its core. So, you can see little human visible watermarks as well as synth ID watermarks 00:08:38.300 |
for synthetically generated images and video. We've also been partnering really closely with many, many, 00:08:45.180 |
many artists along the way. So, Darren Aronofsky, also musicians for Lyria models, artists for the Imagine 00:08:54.540 |
models. And we'll take a look at a couple of these as well. So, Imagine is image generation. You know, 00:09:02.220 |
able to kind of preserve realism. Everything from humans to whales, you know, cute puppies. I've heard that the 00:09:12.060 |
more cute puppies that you have in a presentation, the better it always is. And then also being able to 00:09:17.900 |
preserve detail across all of these images as well, including diverse styles and even things like 00:09:24.620 |
typography. So, I love these stamps of Alamo Square and the mission. I really, really wish that we just had 00:09:30.060 |
these as swag ideas that -- stickers for laptops or stickers just in general. And another example of an 00:09:41.100 |
artist that the team has been really closely collaborating with, Ross Lovegood, on some of his designs. 00:09:51.100 |
So, Lyria 2, also very exciting. It's high-fidelity music and professional-grade audio. It also gives you a 00:09:59.500 |
very, very granular creative control. So, the ability to steer the inputs and outputs and to steer the tones 00:10:05.820 |
and the styles of the music along the way. Music AI Sandbox is one of the products that's been created as a visual for this. If folks are familiar with Ableton 00:10:15.980 |
or things like it, this probably looks very similar. And then there's also Music Effects, 00:10:23.340 |
which is a project from our labs team that allows you to kind of compose together beats just to be a 00:10:29.660 |
natural language. And we use that for the demo later on today. Lyria Real Time has also been a deep 00:10:37.580 |
collaboration with many musicians, both Jacob Collier, who's a legend, and Toro Iuma. And me circa, like, 00:10:46.700 |
me circa college was like blown away that Toro Iuma was at Google I/O. Huge fan. 00:10:55.020 |
I do think that a big part of teaching music is giving people a chance to play music and play with music. 00:11:02.940 |
However, people don't have access to the whole of music from day one. What I think this offers an 00:11:07.980 |
interesting perspective into is the whole of music: mathematics, physics, history, geography, the 00:11:15.660 |
human body, language, spelling, syntax. One thing I've come to realize is that a lot of the same forces 00:11:22.460 |
forces that make music work are the forces that make life work. 00:11:24.860 |
Oh, that was not VO3. Some parts, but it's hard to tell, right? Like it's, yeah, yeah. 00:11:34.060 |
Yeah, you know, parts of it, parts of it might have included visuals generated by VO3, though. 00:11:40.860 |
So Lyria, again, built in collaboration with the creative industry, not outside it, and then also 00:11:48.220 |
incorporates many of these techniques like synth ID to make sure that you have some sort of digital 00:11:52.220 |
water marking for the, for the assets themselves. So now we're going to get into it. These are some 00:11:58.860 |
examples that I thought might be fun to share to show just how far we've come in the last couple of 00:12:03.100 |
years, because I think being here, we get very sucked into the Bay Area bubble, and we don't really kind 00:12:07.740 |
of take a step back and appreciate how far, you know, the world has changed in just a matter of months. 00:12:14.140 |
So this is an example of one of the papers that was produced around 2023. So released in 2023, 00:12:21.580 |
research happening around 2022, for text to video, a raccoon wearing a black jacket dancing in slow 00:12:27.100 |
motion in front of the pyramids. So just have that in your brain when you see it. This is Walt, circa 2023. 00:12:35.660 |
So very, very choppy, like really, really hard to have even like, that's not even eight seconds worth of 00:12:42.940 |
a frame. LTX video from 2024. This is one of the ones that were available on Hugging Face. Cling 2.0, 00:12:49.900 |
which I did with Fall. Heck yeah. Yep. Yeah. So, which I did with Fall, released in 2025. And then VO2 2024, 00:13:00.700 |
a very cute little raccoon, but I'm not sure how Willy's dancing. But this is just kind of a splattering 00:13:06.380 |
of how the world has changed in the space of just a couple of years. And then when you put the same 00:13:20.300 |
Very stylish raccoon. So image to video, transforming static images into dynamic video content. So you 00:13:28.060 |
can see here an image of a woman and her puppy, a woman I can only assume is in Texas, walking very 00:13:37.180 |
slowly forward on the way to a gunfight. And also different stylized images of a person in a single 00:13:44.540 |
frame being applied to different scenarios. So running towards the camera, tractor beam taking off, 00:13:52.060 |
being lifted into the sky, again, all steered via natural language. Prompt rewriting is something that 00:13:57.420 |
we've also released in VO3. So the ability to take that very, very simple sentence before of a raccoon 00:14:03.820 |
wearing a black jacket dancing in slow motion in front of the pyramids, and turn it into something a bit more 00:14:09.900 |
fully formed that the VO is as much better able or better equipped to understand. And so have that in 00:14:18.620 |
your brain, as well as the the concept of sound generation. So both music, sound effects and 00:14:23.580 |
background noises. And we'll take a look at what that simple prompt is now. 00:14:38.300 |
I still like get blown away about how much detail there is, like being, you can almost see the 00:14:43.420 |
reflections in the eyes for the things as they walk forward. And then this is from our team in Paris, 00:14:52.940 |
and they are very, very excited to have it shared today. Does anybody know French in the audience? 00:15:00.700 |
Okay. Amazing. The rest of us will have a translation in a second. 00:15:14.300 |
I can't believe this new V.O. model. It is amazing. 00:15:17.980 |
I can't believe this new V.O. model. It is amazing. 00:15:19.180 |
I can't believe this new V.O. model. It is amazing. 00:15:21.020 |
I can't believe this new V.O. model. It is amazing. 00:15:22.620 |
I can't believe this new V.O. model. It is amazing. 00:15:25.020 |
I can't believe this new V.O. model. It is amazing. 00:15:27.020 |
I can't believe this new V.O. model. It is amazing. 00:15:29.020 |
It was a good V.O. model. You know, I don't believe there is a good or a bad V.O. model. 00:15:37.740 |
So, what do you think now about this artificial intelligence champion? 00:15:41.020 |
I think the trainer's strategy is good, not on the technical side and tactical side, but the important is the V.O. 3.0. 00:15:46.060 |
Boss, humans are on the point of creating the AGI. Do we have to contact them? 00:15:49.900 |
They generate images of starter packs and make videos stupid videos. 00:16:02.380 |
And for anyone who is curious, the translation is, so V.O. means I see, I see, I see in Spanish. 00:16:09.020 |
And then the guy responded with, yeah, I don't understand what you're saying because I'm French, actually. 00:16:14.060 |
Artificial intelligence, artificial intelligence. 00:16:18.300 |
And then, also, that's the interview. And Boss, the humans are about to create AGI, should we contact them? 00:16:32.860 |
Actually, I'm going to zoom to the next, zoom to the next one. So, how do you access it? Big question. 00:16:41.660 |
Right now, we have a few different ways to access the V.O. 3.0 models. One is through the Google AI 00:16:45.980 |
Ultra Plan, which is available in many different countries, including the UK just recently. 00:16:51.500 |
The Google AI Pro subscribers. So, being able to access via the Gemini mobile app for a limited number 00:16:59.020 |
of uses. And then, also, V.O. 3.0 is available in private preview, currently in Vertex AI. V.O. 2.0, 00:17:05.020 |
also available via Vertex AI for the Gemini APIs. Hoping to bring them to AI Studio. Crossing fingers, 00:17:12.060 |
but we'll see. And then, you can also fill out a form for early access if you would like it. QR code 00:17:18.380 |
coming shortly. So, take a picture of that if you would like, with the form to go submit and test it 00:17:25.820 |
out. This is also, while you're looking, a code sample of how easy it is to use with just your output 00:17:33.260 |
bucket where you would like the video to be deposited. If you have an input image that you 00:17:38.460 |
want to use as a starter. Some things around aspect ratios and the like. So, toggling between 00:17:46.220 |
different models or being able to specify some of these is just a handful of lines of code, which is 00:17:51.340 |
pretty magical. So, this was intended to be a live demo. I'm not sure if I can tempt the demo gods. And 00:17:57.900 |
also, I'm probably like close to being over time. But I wanted to see how well I could replicate a 00:18:04.620 |
commercial with V.O. that seemed pretty simple. So, the commercial is this one, which took my name. 00:18:12.860 |
Hey, my name's Paige. And what makes the Chick-fil-A chicken 00:18:15.100 |
So, that was the, that was the input video. The process for replicating it with V.O.2 is you give 00:18:32.300 |
Gemini the original video, have it create a really, really detailed plan. Segment into prompts to like 00:18:39.500 |
handle the eight second limitation. Used music effects to create the background track, which was a 00:18:46.060 |
combination of down home farm and slow guitar. And then put it all in Camtasia, stitch it together with 00:18:52.540 |
transitions. Whole process was relatively quick, but it also took a lot of thinking and a lot of work to get to 00:18:58.940 |
that final assembly stage. And it looked like this. So, this is 00:19:04.460 |
Hey, my name's Paige. And what makes a Chick-fil-A chicken sandwich original to me is the crispiness 00:19:13.020 |
of the breading and a tenderness of the filet. It's tasty. It's warm. It's total satisfaction. 00:19:19.980 |
And that was, again, using V.O.2, using Gemini text to speech, and stitching it all together myself. And I 00:19:28.860 |
actually like that one better than the original commercial, but your mileage may vary. Process with 00:19:34.140 |
V.O.3. You have the original video, you generate the description, you give it to V.O.3, and you see how 00:19:39.180 |
well it does. This took just the span of like submitting the prompt to V.O.3. "Hey, my name's 00:19:47.180 |
Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and 00:19:52.380 |
the tenderness of the filet." So, again, one prompt. And it was able to produce this. "Hey, my name's Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet." 00:19:53.180 |
So, again, one prompt. And it was able to produce this. "Hey, my name's Paige, and what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet." 00:20:07.180 |
Incredible. So, takeaways. V.O.3, pretty magical. We're committed to expanding it, expanding access as 00:20:17.180 |
quick as we can, adding controls around durability. And thank you so much. That is not me actually waving at the camera. That is a static photo of me that has been V.O.3 animated. Excellent. Thank you.