back to index

Veo 3 for Developers — Paige Bailey, Google DeepMind


Whisper Transcript | Transcript Only Page

00:00:00.000 | Thank you so much for having me. Thank you all for being here today and wanting to learn more
00:00:19.180 | about generative media. I'm going to keep this pretty quick. There's a lot to show, especially
00:00:23.880 | for some of the new features that we've released in VO3. But there's also a lot to discuss in
00:00:29.460 | the frame of how does this revolutionize the way that people build things? How do people
00:00:34.960 | create ads? And how do people replicate some of the experiences that you might see every
00:00:40.140 | day? So, as mentioned, I'm Paige. I am the engine lead for our DevRel team at Google DeepMind.
00:00:45.900 | But I'm here today on behalf of our generative media team, who are all brilliant. They are
00:00:52.080 | wonderful. Many of them could not be here today. But this
00:00:58.920 | this is all their work. So, I just want to send a thank you to the heavens for everything
00:01:03.800 | that they've been building. Today we're going to be talking about three different models.
00:01:07.420 | So, VO3, which is our new video and audio generation model. Imagine 4, which can generate these static
00:01:12.940 | images. And also Lyria 2, which is a music generation model. And stay tuned. There will be more of
00:01:18.980 | all of the above coming shortly. VO is kind of magical in the sense that, you know, you can create videos of
00:01:26.540 | things that you've never seen before. And we've saw -- we've seen some of these examples already.
00:01:31.420 | But it also has the potential to really revolutionize everything that we build and create as humans.
00:01:37.260 | And I loved this quote from Andre Karpathy just recently. The video has the potential to be an
00:01:43.180 | incredible surface for communication, but also for education and for human creativity. So, all of these
00:01:49.660 | things are designed with that in mind. VO2, just want to touch on it briefly before we get into the VO3
00:01:56.540 | capabilities. We released a whole bunch of additional things around creative control just recently. So,
00:02:02.940 | think on the order of about a month ago or not even that. So, things like reference powered videos,
00:02:08.540 | videos, out painting, the ability to add and remove objects, character control and consistency,
00:02:13.420 | which I think you all were talking about just a little while ago. And also the ability to interpolate
00:02:18.620 | across first and last frames. And let's take a look at what that means. Because for people who are not
00:02:23.340 | necessarily filmmakers, it's much, much easier to show and not tell. So, reference powered videos are
00:02:29.900 | things like things like you have a person, you have an environment, and you're able to kind of put one
00:02:36.380 | within the other or compose them together into something that feels very, very stylized, but also
00:02:43.420 | like really, really well crafted. So, here you can see reference powered video with VO2. This is
00:02:50.780 | available via the API and via some of our tools like Flow today. And another example of reference powered
00:02:58.780 | video. So, a really, really cute little monster in a variety of environments that you just control
00:03:04.540 | by describing. It's performing pretty well on benchmarks. So, you can see here, the green is VO. So,
00:03:13.100 | compared to things like Runway Gen 4 and Kling, the more green, the better. And for reference powered
00:03:20.540 | video, most human readers selected for VO for some of these side by side comparisons. You can also match
00:03:28.700 | styles. So, upload a reference image and then have different styles composed together.
00:03:34.540 | Another example of styles being preserved. And then also camera controls the same that you might have
00:03:41.820 | if you were a filmmaker. So, things like being able to move back, move right, rotate up, zoom in,
00:03:47.340 | and to be able to precisely control all of these camera movements. Again, just via natural language and
00:03:53.580 | through some of the tools that are available in the APIs. When I saw all of this, I was blown
00:03:58.620 | away because I don't think that we have nearly enough code samples demonstrating some of these
00:04:02.780 | capabilities. But these are all things that you can do with the VO2 models today with the APIs.
00:04:08.940 | We also have the ability to do outpainting. This was important for a recent project with the Sphere
00:04:14.540 | around Wizard of Oz. So, being able to take us like a scene or an individual frame of a video
00:04:22.460 | and imagine what the rest of the scene might look like. So, even if you only have a view into a small
00:04:28.140 | portion, being able to create something that looks real or that looks consistent across the outpainting.
00:04:36.780 | Adding objects or removing objects from scenes. So, you can see here a few examples as well.
00:04:45.420 | And again, all of these available for you to test and to try today. These are all things that exist
00:04:51.660 | that our research team has kind of gotten into the API designs. Some more examples of removing objects.
00:05:00.460 | Character control is quite nice. You might have seen some of these demonstrated into your favorite products
00:05:06.460 | for kind of controlling mouth movements, controlling reference space movements for particular characters.
00:05:14.380 | We also give the ability to add a script and to add kind of a voice tone and to have the character map
00:05:22.300 | the lips to producing that sound and producing it in a way that feels consistent with the location.
00:05:30.220 | These are some of the motion examples with VO2 and with VO3. So, being able to have an input image
00:05:38.540 | and controlling them or changing the design across the scene. More benchmarks. And then the first and last
00:05:48.380 | frame. So, you can have an input image and an output image and VO is able to interpolate across them
00:05:55.980 | to kind of make those images stitched together into a video. And those are just another couple of examples.
00:06:05.660 | I feel like generative media presentations are very gratifying. These are certainly the most beautiful
00:06:11.660 | things that we get to see at developer conferences. So, VO3. Everything that you just saw, VO2. VO3.
00:06:21.740 | Right? Like blown away. So, VO3 is video but coupled together with audio. And so, all of the tokens
00:06:29.100 | composed together natively, not audio being pulled in as a tool, but the model actually able to compose
00:06:36.140 | together all of these tokens across multiple modalities. This is similar to what you see with
00:06:40.460 | Gemini's native audio output. In addition to being able to output text and code, you can also output images,
00:06:46.460 | edit images, edit audio, compose audio, etc. So, VO3, our latest state-of-the-art video generation model.
00:06:54.700 | It has these things around prompt adherence and native audio generation. But again, so much cooler to show
00:07:04.380 | and not tell. The little llama. So, interestingly, you're able to do not just background noises, but also
00:07:27.740 | things like music. Including very, very subtle sounds. And the like. So, let's go to the next.
00:07:51.020 | VO3. There we go. So, this is hard, right? Like, it looks very cool. It's very hard to capture the
00:07:57.900 | nuances of an input prompt. And it's also been really historically very hard to preserve visual
00:08:04.060 | consistency. So, characters often, like, jump from one frame to another. There might be backgrounds,
00:08:11.820 | and then suddenly walls disappear and you're able to see behind them. This is one of the reasons why VO3
00:08:17.900 | feels like a leap forward is because the stylistic consistency and then also the contextual consistency
00:08:23.340 | is much, much better. Built on years of research. So, things like GQN, WALT, etc. And it has
00:08:32.140 | responsibility at its core. So, you can see little human visible watermarks as well as synth ID watermarks
00:08:38.300 | for synthetically generated images and video. We've also been partnering really closely with many, many,
00:08:45.180 | many artists along the way. So, Darren Aronofsky, also musicians for Lyria models, artists for the Imagine
00:08:54.540 | models. And we'll take a look at a couple of these as well. So, Imagine is image generation. You know,
00:09:02.220 | able to kind of preserve realism. Everything from humans to whales, you know, cute puppies. I've heard that the
00:09:12.060 | more cute puppies that you have in a presentation, the better it always is. And then also being able to
00:09:17.900 | preserve detail across all of these images as well, including diverse styles and even things like
00:09:24.620 | typography. So, I love these stamps of Alamo Square and the mission. I really, really wish that we just had
00:09:30.060 | these as swag ideas that -- stickers for laptops or stickers just in general. And another example of an
00:09:41.100 | artist that the team has been really closely collaborating with, Ross Lovegood, on some of his designs.
00:09:51.100 | So, Lyria 2, also very exciting. It's high-fidelity music and professional-grade audio. It also gives you a
00:09:59.500 | very, very granular creative control. So, the ability to steer the inputs and outputs and to steer the tones
00:10:05.820 | and the styles of the music along the way. Music AI Sandbox is one of the products that's been created as a visual for this. If folks are familiar with Ableton
00:10:15.980 | or things like it, this probably looks very similar. And then there's also Music Effects,
00:10:23.340 | which is a project from our labs team that allows you to kind of compose together beats just to be a
00:10:29.660 | natural language. And we use that for the demo later on today. Lyria Real Time has also been a deep
00:10:37.580 | collaboration with many musicians, both Jacob Collier, who's a legend, and Toro Iuma. And me circa, like,
00:10:46.700 | me circa college was like blown away that Toro Iuma was at Google I/O. Huge fan.
00:10:55.020 | I do think that a big part of teaching music is giving people a chance to play music and play with music.
00:11:02.940 | However, people don't have access to the whole of music from day one. What I think this offers an
00:11:07.980 | interesting perspective into is the whole of music: mathematics, physics, history, geography, the
00:11:15.660 | human body, language, spelling, syntax. One thing I've come to realize is that a lot of the same forces
00:11:22.460 | forces that make music work are the forces that make life work.
00:11:24.860 | Oh, that was not VO3. Some parts, but it's hard to tell, right? Like it's, yeah, yeah.
00:11:34.060 | Yeah, you know, parts of it, parts of it might have included visuals generated by VO3, though.
00:11:40.860 | So Lyria, again, built in collaboration with the creative industry, not outside it, and then also
00:11:48.220 | incorporates many of these techniques like synth ID to make sure that you have some sort of digital
00:11:52.220 | water marking for the, for the assets themselves. So now we're going to get into it. These are some
00:11:58.860 | examples that I thought might be fun to share to show just how far we've come in the last couple of
00:12:03.100 | years, because I think being here, we get very sucked into the Bay Area bubble, and we don't really kind
00:12:07.740 | of take a step back and appreciate how far, you know, the world has changed in just a matter of months.
00:12:14.140 | So this is an example of one of the papers that was produced around 2023. So released in 2023,
00:12:21.580 | research happening around 2022, for text to video, a raccoon wearing a black jacket dancing in slow
00:12:27.100 | motion in front of the pyramids. So just have that in your brain when you see it. This is Walt, circa 2023.
00:12:35.660 | So very, very choppy, like really, really hard to have even like, that's not even eight seconds worth of
00:12:42.940 | a frame. LTX video from 2024. This is one of the ones that were available on Hugging Face. Cling 2.0,
00:12:49.900 | which I did with Fall. Heck yeah. Yep. Yeah. So, which I did with Fall, released in 2025. And then VO2 2024,
00:13:00.700 | a very cute little raccoon, but I'm not sure how Willy's dancing. But this is just kind of a splattering
00:13:06.380 | of how the world has changed in the space of just a couple of years. And then when you put the same
00:13:12.220 | prompt through VO3, you get this.
00:13:20.300 | Very stylish raccoon. So image to video, transforming static images into dynamic video content. So you
00:13:28.060 | can see here an image of a woman and her puppy, a woman I can only assume is in Texas, walking very
00:13:37.180 | slowly forward on the way to a gunfight. And also different stylized images of a person in a single
00:13:44.540 | frame being applied to different scenarios. So running towards the camera, tractor beam taking off,
00:13:52.060 | being lifted into the sky, again, all steered via natural language. Prompt rewriting is something that
00:13:57.420 | we've also released in VO3. So the ability to take that very, very simple sentence before of a raccoon
00:14:03.820 | wearing a black jacket dancing in slow motion in front of the pyramids, and turn it into something a bit more
00:14:09.900 | fully formed that the VO is as much better able or better equipped to understand. And so have that in
00:14:18.620 | your brain, as well as the the concept of sound generation. So both music, sound effects and
00:14:23.580 | background noises. And we'll take a look at what that simple prompt is now.
00:14:38.300 | I still like get blown away about how much detail there is, like being, you can almost see the
00:14:43.420 | reflections in the eyes for the things as they walk forward. And then this is from our team in Paris,
00:14:52.940 | and they are very, very excited to have it shared today. Does anybody know French in the audience?
00:15:00.700 | Okay. Amazing. The rest of us will have a translation in a second.
00:15:05.020 | It's like Daft Punk.
00:15:14.300 | I can't believe this new V.O. model. It is amazing.
00:15:17.980 | I can't believe this new V.O. model. It is amazing.
00:15:19.180 | I can't believe this new V.O. model. It is amazing.
00:15:21.020 | I can't believe this new V.O. model. It is amazing.
00:15:22.620 | I can't believe this new V.O. model. It is amazing.
00:15:25.020 | I can't believe this new V.O. model. It is amazing.
00:15:27.020 | I can't believe this new V.O. model. It is amazing.
00:15:29.020 | It was a good V.O. model. You know, I don't believe there is a good or a bad V.O. model.
00:15:37.740 | So, what do you think now about this artificial intelligence champion?
00:15:41.020 | I think the trainer's strategy is good, not on the technical side and tactical side, but the important is the V.O. 3.0.
00:15:46.060 | Boss, humans are on the point of creating the AGI. Do we have to contact them?
00:15:49.100 | What do they do?
00:15:49.900 | They generate images of starter packs and make videos stupid videos.
00:15:52.700 | They are not yet ready.
00:15:53.900 | V.O. can make us sing anything.
00:16:02.380 | And for anyone who is curious, the translation is, so V.O. means I see, I see, I see in Spanish.
00:16:09.020 | And then the guy responded with, yeah, I don't understand what you're saying because I'm French, actually.
00:16:14.060 | Artificial intelligence, artificial intelligence.
00:16:18.300 | And then, also, that's the interview. And Boss, the humans are about to create AGI, should we contact them?
00:16:26.300 | They're not ready yet. So, amazing. Cool.
00:16:32.860 | Actually, I'm going to zoom to the next, zoom to the next one. So, how do you access it? Big question.
00:16:41.660 | Right now, we have a few different ways to access the V.O. 3.0 models. One is through the Google AI
00:16:45.980 | Ultra Plan, which is available in many different countries, including the UK just recently.
00:16:51.500 | The Google AI Pro subscribers. So, being able to access via the Gemini mobile app for a limited number
00:16:59.020 | of uses. And then, also, V.O. 3.0 is available in private preview, currently in Vertex AI. V.O. 2.0,
00:17:05.020 | also available via Vertex AI for the Gemini APIs. Hoping to bring them to AI Studio. Crossing fingers,
00:17:12.060 | but we'll see. And then, you can also fill out a form for early access if you would like it. QR code
00:17:18.380 | coming shortly. So, take a picture of that if you would like, with the form to go submit and test it
00:17:25.820 | out. This is also, while you're looking, a code sample of how easy it is to use with just your output
00:17:33.260 | bucket where you would like the video to be deposited. If you have an input image that you
00:17:38.460 | want to use as a starter. Some things around aspect ratios and the like. So, toggling between
00:17:46.220 | different models or being able to specify some of these is just a handful of lines of code, which is
00:17:51.340 | pretty magical. So, this was intended to be a live demo. I'm not sure if I can tempt the demo gods. And
00:17:57.900 | also, I'm probably like close to being over time. But I wanted to see how well I could replicate a
00:18:04.620 | commercial with V.O. that seemed pretty simple. So, the commercial is this one, which took my name.
00:18:12.860 | Hey, my name's Paige. And what makes the Chick-fil-A chicken
00:18:15.100 | So, that was the, that was the input video. The process for replicating it with V.O.2 is you give
00:18:32.300 | Gemini the original video, have it create a really, really detailed plan. Segment into prompts to like
00:18:39.500 | handle the eight second limitation. Used music effects to create the background track, which was a
00:18:46.060 | combination of down home farm and slow guitar. And then put it all in Camtasia, stitch it together with
00:18:52.540 | transitions. Whole process was relatively quick, but it also took a lot of thinking and a lot of work to get to
00:18:58.940 | that final assembly stage. And it looked like this. So, this is
00:19:04.460 | Hey, my name's Paige. And what makes a Chick-fil-A chicken sandwich original to me is the crispiness
00:19:13.020 | of the breading and a tenderness of the filet. It's tasty. It's warm. It's total satisfaction.
00:19:19.980 | And that was, again, using V.O.2, using Gemini text to speech, and stitching it all together myself. And I
00:19:28.860 | actually like that one better than the original commercial, but your mileage may vary. Process with
00:19:34.140 | V.O.3. You have the original video, you generate the description, you give it to V.O.3, and you see how
00:19:39.180 | well it does. This took just the span of like submitting the prompt to V.O.3. "Hey, my name's
00:19:47.180 | Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and
00:19:52.380 | the tenderness of the filet." So, again, one prompt. And it was able to produce this. "Hey, my name's Paige. And what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet."
00:19:53.180 | So, again, one prompt. And it was able to produce this. "Hey, my name's Paige, and what makes the Chick-fil-A chicken sandwich original to me is the crispiness of the breading and the tenderness of the filet."
00:20:07.180 | Incredible. So, takeaways. V.O.3, pretty magical. We're committed to expanding it, expanding access as
00:20:17.180 | quick as we can, adding controls around durability. And thank you so much. That is not me actually waving at the camera. That is a static photo of me that has been V.O.3 animated. Excellent. Thank you.
00:20:31.180 | Thank you.
00:20:32.180 | We'll see you next time.