back to indexTeaching Gemini to Speak YouTube: Adapting LLMs for Video Recommendations to 2B+DAU - Devansh Tandon

00:00:15.000 |
There's a lot of attention in terms of how LLMs 00:00:24.400 |
Perplexity is a product that a lot of people use. 00:00:27.760 |
But I think recommendations is probably a bigger problem 00:00:32.160 |
that is under-hyped because it's kind of transparent 00:00:37.540 |
And I think the application of LLMs to recommendations 00:00:40.240 |
is going to be a bigger consumer application than search. 00:00:46.300 |
to introduce the problem of YouTube recommendations 00:00:48.460 |
and then talk about how we've built large recommender models. 00:00:54.700 |
How we build semantic ID and how we're using that. 00:00:57.520 |
And then end with this recipe of how you might use an LLM 00:01:03.920 |
To start, why this is important, who here watches YouTube 00:01:10.620 |
It's one of the biggest consumer apps in the world. 00:01:13.860 |
And a large majority of the watch time on YouTube 00:01:20.080 |
And we serve recommendations across home, Watch Next. 00:01:29.300 |
And so if you think about consumer applications of LLMs, 00:01:34.500 |
I think in terms of consumer engagement and impact, 00:01:37.600 |
recommendations is going to be a much bigger application 00:01:44.100 |
And this is true of any consumer app with a billion DAO. 00:01:48.880 |
The way I think about the recommendation problem is you're 00:01:51.960 |
trying to learn this function of you get a user and their context 00:01:58.640 |
And you're trying to give them a bunch of recommendations. 00:02:02.320 |
At YouTube, we have a bunch of user information 00:02:04.520 |
like their demographics, their age, their gender, 00:02:18.660 |
And we use all of that to make video recommendations. 00:02:22.720 |
We've tried a lot of different modeling techniques here-- 00:02:32.860 |
And about two years ago, we started thinking, 00:02:35.280 |
how can we rethink this recommendation system 00:02:38.740 |
on top of Gemini, which has been making incredible progress 00:02:56.240 |
So we start with this base Gemini checkpoint. 00:03:00.760 |
And then we are adapting it for YouTube recommendations, 00:03:03.940 |
teaching it a lot of information about YouTube 00:03:06.940 |
to get this unified YouTube-specific checkpoint 00:03:14.100 |
Then we can align it for different recommendation-related tasks, 00:03:24.400 |
for all of the major recommendation surfaces. 00:03:27.600 |
And so this is a model that we have launched in production 00:03:30.000 |
at YouTube for a while in terms of the retrieval system. 00:03:33.560 |
And we're experimenting a lot on the ranking side. 00:03:36.780 |
So I want to start with just kind of explaining 00:03:41.060 |
And then we'll talk about how we use it for retrieval. 00:03:46.680 |
is you have to develop a way to tokenize videos. 00:03:50.100 |
So in terms of an LLM, when you give it an input, 00:03:56.280 |
it tokenizes that text and then is predicting 00:04:02.200 |
was we want to give this model an input of a number of video 00:04:06.580 |
tokens and then just get video tokens out that would be 00:04:12.520 |
We had to build this because even with a million tokens 00:04:14.740 |
of context, when you want to reason over many videos, 00:04:18.420 |
you have to compress that video representation in some way. 00:04:22.280 |
And before we kind of settle on this approach, 00:04:24.220 |
we tried a bunch of other things like predicting search queries 00:04:28.220 |
and retrieving videos through that or trying to just recommend 00:04:32.780 |
And those solutions were just not good enough. 00:04:35.000 |
So we built SemanticID, which we actually wrote a paper about last year, 00:04:41.900 |
The way that this SemanticID works is you take a video. 00:04:53.840 |
You put all of that into a multidimensional embedding. 00:04:58.120 |
And then you quantize it using RQVAE to give every video a token. 00:05:05.220 |
We've written a pretty detailed paper about this, 00:05:08.100 |
But at a high level, the way I think about this 00:05:10.000 |
is we're making the atomic units for a new language 00:05:18.340 |
imagine the whole corpus of billions of videos on YouTube 00:05:21.880 |
gets organized around these semantically meaningful tokens. 00:05:27.520 |
representing topics like music, gaming, sports. 00:05:30.280 |
Within sports, you would have different sports. 00:05:43.600 |
And this, I think, in itself is an interesting milestone 00:06:01.720 |
and have it understand both English and this new YouTube 00:06:15.220 |
understand sequences of watches and be able to reason 00:06:25.460 |
It's a tennis highlights video which has some semantic ID. 00:06:28.680 |
And you can prompt it and say, hey, this video has title XYZ. 00:06:32.400 |
And the model starts to learn to output the title. 00:06:36.180 |
You could imagine a very similar thing where you could say, 00:06:42.680 |
And so you're basically trying to connect text and this video 00:06:47.680 |
Then what we can try to do is we have a corpus of all the YouTube 00:06:51.480 |
engagement data, all the paths that users took through YouTube 00:06:58.720 |
like, a user has watched the following videos, A, B, C, D, 00:07:04.180 |
And the model starts to learn to predict those masks. 00:07:07.220 |
And now it's starting to understand what are videos that 00:07:09.520 |
are watched together and make relationships between videos 00:07:17.460 |
After a bunch of pre-training tasks like this, 00:07:23.100 |
that can reason across English and YouTube videos. 00:07:26.880 |
And so this is an example from a user's watch history. 00:07:30.460 |
And we find that this model can now reason across these videos. 00:07:35.300 |
So you could prompt it with things like, hey, 00:07:45.120 |
Video three is interesting to math fans because it's about Pi. 00:07:48.280 |
And then you prompt video four is going to be interesting, too. 00:07:50.580 |
And the model starts to be able to understand 00:07:53.640 |
that it's interesting technology fans because it's about AI. 00:07:56.920 |
And this is just based on the semantic ID definition of a video. 00:08:03.780 |
It doesn't really have a lot of other information to go off of. 00:08:07.560 |
So I think this in itself is a very interesting checkpoint 00:08:11.800 |
that is starting to reason across English and YouTube. 00:08:15.880 |
Once we have this model, we think about how we can use this 00:08:20.160 |
for different video recommendation tasks at YouTube. 00:08:24.160 |
And the first one that we focus on is generative retrieval. 00:08:27.060 |
And so here, you could just construct a prompt for every user 00:08:36.180 |
They would be a 24-year-old woman in the US on Android. 00:08:40.260 |
They're watching this highlight video from the Olympics. 00:08:45.140 |
And they have some watch history of 50 videos 00:08:48.540 |
they've watched in the past, how they engaged with it. 00:08:51.820 |
And you can just construct a prompt like we have on the right 00:08:54.360 |
with this user demographic information, the context video, 00:08:58.140 |
and have the model decode some video recommendations as SIDs. 00:09:04.000 |
We find that this gives really interesting, unique recommendations, 00:09:09.340 |
especially for our hardest recommendation tasks. 00:09:12.520 |
So in this example, when you're watching this highlight 00:09:18.940 |
before LRM would give you other men's track races. 00:09:25.160 |
Now, with this new model, it's able to find this unique connection 00:09:28.720 |
between the user demographic and their past watch history 00:09:33.020 |
and find related women's races that we weren't able to recommend 00:09:45.080 |
we get very interesting and unique recommendations 00:09:56.600 |
The big findings from this is that LRM is a very powerful model, 00:10:09.340 |
But the biggest limitation was that the serving costs are too high, 00:10:12.720 |
especially for the scale that YouTube operates at, 00:10:16.420 |
And so after we got our first experiments working, 00:10:19.160 |
we spent a lot of time just reducing the TPU serving cost. 00:10:22.820 |
And we got 95% plus cost savings to be able to actually launch this 00:10:30.260 |
One other strategy that we used, which I think is kind of interesting, 00:10:34.060 |
is we tried to turn this into an offline problem, where it's the same prompt 00:10:41.200 |
We just removed the personalized aspects of this prompt. 00:10:45.880 |
And we wanted to build just an offline recommendations table, 00:10:49.380 |
where if you're watching video A, what are the candid videos 00:10:55.200 |
And normally, these unpersonalized recommendation models just don't hold a candle 00:11:03.340 |
But because this LRM is trained from a really big checkpoint, 00:11:08.000 |
it actually gives us some differentiated recommendations. 00:11:11.400 |
And so in the YouTube context, we can take our corpus of billions of videos, 00:11:15.400 |
look at the head, which represent a lot of the watch time, 00:11:18.740 |
and do offline inference, make this offline RECS table, 00:11:24.340 |
and then we can just do a simple lookup to serve some recommendations. 00:11:27.480 |
And so this was kind of a complete way around our serving problems. 00:11:31.840 |
I want to talk a bit about the challenges for YouTube. 00:11:37.200 |
And I think in some ways, making an LLM-based recommendation system 00:11:44.800 |
One of the big differences is the vocabulary and size of the corpus, right? 00:11:49.200 |
So for Gemini, if you're training an English LLM, 00:11:52.260 |
your vocabulary is about 100,000 words in the Oxford Dictionary, 00:11:58.940 |
At YouTube, if you imagine the library of YouTube, 00:12:12.700 |
And the freshness of videos is really important, 00:12:16.700 |
So if you think about a new word that's added to the English Dictionary, 00:12:25.740 |
it can still answer 99% of questions that people would have. 00:12:38.060 |
you have to be able to recommend it within the next minutes or hours. 00:12:42.100 |
Otherwise, a lot of users are going to be upset. 00:12:50.800 |
and start recommending them to the right user. 00:13:02.300 |
which is very different than classical LLM pre-training like Gemini, 00:13:06.700 |
which happens maybe like once in three to six months. 00:13:09.360 |
And so, in that way, it's a much harder problem. 00:13:27.400 |
the smaller, more efficient models like Flash 00:13:32.400 |
just to be able to hit the latency and scale requirements that we have. 00:13:38.400 |
So, I kind of want to summarize the journey that we've been on YouTube 00:13:41.600 |
in this, what I think of as a LLM and Rexxus recipe 00:13:45.400 |
that you can maybe adapt to your own application. 00:13:47.900 |
And there's three major steps to this, right? 00:13:50.400 |
The first is you want to find a way to tokenize your content. 00:13:56.400 |
you want to make some essence of your content into an atomic token. 00:14:03.900 |
is you find some rich representation, a bunch of features, 00:14:06.900 |
build an embedding, and then find a way to tokenize or quantize it. 00:14:12.800 |
you're making your own domain-specific language. 00:14:15.500 |
The second step is, you then want to adapt the LLM 00:14:19.900 |
and basically make links between English and your domain language, 00:14:24.100 |
and find training tasks that help you reason across English 00:14:30.200 |
And so, the outcome after this step in my mind is, 00:14:37.100 |
but it can also speak your domain-specific language. 00:14:42.100 |
you can do the third step of prompting it with user information, 00:14:44.800 |
where you can just construct personalized prompts with user demographic, 00:14:52.100 |
and then train task-specific or surface-specific models. 00:14:55.900 |
And you have a generative recommendation system on top of an LLM. 00:15:00.800 |
And this is like a tweet-sized summary of maybe two years of work. 00:15:04.800 |
Maybe the last thing that I want to talk about is kind of where I see this going, 00:15:11.800 |
and some possible future directions for LLM and Rexis. 00:15:15.800 |
I think the stage that we're at right now is that LLMs are just augmenting recommendations. 00:15:21.700 |
They bring these magical recommendation experiences. 00:15:25.700 |
They enhance the quality, but they're largely invisible to users. 00:15:32.600 |
but you don't really know whether a Gemini inference happened or not. 00:15:35.600 |
This is why I think the LLM application of Rexis is very under-hyped, 00:15:40.600 |
because users don't directly know what's happening. 00:15:43.600 |
I think we're close to a world and we're experimenting with this. 00:15:47.600 |
If you have, like, we talked about a bilingual LLM across English and recommendations, 00:15:52.500 |
users can then talk to it in natural language. 00:15:55.500 |
And I think you're going to start to see experiences 00:15:57.500 |
where users can steer recommendations to their own goals. 00:16:01.500 |
The recommender can explain why a candidate was recommended to a user. 00:16:06.500 |
And users can start to align it towards their own goals expressed in natural language. 00:16:11.500 |
And I think also the lines between search and recommendations start to blur in this world. 00:16:16.500 |
And then maybe a hint of the future is I think you're going to see recommendation 00:16:23.400 |
and generative content start to come together in the future, 00:16:27.400 |
where we're going to be recommending a personalized version of a piece of content. 00:16:33.400 |
And in the future, instead of recommending content, 00:16:37.400 |
And you can get to really interesting NF1 content that's generated for the user. 00:16:45.400 |
but it's going to come sooner than you expect with all of the advances happening in AI. 00:17:00.300 |
One question on generally how you balance the learning of the semantic ID embeddings 00:17:13.200 |
within the model versus keeping the general language capability not damaged by learning through, 00:17:21.200 |
for example, a tokenized user history, which is a very second language, very different from English. 00:17:36.100 |
In terms of some of our early applications, we mostly cared just about recommendation quality, 00:17:42.100 |
in which case we over-indexed on speaking the semantic ID language. 00:17:46.100 |
And as you over-train on more and more of those examples, actually the model forgets to speak English. 00:17:52.000 |
Maybe it's reasoning in some intermediate layers, which finally end up in semantic ID language. 00:17:58.000 |
We are trying a bunch of things like, you know, with mixture of experts, maybe we can have a few experts that retain the text capability, 00:18:07.000 |
while other experts focus on the semantic ID capability. 00:18:11.900 |
And so, it's a balance, and I think we're going to shift more towards text as we try to build these interactive experiences, 00:18:19.800 |
where text input from users is going to become more important. 00:18:23.800 |
So, during this process, did you learn any good suggestions for cold-starting embeddings on these domain-specific tokens? 00:18:36.800 |
Yeah, so the semantic, one thing is semantic ID training process is entirely unsupervised. 00:18:42.700 |
We're not telling, like, it's making its own quantization of the video corpus. 00:18:47.700 |
When you sample to see what the model is doing, we find that it's learning concepts like sports versus movies and entertainment. 00:18:54.700 |
But we didn't actually try to teach that explicitly, which I think is very interesting. 00:18:59.700 |
I think the sidekick aspect is, because of semantic ID, we can warm start into a semantically meaningful space. 00:19:07.600 |
And what we find is performance for videos that were uploaded in the last day or the last week gets much better, 00:19:15.600 |
because we're better understanding this fresh entailed content. 00:19:24.500 |
So, when you said you extract frames as part of making the semantic ID, are you just running a video at, 00:19:30.500 |
let's say, 3 to 30 FPS, making a grid of them, running siglip or siglip2, and inserting that? 00:19:41.500 |
We've tried a few different approaches where, like, maybe we try to sample from, like, key moments in the video. 00:19:49.200 |
We actually have the engagement data, if you've seen in the YouTube player, it can highlight what are the places where people had the most engagement. 00:19:59.100 |
You know, given the scale, we can't sample a lot of video frames, so we try to intelligently select it. 00:20:05.100 |
But we do have video frames, and over time, I think we'll get more. 00:20:09.100 |
In this way of selecting it, are you able to highlight important things that are based on small objects in a video pretty well? 00:20:18.100 |
Let's say it's a person in the distance that's of attention of this video. 00:20:24.000 |
Hard to say, because, like, at the end, all of this video information gets compressed into eight tokens. 00:20:30.000 |
So, it's probably learning something, but it's hard to know exactly, you know, what it picked up from that video frame. 00:20:48.900 |
I want to have a question regarding pre-training. 00:20:52.900 |
So, did you also fit in a user query and what they watched also as a pre-training data? 00:20:59.900 |
If yes, then did you also use semantic ID for user as well in a pre-training or just semantic ID is only for the videos? 00:21:11.800 |
So, in this case, we have only tokenized videos. 00:21:15.800 |
And we focused more on sequences of watches rather than search query to what watch originated from that search query. 00:21:27.700 |
You could imagine some parallel work where you try to tokenize users and build some kind of user token that represents, like, the last 500 watches that they have had and so on. 00:21:43.600 |
I think it's a very interesting, like, research direction to do. 00:21:46.600 |
So, the pre-training was done on top of existing Gemini pre-trained model, right? 00:21:53.600 |
We basically take a Gemini checkpoint and then adapt it for this YouTube purpose and get this, like, YouTube and Gemini LRM checkpoint. 00:22:04.600 |
So, last -- it would be cool to see cementing ID of videos to V03, you know. 00:22:12.500 |
How much improvement do we see compared to the non-LLM or more traditional recommendation system? 00:22:18.500 |
And when should we use a more traditional one? 00:22:21.500 |
And when should we use LLM-based recommendation system? 00:22:27.500 |
Like, I was -- I can share everything except code and metrics, you know? 00:22:31.500 |
And so, we've given you as much conceptual steps of what we did. 00:22:36.500 |
Maybe what I'll say is, I think it's been the biggest improvement to recommendation quality we've seen in the last few years.