back to indexNetflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix

00:00:00.000 |
good afternoon thank you Eugene for the introduction so today I'm going to share 00:00:22.200 |
our big bet and Netflix on personalization namely to use one foundation model to cover 00:00:28.080 |
all the recommendation use cases and Netflix we have diverse recommendation needs this is an 00:00:36.360 |
example homepage of one node profile on Netflix it's a 2d layout roles and items diversity comes 00:00:45.780 |
at at least the three levels there is first level about role we have diverse roles we have a young 00:00:53.100 |
runs for example roles on comedies roles on action movies we have roles about new trending just the 00:01:01.080 |
release titles we also have a rose about for example titles only available on Netflix so that's the first 00:01:08.760 |
dimension second dimension is of course of the items or entities in addition to traditionally movie and TV 00:01:16.380 |
shows now we have games we have live streaming and we're going to add more so our content space is 00:01:23.660 |
expanding to very heterogeneous content types the third level is page so we have home page we have search 00:01:34.640 |
page we have a kids home page which is tailored very differently to what kids interest mobile feed is a 00:01:42.600 |
linear page is not a 2d layout so on and so forth so page different pages are also very diverse what 00:01:49.880 |
happened traditionally was that these lead to naturally many specialized models that get developed over the 00:01:59.320 |
years some models rank videos some rank rows some focus on for example shows user have not watched yet some focus on shows was a user are already engaging 00:02:11.880 |
and many of those models are were built independently over the years they may have different objectives but have a lot of overlaps as well 00:02:21.880 |
as well so naturally this lead to duplications duplications in our label engineering as well as feature engineering take the feature engineering as example we have this very commonly used the factual data about user intact and history 00:02:41.160 |
uh the the the factual data is the same but over the years many features are developed derived out of the same facts data like counts of different actions 00:02:51.160 |
counts of actions within various time window or other kind of uh slice and dice dimensions similarity between the users history titles against the target titles unique 00:03:03.160 |
and lastly like uh lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model so this list can go on and on and on 00:03:13.160 |
and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar so become very hard to uh maintain 00:03:25.160 |
so the challenge the challenge the challenge back then was uh is this scalable uh obviously not if we keep expanding 00:03:35.160 |
our landscape of our landscape of content type or business use cases it's not manageable to spin up new models for each uh individual use cases 00:03:45.160 |
uh there's not much leverage uh there's some shared components on building the feature label but still by and large each model uh basically uh 00:03:55.160 |
uh spinned up independently and that also impact our innovation velocity in in the terms that you don't reuse as much as you can instead you just 00:04:05.160 |
spin up new models uh pretty much from scratch 00:04:09.160 |
uh so this was the situation about four years ago uh at the beginning or middle of the pandemic 00:04:15.400 |
so the question we asked at that time was uh can we centralize the learning of user representation 00:04:21.720 |
in one place so the answer is yes and we had this key hypothesis that about foundation model based on transformer 00:04:31.480 |
architecture uh concretely two hypotheses here one hypothesis is that through scaled up semi-supervised learning 00:04:38.360 |
personalization can be improved uh the scaling law also applies to recommendation system as it applies to llm 00:04:46.040 |
second is that by integrating the foundation model into all systems we can create high leverage we can 00:04:51.880 |
simultaneously improve all the downstream canvas facing models at the same time so we'll see in the 00:04:59.080 |
following slides how we validate those hypotheses uh i'll break up the overview into two subsessions first about 00:05:07.640 |
data data data and training and the later uh second about application and serving 00:05:12.280 |
so um about data and training so starting from data a very interesting aspect of a building such foundation 00:05:20.120 |
model auto regressive transformer is that there's a lot of analogy but also differences sometimes uh between 00:05:27.400 |
this and llm so we can transfer a lot of learnings inspirations from llm development if we start from the 00:05:37.400 |
very bottom layer which is basically data cleaning and tokenization people work with llm understand 00:05:43.960 |
tokenization decisions have profound impact in your model quality so although it's the bottom layer the 00:05:52.120 |
decision you made there can percolate through all the downstream layers and manifest as either your model 00:05:57.160 |
quality problem or model quality plus so this applies to recommendation uh foundation model as well instead of 00:06:06.520 |
uh there are some differences very importantly instead of language tokens which is just one id here for uh if 00:06:13.800 |
we want to translate the user interaction history or sequence each of the token is a event interaction event 00:06:21.720 |
from the user right but that event has many facets or many fields so it's not just the one id you can represent 00:06:27.320 |
there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization 00:06:36.120 |
uh i think that's what we need to consider very carefully um what is the granularity of tokenization 00:06:43.320 |
and trade off that versus the context window for example um and through many iterations we reach the 00:06:49.320 |
right i think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to 00:06:56.680 |
different use cases for example you can imagine we have a token at one version of tokenization used for pre-training 00:07:02.040 |
for fine-tuning against a specific application we apply slightly different tokenization 00:07:06.760 |
um so moving up from the tokenization layer then becomes the model layers uh at high level 00:07:18.200 |
uh from bottom to top we go through the uh event representation embedding layer transformer layer and the objective layer 00:07:28.200 |
uh so event representation as we just briefly touched upon uh many information in the event about a high 00:07:36.200 |
level you can break it down by when where and what when that even happened that's about timing coding and 00:07:43.640 |
where it happened it's about a physical location your locale country so and so forth but also about device 00:07:49.080 |
about the uh canvas or which row which page this action happened uh and then uh what basically 00:07:57.880 |
is about the target entity or the title which title you interacted with what is the interaction how long 00:08:04.760 |
and uh any that kind of information associated with the action so um that's where the we need to decide 00:08:13.480 |
what information we need to keep what we should drop so on and so forth uh moving one layer above uh the 00:08:20.440 |
embedding feature transformation layer uh one thing that needs to be pointed out is that for recommendation 00:08:26.760 |
we need to combine id embedding learning with other semantic content information 00:08:31.800 |
if you only have id embedding learn from scratch in the model then you have problem with costar 00:08:37.720 |
meaning that titles the model hasn't seen during training it doesn't know how to deal with it at 00:08:43.000 |
inference time so we need to have semantic content information to be a comp complementary to those id embeddings 00:08:52.280 |
this is not a problem for llm but very commonly encountered a costar problem for rec recommendation 00:08:57.480 |
system uh transformer layer i think there's no need to talk too much into this in terms of architecture 00:09:03.880 |
choices optimization so on and so forth the only thing that i want to point out is that uh we are using 00:09:10.280 |
the hidden state output from this layer as our user representation which is one of the primary goal of 00:09:15.880 |
the foundation model is to learn a good long-term user representation then uh we need to put this 00:09:22.040 |
into context then things to consider are for example how stable is our user user representation given 00:09:28.280 |
our user profile user interaction history keep changing how do we guarantee or maintain the stability of 00:09:33.960 |
their representation and what kind of aggregation we should use you can think of broadly aggregate 00:09:39.880 |
across the time dimension in terms of sequence dimension or aggregate uh across the layers you have 00:09:47.080 |
multiple self-attention layer how do you aggregate that um and then lastly 00:09:51.880 |
do we need to do explicit adaptation of the representation based on our downstream objective 00:09:58.120 |
so then we move to last uh the very top layer uh objective loss function this is also very interesting 00:10:06.680 |
in the sense that it's much richer than llm because you can see first we use uh instead of one 00:10:12.600 |
sequence but multiple sequence to represent the output because you can have a sequence of entity ids 00:10:19.000 |
that's your like uh next token prediction softmax or sample softmax but then we have many many other facets 00:10:27.400 |
of field of each event that can be also used as a target okay so it could be for things like uh action 00:10:34.520 |
type it could be some aspect of the entities metadata like entity type young round language so on and so 00:10:40.280 |
forth and also about your action like the prediction of the duration or uh the device where the action 00:10:47.000 |
happened or the time when the next uh user play will happen so those are all legitimate targets or labels 00:10:55.880 |
depends on your use case you can use them to do the fine tuning now instead of so you can cast the 00:11:00.840 |
problem as a multi-task learning problem multi-head or hierarchical prediction but you can also use them just as your 00:11:08.440 |
your weights your rewards or your mask on the loss function so in terms of to adapt the model to 00:11:14.120 |
zooming into one subcategory of uh user behavior you want to you want the model to learn okay so that's 00:11:21.720 |
about the model architecture that i want to talk about um so does it scale the first question a part of the 00:11:30.680 |
first hypothesis we want to answer is that does a school a scaling law apply and i think the answer is 00:11:35.880 |
yes so this is over the uh roughly two to two to two and a half a years we were scaling up and then we 00:11:44.120 |
constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on 00:11:52.840 |
the order of one billion uh model parameters we scale up for the data accordingly um now we stop 00:12:01.720 |
here because we can still keep going but uh as you may realize that recommendation system usually have much 00:12:09.080 |
stringent latency cost requirement so scaling up scaling up more requires to also distill back yeah but certainly i think this is not the end of the scaling law 00:12:20.840 |
uh before we're wrapping up the data and training session uh discussion i would like to highlight some of the learnings i think quite 00:12:26.840 |
interesting we borrow from llm this is not exhaustive list but uh i think very interesting to me 00:12:34.360 |
uh the top three one is multi-token prediction you may have seen this in the deep seek paper so 00:12:40.520 |
and so forth so you get uh implementation wise you can use multi-head multi-label so and uh different 00:12:46.360 |
implementation flavor but the goal is really to force the model to be less mild 00:12:50.760 |
biopic more robust to serving time shift because you have a time gap between your training and serving 00:12:56.680 |
and also force the model to targets long-term user satisfaction and long-term user behavior instead 00:13:02.920 |
of just focus on next action i we have observed in a very notable uh metrics improvement by doing that 00:13:11.400 |
uh the second is multi-layer representation which i touched upon on the profile representation so this 00:13:17.480 |
is also translated from llm aside of techniques of layer wise supervision self-distillation or multi-layer 00:13:24.200 |
output aggregation the goal here is really to make a better and more stable user representation 00:13:30.680 |
lastly uh this is also should be no surprise long context window handling from truncated sliding window 00:13:37.240 |
to sparse attention to progressively training uh longer and longer sequences uh to eventually all of 00:13:44.440 |
the parallelism strategies so this is about more efficient training and maximize the learning 00:13:51.640 |
okay so uh shift gear to talk about the serving and applications uh before the foundation model fm 00:13:58.360 |
this is a roughly the algo stack we have for personalization many data many features many models 00:14:05.800 |
independently developed each serving multiple or one canvases or applications we call 00:14:11.160 |
now with the foundation model we consolidate largely the data and representation layer especially the user 00:14:19.400 |
representation as well as content representation in the personalization domain model layer as well because 00:14:27.080 |
model now each application model now are built on top of fm so become a thinner layer instead of a very 00:14:33.400 |
standalone full-fledged model trained from scratch so how do the various models utilize the foundation 00:14:39.960 |
model there are three main approaches or consumption patterns the first is foundation model can be 00:14:47.720 |
integrated as a sub graph within the downstream model additionally the content embeddings learned from the 00:14:53.320 |
foundation model can be integrated as the embedding lookup layers so downstream model is a newer network 00:14:58.680 |
it may already have initially some of the sequence transformer tower or graph and then using a pre-trained 00:15:10.440 |
foundation model model sub graph to directly replace that uh second is that uh we can push out embeddings 00:15:17.720 |
this is no surprise from both content side and entity embedding as well as member embeddings 00:15:22.600 |
uh the only uh the main concern here of course is how we want to ref how frequently we want to refresh the 00:15:29.480 |
member embeddings and how we make sure they are stable uh and push them to the centralized embedding store and this of course allow far more uh 00:15:38.840 |
uh wider use cases than just personalization because people analytics data scientists can also just 00:15:45.880 |
fetch those embeddings directly to do the things that they want finally user can uh extract the models and 00:15:55.000 |
fine-tune it for specific applications either fine-tune or they need to do distillation to meet the online 00:16:01.240 |
serving requirement especially for those with very strict latency requirement 00:16:07.080 |
to wrap up uh i want to show at high level the wings we accumulated over the last one year and a half 00:16:15.480 |
by incorporating fm into various places so the blue bar represent how many applications have fm incorporated the green bar 00:16:26.440 |
represent the a b test wings because in any application we may have multiple a b tests going on there 00:16:32.680 |
to have wings so we see we indeed see high leverage of fm to bring about both a b test wings as well as 00:16:44.520 |
uh so i think the big back uh big bets are validated uh it is a scalable solution uh in terms of both in 00:16:51.880 |
terms of a scalable scaled up the model with improved quality as well as making the whole infra consolidated 00:16:59.080 |
and the scale uh to new applications to be much easier high leverage because it's a centralized learning 00:17:05.960 |
innovation velocity also is faster because we allow a lot of newly launched applications to directly fine-tune 00:17:13.880 |
the foundation model to launch the first experience 00:17:16.760 |
so the current directions one is that um we want to have universal representation for heterogeneous 00:17:26.600 |
entities this is uh as you can guess the semantic id and along those lines because we want to cover that 00:17:32.680 |
as netflix expanding to very different very heterogeneous content types second is generative retrieval for 00:17:41.080 |
collection recommendation right so instead of just recommending a single video be generative at 00:17:45.240 |
inference time and serving time because you have a modest step decoding a lot of the consideration 00:17:51.000 |
about business business rules or diversity for example can be naturally handled in the decoding process 00:17:56.520 |
lastly faster adaptation through prompt tuning so this is also borrowed from llm can we just train some soft tokens 00:18:05.080 |
uh so that at inference time we can directly swap in and out of those soft tokens to prompt the fm to behave differently 00:18:12.280 |
so that is also a very promising direction that we are getting into all right that concludes my talk thank you for your attention and questions 00:18:25.240 |
thank you um if you have any questions may i invite you to come to the mic in front um while we get our next speakers from mr kata get set up 00:18:35.640 |
uh hi thank you for the talk uh since you get billions of users so except the recommendation system you 00:18:45.320 |
you're maybe you're maybe you can do much more right so what's your cloud on that since i can just ask you to 00:18:52.120 |
to predict who's the next president in the united states thank you um so i actually don't uh could you explain 00:19:00.280 |
a little bit what do you mean by beyond recommendation do you mean the other personalization or other things 00:19:05.000 |
um yeah yeah since you get kind of beating users preference so actually that that that preference is 00:19:13.720 |
also been linked to what things they're buying or who they will vote for the next president so do you think 00:19:20.200 |
your foundation model has that capability to expand not only recommendation what videos they want to 00:19:25.800 |
look but what others they like or what's their opinions on anything else thank you yes so i think 00:19:32.280 |
we are expanding to different uh i think entity type and also capture uh users taste from both on and 00:19:40.760 |
off our platform i think that's a general trend that we're going to yes great thank you this was really 00:19:50.040 |
helpful um question on and you might not be able to share it um for ipv reasons but whatever you can 00:19:55.640 |
thoughts on graph models didn't i didn't hear a lot of that in your talks graphs and 00:20:01.880 |
uh reinforcement learning any utilization there any benefits you saw any boost in performance and 00:20:07.720 |
accuracy yeah that's a good question i think we have actually uh a dedicated team sub team doing graph 00:20:15.160 |
model uh especially around our knowledge graph to cover the content space both on and off our platform in 00:20:22.200 |
the whole entire entertainment ecosystem so we use actually a lot of embeddings for example from the graph 00:20:29.800 |
graph model to co-start that's where i see i show those semantic embeddings that's where it comes from 00:20:35.960 |
in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on 00:20:42.760 |
users from users action are pretty much sparse but we want to use them to guide how for example we generate 00:20:50.040 |
the whole collection that's where we need to the whole collection that's where we need to consider how to use 00:20:54.120 |
those reward to guide those uh process yeah i think one more question i'm sorry can i ask a two-part question 00:21:02.600 |
sure i would be here and so we can also follow up yeah do you also use these unified representations as 00:21:09.400 |
embedding features to downstream models you had a slide how you use the unified model yeah so uh the 00:21:18.280 |
we so for the embeddings learning within our model we also expose the downstream to do and consume them 00:21:25.400 |
uh we also have able to train our unified embedding we also have some upstream like just the for example 00:21:33.320 |
the gn embeddings that those are also consumed to to do that last one is it fast yes 00:21:41.480 |
uh hello uh in your embeddings are you just using when someone does an action or sorry for the in these 00:21:51.560 |
embeddings you're just using metadata over the video to understand what they like or are you actually using 00:21:56.200 |
like frame by frame of the video or second clips uh not yet we do have that from some other content group of our 00:22:06.040 |
or organization but i think the trend will go there so we are not yet uh into very granular sub 00:22:12.360 |
like clips level or view level we have those embeddings but not quite yet to incorporate yeah thank you 00:22:18.680 |
thank you yesu please another round of applause for yesu