Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix

good afternoon thank you Eugene for the introduction so today I'm going to share our big bet and Netflix on personalization namely to use one foundation model to cover all the recommendation use cases and Netflix we have diverse recommendation needs this is an example homepage of one node profile on Netflix it's a 2d layout roles and items diversity comes at at least the three levels there is first level about role we have diverse roles we have a young runs for example roles on comedies roles on action movies we have roles about new trending just the release titles we also have a rose about for example titles only available on Netflix so that's the first dimension second dimension is of course of the items or entities in addition to traditionally movie and TV shows now we have games we have live streaming and we're going to add more so our content space is expanding to very heterogeneous content types the third level is page so we have home page we have search page we have a kids home page which is tailored very differently to what kids interest mobile feed is a linear page is not a 2d layout so on and so forth so page different pages are also very diverse what happened traditionally was that these lead to naturally many specialized models that get developed over the years some models rank videos some rank rows some focus on for example shows user have not watched yet some focus on shows was a user are already engaging and many of those models are were built independently over the years they may have different objectives but have a lot of overlaps as well as well so naturally this lead to duplications duplications in our label engineering as well as feature engineering take the feature engineering as example we have this very commonly used the factual data about user intact and history uh the the the factual data is the same but over the years many features are developed derived out of the same facts data like counts of different actions counts of actions within various time window or other kind of uh slice and dice dimensions similarity between the users history titles against the target titles unique and lastly like uh lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model so this list can go on and on and on and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar so become very hard to uh maintain so the challenge the challenge the challenge back then was uh is this scalable uh obviously not if we keep expanding our landscape of our landscape of content type or business use cases it's not manageable to spin up new models for each uh individual use cases uh there's not much leverage uh there's some shared components on building the feature label but still by and large each model uh basically uh uh spinned up independently and that also impact our innovation velocity in in the terms that you don't reuse as much as you can instead you just spin up new models uh pretty much from scratch uh so this was the situation about four years ago uh at the beginning or middle of the pandemic so the question we asked at that time was uh can we centralize the learning of user representation in one place so the answer is yes and we had this key hypothesis that about foundation model based on transformer architecture uh concretely two hypotheses here one hypothesis is that through scaled up semi-supervised learning personalization can be improved uh the scaling law also applies to recommendation system as it applies to llm second is that by integrating the foundation model into all systems we can create high leverage we can simultaneously improve all the downstream canvas facing models at the same time so we'll see in the following slides how we validate those hypotheses uh i'll break up the overview into two subsessions first about data data data and training and the later uh second about application and serving so um about data and training so starting from data a very interesting aspect of a building such foundation model auto regressive transformer is that there's a lot of analogy but also differences sometimes uh between this and llm so we can transfer a lot of learnings inspirations from llm development if we start from the very bottom layer which is basically data cleaning and tokenization people work with llm understand tokenization decisions have profound impact in your model quality so although it's the bottom layer the decision you made there can percolate through all the downstream layers and manifest as either your model quality problem or model quality plus so this applies to recommendation uh foundation model as well instead of uh there are some differences very importantly instead of language tokens which is just one id here for uh if we want to translate the user interaction history or sequence each of the token is a event interaction event from the user right but that event has many facets or many fields so it's not just the one id you can represent there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization uh i think that's what we need to consider very carefully um what is the granularity of tokenization and trade off that versus the context window for example um and through many iterations we reach the right i think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to different use cases for example you can imagine we have a token at one version of tokenization used for pre-training for fine-tuning against a specific application we apply slightly different tokenization um so moving up from the tokenization layer then becomes the model layers uh at high level uh from bottom to top we go through the uh event representation embedding layer transformer layer and the objective layer uh so event representation as we just briefly touched upon uh many information in the event about a high level you can break it down by when where and what when that even happened that's about timing coding and where it happened it's about a physical location your locale country so and so forth but also about device about the uh canvas or which row which page this action happened uh and then uh what basically is about the target entity or the title which title you interacted with what is the interaction how long and uh any that kind of information associated with the action so um that's where the we need to decide what information we need to keep what we should drop so on and so forth uh moving one layer above uh the embedding feature transformation layer uh one thing that needs to be pointed out is that for recommendation we need to combine id embedding learning with other semantic content information if you only have id embedding learn from scratch in the model then you have problem with costar meaning that titles the model hasn't seen during training it doesn't know how to deal with it at inference time so we need to have semantic content information to be a comp complementary to those id embeddings this is not a problem for llm but very commonly encountered a costar problem for rec recommendation system uh transformer layer i think there's no need to talk too much into this in terms of architecture choices optimization so on and so forth the only thing that i want to point out is that uh we are using the hidden state output from this layer as our user representation which is one of the primary goal of the foundation model is to learn a good long-term user representation then uh we need to put this into context then things to consider are for example how stable is our user user representation given our user profile user interaction history keep changing how do we guarantee or maintain the stability of their representation and what kind of aggregation we should use you can think of broadly aggregate across the time dimension in terms of sequence dimension or aggregate uh across the layers you have multiple self-attention layer how do you aggregate that um and then lastly do we need to do explicit adaptation of the representation based on our downstream objective to fine tune it so then we move to last uh the very top layer uh objective loss function this is also very interesting in the sense that it's much richer than llm because you can see first we use uh instead of one sequence but multiple sequence to represent the output because you can have a sequence of entity ids that's your like uh next token prediction softmax or sample softmax but then we have many many other facets of field of each event that can be also used as a target okay so it could be for things like uh action type it could be some aspect of the entities metadata like entity type young round language so on and so forth and also about your action like the prediction of the duration or uh the device where the action happened or the time when the next uh user play will happen so those are all legitimate targets or labels depends on your use case you can use them to do the fine tuning now instead of so you can cast the problem as a multi-task learning problem multi-head or hierarchical prediction but you can also use them just as your your weights your rewards or your mask on the loss function so in terms of to adapt the model to zooming into one subcategory of uh user behavior you want to you want the model to learn okay so that's about the model architecture that i want to talk about um so does it scale the first question a part of the first hypothesis we want to answer is that does a school a scaling law apply and i think the answer is yes so this is over the uh roughly two to two to two and a half a years we were scaling up and then we constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on the order of one billion uh model parameters we scale up for the data accordingly um now we stop here because we can still keep going but uh as you may realize that recommendation system usually have much stringent latency cost requirement so scaling up scaling up more requires to also distill back yeah but certainly i think this is not the end of the scaling law uh before we're wrapping up the data and training session uh discussion i would like to highlight some of the learnings i think quite interesting we borrow from llm this is not exhaustive list but uh i think very interesting to me uh the top three one is multi-token prediction you may have seen this in the deep seek paper so and so forth so you get uh implementation wise you can use multi-head multi-label so and uh different implementation flavor but the goal is really to force the model to be less mild biopic more robust to serving time shift because you have a time gap between your training and serving and also force the model to targets long-term user satisfaction and long-term user behavior instead of just focus on next action i we have observed in a very notable uh metrics improvement by doing that uh the second is multi-layer representation which i touched upon on the profile representation so this is also translated from llm aside of techniques of layer wise supervision self-distillation or multi-layer output aggregation the goal here is really to make a better and more stable user representation lastly uh this is also should be no surprise long context window handling from truncated sliding window to sparse attention to progressively training uh longer and longer sequences uh to eventually all of the parallelism strategies so this is about more efficient training and maximize the learning okay so uh shift gear to talk about the serving and applications uh before the foundation model fm this is a roughly the algo stack we have for personalization many data many features many models independently developed each serving multiple or one canvases or applications we call now with the foundation model we consolidate largely the data and representation layer especially the user representation as well as content representation in the personalization domain model layer as well because model now each application model now are built on top of fm so become a thinner layer instead of a very standalone full-fledged model trained from scratch so how do the various models utilize the foundation model there are three main approaches or consumption patterns the first is foundation model can be integrated as a sub graph within the downstream model additionally the content embeddings learned from the foundation model can be integrated as the embedding lookup layers so downstream model is a newer network it may already have initially some of the sequence transformer tower or graph and then using a pre-trained foundation model model sub graph to directly replace that uh second is that uh we can push out embeddings this is no surprise from both content side and entity embedding as well as member embeddings uh the only uh the main concern here of course is how we want to ref how frequently we want to refresh the member embeddings and how we make sure they are stable uh and push them to the centralized embedding store and this of course allow far more uh uh wider use cases than just personalization because people analytics data scientists can also just fetch those embeddings directly to do the things that they want finally user can uh extract the models and fine-tune it for specific applications either fine-tune or they need to do distillation to meet the online serving requirement especially for those with very strict latency requirement to wrap up uh i want to show at high level the wings we accumulated over the last one year and a half by incorporating fm into various places so the blue bar represent how many applications have fm incorporated the green bar represent the a b test wings because in any application we may have multiple a b tests going on there to have wings so we see we indeed see high leverage of fm to bring about both a b test wings as well as infrastructure consolidation uh so i think the big back uh big bets are validated uh it is a scalable solution uh in terms of both in terms of a scalable scaled up the model with improved quality as well as making the whole infra consolidated and the scale uh to new applications to be much easier high leverage because it's a centralized learning innovation velocity also is faster because we allow a lot of newly launched applications to directly fine-tune the foundation model to launch the first experience so the current directions one is that um we want to have universal representation for heterogeneous entities this is uh as you can guess the semantic id and along those lines because we want to cover that as netflix expanding to very different very heterogeneous content types second is generative retrieval for collection recommendation right so instead of just recommending a single video be generative at inference time and serving time because you have a modest step decoding a lot of the consideration about business business rules or diversity for example can be naturally handled in the decoding process lastly faster adaptation through prompt tuning so this is also borrowed from llm can we just train some soft tokens uh so that at inference time we can directly swap in and out of those soft tokens to prompt the fm to behave differently so that is also a very promising direction that we are getting into all right that concludes my talk thank you for your attention and questions thank you um if you have any questions may i invite you to come to the mic in front um while we get our next speakers from mr kata get set up uh hi thank you for the talk uh since you get billions of users so except the recommendation system you you're maybe you're maybe you can do much more right so what's your cloud on that since i can just ask you to to predict who's the next president in the united states thank you um so i actually don't uh could you explain a little bit what do you mean by beyond recommendation do you mean the other personalization or other things um yeah yeah since you get kind of beating users preference so actually that that that preference is also been linked to what things they're buying or who they will vote for the next president so do you think your foundation model has that capability to expand not only recommendation what videos they want to look but what others they like or what's their opinions on anything else thank you yes so i think we are expanding to different uh i think entity type and also capture uh users taste from both on and off our platform i think that's a general trend that we're going to yes great thank you this was really helpful um question on and you might not be able to share it um for ipv reasons but whatever you can thoughts on graph models didn't i didn't hear a lot of that in your talks graphs and uh reinforcement learning any utilization there any benefits you saw any boost in performance and accuracy yeah that's a good question i think we have actually uh a dedicated team sub team doing graph model uh especially around our knowledge graph to cover the content space both on and off our platform in the whole entire entertainment ecosystem so we use actually a lot of embeddings for example from the graph graph model to co-start that's where i see i show those semantic embeddings that's where it comes from in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on users from users action are pretty much sparse but we want to use them to guide how for example we generate the whole collection that's where we need to the whole collection that's where we need to consider how to use those reward to guide those uh process yeah i think one more question i'm sorry can i ask a two-part question sure i would be here and so we can also follow up yeah do you also use these unified representations as embedding features to downstream models you had a slide how you use the unified model yeah so uh the we so for the embeddings learning within our model we also expose the downstream to do and consume them uh we also have able to train our unified embedding we also have some upstream like just the for example the gn embeddings that those are also consumed to to do that last one is it fast yes uh hello uh in your embeddings are you just using when someone does an action or sorry for the in these embeddings you're just using metadata over the video to understand what they like or are you actually using like frame by frame of the video or second clips uh not yet we do have that from some other content group of our or organization but i think the trend will go there so we are not yet uh into very granular sub like clips level or view level we have those embeddings but not quite yet to incorporate yeah thank you thank you yesu please another round of applause for yesu

Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix

Transcript