back to index

Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix


Whisper Transcript | Transcript Only Page

00:00:00.000 | good afternoon thank you Eugene for the introduction so today I'm going to share
00:00:22.200 | our big bet and Netflix on personalization namely to use one foundation model to cover
00:00:28.080 | all the recommendation use cases and Netflix we have diverse recommendation needs this is an
00:00:36.360 | example homepage of one node profile on Netflix it's a 2d layout roles and items diversity comes
00:00:45.780 | at at least the three levels there is first level about role we have diverse roles we have a young
00:00:53.100 | runs for example roles on comedies roles on action movies we have roles about new trending just the
00:01:01.080 | release titles we also have a rose about for example titles only available on Netflix so that's the first
00:01:08.760 | dimension second dimension is of course of the items or entities in addition to traditionally movie and TV
00:01:16.380 | shows now we have games we have live streaming and we're going to add more so our content space is
00:01:23.660 | expanding to very heterogeneous content types the third level is page so we have home page we have search
00:01:34.640 | page we have a kids home page which is tailored very differently to what kids interest mobile feed is a
00:01:42.600 | linear page is not a 2d layout so on and so forth so page different pages are also very diverse what
00:01:49.880 | happened traditionally was that these lead to naturally many specialized models that get developed over the
00:01:59.320 | years some models rank videos some rank rows some focus on for example shows user have not watched yet some focus on shows was a user are already engaging
00:02:11.880 | and many of those models are were built independently over the years they may have different objectives but have a lot of overlaps as well
00:02:21.880 | as well so naturally this lead to duplications duplications in our label engineering as well as feature engineering take the feature engineering as example we have this very commonly used the factual data about user intact and history
00:02:41.160 | uh the the the factual data is the same but over the years many features are developed derived out of the same facts data like counts of different actions
00:02:51.160 | counts of actions within various time window or other kind of uh slice and dice dimensions similarity between the users history titles against the target titles unique
00:03:03.160 | and lastly like uh lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model so this list can go on and on and on
00:03:13.160 | and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar so become very hard to uh maintain
00:03:25.160 | so the challenge the challenge the challenge back then was uh is this scalable uh obviously not if we keep expanding
00:03:35.160 | our landscape of our landscape of content type or business use cases it's not manageable to spin up new models for each uh individual use cases
00:03:45.160 | uh there's not much leverage uh there's some shared components on building the feature label but still by and large each model uh basically uh
00:03:55.160 | uh spinned up independently and that also impact our innovation velocity in in the terms that you don't reuse as much as you can instead you just
00:04:05.160 | spin up new models uh pretty much from scratch
00:04:09.160 | uh so this was the situation about four years ago uh at the beginning or middle of the pandemic
00:04:15.400 | so the question we asked at that time was uh can we centralize the learning of user representation
00:04:21.720 | in one place so the answer is yes and we had this key hypothesis that about foundation model based on transformer
00:04:31.480 | architecture uh concretely two hypotheses here one hypothesis is that through scaled up semi-supervised learning
00:04:38.360 | personalization can be improved uh the scaling law also applies to recommendation system as it applies to llm
00:04:46.040 | second is that by integrating the foundation model into all systems we can create high leverage we can
00:04:51.880 | simultaneously improve all the downstream canvas facing models at the same time so we'll see in the
00:04:59.080 | following slides how we validate those hypotheses uh i'll break up the overview into two subsessions first about
00:05:07.640 | data data data and training and the later uh second about application and serving
00:05:12.280 | so um about data and training so starting from data a very interesting aspect of a building such foundation
00:05:20.120 | model auto regressive transformer is that there's a lot of analogy but also differences sometimes uh between
00:05:27.400 | this and llm so we can transfer a lot of learnings inspirations from llm development if we start from the
00:05:37.400 | very bottom layer which is basically data cleaning and tokenization people work with llm understand
00:05:43.960 | tokenization decisions have profound impact in your model quality so although it's the bottom layer the
00:05:52.120 | decision you made there can percolate through all the downstream layers and manifest as either your model
00:05:57.160 | quality problem or model quality plus so this applies to recommendation uh foundation model as well instead of
00:06:06.520 | uh there are some differences very importantly instead of language tokens which is just one id here for uh if
00:06:13.800 | we want to translate the user interaction history or sequence each of the token is a event interaction event
00:06:21.720 | from the user right but that event has many facets or many fields so it's not just the one id you can represent
00:06:27.320 | there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization
00:06:36.120 | uh i think that's what we need to consider very carefully um what is the granularity of tokenization
00:06:43.320 | and trade off that versus the context window for example um and through many iterations we reach the
00:06:49.320 | right i think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to
00:06:56.680 | different use cases for example you can imagine we have a token at one version of tokenization used for pre-training
00:07:02.040 | for fine-tuning against a specific application we apply slightly different tokenization
00:07:06.760 | um so moving up from the tokenization layer then becomes the model layers uh at high level
00:07:18.200 | uh from bottom to top we go through the uh event representation embedding layer transformer layer and the objective layer
00:07:28.200 | uh so event representation as we just briefly touched upon uh many information in the event about a high
00:07:36.200 | level you can break it down by when where and what when that even happened that's about timing coding and
00:07:43.640 | where it happened it's about a physical location your locale country so and so forth but also about device
00:07:49.080 | about the uh canvas or which row which page this action happened uh and then uh what basically
00:07:57.880 | is about the target entity or the title which title you interacted with what is the interaction how long
00:08:04.760 | and uh any that kind of information associated with the action so um that's where the we need to decide
00:08:13.480 | what information we need to keep what we should drop so on and so forth uh moving one layer above uh the
00:08:20.440 | embedding feature transformation layer uh one thing that needs to be pointed out is that for recommendation
00:08:26.760 | we need to combine id embedding learning with other semantic content information
00:08:31.800 | if you only have id embedding learn from scratch in the model then you have problem with costar
00:08:37.720 | meaning that titles the model hasn't seen during training it doesn't know how to deal with it at
00:08:43.000 | inference time so we need to have semantic content information to be a comp complementary to those id embeddings
00:08:52.280 | this is not a problem for llm but very commonly encountered a costar problem for rec recommendation
00:08:57.480 | system uh transformer layer i think there's no need to talk too much into this in terms of architecture
00:09:03.880 | choices optimization so on and so forth the only thing that i want to point out is that uh we are using
00:09:10.280 | the hidden state output from this layer as our user representation which is one of the primary goal of
00:09:15.880 | the foundation model is to learn a good long-term user representation then uh we need to put this
00:09:22.040 | into context then things to consider are for example how stable is our user user representation given
00:09:28.280 | our user profile user interaction history keep changing how do we guarantee or maintain the stability of
00:09:33.960 | their representation and what kind of aggregation we should use you can think of broadly aggregate
00:09:39.880 | across the time dimension in terms of sequence dimension or aggregate uh across the layers you have
00:09:47.080 | multiple self-attention layer how do you aggregate that um and then lastly
00:09:51.880 | do we need to do explicit adaptation of the representation based on our downstream objective
00:09:57.240 | to fine tune it
00:09:58.120 | so then we move to last uh the very top layer uh objective loss function this is also very interesting
00:10:06.680 | in the sense that it's much richer than llm because you can see first we use uh instead of one
00:10:12.600 | sequence but multiple sequence to represent the output because you can have a sequence of entity ids
00:10:19.000 | that's your like uh next token prediction softmax or sample softmax but then we have many many other facets
00:10:27.400 | of field of each event that can be also used as a target okay so it could be for things like uh action
00:10:34.520 | type it could be some aspect of the entities metadata like entity type young round language so on and so
00:10:40.280 | forth and also about your action like the prediction of the duration or uh the device where the action
00:10:47.000 | happened or the time when the next uh user play will happen so those are all legitimate targets or labels
00:10:55.880 | depends on your use case you can use them to do the fine tuning now instead of so you can cast the
00:11:00.840 | problem as a multi-task learning problem multi-head or hierarchical prediction but you can also use them just as your
00:11:08.440 | your weights your rewards or your mask on the loss function so in terms of to adapt the model to
00:11:14.120 | zooming into one subcategory of uh user behavior you want to you want the model to learn okay so that's
00:11:21.720 | about the model architecture that i want to talk about um so does it scale the first question a part of the
00:11:30.680 | first hypothesis we want to answer is that does a school a scaling law apply and i think the answer is
00:11:35.880 | yes so this is over the uh roughly two to two to two and a half a years we were scaling up and then we
00:11:44.120 | constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on
00:11:52.840 | the order of one billion uh model parameters we scale up for the data accordingly um now we stop
00:12:01.720 | here because we can still keep going but uh as you may realize that recommendation system usually have much
00:12:09.080 | stringent latency cost requirement so scaling up scaling up more requires to also distill back yeah but certainly i think this is not the end of the scaling law
00:12:20.840 | uh before we're wrapping up the data and training session uh discussion i would like to highlight some of the learnings i think quite
00:12:26.840 | interesting we borrow from llm this is not exhaustive list but uh i think very interesting to me
00:12:34.360 | uh the top three one is multi-token prediction you may have seen this in the deep seek paper so
00:12:40.520 | and so forth so you get uh implementation wise you can use multi-head multi-label so and uh different
00:12:46.360 | implementation flavor but the goal is really to force the model to be less mild
00:12:50.760 | biopic more robust to serving time shift because you have a time gap between your training and serving
00:12:56.680 | and also force the model to targets long-term user satisfaction and long-term user behavior instead
00:13:02.920 | of just focus on next action i we have observed in a very notable uh metrics improvement by doing that
00:13:11.400 | uh the second is multi-layer representation which i touched upon on the profile representation so this
00:13:17.480 | is also translated from llm aside of techniques of layer wise supervision self-distillation or multi-layer
00:13:24.200 | output aggregation the goal here is really to make a better and more stable user representation
00:13:30.680 | lastly uh this is also should be no surprise long context window handling from truncated sliding window
00:13:37.240 | to sparse attention to progressively training uh longer and longer sequences uh to eventually all of
00:13:44.440 | the parallelism strategies so this is about more efficient training and maximize the learning
00:13:51.640 | okay so uh shift gear to talk about the serving and applications uh before the foundation model fm
00:13:58.360 | this is a roughly the algo stack we have for personalization many data many features many models
00:14:05.800 | independently developed each serving multiple or one canvases or applications we call
00:14:11.160 | now with the foundation model we consolidate largely the data and representation layer especially the user
00:14:19.400 | representation as well as content representation in the personalization domain model layer as well because
00:14:27.080 | model now each application model now are built on top of fm so become a thinner layer instead of a very
00:14:33.400 | standalone full-fledged model trained from scratch so how do the various models utilize the foundation
00:14:39.960 | model there are three main approaches or consumption patterns the first is foundation model can be
00:14:47.720 | integrated as a sub graph within the downstream model additionally the content embeddings learned from the
00:14:53.320 | foundation model can be integrated as the embedding lookup layers so downstream model is a newer network
00:14:58.680 | it may already have initially some of the sequence transformer tower or graph and then using a pre-trained
00:15:10.440 | foundation model model sub graph to directly replace that uh second is that uh we can push out embeddings
00:15:17.720 | this is no surprise from both content side and entity embedding as well as member embeddings
00:15:22.600 | uh the only uh the main concern here of course is how we want to ref how frequently we want to refresh the
00:15:29.480 | member embeddings and how we make sure they are stable uh and push them to the centralized embedding store and this of course allow far more uh
00:15:38.840 | uh wider use cases than just personalization because people analytics data scientists can also just
00:15:45.880 | fetch those embeddings directly to do the things that they want finally user can uh extract the models and
00:15:55.000 | fine-tune it for specific applications either fine-tune or they need to do distillation to meet the online
00:16:01.240 | serving requirement especially for those with very strict latency requirement
00:16:07.080 | to wrap up uh i want to show at high level the wings we accumulated over the last one year and a half
00:16:15.480 | by incorporating fm into various places so the blue bar represent how many applications have fm incorporated the green bar
00:16:26.440 | represent the a b test wings because in any application we may have multiple a b tests going on there
00:16:32.680 | to have wings so we see we indeed see high leverage of fm to bring about both a b test wings as well as
00:16:40.360 | infrastructure consolidation
00:16:44.520 | uh so i think the big back uh big bets are validated uh it is a scalable solution uh in terms of both in
00:16:51.880 | terms of a scalable scaled up the model with improved quality as well as making the whole infra consolidated
00:16:59.080 | and the scale uh to new applications to be much easier high leverage because it's a centralized learning
00:17:05.960 | innovation velocity also is faster because we allow a lot of newly launched applications to directly fine-tune
00:17:13.880 | the foundation model to launch the first experience
00:17:16.760 | so the current directions one is that um we want to have universal representation for heterogeneous
00:17:26.600 | entities this is uh as you can guess the semantic id and along those lines because we want to cover that
00:17:32.680 | as netflix expanding to very different very heterogeneous content types second is generative retrieval for
00:17:41.080 | collection recommendation right so instead of just recommending a single video be generative at
00:17:45.240 | inference time and serving time because you have a modest step decoding a lot of the consideration
00:17:51.000 | about business business rules or diversity for example can be naturally handled in the decoding process
00:17:56.520 | lastly faster adaptation through prompt tuning so this is also borrowed from llm can we just train some soft tokens
00:18:05.080 | uh so that at inference time we can directly swap in and out of those soft tokens to prompt the fm to behave differently
00:18:12.280 | so that is also a very promising direction that we are getting into all right that concludes my talk thank you for your attention and questions
00:18:25.240 | thank you um if you have any questions may i invite you to come to the mic in front um while we get our next speakers from mr kata get set up
00:18:35.640 | uh hi thank you for the talk uh since you get billions of users so except the recommendation system you
00:18:45.320 | you're maybe you're maybe you can do much more right so what's your cloud on that since i can just ask you to
00:18:52.120 | to predict who's the next president in the united states thank you um so i actually don't uh could you explain
00:19:00.280 | a little bit what do you mean by beyond recommendation do you mean the other personalization or other things
00:19:05.000 | um yeah yeah since you get kind of beating users preference so actually that that that preference is
00:19:13.720 | also been linked to what things they're buying or who they will vote for the next president so do you think
00:19:20.200 | your foundation model has that capability to expand not only recommendation what videos they want to
00:19:25.800 | look but what others they like or what's their opinions on anything else thank you yes so i think
00:19:32.280 | we are expanding to different uh i think entity type and also capture uh users taste from both on and
00:19:40.760 | off our platform i think that's a general trend that we're going to yes great thank you this was really
00:19:50.040 | helpful um question on and you might not be able to share it um for ipv reasons but whatever you can
00:19:55.640 | thoughts on graph models didn't i didn't hear a lot of that in your talks graphs and
00:20:01.880 | uh reinforcement learning any utilization there any benefits you saw any boost in performance and
00:20:07.720 | accuracy yeah that's a good question i think we have actually uh a dedicated team sub team doing graph
00:20:15.160 | model uh especially around our knowledge graph to cover the content space both on and off our platform in
00:20:22.200 | the whole entire entertainment ecosystem so we use actually a lot of embeddings for example from the graph
00:20:29.800 | graph model to co-start that's where i see i show those semantic embeddings that's where it comes from
00:20:35.960 | in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on
00:20:42.760 | users from users action are pretty much sparse but we want to use them to guide how for example we generate
00:20:50.040 | the whole collection that's where we need to the whole collection that's where we need to consider how to use
00:20:54.120 | those reward to guide those uh process yeah i think one more question i'm sorry can i ask a two-part question
00:21:02.600 | sure i would be here and so we can also follow up yeah do you also use these unified representations as
00:21:09.400 | embedding features to downstream models you had a slide how you use the unified model yeah so uh the
00:21:18.280 | we so for the embeddings learning within our model we also expose the downstream to do and consume them
00:21:25.400 | uh we also have able to train our unified embedding we also have some upstream like just the for example
00:21:33.320 | the gn embeddings that those are also consumed to to do that last one is it fast yes
00:21:41.480 | uh hello uh in your embeddings are you just using when someone does an action or sorry for the in these
00:21:51.560 | embeddings you're just using metadata over the video to understand what they like or are you actually using
00:21:56.200 | like frame by frame of the video or second clips uh not yet we do have that from some other content group of our
00:22:06.040 | or organization but i think the trend will go there so we are not yet uh into very granular sub
00:22:12.360 | like clips level or view level we have those embeddings but not quite yet to incorporate yeah thank you
00:22:18.680 | thank you yesu please another round of applause for yesu