Netflix's Big Bet: One model to rule recommendations: Yesu Feng, Netflix

00:00:00.000 | good afternoon thank you Eugene for the introduction so today I'm going to share

00:00:22.200 | our big bet and Netflix on personalization namely to use one foundation model to cover

00:00:28.080 | all the recommendation use cases and Netflix we have diverse recommendation needs this is an

00:00:36.360 | example homepage of one node profile on Netflix it's a 2d layout roles and items diversity comes

00:00:45.780 | at at least the three levels there is first level about role we have diverse roles we have a young

00:00:53.100 | runs for example roles on comedies roles on action movies we have roles about new trending just the

00:01:01.080 | release titles we also have a rose about for example titles only available on Netflix so that's the first

00:01:08.760 | dimension second dimension is of course of the items or entities in addition to traditionally movie and TV

00:01:16.380 | shows now we have games we have live streaming and we're going to add more so our content space is

00:01:23.660 | expanding to very heterogeneous content types the third level is page so we have home page we have search

00:01:34.640 | page we have a kids home page which is tailored very differently to what kids interest mobile feed is a

00:01:42.600 | linear page is not a 2d layout so on and so forth so page different pages are also very diverse what

00:01:49.880 | happened traditionally was that these lead to naturally many specialized models that get developed over the

00:01:59.320 | years some models rank videos some rank rows some focus on for example shows user have not watched yet some focus on shows was a user are already engaging

00:02:11.880 | and many of those models are were built independently over the years they may have different objectives but have a lot of overlaps as well

00:02:21.880 | as well so naturally this lead to duplications duplications in our label engineering as well as feature engineering take the feature engineering as example we have this very commonly used the factual data about user intact and history

00:02:41.160 | uh the the the factual data is the same but over the years many features are developed derived out of the same facts data like counts of different actions

00:02:51.160 | counts of actions within various time window or other kind of uh slice and dice dimensions similarity between the users history titles against the target titles unique

00:03:03.160 | and lastly like uh lastly like uh just a sequence of unique show ids uh to be used as a sequence feature into the model so this list can go on and on and on

00:03:13.160 | and a lot of those features uh because they are developed independently into each model they have slight variations but become very but largely uh very similar so become very hard to uh maintain

00:03:25.160 | so the challenge the challenge the challenge back then was uh is this scalable uh obviously not if we keep expanding

00:03:35.160 | our landscape of our landscape of content type or business use cases it's not manageable to spin up new models for each uh individual use cases

00:03:45.160 | uh there's not much leverage uh there's some shared components on building the feature label but still by and large each model uh basically uh

00:03:55.160 | uh spinned up independently and that also impact our innovation velocity in in the terms that you don't reuse as much as you can instead you just

00:04:05.160 | spin up new models uh pretty much from scratch

00:04:09.160 | uh so this was the situation about four years ago uh at the beginning or middle of the pandemic

00:04:15.400 | so the question we asked at that time was uh can we centralize the learning of user representation

00:04:21.720 | in one place so the answer is yes and we had this key hypothesis that about foundation model based on transformer

00:04:31.480 | architecture uh concretely two hypotheses here one hypothesis is that through scaled up semi-supervised learning

00:04:38.360 | personalization can be improved uh the scaling law also applies to recommendation system as it applies to llm

00:04:46.040 | second is that by integrating the foundation model into all systems we can create high leverage we can

00:04:51.880 | simultaneously improve all the downstream canvas facing models at the same time so we'll see in the

00:04:59.080 | following slides how we validate those hypotheses uh i'll break up the overview into two subsessions first about

00:05:07.640 | data data data and training and the later uh second about application and serving

00:05:12.280 | so um about data and training so starting from data a very interesting aspect of a building such foundation

00:05:20.120 | model auto regressive transformer is that there's a lot of analogy but also differences sometimes uh between

00:05:27.400 | this and llm so we can transfer a lot of learnings inspirations from llm development if we start from the

00:05:37.400 | very bottom layer which is basically data cleaning and tokenization people work with llm understand

00:05:43.960 | tokenization decisions have profound impact in your model quality so although it's the bottom layer the

00:05:52.120 | decision you made there can percolate through all the downstream layers and manifest as either your model

00:05:57.160 | quality problem or model quality plus so this applies to recommendation uh foundation model as well instead of

00:06:06.520 | uh there are some differences very importantly instead of language tokens which is just one id here for uh if

00:06:13.800 | we want to translate the user interaction history or sequence each of the token is a event interaction event

00:06:21.720 | from the user right but that event has many facets or many fields so it's not just the one id you can represent

00:06:27.320 | there are a lot of rich information about the event so how you all of those fields can play a role in making the decision of tokenization

00:06:36.120 | uh i think that's what we need to consider very carefully um what is the granularity of tokenization

00:06:43.320 | and trade off that versus the context window for example um and through many iterations we reach the

00:06:49.320 | right i think reach the right abstraction and interfaces that we can use to uh adjust our tokenization to

00:06:56.680 | different use cases for example you can imagine we have a token at one version of tokenization used for pre-training

00:07:02.040 | for fine-tuning against a specific application we apply slightly different tokenization

00:07:06.760 | um so moving up from the tokenization layer then becomes the model layers uh at high level

00:07:18.200 | uh from bottom to top we go through the uh event representation embedding layer transformer layer and the objective layer

00:07:28.200 | uh so event representation as we just briefly touched upon uh many information in the event about a high

00:07:36.200 | level you can break it down by when where and what when that even happened that's about timing coding and

00:07:43.640 | where it happened it's about a physical location your locale country so and so forth but also about device

00:07:49.080 | about the uh canvas or which row which page this action happened uh and then uh what basically

00:07:57.880 | is about the target entity or the title which title you interacted with what is the interaction how long

00:08:04.760 | and uh any that kind of information associated with the action so um that's where the we need to decide

00:08:13.480 | what information we need to keep what we should drop so on and so forth uh moving one layer above uh the

00:08:20.440 | embedding feature transformation layer uh one thing that needs to be pointed out is that for recommendation

00:08:26.760 | we need to combine id embedding learning with other semantic content information

00:08:31.800 | if you only have id embedding learn from scratch in the model then you have problem with costar

00:08:37.720 | meaning that titles the model hasn't seen during training it doesn't know how to deal with it at

00:08:43.000 | inference time so we need to have semantic content information to be a comp complementary to those id embeddings

00:08:52.280 | this is not a problem for llm but very commonly encountered a costar problem for rec recommendation

00:08:57.480 | system uh transformer layer i think there's no need to talk too much into this in terms of architecture

00:09:03.880 | choices optimization so on and so forth the only thing that i want to point out is that uh we are using

00:09:10.280 | the hidden state output from this layer as our user representation which is one of the primary goal of

00:09:15.880 | the foundation model is to learn a good long-term user representation then uh we need to put this

00:09:22.040 | into context then things to consider are for example how stable is our user user representation given

00:09:28.280 | our user profile user interaction history keep changing how do we guarantee or maintain the stability of

00:09:33.960 | their representation and what kind of aggregation we should use you can think of broadly aggregate

00:09:39.880 | across the time dimension in terms of sequence dimension or aggregate uh across the layers you have

00:09:47.080 | multiple self-attention layer how do you aggregate that um and then lastly

00:09:51.880 | do we need to do explicit adaptation of the representation based on our downstream objective

00:09:57.240 | to fine tune it

00:09:58.120 | so then we move to last uh the very top layer uh objective loss function this is also very interesting

00:10:06.680 | in the sense that it's much richer than llm because you can see first we use uh instead of one

00:10:12.600 | sequence but multiple sequence to represent the output because you can have a sequence of entity ids

00:10:19.000 | that's your like uh next token prediction softmax or sample softmax but then we have many many other facets

00:10:27.400 | of field of each event that can be also used as a target okay so it could be for things like uh action

00:10:34.520 | type it could be some aspect of the entities metadata like entity type young round language so on and so

00:10:40.280 | forth and also about your action like the prediction of the duration or uh the device where the action

00:10:47.000 | happened or the time when the next uh user play will happen so those are all legitimate targets or labels

00:10:55.880 | depends on your use case you can use them to do the fine tuning now instead of so you can cast the

00:11:00.840 | problem as a multi-task learning problem multi-head or hierarchical prediction but you can also use them just as your

00:11:08.440 | your weights your rewards or your mask on the loss function so in terms of to adapt the model to

00:11:14.120 | zooming into one subcategory of uh user behavior you want to you want the model to learn okay so that's

00:11:21.720 | about the model architecture that i want to talk about um so does it scale the first question a part of the

00:11:30.680 | first hypothesis we want to answer is that does a school a scaling law apply and i think the answer is

00:11:35.880 | yes so this is over the uh roughly two to two to two and a half a years we were scaling up and then we

00:11:44.120 | constantly still see the gain uh from only on the order of 10 million profile or a few million profile to now on

00:11:52.840 | the order of one billion uh model parameters we scale up for the data accordingly um now we stop

00:12:01.720 | here because we can still keep going but uh as you may realize that recommendation system usually have much

00:12:09.080 | stringent latency cost requirement so scaling up scaling up more requires to also distill back yeah but certainly i think this is not the end of the scaling law

00:12:20.840 | uh before we're wrapping up the data and training session uh discussion i would like to highlight some of the learnings i think quite

00:12:26.840 | interesting we borrow from llm this is not exhaustive list but uh i think very interesting to me

00:12:34.360 | uh the top three one is multi-token prediction you may have seen this in the deep seek paper so

00:12:40.520 | and so forth so you get uh implementation wise you can use multi-head multi-label so and uh different

00:12:46.360 | implementation flavor but the goal is really to force the model to be less mild

00:12:50.760 | biopic more robust to serving time shift because you have a time gap between your training and serving

00:12:56.680 | and also force the model to targets long-term user satisfaction and long-term user behavior instead

00:13:02.920 | of just focus on next action i we have observed in a very notable uh metrics improvement by doing that

00:13:11.400 | uh the second is multi-layer representation which i touched upon on the profile representation so this

00:13:17.480 | is also translated from llm aside of techniques of layer wise supervision self-distillation or multi-layer

00:13:24.200 | output aggregation the goal here is really to make a better and more stable user representation

00:13:30.680 | lastly uh this is also should be no surprise long context window handling from truncated sliding window

00:13:37.240 | to sparse attention to progressively training uh longer and longer sequences uh to eventually all of

00:13:44.440 | the parallelism strategies so this is about more efficient training and maximize the learning

00:13:51.640 | okay so uh shift gear to talk about the serving and applications uh before the foundation model fm

00:13:58.360 | this is a roughly the algo stack we have for personalization many data many features many models

00:14:05.800 | independently developed each serving multiple or one canvases or applications we call

00:14:11.160 | now with the foundation model we consolidate largely the data and representation layer especially the user

00:14:19.400 | representation as well as content representation in the personalization domain model layer as well because

00:14:27.080 | model now each application model now are built on top of fm so become a thinner layer instead of a very

00:14:33.400 | standalone full-fledged model trained from scratch so how do the various models utilize the foundation

00:14:39.960 | model there are three main approaches or consumption patterns the first is foundation model can be

00:14:47.720 | integrated as a sub graph within the downstream model additionally the content embeddings learned from the

00:14:53.320 | foundation model can be integrated as the embedding lookup layers so downstream model is a newer network

00:14:58.680 | it may already have initially some of the sequence transformer tower or graph and then using a pre-trained

00:15:10.440 | foundation model model sub graph to directly replace that uh second is that uh we can push out embeddings

00:15:17.720 | this is no surprise from both content side and entity embedding as well as member embeddings

00:15:22.600 | uh the only uh the main concern here of course is how we want to ref how frequently we want to refresh the

00:15:29.480 | member embeddings and how we make sure they are stable uh and push them to the centralized embedding store and this of course allow far more uh

00:15:38.840 | uh wider use cases than just personalization because people analytics data scientists can also just

00:15:45.880 | fetch those embeddings directly to do the things that they want finally user can uh extract the models and

00:15:55.000 | fine-tune it for specific applications either fine-tune or they need to do distillation to meet the online

00:16:01.240 | serving requirement especially for those with very strict latency requirement

00:16:07.080 | to wrap up uh i want to show at high level the wings we accumulated over the last one year and a half

00:16:15.480 | by incorporating fm into various places so the blue bar represent how many applications have fm incorporated the green bar

00:16:26.440 | represent the a b test wings because in any application we may have multiple a b tests going on there

00:16:32.680 | to have wings so we see we indeed see high leverage of fm to bring about both a b test wings as well as

00:16:40.360 | infrastructure consolidation

00:16:44.520 | uh so i think the big back uh big bets are validated uh it is a scalable solution uh in terms of both in

00:16:51.880 | terms of a scalable scaled up the model with improved quality as well as making the whole infra consolidated

00:16:59.080 | and the scale uh to new applications to be much easier high leverage because it's a centralized learning

00:17:05.960 | innovation velocity also is faster because we allow a lot of newly launched applications to directly fine-tune

00:17:13.880 | the foundation model to launch the first experience

00:17:16.760 | so the current directions one is that um we want to have universal representation for heterogeneous

00:17:26.600 | entities this is uh as you can guess the semantic id and along those lines because we want to cover that

00:17:32.680 | as netflix expanding to very different very heterogeneous content types second is generative retrieval for

00:17:41.080 | collection recommendation right so instead of just recommending a single video be generative at

00:17:45.240 | inference time and serving time because you have a modest step decoding a lot of the consideration

00:17:51.000 | about business business rules or diversity for example can be naturally handled in the decoding process

00:17:56.520 | lastly faster adaptation through prompt tuning so this is also borrowed from llm can we just train some soft tokens

00:18:05.080 | uh so that at inference time we can directly swap in and out of those soft tokens to prompt the fm to behave differently

00:18:12.280 | so that is also a very promising direction that we are getting into all right that concludes my talk thank you for your attention and questions

00:18:25.240 | thank you um if you have any questions may i invite you to come to the mic in front um while we get our next speakers from mr kata get set up

00:18:35.640 | uh hi thank you for the talk uh since you get billions of users so except the recommendation system you

00:18:45.320 | you're maybe you're maybe you can do much more right so what's your cloud on that since i can just ask you to

00:18:52.120 | to predict who's the next president in the united states thank you um so i actually don't uh could you explain

00:19:00.280 | a little bit what do you mean by beyond recommendation do you mean the other personalization or other things

00:19:05.000 | um yeah yeah since you get kind of beating users preference so actually that that that preference is

00:19:13.720 | also been linked to what things they're buying or who they will vote for the next president so do you think

00:19:20.200 | your foundation model has that capability to expand not only recommendation what videos they want to

00:19:25.800 | look but what others they like or what's their opinions on anything else thank you yes so i think

00:19:32.280 | we are expanding to different uh i think entity type and also capture uh users taste from both on and

00:19:40.760 | off our platform i think that's a general trend that we're going to yes great thank you this was really

00:19:50.040 | helpful um question on and you might not be able to share it um for ipv reasons but whatever you can

00:19:55.640 | thoughts on graph models didn't i didn't hear a lot of that in your talks graphs and

00:20:01.880 | uh reinforcement learning any utilization there any benefits you saw any boost in performance and

00:20:07.720 | accuracy yeah that's a good question i think we have actually uh a dedicated team sub team doing graph

00:20:15.160 | model uh especially around our knowledge graph to cover the content space both on and off our platform in

00:20:22.200 | the whole entire entertainment ecosystem so we use actually a lot of embeddings for example from the graph

00:20:29.800 | graph model to co-start that's where i see i show those semantic embeddings that's where it comes from

00:20:35.960 | in terms of reinforcement learning yes as well especially where we consider sparse reward that we have on

00:20:42.760 | users from users action are pretty much sparse but we want to use them to guide how for example we generate

00:20:50.040 | the whole collection that's where we need to the whole collection that's where we need to consider how to use

00:20:54.120 | those reward to guide those uh process yeah i think one more question i'm sorry can i ask a two-part question

00:21:02.600 | sure i would be here and so we can also follow up yeah do you also use these unified representations as

00:21:09.400 | embedding features to downstream models you had a slide how you use the unified model yeah so uh the

00:21:18.280 | we so for the embeddings learning within our model we also expose the downstream to do and consume them

00:21:25.400 | uh we also have able to train our unified embedding we also have some upstream like just the for example

00:21:33.320 | the gn embeddings that those are also consumed to to do that last one is it fast yes

00:21:41.480 | uh hello uh in your embeddings are you just using when someone does an action or sorry for the in these

00:21:51.560 | embeddings you're just using metadata over the video to understand what they like or are you actually using

00:21:56.200 | like frame by frame of the video or second clips uh not yet we do have that from some other content group of our

00:22:06.040 | or organization but i think the trend will go there so we are not yet uh into very granular sub

00:22:12.360 | like clips level or view level we have those embeddings but not quite yet to incorporate yeah thank you

00:22:18.680 | thank you yesu please another round of applause for yesu