360Brew: LLM-based Personalized Ranking and Recommendation

00:00:00.000 | Hi, everyone. Very excited to be here. And I'm Hamid, this is Mazia, and today we're

00:00:21.720 | going to talk about our journey in leveraging large language models for personalization

00:00:26.840 | and ranking, and our paths to productionize such a large model for LinkedIn use cases.

00:00:39.800 | Recommendation ranking and personalization is deeply integrated in our daily life. When

00:00:45.060 | you go to a feed to read an article and you're looking for a job and you're searching for

00:00:49.360 | something and you're buying something online, the backend powered by recommendation system

00:00:55.680 | tries to find the best content or best entity based on your interest and relevancy to you.

00:01:05.600 | However, this system usually suffers from some challenges, especially they are being trained

00:01:16.800 | on a specific task. So they are disjoint optimized. They are usually not leveraging, for leveraging

00:01:24.720 | the most advanced architecture, they are being rolled out one by one, which is very time consuming

00:01:29.260 | and unproductive. So the question that we are asking is that what if you have only one model

00:01:37.680 | to solve all the tasks at the same time? So the mission that we started was to build a large

00:01:46.320 | foundation model, based on large language models, that understand the holistic understanding of the

00:01:53.600 | user journey on LinkedIn platform. And can solve all the personalization tasks that LinkedIn has with just one model.

00:02:01.600 | And in addition to that, we wanted this model to have three other main characteristics. One, we want this model to have zero shot

00:02:09.520 | capability, so that when you have a new problem or new surface, instead of basically collecting the

00:02:13.520 | data, building a new ranking model and putting it into production, which is a very time consuming journey,

00:02:23.440 | you can basically leverage this model out of the box to solve your task. You just basically prompt the model and tell the model that this is the task that I want to solve.

00:02:31.440 | This is kind of recommendation. This is the entity. This is the user. And what do you think about the relevancy between these two entities?

00:02:37.440 | The second characteristic that we want to have this model to have is to leverage in-context learning as much as possible. So that for a cold-star users problem, for example, we can leverage this model by just giving very few examples or just by explaining what the user might be interested in. And the model can solve that problem for the cold-star users.

00:02:57.360 | And the last one is following instruction. We want to basically give our users and members the ability to tell the model what they're interested in. Imagine that next time you go to LinkedIn feed, or you can tell the model that these are my niche interests. And these are the topics that I'm interested to explore. And the model is basically the recommendation system will start finding the relevant information for you and recommend it to you.

00:03:25.360 | Now, Mazi, I will talk about how we build this model, and then I will talk about how to serve this model.

00:03:32.360 | Okay. So let me talk a little bit about the brewing part, the building of the model. So in order to make use of the LLMs, which is what I think most of you guys are here for, is that we need to convert all the information we have about the users and the interactions and everything that they add into prompt.

00:03:51.360 | And this is what we call the magic of promptification. So we take all the information we have about the user history and their profiles and a lot of interactions that they have had, and we turn it into a prompt, something like the one on the right-hand side here.

00:04:04.360 | So there's, as you can see, there's an instruction for the model to follow, for example, what we want the model to do in this case, so that we can actually generalize all the different instructions.

00:04:13.360 | We give some information about the member profile, and we have some past, for example, interactions that they have had with the data that we have already shown to them.

00:04:21.360 | And then the question comes in, what do you think the user is going to do with this data or this new piece of information or this new item that we are showing to you?

00:04:30.360 | So that's basically how we formalize the problem in order to feed it into an LLM.

00:04:34.360 | So obviously, I mean, if you take one of the LLMs out of the box and try to solve this problem with, it's going to work a little bit, but it's not going to be perfect.

00:04:43.360 | So in order to do that, we have to train the model.

00:04:45.360 | So this is actually the pipeline that we have for developing the model and making it productionized.

00:04:50.360 | So as you can see, the left-hand side, we start with the open source model.

00:04:54.360 | Then we do some magic of upcycling to basically so that we can actually control the size of the model and the throughput versus the quality of the model.

00:05:03.360 | And then we have like a few blocks of training, continuous pre-training, fine-tuning, and instruction fine-tuning, and also alignment.

00:05:12.360 | And at this point, we have this large model, which we call brew XL, which you can think of it as a large model with 150 million parameters that does really, really well.

00:05:22.360 | And we have maximized the quality, but obviously, this model is not going to be able to serve online because, as you know, the recommendation systems are very, very cupid-hungry.

00:05:33.360 | So from here, we go all the way down to try to distill the model, so maximize the efficiency, and we're going to talk a little bit about that.

00:05:39.360 | But basically, we go all the way down to, let's say, 3B model, which is actually something that can be productionized.

00:05:45.360 | But as you can see, there are so many different boxes here.

00:05:48.360 | And in order to make sure that the development cycle is actually smooth, we had to do a lot of automation.

00:05:54.360 | So one of the key lessons from here is that you build a lot of automation into these pipelines in order to make the fact that making these models is actually very complicated into a much easier and more manageable situation.

00:06:09.360 | One big question that might actually come up here is that why do you actually need the XL model?

00:06:13.360 | And in fact, we did a lot of experimentation to see if you can actually get away from not having the XL model.

00:06:19.360 | Unfortunately, that's not actually the case. You have to first go big and then go small.

00:06:22.360 | If you do try to train the model from scratch with a small model, it doesn't actually work that well.

00:06:28.360 | So in this case, we did this and we showed that the distillation is actually something that is very important for the smaller models.

00:06:34.360 | But now, let me tell you a little bit about the levers that you can use in order to improve these models over time.

00:06:40.360 | This is actually something that's very important.

00:06:42.360 | I mean, if you look at all the literature, there is a lot about scaling laws, how these models actually scale with data, with compute, and with this and that.

00:06:50.360 | So in this case, we have three different layers that I'm going to talk about.

00:06:53.360 | The first one is obviously the data scaling.

00:06:55.360 | So what if we have actually more and more data?

00:06:57.360 | This is something that comes up a lot in the recommendation systems.

00:07:01.360 | We actually have a lot of data depending on how much you actually log about the user behavior.

00:07:06.360 | You might have a lot of data that goes back to six months, one year, or whatever.

00:07:11.360 | And in this graph, as you see, as we increase the amount of data, the performance of the model actually improves.

00:07:16.360 | And we hope that we can actually improve the model even further with having more and more data fit into it.

00:07:25.360 | Another lever that you can actually pull in order to improve the quality of the model, especially the Excel model, is to increase the size of the model.

00:07:33.360 | And in this experiment, we actually did this experiment over a mixture of architecture.

00:07:38.360 | You can see if you go from 7B to 8 by 22B, the performance of the model actually increases and improves.

00:07:45.360 | And finally, this is another thing that is kind of like I think one of the take-home message from here would be that the context length actually matters a lot for these kinds of applications with the recommendation systems.

00:07:56.360 | And the context length actually defines how much history from the user you can actually give to the model.

00:08:01.360 | So in this experiment, we actually show that if you increase the context length by feeding more history from the user to the model,

00:08:08.360 | you can actually improve the performance of the model by feeding more and more data to the model.

00:08:14.360 | As you can see, towards the end of this graph, the performance actually drops.

00:08:18.360 | We don't believe that this is because of the fact that the context is actually less informative.

00:08:22.360 | The problem is that the models, I mean, at least the model that we were using in this experiment, doesn't generalize that well to the longer context.

00:08:28.360 | Actually, I mean, if you look at most of the literature, they tell that the performance of the model actually drops if you go beyond some context.

00:08:38.360 | Actually, I have to give it back to you.

00:08:41.360 | Okay.

00:08:42.360 | Okay.

00:08:43.360 | Let's talk a little bit about the results and see if we can actually deliver on some of the promises that we had.

00:08:51.360 | So one of the things that we promised was that we can actually improve the performance of the model or performance of the behavior of the system on cold start users.

00:09:01.360 | In this case, we actually show the gap between our model and the production models on the users that have few interactions.

00:09:11.360 | Like, for example, less than five interactions, less than 100 interactions, and so on.

00:09:15.360 | And as you can see, the gap between the 360 BOO model and the production model actually grows as the number of interactions decreases.

00:09:25.360 | So this actually shows you that having the word knowledge that the model inserts into these systems actually improves the quality of its predictions.

00:09:37.360 | Finally, we were promising to give you some generalization to the new domains, meaning that the problems that model has never seen inside its training.

00:09:49.360 | So in this graph, as I showed, these are four different tasks, and these tasks are completely out of domain, no information about that surface the model has seen during the training.

00:10:00.360 | But as you can see, it can actually be on par or even beat some of the production models.

00:10:08.360 | And just to say, these production models are specific for that specific task.

00:10:12.360 | So they have been trained on that task.

00:10:14.360 | So this is not actually a small feat.

00:10:16.360 | So it's actually something that's significant.

00:10:18.360 | So as you can see, this gives the people who are developing these platforms to roll out features and roll out surfaces much more quickly because they can actually use these models to do recommendation for them.

00:10:34.360 | And now I give it back to Hamed to talk about serving.

00:10:37.360 | So let me walk you through that how can we production of such a large model in an environment that requires a very high QPS and low latency.

00:10:46.360 | Many recommendation systems have tens of thousands of the QPS, and they also require less than a second, like a 500, 400 millisecond latency at best.

00:11:00.360 | So there are three levels that we can pull in order to make the model more efficient and improve the throughput and reduce the latency for these models.

00:11:09.360 | A sparsification, going to the smaller model, and quantization.

00:11:18.360 | As Mazur explained before, smaller models definitely have a better throughput, but our recipe is that we need to go big and then go small.

00:11:28.360 | If you go with a smaller model initially, it doesn't have enough capacity, it doesn't have enough reasoning power to solve the complicated task that we have.

00:11:37.360 | So we go with a larger model, and then we start this 250 billion parameter model, and then we start distilling it to the smaller model.

00:11:44.360 | And one of the recipes here is that we need to do the distillation step by step, and that means that we go with, for example, 8B model, then 3B model, and then 1B model.

00:11:55.360 | So we slowly decrease the size of the model, and we distill over and over from the previous model.

00:12:04.360 | And that recipe shows to be much more effective rather than basically directly going from 150 billion parameter model to 1B parameter model.

00:12:14.360 | Same thing for pruning.

00:12:15.360 | So pruning is a mathematical optimization problem.

00:12:20.360 | You want to either reduce the number of hits in the transformers.

00:12:26.360 | You can reduce the number of MLPs.

00:12:28.360 | overall these transformers models proven to be very, very redundant in terms of keeping the information.

00:12:34.360 | So we can start pruning and removing some of these layers or reduce basically the precision for each of the activations and parameters.

00:12:45.360 | However, again, if you do the pruning very aggressively at the beginning, your performance would significantly suffer.

00:12:53.360 | So the recipe here is also do the gradual pruning.

00:12:58.360 | What we do is that we start pruning the model, very small pruning to the model.

00:13:03.360 | We distill to the smaller model, and we do it over and over again.

00:13:08.360 | More pruning, more distillation, more pruning, more distillation.

00:13:11.360 | And as you can see from this plot, doing the gradual pruning can be as effective as basically no information loss.

00:13:22.360 | Whereas if you just basically do aggressive pruning at the beginning, you can have up to 1% reduction in the model quality.

00:13:28.360 | Another level is quantization.

00:13:33.360 | We're going to lower precision.

00:13:35.360 | We are leveraging FB8 for activation model parameters.

00:13:39.360 | However, doing just FB8 in all the layers, here's the performance of the model, the quality of the model significantly.

00:13:47.360 | So now basically your tool would be to do mixed precision.

00:13:50.360 | And one of the important aspects when it comes to ranking and recommendations and overall prediction tasks is you want the model, the prediction or the probability output of the model to have a very good precision.

00:14:02.360 | So the LM head at the end of the language model has to be in FP32.

00:14:08.360 | If you do it in FP16, BF16 or FP8, what happens is that the numbers collapse.

00:14:14.360 | And you don't have a very good calibration on top of that, and you cannot distinguish between different items recommended.

00:14:21.360 | Last part is sparsification.

00:14:24.360 | We can sparsify basically the attentions.

00:14:27.360 | The most expensive part of the transformers is attention scores.

00:14:31.360 | And we can leverage sparsification.

00:14:33.360 | Not every item needs to attend to every item.

00:14:36.360 | And when you know your task, when you know this is a recommendation, these are the items that you want in the history, you can sparsify and not have every item attend to each other.

00:14:45.360 | And same goes with when you are recommending the items.

00:14:48.360 | Instead of recommending one item, you can recommend 50 items, 500 items at the same time.

00:14:53.360 | But you want to make sure that these items are not attending to each other.

00:14:57.360 | So you sparsify the attention scores for the output and for the query.

00:15:08.360 | If you put everything together, we can see that basically we can have a significant reduction in the latency.

00:15:15.360 | What we have done is that in four or five of our release, one release after the other, we were able to reduce the latency by 7x.

00:15:24.360 | And at the same time, increasing the throughput, which is basically the number of queries that we can handle by one GPU, by 30x.

00:15:31.360 | So we are improving basically the amount of the work that the GPU is doing.

00:15:35.360 | At the same time, we are reducing the latency that each queries sync.

00:15:42.360 | These are some of basically technical reports and papers that we published during our journey to share to the community basically our lesson learned.

00:15:57.360 | And that's the end of our talk.

00:15:59.360 | So we have some time also to answer some questions.

00:16:02.360 | Thank you.

00:16:03.360 | Please talk to the microphones if you want to answer the question.

00:16:10.360 | Yeah.

00:16:11.360 | Thank you.

00:16:12.360 | Great talk.

00:16:13.360 | One question.

00:16:14.360 | How did you measure that it doesn't lose generalization power?

00:16:15.360 | Obviously, you've done a lot of fine tuning.

00:16:17.360 | And you mentioned it works for four or five tasks instead of task-specific models.

00:16:21.360 | How do you know it's going to work for the next five tasks?

00:16:24.360 | That's a good question.

00:16:25.360 | So we have a lot of -- I mean, the answer overall is having a very comprehensive benchmark instead.

00:16:30.360 | We have something around like 50 to 60 benchmarking.

00:16:33.360 | Some of them are internal.

00:16:34.360 | Some of them are external.

00:16:35.360 | For example, we leverage if eva to make sure that the model still follows a very good instruction.

00:16:40.360 | And as Marzia mentioned, some of the tasks are not -- never being part of our training data.

00:16:46.360 | And that's how we are measuring basically the generalization to the new domain within LinkedIn use cases, for example.

00:16:52.360 | some of the things that we can use in the same way.

00:16:53.360 | And that's how we do that.

00:16:54.360 | Thank you.

00:16:55.360 | Thanks for the job.

00:16:56.360 | I'm wondering what a small listing website can use out of the box.

00:17:02.360 | Have you heard of NLWeb, which was launched recently by Microsoft?

00:17:05.360 | If yes, what are your views on that as a recommendation system?

00:17:09.360 | NLWeb.

00:17:10.360 | No, I haven't actually heard of it.

00:17:12.360 | Okay.

00:17:13.360 | Okay.

00:17:14.360 | Anything you -- for smaller ones listing -- let's say a real listed listing website has like thousands of real listed listings.

00:17:21.360 | What are the out-of-the-box recommendation models that people can start using?

00:17:26.360 | I mean, that's -- I wish that such a model would exist.

00:17:31.360 | I don't really -- I mean, that's why I think we started this work.

00:17:35.360 | We were trying to see if we can actually make it a foundation model so that you can actually solve those kinds of problems.

00:17:40.360 | I think there's a lot of potential for this to be able to serve a lot of the use cases that are beyond the bigger companies.

00:17:47.360 | But definitely, I don't know any of them.

00:17:49.360 | I think you should check out NLWeb once.

00:17:51.360 | Okay.

00:17:52.360 | I'll look at that.

00:17:53.360 | Thanks.

00:17:54.360 | Thank you for the great talk.

00:17:57.360 | On the slide where you mentioned you have a multi-item scoring.

00:18:02.360 | I'm curious, like, what does it effectively mean?

00:18:05.360 | Does it mean that you need to do multi-step decoding or it's just one step or just processing the logits for multiple items?

00:18:12.360 | What does it --

00:18:13.360 | It's a multi-step.

00:18:14.360 | We don't want to basically -- we didn't want to go to the, for example, complication of speculative decoding or basically the decoding aspect.

00:18:22.360 | We wanted to have everything at the pre-fill.

00:18:24.360 | Okay.

00:18:25.360 | So what we did was basically all the items are being sequenced.

00:18:27.360 | All the recommended items or potential candidates are sequenced together.

00:18:31.360 | Mm-hmm.

00:18:32.360 | But we also wanted to avoid them to attend to each other.

00:18:36.360 | Mm-hmm.

00:18:37.360 | So we leveraged basically what we call it like a 4D attention mask.

00:18:41.360 | And we developed a special kernel actually in the SGLang and VLM to be able to do that.

00:18:48.360 | And now when you have up to 500 items in your query segment, those items don't attend to each other.

00:18:55.360 | They only attend to the historical user and user profile information.

00:19:00.360 | Okay.

00:19:01.360 | Thank you.

00:19:02.360 | Hey.

00:19:03.360 | Great talk.

00:19:04.360 | So a user history means many things, right?

00:19:07.360 | So like there is all of the jobs that they've applied to are in the job postings.

00:19:11.360 | There are so many entities and so on.

00:19:13.360 | The context of the model can get quite large.

00:19:17.360 | How did you manage that?

00:19:18.360 | Did you compress it or were there parts that you focused on?

00:19:22.360 | Yeah.

00:19:23.360 | So we actually experimented with a lot of things.

00:19:25.360 | We experimented with the RAC system so that basically when we have a query, we try to figure out what are the most closest items in your history to bring it up.

00:19:34.360 | We also experimented with chronical orders and some sort of way to get on the chronical orders.

00:19:41.360 | It turns out that for majority of applications that we have, actually chronical order is good enough.

00:19:46.360 | And that kind of makes sense because recommendation systems are very biased to the freshness.

00:19:50.360 | So the more recent user activity helps.

00:19:53.360 | One of the biggest challenge is actually the, this is more, now become more like a traditional LM problem.

00:20:00.360 | How do you balance the distribution of your positive and negative within the context?

00:20:04.360 | And I think that's become something that more like an ML engineering effort to figure out, okay, do I want more positive or negative?

00:20:11.360 | Like how much information I need to put in the context?

00:20:14.360 | Yeah.

00:20:15.360 | I can add one more thing to this.

00:20:17.360 | Sure.

00:20:18.360 | There is also another complication when you go to the serving of these models.

00:20:21.360 | You don't want to break the KV caching or something that you're using in the serving.

00:20:25.360 | So it's going to be a little bit more complicated, more cumbersome to do something that's smarter than just putting the chronological order.

00:20:31.360 | So that's something that needs to be designed.

00:20:33.360 | So it's not something that's very obvious.

00:20:35.360 | Absolutely.

00:20:36.360 | One more question.

00:20:37.360 | You guys did so many experiments, tried out so many things.

00:20:40.360 | How's your entire system set up?

00:20:41.360 | Because I'm assuming that you say quantization, but you must have tried different forms of quantization, whatnot.

00:20:47.360 | How do you set up the system in such a way that you can try out multiple experiments and see what works best?

00:20:53.360 | Can you talk a bit about that?

00:20:55.360 | Yeah, so I'll just touch a bit on that one.

00:20:57.360 | I think the one thing that we hold a very high bar for the one was automation.

00:21:02.360 | So our system is very automated to the extent that when you're running experimentation, actually the result of the experiment being pushed automatically into the Excel sheet.

00:21:11.360 | And now when you have such an automated system, now basically the developers are very efficient in terms of like, I just want to figure out different quantization.

00:21:18.360 | So you just change the quantization parameters and everything else just by clicking a button happens end to end.

00:21:24.360 | So I think automation is the key if you want to basically really optimize for these models.

00:21:30.360 | So did you build all of that automation in-house or did you?

00:21:33.360 | Yes.

00:21:34.360 | Most of them.

00:21:35.360 | Well, we leveraged, for example, lightning, VLL, MSG like.

00:21:37.360 | We leveraged basically a lot of open source tools, but we make sure that they are integrated very well with each other and optimize basically the entire flow.

00:21:47.360 | Okay, thank you.

00:21:48.360 | Thank you.

00:21:49.360 | Thank you again, Hamad Mazah.

00:21:51.360 | Thank you.

00:21:52.360 | Thank you.

00:21:53.360 | Thank you.

00:21:54.360 | you

00:21:54.860 | you

00:21:56.920 | you

360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI