AI Engineer World’s Fair 2025 - LLM Recommendation Systems (RecSys)

Thank you. Thank you. Thank you. Thank you. Thank you. Hi, everyone. Thank you for joining us in today's Reksis, the inaugural Reksis track at the AI Engineer World's Fair. So today, what I want to share about is what the future might look like when we try to merge recommendation systems and language models.

So my wife looked at my slides and she's like, they're so plain. So therefore, I'll be giving the talk together with latte and mochi. You might have seen mochi wandering the halls around somewhere, but there'll be a lot of doggos throughout this slide. I hope you enjoy. First, language modeling techniques are not new in recommendation systems.

I mean, it started with work2vec in 2013. We started learning item embeddings across, from co-occurrences in user interaction sequences, and then after that, we started using GRU for, right? I don't know who here remembers recurrent neural networks, gated recurrent units. Yeah. So those were very short-term, and we predict the next item from a short set of sequences.

Then, of course, transformers and attention came about, and we became better on attention on long-range dependencies. So that's where we started, hey, you know, can we just process on everything in the user sequence, hundreds, 2,000 item IDs long, and try to learn from that. And of course, now, today in this track, I wanted to share with you about three ideas that I think are worth thinking about, semantic IDs, data augmentation, and unified models.

So the first challenge we have is hash-based item IDs. Who here works on recommendation systems? So you probably know that hash-based item IDs actually don't encode the content of the item itself, and then the problem is that every time you have a new item, you suffer from the cold start problem, which is that you have to relearn about this item all over again, and therefore, then there's also sparsity, right, whereby you have a long set of tailed items that have maybe one or two interactions, or even up to ten, but it's just not enough to learn.

So recommendation systems have this issue of being very popularity bias, and they just struggle with cold start and sparsity. So the solution is semantic IDs that may even involve multimodal content. So here's an example of trainable multimodal semantics IDs from Kuaishou. So Kuaishou is kind of like TikTok or Xia Hongsu.

It's a short video platform in China. I think it's the number two short video platform. You might have used their text-to-video model, Kling, which they released sometime last year. So the problem they had, you know, being a short video platform, users upload hundreds of millions of short videos every day.

And it's really hard to learn from this short video. So how can we combine static content embeddings with dynamic user behavior? Here's how they did it, with trainable multimodal semantic IDs. So I'm going to go through each step here. So this is the Kuaishou model. It's a standard two-tower network.

On the left, this is the embedding layer for the user, which is a standard sequence of IDs and the user ID. And on the right is the embedding layer for the item IDs. So these are fairly standard. But what's new here is that they now take in content input.

So all of these slides will be available online. Don't worry about it. I'll make it available right immediately after this. And to encode visual, they use ResNet. To encode video descriptions, they use BERT. And to encode audio, they use VGG-ish. Now the thing about, the trick is this. When you have this encoder models, it's very hard to backpropagate and try to update these encoder model embeddings.

So what did they do? Well, firstly, they took all these content embeddings and then they just concatenated them together. I know it sounds crazy, right? But they just concatenated them together. Then they learned cluster IDs. So I think they shared in the paper, they had like 100 million short videos.

And they learned just via k-means clustering, 1,000 cluster IDs. So that's what you see over there in the model encoder, which is in the boxes at the bottom, which is the cluster IDs. So above the cluster IDs, you have the non-trainable embeddings. Below that, you have the trainable cluster IDs, which are then all mapped to their own embedding table.

So the trick here is this. The model encoder, as you train a model, the model encoder learns to map the content space via the cluster IDs, which are mapped to the embedding table, to the behavioral space. So the output is this. These semantic IDs not only outperform regular hash-based IDs on clicks and likes, right?

Like, that's pretty standard. But what they were able to do was they were able to increase co-start coverage, which is, of 100 videos that you share, how many of them are new, they were able to increase it by 3.6%. And also increase co-start velocity, which is, okay, how many new videos were able to hit some threshold of views?

And this, they did not share what a threshold was, but being able to increase co-start and co-start velocity by these numbers are pretty outstanding. So the, long story short, the benefits of semantic IDs, you can address co-start with the semantic ID itself. And now your recommendations understand content. So, later in the talk, we're going to see some amazing sharing from Pinterest and YouTube.

And in the YouTube one, you see how they actually blend language models with semantic IDs, whereby it can actually explain why you might like the semantic ID, because it understands the semantic ID, and it's able to give human-readable explanations, and vice versa. Now, next question, and I'm sure everyone here has this challenge.

The lifeblood of machine learning is data, good quality data at scale, and this is very essential for search, and of course recommendation systems, but search is actually far more important. You need a lot of metadata, you need a lot of query expansion, synonyms, you need spell checking, you need all sorts of metadata to attach to your search index.

But this is very costly and high effort to get. In the past, we used to do it with human annotations, or maybe you can try to do it automatically, but LLMs have been outstanding at this. And I'm sure everyone here is sort of doing this to some extent, using LLMs for synthetic data and labels.

But I want to share with you two examples from Spotify and Indeed. Now, the Indeed paper, I really like it a lot. So, the problem that they were trying to face is that they were sending job recommendations to users via email. But some of these job recommendations were bad.

They were just not a good fit for the user, right? So, they had poor user experience, and then users lost trust in the job recommendations. Imagine, and how they would indicate that they lost trust was that these job recommendations are not a good fit for me. I'm just going to unsubscribe.

Now, the moment a user unsubscribes from your feed or for your newsletter, it's very, very, very hard to get them back. Almost impossible. So, while they had explicit negative feedback, thumbs up and thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse.

And implicit feedback is often imprecise. What do I mean? If you get some recommendations, but you actually don't act on it, is it because you didn't like it? Or is it because it's not the right time? Or maybe your wife works there and you don't want to work in the same company as your wife?

So, the solution they had was a lightweight classifier to filter bad rats. And I'll tell you why I really like this paper from Indeed in the sense that they didn't just share their successes, but they shared the entire process and how they got there, and it was fraught with challenges.

Well, of course, the first thing that makes me really like it a lot was that they started with evals. So, they had their experts label job recommendations and user pairs, and from the user, you have their resume data, you have their activity data, and they tried to see, hey, you know, is this recommendation a good fit?

Then, they prompted open LLMs, Mistral and Lama 2. Unfortunately, their performance was very poor. So, these models couldn't really pay attention to what was in the resume and what was in the job description, even though they had sufficient context length, and the output was just very generic. So, to get it to work, they prompted GPT-4, and GPT-4 worked really well.

Specifically, GPT-4 had like 90% precision and recall. However, it was very costly. They didn't share the actual cost, but it was too slow. It was 22 seconds. Okay, if GPT-4 is too slow, what can we do? Let's try GPT-3.5. Unfortunately, GPT-3.5 had very poor precision. What does this mean?

In the sense that of the recommendations that it said were bad, only 63% of them were actually bad. What this means is that they were throwing out 37% of recommendations, which is one-third. And for a company that tries on recommendations and people are recruiting through your recommendations, throwing out one-third of them that are actually good is quite a guardrail for them.

This was their key metric here. And also, so what they did then is they fine-tuned GPT-2.5. So you can see the entire journey, right? Open models, GPT-4, GPT-3, now fine-tuning GPT-2.5. GPT-2.5 got the precision they wanted, 0.3 precision. And you know it's one quarter of GPT-4's cost and latency, right?

But unfortunately, it was still too slow. It was about 6.7 seconds. And this would not work in an online filtering system. So therefore, what they did was they distilled a lightweight classifier on the fine-tuned GPT-2.5 labels. And this lightweight classifier was able to achieve very high performance, specifically 0.86 AUCROC.

I mean, the numbers may not make sense to you, but suffice to say that in an industrial setting, this is pretty good. And of course, they didn't mention the latency, but it was good enough for real-time filtering. I think less than 200 milliseconds or something. So the outcome of this was that they were able to reduce bad recommendations.

They were able to cut out bad recommendations by about 20%. So initially, they had hypothesized that by cutting down recommendations, even though they were bad, you will get fewer subscriptions. It's just like sending out links, right? You might have links that are clickbait. Even though they are bad, people just click on it.

And they thought that even if we cut out recommendations, even if they were bad, we would get lower application rates. But this was not the case. In fact, because the recommendations were now better, application rate actually went up by 4%. And unsubscribe rate went down by 5%. That's quite a lot.

So essentially, what this means is that in recommendations, quantity is not everything. Quality makes a big difference. And quality here moves the needle quite a bit by 5%. That example I want to share with you is Spotify. So who here knows that Spotify has podcasts and audiobooks? Oh, okay.

I guess you guys are not a target audience in this use case. So Spotify is really known for song and artists. And a lot of their users just search for songs and artists. And they are very good at that. But when they started introducing podcasts and audiobooks, how would you help your users know that, you know, these new items are available?

And of course, there's a huge-ass cold start problem. Now it's not only cold start on item, it's now cold start on category. How do you start growing a new category within your service? And of course, exploratory search was essential to the business, right? For going, for expanding beyond music.

Spotify doesn't want to just do music, songs. They just now want to be doing audio. So the solution to that is a query recommendation system. So how did they recommend, how first, how did they generate new queries? Well, they have a bunch of ideas, which is, you know, extracted from catalog titles, playlist titles, you mine it from the search logs.

You just take the artist and then you just add cover to it. And this is what they use from existing data. Now, you might be wondering, like, where is the LLM in this? Well, the LLM is used to generate natural language queries. So this might not be sexy, but this works really well, right?

Take whatever you have with conventional techniques that work really well and use the LLM to augment it when you need it. Don't use the LLM for everything at the start. So now they have these exploratory queries. When you search for something, you still get the immediate results hit, right?

So you take all of this, you add the immediate results, and then you rank these new queries. So this is why when you do a search, this is the UX that you're probably going to get right now. I got this from a paper. It may have changed recently. So you still see the item queries at the bottom.

But at the top, with the query recommendations, this is how Spotify informs users without having a banner. Now we have audiobooks. Now we have podcasts, right? You search for something. It actually informs you that we have these new categories. The benefit here is plus 9% exploratory queries. Essentially, one-tenth of their users will now exploring their new products.

So imagine that one-tenth every day exploring their new products. How quickly would you be able to grow your new product category, right? It's actually 1.1 to the power of n. It will grow pretty fast. Long story short, I don't have to tell you about the benefits of LLM augmented synthetic data, which are high-quality data at scale on the tail queries, right?

Even on the tail queries and the tail items, and it's far lower cost and effort than is even possible with human adaptation. So later, we also have a talk from Instacart, who will tell us about how they use LLMs to improve their search system. Now the last thing I want to share is this challenge, whereby right now, in a regular company, the system for ads, for recommendations, for search, they're all separate systems.

And even for recommendations, the model for homepage recommendations, the model for item recommendations, the model for card, add-to-card recommendations, the model for the thank-you page recommendations, they may all be different models, right? So you can imagine this. You're going to have many, many models, but you're going to have, well, leadership expects you to keep the same amount of headcount.

So then how do you try to get around this, right? You have duplicative engineering pipelines. There's a lot of maintenance costs, and improving one model doesn't naturally transfer to the improvement in another model. So the solution for this is unified models, right? I mean, it works for vision. It works for language.

So why not recommendation systems? And we've been doing this for a while. This is not new. And aside, maybe the text is too small, but this is a tweet from Stripe, whereby they built a transformer-based payments fraud model, right? Even for payments, the sequence of payments, you can build a foundation model, which is transformer-based.

So I want to share an example of the unified ranker for search and rexys and Netflix, right? The problem I mentioned, they have teams, they are building bespoke models for search, similar video recommendations, and pre-query recommendations on the search page before you ever enter a search query. High operational costs, missed opportunities from learning throughout.

So their solution is a unified ranker, and they call it a unified contextual ranker, which is unicorn. So you can see over here at the bottom, there's the user foundation model, and in it, you put in a user watch history. And then you also have the context and relevance model, where you put in the context of the videos and what they've watched.

Now, the thing about this unified model is that it takes in unified input, right? So now, if you are able to find a data schema where all your use cases and all your features can use the same input, you can adopt an approach like this, which is similar to multitask learning.

So the input would be the user ID, the item ID, you know, the video or the drama or the series, the search query, if a search query exists, the country and the task. So of course, they have many different tasks. In this example, in the paper, they have three different tasks, search, pre-query, and more like this.

Now, what they did then was very smart imputation of missing items. So for example, if you are doing an item-to-item recommendation, you're just done watching this video, you're going to recommend the next video, you're going to have no search query. How would you impute it? Well, you just simply use the title of the current item and try to find similar items.

The outcome of this is that this unified model was able to match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean, it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed, you might say we did all this work just to match, but imagine unifying all of it, like removing the tech debt and building a better foundation for your future iterations.

It's going to make you iterate faster. The last example I want to share with you is unified embeddings at Etsy. So you might think that embeddings are not very sexy, but this paper from Etsy is really outstanding in what they share in terms of model architecture as well as their system.

So the problem they had was how can you help users get better results from very specific queries or very broad queries? And if you know, the Etsy inventory is constantly changing. They don't have the same products all throughout, right? It's very homegrown. So now you might be querying for something like Mother's Day GIF.

They would almost match very few items. I think very few items would have Mother's Day GIF in their description on their title, right? And you know, lexical embedding, the other problem is that knowledge-based embeddings, like lexical embedding retrieval, don't account for user preferences. So how do you try to address this?

Their problem, how they address this is with unified embedding and retrieval. So if you remember, at the start of my presentation, I talked about the two-tower model, right? There's the user tower, and then there's the item tower. We will see the same pattern again. Over here, you see the product tower, right?

This is the product encoder. So how they encode the product is that they use T5 models for text embeddings, right? Text item descriptions, as well as a query product log for query embeddings. What was the query that was made, and what was the product that was eventually clicked or purchased.

And then over here on the left, you see the query encoder, which is the search query encoder. And they both share encoders for the tokens, which is actually a text token, the product category, which is a token of itself, and the user location. So what this means is that now your embedding is able to match user to the location of the product itself.

And then, of course, to personalize this, they encode the user preferences via the query user scale effect. Essentially, what were the queries that the user searched for, what did they buy previously, all their preferences. Now, this is, they also shared their system architecture. And over here, this is the product encoder from the previous slide, and the query encoder from the previous slide.

But what's very interesting here is that they added a quality vector, because they wanted to ensure that whatever was searched and retrieved was actually of good quality in terms of ratings, freshness, and conversion rate. And, you know, what they did is they just simply concatenated this quality vector to the product embedding vector.

But when you do that for the query vector, you have to expand the product vector by the same dimension so that you can do a dot product or cosine similarity. So essentially, they just slapped on a constant vector for the query embedding, and it just works. The result, 2.6% increase in conversion across the entire site.

That's quite crazy. And more than 5% increase in search purchases. If you search for something, the purchase rate increases by 5%. This is very, very, these are very, very, very good results for e-commerce. So the benefits of unified model, you simplify a system. Whatever you build to improve one side of the tower, improve your model, your unified model, also improves other use cases that use this unified model.

That said, there may also be the alignment text. You may find that when you try to build this, try to compress all 12 use cases into a single unified model, you may need to split it up into maybe two or three separate unified models because that's just the alignment text.

We're trying to get better on one task, actually makes the other task worse. We have a talk from LinkedIn in this afternoon blog, the last talk of the blog, and then we also have a talk from Netflix, which we'll be sharing about their unified model at the start of the next blog.

All right, the three takeaways I have for you, think about it, consider it, semantic IDs, data augmentation, and unified models. And, of course, do stay tuned for the rest of the talks in this track. Okay, that's it. Thank you. I maybe have time for one question while we have our speakers from Pinterest.

Come up and join us. Oh, you have a question? Oh, do you mind speaking in the mic, please? I read your very long paper that you wrote on recommendation systems and what's available today, but you didn't mention the Genrec or HST you work for Metis paper, and I was just curious why you left that out.

I didn't deliberately left that out. I think there were so many papers that I just didn't have time, and I just time boxed myself. I was like, okay, you're going to be done with this in two weeks, and then two weeks is up, that's all I have, so ship it.

That's all. Yes, another question, please. I feel like I have read in anecdotal blog posts about how part of what people might say is some of the decline in Google search quality is a move away from explicit ranking factors and sort of an easily auditable, like, ultimate ranking algorithm to something more black box and using more of those techniques.

And I guess I was curious if you had an opinion on whether that seems likely to be the case or whether that is just, you know, noise and not actually influencing the quality of the search results. Yeah, that's a good question. Unfortunately, I don't have any insider information on why that might happen.

I think we do have some Google folks here. Maybe you can ask them, but honestly, I haven't realized this, and I haven't even experienced this, the degradation. I haven't realized this, and I haven't realized this, but I haven't realized this, and I haven't realized this, and I haven't realized that this, and I haven't realized this, but I haven't realized that this is the same.

Okay. Okay. Okay. Okay. Okay. Okay. Thank you. Thank you. Thank you. Thank you. Thank you everyone for your patience. Next, we have Han and Mukunsu, machine learning engineers from Pinterest. They'll be sharing with us about how they integrated LLM to enhance relevant scoring at Pinterest, how they combine search queries with multi-modal context.

And this multi-modal context includes visual captions, link-based text, and user-curated signals. Thanks for the introduction. Hi, everyone. Thanks for joining the talk today. We're super excited to be here and share some of the learnings we have from integrating the LLM into Pinterest search. My name is Han and today I'll be presenting with Mukunda and we are both machine learning engineers from search relevance team at Pinterest.

So start with a brief introduction to Pinterest. Pinterest is a visual discovery platform where pinners can come to find inspiration to create a life they love. And there are three main discovery services on Pinterest, the home fee, the related things, and search. Today's talk will be focusing on search and where the user can type in their queries and find useful inspiring content based on their information need.

And we'll share how we lab with LLM to improve the search relevance. Here are some key statistics for Pinterest search. Every month we handled over 6 billion searches with billions of things to search from covering topics from recipe, home decor, travel, fashion, and beyond. And at Pinterest, search is remotely global and multilingual.

We support over 45 languages and reaching fingers in more than 100 countries. These numbers highlight the importance of search at Pinterest and why we are investing in search relevance to improve user experience. So this is an overview of how Pinterest search work at the back end. So it's similar to many recommendation systems and industry.

It has query understanding, retrieval, re-ranking, and the blending stage. And finally produced relevant and engagement search feeds. And in today's talk, we'll be focusing on the semantic relevance model that happens in The semantic relevance model that happens at the re-ranking stage. And share about how we use LLM to improve the search relevance on the search feed.

Okay. So here's our search relevance model, which is essentially a classification model. Given a search query and a pin, the model will predict how much the pin is relevant to this search query. And to measure this, we use a five point scale. And to measure this, we use a five point scale, ranging from the most relevant to most irrelevant.

All right. Now we are going to share some key learnings we have from using the LLM to improve search, Pinterest search relevance. And here are four main takeaways that we would like to go into more details. Lesson one. Lesson one. LLM are good at relevance prediction. So before I present the results, let me first give a quick overview of the model architecture that we are using.

We contain the query and the pin text together and pass them into a LLM to get an embedding. So this is called cross-encoder structure. Where we can better capture the interaction between the query and the pin. And then we feed the embedding from LLM into an MLP layer to produce a five dimensional vector which corresponds to the five relevance levels.

And during training, we fine tune some open source LLM using Pinterest internal data and to better adapt the model to our Pinterest content. And here I'd like to share some results to demonstrate that the usefulness of LLM. And as a baseline, we use search stage, which is a Pinterest in house content and the query embedding.

And so if you look at the table, you can see that the LLM has substantially improved the performance of the relevance prediction. And as we use more advanced LLM's and increase the model size, the performance keeps improving. And for example, the 8 billion model gives 12% of improvement over the multilingual bird-based model and 20% of improvement over the search stage embedding model.

So the lesson here is that LLM's, they are quite good at relevance prediction. All right, lesson two. The vision language model generated captions and user actions can be quite useful for content annotations. So to use LLM for search for relevance prediction, we need to use a text representation of each pin.

And here I listed several features that we use for the user For the user curated board that the pin has been saved to or the queries that led to less than a second, like a 500, 400 millisecond latency at best. There are three levels that we can pull in order to make the model more efficient and improve the throughput and reduce the latency for these models.

So the first step is to use the model. And then we can use the model to make the model more efficient and then we can use the model to make the model more efficient. We can use the model to make the model more efficient. And then we can use the model to make the model more efficient and then we can use the model to make the model more efficient.

So we can use the model to make the model more efficient and then we can use the model to make the model more efficient and then we can use the model to make the model more efficient. And one of the recipes here is that we need to do the distillation step by step.

And that means that we go with, for example, 8B model, then 3B model and then 1B model. So we slowly decrease the size of the model. And we distill over and over from the previous model. And that recipe shows to be much more effective rather than basically directly going from 150B model to 1B model.

Same thing for pruning. So pruning is a mathematical optimization problem. You want to either reduce the number of hits in the transformers, you can reduce the number of MLPs. Overall, these transformers models proven to be very, very redundant in terms of keeping the information. So we can start pruning and removing some of these layers or reduce basically the precision for each of the activations and parameters.

However, again, if you do the pruning very aggressively at the beginning, your performance would significantly suffer. So the recipe here is also do the gradual pruning. What we do is that we start pruning the model, very small pruning to the model. We distill to the smaller model and we do it over and over again.

More pruning, more distillation, more pruning, more distillation. And as you can see from this plot, doing the gradual pruning can be as effective as basically no information loss. Whereas if you just basically do the aggressive pruning at the beginning, you can have up to 1% reduction in the model quality.

Another level is quantization. Going to lower precision, we are leveraging FB8 for activation model parameters. However, doing just FB8 in all the layers has the performance of the model or the quality of the model significantly. So now basically your tool would be to do mixed precision. And one of the important aspects when it comes to ranking and recommendations and overall prediction tasks is you want the model, the prediction or the probability of the model to have a very good precision.

So the LM head at the end of the language model has to be in FP32. If you do it in FP16, BF16 or FP8, what happens is that the numbers collapse. And you don't have a very good calibration on top of that and you cannot distinguish between different items recommended.

Last part is sparsification. We can sparsify basically the attentions. The most expensive part of the transformers is attention scores. And we can leverage sparsification. Not every item needs to attend to every item. And when you know your task and you know this recommendation, these are the items that you want to -- in the history, you can sparsify and not have every item attend to each other.

And same goes with when you are recommending the items. Instead of recommending one item, you can recommend 50 items, 500 items at the same time. But you want to make sure that these items are not attending to each other. So you sparsify the attention scores for the output and for the query.

If you put everything together, we can see that basically we can have a significant reduction in the latency. What we have done is that in four or five of our release, one release after the other, we were able to reduce the latency by 7x. And at the same time increasing the throughput, which is basically the number of queries that we can handle by one GPU by 30x.

So we are improving basically the amount of work that the GPU is doing. At the same time, we are reducing the latency that each query is sent. These are some of basically technical reports and papers that we published during our journey to share with the community, basically, our lesson learned.

And that's the end of our talk. So we have some time also to answer some questions. Thank you. . Thank you. Great talk. One question. How did you measure that it doesn't lose generalization power? Obviously, you've done a lot of fine tuning. And you mentioned it works for four or five tasks instead of task-specific models.

How do you know it's going to work for the next five tasks? That's a good question. So we have a lot of -- I mean, the answer overall is having a very comprehensive benchmarking set. We have something around like 50 to 60 benchmarking. Some of them are internal. Some of them are external.

For example, we leverage IFEVAT to make sure that the model still follows a very good instruction. And as Maziar mentioned, some of the tasks are never being part of our training data. And that's how we are measuring basically digitalization to the new domain within the use cases, for example.

Thanks for the job. I'm wondering what a small listing website can use out of the box. Have you heard of NLWeb, which is launched recently by Microsoft? If yes, what are your views on that as a recommendation system? NLWeb. No, I haven't actually heard of it. No, okay, okay.

Sorry about that. Anything you -- for smaller ones listing websites, a real listed listing website has like thousands of real listed listings. What are the out-of-the-box recommendation models that people can start using? I mean, that's the -- I wish that such a model would exist. I don't really -- I mean, that's why I think we started this work.

We were trying to see if we can actually make it a foundation model so that you can actually solve those kinds of problems. I think there's a lot of potential for this to be able to serve a lot of the use cases that are beyond the bigger companies. But definitely I don't know any -- I think you should check out NLWeb one.

Okay. I'll look at that. Okay. Thank you for the great talk. On the slide where you mentioned you have a multi-item scoring, I'm curious, like, what does that effectively mean? Does it mean that you need to do multi-step decoding or it's just one step or just processing the logics for multiple items?

What does it -- It's a multi-step. We don't want to basically -- we didn't want to go to the, for example, complication of speculative decoding or basically the decoding aspect. We wanted to have everything at the prefill. Okay. So what we did was basically all the items are being sequenced.

All the recommended items or potential candidates are sequenced together. Mm-hmm. But we also wanted to avoid them to attend to each other. Mm-hmm. So we leveraged basically what we call it like a 4D attention mask. And we developed a special kernel actually in the SGLang and VLLM to be able to do that.

And now when you have up to 500 items in your query segment, those items don't attend to each other. They only attend to the historical user and user profile information. Okay. Thank you. Hey. Great talk. So a user history means many things, right? So like there is all of the jobs that they've applied to are in the job postings.

There are so many entities and so on. The context of the model can get quite large. How did you manage that? Did you compress it or were there parts that you focused on? Yeah. So we actually experimented with a lot of things. We experimented with the rack system so that basically when you have a query we try to figure out what are the most closest items in your history to bring it up.

We also experimented with chronical orders and some sort of weight decayed on the chronical orders. It turns out that for majority of applications that we have actually chronical order is good enough. And that kind of makes sense because the recommendation systems are very biased to the freshness. Yeah. So the more recent user activity helps.

One of the biggest challenge is actually this has now become more like a traditional problem. How do you balance the distribution of your positive and negative within the context? And I think that's become something that more like an ML engineering effort to figure out. Okay. Do I want more positive or negative?

Like how much information I need to put in the context? Okay. Yeah. I can add one more thing to this. Sure. There's also another complication. When you go to the serving of these models, you don't want to break the KV caching or something that you're using in the serving.

So it's going to be a little bit more complicated, more cumbersome to do something that's smarter than just putting the chronological order. So that's something that needs to be designed. So it's not something that's very obvious. Yeah, absolutely. One more question. You guys did so many experiments, tried out so many things.

How's your entire system set up? Because I'm assuming that you say quantization, but you must have tried different forms of quantization and whatnot. How do you set up the system in such a way that you can try out multiple experiments and see what works best? Can you talk a bit about that?

Yeah. So I'll just touch a bit on that one. I think the one thing that we hold a very high bar for the one was automation. So our system is very automated. To the extent that when you're running experimentation, actually the result of the experiment is being pushed automatically into the Excel sheet.

And now when you have such an automated system, now basically the developers are very efficient in terms of it. I just want to figure out different quantization. So you just change the quantization parameters and everything else happens end to end. So I think automation is the key if you want to basically really optimize for these models.

So did you build all of that automation in-house or did you...? Yes. Most of them. I mean we leveraged, for example, lightning, VLL, MSG like. We leveraged basically a lot of open source tools, but we make sure that they are integrated very well with each other and optimize basically the entire flow.

Okay. Thank you. Thank you. Thank you. Thank you again, Hamad Mazah. Thank you. So we'll come back after lunch at 2:00. So I'll see you guys back here. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

AI Engineer World’s Fair 2025 - LLM Recommendation Systems (RecSys)

Transcript