Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs

Hi everyone, thank you for joining us in today's REXIS, the inaugural REXIS track at the AI Engineer World's Fair. So today what I want to share about is what the future might look like when we try to merge recommendation systems and language models. So my wife looked at my slides and she's like, they're so plain.

So therefore, I'll be giving a talk together with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere, but there'll be a lot of doggos throughout these slides. I hope you enjoy. First, language modeling techniques are not new in recommendation systems. I mean, it started with World2Vac in 2013.

We started learning item embeddings across, from co-occurrences in user interaction sequences. And then after that, we started using GRU for REX. I don't know who here remembers recurrent neural networks, gated recurrent units. Yeah. So those were very short term, and we predict the next item from a short set of sequences.

Then of course, transformers and attention came about, and we became better on attention on long-range dependencies. So that's where we started. Hey, you know, can we just process on everything in the user sequence, hundreds, 2,000 item IDs long, and try to learn from that? And of course, now, today in this track, I wanted to share with you about three ideas that I think are worth thinking about.

Semantic IDs, data augmentation, and unified models. So the first challenge we have is hash-based item IDs. Who here works on recommendation systems? So you probably know that hash-based item IDs actually don't encode the contents of the item itself. And then the problem is that every time you have a new item, you suffer from the cold start problem, which is that all you have to relearn about this item all over again.

And therefore, then there's also sparsity, right, whereby you have a long set of tail items that have maybe one or two interactions, or even up 10, but it's just not enough to learn. So recommendation systems have this issue of being very popularity bias. And they just struggle with cold start and sparsity.

So the solution is semantic IDs that may even involve multimodal content. So here's an example of trainable multimodal semantic IDs from Kuaishou. So Kuaishou is kind of like TikTok or Xia Hongsu, it's a short video platform in China. I think it's the number two short video platform. You might have used their text-to-video model, Kling, which they released sometime last year.

So the problem they had, you know, being a short video platform, users upload hundreds of millions of short videos every day. And it's really hard to learn from this short video. So how can we combine static content embeddings with dynamic user behavior? Here's how they did it with trainable multimodal semantic IDs.

So I'm going to go through each step here. So this is the Kuaishou model. It's a standard two tower network. On the left, this is the embedding layer for the user, which is a standard sequence of IDs and the user ID. And on the right is the embedding layer for the item IDs.

So these are fairly standard. What's new here is that they now take in content input. So all of these slides will be available online. Don't worry about it. I'll make it available right immediately after this. And to encode visual, they use ResNet. To encode video descriptions, they use BERT.

And to encode audio, they use VGG-ish. Now, the trick is this. When you have these encoder models, it's very hard to backpropagate and try to update these encoder model embeddings. So what did they do? Well, firstly, they took all these content embeddings, and then they just concatenated them together.

I know it sounds crazy, right? But I just concatenated them together. Then they learn cluster IDs. So I think they shared in the paper, they had like a hundred million short videos, and they learn just via k-means clustering, a thousand cluster IDs. So that's what you see over there in the model encoder, which is in the boxes at the bottom, which is the cluster IDs.

So above the cluster IDs, you have the non-trainable embeddings. Below that, you have the trainable cluster IDs, which are then all mapped to their own embedding table. So the trick here is this. The model encoder, as you train a model, the model encoder learns to map the content space via the cluster IDs, which are mapped to the embedding table, to the behavioral space.

So the output is this. These semantic IDs not only outperform regular hash-based IDs on clicks and likes, right? Like, that's pretty standard. But what they were able to do was they were able to increase co-start coverage, which is the, of a hundred videos that you share, how many of them are new, they were able to increase it by 3.6%.

And also increase co-start velocity, which is, okay, how many new videos were able to hit some threshold of views? And this, they did not share what a threshold was, but being able to increase co-start and co-start velocity by these numbers are pretty outstanding. So the, long story short, the benefits of semantic IDs, you can address co-start with the semantic ID itself, and now your recommendations understand content.

So later in the talk, we're going to see some amazing sharing from Pinterest and YouTube. And in the YouTube one, you see how they actually blend language models with semantic IDs, whereby it can actually explain why you might like the semantic ID, because it understands the semantic ID, and it's able to give human readable explanations, and vice versa.

Now, next question, and I'm sure all of this, everyone here has this challenge. The lifeblood of machine learning is data, good quality data at scale. And this is very essential for search, and of course recommendation systems, but search is actually far more important. You need a lot of metadata, you need a lot of query expansion, synonyms, you need spell checking, you need all sorts of metadata to attach to your search index.

But this is very costly and high effort to get. In the past, we used to do it with human annotations, or maybe you can try to do it automatically. But LLMs have been outstanding at this. And I'm sure everyone here is sort of doing doing this to some extent, using LLMs for synthetic data and labels.

But I want to share with you two examples from Spotify and Indeed. Now the Indeed paper, I really like it a lot. So the problem that they were trying to face is that they were sending job recommendations to users via email. But some of these job recommendations were bad.

They were just not a good fit for the user, right? So they had poor user experience and then users lost trust in the job recommendations. Imagine, and how they would indicate that they lost trust was that these job recommendations are not a good fit for me, I'm just going to unsubscribe.

Now the moment a user unsubscribes from your feed or for your newsletter, it's very, very, very hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse.

And implicit feedback is often imprecise. What do I mean? If you get some recommendations, but you actually don't act on it, is it because you didn't like it? Or is it because it's not the right time? Or maybe your wife works there and you don't want to work in the same company as your wife?

So the solution they had was a lightweight classifier to filter back racks. And I'll tell you why I really like this paper from Indeed, in the sense that they didn't just share their successes, but they shared the entire process and how they get, how they got there. And it was fraught with challenges.

Well, of course, the first thing that makes me really like it a lot was that they started with evals. So they had their experts label job recommendations and user pairs. And from the user, you have their resume data, you have the activity data, and they tried to see, hey, you know, is this recommendation a good fit?

Then they prompted open LLMs, Mistral and LAMA2. Unfortunately, their performance was very poor. These models couldn't really pay attention to what was in the resume and what was in the job description, even though they had sufficient context length. And the output was just very generic. So to get it to work, they prompted GPT-4.

And GPT-4 worked really well. Specifically, GPT-4 had like 90% precision and recall. However, it was very costly. They didn't share the actual cost, but it's too slow. It's 22 seconds. Okay, if GPT-4 is too slow, what can we do? Let's try GPT-3.5. Unfortunately, GPT-3.5 had very poor precision. What does this mean?

In the sense that of the recommendations that it said were bad, only 63% of them were actually bad. What this means is that they were throwing out 37% of recommendations, which is one-third. And for a company that tries on recommendations and people recruiting through your recommendations, throwing out one-third of them that are actually good is quite a guardrail for them.

This was their key metric here. And also, GPT-3. So what they did then is they fine-tuned GPT-3.5. So you can see the entire journey, right? Open models, GPT-4, GPT-3, now fine-tuning GPT-3.5. GPT-3.5 got the precision they wanted, 0.3 precision. And you know it's one quarter of GPT-4's cost and latency, right?

But unfortunately, it's still too slow. It was about 6.7 seconds, and this would not work in an online filtering system. So therefore, what they did was they distilled a lightweight classifier on the fine-tuned GPT-3.5 labels. And this lightweight classifier was able to achieve very high performance, specifically 0.86 AUCROC.

I mean, the numbers may not make sense to you, but suffice to say that in an industrial setting, this is pretty good. And of course, they didn't mention the latency, but it was good enough for real-time filtering. I think less than 200 milliseconds or something. So the outcome of this was that they were able to reduce bad recommendations.

They were able to cut out bad recommendations by about 20%. So initially, they had hypothesized that by cutting down recommendations, even though they were bad, you will get fewer subscriptions. It's just like sending out links, right? You might have links that are clickbait. Even though they are bad, people just click on it.

And they thought that even if we cut down recommendations, even if they were bad, we would get lower application rate. But this was not the case. In fact, because the recommendations were now better, application rate actually went up by 4%. And unsubscribe rate went down by 5%. That's quite a lot.

So essentially, what this means is that in recommendations, quantity is not everything. Quality makes a big difference, and quality here moves the needle quite a bit by 5%. The next example I want to share with you is Spotify. So who here knows that Spotify has podcasts and audio books?

Oh, okay. I guess you guys are not a target audience in this use case. So Spotify is really known for song and artists, and a lot of their users just search for songs and artists, and they're very good at that. But when they started introducing podcasts and audio books, how would you help your users know that, you know, these new items are available?

And of course, there's a huge co-start problem. Now it's not only co-start on item, it's now co-start on category. How do you start growing a new category within your service? And of course, exploratory search was essential to the business, right, for expanding beyond music. Spotify doesn't want to just do music, songs.

They just now want to be doing audio. So the solution to that is a query recommendation system. So how did they recommend, how first, how did they generate new queries? Well, they have a bunch of ideas, which is, you know, extracted from catalog titles, playlist titles, you mine it from the search logs, you just take the, you just take the artist and then you just add cover to it.

And this is what they use from existing data. Now you might be wondering like, where's the LLM in this? Well, the LLM is used to generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you have with conventional techniques that work really well, and use the LLM to augment it when you need it.

Don't use the LLM for everything at the start. So now they have these exploratory queries. When you search for something, you still get the immediate results hit, right? So you take all this, you add the immediate results, and then you rank these new queries. So this is why when you do a search, this is the UX that you're probably going to get right now.

I got this from a paper. It may have changed recently. So you still see the item queries at the bottom. But at the top, with the query recommendations, this is how Spotify informs users without having a banner. Now we have audio books, now we have podcasts, right? You search for something, it actually informs you that we have these new categories.

The benefit here is plus 9% exploratory queries. Essentially, one-tenth of their users will now exploring their new products. So imagine that one-tenth every day exploring their new products. How quickly would you be able to grow your new product category, right? It's actually 1.1 to the power of N. It will grow pretty fast.

Long story short, I don't have to tell you about the benefits of LLM augmented synthetic data, richer high quality data at scale, on the tail queries, right? Even on the tail queries and the tail items, and it's far lower cost and effort than is even possible with human annotation.

So later, we also have a talk from Instacart, who will tell us about how they use LLMs to improve their search system. Now the last thing I want to share is this challenge, whereby right now, in a regular company, the system for ads, for recommendations, for search, they're all separate systems.

And even for recommendations, the model for homepage recommendations, the model for item recommendations, the model for ad-to-card recommendations, the model for the thank you page recommendations, they may all be different models, right? So you can imagine this, you're going to have many, many models, but you're going to have, well, leadership expects you to keep the same amount of headcount.

So then how do you try to get around this, right? You have duplicative engineering pipelines, there's a lot of maintenance costs, and improving one model doesn't naturally transfer to the improvement in another model. So the solution for this is unified models, right? I mean, it works for vision, it works for language, so why not recommendation systems?

And we've been doing this for a while, this is not new. And aside, maybe the text is too small, but this is a tweet from Stripe, whereby they built a transformer-based payments fraud model, right? Even for payments, the sequence of payments, you can build a foundation model, which is transformer-based.

So I want to share an example of the unified ranker for search and rexies and Netflix, right? The problem I mentioned, they have teams, they are building bespoke models for search, similar video recommendations and pre-query recommendations, like on the search page before you ever enter a search grid. High operational costs, you know, missed opportunities from learning throughout.

So their solution is a unified ranker, and they call it the unified contextual ranker, which is unicorn. So you can see over here at the bottom, there's the user foundation model, and in it, you put in a user watch history. And then you also have the context and relevance model, where you put in the context of the videos and what they've watched.

Now, the thing about this unified model is that it takes in unified input, right? So now, if you are able to find a data schema where all your use cases and all your features can use the same input, you can adopt an approach like this, which is similar to multi-task learning.

So the input would be the user ID, the item ID, you know, the video or the drama or the series, the search query, if a search query exists, the country and the task. So of course, they have many different tasks. In this example, in the paper, they have three different tasks, search pre-query and more like this.

Now, what they did then was very smart imputation of missing items. So for example, if you are doing an item-to-item recommendation, you're just done watching this video, you want to recommend the next video, you will have no search query. How would you impute it? Well, you just simply use the title of the current item and try to find similar items.

The outcome of this is that this unified model was able to match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean, it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem, we did all this work just to match.

But imagine unifying all of it, like removing the tech debt and building a better foundation for your future iterations. It's going to make you iterate faster. The last example I want to share with you is unified embeddings at Etsy. So you might think that embeddings are not very sexy, but this paper from Etsy is really outstanding in what they share in this model architecture as well as their system.

So the problem they had was, how can you help users get better results from very specific queries or very broad queries? And if you know that Etsy inventory is constantly changing. They don't have the same products all throughout, right? It's very homegrown. So now you might be querying for something like Mother's Day gift, that would almost match very few items.

I think very few items would have Mother's Day gift in their description on their title, right? And you know, lexical embedding, the other problem is that knowledge-based embeddings, like lexical embedding retrieval, don't account for user preferences. So how do you try to address this? The problem, how they address this is with unified embedding and retrieval.

So if you remember, at the start of my presentation, I talked about the Kuaishou two-tower model, right? There's the user tower, and then there's the item tower. We will see the same pattern again. And then over here, you see the product tower, right? This is the product encoder. So how they encode the product is that they use T5 models for text embeddings, right?

Text item descriptions, as well as a query product log for query embeddings. What was the query that was made, and what was the product that was eventually clicked or purchased. And then over here on the left, you see the query encoder, which is the search query encoder. And they both share encoders for the tokens, which is actually the text tokens, the product category, which is a token of itself, and the user location.

So what this means is that now your embedding is able to match user to the location of the product itself. And then of course, to personalize this, they encode the user preferences, via the query user scale effect, features at the bottom. Essentially, what were the queries that the user searched for, what did they buy previously, all their preferences.

Now, this is, they also shared their system architecture. And over here, this is the product encoder from the previous slide, and the query encoder from the previous slide. But what's very interesting here is that they added a quality vector, because they wanted to ensure that whatever was searched and retrieved was actually of good quality in terms of ratings, freshness, and conversion rate.

And you know, what they did is they just simply concatenated this quality vector to the product embedding vector. But when you do that, for the query vector, you have to expand the product vector by the same dimension, so that you can do a dot product, or cosine similarity. So essentially, they just slapped on a constant vector for the query embedding, and it just works.

The result, 2.6% increase in conversion across an entire site. That's quite crazy. And more than 5% increase in search purchases. If you search for something, the purchase rate increases by 5%. This is very, very, these are very, very, very good results for e-commerce. So the benefits of unified models, you simplify the system, whatever you build to improve one side of the tower, improve your model, your unified model also improves other use cases that use these unified models.

That said, there may also be the alignment tags. You may find that when you try to build this, try to compress all 12 use cases into a single unified model, you may need to split it up into maybe two or three separate unified models, because that's just the alignment tags.

We're trying to get better one task actually makes the other task worse. We have a talk from LinkedIn in this afternoon blog, the last talk of the blog, and then we also have a talk from Netflix, which we'll be sharing about their unified model at the start of the next blog.

All right, the three takeaways I have for you, think about it, consider it, semantic IDs, data augmentation, and unified models. And of course, stay tuned for the rest of the talks in this track. Okay, that's it. Thank you.

Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon

Chapters

Transcript