back to indexRecsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon

Chapters
0:0 Introduction to Language Modeling in Recommendation Systems
1:31 Challenge 1: Hash-based Item IDs
2:14 Solution: Semantic IDs
5:37 Challenge 2: Data Augmentation and Quality
6:10 Solution: LLM-Augmented Synthetic Data
6:21 Indeed Case Study
10:37 Spotify Case Study
13:34 Challenge 3: Separate Systems and High Operational Costs
14:24 Solution: Unified Models
14:51 Netflix Case Study (Unicorn)
16:46 Etsy Case Study (Unified Embeddings)
20:26 Key Takeaways
00:00:00.500 |
Hi everyone, thank you for joining us in today's REXIS, the inaugural REXIS track at the AI 00:00:19.960 |
Engineer World's Fair. So today what I want to share about is what the future might look 00:00:25.280 |
like when we try to merge recommendation systems and language models. So my wife looked 00:00:32.360 |
at my slides and she's like, they're so plain. So therefore, I'll be giving a talk together 00:00:36.800 |
with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere, 00:00:40.280 |
but there'll be a lot of doggos throughout these slides. I hope you enjoy. First, language 00:00:45.100 |
modeling techniques are not new in recommendation systems. I mean, it started with World2Vac in 00:00:50.560 |
2013. We started learning item embeddings across, from co-occurrences in user interaction sequences. 00:00:57.640 |
And then after that, we started using GRU for REX. I don't know who here remembers recurrent 00:01:02.160 |
neural networks, gated recurrent units. Yeah. So those were very short term, and we predict 00:01:07.440 |
the next item from a short set of sequences. Then of course, transformers and attention came 00:01:12.980 |
about, and we became better on attention on long-range dependencies. So that's where we 00:01:18.660 |
started. Hey, you know, can we just process on everything in the user sequence, hundreds, 00:01:23.100 |
2,000 item IDs long, and try to learn from that? And of course, now, today in this track, 00:01:29.160 |
I wanted to share with you about three ideas that I think are worth thinking about. Semantic 00:01:33.480 |
IDs, data augmentation, and unified models. So the first challenge we have is hash-based item 00:01:40.580 |
IDs. Who here works on recommendation systems? So you probably know that hash-based item IDs 00:01:47.380 |
actually don't encode the contents of the item itself. And then the problem is that every 00:01:51.620 |
time you have a new item, you suffer from the cold start problem, which is that all you have 00:01:56.000 |
to relearn about this item all over again. And therefore, then there's also sparsity, right, 00:02:02.000 |
whereby you have a long set of tail items that have maybe one or two interactions, or even up 00:02:05.580 |
10, but it's just not enough to learn. So recommendation systems have this issue of being 00:02:09.880 |
very popularity bias. And they just struggle with cold start and sparsity. So the solution 00:02:14.880 |
is semantic IDs that may even involve multimodal content. So here's an example of trainable multimodal 00:02:22.300 |
semantic IDs from Kuaishou. So Kuaishou is kind of like TikTok or Xia Hongsu, it's a short video 00:02:27.980 |
platform in China. I think it's the number two short video platform. You might have used their 00:02:31.900 |
text-to-video model, Kling, which they released sometime last year. So the problem they had, 00:02:36.860 |
you know, being a short video platform, users upload hundreds of millions of short videos every day. 00:02:42.220 |
And it's really hard to learn from this short video. So how can we combine static content embeddings 00:02:47.660 |
with dynamic user behavior? Here's how they did it with trainable multimodal semantic IDs. So I'm 00:02:54.700 |
going to go through each step here. So this is the Kuaishou model. It's a standard two tower network. 00:03:01.100 |
On the left, this is the embedding layer for the user, which is a standard sequence of IDs and the user ID. 00:03:10.060 |
And on the right is the embedding layer for the item IDs. So these are fairly standard. 00:03:14.300 |
What's new here is that they now take in content input. So all of these slides will be available 00:03:20.940 |
online. Don't worry about it. I'll make it available right immediately after this. And to encode visual, 00:03:27.740 |
they use ResNet. To encode video descriptions, they use BERT. And to encode audio, they use VGG-ish. 00:03:34.940 |
Now, the trick is this. When you have these encoder models, it's very hard to backpropagate and try to 00:03:41.420 |
update these encoder model embeddings. So what did they do? Well, firstly, they took all these content 00:03:46.700 |
embeddings, and then they just concatenated them together. I know it sounds crazy, right? But I just 00:03:51.260 |
concatenated them together. Then they learn cluster IDs. So I think they shared in the paper, they had like 00:03:58.060 |
a hundred million short videos, and they learn just via k-means clustering, a thousand cluster IDs. 00:04:04.140 |
So that's what you see over there in the model encoder, which is in the boxes at the bottom, 00:04:08.940 |
which is the cluster IDs. So above the cluster IDs, you have the non-trainable embeddings. Below that, 00:04:15.260 |
you have the trainable cluster IDs, which are then all mapped to their own embedding table. 00:04:20.460 |
So the trick here is this. The model encoder, as you train a model, the model encoder learns to map 00:04:25.900 |
the content space via the cluster IDs, which are mapped to the embedding table, to the behavioral space. 00:04:31.340 |
So the output is this. These semantic IDs not only outperform regular hash-based IDs on clicks and likes, 00:04:40.460 |
right? Like, that's pretty standard. But what they were able to do was they were able to increase 00:04:44.860 |
co-start coverage, which is the, of a hundred videos that you share, how many of them are new, 00:04:50.540 |
they were able to increase it by 3.6%. And also increase co-start velocity, which is, okay, 00:04:55.900 |
how many new videos were able to hit some threshold of views? And this, they did not share what a 00:05:01.980 |
threshold was, but being able to increase co-start and co-start velocity by these numbers are pretty 00:05:07.020 |
outstanding. So the, long story short, the benefits of semantic IDs, you can address co-start with the 00:05:13.500 |
semantic ID itself, and now your recommendations understand content. So later in the talk, 00:05:18.620 |
we're going to see some amazing sharing from Pinterest and YouTube. And in the YouTube one, 00:05:24.460 |
you see how they actually blend language models with semantic IDs, whereby it can actually explain 00:05:31.420 |
why you might like the semantic ID, because it understands the semantic ID, and it's able to give 00:05:35.900 |
human readable explanations, and vice versa. Now, next question, and I'm sure all of this, 00:05:42.220 |
everyone here has this challenge. The lifeblood of machine learning is data, good quality data at 00:05:49.340 |
scale. And this is very essential for search, and of course recommendation systems, but search is 00:05:54.780 |
actually far more important. You need a lot of metadata, you need a lot of query expansion, synonyms, 00:06:01.340 |
you need spell checking, you need all sorts of metadata to attach to your search index. But this is very 00:06:07.980 |
costly and high effort to get. In the past, we used to do it with human annotations, or maybe you can try 00:06:11.820 |
to do it automatically. But LLMs have been outstanding at this. And I'm sure everyone here is sort of doing 00:06:17.980 |
doing this to some extent, using LLMs for synthetic data and labels. But I want to share with you two 00:06:22.540 |
examples from Spotify and Indeed. Now the Indeed paper, I really like it a lot. So the problem that they 00:06:32.860 |
were trying to face is that they were sending job recommendations to users via email. But some of 00:06:39.340 |
these job recommendations were bad. They were just not a good fit for the user, right? So they had poor user 00:06:44.380 |
experience and then users lost trust in the job recommendations. Imagine, and how they would indicate 00:06:49.260 |
that they lost trust was that these job recommendations are not a good fit for me, I'm just going to 00:06:53.340 |
unsubscribe. Now the moment a user unsubscribes from your feed or for your newsletter, it's very, very, very 00:06:58.780 |
hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and 00:07:04.380 |
thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse. And 00:07:09.100 |
implicit feedback is often imprecise. What do I mean? If you get some recommendations, but you actually 00:07:14.220 |
don't act on it, is it because you didn't like it? Or is it because it's not the right time? Or maybe 00:07:19.740 |
your wife works there and you don't want to work in the same company as your wife? So the solution they had 00:07:25.260 |
was a lightweight classifier to filter back racks. And I'll tell you why I really like this paper from Indeed, 00:07:31.180 |
in the sense that they didn't just share their successes, but they shared the entire process and how they 00:07:35.900 |
get, how they got there. And it was fraught with challenges. Well, of course, the first thing that 00:07:40.700 |
makes me really like it a lot was that they started with evals. So they had their experts label job 00:07:47.900 |
recommendations and user pairs. And from the user, you have their resume data, you have the activity data, 00:07:53.980 |
and they tried to see, hey, you know, is this recommendation a good fit? Then they prompted 00:07:59.820 |
open LLMs, Mistral and LAMA2. Unfortunately, their performance was very poor. These models couldn't 00:08:05.260 |
really pay attention to what was in the resume and what was in the job description, even though they had 00:08:10.780 |
sufficient context length. And the output was just very generic. So to get it to work, they prompted 00:08:17.580 |
GPT-4. And GPT-4 worked really well. Specifically, GPT-4 had like 90% precision and recall. However, 00:08:24.780 |
it was very costly. They didn't share the actual cost, but it's too slow. It's 22 seconds. Okay, 00:08:29.900 |
if GPT-4 is too slow, what can we do? Let's try GPT-3.5. Unfortunately, GPT-3.5 had very poor precision. 00:08:37.820 |
What does this mean? In the sense that of the recommendations that it said were bad, 00:08:44.620 |
only 63% of them were actually bad. What this means is that they were throwing out 37% of 00:08:50.860 |
recommendations, which is one-third. And for a company that tries on recommendations and people 00:08:54.860 |
recruiting through your recommendations, throwing out one-third of them that are actually good is 00:09:00.780 |
quite a guardrail for them. This was their key metric here. And also, GPT-3. So what they did then is they 00:09:07.500 |
fine-tuned GPT-3.5. So you can see the entire journey, right? Open models, GPT-4, GPT-3, 00:09:13.180 |
now fine-tuning GPT-3.5. GPT-3.5 got the precision they wanted, 0.3 precision. And you know it's one 00:09:19.260 |
quarter of GPT-4's cost and latency, right? But unfortunately, it's still too slow. It was about 00:09:24.460 |
6.7 seconds, and this would not work in an online filtering system. So therefore, what they did was they 00:09:30.300 |
distilled a lightweight classifier on the fine-tuned GPT-3.5 labels. And this lightweight classifier was able to 00:09:37.180 |
achieve very high performance, specifically 0.86 AUCROC. I mean, the numbers may not make sense to 00:09:44.700 |
you, but suffice to say that in an industrial setting, this is pretty good. And of course, 00:09:48.540 |
they didn't mention the latency, but it was good enough for real-time filtering. I think less than 00:09:51.900 |
200 milliseconds or something. So the outcome of this was that they were able to reduce bad 00:09:58.540 |
recommendations. They were able to cut out bad recommendations by about 20%. So initially, 00:10:03.740 |
they had hypothesized that by cutting down recommendations, even though they were bad, 00:10:07.900 |
you will get fewer subscriptions. It's just like sending out links, right? You might have links 00:10:11.900 |
that are clickbait. Even though they are bad, people just click on it. And they thought that even if we 00:10:15.580 |
cut down recommendations, even if they were bad, we would get lower application rate. But this was not 00:10:19.980 |
the case. In fact, because the recommendations were now better, application rate actually went up by 4%. 00:10:27.260 |
And unsubscribe rate went down by 5%. That's quite a lot. So essentially, what this means is that in 00:10:32.380 |
recommendations, quantity is not everything. Quality makes a big difference, and quality here moves the 00:10:37.260 |
needle quite a bit by 5%. The next example I want to share with you is Spotify. So who here knows that 00:10:44.460 |
Spotify has podcasts and audio books? Oh, okay. I guess you guys are not a target audience in this use case. 00:10:51.900 |
So Spotify is really known for song and artists, and a lot of their users just search for songs and artists, 00:10:57.180 |
and they're very good at that. But when they started introducing podcasts and audio books, 00:11:02.620 |
how would you help your users know that, you know, these new items are available? And of course, 00:11:06.940 |
there's a huge co-start problem. Now it's not only co-start on item, it's now co-start on category. 00:11:13.020 |
How do you start growing a new category within your service? And of course, exploratory search was 00:11:19.900 |
essential to the business, right, for expanding beyond music. Spotify doesn't want to just do music, 00:11:25.260 |
songs. They just now want to be doing audio. So the solution to that is a query recommendation system. 00:11:32.780 |
So how did they recommend, how first, how did they generate new queries? Well, they have a bunch of 00:11:39.340 |
ideas, which is, you know, extracted from catalog titles, playlist titles, you mine it from the search 00:11:43.820 |
logs, you just take the, you just take the artist and then you just add cover to it. And this is what 00:11:49.340 |
they use from existing data. Now you might be wondering like, where's the LLM in this? Well, the LLM is used to 00:11:56.780 |
generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you 00:12:02.620 |
have with conventional techniques that work really well, and use the LLM to augment it when you need 00:12:07.100 |
it. Don't use the LLM for everything at the start. So now they have these exploratory queries. 00:12:12.940 |
When you search for something, you still get the immediate results hit, right? So you take all this, 00:12:20.940 |
you add the immediate results, and then you rank these new queries. So this is why when you do a search, 00:12:28.940 |
this is the UX that you're probably going to get right now. I got this from a paper. It may have 00:12:32.300 |
changed recently. So you still see the item queries at the bottom. But at the top, with the query 00:12:37.260 |
recommendations, this is how Spotify informs users without having a banner. Now we have audio books, 00:12:43.180 |
now we have podcasts, right? You search for something, it actually informs you that we have 00:12:46.860 |
these new categories. The benefit here is plus 9% exploratory queries. Essentially, one-tenth of 00:12:54.460 |
their users will now exploring their new products. So imagine that one-tenth every day exploring their 00:13:01.420 |
new products. How quickly would you be able to grow your new product category, right? It's actually 1.1 to 00:13:07.820 |
the power of N. It will grow pretty fast. Long story short, I don't have to tell you about the 00:13:13.020 |
benefits of LLM augmented synthetic data, richer high quality data at scale, on the tail queries, 00:13:19.660 |
right? Even on the tail queries and the tail items, and it's far lower cost and effort than is even 00:13:24.780 |
possible with human annotation. So later, we also have a talk from Instacart, who will tell us about how 00:13:30.380 |
they use LLMs to improve their search system. Now the last thing I want to share is this challenge, 00:13:40.460 |
whereby right now, in a regular company, the system for ads, for recommendations, for search, 00:13:49.340 |
they're all separate systems. And even for recommendations, the model for homepage recommendations, 00:13:55.260 |
the model for item recommendations, the model for ad-to-card recommendations, the model for the thank 00:13:59.900 |
you page recommendations, they may all be different models, right? So you can imagine this, you're 00:14:04.380 |
going to have many, many models, but you're going to have, well, leadership expects you to keep the 00:14:09.820 |
same amount of headcount. So then how do you try to get around this, right? You have duplicative 00:14:13.820 |
engineering pipelines, there's a lot of maintenance costs, and improving one model doesn't naturally 00:14:19.180 |
transfer to the improvement in another model. So the solution for this is unified models, right? I mean, 00:14:25.580 |
it works for vision, it works for language, so why not recommendation systems? And we've been 00:14:29.740 |
doing this for a while, this is not new. And aside, maybe the text is too small, but this is a tweet 00:14:35.900 |
from Stripe, whereby they built a transformer-based payments fraud model, right? Even for payments, 00:14:42.700 |
the sequence of payments, you can build a foundation model, which is transformer-based. 00:14:46.860 |
So I want to share an example of the unified ranker for search and rexies and Netflix, right? 00:14:53.420 |
The problem I mentioned, they have teams, they are building bespoke models for search, similar video 00:14:59.180 |
recommendations and pre-query recommendations, like on the search page before you ever enter 00:15:02.780 |
a search grid. High operational costs, you know, missed opportunities from learning throughout. 00:15:06.540 |
So their solution is a unified ranker, and they call it the unified contextual ranker, 00:15:12.460 |
which is unicorn. So you can see over here at the bottom, there's the user foundation model, 00:15:17.980 |
and in it, you put in a user watch history. And then you also have the context and relevance model, 00:15:23.500 |
where you put in the context of the videos and what they've watched. 00:15:26.140 |
Now, the thing about this unified model is that it takes in unified input, right? So now, if you are able 00:15:34.300 |
to find a data schema where all your use cases and all your features can use the same input, you can adopt 00:15:41.420 |
an approach like this, which is similar to multi-task learning. So the input would be the user ID, 00:15:47.660 |
the item ID, you know, the video or the drama or the series, the search query, if a search query exists, 00:15:53.900 |
the country and the task. So of course, they have many different tasks. In this example, 00:15:58.140 |
in the paper, they have three different tasks, search pre-query and more like this. Now, 00:16:03.580 |
what they did then was very smart imputation of missing items. So for example, if you are doing an 00:16:11.260 |
item-to-item recommendation, you're just done watching this video, you want to recommend the next 00:16:14.540 |
video, you will have no search query. How would you impute it? Well, you just simply use the title of the 00:16:19.420 |
current item and try to find similar items. The outcome of this is that this unified model was able to 00:16:26.300 |
match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean, 00:16:32.940 |
it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem, 00:16:37.340 |
we did all this work just to match. But imagine unifying all of it, like removing the tech debt 00:16:42.620 |
and building a better foundation for your future iterations. It's going to make you iterate faster. 00:16:47.340 |
The last example I want to share with you is unified embeddings at Etsy. So you might think that 00:16:52.620 |
embeddings are not very sexy, but this paper from Etsy is really outstanding in what they share in 00:16:58.220 |
this model architecture as well as their system. So the problem they had was, 00:17:02.620 |
how can you help users get better results from very specific queries or very broad queries? 00:17:08.620 |
And if you know that Etsy inventory is constantly changing. They don't have the same products all 00:17:13.340 |
throughout, right? It's very homegrown. So now you might be querying for something like Mother's Day gift, 00:17:18.380 |
that would almost match very few items. I think very few items would have Mother's Day gift in their 00:17:23.900 |
description on their title, right? And you know, lexical embedding, the other problem is that 00:17:28.300 |
knowledge-based embeddings, like lexical embedding retrieval, don't account for user preferences. 00:17:33.260 |
So how do you try to address this? The problem, how they address this is with unified embedding and 00:17:39.660 |
retrieval. So if you remember, at the start of my presentation, I talked about the Kuaishou two-tower model, 00:17:47.340 |
right? There's the user tower, and then there's the item tower. We will see the same pattern again. 00:17:52.300 |
And then over here, you see the product tower, right? This is the product encoder. 00:17:56.140 |
So how they encode the product is that they use T5 models for text embeddings, right? Text item 00:18:01.820 |
descriptions, as well as a query product log for query embeddings. What was the query that was made, 00:18:07.740 |
and what was the product that was eventually clicked or purchased. And then over here on the left, 00:18:12.700 |
you see the query encoder, which is the search query encoder. And they both share encoders for the tokens, 00:18:19.660 |
which is actually the text tokens, the product category, which is a token of itself, and the 00:18:25.500 |
user location. So what this means is that now your embedding is able to match user to the location 00:18:31.500 |
of the product itself. And then of course, to personalize this, they encode the user preferences, 00:18:36.940 |
via the query user scale effect, features at the bottom. Essentially, what were the queries that the 00:18:41.100 |
user searched for, what did they buy previously, all their preferences. Now, this is, they also shared their 00:18:46.860 |
system architecture. And over here, this is the product encoder from the previous slide, 00:18:51.340 |
and the query encoder from the previous slide. But what's very interesting here is that they added a 00:18:56.140 |
quality vector, because they wanted to ensure that whatever was searched and retrieved 00:19:01.420 |
was actually of good quality in terms of ratings, freshness, and conversion rate. 00:19:05.980 |
And you know, what they did is they just simply concatenated this quality vector to the product 00:19:10.780 |
embedding vector. But when you do that, for the query vector, you have to expand the product vector 00:19:17.260 |
by the same dimension, so that you can do a dot product, or cosine similarity. So essentially, 00:19:21.820 |
they just slapped on a constant vector for the query embedding, and it just works. 00:19:27.340 |
The result, 2.6% increase in conversion across an entire site. That's quite crazy. And more than 00:19:34.220 |
5% increase in search purchases. If you search for something, the purchase rate increases by 5%. 00:19:38.940 |
This is very, very, these are very, very, very good results for e-commerce. 00:19:43.820 |
So the benefits of unified models, you simplify the system, whatever you build to improve one side of 00:19:53.100 |
the tower, improve your model, your unified model also improves other use cases that use these unified 00:19:59.260 |
models. That said, there may also be the alignment tags. You may find that when you try to build this, 00:20:04.700 |
try to compress all 12 use cases into a single unified model, you may need to split it up into maybe 00:20:09.420 |
two or three separate unified models, because that's just the alignment tags. We're trying to get better 00:20:13.340 |
one task actually makes the other task worse. We have a talk from LinkedIn in this afternoon blog, 00:20:21.980 |
the last talk of the blog, and then we also have a talk from Netflix, which we'll be sharing about 00:20:25.820 |
their unified model at the start of the next blog. All right, the three takeaways I have for you, 00:20:30.700 |
think about it, consider it, semantic IDs, data augmentation, and unified models. 00:20:39.580 |
And of course, stay tuned for the rest of the talks in this track. Okay, that's it. Thank you.