back to index

Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon


Chapters

0:0 Introduction to Language Modeling in Recommendation Systems
1:31 Challenge 1: Hash-based Item IDs
2:14 Solution: Semantic IDs
5:37 Challenge 2: Data Augmentation and Quality
6:10 Solution: LLM-Augmented Synthetic Data
6:21 Indeed Case Study
10:37 Spotify Case Study
13:34 Challenge 3: Separate Systems and High Operational Costs
14:24 Solution: Unified Models
14:51 Netflix Case Study (Unicorn)
16:46 Etsy Case Study (Unified Embeddings)
20:26 Key Takeaways

Whisper Transcript | Transcript Only Page

00:00:00.500 | Hi everyone, thank you for joining us in today's REXIS, the inaugural REXIS track at the AI
00:00:19.960 | Engineer World's Fair. So today what I want to share about is what the future might look
00:00:25.280 | like when we try to merge recommendation systems and language models. So my wife looked
00:00:32.360 | at my slides and she's like, they're so plain. So therefore, I'll be giving a talk together
00:00:36.800 | with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere,
00:00:40.280 | but there'll be a lot of doggos throughout these slides. I hope you enjoy. First, language
00:00:45.100 | modeling techniques are not new in recommendation systems. I mean, it started with World2Vac in
00:00:50.560 | 2013. We started learning item embeddings across, from co-occurrences in user interaction sequences.
00:00:57.640 | And then after that, we started using GRU for REX. I don't know who here remembers recurrent
00:01:02.160 | neural networks, gated recurrent units. Yeah. So those were very short term, and we predict
00:01:07.440 | the next item from a short set of sequences. Then of course, transformers and attention came
00:01:12.980 | about, and we became better on attention on long-range dependencies. So that's where we
00:01:18.660 | started. Hey, you know, can we just process on everything in the user sequence, hundreds,
00:01:23.100 | 2,000 item IDs long, and try to learn from that? And of course, now, today in this track,
00:01:29.160 | I wanted to share with you about three ideas that I think are worth thinking about. Semantic
00:01:33.480 | IDs, data augmentation, and unified models. So the first challenge we have is hash-based item
00:01:40.580 | IDs. Who here works on recommendation systems? So you probably know that hash-based item IDs
00:01:47.380 | actually don't encode the contents of the item itself. And then the problem is that every
00:01:51.620 | time you have a new item, you suffer from the cold start problem, which is that all you have
00:01:56.000 | to relearn about this item all over again. And therefore, then there's also sparsity, right,
00:02:02.000 | whereby you have a long set of tail items that have maybe one or two interactions, or even up
00:02:05.580 | 10, but it's just not enough to learn. So recommendation systems have this issue of being
00:02:09.880 | very popularity bias. And they just struggle with cold start and sparsity. So the solution
00:02:14.880 | is semantic IDs that may even involve multimodal content. So here's an example of trainable multimodal
00:02:22.300 | semantic IDs from Kuaishou. So Kuaishou is kind of like TikTok or Xia Hongsu, it's a short video
00:02:27.980 | platform in China. I think it's the number two short video platform. You might have used their
00:02:31.900 | text-to-video model, Kling, which they released sometime last year. So the problem they had,
00:02:36.860 | you know, being a short video platform, users upload hundreds of millions of short videos every day.
00:02:42.220 | And it's really hard to learn from this short video. So how can we combine static content embeddings
00:02:47.660 | with dynamic user behavior? Here's how they did it with trainable multimodal semantic IDs. So I'm
00:02:54.700 | going to go through each step here. So this is the Kuaishou model. It's a standard two tower network.
00:03:01.100 | On the left, this is the embedding layer for the user, which is a standard sequence of IDs and the user ID.
00:03:10.060 | And on the right is the embedding layer for the item IDs. So these are fairly standard.
00:03:14.300 | What's new here is that they now take in content input. So all of these slides will be available
00:03:20.940 | online. Don't worry about it. I'll make it available right immediately after this. And to encode visual,
00:03:27.740 | they use ResNet. To encode video descriptions, they use BERT. And to encode audio, they use VGG-ish.
00:03:34.940 | Now, the trick is this. When you have these encoder models, it's very hard to backpropagate and try to
00:03:41.420 | update these encoder model embeddings. So what did they do? Well, firstly, they took all these content
00:03:46.700 | embeddings, and then they just concatenated them together. I know it sounds crazy, right? But I just
00:03:51.260 | concatenated them together. Then they learn cluster IDs. So I think they shared in the paper, they had like
00:03:58.060 | a hundred million short videos, and they learn just via k-means clustering, a thousand cluster IDs.
00:04:04.140 | So that's what you see over there in the model encoder, which is in the boxes at the bottom,
00:04:08.940 | which is the cluster IDs. So above the cluster IDs, you have the non-trainable embeddings. Below that,
00:04:15.260 | you have the trainable cluster IDs, which are then all mapped to their own embedding table.
00:04:20.460 | So the trick here is this. The model encoder, as you train a model, the model encoder learns to map
00:04:25.900 | the content space via the cluster IDs, which are mapped to the embedding table, to the behavioral space.
00:04:31.340 | So the output is this. These semantic IDs not only outperform regular hash-based IDs on clicks and likes,
00:04:40.460 | right? Like, that's pretty standard. But what they were able to do was they were able to increase
00:04:44.860 | co-start coverage, which is the, of a hundred videos that you share, how many of them are new,
00:04:50.540 | they were able to increase it by 3.6%. And also increase co-start velocity, which is, okay,
00:04:55.900 | how many new videos were able to hit some threshold of views? And this, they did not share what a
00:05:01.980 | threshold was, but being able to increase co-start and co-start velocity by these numbers are pretty
00:05:07.020 | outstanding. So the, long story short, the benefits of semantic IDs, you can address co-start with the
00:05:13.500 | semantic ID itself, and now your recommendations understand content. So later in the talk,
00:05:18.620 | we're going to see some amazing sharing from Pinterest and YouTube. And in the YouTube one,
00:05:24.460 | you see how they actually blend language models with semantic IDs, whereby it can actually explain
00:05:31.420 | why you might like the semantic ID, because it understands the semantic ID, and it's able to give
00:05:35.900 | human readable explanations, and vice versa. Now, next question, and I'm sure all of this,
00:05:42.220 | everyone here has this challenge. The lifeblood of machine learning is data, good quality data at
00:05:49.340 | scale. And this is very essential for search, and of course recommendation systems, but search is
00:05:54.780 | actually far more important. You need a lot of metadata, you need a lot of query expansion, synonyms,
00:06:01.340 | you need spell checking, you need all sorts of metadata to attach to your search index. But this is very
00:06:07.980 | costly and high effort to get. In the past, we used to do it with human annotations, or maybe you can try
00:06:11.820 | to do it automatically. But LLMs have been outstanding at this. And I'm sure everyone here is sort of doing
00:06:17.980 | doing this to some extent, using LLMs for synthetic data and labels. But I want to share with you two
00:06:22.540 | examples from Spotify and Indeed. Now the Indeed paper, I really like it a lot. So the problem that they
00:06:32.860 | were trying to face is that they were sending job recommendations to users via email. But some of
00:06:39.340 | these job recommendations were bad. They were just not a good fit for the user, right? So they had poor user
00:06:44.380 | experience and then users lost trust in the job recommendations. Imagine, and how they would indicate
00:06:49.260 | that they lost trust was that these job recommendations are not a good fit for me, I'm just going to
00:06:53.340 | unsubscribe. Now the moment a user unsubscribes from your feed or for your newsletter, it's very, very, very
00:06:58.780 | hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and
00:07:04.380 | thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse. And
00:07:09.100 | implicit feedback is often imprecise. What do I mean? If you get some recommendations, but you actually
00:07:14.220 | don't act on it, is it because you didn't like it? Or is it because it's not the right time? Or maybe
00:07:19.740 | your wife works there and you don't want to work in the same company as your wife? So the solution they had
00:07:25.260 | was a lightweight classifier to filter back racks. And I'll tell you why I really like this paper from Indeed,
00:07:31.180 | in the sense that they didn't just share their successes, but they shared the entire process and how they
00:07:35.900 | get, how they got there. And it was fraught with challenges. Well, of course, the first thing that
00:07:40.700 | makes me really like it a lot was that they started with evals. So they had their experts label job
00:07:47.900 | recommendations and user pairs. And from the user, you have their resume data, you have the activity data,
00:07:53.980 | and they tried to see, hey, you know, is this recommendation a good fit? Then they prompted
00:07:59.820 | open LLMs, Mistral and LAMA2. Unfortunately, their performance was very poor. These models couldn't
00:08:05.260 | really pay attention to what was in the resume and what was in the job description, even though they had
00:08:10.780 | sufficient context length. And the output was just very generic. So to get it to work, they prompted
00:08:17.580 | GPT-4. And GPT-4 worked really well. Specifically, GPT-4 had like 90% precision and recall. However,
00:08:24.780 | it was very costly. They didn't share the actual cost, but it's too slow. It's 22 seconds. Okay,
00:08:29.900 | if GPT-4 is too slow, what can we do? Let's try GPT-3.5. Unfortunately, GPT-3.5 had very poor precision.
00:08:37.820 | What does this mean? In the sense that of the recommendations that it said were bad,
00:08:44.620 | only 63% of them were actually bad. What this means is that they were throwing out 37% of
00:08:50.860 | recommendations, which is one-third. And for a company that tries on recommendations and people
00:08:54.860 | recruiting through your recommendations, throwing out one-third of them that are actually good is
00:09:00.780 | quite a guardrail for them. This was their key metric here. And also, GPT-3. So what they did then is they
00:09:07.500 | fine-tuned GPT-3.5. So you can see the entire journey, right? Open models, GPT-4, GPT-3,
00:09:13.180 | now fine-tuning GPT-3.5. GPT-3.5 got the precision they wanted, 0.3 precision. And you know it's one
00:09:19.260 | quarter of GPT-4's cost and latency, right? But unfortunately, it's still too slow. It was about
00:09:24.460 | 6.7 seconds, and this would not work in an online filtering system. So therefore, what they did was they
00:09:30.300 | distilled a lightweight classifier on the fine-tuned GPT-3.5 labels. And this lightweight classifier was able to
00:09:37.180 | achieve very high performance, specifically 0.86 AUCROC. I mean, the numbers may not make sense to
00:09:44.700 | you, but suffice to say that in an industrial setting, this is pretty good. And of course,
00:09:48.540 | they didn't mention the latency, but it was good enough for real-time filtering. I think less than
00:09:51.900 | 200 milliseconds or something. So the outcome of this was that they were able to reduce bad
00:09:58.540 | recommendations. They were able to cut out bad recommendations by about 20%. So initially,
00:10:03.740 | they had hypothesized that by cutting down recommendations, even though they were bad,
00:10:07.900 | you will get fewer subscriptions. It's just like sending out links, right? You might have links
00:10:11.900 | that are clickbait. Even though they are bad, people just click on it. And they thought that even if we
00:10:15.580 | cut down recommendations, even if they were bad, we would get lower application rate. But this was not
00:10:19.980 | the case. In fact, because the recommendations were now better, application rate actually went up by 4%.
00:10:27.260 | And unsubscribe rate went down by 5%. That's quite a lot. So essentially, what this means is that in
00:10:32.380 | recommendations, quantity is not everything. Quality makes a big difference, and quality here moves the
00:10:37.260 | needle quite a bit by 5%. The next example I want to share with you is Spotify. So who here knows that
00:10:44.460 | Spotify has podcasts and audio books? Oh, okay. I guess you guys are not a target audience in this use case.
00:10:51.900 | So Spotify is really known for song and artists, and a lot of their users just search for songs and artists,
00:10:57.180 | and they're very good at that. But when they started introducing podcasts and audio books,
00:11:02.620 | how would you help your users know that, you know, these new items are available? And of course,
00:11:06.940 | there's a huge co-start problem. Now it's not only co-start on item, it's now co-start on category.
00:11:13.020 | How do you start growing a new category within your service? And of course, exploratory search was
00:11:19.900 | essential to the business, right, for expanding beyond music. Spotify doesn't want to just do music,
00:11:25.260 | songs. They just now want to be doing audio. So the solution to that is a query recommendation system.
00:11:32.780 | So how did they recommend, how first, how did they generate new queries? Well, they have a bunch of
00:11:39.340 | ideas, which is, you know, extracted from catalog titles, playlist titles, you mine it from the search
00:11:43.820 | logs, you just take the, you just take the artist and then you just add cover to it. And this is what
00:11:49.340 | they use from existing data. Now you might be wondering like, where's the LLM in this? Well, the LLM is used to
00:11:56.780 | generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you
00:12:02.620 | have with conventional techniques that work really well, and use the LLM to augment it when you need
00:12:07.100 | it. Don't use the LLM for everything at the start. So now they have these exploratory queries.
00:12:12.940 | When you search for something, you still get the immediate results hit, right? So you take all this,
00:12:20.940 | you add the immediate results, and then you rank these new queries. So this is why when you do a search,
00:12:28.940 | this is the UX that you're probably going to get right now. I got this from a paper. It may have
00:12:32.300 | changed recently. So you still see the item queries at the bottom. But at the top, with the query
00:12:37.260 | recommendations, this is how Spotify informs users without having a banner. Now we have audio books,
00:12:43.180 | now we have podcasts, right? You search for something, it actually informs you that we have
00:12:46.860 | these new categories. The benefit here is plus 9% exploratory queries. Essentially, one-tenth of
00:12:54.460 | their users will now exploring their new products. So imagine that one-tenth every day exploring their
00:13:01.420 | new products. How quickly would you be able to grow your new product category, right? It's actually 1.1 to
00:13:07.820 | the power of N. It will grow pretty fast. Long story short, I don't have to tell you about the
00:13:13.020 | benefits of LLM augmented synthetic data, richer high quality data at scale, on the tail queries,
00:13:19.660 | right? Even on the tail queries and the tail items, and it's far lower cost and effort than is even
00:13:24.780 | possible with human annotation. So later, we also have a talk from Instacart, who will tell us about how
00:13:30.380 | they use LLMs to improve their search system. Now the last thing I want to share is this challenge,
00:13:40.460 | whereby right now, in a regular company, the system for ads, for recommendations, for search,
00:13:49.340 | they're all separate systems. And even for recommendations, the model for homepage recommendations,
00:13:55.260 | the model for item recommendations, the model for ad-to-card recommendations, the model for the thank
00:13:59.900 | you page recommendations, they may all be different models, right? So you can imagine this, you're
00:14:04.380 | going to have many, many models, but you're going to have, well, leadership expects you to keep the
00:14:09.820 | same amount of headcount. So then how do you try to get around this, right? You have duplicative
00:14:13.820 | engineering pipelines, there's a lot of maintenance costs, and improving one model doesn't naturally
00:14:19.180 | transfer to the improvement in another model. So the solution for this is unified models, right? I mean,
00:14:25.580 | it works for vision, it works for language, so why not recommendation systems? And we've been
00:14:29.740 | doing this for a while, this is not new. And aside, maybe the text is too small, but this is a tweet
00:14:35.900 | from Stripe, whereby they built a transformer-based payments fraud model, right? Even for payments,
00:14:42.700 | the sequence of payments, you can build a foundation model, which is transformer-based.
00:14:46.860 | So I want to share an example of the unified ranker for search and rexies and Netflix, right?
00:14:53.420 | The problem I mentioned, they have teams, they are building bespoke models for search, similar video
00:14:59.180 | recommendations and pre-query recommendations, like on the search page before you ever enter
00:15:02.780 | a search grid. High operational costs, you know, missed opportunities from learning throughout.
00:15:06.540 | So their solution is a unified ranker, and they call it the unified contextual ranker,
00:15:12.460 | which is unicorn. So you can see over here at the bottom, there's the user foundation model,
00:15:17.980 | and in it, you put in a user watch history. And then you also have the context and relevance model,
00:15:23.500 | where you put in the context of the videos and what they've watched.
00:15:26.140 | Now, the thing about this unified model is that it takes in unified input, right? So now, if you are able
00:15:34.300 | to find a data schema where all your use cases and all your features can use the same input, you can adopt
00:15:41.420 | an approach like this, which is similar to multi-task learning. So the input would be the user ID,
00:15:47.660 | the item ID, you know, the video or the drama or the series, the search query, if a search query exists,
00:15:53.900 | the country and the task. So of course, they have many different tasks. In this example,
00:15:58.140 | in the paper, they have three different tasks, search pre-query and more like this. Now,
00:16:03.580 | what they did then was very smart imputation of missing items. So for example, if you are doing an
00:16:11.260 | item-to-item recommendation, you're just done watching this video, you want to recommend the next
00:16:14.540 | video, you will have no search query. How would you impute it? Well, you just simply use the title of the
00:16:19.420 | current item and try to find similar items. The outcome of this is that this unified model was able to
00:16:26.300 | match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean,
00:16:32.940 | it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem,
00:16:37.340 | we did all this work just to match. But imagine unifying all of it, like removing the tech debt
00:16:42.620 | and building a better foundation for your future iterations. It's going to make you iterate faster.
00:16:47.340 | The last example I want to share with you is unified embeddings at Etsy. So you might think that
00:16:52.620 | embeddings are not very sexy, but this paper from Etsy is really outstanding in what they share in
00:16:58.220 | this model architecture as well as their system. So the problem they had was,
00:17:02.620 | how can you help users get better results from very specific queries or very broad queries?
00:17:08.620 | And if you know that Etsy inventory is constantly changing. They don't have the same products all
00:17:13.340 | throughout, right? It's very homegrown. So now you might be querying for something like Mother's Day gift,
00:17:18.380 | that would almost match very few items. I think very few items would have Mother's Day gift in their
00:17:23.900 | description on their title, right? And you know, lexical embedding, the other problem is that
00:17:28.300 | knowledge-based embeddings, like lexical embedding retrieval, don't account for user preferences.
00:17:33.260 | So how do you try to address this? The problem, how they address this is with unified embedding and
00:17:39.660 | retrieval. So if you remember, at the start of my presentation, I talked about the Kuaishou two-tower model,
00:17:47.340 | right? There's the user tower, and then there's the item tower. We will see the same pattern again.
00:17:52.300 | And then over here, you see the product tower, right? This is the product encoder.
00:17:56.140 | So how they encode the product is that they use T5 models for text embeddings, right? Text item
00:18:01.820 | descriptions, as well as a query product log for query embeddings. What was the query that was made,
00:18:07.740 | and what was the product that was eventually clicked or purchased. And then over here on the left,
00:18:12.700 | you see the query encoder, which is the search query encoder. And they both share encoders for the tokens,
00:18:19.660 | which is actually the text tokens, the product category, which is a token of itself, and the
00:18:25.500 | user location. So what this means is that now your embedding is able to match user to the location
00:18:31.500 | of the product itself. And then of course, to personalize this, they encode the user preferences,
00:18:36.940 | via the query user scale effect, features at the bottom. Essentially, what were the queries that the
00:18:41.100 | user searched for, what did they buy previously, all their preferences. Now, this is, they also shared their
00:18:46.860 | system architecture. And over here, this is the product encoder from the previous slide,
00:18:51.340 | and the query encoder from the previous slide. But what's very interesting here is that they added a
00:18:56.140 | quality vector, because they wanted to ensure that whatever was searched and retrieved
00:19:01.420 | was actually of good quality in terms of ratings, freshness, and conversion rate.
00:19:05.980 | And you know, what they did is they just simply concatenated this quality vector to the product
00:19:10.780 | embedding vector. But when you do that, for the query vector, you have to expand the product vector
00:19:17.260 | by the same dimension, so that you can do a dot product, or cosine similarity. So essentially,
00:19:21.820 | they just slapped on a constant vector for the query embedding, and it just works.
00:19:27.340 | The result, 2.6% increase in conversion across an entire site. That's quite crazy. And more than
00:19:34.220 | 5% increase in search purchases. If you search for something, the purchase rate increases by 5%.
00:19:38.940 | This is very, very, these are very, very, very good results for e-commerce.
00:19:43.820 | So the benefits of unified models, you simplify the system, whatever you build to improve one side of
00:19:53.100 | the tower, improve your model, your unified model also improves other use cases that use these unified
00:19:59.260 | models. That said, there may also be the alignment tags. You may find that when you try to build this,
00:20:04.700 | try to compress all 12 use cases into a single unified model, you may need to split it up into maybe
00:20:09.420 | two or three separate unified models, because that's just the alignment tags. We're trying to get better
00:20:13.340 | one task actually makes the other task worse. We have a talk from LinkedIn in this afternoon blog,
00:20:21.980 | the last talk of the blog, and then we also have a talk from Netflix, which we'll be sharing about
00:20:25.820 | their unified model at the start of the next blog. All right, the three takeaways I have for you,
00:20:30.700 | think about it, consider it, semantic IDs, data augmentation, and unified models.
00:20:39.580 | And of course, stay tuned for the rest of the talks in this track. Okay, that's it. Thank you.