Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs

00:00:00.500 | Hi everyone, thank you for joining us in today's REXIS, the inaugural REXIS track at the AI

00:00:19.960 | Engineer World's Fair. So today what I want to share about is what the future might look

00:00:25.280 | like when we try to merge recommendation systems and language models. So my wife looked

00:00:32.360 | at my slides and she's like, they're so plain. So therefore, I'll be giving a talk together

00:00:36.800 | with Latte and Mochi. You might have seen Mochi wandering the halls around somewhere,

00:00:40.280 | but there'll be a lot of doggos throughout these slides. I hope you enjoy. First, language

00:00:45.100 | modeling techniques are not new in recommendation systems. I mean, it started with World2Vac in

00:00:50.560 | 2013. We started learning item embeddings across, from co-occurrences in user interaction sequences.

00:00:57.640 | And then after that, we started using GRU for REX. I don't know who here remembers recurrent

00:01:02.160 | neural networks, gated recurrent units. Yeah. So those were very short term, and we predict

00:01:07.440 | the next item from a short set of sequences. Then of course, transformers and attention came

00:01:12.980 | about, and we became better on attention on long-range dependencies. So that's where we

00:01:18.660 | started. Hey, you know, can we just process on everything in the user sequence, hundreds,

00:01:23.100 | 2,000 item IDs long, and try to learn from that? And of course, now, today in this track,

00:01:29.160 | I wanted to share with you about three ideas that I think are worth thinking about. Semantic

00:01:33.480 | IDs, data augmentation, and unified models. So the first challenge we have is hash-based item

00:01:40.580 | IDs. Who here works on recommendation systems? So you probably know that hash-based item IDs

00:01:47.380 | actually don't encode the contents of the item itself. And then the problem is that every

00:01:51.620 | time you have a new item, you suffer from the cold start problem, which is that all you have

00:01:56.000 | to relearn about this item all over again. And therefore, then there's also sparsity, right,

00:02:02.000 | whereby you have a long set of tail items that have maybe one or two interactions, or even up

00:02:05.580 | 10, but it's just not enough to learn. So recommendation systems have this issue of being

00:02:09.880 | very popularity bias. And they just struggle with cold start and sparsity. So the solution

00:02:14.880 | is semantic IDs that may even involve multimodal content. So here's an example of trainable multimodal

00:02:22.300 | semantic IDs from Kuaishou. So Kuaishou is kind of like TikTok or Xia Hongsu, it's a short video

00:02:27.980 | platform in China. I think it's the number two short video platform. You might have used their

00:02:31.900 | text-to-video model, Kling, which they released sometime last year. So the problem they had,

00:02:36.860 | you know, being a short video platform, users upload hundreds of millions of short videos every day.

00:02:42.220 | And it's really hard to learn from this short video. So how can we combine static content embeddings

00:02:47.660 | with dynamic user behavior? Here's how they did it with trainable multimodal semantic IDs. So I'm

00:02:54.700 | going to go through each step here. So this is the Kuaishou model. It's a standard two tower network.

00:03:01.100 | On the left, this is the embedding layer for the user, which is a standard sequence of IDs and the user ID.

00:03:10.060 | And on the right is the embedding layer for the item IDs. So these are fairly standard.

00:03:14.300 | What's new here is that they now take in content input. So all of these slides will be available

00:03:20.940 | online. Don't worry about it. I'll make it available right immediately after this. And to encode visual,

00:03:27.740 | they use ResNet. To encode video descriptions, they use BERT. And to encode audio, they use VGG-ish.

00:03:34.940 | Now, the trick is this. When you have these encoder models, it's very hard to backpropagate and try to

00:03:41.420 | update these encoder model embeddings. So what did they do? Well, firstly, they took all these content

00:03:46.700 | embeddings, and then they just concatenated them together. I know it sounds crazy, right? But I just

00:03:51.260 | concatenated them together. Then they learn cluster IDs. So I think they shared in the paper, they had like

00:03:58.060 | a hundred million short videos, and they learn just via k-means clustering, a thousand cluster IDs.

00:04:04.140 | So that's what you see over there in the model encoder, which is in the boxes at the bottom,

00:04:08.940 | which is the cluster IDs. So above the cluster IDs, you have the non-trainable embeddings. Below that,

00:04:15.260 | you have the trainable cluster IDs, which are then all mapped to their own embedding table.

00:04:20.460 | So the trick here is this. The model encoder, as you train a model, the model encoder learns to map

00:04:25.900 | the content space via the cluster IDs, which are mapped to the embedding table, to the behavioral space.

00:04:31.340 | So the output is this. These semantic IDs not only outperform regular hash-based IDs on clicks and likes,

00:04:40.460 | right? Like, that's pretty standard. But what they were able to do was they were able to increase

00:04:44.860 | co-start coverage, which is the, of a hundred videos that you share, how many of them are new,

00:04:50.540 | they were able to increase it by 3.6%. And also increase co-start velocity, which is, okay,

00:04:55.900 | how many new videos were able to hit some threshold of views? And this, they did not share what a

00:05:01.980 | threshold was, but being able to increase co-start and co-start velocity by these numbers are pretty

00:05:07.020 | outstanding. So the, long story short, the benefits of semantic IDs, you can address co-start with the

00:05:13.500 | semantic ID itself, and now your recommendations understand content. So later in the talk,

00:05:18.620 | we're going to see some amazing sharing from Pinterest and YouTube. And in the YouTube one,

00:05:24.460 | you see how they actually blend language models with semantic IDs, whereby it can actually explain

00:05:31.420 | why you might like the semantic ID, because it understands the semantic ID, and it's able to give

00:05:35.900 | human readable explanations, and vice versa. Now, next question, and I'm sure all of this,

00:05:42.220 | everyone here has this challenge. The lifeblood of machine learning is data, good quality data at

00:05:49.340 | scale. And this is very essential for search, and of course recommendation systems, but search is

00:05:54.780 | actually far more important. You need a lot of metadata, you need a lot of query expansion, synonyms,

00:06:01.340 | you need spell checking, you need all sorts of metadata to attach to your search index. But this is very

00:06:07.980 | costly and high effort to get. In the past, we used to do it with human annotations, or maybe you can try

00:06:11.820 | to do it automatically. But LLMs have been outstanding at this. And I'm sure everyone here is sort of doing

00:06:17.980 | doing this to some extent, using LLMs for synthetic data and labels. But I want to share with you two

00:06:22.540 | examples from Spotify and Indeed. Now the Indeed paper, I really like it a lot. So the problem that they

00:06:32.860 | were trying to face is that they were sending job recommendations to users via email. But some of

00:06:39.340 | these job recommendations were bad. They were just not a good fit for the user, right? So they had poor user

00:06:44.380 | experience and then users lost trust in the job recommendations. Imagine, and how they would indicate

00:06:49.260 | that they lost trust was that these job recommendations are not a good fit for me, I'm just going to

00:06:53.340 | unsubscribe. Now the moment a user unsubscribes from your feed or for your newsletter, it's very, very, very

00:06:58.780 | hard to get them back. Almost impossible. So while they had explicit negative feedback, thumbs up and

00:07:04.380 | thumbs down, this was very sparse. How often would you actually give thumbs down feedback? Very sparse. And

00:07:09.100 | implicit feedback is often imprecise. What do I mean? If you get some recommendations, but you actually

00:07:14.220 | don't act on it, is it because you didn't like it? Or is it because it's not the right time? Or maybe

00:07:19.740 | your wife works there and you don't want to work in the same company as your wife? So the solution they had

00:07:25.260 | was a lightweight classifier to filter back racks. And I'll tell you why I really like this paper from Indeed,

00:07:31.180 | in the sense that they didn't just share their successes, but they shared the entire process and how they

00:07:35.900 | get, how they got there. And it was fraught with challenges. Well, of course, the first thing that

00:07:40.700 | makes me really like it a lot was that they started with evals. So they had their experts label job

00:07:47.900 | recommendations and user pairs. And from the user, you have their resume data, you have the activity data,

00:07:53.980 | and they tried to see, hey, you know, is this recommendation a good fit? Then they prompted

00:07:59.820 | open LLMs, Mistral and LAMA2. Unfortunately, their performance was very poor. These models couldn't

00:08:05.260 | really pay attention to what was in the resume and what was in the job description, even though they had

00:08:10.780 | sufficient context length. And the output was just very generic. So to get it to work, they prompted

00:08:17.580 | GPT-4. And GPT-4 worked really well. Specifically, GPT-4 had like 90% precision and recall. However,

00:08:24.780 | it was very costly. They didn't share the actual cost, but it's too slow. It's 22 seconds. Okay,

00:08:29.900 | if GPT-4 is too slow, what can we do? Let's try GPT-3.5. Unfortunately, GPT-3.5 had very poor precision.

00:08:37.820 | What does this mean? In the sense that of the recommendations that it said were bad,

00:08:44.620 | only 63% of them were actually bad. What this means is that they were throwing out 37% of

00:08:50.860 | recommendations, which is one-third. And for a company that tries on recommendations and people

00:08:54.860 | recruiting through your recommendations, throwing out one-third of them that are actually good is

00:09:00.780 | quite a guardrail for them. This was their key metric here. And also, GPT-3. So what they did then is they

00:09:07.500 | fine-tuned GPT-3.5. So you can see the entire journey, right? Open models, GPT-4, GPT-3,

00:09:13.180 | now fine-tuning GPT-3.5. GPT-3.5 got the precision they wanted, 0.3 precision. And you know it's one

00:09:19.260 | quarter of GPT-4's cost and latency, right? But unfortunately, it's still too slow. It was about

00:09:24.460 | 6.7 seconds, and this would not work in an online filtering system. So therefore, what they did was they

00:09:30.300 | distilled a lightweight classifier on the fine-tuned GPT-3.5 labels. And this lightweight classifier was able to

00:09:37.180 | achieve very high performance, specifically 0.86 AUCROC. I mean, the numbers may not make sense to

00:09:44.700 | you, but suffice to say that in an industrial setting, this is pretty good. And of course,

00:09:48.540 | they didn't mention the latency, but it was good enough for real-time filtering. I think less than

00:09:51.900 | 200 milliseconds or something. So the outcome of this was that they were able to reduce bad

00:09:58.540 | recommendations. They were able to cut out bad recommendations by about 20%. So initially,

00:10:03.740 | they had hypothesized that by cutting down recommendations, even though they were bad,

00:10:07.900 | you will get fewer subscriptions. It's just like sending out links, right? You might have links

00:10:11.900 | that are clickbait. Even though they are bad, people just click on it. And they thought that even if we

00:10:15.580 | cut down recommendations, even if they were bad, we would get lower application rate. But this was not

00:10:19.980 | the case. In fact, because the recommendations were now better, application rate actually went up by 4%.

00:10:27.260 | And unsubscribe rate went down by 5%. That's quite a lot. So essentially, what this means is that in

00:10:32.380 | recommendations, quantity is not everything. Quality makes a big difference, and quality here moves the

00:10:37.260 | needle quite a bit by 5%. The next example I want to share with you is Spotify. So who here knows that

00:10:44.460 | Spotify has podcasts and audio books? Oh, okay. I guess you guys are not a target audience in this use case.

00:10:51.900 | So Spotify is really known for song and artists, and a lot of their users just search for songs and artists,

00:10:57.180 | and they're very good at that. But when they started introducing podcasts and audio books,

00:11:02.620 | how would you help your users know that, you know, these new items are available? And of course,

00:11:06.940 | there's a huge co-start problem. Now it's not only co-start on item, it's now co-start on category.

00:11:13.020 | How do you start growing a new category within your service? And of course, exploratory search was

00:11:19.900 | essential to the business, right, for expanding beyond music. Spotify doesn't want to just do music,

00:11:25.260 | songs. They just now want to be doing audio. So the solution to that is a query recommendation system.

00:11:32.780 | So how did they recommend, how first, how did they generate new queries? Well, they have a bunch of

00:11:39.340 | ideas, which is, you know, extracted from catalog titles, playlist titles, you mine it from the search

00:11:43.820 | logs, you just take the, you just take the artist and then you just add cover to it. And this is what

00:11:49.340 | they use from existing data. Now you might be wondering like, where's the LLM in this? Well, the LLM is used to

00:11:56.780 | generate natural language queries. So this might not be sexy, but this works really well, right? Take whatever you

00:12:02.620 | have with conventional techniques that work really well, and use the LLM to augment it when you need

00:12:07.100 | it. Don't use the LLM for everything at the start. So now they have these exploratory queries.

00:12:12.940 | When you search for something, you still get the immediate results hit, right? So you take all this,

00:12:20.940 | you add the immediate results, and then you rank these new queries. So this is why when you do a search,

00:12:28.940 | this is the UX that you're probably going to get right now. I got this from a paper. It may have

00:12:32.300 | changed recently. So you still see the item queries at the bottom. But at the top, with the query

00:12:37.260 | recommendations, this is how Spotify informs users without having a banner. Now we have audio books,

00:12:43.180 | now we have podcasts, right? You search for something, it actually informs you that we have

00:12:46.860 | these new categories. The benefit here is plus 9% exploratory queries. Essentially, one-tenth of

00:12:54.460 | their users will now exploring their new products. So imagine that one-tenth every day exploring their

00:13:01.420 | new products. How quickly would you be able to grow your new product category, right? It's actually 1.1 to

00:13:07.820 | the power of N. It will grow pretty fast. Long story short, I don't have to tell you about the

00:13:13.020 | benefits of LLM augmented synthetic data, richer high quality data at scale, on the tail queries,

00:13:19.660 | right? Even on the tail queries and the tail items, and it's far lower cost and effort than is even

00:13:24.780 | possible with human annotation. So later, we also have a talk from Instacart, who will tell us about how

00:13:30.380 | they use LLMs to improve their search system. Now the last thing I want to share is this challenge,

00:13:40.460 | whereby right now, in a regular company, the system for ads, for recommendations, for search,

00:13:49.340 | they're all separate systems. And even for recommendations, the model for homepage recommendations,

00:13:55.260 | the model for item recommendations, the model for ad-to-card recommendations, the model for the thank

00:13:59.900 | you page recommendations, they may all be different models, right? So you can imagine this, you're

00:14:04.380 | going to have many, many models, but you're going to have, well, leadership expects you to keep the

00:14:09.820 | same amount of headcount. So then how do you try to get around this, right? You have duplicative

00:14:13.820 | engineering pipelines, there's a lot of maintenance costs, and improving one model doesn't naturally

00:14:19.180 | transfer to the improvement in another model. So the solution for this is unified models, right? I mean,

00:14:25.580 | it works for vision, it works for language, so why not recommendation systems? And we've been

00:14:29.740 | doing this for a while, this is not new. And aside, maybe the text is too small, but this is a tweet

00:14:35.900 | from Stripe, whereby they built a transformer-based payments fraud model, right? Even for payments,

00:14:42.700 | the sequence of payments, you can build a foundation model, which is transformer-based.

00:14:46.860 | So I want to share an example of the unified ranker for search and rexies and Netflix, right?

00:14:53.420 | The problem I mentioned, they have teams, they are building bespoke models for search, similar video

00:14:59.180 | recommendations and pre-query recommendations, like on the search page before you ever enter

00:15:02.780 | a search grid. High operational costs, you know, missed opportunities from learning throughout.

00:15:06.540 | So their solution is a unified ranker, and they call it the unified contextual ranker,

00:15:12.460 | which is unicorn. So you can see over here at the bottom, there's the user foundation model,

00:15:17.980 | and in it, you put in a user watch history. And then you also have the context and relevance model,

00:15:23.500 | where you put in the context of the videos and what they've watched.

00:15:26.140 | Now, the thing about this unified model is that it takes in unified input, right? So now, if you are able

00:15:34.300 | to find a data schema where all your use cases and all your features can use the same input, you can adopt

00:15:41.420 | an approach like this, which is similar to multi-task learning. So the input would be the user ID,

00:15:47.660 | the item ID, you know, the video or the drama or the series, the search query, if a search query exists,

00:15:53.900 | the country and the task. So of course, they have many different tasks. In this example,

00:15:58.140 | in the paper, they have three different tasks, search pre-query and more like this. Now,

00:16:03.580 | what they did then was very smart imputation of missing items. So for example, if you are doing an

00:16:11.260 | item-to-item recommendation, you're just done watching this video, you want to recommend the next

00:16:14.540 | video, you will have no search query. How would you impute it? Well, you just simply use the title of the

00:16:19.420 | current item and try to find similar items. The outcome of this is that this unified model was able to

00:16:26.300 | match or exceed the metrics of their specialized models on multiple tasks. Think about it. I mean,

00:16:32.940 | it doesn't seem very impressive, right? It may not seem very impressive. Match or exceed. It might seem,

00:16:37.340 | we did all this work just to match. But imagine unifying all of it, like removing the tech debt

00:16:42.620 | and building a better foundation for your future iterations. It's going to make you iterate faster.

00:16:47.340 | The last example I want to share with you is unified embeddings at Etsy. So you might think that

00:16:52.620 | embeddings are not very sexy, but this paper from Etsy is really outstanding in what they share in

00:16:58.220 | this model architecture as well as their system. So the problem they had was,

00:17:02.620 | how can you help users get better results from very specific queries or very broad queries?

00:17:08.620 | And if you know that Etsy inventory is constantly changing. They don't have the same products all

00:17:13.340 | throughout, right? It's very homegrown. So now you might be querying for something like Mother's Day gift,

00:17:18.380 | that would almost match very few items. I think very few items would have Mother's Day gift in their

00:17:23.900 | description on their title, right? And you know, lexical embedding, the other problem is that

00:17:28.300 | knowledge-based embeddings, like lexical embedding retrieval, don't account for user preferences.

00:17:33.260 | So how do you try to address this? The problem, how they address this is with unified embedding and

00:17:39.660 | retrieval. So if you remember, at the start of my presentation, I talked about the Kuaishou two-tower model,

00:17:47.340 | right? There's the user tower, and then there's the item tower. We will see the same pattern again.

00:17:52.300 | And then over here, you see the product tower, right? This is the product encoder.

00:17:56.140 | So how they encode the product is that they use T5 models for text embeddings, right? Text item

00:18:01.820 | descriptions, as well as a query product log for query embeddings. What was the query that was made,

00:18:07.740 | and what was the product that was eventually clicked or purchased. And then over here on the left,

00:18:12.700 | you see the query encoder, which is the search query encoder. And they both share encoders for the tokens,

00:18:19.660 | which is actually the text tokens, the product category, which is a token of itself, and the

00:18:25.500 | user location. So what this means is that now your embedding is able to match user to the location

00:18:31.500 | of the product itself. And then of course, to personalize this, they encode the user preferences,

00:18:36.940 | via the query user scale effect, features at the bottom. Essentially, what were the queries that the

00:18:41.100 | user searched for, what did they buy previously, all their preferences. Now, this is, they also shared their

00:18:46.860 | system architecture. And over here, this is the product encoder from the previous slide,

00:18:51.340 | and the query encoder from the previous slide. But what's very interesting here is that they added a

00:18:56.140 | quality vector, because they wanted to ensure that whatever was searched and retrieved

00:19:01.420 | was actually of good quality in terms of ratings, freshness, and conversion rate.

00:19:05.980 | And you know, what they did is they just simply concatenated this quality vector to the product

00:19:10.780 | embedding vector. But when you do that, for the query vector, you have to expand the product vector

00:19:17.260 | by the same dimension, so that you can do a dot product, or cosine similarity. So essentially,

00:19:21.820 | they just slapped on a constant vector for the query embedding, and it just works.

00:19:27.340 | The result, 2.6% increase in conversion across an entire site. That's quite crazy. And more than

00:19:34.220 | 5% increase in search purchases. If you search for something, the purchase rate increases by 5%.

00:19:38.940 | This is very, very, these are very, very, very good results for e-commerce.

00:19:43.820 | So the benefits of unified models, you simplify the system, whatever you build to improve one side of

00:19:53.100 | the tower, improve your model, your unified model also improves other use cases that use these unified

00:19:59.260 | models. That said, there may also be the alignment tags. You may find that when you try to build this,

00:20:04.700 | try to compress all 12 use cases into a single unified model, you may need to split it up into maybe

00:20:09.420 | two or three separate unified models, because that's just the alignment tags. We're trying to get better

00:20:13.340 | one task actually makes the other task worse. We have a talk from LinkedIn in this afternoon blog,

00:20:21.980 | the last talk of the blog, and then we also have a talk from Netflix, which we'll be sharing about

00:20:25.820 | their unified model at the start of the next blog. All right, the three takeaways I have for you,

00:20:30.700 | think about it, consider it, semantic IDs, data augmentation, and unified models.

00:20:39.580 | And of course, stay tuned for the rest of the talks in this track. Okay, that's it. Thank you.

Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon

Chapters