back to index

Improving RecSys and Search in the age of LLMs — Eugene Yan


Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so I've been, I work in the field of recommendation systems and search, and every
00:00:10.420 | now and then I like to pop my head up to try to see what's going on.
00:00:13.760 | I think the recent trend for the past one or two years has been the interaction between
00:00:18.200 | the recommendation systems and search and LLMs.
00:00:22.300 | Let's say REXs, REXs and LLMs, REXs and search and LLMs, and I think in early 2023,
00:00:29.760 | we would see some papers where people use the decoder-only models to try to predict IDs.
00:00:34.720 | Those didn't work very well.
00:00:36.680 | But at the end of last year and early this year, we're starting to see scenes of life whereby
00:00:40.060 | some of these actually are A/B tested and have very good empirical results.
00:00:43.760 | So I want to highlight a few of those patterns and we can see how it goes.
00:00:49.120 | The one thing that for the longest time, what we're trying to do for REXs and search is to
00:00:53.060 | move away slowly from item IDs.
00:00:55.740 | You can imagine if Eugene interacts with item ID 1, 10, 25, his next predicted direction
00:01:01.980 | is probably item ID number 33.
00:01:04.660 | But all of this is relying slowly on item IDs and you can imagine every time that you have
00:01:08.500 | a new item ID, you have to learn a new embedding for it and that leads to a close-up problem.
00:01:13.700 | So I know this was not part of my recommended reads in my write-up, I will actually go back
00:01:18.340 | and update it to be recommended reads, but I want to discuss two papers here to try to address
00:01:22.620 | this.
00:01:23.620 | The first one is semantic IDs, which is from YouTube.
00:01:26.260 | You can imagine YouTube has a lot of new videos all the time, they can't learn new item IDs for them
00:01:33.540 | all the time.
00:01:34.540 | So what they do is they have a transformer encoder, it generates dense content embeddings.
00:01:38.740 | This is actually just a video encoder that converts a view into dense content embeddings.
00:01:42.860 | And then they compress this into what they call semantic ID.
00:01:46.180 | It's via an autoencoder.
00:01:47.780 | So the dense video encoding is 248 dimension, but what they do is they take the encoder, find
00:01:53.260 | the nearest neighbors, assign that to the code book is also 248, assign it to the nearest code book,
00:01:59.340 | take the residual, find the next nearest neighbors, assign it.
00:02:02.180 | So it just keeps compressing and compressing it.
00:02:04.580 | In the image here, they have four layers.
00:02:07.860 | So essentially you can compress an item ID into four integers product in the paper,
00:02:12.580 | actually decompress an ID into eight integers.
00:02:15.140 | So I thought this was pretty cool.
00:02:16.580 | And then now that you have an item ID, you have a content embeddings, you convert it to eight integers.
00:02:23.620 | How do you then learn it, right?
00:02:25.620 | They tried using ngram and sentence piece.
00:02:28.900 | ngram is really just, you know, like fast text, like character ngrams, you know, every subword,
00:02:34.500 | every one character, two characters, three characters are learning about embeddings.
00:02:37.620 | And then they also tried using sentence piece.
00:02:39.220 | Essentially, sentence piece is really just looking at all the version of all these item IDs.
00:02:44.100 | What are the most common subwords, most common sub characters?
00:02:48.740 | So therefore, it's no longer just a unigram, bigram and trigram.
00:02:53.860 | It's that you can learn variable length subwords.
00:02:55.780 | What are the results from this?
00:02:58.260 | Well, not surprisingly, dense content embeddings itself do worse than item IDs, right?
00:03:08.180 | And you can see this, right?
00:03:10.180 | You can see this on the chart on the left here, right?
00:03:13.860 | You can see unigram and bigram, the red line and the purple line, the unigram.
00:03:21.140 | It actually is worse than item, the random hash ID, the orange line, for some extent.
00:03:27.860 | Oh, actually, no, I didn't include the--
00:03:31.220 | Okay, I have the content embeddings line itself on my write-up.
00:03:36.100 | I didn't include it here.
00:03:37.300 | But this chart here is actually trying to show that they use the dense content embeddings, it's crap.
00:03:42.020 | But when they use both ngram and sentence piece, it did better.
00:03:45.620 | So you have to do this trick whereby you have to convert that content embedding, the full dense content embedding into its own semantic ID and then learn those IDs.
00:03:55.540 | Now, the benefit of this, you might be saying, hey, you know, isn't this all back to IDs again?
00:03:59.540 | Well, not necessarily, because now, given the piece of content, you can convert it to embedding and then you can assign it to its nearest ID.
00:04:05.780 | And therefore, you don't need to learn on behavioral data.
00:04:07.860 | So that's the benefit here.
00:04:09.860 | Similarly, quite short, which is like a TikTok number two, the number two TikTok in China, they adopted the same approach.
00:04:18.500 | They use multimodal content embeddings.
00:04:21.780 | So they use embeddings from ResNet, Sentencebird and VGG to get the respective modality.
00:04:27.380 | And then this is just simply concatenating a single vector.
00:04:29.780 | Then they take all these embeddings.
00:04:32.420 | They just do k-means to identify a thousand of the most common clusters.
00:04:36.660 | Right.
00:04:37.540 | Then now, therefore, each embedding is now a trainable item ID.
00:04:40.980 | These cluster IDs are then embedded via motor encoder.
00:04:45.380 | So this motor encoder, there's quite a bit to it.
00:04:48.420 | Let me try to simplify it.
00:04:49.780 | You can see on the top, it says non-visual, non-trainable visual embeddings.
00:04:53.940 | In this example, they only use visual embedding as an example.
00:04:56.820 | But you can imagine all the non-trainable embeddings.
00:05:00.180 | They take it and they project it into a different space via the mapping network.
00:05:05.140 | Secondly, for every cluster ID, they convert it into learned embeddings for visual texture and acoustic.
00:05:11.700 | These are not the original embeddings that come from it.
00:05:14.020 | These are just the representation of the multimodal cluster ID.
00:05:18.020 | And then fusion is really just concatenating it.
00:05:20.420 | So now you might be thinking, how is this motor encoder learned?
00:05:24.180 | How is this motor encoder trained, right?
00:05:26.180 | This motor encoder is not trained itself.
00:05:28.180 | This motor encoder is trained within the overall encoding, overall ranking network,
00:05:32.580 | which you can see the motor encoder is all at the bottom, right?
00:05:35.380 | So therefore, this motor encoder network, it takes the user tower, which is on the left,
00:05:40.980 | and the item tower that's on the right, and it tries to predict the likelihood that user will click or like.
00:05:45.380 | Therefore, based on this, they just backprop.
00:05:48.900 | You backprop the likelihood of clicking or liking or following, and you backprop it through the motor encoder.
00:05:55.220 | And that's how the motor encoder learns the mapping network and the cluster IDs.
00:06:00.740 | So the benefit of this is that they shared that outperformed several multimodal baselines.
00:06:04.420 | I won't go through them here.
00:06:05.460 | And when they did A/B testing, I think those are pretty freaking significant numbers.
00:06:11.060 | Anything more than 1% is pretty strong in a platform like this.
00:06:17.060 | And the benefit of this is that they mentioned that they had increased cold start velocity and cold start coverage.
00:06:21.540 | It means that, you know, cold start is able to pick up faster.
00:06:24.340 | If it's a good cold start video, it's able to pick up faster.
00:06:27.460 | And they are also able to show more cold start, 3.6% more cold start content, which increases coverage.
00:06:35.300 | So they also did the ablation studies.
00:06:37.540 | So let me butt into, for those new to the REXIS world, you said anything more than 1% is a big deal.
00:06:45.540 | Pretty huge.
00:06:46.500 | Can you contextualize, like how much is that worth or is that like...
00:06:51.140 | So you can imagine, right?
00:06:52.660 | Imagine you are making, I don't know, let's just make something up.
00:06:55.940 | A billion dollars worth of ads.
00:06:57.460 | Yeah.
00:06:58.180 | Right.
00:06:58.740 | And if people are engaging more, spending 1% more time, you can show 1% more ads.
00:07:03.620 | That's like 10 million.
00:07:05.780 | A million dollars.
00:07:06.660 | Right.
00:07:07.140 | So you just expand that, right?
00:07:08.660 | Of course, clicks and likes and follows.
00:07:10.420 | This is, these are engagement, engagement proxy metrics.
00:07:15.940 | Okay.
00:07:16.420 | So like, and is this, is this absolute or relative?
00:07:19.860 | So for example, are we saying that, let's say likes was plus three right here.
00:07:23.940 | Are we saying they went from six to 9% or are we saying they're currently 6% and now we are 6.09%.
00:07:31.860 | I suspect it's relative.
00:07:34.020 | I suspect it's like maybe from 5% to 5.15%.
00:07:37.780 | going from 6% to 9% is impossible.
00:07:41.700 | You wouldn't have to work.
00:07:43.540 | Yeah.
00:07:44.820 | Okay.
00:07:45.780 | All right.
00:07:46.100 | So no surprises here using multi-modality consistently outperform single model features.
00:07:54.420 | But there was also a trick here whereby they had to learn user specific model interests.
00:07:59.380 | And if you look at it on the left tower there, close to the top, there's this thing called multi-modal interest intensity learning.
00:08:09.380 | Essentially what they're learning there is for each user, which modality they are interested in.
00:08:14.340 | And then they actually map to that.
00:08:15.460 | For some people like Swix is very acoustically inclined.
00:08:18.100 | He might care more about a soundtrack.
00:08:19.620 | For other people, they might care more about the visuals or the video itself.
00:08:24.180 | Or the text, having a good caption is worth a thousand words, maybe.
00:08:27.380 | So yeah.
00:08:28.740 | So that's one trend I've seen, which is increasingly including more content into the model itself.
00:08:38.020 | The other trend that I've seen is that for LLMs for synthetic data.
00:08:43.140 | And there's two papers that I really like here because they share a lot of details.
00:08:48.660 | And they share a lot about the pitfalls that they faced.
00:08:52.100 | So the first one is, this is paper I really like.
00:08:55.060 | It's called expected bad match.
00:08:57.060 | It's from Indeed.
00:08:57.940 | Essentially, you can imagine you are providing people with job recommendations, right?
00:09:03.060 | And then you want to have a final filter at the end to filter out bad job recommendations.
00:09:09.780 | So this paper, it's not easy to get the full access to this paper online.
00:09:14.980 | I've included it in our, we have a, in our Discord channel, we have a thread.
00:09:20.020 | I've actually dropped the PDF in there and they talk through the entire, I think they talk through the entire process.
00:09:24.980 | And I think it's a quite a role model process.
00:09:27.380 | They started with looking at 250 job matches, right?
00:09:32.100 | 250 job matches.
00:09:33.140 | And then they compared it across various experts.
00:09:36.020 | They have very rigorous criteria.
00:09:37.860 | And in the end, they built a final eval set of 147 matches that were very high confidence.
00:09:43.540 | multiple experts agreed and then, you know, that there was nothing subjective.
00:09:46.820 | Then they tried prompting the LMS with recruitment guidelines, right?
00:09:50.980 | And to classify job match quality.
00:09:52.980 | And of course, you know, they tried things like the cheap stuff and then they tried things like the expensive stuff.
00:09:57.060 | Unfortunately, the cheap stuff doesn't work.
00:09:59.060 | Only GPT-4 worked.
00:10:01.380 | But GPT-4 was so slow.
00:10:02.900 | GPT-4 took an average of 31 seconds, right?
00:10:06.580 | Right.
00:10:07.140 | Okay.
00:10:07.780 | No problem.
00:10:08.420 | OpenAI lets you fine tune GPT-3.5.
00:10:11.300 | So what they did was then they fine tune GPT-3.5.
00:10:15.300 | And GPT-3.5, you can see, let's just focus on the green boxes and then the red boxes.
00:10:22.660 | You can see GPT-3.5 is able to achieve as almost close good precision and recall as GPT-4.
00:10:31.620 | Fine tune, right?
00:10:32.340 | And fine tune from the labels of GPT-4.
00:10:34.100 | But GPT-3.5 was like reduced latency by two thirds and cost by two thirds, which is perfect, right?
00:10:43.300 | But the thing is, you see that the average latency there is like six seconds.
00:10:48.020 | And that's still not good enough for when they needed to do it online.
00:10:52.180 | So then what they did is they fine tune the lightweight classifier.
00:10:56.180 | Unfortunately, they didn't go into very many details of what this lightweight classifier is.
00:11:00.020 | I suspect that this lightweight classifier is maybe not a language model.
00:11:03.460 | I suspect that it is probably a decision tree because they talk a lot about categorical features.
00:11:09.380 | And then the labels are just solely the ALM generated labels, right?
00:11:13.300 | So you can see their entire journey.
00:11:15.620 | The first to the eval set, then we test GPT-4, GPT-4 good, but too slow.
00:11:19.380 | Then we try GPT-3.5 like step-by-step incremental progress, but still too slow, too expensive.
00:11:24.980 | Okay.
00:11:25.540 | We would really not like to have to train our own classifier and have to maintain the ops of retraining
00:11:31.860 | it, but we don't have a choice.
00:11:33.220 | This is what we need to do to reduce inference latency.
00:11:37.380 | So that's what they did.
00:11:38.260 | They were able to achieve area under the curve, ROC of 0.86.
00:11:45.060 | That's pretty freaking good against ALM labels.
00:11:48.100 | And this is, according to them, low latency.
00:11:50.340 | I don't know how low latency is and suitable for real-time filtering.
00:11:53.460 | The benefits of this are pretty tremendous, right?
00:11:56.020 | You can read the benefits there yourself.
00:11:59.060 | But I think the one big benefit is they lowered unsubscribed rates by 5%.
00:12:03.060 | That is huge.
00:12:05.300 | For someone, if you maintain some kind of push notification or email notification thing,
00:12:10.980 | subscribe rates, unsubscription rates is like your biggest guardrail.
00:12:14.100 | Because if people aren't subscribed, you're never, ever going to reach out to them again.
00:12:17.220 | You lose them.
00:12:19.220 | So, you know, all your customer acquisition costs is really down the drain.
00:12:23.300 | Like maybe you have an offer and you let people sign up.
00:12:26.020 | Hey, would you like to hear more about us?
00:12:27.540 | Okay.
00:12:27.940 | We give that out.
00:12:28.820 | But when I'm subscribed, you lose them.
00:12:30.580 | And so over here on the top line in table two, they share the apply to invite to apply email.
00:12:38.820 | That's one.
00:12:39.460 | And then that's the results I highlighted here.
00:12:41.780 | And in the bottom, I also see they had online experiments for the homepage recommendation feed.
00:12:47.300 | And that's how low latency this classifier has to be.
00:12:51.300 | It has to be on the homepage recommendations.
00:12:53.300 | And similarly, we see very good results, right?
00:12:58.180 | You can see it.
00:12:59.300 | For example, impressions, right?
00:13:00.500 | Impressions drop 5.1% at threshold of 15 and then drop by 7.95% at threshold 25.
00:13:08.820 | What does that mean?
00:13:11.460 | That means you freed up 5% to 8% of impressions.
00:13:15.620 | You can now show more good stuff.
00:13:18.740 | Right?
00:13:20.740 | That's huge.
00:13:21.780 | But freeing up 1/12 of impressions is a very big deal.
00:13:25.380 | Freeing up more space.
00:13:27.140 | And as we know, more real estate is better.
00:13:29.940 | So I think that this was quite an outstanding result.
00:13:34.100 | The other one I want to share and then we'll pause for questions, short questions before I go into two other sections, is query understanding at Yelp.
00:13:41.380 | So query understanding at Yelp was very nice.
00:13:43.940 | It's purely using opening.
00:13:45.220 | They had two things.
00:13:47.540 | One is query segmentation and another one is highlights.
00:13:49.780 | The query segmentation one is not so straightforward to understand, but essentially given a query like Epcot restaurants, they can identify, they can split this into different things like topic, location, name, question, etc.
00:14:02.660 | And then by having better segmentation, they can have greater confidence in rewriting those parts of the query to help them search better.
00:14:12.900 | So the second bullet point gives you an example.
00:14:15.300 | If we know that the user's location is approximately there and the user said Epcot restaurants, we can rewrite the user's location from Orlando, Florida to Epcot for the search backend.
00:14:28.100 | And because the search backend is based on location, by rewriting Orlando, Florida to Epcot, they were able to get more precise results for the user.
00:14:37.620 | So that's one example.
00:14:40.340 | The other example is segmentation.
00:14:42.580 | And the original write-up, they have a lot of good images.
00:14:45.780 | I didn't include those images here because I didn't have time.
00:14:48.500 | I only started reading this like an hour before.
00:14:51.060 | One of this is review highlights.
00:14:53.620 | So imagine if you search for some food, maybe you search for vegetarian friendly Thai food, right?
00:15:01.860 | And then sometimes in the reviews, people would say things like vegetarian, veggie only, suitable for vegetarians.
00:15:08.980 | And then I'm sure that there are way more synonyms for this.
00:15:14.180 | In the past, they had to get humans to write these different synonyms.
00:15:17.460 | And then they add these two dictionaries.
00:15:18.660 | And you can imagine this is not scalable.
00:15:23.380 | But now they can use LLMs to replicate the human reasoning, right?
00:15:25.380 | In the phrase extraction.
00:15:25.940 | And they get way better coverage and it can cover 95% of traffic.
00:15:31.780 | So for query segmentation, they are able to understand the user intent a little bit better.
00:15:36.500 | And then for review highlights, because they were showing more reviews, especially for the long tail queries, it makes search more engaging.
00:15:46.260 | By highlighting the relevant reviews for a user's query, they help the user feel more confident about the food.
00:15:53.460 | Let's say it's vegetarian friendly.
00:15:55.380 | And then maybe the user review would really say something like, oh, the vegetarian food is great and delicious or something.
00:16:00.420 | Or like definitely no meat involved.
00:16:02.340 | It happens to be gluten free as well.
00:16:04.020 | I don't know.
00:16:04.580 | Things like that help make the users more confident in the search results that they're seeing.
00:16:09.060 | Okay, I'll pause here.
00:16:12.900 | Any questions?
00:16:14.820 | I know there's a lot in the chat.
00:16:16.820 | Oh my goodness.
00:16:17.460 | There's a quick, I mean, I have a quick question on just this query understanding thing.
00:16:22.260 | What was the previous solta in query segmentation?
00:16:25.220 | Like this seems like the most obvious, dumb possible thing to do.
00:16:28.260 | Name entity extraction.
00:16:32.260 | You get a span and then you train some kind of classic, you train some kind of transformer model
00:16:36.980 | that takes the input and then they'll cut it at characters.
00:16:40.340 | Yeah.
00:16:41.380 | Okay.
00:16:42.580 | So like it's basically, but like this guy did not compare it to NER, right?
00:16:46.420 | Like they, they, they mentioned that their original was NER and this is better.
00:16:51.620 | Okay.
00:16:53.060 | Nice.
00:16:53.380 | Yeah.
00:16:53.540 | They definitely, yep.
00:16:55.140 | I mean, everyone starts with some kind of NER based approach.
00:16:58.020 | My, I mean, my, my theory is that like, yeah, I, I basically,
00:17:01.540 | basically there's no point doing traditional NER anymore.
00:17:04.820 | You just do LMs, uh, with a, with some kind of schema.
00:17:08.740 | Could be.
00:17:09.380 | So fast.
00:17:10.260 | Yeah.
00:17:10.980 | Any fast NER cheap, uh, that's it for, for search, right?
00:17:15.300 | Uh, sorry.
00:17:17.860 | Great.
00:17:18.260 | But slow, right?
00:17:20.020 | Yeah.
00:17:20.260 | Um, like if you want to spell check auto complete thing, like a grammar tool, Gemini slow.
00:17:27.380 | Oh, so this is why I might rather prefer, uh, present from this, um, where is it search query?
00:17:38.980 | So, um, um, they started their legacy models.
00:17:44.100 | Uh, their legacy models.
00:17:45.540 | Oh, go ahead.
00:17:46.340 | Yeah.
00:17:47.700 | So you can see name entity recognition, right?
00:17:50.580 | Um, they use aim at the recognition and, and they do this.
00:17:53.540 | Um, but then, oh, wow.
00:17:56.180 | People are drawing on this.
00:17:57.380 | I didn't know you could do that.
00:17:58.820 | Um, but then they actually shed one thing I really like about this, uh, write up is this,
00:18:04.340 | this, this chart right here.
00:18:05.540 | No, this chart seems very, very, very base formulation, scope task, proof of concept scaling up.
00:18:11.140 | Um, but they, they wrote it very well to explain how they did it in the context of these two case
00:18:17.380 | studies.
00:18:17.780 | I feel like a lot of people just completely just, uh, drop this.
00:18:22.100 | They completely ignore this.
00:18:23.940 | And for query segmentation, right?
00:18:25.460 | Um, I know I'm taking a long time to get to the points.
00:18:28.420 | Is that 10% of queries make up 80% of traffic.
00:18:31.860 | So they could do all this query segmentation once, period.
00:18:35.940 | And then just retrieve from the cache.
00:18:38.500 | Um, derive golden data set fine tune.
00:18:43.540 | I can't remember where they wrote it, but that's how they did for query segmentation and for review
00:18:47.780 | highlights.
00:18:48.420 | So essentially a lot of these things that may feel like they have to be done on online,
00:18:53.380 | but because of the power law in e-commerce and online search,
00:18:56.900 | you don't have to, uh, you, you can make use of a cache a lot.
00:19:00.260 | Um, there's also another question from Tyler Cross.
00:19:06.020 | How do these approaches compare to more information retrieval methods like BM25 vector methods?
00:19:11.380 | Um, uh, Tyler, do you want to, what do you mean by, by these approaches?
00:19:16.900 | What approaches specifically?
00:19:18.180 | Oh, no, maybe Tyler's not on the...
00:19:22.820 | I think you said this when you were doing the LLM, like we tried 3.5,
00:19:26.180 | fine tune, 3.5, Lama 2, Nostro 11B.
00:19:28.420 | Ah, okay.
00:19:30.500 | I think that's like a classification.
00:19:31.940 | I think in this case, uh, if it's a classification approach and this wouldn't work.
00:19:35.860 | Right.
00:19:36.900 | Um, but if you're talking more generally, like using LLMs or embeddings, uh, for retrieval.
00:19:44.260 | Yeah.
00:19:45.540 | I actually don't know.
00:19:46.180 | I don't know the full, the full context of the question.
00:19:48.260 | So I probably better not answer.
00:19:49.620 | Um, okay.
00:19:50.740 | Any, any other questions?
00:19:51.860 | Okay.
00:19:55.940 | I think that's, that's great.
00:19:57.700 | I can move on because the other two sections are fairly heavy and I, and then, you know,
00:20:02.100 | we have more times, more time after that.
00:20:03.860 | Ooh, wait.
00:20:05.540 | I don't know if I'm trying to write screen.
00:20:08.660 | Give me a second.
00:20:09.220 | Okay.
00:20:10.340 | First I need to go to my Google Chrome.
00:20:14.420 | Ah, see, this is what happens when you get a noob to do slides.
00:20:17.540 | Okay.
00:20:21.860 | Share.
00:20:22.660 | You are seeing this, right?
00:20:24.580 | You're seeing my slides.
00:20:25.460 | Are you seeing it in slideshow mode?
00:20:28.820 | Full screen.
00:20:29.460 | Okay.
00:20:29.700 | Perfect.
00:20:30.100 | Perfect.
00:20:30.980 | So then the other thing is, I think that I'm LLM inspired training paradigms.
00:20:34.980 | So maybe it's LLM inspired, maybe you've been doing this for a very long time, but I thought
00:20:38.660 | to highlight it.
00:20:39.700 | The first one is really looking at the scaling laws.
00:20:43.140 | And ever since I published, I shared about this post.
00:20:45.540 | Like people have gotten back to me with like at least three or four papers about other studies
00:20:48.740 | of scaling laws, but along a very similar, very similar view.
00:20:52.580 | So I want to talk about the experimentation that I did.
00:20:55.860 | This scaling law was decoder only transformer architecture.
00:20:59.060 | And they tried various model sizes.
00:21:00.900 | The training data is the same as sentences.
00:21:03.460 | Essentially it's fixed length sequences, 50 item IDs each.
00:21:06.740 | And the training objective is given the past 10 items, predict item number 11.
00:21:10.740 | Given the past 20 items, predict item number 21.
00:21:13.140 | So it's fairly straightforward.
00:21:14.660 | But they did introduce two key things that are very interesting.
00:21:20.180 | The first one is layer-wise adaptive dropout.
00:21:22.500 | So you can imagine, right, for LLMs, every single layer has same dimension.
00:21:28.660 | You know, usually when they draw a transformer layer, it's every single layer has same dimension.
00:21:33.060 | But for recommendation system models, that's not the case.
00:21:35.540 | It's usually fairly fat at the bottom and gets skinnier towards the top.
00:21:39.300 | So what they do over here is they have higher dropout in the lower layers and lower dropout in the higher layers.
00:21:43.780 | And ablation studies showed that this works.
00:21:46.420 | So the intuition here is that the lower layers process more direct input from the data.
00:21:51.140 | And because e-commerce data or recommendation data is fairly noisy, it's more prone to overfitting.
00:21:58.260 | Therefore, they have more dropout in the lower layers.
00:22:01.780 | Vice versa, at the upper layers, it learns from more abstract data.
00:22:06.260 | And therefore, you want to make sure it doesn't underfit.
00:22:09.140 | You want to make sure it gets all the juice you can get.
00:22:11.380 | And therefore, they reduce the lower, they have lower dropout at the higher layers.
00:22:16.100 | The other thing, which feels a little bit like black magic, is that they switch optimizers halfway doing training.
00:22:22.580 | Firstly, they start with Adam and then they switch to SGD.
00:22:26.340 | The observation they had was that, you know, they ran a full run with Adam, ran a full run with SGD,
00:22:30.420 | is that Adam is able to very quickly reduce the loss at the start.
00:22:35.780 | But then it like slowly tapers off, whereas SGD is slower at the start, but achieves better conversions.
00:22:40.500 | So they had to do these two tricks for their sequential models.
00:22:44.660 | What were the results?
00:22:46.820 | I mean, obviously, no, this is fairly obvious.
00:22:50.340 | Higher model capacity reduce cross entropy loss.
00:22:53.140 | And this model capacity is model capacity, model params excluding ID embeddings.
00:22:59.380 | So it's purely just the layers itself without the ID embeddings.
00:23:03.380 | And they were able to model this.
00:23:05.860 | If you look at the dash line, the test loss curve and the blue dots, they estimated this with the blue dots, estimated that power law curve.
00:23:14.900 | And they were able to fairly accurately predict where the red dots are going to be.
00:23:19.300 | And, you know, this is like the Kaplan-style scaling loss and the Chinchilla-style scaling loss.
00:23:25.140 | So essentially, given some smaller model, if we had bigger model, how would it perform?
00:23:29.060 | The other thing was, oh gosh, these lines don't look correct.
00:23:33.860 | Okay, the red arrow does look correct.
00:23:36.740 | Everything else is fine.
00:23:38.020 | So over here, I think this is a very nice result, which is that smaller models need more data to achieve comparable performance.
00:23:46.980 | So over here on the left, you can see that there's a small model there on the orange line.
00:23:51.700 | It needed twice the amount of data compared to a bigger model to get similar performance, right?
00:23:59.540 | So the flip side of it is that, hey, you know, if you want highly performant models online, you're going to need a factor more data.
00:24:09.940 | Of course, this is also nothing unusual.
00:24:12.820 | This is something we know, but it's really nice to have someone have done the experiments and distill it into the results here.
00:24:19.140 | The other thing that I thought was really interesting is this idea of recommendation model pre-training.
00:24:28.500 | So this was fairly new to me.
00:24:31.140 | I didn't think it could be done.
00:24:32.420 | Most people do this on content embeddings, which is given some content of this item, can you predict the content of that item?
00:24:40.180 | I thought this was fairly novel, whereby it's trained solely on item popularity statistics.
00:24:44.580 | They say it works.
00:24:47.940 | It's still quite unfantom to me on how it works.
00:24:51.860 | Essentially just take the item popularity statistic in the monthly and the weekly timescale, convert it to percentiles, and then convert those percentiles to vector representations.
00:25:02.340 | And that's it.
00:25:03.940 | That's your representation of the item.
00:25:06.500 | So anytime you have a new item, so long as you have the past statistic for the past month and week, you can convert the percentile and then map it into vector representation.
00:25:14.900 | So what this means is that imagine if our percentiles are only at the hundreds and we have stats for monthly and weekly, all we need is 200 embeddings for a month and a week.
00:25:24.660 | And for each hundred, for 100 and we need 100 percentiles, we need vector representations.
00:25:31.220 | So instead of millions of item IDs or billions of item IDs, all you need is 200 percentile vector representations.
00:25:39.300 | So that is extremely compressed.
00:25:41.060 | They also had to do several tricks like relative time intervals and fixed position encoding that don't come across as ventuitive to me.
00:25:50.340 | They explained that they say that they did that, but it's unclear, like how would I know a priori that I need to, how would I know if without running experiment that I needed to do it?
00:25:59.700 | So it feels like there's like too many tricks.
00:26:02.900 | There's so many tricks in like, okay, I need these three things, the stats to perfectly align for this to work.
00:26:07.380 | So I think it's very promising, but I wish there was a simpler way to do this.
00:26:11.300 | The results, it has promising zero shot performance.
00:26:15.380 | What I mean by zero performance, basically it trains on the standard domain and then tries to apply it across the domain to another domain, right?
00:26:23.700 | And you can see two to six percent drop in recall at 10.
00:26:25.940 | This is compared to baselines, which are fairly good baselines, SASTRAC and BOFORAC, which are trained on the target domain itself.
00:26:32.020 | Now, if you take this model and you train it on that target domain,
00:26:38.660 | it matches or surpasses SASTRAC and BOFORAC when trained from scratch.
00:26:42.100 | But the test, it only uses one to five percent of parameters because it doesn't have item embeddings, right?
00:26:48.100 | It only has those 200 embeddings at the monthly and the weekly scale for every percentile.
00:26:52.100 | So this is quite promising in the sense, it's one direction to its pre-trained models.
00:27:00.180 | You can imagine some kind of recommendation as a service, adopting this idea and maybe it could work.
00:27:07.060 | Maybe something like Shopify, right?
00:27:08.660 | Shopify has a lot of new merchants onboarding.
00:27:11.220 | Hey, you know, can we take existing merchant data with their permission?
00:27:13.860 | Of course, completely anonymized, right?
00:27:16.260 | It's just solely trained on popularity, right?
00:27:19.220 | And then we just train this model.
00:27:20.580 | Now for any new merchant that's onboarding, as long as we have a week of data, we can use the weekly popularity embeddings.
00:27:26.820 | And once we have a month of data, we can use that model.
00:27:29.860 | So we don't actually need semantic IDs.
00:27:31.860 | The second one is we have two papers from YouTube.
00:27:36.660 | We have two pictures from Google and YouTube here.
00:27:38.900 | And this is so the one thing about distillation is that
00:27:44.900 | if you solely learn on the teacher labels, it is very noisy, right?
00:27:54.260 | The teacher models are not the perfect models. It's better to learn from the ground truth.
00:27:57.140 | But we do know that adding the teacher models does help.
00:28:00.100 | So what they do here is on the left side, you can see that direct distillation, which is learning from
00:28:06.980 | both the hard labels, which is the ground truth and the distillation labels, which is what the teacher provides,
00:28:11.860 | the teacher model, the big teacher model provides, is not as good as auxiliary distillation.
00:28:17.620 | And essentially, what auxiliary distillation means is that you just split, give them two logits.
00:28:21.620 | One logit to learn from the hard label, one logit to learn from the distillation label.
00:28:25.060 | And they find that this works very well.
00:28:28.100 | I didn't have time to put the results here, but they find that this works well for YouTube.
00:28:32.580 | And then the thing is that the teacher model is useful.
00:28:35.060 | So what they did is that they amortized the cost by having a big fat teacher model. And
00:28:40.260 | by big fat teacher model, I mean, it's only two to four X bigger.
00:28:44.100 | By having a teacher model that's two to four X bigger, this teacher model will just keep pumping
00:28:48.580 | out the soft labels that all the students can learn from. And this makes all the students better.
00:28:53.460 | And of course, why do we want students? If you're saying that teacher model is better,
00:28:56.420 | why do we want students? We want the students because the student models are small and cheap.
00:29:00.580 | And at YouTube scale, where they have to make a lot of requests, this is probably what they need to do.
00:29:05.140 | Another approach, which is from Google, and I think they applied this in the YouTube setting as well,
00:29:12.660 | is called self auxiliary distillation. So the intuition here is this, don't look at the image first.
00:29:19.380 | So intuition here is this, they want to prioritize high quality labels and improve the resolution of
00:29:24.340 | low quality. What does it mean to improve the resolution of lower quality labels? Essentially,
00:29:29.220 | what they're saying is that if something is impressed, but not clicked, we should not treat that as a label
00:29:36.900 | of zero. Instead, what we should do is to try to get the teacher to predict what that label is,
00:29:42.820 | to smoothen it out. So if you look at the image, you can see that they have ground truth labels,
00:29:48.660 | which is those in green, and they have teacher predictions, which is those in yellow.
00:29:52.020 | So to combine a hard label with the soft label, they suggested a very simple function in the,
00:29:57.300 | I don't know if that's what they actually use, but essentially the max of the teacher and the
00:30:02.740 | student. So the max of the teacher and the ground truth. So imagine if the actual label was zero,
00:30:09.540 | and the teacher said that, you know, it's a 0.3, you just use the 0.3. Or if the actual label is one,
00:30:14.660 | and it's just a 0.5, you just take the one. So by smoothing it, and then having the teacher,
00:30:20.020 | having the student learn on the auxiliary head, right, you are actually able to improve the teacher
00:30:27.540 | model itself and use it for serving. So there's a lot of distillation techniques, which I think is
00:30:33.780 | quite inspired by what we see from computer vision and language models. I haven't seen too many of
00:30:41.940 | these distillation techniques myself in the field of recommendations, which I thought were pretty
00:30:45.700 | interesting. The last one, and unfortunately this is the last one I have slides for, I can go through
00:30:51.140 | the other recommended reads I have, but unfortunately I didn't have slides to do for it,
00:30:56.980 | is this one. So this is quite eye-opening for me. Essentially what LinkedIn did was they replaced
00:31:07.700 | several ID-based ranking models into a single 150B decoder-only model. What this means is that, for
00:31:16.900 | example, you could replace 30 different logistic regressions or decision trees or neural networks
00:31:22.660 | with a single text-based decoder-only model. This model is based, it's built on the Mistro MOE, right,
00:31:30.500 | that's why it's like approximately 150B. And it's trained on three, six months of interaction data,
00:31:36.180 | and the key, the main, so you may think, okay, decoder-only model, what does it mean?
00:31:40.500 | Will you write posts for me? Will you write LinkedIn posts for me? Will you write my,
00:31:43.780 | will you write, update my job title, whatever? The focus here is solely binary classification,
00:31:50.420 | if the user will like with, will like or interact with a post or interact for, apply for a job.
00:31:56.340 | So you can imagine that this model probably only needs to output like or not like. It's probably more
00:32:03.220 | complex than that. But essentially, this is a big fat decoder-only model that is very good at
00:32:07.140 | binary classification. That's why it's able to actually do well. And that's how they were evaluated.
00:32:14.180 | So there are different training stages. And over here, I think maybe it's better for me to go over,
00:32:18.500 | go into the actual write-up itself because I just didn't have time to
00:32:26.980 | share this. So they have continuous pre-training. So continuous pre-training, they just take member
00:32:33.540 | interactions on LinkedIn, different LinkedIn products, right? And then your raw entity data,
00:32:38.260 | essentially just take all this job-related, job hunting-related data to pre-train the model,
00:32:43.860 | to help the model get some idea of what is the domain. After continuous pre-training,
00:32:51.380 | they do the post-training approach. They do instruction tuning. So essentially, this is like
00:32:56.900 | training the model for instructions. They follow, they use UltraChat and internally generated instruction
00:33:04.260 | following data, right? So get LLMs to come up with questions and answers, relevant LinkedIn tasks,
00:33:08.820 | and then try to find high-quality ones. So that's training it, fine-tuning it to follow instructions.
00:33:14.020 | And then finally is supervised fine-tuning. I don't say something, a lot of things like multi-turn chat
00:33:18.980 | format. But essentially, the goal for supervised fine-tuning is to train the model to do the specific
00:33:29.780 | task. I don't remember where it is exactly, but it's like, ah, so this is a specific task.
00:33:36.820 | So now that we know it can follow instructions, now let's make it better at the specific task.
00:33:42.260 | Speaker 1: What action would a member take on this post? Would it solely be impressed? Will it be
00:33:48.420 | liked? Will it be comment? Etc. So that's how they go through differences. And I'm going back to my slides.
00:34:03.540 | Okay. So they have these three different stages. And so here's the crazy thing. You can see the slides,
00:34:15.140 | right? Can someone just say yes? Yes. Okay. The crazy thing is that they have now replaced feature
00:34:23.220 | engineering with prop engineering because of this unified decoder model. So you can broadly read it. It's like,
00:34:30.900 | this is the instruction. Here's the current member profile, software engineer at Google. Here's their
00:34:36.020 | job. Here's their resume. Here's the things that they have applied to. So will the user apply to this job?
00:34:44.500 | And the answer is apply. And you can probably simplify this into a one or zero, right? I guess they just
00:34:50.180 | say in the text as an example, but that's all this model is doing. For a user, we have retrieved several
00:34:57.220 | jobs. This model is doing the final pass of which one to rank. And they take the log props of the output
00:35:04.020 | to score it. So essentially, if this says that the member will apply, maybe you have 10 jobs that a
00:35:09.780 | member will apply. Then we take the log props to rank it. I don't know if this is a good thing or bad thing.
00:35:16.260 | I find feature engineering more intuitive than prop engineering, but maybe it's a skill issue. But
00:35:22.020 | essentially now, all PMs can engineer their own features. The impressive thing was, is that this
00:35:31.700 | can support 30 different ranking tasks. That's insane. So now, instead of 30 different models,
00:35:38.820 | you just have one big fat decoder model. That sounds a bit crazy to me. Firstly, it's crazy impressive.
00:35:46.260 | Secondly, it's a lot of savings. Thirdly, I don't know how to deal with the alignment tax, or maybe it's
00:35:51.620 | just a do no harm tax. I don't know. Essentially, the goal of REXIS was to decouple everything, right?
00:35:57.940 | It's like, have retrieval be very good at retrieval, have ranking be very good at ranking. And then
00:36:02.740 | each model just squeezes as much juice as we can. Now, what this is saying is that, okay, we're going
00:36:08.660 | to unify. We have too many separate ranking models. We're going to unify into a big fat model and then
00:36:15.220 | push all the data through it. And hopefully, you'll outperform. And it does outperform. It needs a lot
00:36:20.260 | of data. So you can see in the graph of that, right? Up to release three, it was not better than
00:36:27.620 | production. And you can see that based on the axis on the left, which is the gap production. Zero
00:36:33.220 | means that it's on par with production. Up to release three, it was not better. I mean, I don't know who
00:36:37.540 | had to get whatever budget or just to quarterback to make sure that this work, to push this through.
00:36:46.340 | But as they add more and more tokens, it starts to get better than production, like 2.5% increase. So this is
00:36:57.300 | a huge leap of faith that, okay, we'll just say, uh, with the lesson. Just give us more data,
00:37:04.900 | we will outperform, um, and with a single model. Um, okay, so that's it. That's all I had to share.
00:37:11.860 | Um, I, I can go through two other, I want to just briefly highlight two other papers, which I think are
00:37:19.300 | good. Um, it's a little bit less connected to LLMs, but the two other papers, which I think are very good is because of
00:37:26.900 | how do they go into their system architecture. The first one is Etsy. Um, Etsy, you can see this, this is extremely
00:37:33.780 | complicated. Uh, but this really shows you a very realistic and practical approach, right? Classic two-tower
00:37:40.900 | architecture. They share about negative sampling and then talk about product quality, right? The thing
00:37:46.020 | is you can have very good baby crap, realized images, but when people buy it, they return it.
00:37:52.660 | Um, you will never be able to detect that if you're just using that. So what they did was that they
00:37:57.780 | actually have a product quality embedding index that they use, used to augment, um, their approximate
00:38:04.820 | nearest neighbor index, right? So you can see the quality back quality vector. Uh, this is extremely pragmatic
00:38:11.540 | and I can tell you that not, uh, no, no e-commerce website or no search engine, search, whatever online
00:38:18.260 | discovery thing can do without some form of quality vector or some kind of post quality filtering. We
00:38:24.820 | saw that with indeed, right? Expected bad match. They need the quality. They just, uh, operationalize it
00:38:29.540 | in a different way as a final post filtering layer over here. They include it in the approximate nearest
00:38:34.900 | neighbors index. So I highly recommend reading this, um, very practical, uh, shares a lot of detail into
00:38:41.460 | their system design. Uh, the next one I also highly recommend is the model ranking platform at Zalando.
00:38:46.900 | I think this is all the best practices, uh, talks about all the different tenants, like composability,
00:38:52.020 | scalability, steerable ranking, and they really go right deep into, Hey, you know, here's the candidate
00:38:56.260 | generator, essentially the retrieval step to tower model. And then, you know, they just, uh, using an
00:39:02.340 | ANN to retrieve it. And then they talk about the ranker and then finally the policy layer. What is this policy
00:39:07.860 | layer, right? Policy layer, encourage exploration, uh, business rules, like previously purchased item, item
00:39:13.380 | diversity, again, some kind of, some, some, some measure of quality that the model would never be
00:39:20.340 | able to learn from the data. The model will never learn that showing good items is good, right? Because
00:39:24.260 | they're untested. So you have to override the model with this policy layer. Um, and of course, very good
00:39:30.260 | results. Uh, but what, what I really like about this paper is that if you want to learn about system design
00:39:36.020 | for Rex's like the Zalano paper and the Etsy paper, uh, really, really, really good and really in depth.
00:39:42.980 | Um, but of course everything here is, uh, very good. If there were a few papers I read, I'm like, okay,
00:39:48.340 | this is pretty crap. I wouldn't include it. Uh, but every paper here is pretty good for system design.
00:39:52.900 | Um, under the final section, which is unified architectures. Um, okay. Any, I spoke a lot,
00:40:00.980 | any questions, they'll lose anyone. They'll lose everyone. Uh, Eugene, I have one quick question to
00:40:10.020 | double check on the LinkedIn's paper. Hmm. My understanding is the, the big model on the 150
00:40:17.860 | billion model is actually used as teacher model and then distilled the knowledge into smaller models
00:40:24.500 | then used for kind of different tasks. So I don't know if that aligns with your understanding because
00:40:30.580 | practically 150 billion model and surf data for prediction, the latency will not be acceptable and
00:40:38.020 | too costly also. They do actually have a, like a full, like a paper discuss more about how
00:40:45.220 | knowledge deceleration is happening with that 150 billion models. I kind of put it in the, in the chat.
00:40:51.860 | I don't know if you have come across that paper.
00:40:53.860 | I have not. Um, but my impression was that they were actually using it. Thank you for sharing this
00:40:59.540 | and thank you for correcting my misunderstanding. I need to look deeper into the original paper.
00:41:04.260 | Um, um, um, to confirm this. Let me take you in this thread. That's my understanding.
00:41:12.020 | So, but I also find this, uh, the, the approach very interesting. So we were actually thinking of
00:41:18.980 | similar kind of approach and as well, but actually they kind of proved that this, this way kind of works.
00:41:24.980 | Yeah. Thank you. I, I think that, I think that could probably be it. I think there's no way for it to be
00:41:30.900 | feasible to serve it at that scale. Uh, I think you're probably right. I, I don't know. I, I didn't see
00:41:38.020 | anything in the original paper that actually suggest us that. Um, but I, I, I think you're right. That's,
00:41:43.380 | there's just no way for it to serve it at scale.
00:41:46.100 | Yeah. So I checked you in this thread. Maybe, maybe I can,
00:41:52.820 | I have the paper. Thank you. I, I don't, I added my safe list to confirm.
00:41:57.700 | Yeah. Yeah. Thanks. Thanks for explaining this. Just to double check. Thank you.
00:42:03.940 | Any other questions?
00:42:08.980 | I mean, so, um, one thing that I tried to look for and I found myself doing, but I might as well ask the,
00:42:22.180 | pre-trained LLM, uh, that is you, uh, is to rank, um, what is highest, uh, you know, the,
00:42:30.980 | the lowest hanging fruit versus the higher ones. Um, so for example, right, you, the way that you
00:42:37.380 | organize your, at your write-up was four sections. It was model architecture, data generation, scaling laws,
00:42:46.180 | and then unified architectures. Um, why there's no particular reason. There's no order, right? Like,
00:42:53.780 | to me, it was very clear that model architecture is basically useless. Is that true? Would you,
00:42:58.340 | would you recommend? I don't think so. I actually think that, um, I think the model architecture right
00:43:05.140 | now, it's like a little bit more like, um, you know, meters dilemma. Um, I would say that in 2023
00:43:11.860 | is useless. Um, I haven't seen good results. Now I'm seeing good results. Um, would you classify this
00:43:18.420 | YouTube one as a good result? Because I was, I read it and I was like, wait, like this, I don't know,
00:43:23.300 | you know, like it's, these are smart ideas. And then the, then the results are like, uh, you know,
00:43:29.300 | doesn't, doesn't really outperform our baseline. I, I, I think it's decent results. Um, so, and, uh,
00:43:36.260 | coincidentally, after I published this, I think after I made the rounds on Hacker News, people from
00:43:41.220 | YouTube actually reached out. One of the authors on this exact paper, I should reached out. Um,
00:43:45.540 | they wanted me to like go in and chat with them. Uh, and then they were like, we have more papers.
00:43:50.580 | We're pushing to publish it. And this is a perennial problem, right? Especially for YouTube and TikTok,
00:43:55.700 | right? Um, new videos get uploaded all the time. They have to deal with, costar is their bread and
00:44:02.500 | butter. So I wouldn't be surprised that they are focusing so hard on content embeddings for
00:44:07.460 | costar. This is for, for a new video, but not, I mean, they have users, uh, they have user histories
00:44:14.740 | and they saturated the world on that. I think very likely. So, right. Um, you can imagine YouTube,
00:44:22.420 | Twitter, I mean, it's unmentioned here, but ads, Google ads, it's always costar being able to
00:44:29.540 | crack this costar problem. Just even 0.1% is huge. It is huge. And you can see a lot of the
00:44:36.020 | papers in this semantic IDs, YouTube quite show, which is like TikTok. Um,
00:44:41.940 | Huawei, this one, I'm not very sure why they did this. Um, yeah, a lot of it, Kyle Rack also,
00:44:47.060 | uh, solving costar. So yeah. Okay. But like, you know, orders of magnitude is, um, which takes
00:44:57.700 | orders of magnitude. I really think that the low hanging fruit right now is really using LLM data
00:45:02.260 | generation. Right. Yeah. That's my, yeah. I mean, obviously that is the, I think everyone can do this
00:45:08.020 | now. And you know, the expected bad match paper, um, indeed did, right. I actually did this, uh,
00:45:14.820 | last year, something very similar. I did this last year. Uh, it got published internally. This is very,
00:45:21.300 | very, very, very, very, very effective. This approach of, um, starting from somewhere active
00:45:27.620 | learning, fine tuning model, more active learning. It really helps, uh, uh, improve quality. Um, I was
00:45:34.180 | doing it in the context of LLM and hallucinations, but I can imagine doing this in terms of relevance,
00:45:39.780 | in terms of any level of measure of quality that you want to focus on, it will work very well.
00:45:44.580 | Okay. Um, and then of course, yeah. Then architecture. I would say data generation and then like model
00:45:52.420 | architecture and system architecture. Yeah. Oh, wait, actually the, the, even the scaling loss part,
00:45:57.460 | there are some things that are very, uh, practical. Um, one example is, which I didn't have time to go
00:46:03.860 | through is this, um, basically Laura's for recommendation. So what they did is they train
00:46:11.460 | a single model on all domain data. You can imagine all domain, like fashion, e-commerce, uh, fashion,
00:46:18.580 | furniture, toys, et cetera. Or it could be like all domain, like ads, videos, uh, e-commerce. And then after
00:46:26.340 | that they have specific LoRa's for each domain. Um, and this works very well. So I, I, I definitely
00:46:34.900 | think that essentially right now it's not easy to learn from data across domains for recommendation
00:46:42.260 | system. It's actually for recommendation system. And, you know, correct me if I'm wrong. I really
00:46:45.700 | think that you want to overfit on your domain. Um, you want to overfit and predict the next best thing
00:46:51.540 | for tomorrow. And that's it, period. I, I can overfit and just retrain every day. Um, but yeah,
00:46:56.820 | I know we have a few questions here. Uh, Daniel asks, shouldn't the LinkedIn model combine with an
00:47:02.900 | information model that's used to generate? Yes, correct. Um, that's probably an upstream, uh,
00:47:08.660 | retrieval model. And then the LinkedIn model just does the ranking. So in a two-step process,
00:47:14.020 | you have a retrieval, the LinkedIn, the decoder model, I think it just does the ranking.
00:47:18.180 | Um, for LM-based search retrieval, any papers talk about impact of query writing prompt engineering?
00:47:24.020 | Also the sensitivity. Uh, I think we know that LMs are, are sensitive to prompts, but I think they're
00:47:29.700 | increasingly less sensitive to prompts, uh, because they're just way more instruction tuned. Um, I'm not
00:47:35.220 | sure about LM papers that talk about the power, the impact of query writing. I think the only one that I
00:47:39.620 | have seen at least covered here is the one by Yelp. Uh, so I think that could be helpful. What's the
00:47:45.460 | process for keeping this hybrid models up to date and personalization, uh, keeping them up to date.
00:47:50.660 | That's a good question. I don't know if they actually need to be kept up to date. So if you look at the,
00:47:57.140 | if you look at this hybrid models, right, let's just take the, this, uh,
00:48:06.180 | semantic ID embedding, it actually uses a frozen video bird. Similarly for quite sure, they use frozen
00:48:13.380 | sentence, but resnet and VGG ish. So the content itself doesn't mean to be up to date. And the
00:48:20.100 | assumption is that, okay, content today is going to be the same as content. Tomorrow is going to be
00:48:23.220 | same as content for one month. So that is not up to date, but what is learnable is the semantic ID
00:48:28.500 | embedding and a cluster embedding. Now for personalization, that's very interesting. Uh, that's the hard question.
00:48:33.220 | Right. And the personalization, I guess, how you include personalization is okay. After we learn
00:48:37.620 | the content, we also need to learn what the user is interested in. And that's how they have this two
00:48:41.780 | tower approach. And that's why you can see over here, there's this small layer, uh, which is multimodal
00:48:47.780 | interest intensity, right? Which is given a user and their past historical sequence. How can we tell what
00:48:53.380 | layers are, what modality they're interested in? I think that's how they do personalization over here.
00:48:58.820 | Any other questions?
00:49:01.700 | If not, we can always ask for volunteers for next week's session.
00:49:08.180 | Anyone?
00:49:19.220 | No, I think, I think this is, uh, really helpful. Uh, it just feels like rexys is always these,
00:49:34.340 | like these bundles of ideas. Yeah. Like in the way that agents are bundles of ideas for LLMs, rexys was,
00:49:44.420 | is also a bundle of ideas. And I, I think that they are obviously converging. Um, you don't, yeah, I mean,
00:49:52.020 | I, I, I definitely think so. Right. And you can see that you can see examples, right? We learned item
00:49:58.580 | embeddings via word to back. You know, when people talk about graph embeddings, it's actually just
00:50:02.260 | taking the graph, doing a random walk, converting that random walk into a sentence of item IDs,
00:50:06.500 | and just using word to back. Similarly learning the next best action, GRU is transformers and BERT,
00:50:12.500 | very obvious it will work. So I think we will see more from the LLM space being in, being adopted
00:50:19.540 | in rexys as well. What is the link in your mind between re-rankers and rexys?
00:50:28.900 | I think, so in rexys, we have, uh, retrieval and recommendation, right? Um, so you can see over
00:50:40.740 | here, uh, where is it? So what we do over here is retrieval will retrieve a lot of top candidates. It's
00:50:50.900 | going to retrieve a hundred candidates. And then ranking is going to find the best five candidates.
00:50:55.460 | You can focus on the best five. I think it's the same thing in retrieval and re-ranking, uh,
00:51:00.980 | in, in ranking and rexys. What people say in reg as re-ranking, I think it's just really just
00:51:08.900 | taking retrieval and then finding the best five. And, you know, Cohear has re-rankers and finding
00:51:13.220 | a best five for the LLM as part of the context. Yeah. It's, to me, it's a bit weird, right? Like, uh,
00:51:18.420 | the re-ranker models are being promoted as a way to just, you feed in your top K whatever results.
00:51:24.900 | And then they re-rank them. And somehow that is supposed to produce better rag because
00:51:29.780 | the more relevant results is at the top, but without the context that rexys have, for example,
00:51:35.780 | user preferences and user histories and whatever, like, how can you have any useful re-ranking at all?
00:51:41.300 | Right? Like, I think there are some ways. So for example, like maybe retrieval, right?
00:51:44.660 | You can imagine the most naive retrieval is really just BM25 or, um, semantic search.
00:51:50.420 | Now, then you can imagine you have a lot of historical data on all these BM25 and his, uh,
00:51:56.500 | semantic search and all the associated metadata, which you probably cannot use in
00:52:00.340 | retrieval stage because it's too expensive. And then you can just train a re-ranker.
00:52:03.380 | Just say that when the author match or this author usually looks for this kind of document,
00:52:09.220 | um, and then you can try to re-rank it. I think it's possible. Um,
00:52:13.460 | I haven't, I haven't dove too deep into how re-ranking is done for a rag, but it's possible.
00:52:21.220 | Oh, Apulantia, you had a hand raise.
00:52:22.740 | Yeah, no. And thank you so much, Swix and Eugene. You guys are amazing. I'm huge fans of you both.
00:52:28.100 | But, um, I, with the question I had, I guess is, is it worthwhile for an organization to go and build
00:52:34.340 | this when you have something like Gina.ai, who, as we know, popularized a lot of work on deep search and
00:52:40.740 | not just internal retrieval, but external search and retrieval, because they have the embedding models,
00:52:45.860 | the re-rankers, the retrieving deep search APIs, all unified. Um, how do you feel about that, Eugene and Swix?
00:52:53.220 | Should teams build them, build it themselves? Or should they just buy something off the shelf?
00:52:58.740 | Yeah, like Gina.ai has kind of this full stack that you're talking about with all these fine-tuned
00:53:03.780 | clip models, embedding models, re-rankers. They have the deep search retrieval. There's,
00:53:08.980 | they've posted probably some of the better technical blogs on deep search. I think that's out right now,
00:53:14.500 | and embedding models and re-rank. Yeah, my answer is going to be probably very boring and you can
00:53:19.460 | apply it to any answer, whether they should, someone should use an LMP off the shelf or just finding a
00:53:24.100 | model. I think that for prototyping, just do whatever is fast, right? Demonstrate user value. It's going
00:53:30.340 | to get to a point in time where what you need does not fit, um, what something off the shelf is going to
00:53:37.060 | do. Um, and that's what, um, that's, uh, Indeed's story, right? For expected bad match. Latency
00:53:47.460 | continued to be too high. Even if they need to be too high, the only way to fine-tune your own model.
00:53:53.620 | Similarly, like for retrieval, you can imagine, okay, they're going to provide a lot of out-of-the-box
00:53:58.340 | embeddings and maybe it's going to be good enough. And then finally, you want to really squeeze more juice
00:54:03.540 | out. You probably need to go fine-tune your own embeddings. I know like Replit recently such
00:54:07.380 | shared something about fine-tuning their own embeddings, half the size, et cetera, et cetera,
00:54:10.660 | and it does way better. Um, there are a lot of examples here. I think Etsy also fine-tune their own
00:54:14.820 | embeddings, um, and it outperform embeddings out of the box. I think it's a little bit of unfair
00:54:21.140 | comparison. I think if you take those models, those embedding models and you further fine-tune them,
00:54:25.700 | I think they could do better, but essentially point by just use us off the shelf to move fast
00:54:30.820 | and then just, and then after that, if you do need to customize it, then you customize it.
00:54:36.180 | I love it. Thank you so much. Super helpful. You're welcome.
00:54:39.460 | Sorry, we have a question. Let's go ahead. And I think this is probably my last question.
00:54:49.540 | Okay. Uh, what do you think of the, uh, biggest opportunity in terms of for, uh,
00:54:54.820 | apply IOM in recommendation domain because we're discussing in ritual,
00:54:58.900 | ranking, content understanding, etc., right? There are so many different prediction tasks.
00:55:04.500 | You're asking me what I think is the big opportunity?
00:55:12.100 | Is that your question? Yeah. Uh, I think, I think embeddings, I think what I've seen is that embeddings
00:55:20.900 | are helpful for retrieval. So instead of purely keyword retrieval, like, and killer, someone might,
00:55:26.260 | might just ask, I have ants in my house. I think a semantic embedding could be able to help you match
00:55:32.100 | that. Um, and I think ranking is definitely clearly, uh, it will clearly work, uh, using an LLM-based ranker.
00:55:39.460 | I think LinkedIn actually really could clearly, can clearly work. And of course for search, I think
00:55:44.100 | there's this card, this guy, Doc Turnbull, um, he's going, increasingly going down the route of,
00:55:50.100 | and we've seen examples from Yelp, right? Using an LLM to do query, uh, segmentation, query expansion,
00:55:56.980 | query rewriting. It clearly works in Yelp's use case and you can just catch all these results.
00:56:02.740 | So those are the three things I, off the top of my head that I can't think of.
00:56:08.740 | Okay. Thank you, everyone. I do need to drop. Maybe you can discuss what the next paper is.
00:56:13.220 | Maybe Swix will talk about Moore's Law for AI every seven months, which I think is interesting.
00:56:17.540 | No, not on the image generation, autoregressive image generation.
00:56:21.300 | Okay. All right. Bye. Bye. Bye. Bye. Thank you.
00:56:25.300 | Thank you, everyone. Take care.
00:56:26.580 | Thank you. Thank you.