Improving RecSys and Search in the age of LLMs

00:00:00.000 | Okay, so I've been, I work in the field of recommendation systems and search, and every

00:00:10.420 | now and then I like to pop my head up to try to see what's going on.

00:00:13.760 | I think the recent trend for the past one or two years has been the interaction between

00:00:18.200 | the recommendation systems and search and LLMs.

00:00:22.300 | Let's say REXs, REXs and LLMs, REXs and search and LLMs, and I think in early 2023,

00:00:29.760 | we would see some papers where people use the decoder-only models to try to predict IDs.

00:00:34.720 | Those didn't work very well.

00:00:36.680 | But at the end of last year and early this year, we're starting to see scenes of life whereby

00:00:40.060 | some of these actually are A/B tested and have very good empirical results.

00:00:43.760 | So I want to highlight a few of those patterns and we can see how it goes.

00:00:49.120 | The one thing that for the longest time, what we're trying to do for REXs and search is to

00:00:53.060 | move away slowly from item IDs.

00:00:55.740 | You can imagine if Eugene interacts with item ID 1, 10, 25, his next predicted direction

00:01:01.980 | is probably item ID number 33.

00:01:04.660 | But all of this is relying slowly on item IDs and you can imagine every time that you have

00:01:08.500 | a new item ID, you have to learn a new embedding for it and that leads to a close-up problem.

00:01:13.700 | So I know this was not part of my recommended reads in my write-up, I will actually go back

00:01:18.340 | and update it to be recommended reads, but I want to discuss two papers here to try to address

00:01:22.620 | this.

00:01:23.620 | The first one is semantic IDs, which is from YouTube.

00:01:26.260 | You can imagine YouTube has a lot of new videos all the time, they can't learn new item IDs for them

00:01:33.540 | all the time.

00:01:34.540 | So what they do is they have a transformer encoder, it generates dense content embeddings.

00:01:38.740 | This is actually just a video encoder that converts a view into dense content embeddings.

00:01:42.860 | And then they compress this into what they call semantic ID.

00:01:46.180 | It's via an autoencoder.

00:01:47.780 | So the dense video encoding is 248 dimension, but what they do is they take the encoder, find

00:01:53.260 | the nearest neighbors, assign that to the code book is also 248, assign it to the nearest code book,

00:01:59.340 | take the residual, find the next nearest neighbors, assign it.

00:02:02.180 | So it just keeps compressing and compressing it.

00:02:04.580 | In the image here, they have four layers.

00:02:07.860 | So essentially you can compress an item ID into four integers product in the paper,

00:02:12.580 | actually decompress an ID into eight integers.

00:02:15.140 | So I thought this was pretty cool.

00:02:16.580 | And then now that you have an item ID, you have a content embeddings, you convert it to eight integers.

00:02:23.620 | How do you then learn it, right?

00:02:25.620 | They tried using ngram and sentence piece.

00:02:28.900 | ngram is really just, you know, like fast text, like character ngrams, you know, every subword,

00:02:34.500 | every one character, two characters, three characters are learning about embeddings.

00:02:37.620 | And then they also tried using sentence piece.

00:02:39.220 | Essentially, sentence piece is really just looking at all the version of all these item IDs.

00:02:44.100 | What are the most common subwords, most common sub characters?

00:02:48.740 | So therefore, it's no longer just a unigram, bigram and trigram.

00:02:53.860 | It's that you can learn variable length subwords.

00:02:55.780 | What are the results from this?

00:02:58.260 | Well, not surprisingly, dense content embeddings itself do worse than item IDs, right?

00:03:08.180 | And you can see this, right?

00:03:10.180 | You can see this on the chart on the left here, right?

00:03:13.860 | You can see unigram and bigram, the red line and the purple line, the unigram.

00:03:21.140 | It actually is worse than item, the random hash ID, the orange line, for some extent.

00:03:27.860 | Oh, actually, no, I didn't include the--

00:03:31.220 | Okay, I have the content embeddings line itself on my write-up.

00:03:36.100 | I didn't include it here.

00:03:37.300 | But this chart here is actually trying to show that they use the dense content embeddings, it's crap.

00:03:42.020 | But when they use both ngram and sentence piece, it did better.

00:03:45.620 | So you have to do this trick whereby you have to convert that content embedding, the full dense content embedding into its own semantic ID and then learn those IDs.

00:03:55.540 | Now, the benefit of this, you might be saying, hey, you know, isn't this all back to IDs again?

00:03:59.540 | Well, not necessarily, because now, given the piece of content, you can convert it to embedding and then you can assign it to its nearest ID.

00:04:05.780 | And therefore, you don't need to learn on behavioral data.

00:04:07.860 | So that's the benefit here.

00:04:09.860 | Similarly, quite short, which is like a TikTok number two, the number two TikTok in China, they adopted the same approach.

00:04:18.500 | They use multimodal content embeddings.

00:04:21.780 | So they use embeddings from ResNet, Sentencebird and VGG to get the respective modality.

00:04:27.380 | And then this is just simply concatenating a single vector.

00:04:29.780 | Then they take all these embeddings.

00:04:32.420 | They just do k-means to identify a thousand of the most common clusters.

00:04:36.660 | Right.

00:04:37.540 | Then now, therefore, each embedding is now a trainable item ID.

00:04:40.980 | These cluster IDs are then embedded via motor encoder.

00:04:45.380 | So this motor encoder, there's quite a bit to it.

00:04:48.420 | Let me try to simplify it.

00:04:49.780 | You can see on the top, it says non-visual, non-trainable visual embeddings.

00:04:53.940 | In this example, they only use visual embedding as an example.

00:04:56.820 | But you can imagine all the non-trainable embeddings.

00:05:00.180 | They take it and they project it into a different space via the mapping network.

00:05:05.140 | Secondly, for every cluster ID, they convert it into learned embeddings for visual texture and acoustic.

00:05:11.700 | These are not the original embeddings that come from it.

00:05:14.020 | These are just the representation of the multimodal cluster ID.

00:05:18.020 | And then fusion is really just concatenating it.

00:05:20.420 | So now you might be thinking, how is this motor encoder learned?

00:05:24.180 | How is this motor encoder trained, right?

00:05:26.180 | This motor encoder is not trained itself.

00:05:28.180 | This motor encoder is trained within the overall encoding, overall ranking network,

00:05:32.580 | which you can see the motor encoder is all at the bottom, right?

00:05:35.380 | So therefore, this motor encoder network, it takes the user tower, which is on the left,

00:05:40.980 | and the item tower that's on the right, and it tries to predict the likelihood that user will click or like.

00:05:45.380 | Therefore, based on this, they just backprop.

00:05:48.900 | You backprop the likelihood of clicking or liking or following, and you backprop it through the motor encoder.

00:05:55.220 | And that's how the motor encoder learns the mapping network and the cluster IDs.

00:06:00.740 | So the benefit of this is that they shared that outperformed several multimodal baselines.

00:06:04.420 | I won't go through them here.

00:06:05.460 | And when they did A/B testing, I think those are pretty freaking significant numbers.

00:06:11.060 | Anything more than 1% is pretty strong in a platform like this.

00:06:17.060 | And the benefit of this is that they mentioned that they had increased cold start velocity and cold start coverage.

00:06:21.540 | It means that, you know, cold start is able to pick up faster.

00:06:24.340 | If it's a good cold start video, it's able to pick up faster.

00:06:27.460 | And they are also able to show more cold start, 3.6% more cold start content, which increases coverage.

00:06:35.300 | So they also did the ablation studies.

00:06:37.540 | So let me butt into, for those new to the REXIS world, you said anything more than 1% is a big deal.

00:06:45.540 | Pretty huge.

00:06:46.500 | Can you contextualize, like how much is that worth or is that like...

00:06:51.140 | So you can imagine, right?

00:06:52.660 | Imagine you are making, I don't know, let's just make something up.

00:06:55.940 | A billion dollars worth of ads.

00:06:57.460 | Yeah.

00:06:58.180 | Right.

00:06:58.740 | And if people are engaging more, spending 1% more time, you can show 1% more ads.

00:07:03.620 | That's like 10 million.

00:07:05.780 | A million dollars.

00:07:06.660 | Right.

00:07:07.140 | So you just expand that, right?

00:07:08.660 | Of course, clicks and likes and follows.

00:07:10.420 | This is, these are engagement, engagement proxy metrics.

00:07:15.940 | Okay.

00:07:16.420 | So like, and is this, is this absolute or relative?

00:07:19.860 | So for example, are we saying that, let's say likes was plus three right here.

00:07:23.940 | Are we saying they went from six to 9% or are we saying they're currently 6% and now we are 6.09%.

00:07:31.860 | I suspect it's relative.

00:07:34.020 | I suspect it's like maybe from 5% to 5.15%.

00:07:37.780 | going from 6% to 9% is impossible.

00:07:41.700 | You wouldn't have to work.

00:07:43.540 | Yeah.

00:07:44.820 | Okay.

00:07:45.780 | All right.

00:07:46.100 | So no surprises here using multi-modality consistently outperform single model features.

00:07:54.420 | But there was also a trick here whereby they had to learn user specific model interests.

00:07:59.380 | And if you look at it on the left tower there, close to the top, there's this thing called multi-modal interest intensity learning.

00:08:09.380 | Essentially what they're learning there is for each user, which modality they are interested in.

00:08:14.340 | And then they actually map to that.

00:08:15.460 | For some people like Swix is very acoustically inclined.

00:08:18.100 | He might care more about a soundtrack.

00:08:19.620 | For other people, they might care more about the visuals or the video itself.

00:08:24.180 | Or the text, having a good caption is worth a thousand words, maybe.

00:08:27.380 | So yeah.

00:08:28.740 | So that's one trend I've seen, which is increasingly including more content into the model itself.

00:08:38.020 | The other trend that I've seen is that for LLMs for synthetic data.

00:08:43.140 | And there's two papers that I really like here because they share a lot of details.

00:08:48.660 | And they share a lot about the pitfalls that they faced.

00:08:52.100 | So the first one is, this is paper I really like.

00:08:55.060 | It's called expected bad match.

00:08:57.060 | It's from Indeed.

00:08:57.940 | Essentially, you can imagine you are providing people with job recommendations, right?

00:09:03.060 | And then you want to have a final filter at the end to filter out bad job recommendations.

00:09:09.780 | So this paper, it's not easy to get the full access to this paper online.

00:09:14.980 | I've included it in our, we have a, in our Discord channel, we have a thread.

00:09:20.020 | I've actually dropped the PDF in there and they talk through the entire, I think they talk through the entire process.

00:09:24.980 | And I think it's a quite a role model process.

00:09:27.380 | They started with looking at 250 job matches, right?

00:09:32.100 | 250 job matches.

00:09:33.140 | And then they compared it across various experts.

00:09:36.020 | They have very rigorous criteria.

00:09:37.860 | And in the end, they built a final eval set of 147 matches that were very high confidence.

00:09:43.540 | multiple experts agreed and then, you know, that there was nothing subjective.

00:09:46.820 | Then they tried prompting the LMS with recruitment guidelines, right?

00:09:50.980 | And to classify job match quality.

00:09:52.980 | And of course, you know, they tried things like the cheap stuff and then they tried things like the expensive stuff.

00:09:57.060 | Unfortunately, the cheap stuff doesn't work.

00:09:59.060 | Only GPT-4 worked.

00:10:01.380 | But GPT-4 was so slow.

00:10:02.900 | GPT-4 took an average of 31 seconds, right?

00:10:06.580 | Right.

00:10:07.140 | Okay.

00:10:07.780 | No problem.

00:10:08.420 | OpenAI lets you fine tune GPT-3.5.

00:10:11.300 | So what they did was then they fine tune GPT-3.5.

00:10:15.300 | And GPT-3.5, you can see, let's just focus on the green boxes and then the red boxes.

00:10:22.660 | You can see GPT-3.5 is able to achieve as almost close good precision and recall as GPT-4.

00:10:31.620 | Fine tune, right?

00:10:32.340 | And fine tune from the labels of GPT-4.

00:10:34.100 | But GPT-3.5 was like reduced latency by two thirds and cost by two thirds, which is perfect, right?

00:10:43.300 | But the thing is, you see that the average latency there is like six seconds.

00:10:48.020 | And that's still not good enough for when they needed to do it online.

00:10:52.180 | So then what they did is they fine tune the lightweight classifier.

00:10:56.180 | Unfortunately, they didn't go into very many details of what this lightweight classifier is.

00:11:00.020 | I suspect that this lightweight classifier is maybe not a language model.

00:11:03.460 | I suspect that it is probably a decision tree because they talk a lot about categorical features.

00:11:09.380 | And then the labels are just solely the ALM generated labels, right?

00:11:13.300 | So you can see their entire journey.

00:11:15.620 | The first to the eval set, then we test GPT-4, GPT-4 good, but too slow.

00:11:19.380 | Then we try GPT-3.5 like step-by-step incremental progress, but still too slow, too expensive.

00:11:24.980 | Okay.

00:11:25.540 | We would really not like to have to train our own classifier and have to maintain the ops of retraining

00:11:31.860 | it, but we don't have a choice.

00:11:33.220 | This is what we need to do to reduce inference latency.

00:11:37.380 | So that's what they did.

00:11:38.260 | They were able to achieve area under the curve, ROC of 0.86.

00:11:45.060 | That's pretty freaking good against ALM labels.

00:11:48.100 | And this is, according to them, low latency.

00:11:50.340 | I don't know how low latency is and suitable for real-time filtering.

00:11:53.460 | The benefits of this are pretty tremendous, right?

00:11:56.020 | You can read the benefits there yourself.

00:11:59.060 | But I think the one big benefit is they lowered unsubscribed rates by 5%.

00:12:03.060 | That is huge.

00:12:05.300 | For someone, if you maintain some kind of push notification or email notification thing,

00:12:10.980 | subscribe rates, unsubscription rates is like your biggest guardrail.

00:12:14.100 | Because if people aren't subscribed, you're never, ever going to reach out to them again.

00:12:17.220 | You lose them.

00:12:19.220 | So, you know, all your customer acquisition costs is really down the drain.

00:12:23.300 | Like maybe you have an offer and you let people sign up.

00:12:26.020 | Hey, would you like to hear more about us?

00:12:27.540 | Okay.

00:12:27.940 | We give that out.

00:12:28.820 | But when I'm subscribed, you lose them.

00:12:30.580 | And so over here on the top line in table two, they share the apply to invite to apply email.

00:12:38.820 | That's one.

00:12:39.460 | And then that's the results I highlighted here.

00:12:41.780 | And in the bottom, I also see they had online experiments for the homepage recommendation feed.

00:12:47.300 | And that's how low latency this classifier has to be.

00:12:51.300 | It has to be on the homepage recommendations.

00:12:53.300 | And similarly, we see very good results, right?

00:12:58.180 | You can see it.

00:12:59.300 | For example, impressions, right?

00:13:00.500 | Impressions drop 5.1% at threshold of 15 and then drop by 7.95% at threshold 25.

00:13:08.820 | What does that mean?

00:13:11.460 | That means you freed up 5% to 8% of impressions.

00:13:15.620 | You can now show more good stuff.

00:13:18.740 | Right?

00:13:20.740 | That's huge.

00:13:21.780 | But freeing up 1/12 of impressions is a very big deal.

00:13:25.380 | Freeing up more space.

00:13:27.140 | And as we know, more real estate is better.

00:13:29.940 | So I think that this was quite an outstanding result.

00:13:34.100 | The other one I want to share and then we'll pause for questions, short questions before I go into two other sections, is query understanding at Yelp.

00:13:41.380 | So query understanding at Yelp was very nice.

00:13:43.940 | It's purely using opening.

00:13:45.220 | They had two things.

00:13:47.540 | One is query segmentation and another one is highlights.

00:13:49.780 | The query segmentation one is not so straightforward to understand, but essentially given a query like Epcot restaurants, they can identify, they can split this into different things like topic, location, name, question, etc.

00:14:02.660 | And then by having better segmentation, they can have greater confidence in rewriting those parts of the query to help them search better.

00:14:12.900 | So the second bullet point gives you an example.

00:14:15.300 | If we know that the user's location is approximately there and the user said Epcot restaurants, we can rewrite the user's location from Orlando, Florida to Epcot for the search backend.

00:14:28.100 | And because the search backend is based on location, by rewriting Orlando, Florida to Epcot, they were able to get more precise results for the user.

00:14:37.620 | So that's one example.

00:14:40.340 | The other example is segmentation.

00:14:42.580 | And the original write-up, they have a lot of good images.

00:14:45.780 | I didn't include those images here because I didn't have time.

00:14:48.500 | I only started reading this like an hour before.

00:14:51.060 | One of this is review highlights.

00:14:53.620 | So imagine if you search for some food, maybe you search for vegetarian friendly Thai food, right?

00:15:01.860 | And then sometimes in the reviews, people would say things like vegetarian, veggie only, suitable for vegetarians.

00:15:08.980 | And then I'm sure that there are way more synonyms for this.

00:15:14.180 | In the past, they had to get humans to write these different synonyms.

00:15:17.460 | And then they add these two dictionaries.

00:15:18.660 | And you can imagine this is not scalable.

00:15:23.380 | But now they can use LLMs to replicate the human reasoning, right?

00:15:25.380 | In the phrase extraction.

00:15:25.940 | And they get way better coverage and it can cover 95% of traffic.

00:15:31.780 | So for query segmentation, they are able to understand the user intent a little bit better.

00:15:36.500 | And then for review highlights, because they were showing more reviews, especially for the long tail queries, it makes search more engaging.

00:15:46.260 | By highlighting the relevant reviews for a user's query, they help the user feel more confident about the food.

00:15:53.460 | Let's say it's vegetarian friendly.

00:15:55.380 | And then maybe the user review would really say something like, oh, the vegetarian food is great and delicious or something.

00:16:00.420 | Or like definitely no meat involved.

00:16:02.340 | It happens to be gluten free as well.

00:16:04.020 | I don't know.

00:16:04.580 | Things like that help make the users more confident in the search results that they're seeing.

00:16:09.060 | Okay, I'll pause here.

00:16:12.900 | Any questions?

00:16:14.820 | I know there's a lot in the chat.

00:16:16.820 | Oh my goodness.

00:16:17.460 | There's a quick, I mean, I have a quick question on just this query understanding thing.

00:16:22.260 | What was the previous solta in query segmentation?

00:16:25.220 | Like this seems like the most obvious, dumb possible thing to do.

00:16:28.260 | Name entity extraction.

00:16:32.260 | You get a span and then you train some kind of classic, you train some kind of transformer model

00:16:36.980 | that takes the input and then they'll cut it at characters.

00:16:40.340 | Yeah.

00:16:41.380 | Okay.

00:16:42.580 | So like it's basically, but like this guy did not compare it to NER, right?

00:16:46.420 | Like they, they, they mentioned that their original was NER and this is better.

00:16:51.620 | Okay.

00:16:53.060 | Nice.

00:16:53.380 | Yeah.

00:16:53.540 | They definitely, yep.

00:16:55.140 | I mean, everyone starts with some kind of NER based approach.

00:16:58.020 | My, I mean, my, my theory is that like, yeah, I, I basically,

00:17:01.540 | basically there's no point doing traditional NER anymore.

00:17:04.820 | You just do LMs, uh, with a, with some kind of schema.

00:17:08.740 | Could be.

00:17:09.380 | So fast.

00:17:10.260 | Yeah.

00:17:10.980 | Any fast NER cheap, uh, that's it for, for search, right?

00:17:15.300 | Uh, sorry.

00:17:17.860 | Great.

00:17:18.260 | But slow, right?

00:17:20.020 | Yeah.

00:17:20.260 | Um, like if you want to spell check auto complete thing, like a grammar tool, Gemini slow.

00:17:27.380 | Oh, so this is why I might rather prefer, uh, present from this, um, where is it search query?

00:17:38.980 | So, um, um, they started their legacy models.

00:17:44.100 | Uh, their legacy models.

00:17:45.540 | Oh, go ahead.

00:17:46.340 | Yeah.

00:17:47.700 | So you can see name entity recognition, right?

00:17:50.580 | Um, they use aim at the recognition and, and they do this.

00:17:53.540 | Um, but then, oh, wow.

00:17:56.180 | People are drawing on this.

00:17:57.380 | I didn't know you could do that.

00:17:58.820 | Um, but then they actually shed one thing I really like about this, uh, write up is this,

00:18:04.340 | this, this chart right here.

00:18:05.540 | No, this chart seems very, very, very base formulation, scope task, proof of concept scaling up.

00:18:11.140 | Um, but they, they wrote it very well to explain how they did it in the context of these two case

00:18:17.380 | studies.

00:18:17.780 | I feel like a lot of people just completely just, uh, drop this.

00:18:22.100 | They completely ignore this.

00:18:23.940 | And for query segmentation, right?

00:18:25.460 | Um, I know I'm taking a long time to get to the points.

00:18:28.420 | Is that 10% of queries make up 80% of traffic.

00:18:31.860 | So they could do all this query segmentation once, period.

00:18:35.940 | And then just retrieve from the cache.

00:18:38.500 | Um, derive golden data set fine tune.

00:18:43.540 | I can't remember where they wrote it, but that's how they did for query segmentation and for review

00:18:47.780 | highlights.

00:18:48.420 | So essentially a lot of these things that may feel like they have to be done on online,

00:18:53.380 | but because of the power law in e-commerce and online search,

00:18:56.900 | you don't have to, uh, you, you can make use of a cache a lot.

00:19:00.260 | Um, there's also another question from Tyler Cross.

00:19:06.020 | How do these approaches compare to more information retrieval methods like BM25 vector methods?

00:19:11.380 | Um, uh, Tyler, do you want to, what do you mean by, by these approaches?

00:19:16.900 | What approaches specifically?

00:19:18.180 | Oh, no, maybe Tyler's not on the...

00:19:22.820 | I think you said this when you were doing the LLM, like we tried 3.5,

00:19:26.180 | fine tune, 3.5, Lama 2, Nostro 11B.

00:19:28.420 | Ah, okay.

00:19:30.500 | I think that's like a classification.

00:19:31.940 | I think in this case, uh, if it's a classification approach and this wouldn't work.

00:19:35.860 | Right.

00:19:36.900 | Um, but if you're talking more generally, like using LLMs or embeddings, uh, for retrieval.

00:19:43.700 | Hmm.

00:19:44.260 | Yeah.

00:19:45.540 | I actually don't know.

00:19:46.180 | I don't know the full, the full context of the question.

00:19:48.260 | So I probably better not answer.

00:19:49.620 | Um, okay.

00:19:50.740 | Any, any other questions?

00:19:51.860 | Okay.

00:19:55.940 | I think that's, that's great.

00:19:57.700 | I can move on because the other two sections are fairly heavy and I, and then, you know,

00:20:02.100 | we have more times, more time after that.

00:20:03.860 | Ooh, wait.

00:20:05.540 | I don't know if I'm trying to write screen.

00:20:08.660 | Give me a second.

00:20:09.220 | Okay.

00:20:10.340 | First I need to go to my Google Chrome.

00:20:14.420 | Ah, see, this is what happens when you get a noob to do slides.

00:20:17.540 | Okay.

00:20:21.860 | Share.

00:20:22.660 | You are seeing this, right?

00:20:24.580 | You're seeing my slides.

00:20:25.460 | Are you seeing it in slideshow mode?

00:20:27.300 | Yes.

00:20:28.820 | Full screen.

00:20:29.460 | Okay.

00:20:29.700 | Perfect.

00:20:30.100 | Perfect.

00:20:30.980 | So then the other thing is, I think that I'm LLM inspired training paradigms.

00:20:34.980 | So maybe it's LLM inspired, maybe you've been doing this for a very long time, but I thought

00:20:38.660 | to highlight it.

00:20:39.700 | The first one is really looking at the scaling laws.

00:20:43.140 | And ever since I published, I shared about this post.

00:20:45.540 | Like people have gotten back to me with like at least three or four papers about other studies

00:20:48.740 | of scaling laws, but along a very similar, very similar view.

00:20:52.580 | So I want to talk about the experimentation that I did.

00:20:55.860 | This scaling law was decoder only transformer architecture.

00:20:59.060 | And they tried various model sizes.

00:21:00.900 | The training data is the same as sentences.

00:21:03.460 | Essentially it's fixed length sequences, 50 item IDs each.

00:21:06.740 | And the training objective is given the past 10 items, predict item number 11.

00:21:10.740 | Given the past 20 items, predict item number 21.

00:21:13.140 | So it's fairly straightforward.

00:21:14.660 | But they did introduce two key things that are very interesting.

00:21:20.180 | The first one is layer-wise adaptive dropout.

00:21:22.500 | So you can imagine, right, for LLMs, every single layer has same dimension.

00:21:28.660 | You know, usually when they draw a transformer layer, it's every single layer has same dimension.

00:21:33.060 | But for recommendation system models, that's not the case.

00:21:35.540 | It's usually fairly fat at the bottom and gets skinnier towards the top.

00:21:39.300 | So what they do over here is they have higher dropout in the lower layers and lower dropout in the higher layers.

00:21:43.780 | And ablation studies showed that this works.

00:21:46.420 | So the intuition here is that the lower layers process more direct input from the data.

00:21:51.140 | And because e-commerce data or recommendation data is fairly noisy, it's more prone to overfitting.

00:21:58.260 | Therefore, they have more dropout in the lower layers.

00:22:01.780 | Vice versa, at the upper layers, it learns from more abstract data.

00:22:06.260 | And therefore, you want to make sure it doesn't underfit.

00:22:09.140 | You want to make sure it gets all the juice you can get.

00:22:11.380 | And therefore, they reduce the lower, they have lower dropout at the higher layers.

00:22:16.100 | The other thing, which feels a little bit like black magic, is that they switch optimizers halfway doing training.

00:22:22.580 | Firstly, they start with Adam and then they switch to SGD.

00:22:26.340 | The observation they had was that, you know, they ran a full run with Adam, ran a full run with SGD,

00:22:30.420 | is that Adam is able to very quickly reduce the loss at the start.

00:22:35.780 | But then it like slowly tapers off, whereas SGD is slower at the start, but achieves better conversions.

00:22:40.500 | So they had to do these two tricks for their sequential models.

00:22:44.660 | What were the results?

00:22:46.820 | I mean, obviously, no, this is fairly obvious.

00:22:50.340 | Higher model capacity reduce cross entropy loss.

00:22:53.140 | And this model capacity is model capacity, model params excluding ID embeddings.

00:22:59.380 | So it's purely just the layers itself without the ID embeddings.

00:23:03.380 | And they were able to model this.

00:23:05.860 | If you look at the dash line, the test loss curve and the blue dots, they estimated this with the blue dots, estimated that power law curve.

00:23:14.900 | And they were able to fairly accurately predict where the red dots are going to be.

00:23:19.300 | And, you know, this is like the Kaplan-style scaling loss and the Chinchilla-style scaling loss.

00:23:25.140 | So essentially, given some smaller model, if we had bigger model, how would it perform?

00:23:29.060 | The other thing was, oh gosh, these lines don't look correct.

00:23:33.860 | Okay, the red arrow does look correct.

00:23:36.740 | Everything else is fine.

00:23:38.020 | So over here, I think this is a very nice result, which is that smaller models need more data to achieve comparable performance.

00:23:46.980 | So over here on the left, you can see that there's a small model there on the orange line.

00:23:51.700 | It needed twice the amount of data compared to a bigger model to get similar performance, right?

00:23:59.540 | So the flip side of it is that, hey, you know, if you want highly performant models online, you're going to need a factor more data.

00:24:09.940 | Of course, this is also nothing unusual.

00:24:12.820 | This is something we know, but it's really nice to have someone have done the experiments and distill it into the results here.

00:24:19.140 | The other thing that I thought was really interesting is this idea of recommendation model pre-training.

00:24:28.500 | So this was fairly new to me.

00:24:31.140 | I didn't think it could be done.

00:24:32.420 | Most people do this on content embeddings, which is given some content of this item, can you predict the content of that item?

00:24:40.180 | I thought this was fairly novel, whereby it's trained solely on item popularity statistics.

00:24:44.580 | They say it works.

00:24:47.940 | It's still quite unfantom to me on how it works.

00:24:51.860 | Essentially just take the item popularity statistic in the monthly and the weekly timescale, convert it to percentiles, and then convert those percentiles to vector representations.

00:25:02.340 | And that's it.

00:25:03.940 | That's your representation of the item.

00:25:06.500 | So anytime you have a new item, so long as you have the past statistic for the past month and week, you can convert the percentile and then map it into vector representation.

00:25:14.900 | So what this means is that imagine if our percentiles are only at the hundreds and we have stats for monthly and weekly, all we need is 200 embeddings for a month and a week.

00:25:24.660 | And for each hundred, for 100 and we need 100 percentiles, we need vector representations.

00:25:31.220 | So instead of millions of item IDs or billions of item IDs, all you need is 200 percentile vector representations.

00:25:39.300 | So that is extremely compressed.

00:25:41.060 | They also had to do several tricks like relative time intervals and fixed position encoding that don't come across as ventuitive to me.

00:25:50.340 | They explained that they say that they did that, but it's unclear, like how would I know a priori that I need to, how would I know if without running experiment that I needed to do it?

00:25:59.700 | So it feels like there's like too many tricks.

00:26:02.900 | There's so many tricks in like, okay, I need these three things, the stats to perfectly align for this to work.

00:26:07.380 | So I think it's very promising, but I wish there was a simpler way to do this.

00:26:11.300 | The results, it has promising zero shot performance.

00:26:15.380 | What I mean by zero performance, basically it trains on the standard domain and then tries to apply it across the domain to another domain, right?

00:26:23.700 | And you can see two to six percent drop in recall at 10.

00:26:25.940 | This is compared to baselines, which are fairly good baselines, SASTRAC and BOFORAC, which are trained on the target domain itself.

00:26:32.020 | Now, if you take this model and you train it on that target domain,

00:26:38.660 | it matches or surpasses SASTRAC and BOFORAC when trained from scratch.

00:26:42.100 | But the test, it only uses one to five percent of parameters because it doesn't have item embeddings, right?

00:26:48.100 | It only has those 200 embeddings at the monthly and the weekly scale for every percentile.

00:26:52.100 | So this is quite promising in the sense, it's one direction to its pre-trained models.

00:27:00.180 | You can imagine some kind of recommendation as a service, adopting this idea and maybe it could work.

00:27:07.060 | Maybe something like Shopify, right?

00:27:08.660 | Shopify has a lot of new merchants onboarding.

00:27:11.220 | Hey, you know, can we take existing merchant data with their permission?

00:27:13.860 | Of course, completely anonymized, right?

00:27:16.260 | It's just solely trained on popularity, right?

00:27:19.220 | And then we just train this model.

00:27:20.580 | Now for any new merchant that's onboarding, as long as we have a week of data, we can use the weekly popularity embeddings.

00:27:26.820 | And once we have a month of data, we can use that model.

00:27:29.860 | So we don't actually need semantic IDs.

00:27:31.860 | The second one is we have two papers from YouTube.

00:27:36.660 | We have two pictures from Google and YouTube here.

00:27:38.900 | And this is so the one thing about distillation is that

00:27:44.900 | if you solely learn on the teacher labels, it is very noisy, right?

00:27:54.260 | The teacher models are not the perfect models. It's better to learn from the ground truth.

00:27:57.140 | But we do know that adding the teacher models does help.

00:28:00.100 | So what they do here is on the left side, you can see that direct distillation, which is learning from

00:28:06.980 | both the hard labels, which is the ground truth and the distillation labels, which is what the teacher provides,

00:28:11.860 | the teacher model, the big teacher model provides, is not as good as auxiliary distillation.

00:28:17.620 | And essentially, what auxiliary distillation means is that you just split, give them two logits.

00:28:21.620 | One logit to learn from the hard label, one logit to learn from the distillation label.

00:28:25.060 | And they find that this works very well.

00:28:28.100 | I didn't have time to put the results here, but they find that this works well for YouTube.

00:28:32.580 | And then the thing is that the teacher model is useful.

00:28:35.060 | So what they did is that they amortized the cost by having a big fat teacher model. And

00:28:40.260 | by big fat teacher model, I mean, it's only two to four X bigger.

00:28:44.100 | By having a teacher model that's two to four X bigger, this teacher model will just keep pumping

00:28:48.580 | out the soft labels that all the students can learn from. And this makes all the students better.

00:28:53.460 | And of course, why do we want students? If you're saying that teacher model is better,

00:28:56.420 | why do we want students? We want the students because the student models are small and cheap.

00:29:00.580 | And at YouTube scale, where they have to make a lot of requests, this is probably what they need to do.

00:29:05.140 | Another approach, which is from Google, and I think they applied this in the YouTube setting as well,

00:29:12.660 | is called self auxiliary distillation. So the intuition here is this, don't look at the image first.

00:29:19.380 | So intuition here is this, they want to prioritize high quality labels and improve the resolution of

00:29:24.340 | low quality. What does it mean to improve the resolution of lower quality labels? Essentially,

00:29:29.220 | what they're saying is that if something is impressed, but not clicked, we should not treat that as a label

00:29:36.900 | of zero. Instead, what we should do is to try to get the teacher to predict what that label is,

00:29:42.820 | to smoothen it out. So if you look at the image, you can see that they have ground truth labels,

00:29:48.660 | which is those in green, and they have teacher predictions, which is those in yellow.

00:29:52.020 | So to combine a hard label with the soft label, they suggested a very simple function in the,

00:29:57.300 | I don't know if that's what they actually use, but essentially the max of the teacher and the

00:30:02.740 | student. So the max of the teacher and the ground truth. So imagine if the actual label was zero,

00:30:09.540 | and the teacher said that, you know, it's a 0.3, you just use the 0.3. Or if the actual label is one,

00:30:14.660 | and it's just a 0.5, you just take the one. So by smoothing it, and then having the teacher,

00:30:20.020 | having the student learn on the auxiliary head, right, you are actually able to improve the teacher

00:30:27.540 | model itself and use it for serving. So there's a lot of distillation techniques, which I think is

00:30:33.780 | quite inspired by what we see from computer vision and language models. I haven't seen too many of

00:30:41.940 | these distillation techniques myself in the field of recommendations, which I thought were pretty

00:30:45.700 | interesting. The last one, and unfortunately this is the last one I have slides for, I can go through

00:30:51.140 | the other recommended reads I have, but unfortunately I didn't have slides to do for it,

00:30:56.980 | is this one. So this is quite eye-opening for me. Essentially what LinkedIn did was they replaced

00:31:07.700 | several ID-based ranking models into a single 150B decoder-only model. What this means is that, for

00:31:16.900 | example, you could replace 30 different logistic regressions or decision trees or neural networks

00:31:22.660 | with a single text-based decoder-only model. This model is based, it's built on the Mistro MOE, right,

00:31:30.500 | that's why it's like approximately 150B. And it's trained on three, six months of interaction data,

00:31:36.180 | and the key, the main, so you may think, okay, decoder-only model, what does it mean?

00:31:40.500 | Will you write posts for me? Will you write LinkedIn posts for me? Will you write my,

00:31:43.780 | will you write, update my job title, whatever? The focus here is solely binary classification,

00:31:50.420 | if the user will like with, will like or interact with a post or interact for, apply for a job.

00:31:56.340 | So you can imagine that this model probably only needs to output like or not like. It's probably more

00:32:03.220 | complex than that. But essentially, this is a big fat decoder-only model that is very good at

00:32:07.140 | binary classification. That's why it's able to actually do well. And that's how they were evaluated.

00:32:14.180 | So there are different training stages. And over here, I think maybe it's better for me to go over,

00:32:18.500 | go into the actual write-up itself because I just didn't have time to

00:32:26.980 | share this. So they have continuous pre-training. So continuous pre-training, they just take member

00:32:33.540 | interactions on LinkedIn, different LinkedIn products, right? And then your raw entity data,

00:32:38.260 | essentially just take all this job-related, job hunting-related data to pre-train the model,

00:32:43.860 | to help the model get some idea of what is the domain. After continuous pre-training,

00:32:51.380 | they do the post-training approach. They do instruction tuning. So essentially, this is like

00:32:56.900 | training the model for instructions. They follow, they use UltraChat and internally generated instruction

00:33:04.260 | following data, right? So get LLMs to come up with questions and answers, relevant LinkedIn tasks,

00:33:08.820 | and then try to find high-quality ones. So that's training it, fine-tuning it to follow instructions.

00:33:14.020 | And then finally is supervised fine-tuning. I don't say something, a lot of things like multi-turn chat

00:33:18.980 | format. But essentially, the goal for supervised fine-tuning is to train the model to do the specific

00:33:29.780 | task. I don't remember where it is exactly, but it's like, ah, so this is a specific task.

00:33:36.820 | So now that we know it can follow instructions, now let's make it better at the specific task.

00:33:42.260 | Speaker 1: What action would a member take on this post? Would it solely be impressed? Will it be

00:33:48.420 | liked? Will it be comment? Etc. So that's how they go through differences. And I'm going back to my slides.

00:34:03.540 | Okay. So they have these three different stages. And so here's the crazy thing. You can see the slides,

00:34:15.140 | right? Can someone just say yes? Yes. Okay. The crazy thing is that they have now replaced feature

00:34:23.220 | engineering with prop engineering because of this unified decoder model. So you can broadly read it. It's like,

00:34:30.900 | this is the instruction. Here's the current member profile, software engineer at Google. Here's their

00:34:36.020 | job. Here's their resume. Here's the things that they have applied to. So will the user apply to this job?

00:34:44.500 | And the answer is apply. And you can probably simplify this into a one or zero, right? I guess they just

00:34:50.180 | say in the text as an example, but that's all this model is doing. For a user, we have retrieved several

00:34:57.220 | jobs. This model is doing the final pass of which one to rank. And they take the log props of the output

00:35:04.020 | to score it. So essentially, if this says that the member will apply, maybe you have 10 jobs that a

00:35:09.780 | member will apply. Then we take the log props to rank it. I don't know if this is a good thing or bad thing.

00:35:16.260 | I find feature engineering more intuitive than prop engineering, but maybe it's a skill issue. But

00:35:22.020 | essentially now, all PMs can engineer their own features. The impressive thing was, is that this

00:35:31.700 | can support 30 different ranking tasks. That's insane. So now, instead of 30 different models,

00:35:38.820 | you just have one big fat decoder model. That sounds a bit crazy to me. Firstly, it's crazy impressive.

00:35:46.260 | Secondly, it's a lot of savings. Thirdly, I don't know how to deal with the alignment tax, or maybe it's

00:35:51.620 | just a do no harm tax. I don't know. Essentially, the goal of REXIS was to decouple everything, right?

00:35:57.940 | It's like, have retrieval be very good at retrieval, have ranking be very good at ranking. And then

00:36:02.740 | each model just squeezes as much juice as we can. Now, what this is saying is that, okay, we're going

00:36:08.660 | to unify. We have too many separate ranking models. We're going to unify into a big fat model and then

00:36:15.220 | push all the data through it. And hopefully, you'll outperform. And it does outperform. It needs a lot

00:36:20.260 | of data. So you can see in the graph of that, right? Up to release three, it was not better than

00:36:27.620 | production. And you can see that based on the axis on the left, which is the gap production. Zero

00:36:33.220 | means that it's on par with production. Up to release three, it was not better. I mean, I don't know who

00:36:37.540 | had to get whatever budget or just to quarterback to make sure that this work, to push this through.

00:36:46.340 | But as they add more and more tokens, it starts to get better than production, like 2.5% increase. So this is

00:36:57.300 | a huge leap of faith that, okay, we'll just say, uh, with the lesson. Just give us more data,

00:37:04.900 | we will outperform, um, and with a single model. Um, okay, so that's it. That's all I had to share.

00:37:11.860 | Um, I, I can go through two other, I want to just briefly highlight two other papers, which I think are

00:37:19.300 | good. Um, it's a little bit less connected to LLMs, but the two other papers, which I think are very good is because of

00:37:26.900 | how do they go into their system architecture. The first one is Etsy. Um, Etsy, you can see this, this is extremely

00:37:33.780 | complicated. Uh, but this really shows you a very realistic and practical approach, right? Classic two-tower

00:37:40.900 | architecture. They share about negative sampling and then talk about product quality, right? The thing

00:37:46.020 | is you can have very good baby crap, realized images, but when people buy it, they return it.

00:37:52.660 | Um, you will never be able to detect that if you're just using that. So what they did was that they

00:37:57.780 | actually have a product quality embedding index that they use, used to augment, um, their approximate

00:38:04.820 | nearest neighbor index, right? So you can see the quality back quality vector. Uh, this is extremely pragmatic

00:38:11.540 | and I can tell you that not, uh, no, no e-commerce website or no search engine, search, whatever online

00:38:18.260 | discovery thing can do without some form of quality vector or some kind of post quality filtering. We

00:38:24.820 | saw that with indeed, right? Expected bad match. They need the quality. They just, uh, operationalize it

00:38:29.540 | in a different way as a final post filtering layer over here. They include it in the approximate nearest

00:38:34.900 | neighbors index. So I highly recommend reading this, um, very practical, uh, shares a lot of detail into

00:38:41.460 | their system design. Uh, the next one I also highly recommend is the model ranking platform at Zalando.

00:38:46.900 | I think this is all the best practices, uh, talks about all the different tenants, like composability,

00:38:52.020 | scalability, steerable ranking, and they really go right deep into, Hey, you know, here's the candidate

00:38:56.260 | generator, essentially the retrieval step to tower model. And then, you know, they just, uh, using an

00:39:02.340 | ANN to retrieve it. And then they talk about the ranker and then finally the policy layer. What is this policy

00:39:07.860 | layer, right? Policy layer, encourage exploration, uh, business rules, like previously purchased item, item

00:39:13.380 | diversity, again, some kind of, some, some, some measure of quality that the model would never be

00:39:20.340 | able to learn from the data. The model will never learn that showing good items is good, right? Because

00:39:24.260 | they're untested. So you have to override the model with this policy layer. Um, and of course, very good

00:39:30.260 | results. Uh, but what, what I really like about this paper is that if you want to learn about system design

00:39:36.020 | for Rex's like the Zalano paper and the Etsy paper, uh, really, really, really good and really in depth.

00:39:42.980 | Um, but of course everything here is, uh, very good. If there were a few papers I read, I'm like, okay,

00:39:48.340 | this is pretty crap. I wouldn't include it. Uh, but every paper here is pretty good for system design.

00:39:52.900 | Um, under the final section, which is unified architectures. Um, okay. Any, I spoke a lot,

00:40:00.980 | any questions, they'll lose anyone. They'll lose everyone. Uh, Eugene, I have one quick question to

00:40:10.020 | double check on the LinkedIn's paper. Hmm. My understanding is the, the big model on the 150

00:40:17.860 | billion model is actually used as teacher model and then distilled the knowledge into smaller models

00:40:24.500 | then used for kind of different tasks. So I don't know if that aligns with your understanding because

00:40:30.580 | practically 150 billion model and surf data for prediction, the latency will not be acceptable and

00:40:38.020 | too costly also. They do actually have a, like a full, like a paper discuss more about how

00:40:45.220 | knowledge deceleration is happening with that 150 billion models. I kind of put it in the, in the chat.

00:40:51.860 | I don't know if you have come across that paper.

00:40:53.860 | I have not. Um, but my impression was that they were actually using it. Thank you for sharing this

00:40:59.540 | and thank you for correcting my misunderstanding. I need to look deeper into the original paper.

00:41:04.260 | Um, um, um, to confirm this. Let me take you in this thread. That's my understanding.

00:41:12.020 | So, but I also find this, uh, the, the approach very interesting. So we were actually thinking of

00:41:18.980 | similar kind of approach and as well, but actually they kind of proved that this, this way kind of works.

00:41:24.980 | Yeah. Thank you. I, I think that, I think that could probably be it. I think there's no way for it to be

00:41:30.900 | feasible to serve it at that scale. Uh, I think you're probably right. I, I don't know. I, I didn't see

00:41:38.020 | anything in the original paper that actually suggest us that. Um, but I, I, I think you're right. That's,

00:41:43.380 | there's just no way for it to serve it at scale.

00:41:46.100 | Yeah. So I checked you in this thread. Maybe, maybe I can,

00:41:52.820 | I have the paper. Thank you. I, I don't, I added my safe list to confirm.

00:41:57.700 | Yeah. Yeah. Thanks. Thanks for explaining this. Just to double check. Thank you.

00:42:03.940 | Any other questions?

00:42:08.980 | I mean, so, um, one thing that I tried to look for and I found myself doing, but I might as well ask the,

00:42:22.180 | pre-trained LLM, uh, that is you, uh, is to rank, um, what is highest, uh, you know, the,

00:42:30.980 | the lowest hanging fruit versus the higher ones. Um, so for example, right, you, the way that you

00:42:37.380 | organize your, at your write-up was four sections. It was model architecture, data generation, scaling laws,

00:42:46.180 | and then unified architectures. Um, why there's no particular reason. There's no order, right? Like,

00:42:53.780 | to me, it was very clear that model architecture is basically useless. Is that true? Would you,

00:42:58.340 | would you recommend? I don't think so. I actually think that, um, I think the model architecture right

00:43:05.140 | now, it's like a little bit more like, um, you know, meters dilemma. Um, I would say that in 2023

00:43:11.860 | is useless. Um, I haven't seen good results. Now I'm seeing good results. Um, would you classify this

00:43:18.420 | YouTube one as a good result? Because I was, I read it and I was like, wait, like this, I don't know,

00:43:23.300 | you know, like it's, these are smart ideas. And then the, then the results are like, uh, you know,

00:43:29.300 | doesn't, doesn't really outperform our baseline. I, I, I think it's decent results. Um, so, and, uh,

00:43:36.260 | coincidentally, after I published this, I think after I made the rounds on Hacker News, people from

00:43:41.220 | YouTube actually reached out. One of the authors on this exact paper, I should reached out. Um,

00:43:45.540 | they wanted me to like go in and chat with them. Uh, and then they were like, we have more papers.

00:43:50.580 | We're pushing to publish it. And this is a perennial problem, right? Especially for YouTube and TikTok,

00:43:55.700 | right? Um, new videos get uploaded all the time. They have to deal with, costar is their bread and

00:44:02.500 | butter. So I wouldn't be surprised that they are focusing so hard on content embeddings for

00:44:07.460 | costar. This is for, for a new video, but not, I mean, they have users, uh, they have user histories

00:44:14.740 | and they saturated the world on that. I think very likely. So, right. Um, you can imagine YouTube,

00:44:22.420 | Twitter, I mean, it's unmentioned here, but ads, Google ads, it's always costar being able to

00:44:29.540 | crack this costar problem. Just even 0.1% is huge. It is huge. And you can see a lot of the

00:44:36.020 | papers in this semantic IDs, YouTube quite show, which is like TikTok. Um,

00:44:41.940 | Huawei, this one, I'm not very sure why they did this. Um, yeah, a lot of it, Kyle Rack also,

00:44:47.060 | uh, solving costar. So yeah. Okay. But like, you know, orders of magnitude is, um, which takes

00:44:57.700 | orders of magnitude. I really think that the low hanging fruit right now is really using LLM data

00:45:02.260 | generation. Right. Yeah. That's my, yeah. I mean, obviously that is the, I think everyone can do this

00:45:08.020 | now. And you know, the expected bad match paper, um, indeed did, right. I actually did this, uh,

00:45:14.820 | last year, something very similar. I did this last year. Uh, it got published internally. This is very,

00:45:21.300 | very, very, very, very, very effective. This approach of, um, starting from somewhere active

00:45:27.620 | learning, fine tuning model, more active learning. It really helps, uh, uh, improve quality. Um, I was

00:45:34.180 | doing it in the context of LLM and hallucinations, but I can imagine doing this in terms of relevance,

00:45:39.780 | in terms of any level of measure of quality that you want to focus on, it will work very well.

00:45:44.580 | Okay. Um, and then of course, yeah. Then architecture. I would say data generation and then like model

00:45:52.420 | architecture and system architecture. Yeah. Oh, wait, actually the, the, even the scaling loss part,

00:45:57.460 | there are some things that are very, uh, practical. Um, one example is, which I didn't have time to go

00:46:03.860 | through is this, um, basically Laura's for recommendation. So what they did is they train

00:46:11.460 | a single model on all domain data. You can imagine all domain, like fashion, e-commerce, uh, fashion,

00:46:18.580 | furniture, toys, et cetera. Or it could be like all domain, like ads, videos, uh, e-commerce. And then after

00:46:26.340 | that they have specific LoRa's for each domain. Um, and this works very well. So I, I, I definitely

00:46:34.900 | think that essentially right now it's not easy to learn from data across domains for recommendation

00:46:42.260 | system. It's actually for recommendation system. And, you know, correct me if I'm wrong. I really

00:46:45.700 | think that you want to overfit on your domain. Um, you want to overfit and predict the next best thing

00:46:51.540 | for tomorrow. And that's it, period. I, I can overfit and just retrain every day. Um, but yeah,

00:46:56.820 | I know we have a few questions here. Uh, Daniel asks, shouldn't the LinkedIn model combine with an

00:47:02.900 | information model that's used to generate? Yes, correct. Um, that's probably an upstream, uh,

00:47:08.660 | retrieval model. And then the LinkedIn model just does the ranking. So in a two-step process,

00:47:14.020 | you have a retrieval, the LinkedIn, the decoder model, I think it just does the ranking.

00:47:18.180 | Um, for LM-based search retrieval, any papers talk about impact of query writing prompt engineering?

00:47:24.020 | Also the sensitivity. Uh, I think we know that LMs are, are sensitive to prompts, but I think they're

00:47:29.700 | increasingly less sensitive to prompts, uh, because they're just way more instruction tuned. Um, I'm not

00:47:35.220 | sure about LM papers that talk about the power, the impact of query writing. I think the only one that I

00:47:39.620 | have seen at least covered here is the one by Yelp. Uh, so I think that could be helpful. What's the

00:47:45.460 | process for keeping this hybrid models up to date and personalization, uh, keeping them up to date.

00:47:50.660 | That's a good question. I don't know if they actually need to be kept up to date. So if you look at the,

00:47:57.140 | if you look at this hybrid models, right, let's just take the, this, uh,

00:48:06.180 | semantic ID embedding, it actually uses a frozen video bird. Similarly for quite sure, they use frozen

00:48:13.380 | sentence, but resnet and VGG ish. So the content itself doesn't mean to be up to date. And the

00:48:20.100 | assumption is that, okay, content today is going to be the same as content. Tomorrow is going to be

00:48:23.220 | same as content for one month. So that is not up to date, but what is learnable is the semantic ID

00:48:28.500 | embedding and a cluster embedding. Now for personalization, that's very interesting. Uh, that's the hard question.

00:48:33.220 | Right. And the personalization, I guess, how you include personalization is okay. After we learn

00:48:37.620 | the content, we also need to learn what the user is interested in. And that's how they have this two

00:48:41.780 | tower approach. And that's why you can see over here, there's this small layer, uh, which is multimodal

00:48:47.780 | interest intensity, right? Which is given a user and their past historical sequence. How can we tell what

00:48:53.380 | layers are, what modality they're interested in? I think that's how they do personalization over here.

00:48:58.820 | Any other questions?

00:49:01.700 | If not, we can always ask for volunteers for next week's session.

00:49:08.180 | Anyone?

00:49:19.220 | No, I think, I think this is, uh, really helpful. Uh, it just feels like rexys is always these,

00:49:34.340 | like these bundles of ideas. Yeah. Like in the way that agents are bundles of ideas for LLMs, rexys was,

00:49:44.420 | is also a bundle of ideas. And I, I think that they are obviously converging. Um, you don't, yeah, I mean,

00:49:52.020 | I, I, I definitely think so. Right. And you can see that you can see examples, right? We learned item

00:49:58.580 | embeddings via word to back. You know, when people talk about graph embeddings, it's actually just

00:50:02.260 | taking the graph, doing a random walk, converting that random walk into a sentence of item IDs,

00:50:06.500 | and just using word to back. Similarly learning the next best action, GRU is transformers and BERT,

00:50:12.500 | very obvious it will work. So I think we will see more from the LLM space being in, being adopted

00:50:19.540 | in rexys as well. What is the link in your mind between re-rankers and rexys?

00:50:28.900 | I think, so in rexys, we have, uh, retrieval and recommendation, right? Um, so you can see over

00:50:40.740 | here, uh, where is it? So what we do over here is retrieval will retrieve a lot of top candidates. It's

00:50:50.900 | going to retrieve a hundred candidates. And then ranking is going to find the best five candidates.

00:50:55.460 | You can focus on the best five. I think it's the same thing in retrieval and re-ranking, uh,

00:51:00.980 | in, in ranking and rexys. What people say in reg as re-ranking, I think it's just really just

00:51:08.900 | taking retrieval and then finding the best five. And, you know, Cohear has re-rankers and finding

00:51:13.220 | a best five for the LLM as part of the context. Yeah. It's, to me, it's a bit weird, right? Like, uh,

00:51:18.420 | the re-ranker models are being promoted as a way to just, you feed in your top K whatever results.

00:51:24.900 | And then they re-rank them. And somehow that is supposed to produce better rag because

00:51:29.780 | the more relevant results is at the top, but without the context that rexys have, for example,

00:51:35.780 | user preferences and user histories and whatever, like, how can you have any useful re-ranking at all?

00:51:41.300 | Right? Like, I think there are some ways. So for example, like maybe retrieval, right?

00:51:44.660 | You can imagine the most naive retrieval is really just BM25 or, um, semantic search.

00:51:50.420 | Now, then you can imagine you have a lot of historical data on all these BM25 and his, uh,

00:51:56.500 | semantic search and all the associated metadata, which you probably cannot use in

00:52:00.340 | retrieval stage because it's too expensive. And then you can just train a re-ranker.

00:52:03.380 | Just say that when the author match or this author usually looks for this kind of document,

00:52:09.220 | um, and then you can try to re-rank it. I think it's possible. Um,

00:52:13.460 | I haven't, I haven't dove too deep into how re-ranking is done for a rag, but it's possible.

00:52:21.220 | Oh, Apulantia, you had a hand raise.

00:52:22.740 | Yeah, no. And thank you so much, Swix and Eugene. You guys are amazing. I'm huge fans of you both.

00:52:28.100 | But, um, I, with the question I had, I guess is, is it worthwhile for an organization to go and build

00:52:34.340 | this when you have something like Gina.ai, who, as we know, popularized a lot of work on deep search and

00:52:40.740 | not just internal retrieval, but external search and retrieval, because they have the embedding models,

00:52:45.860 | the re-rankers, the retrieving deep search APIs, all unified. Um, how do you feel about that, Eugene and Swix?

00:52:53.220 | Should teams build them, build it themselves? Or should they just buy something off the shelf?

00:52:58.740 | Yeah, like Gina.ai has kind of this full stack that you're talking about with all these fine-tuned

00:53:03.780 | clip models, embedding models, re-rankers. They have the deep search retrieval. There's,

00:53:08.980 | they've posted probably some of the better technical blogs on deep search. I think that's out right now,

00:53:14.500 | and embedding models and re-rank. Yeah, my answer is going to be probably very boring and you can

00:53:19.460 | apply it to any answer, whether they should, someone should use an LMP off the shelf or just finding a

00:53:24.100 | model. I think that for prototyping, just do whatever is fast, right? Demonstrate user value. It's going

00:53:30.340 | to get to a point in time where what you need does not fit, um, what something off the shelf is going to

00:53:37.060 | do. Um, and that's what, um, that's, uh, Indeed's story, right? For expected bad match. Latency

00:53:47.460 | continued to be too high. Even if they need to be too high, the only way to fine-tune your own model.

00:53:53.620 | Similarly, like for retrieval, you can imagine, okay, they're going to provide a lot of out-of-the-box

00:53:58.340 | embeddings and maybe it's going to be good enough. And then finally, you want to really squeeze more juice

00:54:03.540 | out. You probably need to go fine-tune your own embeddings. I know like Replit recently such

00:54:07.380 | shared something about fine-tuning their own embeddings, half the size, et cetera, et cetera,

00:54:10.660 | and it does way better. Um, there are a lot of examples here. I think Etsy also fine-tune their own

00:54:14.820 | embeddings, um, and it outperform embeddings out of the box. I think it's a little bit of unfair

00:54:21.140 | comparison. I think if you take those models, those embedding models and you further fine-tune them,

00:54:25.700 | I think they could do better, but essentially point by just use us off the shelf to move fast

00:54:30.820 | and then just, and then after that, if you do need to customize it, then you customize it.

00:54:36.180 | I love it. Thank you so much. Super helpful. You're welcome.

00:54:39.460 | Sorry, we have a question. Let's go ahead. And I think this is probably my last question.

00:54:49.540 | Okay. Uh, what do you think of the, uh, biggest opportunity in terms of for, uh,

00:54:54.820 | apply IOM in recommendation domain because we're discussing in ritual,

00:54:58.900 | ranking, content understanding, etc., right? There are so many different prediction tasks.

00:55:04.500 | You're asking me what I think is the big opportunity?

00:55:12.100 | Is that your question? Yeah. Uh, I think, I think embeddings, I think what I've seen is that embeddings

00:55:20.900 | are helpful for retrieval. So instead of purely keyword retrieval, like, and killer, someone might,

00:55:26.260 | might just ask, I have ants in my house. I think a semantic embedding could be able to help you match

00:55:32.100 | that. Um, and I think ranking is definitely clearly, uh, it will clearly work, uh, using an LLM-based ranker.

00:55:39.460 | I think LinkedIn actually really could clearly, can clearly work. And of course for search, I think

00:55:44.100 | there's this card, this guy, Doc Turnbull, um, he's going, increasingly going down the route of,

00:55:50.100 | and we've seen examples from Yelp, right? Using an LLM to do query, uh, segmentation, query expansion,

00:55:56.980 | query rewriting. It clearly works in Yelp's use case and you can just catch all these results.

00:56:02.740 | So those are the three things I, off the top of my head that I can't think of.

00:56:08.740 | Okay. Thank you, everyone. I do need to drop. Maybe you can discuss what the next paper is.

00:56:13.220 | Maybe Swix will talk about Moore's Law for AI every seven months, which I think is interesting.

00:56:17.540 | No, not on the image generation, autoregressive image generation.

00:56:21.300 | Okay. All right. Bye. Bye. Bye. Bye. Thank you.

00:56:25.300 | Thank you, everyone. Take care.

00:56:26.580 | Thank you. Thank you.

Improving RecSys and Search in the age of LLMs — Eugene Yan