Improving RecSys and Search in the age of LLMs

Okay, so I've been, I work in the field of recommendation systems and search, and every now and then I like to pop my head up to try to see what's going on. I think the recent trend for the past one or two years has been the interaction between the recommendation systems and search and LLMs.

Let's say REXs, REXs and LLMs, REXs and search and LLMs, and I think in early 2023, we would see some papers where people use the decoder-only models to try to predict IDs. Those didn't work very well. But at the end of last year and early this year, we're starting to see scenes of life whereby some of these actually are A/B tested and have very good empirical results.

So I want to highlight a few of those patterns and we can see how it goes. The one thing that for the longest time, what we're trying to do for REXs and search is to move away slowly from item IDs. You can imagine if Eugene interacts with item ID 1, 10, 25, his next predicted direction is probably item ID number 33.

But all of this is relying slowly on item IDs and you can imagine every time that you have a new item ID, you have to learn a new embedding for it and that leads to a close-up problem. So I know this was not part of my recommended reads in my write-up, I will actually go back and update it to be recommended reads, but I want to discuss two papers here to try to address this.

The first one is semantic IDs, which is from YouTube. You can imagine YouTube has a lot of new videos all the time, they can't learn new item IDs for them all the time. So what they do is they have a transformer encoder, it generates dense content embeddings. This is actually just a video encoder that converts a view into dense content embeddings.

And then they compress this into what they call semantic ID. It's via an autoencoder. So the dense video encoding is 248 dimension, but what they do is they take the encoder, find the nearest neighbors, assign that to the code book is also 248, assign it to the nearest code book, take the residual, find the next nearest neighbors, assign it.

So it just keeps compressing and compressing it. In the image here, they have four layers. So essentially you can compress an item ID into four integers product in the paper, actually decompress an ID into eight integers. So I thought this was pretty cool. And then now that you have an item ID, you have a content embeddings, you convert it to eight integers.

How do you then learn it, right? They tried using ngram and sentence piece. ngram is really just, you know, like fast text, like character ngrams, you know, every subword, every one character, two characters, three characters are learning about embeddings. And then they also tried using sentence piece. Essentially, sentence piece is really just looking at all the version of all these item IDs.

What are the most common subwords, most common sub characters? So therefore, it's no longer just a unigram, bigram and trigram. It's that you can learn variable length subwords. What are the results from this? Well, not surprisingly, dense content embeddings itself do worse than item IDs, right? And you can see this, right?

You can see this on the chart on the left here, right? You can see unigram and bigram, the red line and the purple line, the unigram. It actually is worse than item, the random hash ID, the orange line, for some extent. Oh, actually, no, I didn't include the-- Okay, I have the content embeddings line itself on my write-up.

I didn't include it here. But this chart here is actually trying to show that they use the dense content embeddings, it's crap. But when they use both ngram and sentence piece, it did better. So you have to do this trick whereby you have to convert that content embedding, the full dense content embedding into its own semantic ID and then learn those IDs.

Now, the benefit of this, you might be saying, hey, you know, isn't this all back to IDs again? Well, not necessarily, because now, given the piece of content, you can convert it to embedding and then you can assign it to its nearest ID. And therefore, you don't need to learn on behavioral data.

So that's the benefit here. Similarly, quite short, which is like a TikTok number two, the number two TikTok in China, they adopted the same approach. They use multimodal content embeddings. So they use embeddings from ResNet, Sentencebird and VGG to get the respective modality. And then this is just simply concatenating a single vector.

Then they take all these embeddings. They just do k-means to identify a thousand of the most common clusters. Right. Then now, therefore, each embedding is now a trainable item ID. These cluster IDs are then embedded via motor encoder. So this motor encoder, there's quite a bit to it. Let me try to simplify it.

You can see on the top, it says non-visual, non-trainable visual embeddings. In this example, they only use visual embedding as an example. But you can imagine all the non-trainable embeddings. They take it and they project it into a different space via the mapping network. Secondly, for every cluster ID, they convert it into learned embeddings for visual texture and acoustic.

These are not the original embeddings that come from it. These are just the representation of the multimodal cluster ID. And then fusion is really just concatenating it. So now you might be thinking, how is this motor encoder learned? How is this motor encoder trained, right? This motor encoder is not trained itself.

This motor encoder is trained within the overall encoding, overall ranking network, which you can see the motor encoder is all at the bottom, right? So therefore, this motor encoder network, it takes the user tower, which is on the left, and the item tower that's on the right, and it tries to predict the likelihood that user will click or like.

Therefore, based on this, they just backprop. You backprop the likelihood of clicking or liking or following, and you backprop it through the motor encoder. And that's how the motor encoder learns the mapping network and the cluster IDs. So the benefit of this is that they shared that outperformed several multimodal baselines.

I won't go through them here. And when they did A/B testing, I think those are pretty freaking significant numbers. Anything more than 1% is pretty strong in a platform like this. And the benefit of this is that they mentioned that they had increased cold start velocity and cold start coverage.

It means that, you know, cold start is able to pick up faster. If it's a good cold start video, it's able to pick up faster. And they are also able to show more cold start, 3.6% more cold start content, which increases coverage. So they also did the ablation studies.

So let me butt into, for those new to the REXIS world, you said anything more than 1% is a big deal. Pretty huge. Can you contextualize, like how much is that worth or is that like... So you can imagine, right? Imagine you are making, I don't know, let's just make something up.

A billion dollars worth of ads. Yeah. Right. And if people are engaging more, spending 1% more time, you can show 1% more ads. That's like 10 million. A million dollars. Right. So you just expand that, right? Of course, clicks and likes and follows. This is, these are engagement, engagement proxy metrics.

Okay. So like, and is this, is this absolute or relative? So for example, are we saying that, let's say likes was plus three right here. Are we saying they went from six to 9% or are we saying they're currently 6% and now we are 6.09%. I suspect it's relative.

I suspect it's like maybe from 5% to 5.15%. going from 6% to 9% is impossible. You wouldn't have to work. Yeah. Okay. All right. So no surprises here using multi-modality consistently outperform single model features. But there was also a trick here whereby they had to learn user specific model interests.

And if you look at it on the left tower there, close to the top, there's this thing called multi-modal interest intensity learning. Essentially what they're learning there is for each user, which modality they are interested in. And then they actually map to that. For some people like Swix is very acoustically inclined.

He might care more about a soundtrack. For other people, they might care more about the visuals or the video itself. Or the text, having a good caption is worth a thousand words, maybe. So yeah. So that's one trend I've seen, which is increasingly including more content into the model itself.

The other trend that I've seen is that for LLMs for synthetic data. And there's two papers that I really like here because they share a lot of details. And they share a lot about the pitfalls that they faced. So the first one is, this is paper I really like.

It's called expected bad match. It's from Indeed. Essentially, you can imagine you are providing people with job recommendations, right? And then you want to have a final filter at the end to filter out bad job recommendations. So this paper, it's not easy to get the full access to this paper online.

I've included it in our, we have a, in our Discord channel, we have a thread. I've actually dropped the PDF in there and they talk through the entire, I think they talk through the entire process. And I think it's a quite a role model process. They started with looking at 250 job matches, right?

250 job matches. And then they compared it across various experts. They have very rigorous criteria. And in the end, they built a final eval set of 147 matches that were very high confidence. multiple experts agreed and then, you know, that there was nothing subjective. Then they tried prompting the LMS with recruitment guidelines, right?

And to classify job match quality. And of course, you know, they tried things like the cheap stuff and then they tried things like the expensive stuff. Unfortunately, the cheap stuff doesn't work. Only GPT-4 worked. But GPT-4 was so slow. GPT-4 took an average of 31 seconds, right? Right. Okay.

No problem. OpenAI lets you fine tune GPT-3.5. So what they did was then they fine tune GPT-3.5. And GPT-3.5, you can see, let's just focus on the green boxes and then the red boxes. You can see GPT-3.5 is able to achieve as almost close good precision and recall as GPT-4.

Fine tune, right? And fine tune from the labels of GPT-4. But GPT-3.5 was like reduced latency by two thirds and cost by two thirds, which is perfect, right? But the thing is, you see that the average latency there is like six seconds. And that's still not good enough for when they needed to do it online.

So then what they did is they fine tune the lightweight classifier. Unfortunately, they didn't go into very many details of what this lightweight classifier is. I suspect that this lightweight classifier is maybe not a language model. I suspect that it is probably a decision tree because they talk a lot about categorical features.

And then the labels are just solely the ALM generated labels, right? So you can see their entire journey. The first to the eval set, then we test GPT-4, GPT-4 good, but too slow. Then we try GPT-3.5 like step-by-step incremental progress, but still too slow, too expensive. Okay. We would really not like to have to train our own classifier and have to maintain the ops of retraining it, but we don't have a choice.

This is what we need to do to reduce inference latency. So that's what they did. They were able to achieve area under the curve, ROC of 0.86. That's pretty freaking good against ALM labels. And this is, according to them, low latency. I don't know how low latency is and suitable for real-time filtering.

The benefits of this are pretty tremendous, right? You can read the benefits there yourself. But I think the one big benefit is they lowered unsubscribed rates by 5%. That is huge. For someone, if you maintain some kind of push notification or email notification thing, subscribe rates, unsubscription rates is like your biggest guardrail.

Because if people aren't subscribed, you're never, ever going to reach out to them again. You lose them. So, you know, all your customer acquisition costs is really down the drain. Like maybe you have an offer and you let people sign up. Hey, would you like to hear more about us?

Okay. We give that out. But when I'm subscribed, you lose them. And so over here on the top line in table two, they share the apply to invite to apply email. That's one. And then that's the results I highlighted here. And in the bottom, I also see they had online experiments for the homepage recommendation feed.

And that's how low latency this classifier has to be. It has to be on the homepage recommendations. And similarly, we see very good results, right? You can see it. For example, impressions, right? Impressions drop 5.1% at threshold of 15 and then drop by 7.95% at threshold 25. What does that mean?

That means you freed up 5% to 8% of impressions. You can now show more good stuff. Right? That's huge. But freeing up 1/12 of impressions is a very big deal. Freeing up more space. And as we know, more real estate is better. So I think that this was quite an outstanding result.

The other one I want to share and then we'll pause for questions, short questions before I go into two other sections, is query understanding at Yelp. So query understanding at Yelp was very nice. It's purely using opening. They had two things. One is query segmentation and another one is highlights.

The query segmentation one is not so straightforward to understand, but essentially given a query like Epcot restaurants, they can identify, they can split this into different things like topic, location, name, question, etc. And then by having better segmentation, they can have greater confidence in rewriting those parts of the query to help them search better.

So the second bullet point gives you an example. If we know that the user's location is approximately there and the user said Epcot restaurants, we can rewrite the user's location from Orlando, Florida to Epcot for the search backend. And because the search backend is based on location, by rewriting Orlando, Florida to Epcot, they were able to get more precise results for the user.

So that's one example. The other example is segmentation. And the original write-up, they have a lot of good images. I didn't include those images here because I didn't have time. I only started reading this like an hour before. One of this is review highlights. So imagine if you search for some food, maybe you search for vegetarian friendly Thai food, right?

And then sometimes in the reviews, people would say things like vegetarian, veggie only, suitable for vegetarians. And then I'm sure that there are way more synonyms for this. In the past, they had to get humans to write these different synonyms. And then they add these two dictionaries. And you can imagine this is not scalable.

But now they can use LLMs to replicate the human reasoning, right? In the phrase extraction. And they get way better coverage and it can cover 95% of traffic. So for query segmentation, they are able to understand the user intent a little bit better. And then for review highlights, because they were showing more reviews, especially for the long tail queries, it makes search more engaging.

By highlighting the relevant reviews for a user's query, they help the user feel more confident about the food. Let's say it's vegetarian friendly. And then maybe the user review would really say something like, oh, the vegetarian food is great and delicious or something. Or like definitely no meat involved.

It happens to be gluten free as well. I don't know. Things like that help make the users more confident in the search results that they're seeing. Okay, I'll pause here. Any questions? I know there's a lot in the chat. Oh my goodness. There's a quick, I mean, I have a quick question on just this query understanding thing.

What was the previous solta in query segmentation? Like this seems like the most obvious, dumb possible thing to do. Name entity extraction. You get a span and then you train some kind of classic, you train some kind of transformer model that takes the input and then they'll cut it at characters.

Yeah. Okay. So like it's basically, but like this guy did not compare it to NER, right? Like they, they, they mentioned that their original was NER and this is better. Okay. Nice. Yeah. They definitely, yep. I mean, everyone starts with some kind of NER based approach. My, I mean, my, my theory is that like, yeah, I, I basically, basically there's no point doing traditional NER anymore.

You just do LMs, uh, with a, with some kind of schema. Could be. So fast. Yeah. Any fast NER cheap, uh, that's it for, for search, right? Uh, sorry. Great. But slow, right? Yeah. Um, like if you want to spell check auto complete thing, like a grammar tool, Gemini slow.

Oh, so this is why I might rather prefer, uh, present from this, um, where is it search query? So, um, um, they started their legacy models. Uh, their legacy models. Oh, go ahead. Yeah. So you can see name entity recognition, right? Um, they use aim at the recognition and, and they do this.

Um, but then, oh, wow. People are drawing on this. I didn't know you could do that. Um, but then they actually shed one thing I really like about this, uh, write up is this, this, this chart right here. No, this chart seems very, very, very base formulation, scope task, proof of concept scaling up.

Um, but they, they wrote it very well to explain how they did it in the context of these two case studies. I feel like a lot of people just completely just, uh, drop this. They completely ignore this. And for query segmentation, right? Um, I know I'm taking a long time to get to the points.

Is that 10% of queries make up 80% of traffic. So they could do all this query segmentation once, period. And then just retrieve from the cache. Um, derive golden data set fine tune. I can't remember where they wrote it, but that's how they did for query segmentation and for review highlights.

So essentially a lot of these things that may feel like they have to be done on online, but because of the power law in e-commerce and online search, you don't have to, uh, you, you can make use of a cache a lot. Um, there's also another question from Tyler Cross.

How do these approaches compare to more information retrieval methods like BM25 vector methods? Um, uh, Tyler, do you want to, what do you mean by, by these approaches? What approaches specifically? Oh, no, maybe Tyler's not on the... I think you said this when you were doing the LLM, like we tried 3.5, fine tune, 3.5, Lama 2, Nostro 11B.

Ah, okay. I think that's like a classification. I think in this case, uh, if it's a classification approach and this wouldn't work. Right. Um, but if you're talking more generally, like using LLMs or embeddings, uh, for retrieval. Hmm. Yeah. I actually don't know. I don't know the full, the full context of the question.

So I probably better not answer. Um, okay. Any, any other questions? Okay. I think that's, that's great. I can move on because the other two sections are fairly heavy and I, and then, you know, we have more times, more time after that. Ooh, wait. I don't know if I'm trying to write screen.

Give me a second. Okay. First I need to go to my Google Chrome. Ah, see, this is what happens when you get a noob to do slides. Okay. Share. You are seeing this, right? You're seeing my slides. Are you seeing it in slideshow mode? Yes. Full screen. Okay. Perfect.

Perfect. So then the other thing is, I think that I'm LLM inspired training paradigms. So maybe it's LLM inspired, maybe you've been doing this for a very long time, but I thought to highlight it. The first one is really looking at the scaling laws. And ever since I published, I shared about this post.

Like people have gotten back to me with like at least three or four papers about other studies of scaling laws, but along a very similar, very similar view. So I want to talk about the experimentation that I did. This scaling law was decoder only transformer architecture. And they tried various model sizes.

The training data is the same as sentences. Essentially it's fixed length sequences, 50 item IDs each. And the training objective is given the past 10 items, predict item number 11. Given the past 20 items, predict item number 21. So it's fairly straightforward. But they did introduce two key things that are very interesting.

The first one is layer-wise adaptive dropout. So you can imagine, right, for LLMs, every single layer has same dimension. You know, usually when they draw a transformer layer, it's every single layer has same dimension. But for recommendation system models, that's not the case. It's usually fairly fat at the bottom and gets skinnier towards the top.

So what they do over here is they have higher dropout in the lower layers and lower dropout in the higher layers. And ablation studies showed that this works. So the intuition here is that the lower layers process more direct input from the data. And because e-commerce data or recommendation data is fairly noisy, it's more prone to overfitting.

Therefore, they have more dropout in the lower layers. Vice versa, at the upper layers, it learns from more abstract data. And therefore, you want to make sure it doesn't underfit. You want to make sure it gets all the juice you can get. And therefore, they reduce the lower, they have lower dropout at the higher layers.

The other thing, which feels a little bit like black magic, is that they switch optimizers halfway doing training. Firstly, they start with Adam and then they switch to SGD. The observation they had was that, you know, they ran a full run with Adam, ran a full run with SGD, is that Adam is able to very quickly reduce the loss at the start.

But then it like slowly tapers off, whereas SGD is slower at the start, but achieves better conversions. So they had to do these two tricks for their sequential models. What were the results? I mean, obviously, no, this is fairly obvious. Higher model capacity reduce cross entropy loss. And this model capacity is model capacity, model params excluding ID embeddings.

So it's purely just the layers itself without the ID embeddings. And they were able to model this. If you look at the dash line, the test loss curve and the blue dots, they estimated this with the blue dots, estimated that power law curve. And they were able to fairly accurately predict where the red dots are going to be.

And, you know, this is like the Kaplan-style scaling loss and the Chinchilla-style scaling loss. So essentially, given some smaller model, if we had bigger model, how would it perform? The other thing was, oh gosh, these lines don't look correct. Okay, the red arrow does look correct. Everything else is fine.

So over here, I think this is a very nice result, which is that smaller models need more data to achieve comparable performance. So over here on the left, you can see that there's a small model there on the orange line. It needed twice the amount of data compared to a bigger model to get similar performance, right?

So the flip side of it is that, hey, you know, if you want highly performant models online, you're going to need a factor more data. Of course, this is also nothing unusual. This is something we know, but it's really nice to have someone have done the experiments and distill it into the results here.

The other thing that I thought was really interesting is this idea of recommendation model pre-training. So this was fairly new to me. I didn't think it could be done. Most people do this on content embeddings, which is given some content of this item, can you predict the content of that item?

I thought this was fairly novel, whereby it's trained solely on item popularity statistics. They say it works. It's still quite unfantom to me on how it works. Essentially just take the item popularity statistic in the monthly and the weekly timescale, convert it to percentiles, and then convert those percentiles to vector representations.

And that's it. That's your representation of the item. So anytime you have a new item, so long as you have the past statistic for the past month and week, you can convert the percentile and then map it into vector representation. So what this means is that imagine if our percentiles are only at the hundreds and we have stats for monthly and weekly, all we need is 200 embeddings for a month and a week.

And for each hundred, for 100 and we need 100 percentiles, we need vector representations. So instead of millions of item IDs or billions of item IDs, all you need is 200 percentile vector representations. So that is extremely compressed. They also had to do several tricks like relative time intervals and fixed position encoding that don't come across as ventuitive to me.

They explained that they say that they did that, but it's unclear, like how would I know a priori that I need to, how would I know if without running experiment that I needed to do it? So it feels like there's like too many tricks. There's so many tricks in like, okay, I need these three things, the stats to perfectly align for this to work.

So I think it's very promising, but I wish there was a simpler way to do this. The results, it has promising zero shot performance. What I mean by zero performance, basically it trains on the standard domain and then tries to apply it across the domain to another domain, right?

And you can see two to six percent drop in recall at 10. This is compared to baselines, which are fairly good baselines, SASTRAC and BOFORAC, which are trained on the target domain itself. Now, if you take this model and you train it on that target domain, it matches or surpasses SASTRAC and BOFORAC when trained from scratch.

But the test, it only uses one to five percent of parameters because it doesn't have item embeddings, right? It only has those 200 embeddings at the monthly and the weekly scale for every percentile. So this is quite promising in the sense, it's one direction to its pre-trained models. You can imagine some kind of recommendation as a service, adopting this idea and maybe it could work.

Maybe something like Shopify, right? Shopify has a lot of new merchants onboarding. Hey, you know, can we take existing merchant data with their permission? Of course, completely anonymized, right? It's just solely trained on popularity, right? And then we just train this model. Now for any new merchant that's onboarding, as long as we have a week of data, we can use the weekly popularity embeddings.

And once we have a month of data, we can use that model. So we don't actually need semantic IDs. The second one is we have two papers from YouTube. We have two pictures from Google and YouTube here. And this is so the one thing about distillation is that if you solely learn on the teacher labels, it is very noisy, right?

The teacher models are not the perfect models. It's better to learn from the ground truth. But we do know that adding the teacher models does help. So what they do here is on the left side, you can see that direct distillation, which is learning from both the hard labels, which is the ground truth and the distillation labels, which is what the teacher provides, the teacher model, the big teacher model provides, is not as good as auxiliary distillation.

And essentially, what auxiliary distillation means is that you just split, give them two logits. One logit to learn from the hard label, one logit to learn from the distillation label. And they find that this works very well. I didn't have time to put the results here, but they find that this works well for YouTube.

And then the thing is that the teacher model is useful. So what they did is that they amortized the cost by having a big fat teacher model. And by big fat teacher model, I mean, it's only two to four X bigger. By having a teacher model that's two to four X bigger, this teacher model will just keep pumping out the soft labels that all the students can learn from.

And this makes all the students better. And of course, why do we want students? If you're saying that teacher model is better, why do we want students? We want the students because the student models are small and cheap. And at YouTube scale, where they have to make a lot of requests, this is probably what they need to do.

Another approach, which is from Google, and I think they applied this in the YouTube setting as well, is called self auxiliary distillation. So the intuition here is this, don't look at the image first. So intuition here is this, they want to prioritize high quality labels and improve the resolution of low quality.

What does it mean to improve the resolution of lower quality labels? Essentially, what they're saying is that if something is impressed, but not clicked, we should not treat that as a label of zero. Instead, what we should do is to try to get the teacher to predict what that label is, to smoothen it out.

So if you look at the image, you can see that they have ground truth labels, which is those in green, and they have teacher predictions, which is those in yellow. So to combine a hard label with the soft label, they suggested a very simple function in the, I don't know if that's what they actually use, but essentially the max of the teacher and the student.

So the max of the teacher and the ground truth. So imagine if the actual label was zero, and the teacher said that, you know, it's a 0.3, you just use the 0.3. Or if the actual label is one, and it's just a 0.5, you just take the one. So by smoothing it, and then having the teacher, having the student learn on the auxiliary head, right, you are actually able to improve the teacher model itself and use it for serving.

So there's a lot of distillation techniques, which I think is quite inspired by what we see from computer vision and language models. I haven't seen too many of these distillation techniques myself in the field of recommendations, which I thought were pretty interesting. The last one, and unfortunately this is the last one I have slides for, I can go through the other recommended reads I have, but unfortunately I didn't have slides to do for it, is this one.

So this is quite eye-opening for me. Essentially what LinkedIn did was they replaced several ID-based ranking models into a single 150B decoder-only model. What this means is that, for example, you could replace 30 different logistic regressions or decision trees or neural networks with a single text-based decoder-only model. This model is based, it's built on the Mistro MOE, right, that's why it's like approximately 150B.

And it's trained on three, six months of interaction data, and the key, the main, so you may think, okay, decoder-only model, what does it mean? Will you write posts for me? Will you write LinkedIn posts for me? Will you write my, will you write, update my job title, whatever?

The focus here is solely binary classification, if the user will like with, will like or interact with a post or interact for, apply for a job. So you can imagine that this model probably only needs to output like or not like. It's probably more complex than that. But essentially, this is a big fat decoder-only model that is very good at binary classification.

That's why it's able to actually do well. And that's how they were evaluated. So there are different training stages. And over here, I think maybe it's better for me to go over, go into the actual write-up itself because I just didn't have time to share this. So they have continuous pre-training.

So continuous pre-training, they just take member interactions on LinkedIn, different LinkedIn products, right? And then your raw entity data, essentially just take all this job-related, job hunting-related data to pre-train the model, to help the model get some idea of what is the domain. After continuous pre-training, they do the post-training approach.

They do instruction tuning. So essentially, this is like training the model for instructions. They follow, they use UltraChat and internally generated instruction following data, right? So get LLMs to come up with questions and answers, relevant LinkedIn tasks, and then try to find high-quality ones. So that's training it, fine-tuning it to follow instructions.

And then finally is supervised fine-tuning. I don't say something, a lot of things like multi-turn chat format. But essentially, the goal for supervised fine-tuning is to train the model to do the specific task. I don't remember where it is exactly, but it's like, ah, so this is a specific task.

So now that we know it can follow instructions, now let's make it better at the specific task. Speaker 1: What action would a member take on this post? Would it solely be impressed? Will it be liked? Will it be comment? Etc. So that's how they go through differences. And I'm going back to my slides.

Okay. So they have these three different stages. And so here's the crazy thing. You can see the slides, right? Can someone just say yes? Yes. Okay. The crazy thing is that they have now replaced feature engineering with prop engineering because of this unified decoder model. So you can broadly read it.

It's like, this is the instruction. Here's the current member profile, software engineer at Google. Here's their job. Here's their resume. Here's the things that they have applied to. So will the user apply to this job? And the answer is apply. And you can probably simplify this into a one or zero, right?

I guess they just say in the text as an example, but that's all this model is doing. For a user, we have retrieved several jobs. This model is doing the final pass of which one to rank. And they take the log props of the output to score it. So essentially, if this says that the member will apply, maybe you have 10 jobs that a member will apply.

Then we take the log props to rank it. I don't know if this is a good thing or bad thing. I find feature engineering more intuitive than prop engineering, but maybe it's a skill issue. But essentially now, all PMs can engineer their own features. The impressive thing was, is that this can support 30 different ranking tasks.

That's insane. So now, instead of 30 different models, you just have one big fat decoder model. That sounds a bit crazy to me. Firstly, it's crazy impressive. Secondly, it's a lot of savings. Thirdly, I don't know how to deal with the alignment tax, or maybe it's just a do no harm tax.

I don't know. Essentially, the goal of REXIS was to decouple everything, right? It's like, have retrieval be very good at retrieval, have ranking be very good at ranking. And then each model just squeezes as much juice as we can. Now, what this is saying is that, okay, we're going to unify.

We have too many separate ranking models. We're going to unify into a big fat model and then push all the data through it. And hopefully, you'll outperform. And it does outperform. It needs a lot of data. So you can see in the graph of that, right? Up to release three, it was not better than production.

And you can see that based on the axis on the left, which is the gap production. Zero means that it's on par with production. Up to release three, it was not better. I mean, I don't know who had to get whatever budget or just to quarterback to make sure that this work, to push this through.

But as they add more and more tokens, it starts to get better than production, like 2.5% increase. So this is a huge leap of faith that, okay, we'll just say, uh, with the lesson. Just give us more data, we will outperform, um, and with a single model. Um, okay, so that's it.

That's all I had to share. Um, I, I can go through two other, I want to just briefly highlight two other papers, which I think are good. Um, it's a little bit less connected to LLMs, but the two other papers, which I think are very good is because of how do they go into their system architecture.

The first one is Etsy. Um, Etsy, you can see this, this is extremely complicated. Uh, but this really shows you a very realistic and practical approach, right? Classic two-tower architecture. They share about negative sampling and then talk about product quality, right? The thing is you can have very good baby crap, realized images, but when people buy it, they return it.

Um, you will never be able to detect that if you're just using that. So what they did was that they actually have a product quality embedding index that they use, used to augment, um, their approximate nearest neighbor index, right? So you can see the quality back quality vector. Uh, this is extremely pragmatic and I can tell you that not, uh, no, no e-commerce website or no search engine, search, whatever online discovery thing can do without some form of quality vector or some kind of post quality filtering.

We saw that with indeed, right? Expected bad match. They need the quality. They just, uh, operationalize it in a different way as a final post filtering layer over here. They include it in the approximate nearest neighbors index. So I highly recommend reading this, um, very practical, uh, shares a lot of detail into their system design.

Uh, the next one I also highly recommend is the model ranking platform at Zalando. I think this is all the best practices, uh, talks about all the different tenants, like composability, scalability, steerable ranking, and they really go right deep into, Hey, you know, here's the candidate generator, essentially the retrieval step to tower model.

And then, you know, they just, uh, using an ANN to retrieve it. And then they talk about the ranker and then finally the policy layer. What is this policy layer, right? Policy layer, encourage exploration, uh, business rules, like previously purchased item, item diversity, again, some kind of, some, some, some measure of quality that the model would never be able to learn from the data.

The model will never learn that showing good items is good, right? Because they're untested. So you have to override the model with this policy layer. Um, and of course, very good results. Uh, but what, what I really like about this paper is that if you want to learn about system design for Rex's like the Zalano paper and the Etsy paper, uh, really, really, really good and really in depth.

Um, but of course everything here is, uh, very good. If there were a few papers I read, I'm like, okay, this is pretty crap. I wouldn't include it. Uh, but every paper here is pretty good for system design. Um, under the final section, which is unified architectures. Um, okay.

Any, I spoke a lot, any questions, they'll lose anyone. They'll lose everyone. Uh, Eugene, I have one quick question to double check on the LinkedIn's paper. Hmm. My understanding is the, the big model on the 150 billion model is actually used as teacher model and then distilled the knowledge into smaller models then used for kind of different tasks.

So I don't know if that aligns with your understanding because practically 150 billion model and surf data for prediction, the latency will not be acceptable and too costly also. They do actually have a, like a full, like a paper discuss more about how knowledge deceleration is happening with that 150 billion models.

I kind of put it in the, in the chat. I don't know if you have come across that paper. I have not. Um, but my impression was that they were actually using it. Thank you for sharing this and thank you for correcting my misunderstanding. I need to look deeper into the original paper.

Um, um, um, to confirm this. Let me take you in this thread. That's my understanding. So, but I also find this, uh, the, the approach very interesting. So we were actually thinking of similar kind of approach and as well, but actually they kind of proved that this, this way kind of works.

Yeah. Thank you. I, I think that, I think that could probably be it. I think there's no way for it to be feasible to serve it at that scale. Uh, I think you're probably right. I, I don't know. I, I didn't see anything in the original paper that actually suggest us that.

Um, but I, I, I think you're right. That's, there's just no way for it to serve it at scale. Yeah. So I checked you in this thread. Maybe, maybe I can, I have the paper. Thank you. I, I don't, I added my safe list to confirm. Yeah. Yeah. Thanks.

Thanks for explaining this. Just to double check. Thank you. Any other questions? I mean, so, um, one thing that I tried to look for and I found myself doing, but I might as well ask the, pre-trained LLM, uh, that is you, uh, is to rank, um, what is highest, uh, you know, the, the lowest hanging fruit versus the higher ones.

Um, so for example, right, you, the way that you organize your, at your write-up was four sections. It was model architecture, data generation, scaling laws, and then unified architectures. Um, why there's no particular reason. There's no order, right? Like, to me, it was very clear that model architecture is basically useless.

Is that true? Would you, would you recommend? I don't think so. I actually think that, um, I think the model architecture right now, it's like a little bit more like, um, you know, meters dilemma. Um, I would say that in 2023 is useless. Um, I haven't seen good results.

Now I'm seeing good results. Um, would you classify this YouTube one as a good result? Because I was, I read it and I was like, wait, like this, I don't know, you know, like it's, these are smart ideas. And then the, then the results are like, uh, you know, doesn't, doesn't really outperform our baseline.

I, I, I think it's decent results. Um, so, and, uh, coincidentally, after I published this, I think after I made the rounds on Hacker News, people from YouTube actually reached out. One of the authors on this exact paper, I should reached out. Um, they wanted me to like go in and chat with them.

Uh, and then they were like, we have more papers. We're pushing to publish it. And this is a perennial problem, right? Especially for YouTube and TikTok, right? Um, new videos get uploaded all the time. They have to deal with, costar is their bread and butter. So I wouldn't be surprised that they are focusing so hard on content embeddings for costar.

This is for, for a new video, but not, I mean, they have users, uh, they have user histories and they saturated the world on that. I think very likely. So, right. Um, you can imagine YouTube, Twitter, I mean, it's unmentioned here, but ads, Google ads, it's always costar being able to crack this costar problem.

Just even 0.1% is huge. It is huge. And you can see a lot of the papers in this semantic IDs, YouTube quite show, which is like TikTok. Um, Huawei, this one, I'm not very sure why they did this. Um, yeah, a lot of it, Kyle Rack also, uh, solving costar.

So yeah. Okay. But like, you know, orders of magnitude is, um, which takes orders of magnitude. I really think that the low hanging fruit right now is really using LLM data generation. Right. Yeah. That's my, yeah. I mean, obviously that is the, I think everyone can do this now.

And you know, the expected bad match paper, um, indeed did, right. I actually did this, uh, last year, something very similar. I did this last year. Uh, it got published internally. This is very, very, very, very, very, very effective. This approach of, um, starting from somewhere active learning, fine tuning model, more active learning.

It really helps, uh, uh, improve quality. Um, I was doing it in the context of LLM and hallucinations, but I can imagine doing this in terms of relevance, in terms of any level of measure of quality that you want to focus on, it will work very well. Okay. Um, and then of course, yeah.

Then architecture. I would say data generation and then like model architecture and system architecture. Yeah. Oh, wait, actually the, the, even the scaling loss part, there are some things that are very, uh, practical. Um, one example is, which I didn't have time to go through is this, um, basically Laura's for recommendation.

So what they did is they train a single model on all domain data. You can imagine all domain, like fashion, e-commerce, uh, fashion, furniture, toys, et cetera. Or it could be like all domain, like ads, videos, uh, e-commerce. And then after that they have specific LoRa's for each domain.

Um, and this works very well. So I, I, I definitely think that essentially right now it's not easy to learn from data across domains for recommendation system. It's actually for recommendation system. And, you know, correct me if I'm wrong. I really think that you want to overfit on your domain.

Um, you want to overfit and predict the next best thing for tomorrow. And that's it, period. I, I can overfit and just retrain every day. Um, but yeah, I know we have a few questions here. Uh, Daniel asks, shouldn't the LinkedIn model combine with an information model that's used to generate?

Yes, correct. Um, that's probably an upstream, uh, retrieval model. And then the LinkedIn model just does the ranking. So in a two-step process, you have a retrieval, the LinkedIn, the decoder model, I think it just does the ranking. Um, for LM-based search retrieval, any papers talk about impact of query writing prompt engineering?

Also the sensitivity. Uh, I think we know that LMs are, are sensitive to prompts, but I think they're increasingly less sensitive to prompts, uh, because they're just way more instruction tuned. Um, I'm not sure about LM papers that talk about the power, the impact of query writing. I think the only one that I have seen at least covered here is the one by Yelp.

Uh, so I think that could be helpful. What's the process for keeping this hybrid models up to date and personalization, uh, keeping them up to date. That's a good question. I don't know if they actually need to be kept up to date. So if you look at the, if you look at this hybrid models, right, let's just take the, this, uh, semantic ID embedding, it actually uses a frozen video bird.

Similarly for quite sure, they use frozen sentence, but resnet and VGG ish. So the content itself doesn't mean to be up to date. And the assumption is that, okay, content today is going to be the same as content. Tomorrow is going to be same as content for one month.

So that is not up to date, but what is learnable is the semantic ID embedding and a cluster embedding. Now for personalization, that's very interesting. Uh, that's the hard question. Right. And the personalization, I guess, how you include personalization is okay. After we learn the content, we also need to learn what the user is interested in.

And that's how they have this two tower approach. And that's why you can see over here, there's this small layer, uh, which is multimodal interest intensity, right? Which is given a user and their past historical sequence. How can we tell what layers are, what modality they're interested in? I think that's how they do personalization over here.

Any other questions? If not, we can always ask for volunteers for next week's session. Anyone? No, I think, I think this is, uh, really helpful. Uh, it just feels like rexys is always these, like these bundles of ideas. Yeah. Like in the way that agents are bundles of ideas for LLMs, rexys was, is also a bundle of ideas.

And I, I think that they are obviously converging. Um, you don't, yeah, I mean, I, I, I definitely think so. Right. And you can see that you can see examples, right? We learned item embeddings via word to back. You know, when people talk about graph embeddings, it's actually just taking the graph, doing a random walk, converting that random walk into a sentence of item IDs, and just using word to back.

Similarly learning the next best action, GRU is transformers and BERT, very obvious it will work. So I think we will see more from the LLM space being in, being adopted in rexys as well. What is the link in your mind between re-rankers and rexys? I think, so in rexys, we have, uh, retrieval and recommendation, right?

Um, so you can see over here, uh, where is it? So what we do over here is retrieval will retrieve a lot of top candidates. It's going to retrieve a hundred candidates. And then ranking is going to find the best five candidates. You can focus on the best five.

I think it's the same thing in retrieval and re-ranking, uh, in, in ranking and rexys. What people say in reg as re-ranking, I think it's just really just taking retrieval and then finding the best five. And, you know, Cohear has re-rankers and finding a best five for the LLM as part of the context.

Yeah. It's, to me, it's a bit weird, right? Like, uh, the re-ranker models are being promoted as a way to just, you feed in your top K whatever results. And then they re-rank them. And somehow that is supposed to produce better rag because the more relevant results is at the top, but without the context that rexys have, for example, user preferences and user histories and whatever, like, how can you have any useful re-ranking at all?

Right? Like, I think there are some ways. So for example, like maybe retrieval, right? You can imagine the most naive retrieval is really just BM25 or, um, semantic search. Now, then you can imagine you have a lot of historical data on all these BM25 and his, uh, semantic search and all the associated metadata, which you probably cannot use in retrieval stage because it's too expensive.

And then you can just train a re-ranker. Just say that when the author match or this author usually looks for this kind of document, um, and then you can try to re-rank it. I think it's possible. Um, I haven't, I haven't dove too deep into how re-ranking is done for a rag, but it's possible.

Oh, Apulantia, you had a hand raise. Yeah, no. And thank you so much, Swix and Eugene. You guys are amazing. I'm huge fans of you both. But, um, I, with the question I had, I guess is, is it worthwhile for an organization to go and build this when you have something like Gina.ai, who, as we know, popularized a lot of work on deep search and not just internal retrieval, but external search and retrieval, because they have the embedding models, the re-rankers, the retrieving deep search APIs, all unified.

Um, how do you feel about that, Eugene and Swix? Should teams build them, build it themselves? Or should they just buy something off the shelf? Yeah, like Gina.ai has kind of this full stack that you're talking about with all these fine-tuned clip models, embedding models, re-rankers. They have the deep search retrieval.

There's, they've posted probably some of the better technical blogs on deep search. I think that's out right now, and embedding models and re-rank. Yeah, my answer is going to be probably very boring and you can apply it to any answer, whether they should, someone should use an LMP off the shelf or just finding a model.

I think that for prototyping, just do whatever is fast, right? Demonstrate user value. It's going to get to a point in time where what you need does not fit, um, what something off the shelf is going to do. Um, and that's what, um, that's, uh, Indeed's story, right? For expected bad match.

Latency continued to be too high. Even if they need to be too high, the only way to fine-tune your own model. Similarly, like for retrieval, you can imagine, okay, they're going to provide a lot of out-of-the-box embeddings and maybe it's going to be good enough. And then finally, you want to really squeeze more juice out.

You probably need to go fine-tune your own embeddings. I know like Replit recently such shared something about fine-tuning their own embeddings, half the size, et cetera, et cetera, and it does way better. Um, there are a lot of examples here. I think Etsy also fine-tune their own embeddings, um, and it outperform embeddings out of the box.

I think it's a little bit of unfair comparison. I think if you take those models, those embedding models and you further fine-tune them, I think they could do better, but essentially point by just use us off the shelf to move fast and then just, and then after that, if you do need to customize it, then you customize it.

I love it. Thank you so much. Super helpful. You're welcome. Sorry, we have a question. Let's go ahead. And I think this is probably my last question. Okay. Uh, what do you think of the, uh, biggest opportunity in terms of for, uh, apply IOM in recommendation domain because we're discussing in ritual, ranking, content understanding, etc., right?

There are so many different prediction tasks. You're asking me what I think is the big opportunity? Is that your question? Yeah. Uh, I think, I think embeddings, I think what I've seen is that embeddings are helpful for retrieval. So instead of purely keyword retrieval, like, and killer, someone might, might just ask, I have ants in my house.

I think a semantic embedding could be able to help you match that. Um, and I think ranking is definitely clearly, uh, it will clearly work, uh, using an LLM-based ranker. I think LinkedIn actually really could clearly, can clearly work. And of course for search, I think there's this card, this guy, Doc Turnbull, um, he's going, increasingly going down the route of, and we've seen examples from Yelp, right?

Using an LLM to do query, uh, segmentation, query expansion, query rewriting. It clearly works in Yelp's use case and you can just catch all these results. So those are the three things I, off the top of my head that I can't think of. Okay. Thank you, everyone. I do need to drop.

Maybe you can discuss what the next paper is. Maybe Swix will talk about Moore's Law for AI every seven months, which I think is interesting. No, not on the image generation, autoregressive image generation. Okay. All right. Bye. Bye. Bye. Bye. Thank you. Thank you, everyone. Take care. Thank you.

Thank you.

Improving RecSys and Search in the age of LLMs — Eugene Yan

Transcript