back to indexImproving RecSys and Search in the age of LLMs — Eugene Yan

00:00:00.000 |
Okay, so I've been, I work in the field of recommendation systems and search, and every 00:00:10.420 |
now and then I like to pop my head up to try to see what's going on. 00:00:13.760 |
I think the recent trend for the past one or two years has been the interaction between 00:00:18.200 |
the recommendation systems and search and LLMs. 00:00:22.300 |
Let's say REXs, REXs and LLMs, REXs and search and LLMs, and I think in early 2023, 00:00:29.760 |
we would see some papers where people use the decoder-only models to try to predict IDs. 00:00:36.680 |
But at the end of last year and early this year, we're starting to see scenes of life whereby 00:00:40.060 |
some of these actually are A/B tested and have very good empirical results. 00:00:43.760 |
So I want to highlight a few of those patterns and we can see how it goes. 00:00:49.120 |
The one thing that for the longest time, what we're trying to do for REXs and search is to 00:00:55.740 |
You can imagine if Eugene interacts with item ID 1, 10, 25, his next predicted direction 00:01:04.660 |
But all of this is relying slowly on item IDs and you can imagine every time that you have 00:01:08.500 |
a new item ID, you have to learn a new embedding for it and that leads to a close-up problem. 00:01:13.700 |
So I know this was not part of my recommended reads in my write-up, I will actually go back 00:01:18.340 |
and update it to be recommended reads, but I want to discuss two papers here to try to address 00:01:23.620 |
The first one is semantic IDs, which is from YouTube. 00:01:26.260 |
You can imagine YouTube has a lot of new videos all the time, they can't learn new item IDs for them 00:01:34.540 |
So what they do is they have a transformer encoder, it generates dense content embeddings. 00:01:38.740 |
This is actually just a video encoder that converts a view into dense content embeddings. 00:01:42.860 |
And then they compress this into what they call semantic ID. 00:01:47.780 |
So the dense video encoding is 248 dimension, but what they do is they take the encoder, find 00:01:53.260 |
the nearest neighbors, assign that to the code book is also 248, assign it to the nearest code book, 00:01:59.340 |
take the residual, find the next nearest neighbors, assign it. 00:02:02.180 |
So it just keeps compressing and compressing it. 00:02:07.860 |
So essentially you can compress an item ID into four integers product in the paper, 00:02:12.580 |
actually decompress an ID into eight integers. 00:02:16.580 |
And then now that you have an item ID, you have a content embeddings, you convert it to eight integers. 00:02:28.900 |
ngram is really just, you know, like fast text, like character ngrams, you know, every subword, 00:02:34.500 |
every one character, two characters, three characters are learning about embeddings. 00:02:37.620 |
And then they also tried using sentence piece. 00:02:39.220 |
Essentially, sentence piece is really just looking at all the version of all these item IDs. 00:02:44.100 |
What are the most common subwords, most common sub characters? 00:02:48.740 |
So therefore, it's no longer just a unigram, bigram and trigram. 00:02:53.860 |
It's that you can learn variable length subwords. 00:02:58.260 |
Well, not surprisingly, dense content embeddings itself do worse than item IDs, right? 00:03:10.180 |
You can see this on the chart on the left here, right? 00:03:13.860 |
You can see unigram and bigram, the red line and the purple line, the unigram. 00:03:21.140 |
It actually is worse than item, the random hash ID, the orange line, for some extent. 00:03:31.220 |
Okay, I have the content embeddings line itself on my write-up. 00:03:37.300 |
But this chart here is actually trying to show that they use the dense content embeddings, it's crap. 00:03:42.020 |
But when they use both ngram and sentence piece, it did better. 00:03:45.620 |
So you have to do this trick whereby you have to convert that content embedding, the full dense content embedding into its own semantic ID and then learn those IDs. 00:03:55.540 |
Now, the benefit of this, you might be saying, hey, you know, isn't this all back to IDs again? 00:03:59.540 |
Well, not necessarily, because now, given the piece of content, you can convert it to embedding and then you can assign it to its nearest ID. 00:04:05.780 |
And therefore, you don't need to learn on behavioral data. 00:04:09.860 |
Similarly, quite short, which is like a TikTok number two, the number two TikTok in China, they adopted the same approach. 00:04:21.780 |
So they use embeddings from ResNet, Sentencebird and VGG to get the respective modality. 00:04:27.380 |
And then this is just simply concatenating a single vector. 00:04:32.420 |
They just do k-means to identify a thousand of the most common clusters. 00:04:37.540 |
Then now, therefore, each embedding is now a trainable item ID. 00:04:40.980 |
These cluster IDs are then embedded via motor encoder. 00:04:45.380 |
So this motor encoder, there's quite a bit to it. 00:04:49.780 |
You can see on the top, it says non-visual, non-trainable visual embeddings. 00:04:53.940 |
In this example, they only use visual embedding as an example. 00:04:56.820 |
But you can imagine all the non-trainable embeddings. 00:05:00.180 |
They take it and they project it into a different space via the mapping network. 00:05:05.140 |
Secondly, for every cluster ID, they convert it into learned embeddings for visual texture and acoustic. 00:05:11.700 |
These are not the original embeddings that come from it. 00:05:14.020 |
These are just the representation of the multimodal cluster ID. 00:05:18.020 |
And then fusion is really just concatenating it. 00:05:20.420 |
So now you might be thinking, how is this motor encoder learned? 00:05:28.180 |
This motor encoder is trained within the overall encoding, overall ranking network, 00:05:32.580 |
which you can see the motor encoder is all at the bottom, right? 00:05:35.380 |
So therefore, this motor encoder network, it takes the user tower, which is on the left, 00:05:40.980 |
and the item tower that's on the right, and it tries to predict the likelihood that user will click or like. 00:05:45.380 |
Therefore, based on this, they just backprop. 00:05:48.900 |
You backprop the likelihood of clicking or liking or following, and you backprop it through the motor encoder. 00:05:55.220 |
And that's how the motor encoder learns the mapping network and the cluster IDs. 00:06:00.740 |
So the benefit of this is that they shared that outperformed several multimodal baselines. 00:06:05.460 |
And when they did A/B testing, I think those are pretty freaking significant numbers. 00:06:11.060 |
Anything more than 1% is pretty strong in a platform like this. 00:06:17.060 |
And the benefit of this is that they mentioned that they had increased cold start velocity and cold start coverage. 00:06:21.540 |
It means that, you know, cold start is able to pick up faster. 00:06:24.340 |
If it's a good cold start video, it's able to pick up faster. 00:06:27.460 |
And they are also able to show more cold start, 3.6% more cold start content, which increases coverage. 00:06:37.540 |
So let me butt into, for those new to the REXIS world, you said anything more than 1% is a big deal. 00:06:46.500 |
Can you contextualize, like how much is that worth or is that like... 00:06:52.660 |
Imagine you are making, I don't know, let's just make something up. 00:06:58.740 |
And if people are engaging more, spending 1% more time, you can show 1% more ads. 00:07:10.420 |
This is, these are engagement, engagement proxy metrics. 00:07:16.420 |
So like, and is this, is this absolute or relative? 00:07:19.860 |
So for example, are we saying that, let's say likes was plus three right here. 00:07:23.940 |
Are we saying they went from six to 9% or are we saying they're currently 6% and now we are 6.09%. 00:07:46.100 |
So no surprises here using multi-modality consistently outperform single model features. 00:07:54.420 |
But there was also a trick here whereby they had to learn user specific model interests. 00:07:59.380 |
And if you look at it on the left tower there, close to the top, there's this thing called multi-modal interest intensity learning. 00:08:09.380 |
Essentially what they're learning there is for each user, which modality they are interested in. 00:08:15.460 |
For some people like Swix is very acoustically inclined. 00:08:19.620 |
For other people, they might care more about the visuals or the video itself. 00:08:24.180 |
Or the text, having a good caption is worth a thousand words, maybe. 00:08:28.740 |
So that's one trend I've seen, which is increasingly including more content into the model itself. 00:08:38.020 |
The other trend that I've seen is that for LLMs for synthetic data. 00:08:43.140 |
And there's two papers that I really like here because they share a lot of details. 00:08:48.660 |
And they share a lot about the pitfalls that they faced. 00:08:52.100 |
So the first one is, this is paper I really like. 00:08:57.940 |
Essentially, you can imagine you are providing people with job recommendations, right? 00:09:03.060 |
And then you want to have a final filter at the end to filter out bad job recommendations. 00:09:09.780 |
So this paper, it's not easy to get the full access to this paper online. 00:09:14.980 |
I've included it in our, we have a, in our Discord channel, we have a thread. 00:09:20.020 |
I've actually dropped the PDF in there and they talk through the entire, I think they talk through the entire process. 00:09:24.980 |
And I think it's a quite a role model process. 00:09:27.380 |
They started with looking at 250 job matches, right? 00:09:33.140 |
And then they compared it across various experts. 00:09:37.860 |
And in the end, they built a final eval set of 147 matches that were very high confidence. 00:09:43.540 |
multiple experts agreed and then, you know, that there was nothing subjective. 00:09:46.820 |
Then they tried prompting the LMS with recruitment guidelines, right? 00:09:52.980 |
And of course, you know, they tried things like the cheap stuff and then they tried things like the expensive stuff. 00:10:11.300 |
So what they did was then they fine tune GPT-3.5. 00:10:15.300 |
And GPT-3.5, you can see, let's just focus on the green boxes and then the red boxes. 00:10:22.660 |
You can see GPT-3.5 is able to achieve as almost close good precision and recall as GPT-4. 00:10:34.100 |
But GPT-3.5 was like reduced latency by two thirds and cost by two thirds, which is perfect, right? 00:10:43.300 |
But the thing is, you see that the average latency there is like six seconds. 00:10:48.020 |
And that's still not good enough for when they needed to do it online. 00:10:52.180 |
So then what they did is they fine tune the lightweight classifier. 00:10:56.180 |
Unfortunately, they didn't go into very many details of what this lightweight classifier is. 00:11:00.020 |
I suspect that this lightweight classifier is maybe not a language model. 00:11:03.460 |
I suspect that it is probably a decision tree because they talk a lot about categorical features. 00:11:09.380 |
And then the labels are just solely the ALM generated labels, right? 00:11:15.620 |
The first to the eval set, then we test GPT-4, GPT-4 good, but too slow. 00:11:19.380 |
Then we try GPT-3.5 like step-by-step incremental progress, but still too slow, too expensive. 00:11:25.540 |
We would really not like to have to train our own classifier and have to maintain the ops of retraining 00:11:33.220 |
This is what we need to do to reduce inference latency. 00:11:38.260 |
They were able to achieve area under the curve, ROC of 0.86. 00:11:45.060 |
That's pretty freaking good against ALM labels. 00:11:50.340 |
I don't know how low latency is and suitable for real-time filtering. 00:11:53.460 |
The benefits of this are pretty tremendous, right? 00:11:59.060 |
But I think the one big benefit is they lowered unsubscribed rates by 5%. 00:12:05.300 |
For someone, if you maintain some kind of push notification or email notification thing, 00:12:10.980 |
subscribe rates, unsubscription rates is like your biggest guardrail. 00:12:14.100 |
Because if people aren't subscribed, you're never, ever going to reach out to them again. 00:12:19.220 |
So, you know, all your customer acquisition costs is really down the drain. 00:12:23.300 |
Like maybe you have an offer and you let people sign up. 00:12:30.580 |
And so over here on the top line in table two, they share the apply to invite to apply email. 00:12:39.460 |
And then that's the results I highlighted here. 00:12:41.780 |
And in the bottom, I also see they had online experiments for the homepage recommendation feed. 00:12:47.300 |
And that's how low latency this classifier has to be. 00:12:51.300 |
It has to be on the homepage recommendations. 00:12:53.300 |
And similarly, we see very good results, right? 00:13:00.500 |
Impressions drop 5.1% at threshold of 15 and then drop by 7.95% at threshold 25. 00:13:11.460 |
That means you freed up 5% to 8% of impressions. 00:13:21.780 |
But freeing up 1/12 of impressions is a very big deal. 00:13:29.940 |
So I think that this was quite an outstanding result. 00:13:34.100 |
The other one I want to share and then we'll pause for questions, short questions before I go into two other sections, is query understanding at Yelp. 00:13:41.380 |
So query understanding at Yelp was very nice. 00:13:47.540 |
One is query segmentation and another one is highlights. 00:13:49.780 |
The query segmentation one is not so straightforward to understand, but essentially given a query like Epcot restaurants, they can identify, they can split this into different things like topic, location, name, question, etc. 00:14:02.660 |
And then by having better segmentation, they can have greater confidence in rewriting those parts of the query to help them search better. 00:14:12.900 |
So the second bullet point gives you an example. 00:14:15.300 |
If we know that the user's location is approximately there and the user said Epcot restaurants, we can rewrite the user's location from Orlando, Florida to Epcot for the search backend. 00:14:28.100 |
And because the search backend is based on location, by rewriting Orlando, Florida to Epcot, they were able to get more precise results for the user. 00:14:42.580 |
And the original write-up, they have a lot of good images. 00:14:45.780 |
I didn't include those images here because I didn't have time. 00:14:48.500 |
I only started reading this like an hour before. 00:14:53.620 |
So imagine if you search for some food, maybe you search for vegetarian friendly Thai food, right? 00:15:01.860 |
And then sometimes in the reviews, people would say things like vegetarian, veggie only, suitable for vegetarians. 00:15:08.980 |
And then I'm sure that there are way more synonyms for this. 00:15:14.180 |
In the past, they had to get humans to write these different synonyms. 00:15:23.380 |
But now they can use LLMs to replicate the human reasoning, right? 00:15:25.940 |
And they get way better coverage and it can cover 95% of traffic. 00:15:31.780 |
So for query segmentation, they are able to understand the user intent a little bit better. 00:15:36.500 |
And then for review highlights, because they were showing more reviews, especially for the long tail queries, it makes search more engaging. 00:15:46.260 |
By highlighting the relevant reviews for a user's query, they help the user feel more confident about the food. 00:15:55.380 |
And then maybe the user review would really say something like, oh, the vegetarian food is great and delicious or something. 00:16:04.580 |
Things like that help make the users more confident in the search results that they're seeing. 00:16:17.460 |
There's a quick, I mean, I have a quick question on just this query understanding thing. 00:16:22.260 |
What was the previous solta in query segmentation? 00:16:25.220 |
Like this seems like the most obvious, dumb possible thing to do. 00:16:32.260 |
You get a span and then you train some kind of classic, you train some kind of transformer model 00:16:36.980 |
that takes the input and then they'll cut it at characters. 00:16:42.580 |
So like it's basically, but like this guy did not compare it to NER, right? 00:16:46.420 |
Like they, they, they mentioned that their original was NER and this is better. 00:16:55.140 |
I mean, everyone starts with some kind of NER based approach. 00:16:58.020 |
My, I mean, my, my theory is that like, yeah, I, I basically, 00:17:01.540 |
basically there's no point doing traditional NER anymore. 00:17:04.820 |
You just do LMs, uh, with a, with some kind of schema. 00:17:10.980 |
Any fast NER cheap, uh, that's it for, for search, right? 00:17:20.260 |
Um, like if you want to spell check auto complete thing, like a grammar tool, Gemini slow. 00:17:27.380 |
Oh, so this is why I might rather prefer, uh, present from this, um, where is it search query? 00:17:38.980 |
So, um, um, they started their legacy models. 00:17:47.700 |
So you can see name entity recognition, right? 00:17:50.580 |
Um, they use aim at the recognition and, and they do this. 00:17:58.820 |
Um, but then they actually shed one thing I really like about this, uh, write up is this, 00:18:05.540 |
No, this chart seems very, very, very base formulation, scope task, proof of concept scaling up. 00:18:11.140 |
Um, but they, they wrote it very well to explain how they did it in the context of these two case 00:18:17.780 |
I feel like a lot of people just completely just, uh, drop this. 00:18:25.460 |
Um, I know I'm taking a long time to get to the points. 00:18:28.420 |
Is that 10% of queries make up 80% of traffic. 00:18:31.860 |
So they could do all this query segmentation once, period. 00:18:43.540 |
I can't remember where they wrote it, but that's how they did for query segmentation and for review 00:18:48.420 |
So essentially a lot of these things that may feel like they have to be done on online, 00:18:53.380 |
but because of the power law in e-commerce and online search, 00:18:56.900 |
you don't have to, uh, you, you can make use of a cache a lot. 00:19:00.260 |
Um, there's also another question from Tyler Cross. 00:19:06.020 |
How do these approaches compare to more information retrieval methods like BM25 vector methods? 00:19:11.380 |
Um, uh, Tyler, do you want to, what do you mean by, by these approaches? 00:19:22.820 |
I think you said this when you were doing the LLM, like we tried 3.5, 00:19:31.940 |
I think in this case, uh, if it's a classification approach and this wouldn't work. 00:19:36.900 |
Um, but if you're talking more generally, like using LLMs or embeddings, uh, for retrieval. 00:19:46.180 |
I don't know the full, the full context of the question. 00:19:57.700 |
I can move on because the other two sections are fairly heavy and I, and then, you know, 00:20:14.420 |
Ah, see, this is what happens when you get a noob to do slides. 00:20:30.980 |
So then the other thing is, I think that I'm LLM inspired training paradigms. 00:20:34.980 |
So maybe it's LLM inspired, maybe you've been doing this for a very long time, but I thought 00:20:39.700 |
The first one is really looking at the scaling laws. 00:20:43.140 |
And ever since I published, I shared about this post. 00:20:45.540 |
Like people have gotten back to me with like at least three or four papers about other studies 00:20:48.740 |
of scaling laws, but along a very similar, very similar view. 00:20:52.580 |
So I want to talk about the experimentation that I did. 00:20:55.860 |
This scaling law was decoder only transformer architecture. 00:21:03.460 |
Essentially it's fixed length sequences, 50 item IDs each. 00:21:06.740 |
And the training objective is given the past 10 items, predict item number 11. 00:21:10.740 |
Given the past 20 items, predict item number 21. 00:21:14.660 |
But they did introduce two key things that are very interesting. 00:21:20.180 |
The first one is layer-wise adaptive dropout. 00:21:22.500 |
So you can imagine, right, for LLMs, every single layer has same dimension. 00:21:28.660 |
You know, usually when they draw a transformer layer, it's every single layer has same dimension. 00:21:33.060 |
But for recommendation system models, that's not the case. 00:21:35.540 |
It's usually fairly fat at the bottom and gets skinnier towards the top. 00:21:39.300 |
So what they do over here is they have higher dropout in the lower layers and lower dropout in the higher layers. 00:21:46.420 |
So the intuition here is that the lower layers process more direct input from the data. 00:21:51.140 |
And because e-commerce data or recommendation data is fairly noisy, it's more prone to overfitting. 00:21:58.260 |
Therefore, they have more dropout in the lower layers. 00:22:01.780 |
Vice versa, at the upper layers, it learns from more abstract data. 00:22:06.260 |
And therefore, you want to make sure it doesn't underfit. 00:22:09.140 |
You want to make sure it gets all the juice you can get. 00:22:11.380 |
And therefore, they reduce the lower, they have lower dropout at the higher layers. 00:22:16.100 |
The other thing, which feels a little bit like black magic, is that they switch optimizers halfway doing training. 00:22:22.580 |
Firstly, they start with Adam and then they switch to SGD. 00:22:26.340 |
The observation they had was that, you know, they ran a full run with Adam, ran a full run with SGD, 00:22:30.420 |
is that Adam is able to very quickly reduce the loss at the start. 00:22:35.780 |
But then it like slowly tapers off, whereas SGD is slower at the start, but achieves better conversions. 00:22:40.500 |
So they had to do these two tricks for their sequential models. 00:22:46.820 |
I mean, obviously, no, this is fairly obvious. 00:22:50.340 |
Higher model capacity reduce cross entropy loss. 00:22:53.140 |
And this model capacity is model capacity, model params excluding ID embeddings. 00:22:59.380 |
So it's purely just the layers itself without the ID embeddings. 00:23:05.860 |
If you look at the dash line, the test loss curve and the blue dots, they estimated this with the blue dots, estimated that power law curve. 00:23:14.900 |
And they were able to fairly accurately predict where the red dots are going to be. 00:23:19.300 |
And, you know, this is like the Kaplan-style scaling loss and the Chinchilla-style scaling loss. 00:23:25.140 |
So essentially, given some smaller model, if we had bigger model, how would it perform? 00:23:29.060 |
The other thing was, oh gosh, these lines don't look correct. 00:23:38.020 |
So over here, I think this is a very nice result, which is that smaller models need more data to achieve comparable performance. 00:23:46.980 |
So over here on the left, you can see that there's a small model there on the orange line. 00:23:51.700 |
It needed twice the amount of data compared to a bigger model to get similar performance, right? 00:23:59.540 |
So the flip side of it is that, hey, you know, if you want highly performant models online, you're going to need a factor more data. 00:24:12.820 |
This is something we know, but it's really nice to have someone have done the experiments and distill it into the results here. 00:24:19.140 |
The other thing that I thought was really interesting is this idea of recommendation model pre-training. 00:24:32.420 |
Most people do this on content embeddings, which is given some content of this item, can you predict the content of that item? 00:24:40.180 |
I thought this was fairly novel, whereby it's trained solely on item popularity statistics. 00:24:47.940 |
It's still quite unfantom to me on how it works. 00:24:51.860 |
Essentially just take the item popularity statistic in the monthly and the weekly timescale, convert it to percentiles, and then convert those percentiles to vector representations. 00:25:06.500 |
So anytime you have a new item, so long as you have the past statistic for the past month and week, you can convert the percentile and then map it into vector representation. 00:25:14.900 |
So what this means is that imagine if our percentiles are only at the hundreds and we have stats for monthly and weekly, all we need is 200 embeddings for a month and a week. 00:25:24.660 |
And for each hundred, for 100 and we need 100 percentiles, we need vector representations. 00:25:31.220 |
So instead of millions of item IDs or billions of item IDs, all you need is 200 percentile vector representations. 00:25:41.060 |
They also had to do several tricks like relative time intervals and fixed position encoding that don't come across as ventuitive to me. 00:25:50.340 |
They explained that they say that they did that, but it's unclear, like how would I know a priori that I need to, how would I know if without running experiment that I needed to do it? 00:25:59.700 |
So it feels like there's like too many tricks. 00:26:02.900 |
There's so many tricks in like, okay, I need these three things, the stats to perfectly align for this to work. 00:26:07.380 |
So I think it's very promising, but I wish there was a simpler way to do this. 00:26:11.300 |
The results, it has promising zero shot performance. 00:26:15.380 |
What I mean by zero performance, basically it trains on the standard domain and then tries to apply it across the domain to another domain, right? 00:26:23.700 |
And you can see two to six percent drop in recall at 10. 00:26:25.940 |
This is compared to baselines, which are fairly good baselines, SASTRAC and BOFORAC, which are trained on the target domain itself. 00:26:32.020 |
Now, if you take this model and you train it on that target domain, 00:26:38.660 |
it matches or surpasses SASTRAC and BOFORAC when trained from scratch. 00:26:42.100 |
But the test, it only uses one to five percent of parameters because it doesn't have item embeddings, right? 00:26:48.100 |
It only has those 200 embeddings at the monthly and the weekly scale for every percentile. 00:26:52.100 |
So this is quite promising in the sense, it's one direction to its pre-trained models. 00:27:00.180 |
You can imagine some kind of recommendation as a service, adopting this idea and maybe it could work. 00:27:08.660 |
Shopify has a lot of new merchants onboarding. 00:27:11.220 |
Hey, you know, can we take existing merchant data with their permission? 00:27:16.260 |
It's just solely trained on popularity, right? 00:27:20.580 |
Now for any new merchant that's onboarding, as long as we have a week of data, we can use the weekly popularity embeddings. 00:27:26.820 |
And once we have a month of data, we can use that model. 00:27:31.860 |
The second one is we have two papers from YouTube. 00:27:36.660 |
We have two pictures from Google and YouTube here. 00:27:38.900 |
And this is so the one thing about distillation is that 00:27:44.900 |
if you solely learn on the teacher labels, it is very noisy, right? 00:27:54.260 |
The teacher models are not the perfect models. It's better to learn from the ground truth. 00:27:57.140 |
But we do know that adding the teacher models does help. 00:28:00.100 |
So what they do here is on the left side, you can see that direct distillation, which is learning from 00:28:06.980 |
both the hard labels, which is the ground truth and the distillation labels, which is what the teacher provides, 00:28:11.860 |
the teacher model, the big teacher model provides, is not as good as auxiliary distillation. 00:28:17.620 |
And essentially, what auxiliary distillation means is that you just split, give them two logits. 00:28:21.620 |
One logit to learn from the hard label, one logit to learn from the distillation label. 00:28:28.100 |
I didn't have time to put the results here, but they find that this works well for YouTube. 00:28:32.580 |
And then the thing is that the teacher model is useful. 00:28:35.060 |
So what they did is that they amortized the cost by having a big fat teacher model. And 00:28:40.260 |
by big fat teacher model, I mean, it's only two to four X bigger. 00:28:44.100 |
By having a teacher model that's two to four X bigger, this teacher model will just keep pumping 00:28:48.580 |
out the soft labels that all the students can learn from. And this makes all the students better. 00:28:53.460 |
And of course, why do we want students? If you're saying that teacher model is better, 00:28:56.420 |
why do we want students? We want the students because the student models are small and cheap. 00:29:00.580 |
And at YouTube scale, where they have to make a lot of requests, this is probably what they need to do. 00:29:05.140 |
Another approach, which is from Google, and I think they applied this in the YouTube setting as well, 00:29:12.660 |
is called self auxiliary distillation. So the intuition here is this, don't look at the image first. 00:29:19.380 |
So intuition here is this, they want to prioritize high quality labels and improve the resolution of 00:29:24.340 |
low quality. What does it mean to improve the resolution of lower quality labels? Essentially, 00:29:29.220 |
what they're saying is that if something is impressed, but not clicked, we should not treat that as a label 00:29:36.900 |
of zero. Instead, what we should do is to try to get the teacher to predict what that label is, 00:29:42.820 |
to smoothen it out. So if you look at the image, you can see that they have ground truth labels, 00:29:48.660 |
which is those in green, and they have teacher predictions, which is those in yellow. 00:29:52.020 |
So to combine a hard label with the soft label, they suggested a very simple function in the, 00:29:57.300 |
I don't know if that's what they actually use, but essentially the max of the teacher and the 00:30:02.740 |
student. So the max of the teacher and the ground truth. So imagine if the actual label was zero, 00:30:09.540 |
and the teacher said that, you know, it's a 0.3, you just use the 0.3. Or if the actual label is one, 00:30:14.660 |
and it's just a 0.5, you just take the one. So by smoothing it, and then having the teacher, 00:30:20.020 |
having the student learn on the auxiliary head, right, you are actually able to improve the teacher 00:30:27.540 |
model itself and use it for serving. So there's a lot of distillation techniques, which I think is 00:30:33.780 |
quite inspired by what we see from computer vision and language models. I haven't seen too many of 00:30:41.940 |
these distillation techniques myself in the field of recommendations, which I thought were pretty 00:30:45.700 |
interesting. The last one, and unfortunately this is the last one I have slides for, I can go through 00:30:51.140 |
the other recommended reads I have, but unfortunately I didn't have slides to do for it, 00:30:56.980 |
is this one. So this is quite eye-opening for me. Essentially what LinkedIn did was they replaced 00:31:07.700 |
several ID-based ranking models into a single 150B decoder-only model. What this means is that, for 00:31:16.900 |
example, you could replace 30 different logistic regressions or decision trees or neural networks 00:31:22.660 |
with a single text-based decoder-only model. This model is based, it's built on the Mistro MOE, right, 00:31:30.500 |
that's why it's like approximately 150B. And it's trained on three, six months of interaction data, 00:31:36.180 |
and the key, the main, so you may think, okay, decoder-only model, what does it mean? 00:31:40.500 |
Will you write posts for me? Will you write LinkedIn posts for me? Will you write my, 00:31:43.780 |
will you write, update my job title, whatever? The focus here is solely binary classification, 00:31:50.420 |
if the user will like with, will like or interact with a post or interact for, apply for a job. 00:31:56.340 |
So you can imagine that this model probably only needs to output like or not like. It's probably more 00:32:03.220 |
complex than that. But essentially, this is a big fat decoder-only model that is very good at 00:32:07.140 |
binary classification. That's why it's able to actually do well. And that's how they were evaluated. 00:32:14.180 |
So there are different training stages. And over here, I think maybe it's better for me to go over, 00:32:18.500 |
go into the actual write-up itself because I just didn't have time to 00:32:26.980 |
share this. So they have continuous pre-training. So continuous pre-training, they just take member 00:32:33.540 |
interactions on LinkedIn, different LinkedIn products, right? And then your raw entity data, 00:32:38.260 |
essentially just take all this job-related, job hunting-related data to pre-train the model, 00:32:43.860 |
to help the model get some idea of what is the domain. After continuous pre-training, 00:32:51.380 |
they do the post-training approach. They do instruction tuning. So essentially, this is like 00:32:56.900 |
training the model for instructions. They follow, they use UltraChat and internally generated instruction 00:33:04.260 |
following data, right? So get LLMs to come up with questions and answers, relevant LinkedIn tasks, 00:33:08.820 |
and then try to find high-quality ones. So that's training it, fine-tuning it to follow instructions. 00:33:14.020 |
And then finally is supervised fine-tuning. I don't say something, a lot of things like multi-turn chat 00:33:18.980 |
format. But essentially, the goal for supervised fine-tuning is to train the model to do the specific 00:33:29.780 |
task. I don't remember where it is exactly, but it's like, ah, so this is a specific task. 00:33:36.820 |
So now that we know it can follow instructions, now let's make it better at the specific task. 00:33:42.260 |
Speaker 1: What action would a member take on this post? Would it solely be impressed? Will it be 00:33:48.420 |
liked? Will it be comment? Etc. So that's how they go through differences. And I'm going back to my slides. 00:34:03.540 |
Okay. So they have these three different stages. And so here's the crazy thing. You can see the slides, 00:34:15.140 |
right? Can someone just say yes? Yes. Okay. The crazy thing is that they have now replaced feature 00:34:23.220 |
engineering with prop engineering because of this unified decoder model. So you can broadly read it. It's like, 00:34:30.900 |
this is the instruction. Here's the current member profile, software engineer at Google. Here's their 00:34:36.020 |
job. Here's their resume. Here's the things that they have applied to. So will the user apply to this job? 00:34:44.500 |
And the answer is apply. And you can probably simplify this into a one or zero, right? I guess they just 00:34:50.180 |
say in the text as an example, but that's all this model is doing. For a user, we have retrieved several 00:34:57.220 |
jobs. This model is doing the final pass of which one to rank. And they take the log props of the output 00:35:04.020 |
to score it. So essentially, if this says that the member will apply, maybe you have 10 jobs that a 00:35:09.780 |
member will apply. Then we take the log props to rank it. I don't know if this is a good thing or bad thing. 00:35:16.260 |
I find feature engineering more intuitive than prop engineering, but maybe it's a skill issue. But 00:35:22.020 |
essentially now, all PMs can engineer their own features. The impressive thing was, is that this 00:35:31.700 |
can support 30 different ranking tasks. That's insane. So now, instead of 30 different models, 00:35:38.820 |
you just have one big fat decoder model. That sounds a bit crazy to me. Firstly, it's crazy impressive. 00:35:46.260 |
Secondly, it's a lot of savings. Thirdly, I don't know how to deal with the alignment tax, or maybe it's 00:35:51.620 |
just a do no harm tax. I don't know. Essentially, the goal of REXIS was to decouple everything, right? 00:35:57.940 |
It's like, have retrieval be very good at retrieval, have ranking be very good at ranking. And then 00:36:02.740 |
each model just squeezes as much juice as we can. Now, what this is saying is that, okay, we're going 00:36:08.660 |
to unify. We have too many separate ranking models. We're going to unify into a big fat model and then 00:36:15.220 |
push all the data through it. And hopefully, you'll outperform. And it does outperform. It needs a lot 00:36:20.260 |
of data. So you can see in the graph of that, right? Up to release three, it was not better than 00:36:27.620 |
production. And you can see that based on the axis on the left, which is the gap production. Zero 00:36:33.220 |
means that it's on par with production. Up to release three, it was not better. I mean, I don't know who 00:36:37.540 |
had to get whatever budget or just to quarterback to make sure that this work, to push this through. 00:36:46.340 |
But as they add more and more tokens, it starts to get better than production, like 2.5% increase. So this is 00:36:57.300 |
a huge leap of faith that, okay, we'll just say, uh, with the lesson. Just give us more data, 00:37:04.900 |
we will outperform, um, and with a single model. Um, okay, so that's it. That's all I had to share. 00:37:11.860 |
Um, I, I can go through two other, I want to just briefly highlight two other papers, which I think are 00:37:19.300 |
good. Um, it's a little bit less connected to LLMs, but the two other papers, which I think are very good is because of 00:37:26.900 |
how do they go into their system architecture. The first one is Etsy. Um, Etsy, you can see this, this is extremely 00:37:33.780 |
complicated. Uh, but this really shows you a very realistic and practical approach, right? Classic two-tower 00:37:40.900 |
architecture. They share about negative sampling and then talk about product quality, right? The thing 00:37:46.020 |
is you can have very good baby crap, realized images, but when people buy it, they return it. 00:37:52.660 |
Um, you will never be able to detect that if you're just using that. So what they did was that they 00:37:57.780 |
actually have a product quality embedding index that they use, used to augment, um, their approximate 00:38:04.820 |
nearest neighbor index, right? So you can see the quality back quality vector. Uh, this is extremely pragmatic 00:38:11.540 |
and I can tell you that not, uh, no, no e-commerce website or no search engine, search, whatever online 00:38:18.260 |
discovery thing can do without some form of quality vector or some kind of post quality filtering. We 00:38:24.820 |
saw that with indeed, right? Expected bad match. They need the quality. They just, uh, operationalize it 00:38:29.540 |
in a different way as a final post filtering layer over here. They include it in the approximate nearest 00:38:34.900 |
neighbors index. So I highly recommend reading this, um, very practical, uh, shares a lot of detail into 00:38:41.460 |
their system design. Uh, the next one I also highly recommend is the model ranking platform at Zalando. 00:38:46.900 |
I think this is all the best practices, uh, talks about all the different tenants, like composability, 00:38:52.020 |
scalability, steerable ranking, and they really go right deep into, Hey, you know, here's the candidate 00:38:56.260 |
generator, essentially the retrieval step to tower model. And then, you know, they just, uh, using an 00:39:02.340 |
ANN to retrieve it. And then they talk about the ranker and then finally the policy layer. What is this policy 00:39:07.860 |
layer, right? Policy layer, encourage exploration, uh, business rules, like previously purchased item, item 00:39:13.380 |
diversity, again, some kind of, some, some, some measure of quality that the model would never be 00:39:20.340 |
able to learn from the data. The model will never learn that showing good items is good, right? Because 00:39:24.260 |
they're untested. So you have to override the model with this policy layer. Um, and of course, very good 00:39:30.260 |
results. Uh, but what, what I really like about this paper is that if you want to learn about system design 00:39:36.020 |
for Rex's like the Zalano paper and the Etsy paper, uh, really, really, really good and really in depth. 00:39:42.980 |
Um, but of course everything here is, uh, very good. If there were a few papers I read, I'm like, okay, 00:39:48.340 |
this is pretty crap. I wouldn't include it. Uh, but every paper here is pretty good for system design. 00:39:52.900 |
Um, under the final section, which is unified architectures. Um, okay. Any, I spoke a lot, 00:40:00.980 |
any questions, they'll lose anyone. They'll lose everyone. Uh, Eugene, I have one quick question to 00:40:10.020 |
double check on the LinkedIn's paper. Hmm. My understanding is the, the big model on the 150 00:40:17.860 |
billion model is actually used as teacher model and then distilled the knowledge into smaller models 00:40:24.500 |
then used for kind of different tasks. So I don't know if that aligns with your understanding because 00:40:30.580 |
practically 150 billion model and surf data for prediction, the latency will not be acceptable and 00:40:38.020 |
too costly also. They do actually have a, like a full, like a paper discuss more about how 00:40:45.220 |
knowledge deceleration is happening with that 150 billion models. I kind of put it in the, in the chat. 00:40:51.860 |
I don't know if you have come across that paper. 00:40:53.860 |
I have not. Um, but my impression was that they were actually using it. Thank you for sharing this 00:40:59.540 |
and thank you for correcting my misunderstanding. I need to look deeper into the original paper. 00:41:04.260 |
Um, um, um, to confirm this. Let me take you in this thread. That's my understanding. 00:41:12.020 |
So, but I also find this, uh, the, the approach very interesting. So we were actually thinking of 00:41:18.980 |
similar kind of approach and as well, but actually they kind of proved that this, this way kind of works. 00:41:24.980 |
Yeah. Thank you. I, I think that, I think that could probably be it. I think there's no way for it to be 00:41:30.900 |
feasible to serve it at that scale. Uh, I think you're probably right. I, I don't know. I, I didn't see 00:41:38.020 |
anything in the original paper that actually suggest us that. Um, but I, I, I think you're right. That's, 00:41:43.380 |
there's just no way for it to serve it at scale. 00:41:46.100 |
Yeah. So I checked you in this thread. Maybe, maybe I can, 00:41:52.820 |
I have the paper. Thank you. I, I don't, I added my safe list to confirm. 00:41:57.700 |
Yeah. Yeah. Thanks. Thanks for explaining this. Just to double check. Thank you. 00:42:08.980 |
I mean, so, um, one thing that I tried to look for and I found myself doing, but I might as well ask the, 00:42:22.180 |
pre-trained LLM, uh, that is you, uh, is to rank, um, what is highest, uh, you know, the, 00:42:30.980 |
the lowest hanging fruit versus the higher ones. Um, so for example, right, you, the way that you 00:42:37.380 |
organize your, at your write-up was four sections. It was model architecture, data generation, scaling laws, 00:42:46.180 |
and then unified architectures. Um, why there's no particular reason. There's no order, right? Like, 00:42:53.780 |
to me, it was very clear that model architecture is basically useless. Is that true? Would you, 00:42:58.340 |
would you recommend? I don't think so. I actually think that, um, I think the model architecture right 00:43:05.140 |
now, it's like a little bit more like, um, you know, meters dilemma. Um, I would say that in 2023 00:43:11.860 |
is useless. Um, I haven't seen good results. Now I'm seeing good results. Um, would you classify this 00:43:18.420 |
YouTube one as a good result? Because I was, I read it and I was like, wait, like this, I don't know, 00:43:23.300 |
you know, like it's, these are smart ideas. And then the, then the results are like, uh, you know, 00:43:29.300 |
doesn't, doesn't really outperform our baseline. I, I, I think it's decent results. Um, so, and, uh, 00:43:36.260 |
coincidentally, after I published this, I think after I made the rounds on Hacker News, people from 00:43:41.220 |
YouTube actually reached out. One of the authors on this exact paper, I should reached out. Um, 00:43:45.540 |
they wanted me to like go in and chat with them. Uh, and then they were like, we have more papers. 00:43:50.580 |
We're pushing to publish it. And this is a perennial problem, right? Especially for YouTube and TikTok, 00:43:55.700 |
right? Um, new videos get uploaded all the time. They have to deal with, costar is their bread and 00:44:02.500 |
butter. So I wouldn't be surprised that they are focusing so hard on content embeddings for 00:44:07.460 |
costar. This is for, for a new video, but not, I mean, they have users, uh, they have user histories 00:44:14.740 |
and they saturated the world on that. I think very likely. So, right. Um, you can imagine YouTube, 00:44:22.420 |
Twitter, I mean, it's unmentioned here, but ads, Google ads, it's always costar being able to 00:44:29.540 |
crack this costar problem. Just even 0.1% is huge. It is huge. And you can see a lot of the 00:44:36.020 |
papers in this semantic IDs, YouTube quite show, which is like TikTok. Um, 00:44:41.940 |
Huawei, this one, I'm not very sure why they did this. Um, yeah, a lot of it, Kyle Rack also, 00:44:47.060 |
uh, solving costar. So yeah. Okay. But like, you know, orders of magnitude is, um, which takes 00:44:57.700 |
orders of magnitude. I really think that the low hanging fruit right now is really using LLM data 00:45:02.260 |
generation. Right. Yeah. That's my, yeah. I mean, obviously that is the, I think everyone can do this 00:45:08.020 |
now. And you know, the expected bad match paper, um, indeed did, right. I actually did this, uh, 00:45:14.820 |
last year, something very similar. I did this last year. Uh, it got published internally. This is very, 00:45:21.300 |
very, very, very, very, very effective. This approach of, um, starting from somewhere active 00:45:27.620 |
learning, fine tuning model, more active learning. It really helps, uh, uh, improve quality. Um, I was 00:45:34.180 |
doing it in the context of LLM and hallucinations, but I can imagine doing this in terms of relevance, 00:45:39.780 |
in terms of any level of measure of quality that you want to focus on, it will work very well. 00:45:44.580 |
Okay. Um, and then of course, yeah. Then architecture. I would say data generation and then like model 00:45:52.420 |
architecture and system architecture. Yeah. Oh, wait, actually the, the, even the scaling loss part, 00:45:57.460 |
there are some things that are very, uh, practical. Um, one example is, which I didn't have time to go 00:46:03.860 |
through is this, um, basically Laura's for recommendation. So what they did is they train 00:46:11.460 |
a single model on all domain data. You can imagine all domain, like fashion, e-commerce, uh, fashion, 00:46:18.580 |
furniture, toys, et cetera. Or it could be like all domain, like ads, videos, uh, e-commerce. And then after 00:46:26.340 |
that they have specific LoRa's for each domain. Um, and this works very well. So I, I, I definitely 00:46:34.900 |
think that essentially right now it's not easy to learn from data across domains for recommendation 00:46:42.260 |
system. It's actually for recommendation system. And, you know, correct me if I'm wrong. I really 00:46:45.700 |
think that you want to overfit on your domain. Um, you want to overfit and predict the next best thing 00:46:51.540 |
for tomorrow. And that's it, period. I, I can overfit and just retrain every day. Um, but yeah, 00:46:56.820 |
I know we have a few questions here. Uh, Daniel asks, shouldn't the LinkedIn model combine with an 00:47:02.900 |
information model that's used to generate? Yes, correct. Um, that's probably an upstream, uh, 00:47:08.660 |
retrieval model. And then the LinkedIn model just does the ranking. So in a two-step process, 00:47:14.020 |
you have a retrieval, the LinkedIn, the decoder model, I think it just does the ranking. 00:47:18.180 |
Um, for LM-based search retrieval, any papers talk about impact of query writing prompt engineering? 00:47:24.020 |
Also the sensitivity. Uh, I think we know that LMs are, are sensitive to prompts, but I think they're 00:47:29.700 |
increasingly less sensitive to prompts, uh, because they're just way more instruction tuned. Um, I'm not 00:47:35.220 |
sure about LM papers that talk about the power, the impact of query writing. I think the only one that I 00:47:39.620 |
have seen at least covered here is the one by Yelp. Uh, so I think that could be helpful. What's the 00:47:45.460 |
process for keeping this hybrid models up to date and personalization, uh, keeping them up to date. 00:47:50.660 |
That's a good question. I don't know if they actually need to be kept up to date. So if you look at the, 00:47:57.140 |
if you look at this hybrid models, right, let's just take the, this, uh, 00:48:06.180 |
semantic ID embedding, it actually uses a frozen video bird. Similarly for quite sure, they use frozen 00:48:13.380 |
sentence, but resnet and VGG ish. So the content itself doesn't mean to be up to date. And the 00:48:20.100 |
assumption is that, okay, content today is going to be the same as content. Tomorrow is going to be 00:48:23.220 |
same as content for one month. So that is not up to date, but what is learnable is the semantic ID 00:48:28.500 |
embedding and a cluster embedding. Now for personalization, that's very interesting. Uh, that's the hard question. 00:48:33.220 |
Right. And the personalization, I guess, how you include personalization is okay. After we learn 00:48:37.620 |
the content, we also need to learn what the user is interested in. And that's how they have this two 00:48:41.780 |
tower approach. And that's why you can see over here, there's this small layer, uh, which is multimodal 00:48:47.780 |
interest intensity, right? Which is given a user and their past historical sequence. How can we tell what 00:48:53.380 |
layers are, what modality they're interested in? I think that's how they do personalization over here. 00:49:01.700 |
If not, we can always ask for volunteers for next week's session. 00:49:19.220 |
No, I think, I think this is, uh, really helpful. Uh, it just feels like rexys is always these, 00:49:34.340 |
like these bundles of ideas. Yeah. Like in the way that agents are bundles of ideas for LLMs, rexys was, 00:49:44.420 |
is also a bundle of ideas. And I, I think that they are obviously converging. Um, you don't, yeah, I mean, 00:49:52.020 |
I, I, I definitely think so. Right. And you can see that you can see examples, right? We learned item 00:49:58.580 |
embeddings via word to back. You know, when people talk about graph embeddings, it's actually just 00:50:02.260 |
taking the graph, doing a random walk, converting that random walk into a sentence of item IDs, 00:50:06.500 |
and just using word to back. Similarly learning the next best action, GRU is transformers and BERT, 00:50:12.500 |
very obvious it will work. So I think we will see more from the LLM space being in, being adopted 00:50:19.540 |
in rexys as well. What is the link in your mind between re-rankers and rexys? 00:50:28.900 |
I think, so in rexys, we have, uh, retrieval and recommendation, right? Um, so you can see over 00:50:40.740 |
here, uh, where is it? So what we do over here is retrieval will retrieve a lot of top candidates. It's 00:50:50.900 |
going to retrieve a hundred candidates. And then ranking is going to find the best five candidates. 00:50:55.460 |
You can focus on the best five. I think it's the same thing in retrieval and re-ranking, uh, 00:51:00.980 |
in, in ranking and rexys. What people say in reg as re-ranking, I think it's just really just 00:51:08.900 |
taking retrieval and then finding the best five. And, you know, Cohear has re-rankers and finding 00:51:13.220 |
a best five for the LLM as part of the context. Yeah. It's, to me, it's a bit weird, right? Like, uh, 00:51:18.420 |
the re-ranker models are being promoted as a way to just, you feed in your top K whatever results. 00:51:24.900 |
And then they re-rank them. And somehow that is supposed to produce better rag because 00:51:29.780 |
the more relevant results is at the top, but without the context that rexys have, for example, 00:51:35.780 |
user preferences and user histories and whatever, like, how can you have any useful re-ranking at all? 00:51:41.300 |
Right? Like, I think there are some ways. So for example, like maybe retrieval, right? 00:51:44.660 |
You can imagine the most naive retrieval is really just BM25 or, um, semantic search. 00:51:50.420 |
Now, then you can imagine you have a lot of historical data on all these BM25 and his, uh, 00:51:56.500 |
semantic search and all the associated metadata, which you probably cannot use in 00:52:00.340 |
retrieval stage because it's too expensive. And then you can just train a re-ranker. 00:52:03.380 |
Just say that when the author match or this author usually looks for this kind of document, 00:52:09.220 |
um, and then you can try to re-rank it. I think it's possible. Um, 00:52:13.460 |
I haven't, I haven't dove too deep into how re-ranking is done for a rag, but it's possible. 00:52:22.740 |
Yeah, no. And thank you so much, Swix and Eugene. You guys are amazing. I'm huge fans of you both. 00:52:28.100 |
But, um, I, with the question I had, I guess is, is it worthwhile for an organization to go and build 00:52:34.340 |
this when you have something like Gina.ai, who, as we know, popularized a lot of work on deep search and 00:52:40.740 |
not just internal retrieval, but external search and retrieval, because they have the embedding models, 00:52:45.860 |
the re-rankers, the retrieving deep search APIs, all unified. Um, how do you feel about that, Eugene and Swix? 00:52:53.220 |
Should teams build them, build it themselves? Or should they just buy something off the shelf? 00:52:58.740 |
Yeah, like Gina.ai has kind of this full stack that you're talking about with all these fine-tuned 00:53:03.780 |
clip models, embedding models, re-rankers. They have the deep search retrieval. There's, 00:53:08.980 |
they've posted probably some of the better technical blogs on deep search. I think that's out right now, 00:53:14.500 |
and embedding models and re-rank. Yeah, my answer is going to be probably very boring and you can 00:53:19.460 |
apply it to any answer, whether they should, someone should use an LMP off the shelf or just finding a 00:53:24.100 |
model. I think that for prototyping, just do whatever is fast, right? Demonstrate user value. It's going 00:53:30.340 |
to get to a point in time where what you need does not fit, um, what something off the shelf is going to 00:53:37.060 |
do. Um, and that's what, um, that's, uh, Indeed's story, right? For expected bad match. Latency 00:53:47.460 |
continued to be too high. Even if they need to be too high, the only way to fine-tune your own model. 00:53:53.620 |
Similarly, like for retrieval, you can imagine, okay, they're going to provide a lot of out-of-the-box 00:53:58.340 |
embeddings and maybe it's going to be good enough. And then finally, you want to really squeeze more juice 00:54:03.540 |
out. You probably need to go fine-tune your own embeddings. I know like Replit recently such 00:54:07.380 |
shared something about fine-tuning their own embeddings, half the size, et cetera, et cetera, 00:54:10.660 |
and it does way better. Um, there are a lot of examples here. I think Etsy also fine-tune their own 00:54:14.820 |
embeddings, um, and it outperform embeddings out of the box. I think it's a little bit of unfair 00:54:21.140 |
comparison. I think if you take those models, those embedding models and you further fine-tune them, 00:54:25.700 |
I think they could do better, but essentially point by just use us off the shelf to move fast 00:54:30.820 |
and then just, and then after that, if you do need to customize it, then you customize it. 00:54:36.180 |
I love it. Thank you so much. Super helpful. You're welcome. 00:54:39.460 |
Sorry, we have a question. Let's go ahead. And I think this is probably my last question. 00:54:49.540 |
Okay. Uh, what do you think of the, uh, biggest opportunity in terms of for, uh, 00:54:54.820 |
apply IOM in recommendation domain because we're discussing in ritual, 00:54:58.900 |
ranking, content understanding, etc., right? There are so many different prediction tasks. 00:55:04.500 |
You're asking me what I think is the big opportunity? 00:55:12.100 |
Is that your question? Yeah. Uh, I think, I think embeddings, I think what I've seen is that embeddings 00:55:20.900 |
are helpful for retrieval. So instead of purely keyword retrieval, like, and killer, someone might, 00:55:26.260 |
might just ask, I have ants in my house. I think a semantic embedding could be able to help you match 00:55:32.100 |
that. Um, and I think ranking is definitely clearly, uh, it will clearly work, uh, using an LLM-based ranker. 00:55:39.460 |
I think LinkedIn actually really could clearly, can clearly work. And of course for search, I think 00:55:44.100 |
there's this card, this guy, Doc Turnbull, um, he's going, increasingly going down the route of, 00:55:50.100 |
and we've seen examples from Yelp, right? Using an LLM to do query, uh, segmentation, query expansion, 00:55:56.980 |
query rewriting. It clearly works in Yelp's use case and you can just catch all these results. 00:56:02.740 |
So those are the three things I, off the top of my head that I can't think of. 00:56:08.740 |
Okay. Thank you, everyone. I do need to drop. Maybe you can discuss what the next paper is. 00:56:13.220 |
Maybe Swix will talk about Moore's Law for AI every seven months, which I think is interesting. 00:56:17.540 |
No, not on the image generation, autoregressive image generation. 00:56:21.300 |
Okay. All right. Bye. Bye. Bye. Bye. Thank you.