back to indexTransforming search and discovery using LLMs — Tejaswi & Vinesh, Instacart

00:00:20.600 |
We are part of the search and machine learning team 00:00:25.320 |
how we are using LLMs to transform our search and discovery. 00:00:29.960 |
So yeah, so first, a little bit about ourselves. 00:00:33.360 |
Yeah, as I mentioned, we are part of the search 00:00:37.960 |
And for those of you who may not be familiar with Instacart, 00:00:40.960 |
it's the leader in online grocery in North America. 00:00:47.060 |
where everyone has access to the food they love 00:00:51.280 |
So coming to what we'll actually talk about today. 00:00:55.320 |
First, we'll talk about the importance of search 00:01:04.320 |
And then actually get to the meat of the talk today, 00:01:07.120 |
which is how we are using LLMs to solve some of these problems. 00:01:10.920 |
Finally, we'll finish with some key takeaways 00:01:17.320 |
and grocery commerce, I think they've all gone grocery shopping. 00:01:27.760 |
And of these, the majority of them are just restocking purchases. 00:01:32.160 |
That is, things that the customer has bought in the past. 00:01:34.760 |
And the remaining are items that the user is trying out for the first time. 00:01:38.760 |
And the majority of these purchases come from search. 00:01:44.960 |
It needs to both support quick and efficient. 00:01:47.960 |
It needs to have the customer quickly and efficiently find the product they're looking for 00:01:55.560 |
And new product discovery isn't just important for the customer. 00:01:59.960 |
It's also great for our advertisers because it helps them showcase new products. 00:02:04.160 |
And it's also good for the platform because overall it increases larger basket sizes. 00:02:09.560 |
So let's see what some problems are with our existing setup that sort of makes this hard. 00:02:15.160 |
So to begin with, we have two classes of queries that are generally more challenging, especially 00:02:26.760 |
In this case, like on the left, the snacks query, where there are tons of products that 00:02:33.640 |
And now because our models are trained on engagement data, if we aren't exposing these products to 00:02:40.320 |
the user, it's hard to actually collect engagement data to them, rank them up high. 00:02:44.940 |
So the traditional cold start problem in a way. 00:02:47.820 |
Then, as you can see on the query on the right, we have very specific queries like unsweetened 00:02:52.000 |
plant-based yogurt, where the user is looking for something very specific. 00:02:56.540 |
And these queries don't happen very frequently, which means that we just don't have enough engagement 00:03:04.340 |
And while we have done quite a bit of work to sort of improve this, the challenge that we 00:03:11.900 |
continually keep facing is that while recall improves, precision is still a challenge, especially 00:03:18.960 |
The next class of problems is how do we actually support that new item discovery, as we spoke 00:03:24.120 |
So when a customer walks into a grocery store, let's say into a pasta aisle, they might see 00:03:29.780 |
new brands of pasta that they would want to try out. 00:03:32.960 |
Along with that, they would also see pasta sauce and every other thing that's needed to make 00:03:38.840 |
And customers would want a similar experience on our site. 00:03:42.020 |
We have heard multiple rounds of feedback from our customers that, hey, I can find the product 00:03:48.160 |
that I want via search, but when I'm trying to find any other related products, it's a bit 00:03:55.620 |
I would need to make multiple searches to get to where I want to. 00:03:58.880 |
So this was a problem that we wanted to solve as well. 00:04:02.320 |
And yeah, as I mentioned, pre-LLMs, this was a hard problem because of the lack of engagement 00:04:09.880 |
So let's see how we actually use the LLMs to sort of solve these problems. 00:04:13.100 |
I'll sort of talk specifically about how we use the LLMs to up-level our query understanding 00:04:20.080 |
Now, query understanding, as I'm sure most of you know, is the most upstream part of the 00:04:25.080 |
search stack and very accurate outputs are needed to sort of enable better retrieval and 00:04:32.080 |
recall and finally improve our ranking results. 00:04:35.560 |
So our query understanding module has multiple models in them like query normalization, query 00:04:40.060 |
tagging, query classification, category classification, et cetera. 00:04:44.840 |
So in the interest of time, I'll just pick a couple of models and talk about how we sort of 00:04:52.640 |
The first is our query to our category, our product category classifier. 00:04:57.300 |
Essentially, we are taking a query and mapping it to a category in our taxonomy. 00:05:02.500 |
So as an example, if you take a query like watermelon, that maps to categories like fruits, organic foods, 00:05:11.320 |
And our taxonomy has about 10,000 labels, of its 6,000 are more commonly used. 00:05:16.240 |
So because a product, a query can map to multiple labels, there is essentially a multi-label 00:05:24.300 |
And in the past, our traditional models, which were - we actually had a couple of different 00:05:31.880 |
One was a fast text-based neural network, which essentially modeled the semantic relationship 00:05:39.360 |
And then as a fallback, we had an NPMI model, which was a statistical co-occurrence model between 00:05:46.320 |
Now while these techniques were great for the head and torso queries, we had really low coverage 00:05:52.880 |
for our tail queries because, again, we just didn't have enough engagement data to train the 00:05:58.820 |
And to be honest, we actually tried more sophisticated bird-based models as well. 00:06:03.120 |
And while we did see some improvement, the lack of engagement data meant that for the increased 00:06:09.400 |
latency, we didn't see the wins that we actually hoped for. 00:06:13.180 |
So this is where we actually tried to use an LLM. 00:06:16.460 |
First, we took all of our queries and we, along with the taxonomy, we fed it into an LLM and 00:06:22.540 |
asked it to predict the most relevant categories for that query. 00:06:29.320 |
Actually, when you all looked at it, it made a lot of sense. 00:06:31.740 |
But when we actually ran an online A/B test, the results weren't as great. 00:06:37.440 |
And one particular example that illustrates this point very well is a query like protein. 00:06:44.140 |
Users that come to Instacart, when they type something like protein, they're looking for 00:06:47.180 |
maybe protein shakes, protein bars, or other protein supplements. 00:06:51.980 |
The LLM, on the other hand, thinks that when a user types protein, they're looking for maybe 00:07:00.200 |
So this mismatch, wherein the LLM doesn't truly understand Instacart user behavior, was 00:07:08.400 |
So to sort of maybe improve our results, we sort of switched the problem around. 00:07:13.540 |
We took the most commonly converting categories or the top K converting categories for each query 00:07:19.240 |
and fed that as additional context to the LLM. 00:07:24.680 |
There's a bunch of ranking and downstream validation that happens. 00:07:31.340 |
We generated a bunch of candidates, ranked candidates, and this greatly simplified the problem for 00:07:38.840 |
And again, to illustrate this with an example, take a query like Werner's soda. 00:07:43.980 |
Our previous model actually identified this as a brand of fruit-flavored soda, which is not 00:07:58.020 |
And with this, our downstream retrieval and ranking improved greatly as well. 00:08:02.480 |
And as you can see from the results below, especially for tail queries, we saw a big improvement. 00:08:08.620 |
Our precision improved by our 18 percentage points, and our recall improved by our 70 percentage points, which is actually pretty significant for our tail queries. 00:08:17.060 |
And maybe to very briefly look at our prompt, as you can see, it's very simple. 00:08:21.780 |
We are essentially passing in the top converted categories as context. 00:08:26.880 |
There are a bunch of guidelines about what the LLM should actually outdo, and that's it. 00:08:32.860 |
So this was all that was needed to sort of enable this. 00:08:35.820 |
Again, I'm simplifying the overall flow, but the general concepts are pretty straightforward. 00:08:43.560 |
So coming to another model, the query rewrites model is actually pretty important as well from an e-commerce perspective, especially at Instacart because not all retailers are created equal. 00:08:56.580 |
Some have large catalogs, some have very small catalogs. 00:08:59.580 |
The same query may not always return results. 00:09:02.460 |
And that is where a rewrite is really helpful. 00:09:04.480 |
For example, going from a query like 1% milk to just milk, or at least return results that 00:09:13.300 |
And again, our previous approach, which was trained on engagement data, didn't do too well. 00:09:19.200 |
It suffered or it did decently well on head and torso queries, but it suffered from a lack 00:09:26.800 |
So by using an LLM, similar to how we did for the product category classifier, we were able 00:09:34.800 |
In the example here, you can see that there's a substitute, a broad and a synonymous rewrite. 00:09:40.240 |
So for the case of avocado oil, a substitute is olive oil, a broader rewrite is healthy cooking 00:09:46.060 |
oil, and a synonymous rewrite is just avocado extract. 00:09:50.140 |
And again, just looking at the results from this, and we saw a bunch of offline improvements, and just 00:09:57.920 |
moving from using third-party LLMs here, just going from more simpler models to better models 00:10:10.280 |
So as you can see, just improving the models itself, improved the overall performance of 00:10:16.420 |
And in terms of online improvements, we actually saw a large drop in the number of queries without 00:10:23.420 |
This is pretty significant, again, because we could now actually show results to users where 00:10:28.820 |
they previously saw empty results, which was great for the business. 00:10:34.280 |
So coming to the important part of this, which is how we actually scored and served the data. 00:10:43.820 |
The thing is that Instacart has a pretty idiosyncratic query pattern. 00:10:49.140 |
There's a very fat head and torso set of queries, and we have a sort of a long tail. 00:10:55.240 |
So by precomputing the outputs for all of the head and torso queries offline in a batch mode, 00:11:02.360 |
we were able to sort of cache all of this data. 00:11:06.160 |
And then at online, when a query comes in, we could just serve it off of the cache with 00:11:10.600 |
very low impact on latency and fall back to our existing models for the long tail of queries. 00:11:17.860 |
And again, this worked really well because it didn't impact our latency, while it greatly 00:11:23.780 |
improved our coverage for the long tail of queries. 00:11:26.800 |
Now, for the really long tail where I said we would fall back to our existing models, we're 00:11:32.380 |
actually trying to replace them with our distilled Lama IB model so that we can actually do a much 00:11:41.680 |
So yeah, to sort of summarize, essentially what we saw was that from a query understanding perspective, 00:11:48.020 |
we have a bunch of models and just using our hybrid approach greatly improved their performance. 00:11:56.020 |
But what's actually more interesting is that today, query understanding consists of a bunch 00:12:01.080 |
And as Yazoo was talking about in the Netflix talk, managing all of these models is actually 00:12:08.320 |
So consolidating all of these into an SLM or maybe a large language model can make the 00:12:17.820 |
And I'll finish it off by giving an example here. 00:12:20.820 |
There's a query hum that we sort of saw some interesting issues with, which is spelled H-U-M-M. 00:12:29.720 |
The actual query, our query brand tagger identified the brand correctly as a brand of kombucha. 00:12:37.820 |
But then our spell corrector, unfortunately, corrected it as hummus. 00:12:41.460 |
So the results were really confusing to users and was pretty bad. 00:12:45.960 |
But by using a more unified model, I think the results were much better. 00:12:49.880 |
The second is by using an LLM for query understanding, we can actually pass an extra context. 00:12:57.160 |
So instead of just generating results for that query in isolation, we can really try to understand 00:13:06.240 |
So for example, detect if they're actually here to buy ingredients for a recipe, et cetera. 00:13:13.240 |
So to talk more about that, I have Thaisi here. 00:13:19.480 |
Now I'll quickly talk about how we used LLMs for showing more discovery-oriented content in 00:13:25.120 |
Just to restate the problem, our users found that while our search engine was very good at 00:13:30.520 |
showing exactly the results that they exactly wanted to see. 00:13:34.520 |
Once they added an item to the cart, they couldn't do anything useful with the search results page. 00:13:38.400 |
They either had to do another search or go to another page to fulfill their next intent to 00:13:45.280 |
Solving this with traditional methods would require a lot of feature engineering or manual 00:13:50.280 |
LLMs solved this problem for us, and I will talk about how. 00:13:55.900 |
So for queries like Swerdfish, let's say there are no exact results. 00:13:59.900 |
We used LLMs to generate substitute results like other seafood alternatives, meaty fish like 00:14:07.800 |
And similarly for queries like sushi where there were a lot of exact results, let's say, we would 00:14:13.520 |
show at the bottom of the search results page, we would show things like Asian cooking ingredients 00:14:18.020 |
or Japanese drinks and so on, in order to get the users to engage. 00:14:23.020 |
I'll talk about the techniques here, but both of these discovery-oriented results we saw led 00:14:31.140 |
to improvement in engagement as well as improvement in revenue for each search. 00:14:38.020 |
Now, like I said, I'll get into the techniques, but let's first talk about the requirements to generate 00:14:43.200 |
First, obviously we wanted to generate content that is incremented to the current solutions. 00:14:47.500 |
We don't want duplicates to what we were already showing. 00:14:50.360 |
And the second requirement and the most important one is we wanted all of the LLM answers or the 00:14:55.840 |
generation to be aligned with Instacart's domain knowledge. 00:15:00.120 |
So if a user searches for a query called dishes, LLM should understand that it refers 00:15:05.120 |
to cookware and not food, and vice versa for a query like Thanksgiving dishes. 00:15:10.620 |
So with these requirements in mind, we started with a very basic generation approach. 00:15:16.920 |
We took the query and we told the LLM, "Hey, you are an AI assistant and your job is to generate 00:15:23.320 |
One is a list of complementary items and another is a list of substitute items for a given query." 00:15:38.120 |
And like Vinesh said in the queue, when we launched this to our users, we saw that the results 00:15:44.120 |
were good, but users weren't engaging it as much as we would have liked it to. 00:15:48.120 |
So we went back to the drawing board and we tried to analyze what was going on. 00:15:53.120 |
And what we realized quickly was while LLM's answers were like common sense-like answers 00:15:58.120 |
and so on and so on, they weren't really what users were looking for. 00:16:02.120 |
Taking the protein example again, like users, when they search for protein, they look for 00:16:07.120 |
protein bars and protein shakes rather than what LLM would give us an answer, which is chicken, 00:16:14.120 |
So what we did was we augmented the prompt with Instacart domain knowledge. 00:16:20.120 |
So in one case, what we did was we took the query and then we augmented it with like, "Hey, 00:16:25.120 |
here is the query and here are the top converting categories for this particular query," along 00:16:30.120 |
with any annotations from the query understanding model like, "Hey, here is a brand present in 00:16:35.120 |
Here is like a dietary attribute present in the query," and so on and such. 00:16:39.120 |
In another case, we were like, "Here is the query and here are the subsequent queries that users 00:16:47.120 |
So once you augmented the prompt with this additional metadata about how Instacart users behave, 00:16:55.120 |
I don't have the time to show before and after, but like I said, we definitely saw like a huge 00:17:00.120 |
improvement in both engagement as well as revenue. 00:17:05.120 |
I'll quickly talk about like how we served all of these contents. 00:17:10.120 |
It's impractical to call the LLM in real-time because of latency and maybe cost concerns sometimes. 00:17:15.120 |
So what we did was we took all of our historical search logs. 00:17:20.120 |
We called LLM in like a batch mode and stored everything. 00:17:24.120 |
So query, content metadata, along with even the products that could potentially show up in the 00:17:29.120 |
And online, it's just a very quick lookup from a feature store. 00:17:33.120 |
And that's how we were able to like serve all of these recommendations in like blazing fast time. 00:17:39.120 |
Again, things weren't as simple as we making them out to be. 00:17:43.120 |
Like Vinesh said, the overall concept is simple. 00:17:48.120 |
But there were three key challenges that we solved along the way. 00:17:51.120 |
One is aligning generation with business metrics like revenue. 00:17:55.120 |
This was very important to select top-line bins. 00:17:57.120 |
So we iterated over the prompts and the kind of metadata that we would feed to the LLM in 00:18:04.120 |
Second, we spent a lot of time on ranking, on improving the ranking of the content itself 00:18:11.120 |
So our traditional PCDR, PCVR models did not work. 00:18:13.120 |
So we had to employ strategies like diversity-based ranking and so on and so forth to get users 00:18:21.120 |
And then the third thing is evaluating the content itself. 00:18:24.120 |
So one is making sure that, hey, whatever LLM is giving is one, right? 00:18:30.120 |
And second, it adhered to, like, what Instacart or what we need as a product, right? 00:18:36.120 |
So summarizing the key takeaways from our talk. 00:18:39.120 |
LLM's world knowledge was super important to improve query understanding predictions, especially 00:18:46.120 |
While LLMs were super helpful, we really found success by combining the domain knowledge of 00:18:53.120 |
Instacart with LLMs in order to see the top-line wins that we saw. 00:18:57.120 |
And the third and the last one is evaluating the content as well as the queue predictions 00:19:02.120 |
and so on and such was far more important and far more difficult than we anticipated. 00:19:07.120 |
We used LLMs as a judge in order to make this happen, but very, very important step. 00:19:26.120 |
Have you also been trying around queries which are very long in natural language? 00:19:31.120 |
Like, I want these three items and these five items. 00:19:42.120 |
I think we have actually launched something in the past, like, Ask Instacart, if you've 00:19:51.120 |
Which essentially takes natural language queries and tries to map that to search intent. 00:19:55.120 |
So, for example, you might say healthy foods for a three-year-old baby or something like 00:20:01.120 |
And so that would map to things like fruit slices. 00:20:03.120 |
I don't know if three-year-old toddlers can eat popcorn, but something along those lines. 00:20:09.120 |
And then we had our usual ranking, recall and ranking stack sort of retrieve those results. 00:20:15.120 |
So, any learnings from that experiment for you? 00:20:20.120 |
So, I think we actually have a lot of learnings from that. 00:20:22.120 |
Essentially, as Dees already mentioned, we need to inject a lot of Instacart context into 00:20:33.120 |
So, having a robust automated evaluation pipeline was important. 00:20:39.120 |
That is, for example, if it's a, let's say it's a Mother's Day query. 00:20:44.120 |
And let's say we come up with individual search intents as perfumes. 00:20:49.120 |
You really want women's perfumes to be in there. 00:20:52.120 |
Whereas when we just had perfumes, we could see all kinds of items. 00:20:55.120 |
So, passing that context from the LLM to the downstream systems is really important. 00:21:01.120 |
Yeah, we have a lot of examples where we failed.