- - Hi, good afternoon everyone. My name is Vinesh and he's Desvi. We are part of the search and machine learning team at Instacart. So today we'd like to talk to you about how we are using LLMs to transform our search and discovery. So yeah, so first, a little bit about ourselves.
Yeah, as I mentioned, we are part of the search and discovery ML team at Instacart. And for those of you who may not be familiar with Instacart, it's the leader in online grocery in North America. And our mission is to create a world where everyone has access to the food they love and more time to enjoy it together.
So coming to what we'll actually talk about today. First, we'll talk about the importance of search in grocery e-commerce. And we'll look into some of the challenges facing conventional search engines. And then actually get to the meat of the talk today, which is how we are using LLMs to solve some of these problems.
Finally, we'll finish with some key takeaways from today's talk. So coming to the importance of search and grocery commerce, I think they've all gone grocery shopping. Customers come with long shopping lists, and it's the same on the platform as well. People are looking for tens of items. And of these, the majority of them are just restocking purchases.
That is, things that the customer has bought in the past. And the remaining are items that the user is trying out for the first time. And the majority of these purchases come from search. So search has a dual role. It needs to both support quick and efficient. It needs to have the customer quickly and efficiently find the product they're looking for and also enables this new product discovery.
And new product discovery isn't just important for the customer. It's also great for our advertisers because it helps them showcase new products. And it's also good for the platform because overall it increases larger basket sizes. So let's see what some problems are with our existing setup that sort of makes this hard.
So to begin with, we have two classes of queries that are generally more challenging, especially from an e-commerce perspective. The first are overly broad queries. In this case, like on the left, the snacks query, where there are tons of products that map to that query. And now because our models are trained on engagement data, if we aren't exposing these products to the user, it's hard to actually collect engagement data to them, rank them up high.
So the traditional cold start problem in a way. Then, as you can see on the query on the right, we have very specific queries like unsweetened plant-based yogurt, where the user is looking for something very specific. And these queries don't happen very frequently, which means that we just don't have enough engagement data to train the models on.
And while we have done quite a bit of work to sort of improve this, the challenge that we continually keep facing is that while recall improves, precision is still a challenge, especially in a pre-LLM world. The next class of problems is how do we actually support that new item discovery, as we spoke about.
So when a customer walks into a grocery store, let's say into a pasta aisle, they might see new brands of pasta that they would want to try out. Along with that, they would also see pasta sauce and every other thing that's needed to make a bowl of pasta. And customers would want a similar experience on our site.
We have heard multiple rounds of feedback from our customers that, hey, I can find the product that I want via search, but when I'm trying to find any other related products, it's a bit of a dead end. I would need to make multiple searches to get to where I want to.
So this was a problem that we wanted to solve as well. And yeah, as I mentioned, pre-LLMs, this was a hard problem because of the lack of engagement data, et cetera. So let's see how we actually use the LLMs to sort of solve these problems. I'll sort of talk specifically about how we use the LLMs to up-level our query understanding module.
Now, query understanding, as I'm sure most of you know, is the most upstream part of the search stack and very accurate outputs are needed to sort of enable better retrieval and recall and finally improve our ranking results. So our query understanding module has multiple models in them like query normalization, query tagging, query classification, category classification, et cetera.
So in the interest of time, I'll just pick a couple of models and talk about how we sort of really improve them. The first is our query to our category, our product category classifier. Essentially, we are taking a query and mapping it to a category in our taxonomy. So as an example, if you take a query like watermelon, that maps to categories like fruits, organic foods, et cetera.
And our taxonomy has about 10,000 labels, of its 6,000 are more commonly used. So because a product, a query can map to multiple labels, there is essentially a multi-label classification problem. And in the past, our traditional models, which were - we actually had a couple of different models. One was a fast text-based neural network, which essentially modeled the semantic relationship between the query and the category.
And then as a fallback, we had an NPMI model, which was a statistical co-occurrence model between the query and the category. Now while these techniques were great for the head and torso queries, we had really low coverage for our tail queries because, again, we just didn't have enough engagement data to train the models on.
And to be honest, we actually tried more sophisticated bird-based models as well. And while we did see some improvement, the lack of engagement data meant that for the increased latency, we didn't see the wins that we actually hoped for. So this is where we actually tried to use an LLM.
First, we took all of our queries and we, along with the taxonomy, we fed it into an LLM and asked it to predict the most relevant categories for that query. Now, the output that came back was decent. Actually, when you all looked at it, it made a lot of sense.
But when we actually ran an online A/B test, the results weren't as great. And one particular example that illustrates this point very well is a query like protein. Users that come to Instacart, when they type something like protein, they're looking for maybe protein shakes, protein bars, or other protein supplements.
The LLM, on the other hand, thinks that when a user types protein, they're looking for maybe chicken, tofu, or other protein foods. So this mismatch, wherein the LLM doesn't truly understand Instacart user behavior, was really the cause of the problem. So to sort of maybe improve our results, we sort of switched the problem around.
We took the most commonly converting categories or the top K converting categories for each query and fed that as additional context to the LLM. And then I'm sort of simplifying this a bit. There's a bunch of ranking and downstream validation that happens. But essentially, that was what we did.
We generated a bunch of candidates, ranked candidates, and this greatly simplified the problem for the LLM as well. And again, to illustrate this with an example, take a query like Werner's soda. Our previous model actually identified this as a brand of fruit-flavored soda, which is not correct, but it's not very precise either.
Now, the LLM did a much better job. It identified it as a brand of ginger ale. And with this, our downstream retrieval and ranking improved greatly as well. And as you can see from the results below, especially for tail queries, we saw a big improvement. Our precision improved by our 18 percentage points, and our recall improved by our 70 percentage points, which is actually pretty significant for our tail queries.
And maybe to very briefly look at our prompt, as you can see, it's very simple. We are essentially passing in the top converted categories as context. There are a bunch of guidelines about what the LLM should actually outdo, and that's it. So this was all that was needed to sort of enable this.
Again, I'm simplifying the overall flow, but the general concepts are pretty straightforward. So coming to another model, the query rewrites model is actually pretty important as well from an e-commerce perspective, especially at Instacart because not all retailers are created equal. Some have large catalogs, some have very small catalogs.
The same query may not always return results. And that is where a rewrite is really helpful. For example, going from a query like 1% milk to just milk, or at least return results that the customer can decide to buy or not. And again, our previous approach, which was trained on engagement data, didn't do too well.
It suffered or it did decently well on head and torso queries, but it suffered from a lack of engagement data on tail queries. So by using an LLM, similar to how we did for the product category classifier, we were able to generate very precise rewrites. In the example here, you can see that there's a substitute, a broad and a synonymous rewrite.
So for the case of avocado oil, a substitute is olive oil, a broader rewrite is healthy cooking oil, and a synonymous rewrite is just avocado extract. And again, just looking at the results from this, and we saw a bunch of offline improvements, and just moving from using third-party LLMs here, just going from more simpler models to better models improved the results quite a bit.
This is based off our human evaluation data. So as you can see, just improving the models itself, improved the overall performance of the task. And in terms of online improvements, we actually saw a large drop in the number of queries without any results. This is pretty significant, again, because we could now actually show results to users where they previously saw empty results, which was great for the business.
So coming to the important part of this, which is how we actually scored and served the data. The thing is that Instacart has a pretty idiosyncratic query pattern. There's a very fat head and torso set of queries, and we have a sort of a long tail. So by precomputing the outputs for all of the head and torso queries offline in a batch mode, we were able to sort of cache all of this data.
And then at online, when a query comes in, we could just serve it off of the cache with very low impact on latency and fall back to our existing models for the long tail of queries. And again, this worked really well because it didn't impact our latency, while it greatly improved our coverage for the long tail of queries.
Now, for the really long tail where I said we would fall back to our existing models, we're actually trying to replace them with our distilled Lama IB model so that we can actually do a much better job compared to the existing models. So yeah, to sort of summarize, essentially what we saw was that from a query understanding perspective, we have a bunch of models and just using our hybrid approach greatly improved their performance.
But what's actually more interesting is that today, query understanding consists of a bunch of models. And as Yazoo was talking about in the Netflix talk, managing all of these models is actually complex from a system perspective. So consolidating all of these into an SLM or maybe a large language model can make the results a lot more consistent.
And I'll finish it off by giving an example here. There's a query hum that we sort of saw some interesting issues with, which is spelled H-U-M-M. The actual query, our query brand tagger identified the brand correctly as a brand of kombucha. But then our spell corrector, unfortunately, corrected it as hummus.
So the results were really confusing to users and was pretty bad. But by using a more unified model, I think the results were much better. The second is by using an LLM for query understanding, we can actually pass an extra context. So instead of just generating results for that query in isolation, we can really try to understand what the customer's mission is.
So for example, detect if they're actually here to buy ingredients for a recipe, et cetera. And then generate the content for that. So to talk more about that, I have Thaisi here. Thank you, Inish. Now I'll quickly talk about how we used LLMs for showing more discovery-oriented content in search results page.
Just to restate the problem, our users found that while our search engine was very good at showing exactly the results that they exactly wanted to see. Once they added an item to the cart, they couldn't do anything useful with the search results page. They either had to do another search or go to another page to fulfill their next intent to some starts.
Solving this with traditional methods would require a lot of feature engineering or manual work. LLMs solved this problem for us, and I will talk about how. So this is how it looked in the end. So for queries like Swerdfish, let's say there are no exact results. We used LLMs to generate substitute results like other seafood alternatives, meaty fish like Tlapia and whatnot.
And similarly for queries like sushi where there were a lot of exact results, let's say, we would show at the bottom of the search results page, we would show things like Asian cooking ingredients or Japanese drinks and so on, in order to get the users to engage. I'll talk about the techniques here, but both of these discovery-oriented results we saw led to improvement in engagement as well as improvement in revenue for each search.
Cool. Now, like I said, I'll get into the techniques, but let's first talk about the requirements to generate such content. First, obviously we wanted to generate content that is incremented to the current solutions. We don't want duplicates to what we were already showing. And the second requirement and the most important one is we wanted all of the LLM answers or the generation to be aligned with Instacart's domain knowledge.
What does this mean? So if a user searches for a query called dishes, LLM should understand that it refers to cookware and not food, and vice versa for a query like Thanksgiving dishes. So with these requirements in mind, we started with a very basic generation approach. So what did we do?
We took the query and we told the LLM, "Hey, you are an AI assistant and your job is to generate two shopping lists. One is a list of complementary items and another is a list of substitute items for a given query." It looked good. I mean, so we saw the results.
They looked pretty good. Our PMs wetted everything. We looked at everything. And like Vinesh said in the queue, when we launched this to our users, we saw that the results were good, but users weren't engaging it as much as we would have liked it to. So we went back to the drawing board and we tried to analyze what was going on.
And what we realized quickly was while LLM's answers were like common sense-like answers and so on and so on, they weren't really what users were looking for. Taking the protein example again, like users, when they search for protein, they look for protein bars and protein shakes rather than what LLM would give us an answer, which is chicken, turkey, and tofu, and whatnot.
So what we did was we augmented the prompt with Instacart domain knowledge. So in one case, what we did was we took the query and then we augmented it with like, "Hey, here is the query and here are the top converting categories for this particular query," along with any annotations from the query understanding model like, "Hey, here is a brand present in the query.
Here is like a dietary attribute present in the query," and so on and such. In another case, we were like, "Here is the query and here are the subsequent queries that users did once they issued this particular query." So once you augmented the prompt with this additional metadata about how Instacart users behave, the results were far more better.
I don't have the time to show before and after, but like I said, we definitely saw like a huge improvement in both engagement as well as revenue. I'll quickly talk about like how we served all of these contents. Like very similar to QU. It's impractical to call the LLM in real-time because of latency and maybe cost concerns sometimes.
So what we did was we took all of our historical search logs. We called LLM in like a batch mode and stored everything. So query, content metadata, along with even the products that could potentially show up in the carousel. And online, it's just a very quick lookup from a feature store.
And that's how we were able to like serve all of these recommendations in like blazing fast time. Again, things weren't as simple as we making them out to be. Like Vinesh said, the overall concept is simple. The prompt itself is very simple. But there were three key challenges that we solved along the way.
One is aligning generation with business metrics like revenue. This was very important to select top-line bins. So we iterated over the prompts and the kind of metadata that we would feed to the LLM in order to achieve this. Second, we spent a lot of time on ranking, on improving the ranking of the content itself and so on and such.
So our traditional PCDR, PCVR models did not work. So we had to employ strategies like diversity-based ranking and so on and so forth to get users to engage with the content. And then the third thing is evaluating the content itself. So one is making sure that, hey, whatever LLM is giving is one, right?
It's not hallucinating something. And second, it adhered to, like, what Instacart or what we need as a product, right? Cool. So summarizing the key takeaways from our talk. LLM's world knowledge was super important to improve query understanding predictions, especially for the tail queries. While LLMs were super helpful, we really found success by combining the domain knowledge of Instacart with LLMs in order to see the top-line wins that we saw.
And the third and the last one is evaluating the content as well as the queue predictions and so on and such was far more important and far more difficult than we anticipated. We used LLMs as a judge in order to make this happen, but very, very important step. And we realized that kind of late.
So, yeah, that's all from us. We'll take questions now. Thank you. Yeah, we'll take questions at the mic. While the next speaker gets set up. Hi. Thanks for the talk. Have you also been trying around queries which are very long in natural language? Like, I want these three items and these five items.
Like, what we would do it on ChatGPT? Or it's still, like, single item? That's the focus. Yeah. I think we have actually launched something in the past, like, Ask Instacart, if you've heard of it. Which essentially takes natural language queries and tries to map that to search intent. So, for example, you might say healthy foods for a three-year-old baby or something like that.
And so that would map to things like fruit slices. I don't know if three-year-old toddlers can eat popcorn, but something along those lines. And then we had our usual ranking, recall and ranking stack sort of retrieve those results. So, any learnings from that experiment for you? Yeah. So, I think we actually have a lot of learnings from that.
Essentially, as Dees already mentioned, we need to inject a lot of Instacart context into the model to be able to get decent results. The evaluation part is really key. So, having a robust automated evaluation pipeline was important. And lastly, passing context. That is, for example, if it's a, let's say it's a Mother's Day query.
And let's say we come up with individual search intents as perfumes. You really want women's perfumes to be in there. Whereas when we just had perfumes, we could see all kinds of items. So, passing that context from the LLM to the downstream systems is really important. Thanks. Yeah, we have a lot of examples where we failed.
We can talk about. We can talk about this. We can talk about this.