back to index

Transforming search and discovery using LLMs — Tejaswi & Vinesh, Instacart


Whisper Transcript | Transcript Only Page

00:00:02.000 | - Hi, good afternoon everyone.
00:00:18.160 | My name is Vinesh and he's Desvi.
00:00:20.600 | We are part of the search and machine learning team
00:00:22.760 | at Instacart.
00:00:23.880 | So today we'd like to talk to you about
00:00:25.320 | how we are using LLMs to transform our search and discovery.
00:00:29.960 | So yeah, so first, a little bit about ourselves.
00:00:33.360 | Yeah, as I mentioned, we are part of the search
00:00:35.320 | and discovery ML team at Instacart.
00:00:37.960 | And for those of you who may not be familiar with Instacart,
00:00:40.960 | it's the leader in online grocery in North America.
00:00:44.920 | And our mission is to create a world
00:00:47.060 | where everyone has access to the food they love
00:00:49.740 | and more time to enjoy it together.
00:00:51.280 | So coming to what we'll actually talk about today.
00:00:55.320 | First, we'll talk about the importance of search
00:00:57.820 | in grocery e-commerce.
00:00:59.920 | And we'll look into some of the challenges
00:01:02.120 | facing conventional search engines.
00:01:04.320 | And then actually get to the meat of the talk today,
00:01:07.120 | which is how we are using LLMs to solve some of these problems.
00:01:10.920 | Finally, we'll finish with some key takeaways
00:01:13.440 | from today's talk.
00:01:14.480 | So coming to the importance of search
00:01:17.320 | and grocery commerce, I think they've all gone grocery shopping.
00:01:20.440 | Customers come with long shopping lists,
00:01:23.120 | and it's the same on the platform as well.
00:01:25.560 | People are looking for tens of items.
00:01:27.760 | And of these, the majority of them are just restocking purchases.
00:01:32.160 | That is, things that the customer has bought in the past.
00:01:34.760 | And the remaining are items that the user is trying out for the first time.
00:01:38.760 | And the majority of these purchases come from search.
00:01:42.960 | So search has a dual role.
00:01:44.960 | It needs to both support quick and efficient.
00:01:47.960 | It needs to have the customer quickly and efficiently find the product they're looking for
00:01:52.960 | and also enables this new product discovery.
00:01:55.560 | And new product discovery isn't just important for the customer.
00:01:59.960 | It's also great for our advertisers because it helps them showcase new products.
00:02:04.160 | And it's also good for the platform because overall it increases larger basket sizes.
00:02:09.560 | So let's see what some problems are with our existing setup that sort of makes this hard.
00:02:15.160 | So to begin with, we have two classes of queries that are generally more challenging, especially
00:02:22.560 | from an e-commerce perspective.
00:02:24.880 | The first are overly broad queries.
00:02:26.760 | In this case, like on the left, the snacks query, where there are tons of products that
00:02:31.480 | map to that query.
00:02:33.640 | And now because our models are trained on engagement data, if we aren't exposing these products to
00:02:40.320 | the user, it's hard to actually collect engagement data to them, rank them up high.
00:02:44.940 | So the traditional cold start problem in a way.
00:02:47.820 | Then, as you can see on the query on the right, we have very specific queries like unsweetened
00:02:52.000 | plant-based yogurt, where the user is looking for something very specific.
00:02:56.540 | And these queries don't happen very frequently, which means that we just don't have enough engagement
00:03:01.700 | data to train the models on.
00:03:04.340 | And while we have done quite a bit of work to sort of improve this, the challenge that we
00:03:11.900 | continually keep facing is that while recall improves, precision is still a challenge, especially
00:03:16.780 | in a pre-LLM world.
00:03:18.960 | The next class of problems is how do we actually support that new item discovery, as we spoke
00:03:23.120 | about.
00:03:24.120 | So when a customer walks into a grocery store, let's say into a pasta aisle, they might see
00:03:29.780 | new brands of pasta that they would want to try out.
00:03:32.960 | Along with that, they would also see pasta sauce and every other thing that's needed to make
00:03:36.780 | a bowl of pasta.
00:03:38.840 | And customers would want a similar experience on our site.
00:03:42.020 | We have heard multiple rounds of feedback from our customers that, hey, I can find the product
00:03:48.160 | that I want via search, but when I'm trying to find any other related products, it's a bit
00:03:54.620 | of a dead end.
00:03:55.620 | I would need to make multiple searches to get to where I want to.
00:03:58.880 | So this was a problem that we wanted to solve as well.
00:04:02.320 | And yeah, as I mentioned, pre-LLMs, this was a hard problem because of the lack of engagement
00:04:08.440 | data, et cetera.
00:04:09.880 | So let's see how we actually use the LLMs to sort of solve these problems.
00:04:13.100 | I'll sort of talk specifically about how we use the LLMs to up-level our query understanding
00:04:19.080 | module.
00:04:20.080 | Now, query understanding, as I'm sure most of you know, is the most upstream part of the
00:04:25.080 | search stack and very accurate outputs are needed to sort of enable better retrieval and
00:04:32.080 | recall and finally improve our ranking results.
00:04:35.560 | So our query understanding module has multiple models in them like query normalization, query
00:04:40.060 | tagging, query classification, category classification, et cetera.
00:04:44.840 | So in the interest of time, I'll just pick a couple of models and talk about how we sort of
00:04:50.880 | really improve them.
00:04:52.640 | The first is our query to our category, our product category classifier.
00:04:57.300 | Essentially, we are taking a query and mapping it to a category in our taxonomy.
00:05:02.500 | So as an example, if you take a query like watermelon, that maps to categories like fruits, organic foods,
00:05:09.320 | et cetera.
00:05:11.320 | And our taxonomy has about 10,000 labels, of its 6,000 are more commonly used.
00:05:16.240 | So because a product, a query can map to multiple labels, there is essentially a multi-label
00:05:21.020 | classification problem.
00:05:24.300 | And in the past, our traditional models, which were - we actually had a couple of different
00:05:30.880 | models.
00:05:31.880 | One was a fast text-based neural network, which essentially modeled the semantic relationship
00:05:36.900 | between the query and the category.
00:05:39.360 | And then as a fallback, we had an NPMI model, which was a statistical co-occurrence model between
00:05:44.460 | the query and the category.
00:05:46.320 | Now while these techniques were great for the head and torso queries, we had really low coverage
00:05:52.880 | for our tail queries because, again, we just didn't have enough engagement data to train the
00:05:56.780 | models on.
00:05:58.820 | And to be honest, we actually tried more sophisticated bird-based models as well.
00:06:03.120 | And while we did see some improvement, the lack of engagement data meant that for the increased
00:06:09.400 | latency, we didn't see the wins that we actually hoped for.
00:06:13.180 | So this is where we actually tried to use an LLM.
00:06:16.460 | First, we took all of our queries and we, along with the taxonomy, we fed it into an LLM and
00:06:22.540 | asked it to predict the most relevant categories for that query.
00:06:26.280 | Now, the output that came back was decent.
00:06:29.320 | Actually, when you all looked at it, it made a lot of sense.
00:06:31.740 | But when we actually ran an online A/B test, the results weren't as great.
00:06:37.440 | And one particular example that illustrates this point very well is a query like protein.
00:06:44.140 | Users that come to Instacart, when they type something like protein, they're looking for
00:06:47.180 | maybe protein shakes, protein bars, or other protein supplements.
00:06:51.980 | The LLM, on the other hand, thinks that when a user types protein, they're looking for maybe
00:06:56.780 | chicken, tofu, or other protein foods.
00:07:00.200 | So this mismatch, wherein the LLM doesn't truly understand Instacart user behavior, was
00:07:05.740 | really the cause of the problem.
00:07:08.400 | So to sort of maybe improve our results, we sort of switched the problem around.
00:07:13.540 | We took the most commonly converting categories or the top K converting categories for each query
00:07:19.240 | and fed that as additional context to the LLM.
00:07:22.720 | And then I'm sort of simplifying this a bit.
00:07:24.680 | There's a bunch of ranking and downstream validation that happens.
00:07:28.940 | But essentially, that was what we did.
00:07:31.340 | We generated a bunch of candidates, ranked candidates, and this greatly simplified the problem for
00:07:36.800 | the LLM as well.
00:07:38.840 | And again, to illustrate this with an example, take a query like Werner's soda.
00:07:43.980 | Our previous model actually identified this as a brand of fruit-flavored soda, which is not
00:07:50.360 | correct, but it's not very precise either.
00:07:53.380 | Now, the LLM did a much better job.
00:07:55.620 | It identified it as a brand of ginger ale.
00:07:58.020 | And with this, our downstream retrieval and ranking improved greatly as well.
00:08:02.480 | And as you can see from the results below, especially for tail queries, we saw a big improvement.
00:08:08.620 | Our precision improved by our 18 percentage points, and our recall improved by our 70 percentage points, which is actually pretty significant for our tail queries.
00:08:17.060 | And maybe to very briefly look at our prompt, as you can see, it's very simple.
00:08:21.780 | We are essentially passing in the top converted categories as context.
00:08:26.880 | There are a bunch of guidelines about what the LLM should actually outdo, and that's it.
00:08:32.860 | So this was all that was needed to sort of enable this.
00:08:35.820 | Again, I'm simplifying the overall flow, but the general concepts are pretty straightforward.
00:08:43.560 | So coming to another model, the query rewrites model is actually pretty important as well from an e-commerce perspective, especially at Instacart because not all retailers are created equal.
00:08:56.580 | Some have large catalogs, some have very small catalogs.
00:08:59.580 | The same query may not always return results.
00:09:02.460 | And that is where a rewrite is really helpful.
00:09:04.480 | For example, going from a query like 1% milk to just milk, or at least return results that
00:09:10.200 | the customer can decide to buy or not.
00:09:13.300 | And again, our previous approach, which was trained on engagement data, didn't do too well.
00:09:19.200 | It suffered or it did decently well on head and torso queries, but it suffered from a lack
00:09:24.080 | of engagement data on tail queries.
00:09:26.800 | So by using an LLM, similar to how we did for the product category classifier, we were able
00:09:31.940 | to generate very precise rewrites.
00:09:34.800 | In the example here, you can see that there's a substitute, a broad and a synonymous rewrite.
00:09:40.240 | So for the case of avocado oil, a substitute is olive oil, a broader rewrite is healthy cooking
00:09:46.060 | oil, and a synonymous rewrite is just avocado extract.
00:09:50.140 | And again, just looking at the results from this, and we saw a bunch of offline improvements, and just
00:09:57.920 | moving from using third-party LLMs here, just going from more simpler models to better models
00:10:05.360 | improved the results quite a bit.
00:10:08.280 | This is based off our human evaluation data.
00:10:10.280 | So as you can see, just improving the models itself, improved the overall performance of
00:10:15.420 | the task.
00:10:16.420 | And in terms of online improvements, we actually saw a large drop in the number of queries without
00:10:22.140 | any results.
00:10:23.420 | This is pretty significant, again, because we could now actually show results to users where
00:10:28.820 | they previously saw empty results, which was great for the business.
00:10:34.280 | So coming to the important part of this, which is how we actually scored and served the data.
00:10:43.820 | The thing is that Instacart has a pretty idiosyncratic query pattern.
00:10:49.140 | There's a very fat head and torso set of queries, and we have a sort of a long tail.
00:10:55.240 | So by precomputing the outputs for all of the head and torso queries offline in a batch mode,
00:11:02.360 | we were able to sort of cache all of this data.
00:11:06.160 | And then at online, when a query comes in, we could just serve it off of the cache with
00:11:10.600 | very low impact on latency and fall back to our existing models for the long tail of queries.
00:11:17.860 | And again, this worked really well because it didn't impact our latency, while it greatly
00:11:23.780 | improved our coverage for the long tail of queries.
00:11:26.800 | Now, for the really long tail where I said we would fall back to our existing models, we're
00:11:32.380 | actually trying to replace them with our distilled Lama IB model so that we can actually do a much
00:11:38.980 | better job compared to the existing models.
00:11:41.680 | So yeah, to sort of summarize, essentially what we saw was that from a query understanding perspective,
00:11:48.020 | we have a bunch of models and just using our hybrid approach greatly improved their performance.
00:11:56.020 | But what's actually more interesting is that today, query understanding consists of a bunch
00:11:59.980 | of models.
00:12:01.080 | And as Yazoo was talking about in the Netflix talk, managing all of these models is actually
00:12:06.260 | complex from a system perspective.
00:12:08.320 | So consolidating all of these into an SLM or maybe a large language model can make the
00:12:15.760 | results a lot more consistent.
00:12:17.820 | And I'll finish it off by giving an example here.
00:12:20.820 | There's a query hum that we sort of saw some interesting issues with, which is spelled H-U-M-M.
00:12:29.720 | The actual query, our query brand tagger identified the brand correctly as a brand of kombucha.
00:12:37.820 | But then our spell corrector, unfortunately, corrected it as hummus.
00:12:41.460 | So the results were really confusing to users and was pretty bad.
00:12:45.960 | But by using a more unified model, I think the results were much better.
00:12:49.880 | The second is by using an LLM for query understanding, we can actually pass an extra context.
00:12:57.160 | So instead of just generating results for that query in isolation, we can really try to understand
00:13:03.620 | what the customer's mission is.
00:13:06.240 | So for example, detect if they're actually here to buy ingredients for a recipe, et cetera.
00:13:11.420 | And then generate the content for that.
00:13:13.240 | So to talk more about that, I have Thaisi here.
00:13:16.380 | Thank you, Inish.
00:13:19.480 | Now I'll quickly talk about how we used LLMs for showing more discovery-oriented content in
00:13:23.380 | search results page.
00:13:25.120 | Just to restate the problem, our users found that while our search engine was very good at
00:13:30.520 | showing exactly the results that they exactly wanted to see.
00:13:34.520 | Once they added an item to the cart, they couldn't do anything useful with the search results page.
00:13:38.400 | They either had to do another search or go to another page to fulfill their next intent to
00:13:43.100 | some starts.
00:13:45.280 | Solving this with traditional methods would require a lot of feature engineering or manual
00:13:49.280 | work.
00:13:50.280 | LLMs solved this problem for us, and I will talk about how.
00:13:54.340 | So this is how it looked in the end.
00:13:55.900 | So for queries like Swerdfish, let's say there are no exact results.
00:13:59.900 | We used LLMs to generate substitute results like other seafood alternatives, meaty fish like
00:14:05.380 | Tlapia and whatnot.
00:14:07.800 | And similarly for queries like sushi where there were a lot of exact results, let's say, we would
00:14:13.520 | show at the bottom of the search results page, we would show things like Asian cooking ingredients
00:14:18.020 | or Japanese drinks and so on, in order to get the users to engage.
00:14:23.020 | I'll talk about the techniques here, but both of these discovery-oriented results we saw led
00:14:31.140 | to improvement in engagement as well as improvement in revenue for each search.
00:14:37.020 | Cool.
00:14:38.020 | Now, like I said, I'll get into the techniques, but let's first talk about the requirements to generate
00:14:42.200 | such content.
00:14:43.200 | First, obviously we wanted to generate content that is incremented to the current solutions.
00:14:47.500 | We don't want duplicates to what we were already showing.
00:14:50.360 | And the second requirement and the most important one is we wanted all of the LLM answers or the
00:14:55.840 | generation to be aligned with Instacart's domain knowledge.
00:14:59.120 | What does this mean?
00:15:00.120 | So if a user searches for a query called dishes, LLM should understand that it refers
00:15:05.120 | to cookware and not food, and vice versa for a query like Thanksgiving dishes.
00:15:10.620 | So with these requirements in mind, we started with a very basic generation approach.
00:15:15.920 | So what did we do?
00:15:16.920 | We took the query and we told the LLM, "Hey, you are an AI assistant and your job is to generate
00:15:21.960 | two shopping lists.
00:15:23.320 | One is a list of complementary items and another is a list of substitute items for a given query."
00:15:29.120 | It looked good.
00:15:31.120 | I mean, so we saw the results.
00:15:32.120 | They looked pretty good.
00:15:33.120 | Our PMs wetted everything.
00:15:35.120 | We looked at everything.
00:15:38.120 | And like Vinesh said in the queue, when we launched this to our users, we saw that the results
00:15:44.120 | were good, but users weren't engaging it as much as we would have liked it to.
00:15:48.120 | So we went back to the drawing board and we tried to analyze what was going on.
00:15:53.120 | And what we realized quickly was while LLM's answers were like common sense-like answers
00:15:58.120 | and so on and so on, they weren't really what users were looking for.
00:16:02.120 | Taking the protein example again, like users, when they search for protein, they look for
00:16:07.120 | protein bars and protein shakes rather than what LLM would give us an answer, which is chicken,
00:16:12.120 | turkey, and tofu, and whatnot.
00:16:14.120 | So what we did was we augmented the prompt with Instacart domain knowledge.
00:16:20.120 | So in one case, what we did was we took the query and then we augmented it with like, "Hey,
00:16:25.120 | here is the query and here are the top converting categories for this particular query," along
00:16:30.120 | with any annotations from the query understanding model like, "Hey, here is a brand present in
00:16:34.120 | the query.
00:16:35.120 | Here is like a dietary attribute present in the query," and so on and such.
00:16:39.120 | In another case, we were like, "Here is the query and here are the subsequent queries that users
00:16:44.120 | did once they issued this particular query."
00:16:47.120 | So once you augmented the prompt with this additional metadata about how Instacart users behave,
00:16:52.120 | the results were far more better.
00:16:55.120 | I don't have the time to show before and after, but like I said, we definitely saw like a huge
00:17:00.120 | improvement in both engagement as well as revenue.
00:17:05.120 | I'll quickly talk about like how we served all of these contents.
00:17:08.120 | Like very similar to QU.
00:17:10.120 | It's impractical to call the LLM in real-time because of latency and maybe cost concerns sometimes.
00:17:15.120 | So what we did was we took all of our historical search logs.
00:17:20.120 | We called LLM in like a batch mode and stored everything.
00:17:24.120 | So query, content metadata, along with even the products that could potentially show up in the
00:17:28.120 | carousel.
00:17:29.120 | And online, it's just a very quick lookup from a feature store.
00:17:33.120 | And that's how we were able to like serve all of these recommendations in like blazing fast time.
00:17:39.120 | Again, things weren't as simple as we making them out to be.
00:17:43.120 | Like Vinesh said, the overall concept is simple.
00:17:46.120 | The prompt itself is very simple.
00:17:48.120 | But there were three key challenges that we solved along the way.
00:17:51.120 | One is aligning generation with business metrics like revenue.
00:17:55.120 | This was very important to select top-line bins.
00:17:57.120 | So we iterated over the prompts and the kind of metadata that we would feed to the LLM in
00:18:02.120 | order to achieve this.
00:18:04.120 | Second, we spent a lot of time on ranking, on improving the ranking of the content itself
00:18:10.120 | and so on and such.
00:18:11.120 | So our traditional PCDR, PCVR models did not work.
00:18:13.120 | So we had to employ strategies like diversity-based ranking and so on and so forth to get users
00:18:19.120 | to engage with the content.
00:18:21.120 | And then the third thing is evaluating the content itself.
00:18:24.120 | So one is making sure that, hey, whatever LLM is giving is one, right?
00:18:28.120 | It's not hallucinating something.
00:18:30.120 | And second, it adhered to, like, what Instacart or what we need as a product, right?
00:18:35.120 | Cool.
00:18:36.120 | So summarizing the key takeaways from our talk.
00:18:39.120 | LLM's world knowledge was super important to improve query understanding predictions, especially
00:18:45.120 | for the tail queries.
00:18:46.120 | While LLMs were super helpful, we really found success by combining the domain knowledge of
00:18:53.120 | Instacart with LLMs in order to see the top-line wins that we saw.
00:18:57.120 | And the third and the last one is evaluating the content as well as the queue predictions
00:19:02.120 | and so on and such was far more important and far more difficult than we anticipated.
00:19:07.120 | We used LLMs as a judge in order to make this happen, but very, very important step.
00:19:12.120 | And we realized that kind of late.
00:19:13.120 | So, yeah, that's all from us.
00:19:15.120 | We'll take questions now.
00:19:16.120 | Thank you.
00:19:17.120 | Yeah, we'll take questions at the mic.
00:19:22.120 | While the next speaker gets set up.
00:19:25.120 | Thanks for the talk.
00:19:26.120 | Have you also been trying around queries which are very long in natural language?
00:19:31.120 | Like, I want these three items and these five items.
00:19:35.120 | Like, what we would do it on ChatGPT?
00:19:38.120 | Or it's still, like, single item?
00:19:40.120 | That's the focus.
00:19:41.120 | Yeah.
00:19:42.120 | I think we have actually launched something in the past, like, Ask Instacart, if you've
00:19:50.120 | heard of it.
00:19:51.120 | Which essentially takes natural language queries and tries to map that to search intent.
00:19:55.120 | So, for example, you might say healthy foods for a three-year-old baby or something like
00:20:00.120 | that.
00:20:01.120 | And so that would map to things like fruit slices.
00:20:03.120 | I don't know if three-year-old toddlers can eat popcorn, but something along those lines.
00:20:09.120 | And then we had our usual ranking, recall and ranking stack sort of retrieve those results.
00:20:15.120 | So, any learnings from that experiment for you?
00:20:19.120 | Yeah.
00:20:20.120 | So, I think we actually have a lot of learnings from that.
00:20:22.120 | Essentially, as Dees already mentioned, we need to inject a lot of Instacart context into
00:20:28.120 | the model to be able to get decent results.
00:20:31.120 | The evaluation part is really key.
00:20:33.120 | So, having a robust automated evaluation pipeline was important.
00:20:37.120 | And lastly, passing context.
00:20:39.120 | That is, for example, if it's a, let's say it's a Mother's Day query.
00:20:44.120 | And let's say we come up with individual search intents as perfumes.
00:20:49.120 | You really want women's perfumes to be in there.
00:20:52.120 | Whereas when we just had perfumes, we could see all kinds of items.
00:20:55.120 | So, passing that context from the LLM to the downstream systems is really important.
00:21:00.120 | Thanks.
00:21:01.120 | Yeah, we have a lot of examples where we failed.
00:21:02.120 | We can talk about.
00:21:03.120 | We can talk about this.
00:21:04.120 | We can talk about this.