Transforming search and discovery using LLMs — Tejaswi & Vinesh, Instacart

00:00:00.000 | -

00:00:02.000 | - Hi, good afternoon everyone.

00:00:18.160 | My name is Vinesh and he's Desvi.

00:00:20.600 | We are part of the search and machine learning team

00:00:22.760 | at Instacart.

00:00:23.880 | So today we'd like to talk to you about

00:00:25.320 | how we are using LLMs to transform our search and discovery.

00:00:29.960 | So yeah, so first, a little bit about ourselves.

00:00:33.360 | Yeah, as I mentioned, we are part of the search

00:00:35.320 | and discovery ML team at Instacart.

00:00:37.960 | And for those of you who may not be familiar with Instacart,

00:00:40.960 | it's the leader in online grocery in North America.

00:00:44.920 | And our mission is to create a world

00:00:47.060 | where everyone has access to the food they love

00:00:49.740 | and more time to enjoy it together.

00:00:51.280 | So coming to what we'll actually talk about today.

00:00:55.320 | First, we'll talk about the importance of search

00:00:57.820 | in grocery e-commerce.

00:00:59.920 | And we'll look into some of the challenges

00:01:02.120 | facing conventional search engines.

00:01:04.320 | And then actually get to the meat of the talk today,

00:01:07.120 | which is how we are using LLMs to solve some of these problems.

00:01:10.920 | Finally, we'll finish with some key takeaways

00:01:13.440 | from today's talk.

00:01:14.480 | So coming to the importance of search

00:01:17.320 | and grocery commerce, I think they've all gone grocery shopping.

00:01:20.440 | Customers come with long shopping lists,

00:01:23.120 | and it's the same on the platform as well.

00:01:25.560 | People are looking for tens of items.

00:01:27.760 | And of these, the majority of them are just restocking purchases.

00:01:32.160 | That is, things that the customer has bought in the past.

00:01:34.760 | And the remaining are items that the user is trying out for the first time.

00:01:38.760 | And the majority of these purchases come from search.

00:01:42.960 | So search has a dual role.

00:01:44.960 | It needs to both support quick and efficient.

00:01:47.960 | It needs to have the customer quickly and efficiently find the product they're looking for

00:01:52.960 | and also enables this new product discovery.

00:01:55.560 | And new product discovery isn't just important for the customer.

00:01:59.960 | It's also great for our advertisers because it helps them showcase new products.

00:02:04.160 | And it's also good for the platform because overall it increases larger basket sizes.

00:02:09.560 | So let's see what some problems are with our existing setup that sort of makes this hard.

00:02:15.160 | So to begin with, we have two classes of queries that are generally more challenging, especially

00:02:22.560 | from an e-commerce perspective.

00:02:24.880 | The first are overly broad queries.

00:02:26.760 | In this case, like on the left, the snacks query, where there are tons of products that

00:02:31.480 | map to that query.

00:02:33.640 | And now because our models are trained on engagement data, if we aren't exposing these products to

00:02:40.320 | the user, it's hard to actually collect engagement data to them, rank them up high.

00:02:44.940 | So the traditional cold start problem in a way.

00:02:47.820 | Then, as you can see on the query on the right, we have very specific queries like unsweetened

00:02:52.000 | plant-based yogurt, where the user is looking for something very specific.

00:02:56.540 | And these queries don't happen very frequently, which means that we just don't have enough engagement

00:03:01.700 | data to train the models on.

00:03:04.340 | And while we have done quite a bit of work to sort of improve this, the challenge that we

00:03:11.900 | continually keep facing is that while recall improves, precision is still a challenge, especially

00:03:16.780 | in a pre-LLM world.

00:03:18.960 | The next class of problems is how do we actually support that new item discovery, as we spoke

00:03:23.120 | about.

00:03:24.120 | So when a customer walks into a grocery store, let's say into a pasta aisle, they might see

00:03:29.780 | new brands of pasta that they would want to try out.

00:03:32.960 | Along with that, they would also see pasta sauce and every other thing that's needed to make

00:03:36.780 | a bowl of pasta.

00:03:38.840 | And customers would want a similar experience on our site.

00:03:42.020 | We have heard multiple rounds of feedback from our customers that, hey, I can find the product

00:03:48.160 | that I want via search, but when I'm trying to find any other related products, it's a bit

00:03:54.620 | of a dead end.

00:03:55.620 | I would need to make multiple searches to get to where I want to.

00:03:58.880 | So this was a problem that we wanted to solve as well.

00:04:02.320 | And yeah, as I mentioned, pre-LLMs, this was a hard problem because of the lack of engagement

00:04:08.440 | data, et cetera.

00:04:09.880 | So let's see how we actually use the LLMs to sort of solve these problems.

00:04:13.100 | I'll sort of talk specifically about how we use the LLMs to up-level our query understanding

00:04:19.080 | module.

00:04:20.080 | Now, query understanding, as I'm sure most of you know, is the most upstream part of the

00:04:25.080 | search stack and very accurate outputs are needed to sort of enable better retrieval and

00:04:32.080 | recall and finally improve our ranking results.

00:04:35.560 | So our query understanding module has multiple models in them like query normalization, query

00:04:40.060 | tagging, query classification, category classification, et cetera.

00:04:44.840 | So in the interest of time, I'll just pick a couple of models and talk about how we sort of

00:04:50.880 | really improve them.

00:04:52.640 | The first is our query to our category, our product category classifier.

00:04:57.300 | Essentially, we are taking a query and mapping it to a category in our taxonomy.

00:05:02.500 | So as an example, if you take a query like watermelon, that maps to categories like fruits, organic foods,

00:05:09.320 | et cetera.

00:05:11.320 | And our taxonomy has about 10,000 labels, of its 6,000 are more commonly used.

00:05:16.240 | So because a product, a query can map to multiple labels, there is essentially a multi-label

00:05:21.020 | classification problem.

00:05:24.300 | And in the past, our traditional models, which were - we actually had a couple of different

00:05:30.880 | models.

00:05:31.880 | One was a fast text-based neural network, which essentially modeled the semantic relationship

00:05:36.900 | between the query and the category.

00:05:39.360 | And then as a fallback, we had an NPMI model, which was a statistical co-occurrence model between

00:05:44.460 | the query and the category.

00:05:46.320 | Now while these techniques were great for the head and torso queries, we had really low coverage

00:05:52.880 | for our tail queries because, again, we just didn't have enough engagement data to train the

00:05:56.780 | models on.

00:05:58.820 | And to be honest, we actually tried more sophisticated bird-based models as well.

00:06:03.120 | And while we did see some improvement, the lack of engagement data meant that for the increased

00:06:09.400 | latency, we didn't see the wins that we actually hoped for.

00:06:13.180 | So this is where we actually tried to use an LLM.

00:06:16.460 | First, we took all of our queries and we, along with the taxonomy, we fed it into an LLM and

00:06:22.540 | asked it to predict the most relevant categories for that query.

00:06:26.280 | Now, the output that came back was decent.

00:06:29.320 | Actually, when you all looked at it, it made a lot of sense.

00:06:31.740 | But when we actually ran an online A/B test, the results weren't as great.

00:06:37.440 | And one particular example that illustrates this point very well is a query like protein.

00:06:44.140 | Users that come to Instacart, when they type something like protein, they're looking for

00:06:47.180 | maybe protein shakes, protein bars, or other protein supplements.

00:06:51.980 | The LLM, on the other hand, thinks that when a user types protein, they're looking for maybe

00:06:56.780 | chicken, tofu, or other protein foods.

00:07:00.200 | So this mismatch, wherein the LLM doesn't truly understand Instacart user behavior, was

00:07:05.740 | really the cause of the problem.

00:07:08.400 | So to sort of maybe improve our results, we sort of switched the problem around.

00:07:13.540 | We took the most commonly converting categories or the top K converting categories for each query

00:07:19.240 | and fed that as additional context to the LLM.

00:07:22.720 | And then I'm sort of simplifying this a bit.

00:07:24.680 | There's a bunch of ranking and downstream validation that happens.

00:07:28.940 | But essentially, that was what we did.

00:07:31.340 | We generated a bunch of candidates, ranked candidates, and this greatly simplified the problem for

00:07:36.800 | the LLM as well.

00:07:38.840 | And again, to illustrate this with an example, take a query like Werner's soda.

00:07:43.980 | Our previous model actually identified this as a brand of fruit-flavored soda, which is not

00:07:50.360 | correct, but it's not very precise either.

00:07:53.380 | Now, the LLM did a much better job.

00:07:55.620 | It identified it as a brand of ginger ale.

00:07:58.020 | And with this, our downstream retrieval and ranking improved greatly as well.

00:08:02.480 | And as you can see from the results below, especially for tail queries, we saw a big improvement.

00:08:08.620 | Our precision improved by our 18 percentage points, and our recall improved by our 70 percentage points, which is actually pretty significant for our tail queries.

00:08:17.060 | And maybe to very briefly look at our prompt, as you can see, it's very simple.

00:08:21.780 | We are essentially passing in the top converted categories as context.

00:08:26.880 | There are a bunch of guidelines about what the LLM should actually outdo, and that's it.

00:08:32.860 | So this was all that was needed to sort of enable this.

00:08:35.820 | Again, I'm simplifying the overall flow, but the general concepts are pretty straightforward.

00:08:43.560 | So coming to another model, the query rewrites model is actually pretty important as well from an e-commerce perspective, especially at Instacart because not all retailers are created equal.

00:08:56.580 | Some have large catalogs, some have very small catalogs.

00:08:59.580 | The same query may not always return results.

00:09:02.460 | And that is where a rewrite is really helpful.

00:09:04.480 | For example, going from a query like 1% milk to just milk, or at least return results that

00:09:10.200 | the customer can decide to buy or not.

00:09:13.300 | And again, our previous approach, which was trained on engagement data, didn't do too well.

00:09:19.200 | It suffered or it did decently well on head and torso queries, but it suffered from a lack

00:09:24.080 | of engagement data on tail queries.

00:09:26.800 | So by using an LLM, similar to how we did for the product category classifier, we were able

00:09:31.940 | to generate very precise rewrites.

00:09:34.800 | In the example here, you can see that there's a substitute, a broad and a synonymous rewrite.

00:09:40.240 | So for the case of avocado oil, a substitute is olive oil, a broader rewrite is healthy cooking

00:09:46.060 | oil, and a synonymous rewrite is just avocado extract.

00:09:50.140 | And again, just looking at the results from this, and we saw a bunch of offline improvements, and just

00:09:57.920 | moving from using third-party LLMs here, just going from more simpler models to better models

00:10:05.360 | improved the results quite a bit.

00:10:08.280 | This is based off our human evaluation data.

00:10:10.280 | So as you can see, just improving the models itself, improved the overall performance of

00:10:15.420 | the task.

00:10:16.420 | And in terms of online improvements, we actually saw a large drop in the number of queries without

00:10:22.140 | any results.

00:10:23.420 | This is pretty significant, again, because we could now actually show results to users where

00:10:28.820 | they previously saw empty results, which was great for the business.

00:10:34.280 | So coming to the important part of this, which is how we actually scored and served the data.

00:10:43.820 | The thing is that Instacart has a pretty idiosyncratic query pattern.

00:10:49.140 | There's a very fat head and torso set of queries, and we have a sort of a long tail.

00:10:55.240 | So by precomputing the outputs for all of the head and torso queries offline in a batch mode,

00:11:02.360 | we were able to sort of cache all of this data.

00:11:06.160 | And then at online, when a query comes in, we could just serve it off of the cache with

00:11:10.600 | very low impact on latency and fall back to our existing models for the long tail of queries.

00:11:17.860 | And again, this worked really well because it didn't impact our latency, while it greatly

00:11:23.780 | improved our coverage for the long tail of queries.

00:11:26.800 | Now, for the really long tail where I said we would fall back to our existing models, we're

00:11:32.380 | actually trying to replace them with our distilled Lama IB model so that we can actually do a much

00:11:38.980 | better job compared to the existing models.

00:11:41.680 | So yeah, to sort of summarize, essentially what we saw was that from a query understanding perspective,

00:11:48.020 | we have a bunch of models and just using our hybrid approach greatly improved their performance.

00:11:56.020 | But what's actually more interesting is that today, query understanding consists of a bunch

00:11:59.980 | of models.

00:12:01.080 | And as Yazoo was talking about in the Netflix talk, managing all of these models is actually

00:12:06.260 | complex from a system perspective.

00:12:08.320 | So consolidating all of these into an SLM or maybe a large language model can make the

00:12:15.760 | results a lot more consistent.

00:12:17.820 | And I'll finish it off by giving an example here.

00:12:20.820 | There's a query hum that we sort of saw some interesting issues with, which is spelled H-U-M-M.

00:12:29.720 | The actual query, our query brand tagger identified the brand correctly as a brand of kombucha.

00:12:37.820 | But then our spell corrector, unfortunately, corrected it as hummus.

00:12:41.460 | So the results were really confusing to users and was pretty bad.

00:12:45.960 | But by using a more unified model, I think the results were much better.

00:12:49.880 | The second is by using an LLM for query understanding, we can actually pass an extra context.

00:12:57.160 | So instead of just generating results for that query in isolation, we can really try to understand

00:13:03.620 | what the customer's mission is.

00:13:06.240 | So for example, detect if they're actually here to buy ingredients for a recipe, et cetera.

00:13:11.420 | And then generate the content for that.

00:13:13.240 | So to talk more about that, I have Thaisi here.

00:13:16.380 | Thank you, Inish.

00:13:19.480 | Now I'll quickly talk about how we used LLMs for showing more discovery-oriented content in

00:13:23.380 | search results page.

00:13:25.120 | Just to restate the problem, our users found that while our search engine was very good at

00:13:30.520 | showing exactly the results that they exactly wanted to see.

00:13:34.520 | Once they added an item to the cart, they couldn't do anything useful with the search results page.

00:13:38.400 | They either had to do another search or go to another page to fulfill their next intent to

00:13:43.100 | some starts.

00:13:45.280 | Solving this with traditional methods would require a lot of feature engineering or manual

00:13:49.280 | work.

00:13:50.280 | LLMs solved this problem for us, and I will talk about how.

00:13:54.340 | So this is how it looked in the end.

00:13:55.900 | So for queries like Swerdfish, let's say there are no exact results.

00:13:59.900 | We used LLMs to generate substitute results like other seafood alternatives, meaty fish like

00:14:05.380 | Tlapia and whatnot.

00:14:07.800 | And similarly for queries like sushi where there were a lot of exact results, let's say, we would

00:14:13.520 | show at the bottom of the search results page, we would show things like Asian cooking ingredients

00:14:18.020 | or Japanese drinks and so on, in order to get the users to engage.

00:14:23.020 | I'll talk about the techniques here, but both of these discovery-oriented results we saw led

00:14:31.140 | to improvement in engagement as well as improvement in revenue for each search.

00:14:37.020 | Cool.

00:14:38.020 | Now, like I said, I'll get into the techniques, but let's first talk about the requirements to generate

00:14:42.200 | such content.

00:14:43.200 | First, obviously we wanted to generate content that is incremented to the current solutions.

00:14:47.500 | We don't want duplicates to what we were already showing.

00:14:50.360 | And the second requirement and the most important one is we wanted all of the LLM answers or the

00:14:55.840 | generation to be aligned with Instacart's domain knowledge.

00:14:59.120 | What does this mean?

00:15:00.120 | So if a user searches for a query called dishes, LLM should understand that it refers

00:15:05.120 | to cookware and not food, and vice versa for a query like Thanksgiving dishes.

00:15:10.620 | So with these requirements in mind, we started with a very basic generation approach.

00:15:15.920 | So what did we do?

00:15:16.920 | We took the query and we told the LLM, "Hey, you are an AI assistant and your job is to generate

00:15:21.960 | two shopping lists.

00:15:23.320 | One is a list of complementary items and another is a list of substitute items for a given query."

00:15:29.120 | It looked good.

00:15:31.120 | I mean, so we saw the results.

00:15:32.120 | They looked pretty good.

00:15:33.120 | Our PMs wetted everything.

00:15:35.120 | We looked at everything.

00:15:38.120 | And like Vinesh said in the queue, when we launched this to our users, we saw that the results

00:15:44.120 | were good, but users weren't engaging it as much as we would have liked it to.

00:15:48.120 | So we went back to the drawing board and we tried to analyze what was going on.

00:15:53.120 | And what we realized quickly was while LLM's answers were like common sense-like answers

00:15:58.120 | and so on and so on, they weren't really what users were looking for.

00:16:02.120 | Taking the protein example again, like users, when they search for protein, they look for

00:16:07.120 | protein bars and protein shakes rather than what LLM would give us an answer, which is chicken,

00:16:12.120 | turkey, and tofu, and whatnot.

00:16:14.120 | So what we did was we augmented the prompt with Instacart domain knowledge.

00:16:20.120 | So in one case, what we did was we took the query and then we augmented it with like, "Hey,

00:16:25.120 | here is the query and here are the top converting categories for this particular query," along

00:16:30.120 | with any annotations from the query understanding model like, "Hey, here is a brand present in

00:16:34.120 | the query.

00:16:35.120 | Here is like a dietary attribute present in the query," and so on and such.

00:16:39.120 | In another case, we were like, "Here is the query and here are the subsequent queries that users

00:16:44.120 | did once they issued this particular query."

00:16:47.120 | So once you augmented the prompt with this additional metadata about how Instacart users behave,

00:16:52.120 | the results were far more better.

00:16:55.120 | I don't have the time to show before and after, but like I said, we definitely saw like a huge

00:17:00.120 | improvement in both engagement as well as revenue.

00:17:05.120 | I'll quickly talk about like how we served all of these contents.

00:17:08.120 | Like very similar to QU.

00:17:10.120 | It's impractical to call the LLM in real-time because of latency and maybe cost concerns sometimes.

00:17:15.120 | So what we did was we took all of our historical search logs.

00:17:20.120 | We called LLM in like a batch mode and stored everything.

00:17:24.120 | So query, content metadata, along with even the products that could potentially show up in the

00:17:28.120 | carousel.

00:17:29.120 | And online, it's just a very quick lookup from a feature store.

00:17:33.120 | And that's how we were able to like serve all of these recommendations in like blazing fast time.

00:17:39.120 | Again, things weren't as simple as we making them out to be.

00:17:43.120 | Like Vinesh said, the overall concept is simple.

00:17:46.120 | The prompt itself is very simple.

00:17:48.120 | But there were three key challenges that we solved along the way.

00:17:51.120 | One is aligning generation with business metrics like revenue.

00:17:55.120 | This was very important to select top-line bins.

00:17:57.120 | So we iterated over the prompts and the kind of metadata that we would feed to the LLM in

00:18:02.120 | order to achieve this.

00:18:04.120 | Second, we spent a lot of time on ranking, on improving the ranking of the content itself

00:18:10.120 | and so on and such.

00:18:11.120 | So our traditional PCDR, PCVR models did not work.

00:18:13.120 | So we had to employ strategies like diversity-based ranking and so on and so forth to get users

00:18:19.120 | to engage with the content.

00:18:21.120 | And then the third thing is evaluating the content itself.

00:18:24.120 | So one is making sure that, hey, whatever LLM is giving is one, right?

00:18:28.120 | It's not hallucinating something.

00:18:30.120 | And second, it adhered to, like, what Instacart or what we need as a product, right?

00:18:35.120 | Cool.

00:18:36.120 | So summarizing the key takeaways from our talk.

00:18:39.120 | LLM's world knowledge was super important to improve query understanding predictions, especially

00:18:45.120 | for the tail queries.

00:18:46.120 | While LLMs were super helpful, we really found success by combining the domain knowledge of

00:18:53.120 | Instacart with LLMs in order to see the top-line wins that we saw.

00:18:57.120 | And the third and the last one is evaluating the content as well as the queue predictions

00:19:02.120 | and so on and such was far more important and far more difficult than we anticipated.

00:19:07.120 | We used LLMs as a judge in order to make this happen, but very, very important step.

00:19:12.120 | And we realized that kind of late.

00:19:13.120 | So, yeah, that's all from us.

00:19:15.120 | We'll take questions now.

00:19:16.120 | Thank you.

00:19:17.120 | Yeah, we'll take questions at the mic.

00:19:22.120 | While the next speaker gets set up.

00:19:24.120 | Hi.

00:19:25.120 | Thanks for the talk.

00:19:26.120 | Have you also been trying around queries which are very long in natural language?

00:19:31.120 | Like, I want these three items and these five items.

00:19:35.120 | Like, what we would do it on ChatGPT?

00:19:38.120 | Or it's still, like, single item?

00:19:40.120 | That's the focus.

00:19:41.120 | Yeah.

00:19:42.120 | I think we have actually launched something in the past, like, Ask Instacart, if you've

00:19:50.120 | heard of it.

00:19:51.120 | Which essentially takes natural language queries and tries to map that to search intent.

00:19:55.120 | So, for example, you might say healthy foods for a three-year-old baby or something like

00:20:00.120 | that.

00:20:01.120 | And so that would map to things like fruit slices.

00:20:03.120 | I don't know if three-year-old toddlers can eat popcorn, but something along those lines.

00:20:09.120 | And then we had our usual ranking, recall and ranking stack sort of retrieve those results.

00:20:15.120 | So, any learnings from that experiment for you?

00:20:19.120 | Yeah.

00:20:20.120 | So, I think we actually have a lot of learnings from that.

00:20:22.120 | Essentially, as Dees already mentioned, we need to inject a lot of Instacart context into

00:20:28.120 | the model to be able to get decent results.

00:20:31.120 | The evaluation part is really key.

00:20:33.120 | So, having a robust automated evaluation pipeline was important.

00:20:37.120 | And lastly, passing context.

00:20:39.120 | That is, for example, if it's a, let's say it's a Mother's Day query.

00:20:44.120 | And let's say we come up with individual search intents as perfumes.

00:20:49.120 | You really want women's perfumes to be in there.

00:20:52.120 | Whereas when we just had perfumes, we could see all kinds of items.

00:20:55.120 | So, passing that context from the LLM to the downstream systems is really important.

00:21:00.120 | Thanks.

00:21:01.120 | Yeah, we have a lot of examples where we failed.

00:21:02.120 | We can talk about.

00:21:03.120 | We can talk about this.

00:21:04.120 | We can talk about this.