back to index

What We Learned from Using LLMs in Pinterest — Mukuntha Narayanan, Han Wang, Pinterest


Chapters

0:0 Introduction to Pinterest and its search functionality.
1:52 Overview of the Pinterest search backend architecture.
2:29 The search relevance model.
2:55 Key learnings from using LLMs for search relevance.
5:4 The value of VLM-generated captions and user actions as content annotations.
7:16 Productionizing LLMs with knowledge distillation.
12:14 The utility of relevance-tuned LLM embeddings as general-purpose semantic representations.
13:55 Q&A session.

Whisper Transcript | Transcript Only Page

00:00:00.600 | Hi, everyone. Thanks for joining the talk today.
00:00:17.240 | We are super excited to be here and share some of the learnings
00:00:22.100 | we have from integrating the LM into Pinterest search.
00:00:27.400 | My name is Khan, and today I will be presenting with Mukunda.
00:00:30.300 | And we are both machine learning engineers from search relevance team at Pinterest.
00:00:34.440 | So start with a brief introduction to Pinterest.
00:00:39.140 | Pinterest is a visual discovery platform where pinners can come
00:00:43.900 | to find inspiration to create a life they love.
00:00:46.600 | And there are three main discovery services on Pinterest.
00:00:50.000 | The home feed, the related things and search.
00:00:53.860 | And today's talk will be focusing on search and where the user can type in their queries
00:01:00.900 | and find useful, inspiring content based on their information need.
00:01:06.360 | And we will share how we leverage LM to improve the search relevance.
00:01:10.960 | Here are some key statistics for Pinterest search.
00:01:17.460 | Every month we handled over 6 billion searches with billions of pins to search from covering topics
00:01:24.500 | from recipe, home decor, travel, fashion and beyond.
00:01:28.820 | And at Pinterest search is remarkably global and multilingual.
00:01:33.920 | We support over 45 languages and reaching pinners in more than 100 countries.
00:01:40.500 | These numbers highlight the importance of search at Pinterest and why we are investing in search
00:01:46.900 | relevance to improving the search experience.
00:01:49.240 | So this is an overview of how Pinterest search work at the back end.
00:01:57.540 | So it's similar to many recommendation system and industry.
00:02:02.540 | it has query understanding, retrieval, re-ranking and the blending stage, and finally produced
00:02:08.700 | on relevant and engagement search things.
00:02:11.400 | And in today's talk, we'll be focusing on the semantic relevance modeling that happened
00:02:18.300 | at the re-ranking stage and share about how we use LM to improve the search relevance on the search.
00:02:29.500 | Okay, so here's our search relevance model, which is essentially a classification model.
00:02:36.460 | Given a search query and the pin, the model will predict how much the pin is relevant to this search query.
00:02:43.740 | And to measure this, we use a five point scale ranging from the most relevant to most irrelevant.
00:02:53.700 | All right, now we are going to share some key learnings we have from using the LM to improve search
00:03:00.860 | Pinterest search relevance.
00:03:02.700 | And here are four main takeaways that we would like to go into more details.
00:03:08.260 | Lesson one, LMs are good at relevance prediction.
00:03:17.740 | So before I present the result, let me first give a quick overview of the model architecture that we are using.
00:03:25.180 | We contain the query and the pin text together and pass them into an LM to get an embedding.
00:03:34.940 | So this is called cross encoder structure where we can better capture the interaction between the query and the pin.
00:03:43.780 | And then we see the embedding from LM into an MLP layer to produce a five dimensional factor, which correspond to the five relevant levels.
00:03:54.820 | And during training, we fine tune some open source LM using Pinterest internal data and to better adapt the model to our Pinterest content.
00:04:06.880 | And here I'd like to share some results to demonstrate that the usefulness of LM and as a baseline, we use search search, which is a Pinterest in house content and the query embedding.
00:04:24.220 | And so if you look at the table, you can see that the LM has substantially
00:04:30.080 | improve the performance of the relevance prediction.
00:04:33.980 | And as we use more advanced LMs and increase the model size, the performance keeps improving.
00:04:41.560 | And for example, the 8 billion landmass remodel gives 12% of improvement over the multilingual bird based model
00:04:50.560 | and 20% of improvement over the search search search embedding model.
00:04:54.520 | So the lesson here is that LM, they are quite good at relevance prediction.
00:05:01.000 | Um, lesson two.
00:05:05.920 | The mission language model generated captions and the user actions can be quite useful for content annotations.
00:05:16.160 | So to use LM for search for relevance prediction, we need to view a text representation of each pin.
00:05:24.560 | And here I listed several features that we used in our model.
00:05:29.500 | Besides the, um, the title of description of the pin, we also include, um, the VON generated synthetic image caption to directly extract information from the image itself.
00:05:43.620 | And besides that, we add some, um, user engagement based feature, like the word titles, um, for the user curated board that the pin has been saved to.
00:05:54.680 | Or, um, the queries that led to the highest engagement with this pin on search surface.
00:06:01.780 | So these two user action based features, um, serves as additional annotation for the content and, um, here the five source of feature together helps to build a more, um, robust and comprehensive text representation for each pin.
00:06:21.940 | Uh, we, uh, we, to understand the importance of each vertex feature.
00:06:27.180 | We also did some oblation studies.
00:06:29.600 | We used the, um, VON generated image caption as a baseline.
00:06:34.840 | And, um, as you can see itself already, um, provide a very solid baseline.
00:06:42.100 | And as we sequentially add more vertex feature, we keep seeing performance improvement.
00:06:47.660 | And this indicate that enriching the vertex feature is quite useful for relevance prediction.
00:06:54.040 | And notably, um, the last two rows of the table shows the performance gain we have by adding these user action based features.
00:07:04.540 | So these features turned out to be quite useful content annotation that help model better understand the content.
00:07:11.680 | All right.
00:07:14.500 | Um, next I will hand over to Mukunda to talk about how we use knowledge distillation to productionize this model.
00:07:21.300 | Great.
00:07:23.000 | Yeah.
00:07:23.980 | Uh, so now we have a good relevance model, which is good at predicting search relevance.
00:07:29.100 | But how do we actually scale this up without bankrupting Pinterest?
00:07:33.100 | Uh, usually the answer is knowledge distillation into smaller models.
00:07:36.700 | Um, and this is the production-served relevant student model that we distilled from the teacher model using semi-supervised learning.
00:07:44.740 | Uh, the student model is trained to predict five-scale relevant scores too.
00:07:49.400 | Uh, it trains using the five-scale soft scores produced by the teacher model.
00:07:55.060 | Um, and we produce data for this using a semi-supervised learning setup that, uh, I'll show in the next slide.
00:08:01.660 | So the LLM teacher model is trained on a small set of human label data that we get from human annotators who are trained in very specific segments.
00:08:12.460 | Uh, we fine tune and this is a multilingual language model, which uses pretty generic features with scale across a lot of different domains, et cetera.
00:08:22.500 | Um, and the way we get training data for the student, uh, uh, is through, uh, sampling from daily search logs, which is, um, although searches, uh, people make on Pinterest, uh, and since we sample daily, uh, this includes any trending queries, all the latest freshest pins on Pinterest.
00:08:43.440 | Um, and there's also remarkably global, like you mentioned, and only a small subset of this comes from the U S where most of our human label data comes from.
00:08:52.360 | Um, we sample from this and we label using the teacher and we scale it up pretty much a hundred X, uh, across different domains, languages, countries where, uh, the LLM teacher model produces pretty good labels.
00:09:05.720 | We train the student model and this is the model that actually gets sort of online, um, and, uh, zooming into the student model, uh, also, this also has language models in it.
00:09:18.620 | Uh, but unlike the teacher model, it's not a cross encoder.
00:09:21.480 | It, uh, is a by encoder, uh, which essentially means we don't have cross interactions between the pin and the query, uh, representations.
00:09:29.980 | Um, the pin gets embedded separately, query gets embedded separately, and it also uses a lot of other features like, um, sort change that we previously mentioned for both embedding the query and the pin.
00:09:41.140 | Uh, we have graph stage embeddings, which Pinterest has published papers on, um, and omni-sage and a lot of other embedding features for query and pin.
00:09:49.540 | But we also use, uh, a lot of pin query text match statistics like BM25, which we've seen historically perform really well for predicting search relevance.
00:09:59.080 | Um, and the reason this scales well is the by encoder, uh, by encoder large language models can scale really well, uh, when we, uh, use offline inference and caching.
00:10:12.640 | Uh, the pin embedding here is entirely offline inferred on billions of pins.
00:10:16.960 | Uh, it uses predominantly the same text features that we mentioned on the teacher, uh, which helps distill efficiently.
00:10:23.620 | Um, and, uh, we only, uh, re-infer, uh, these embeddings every time that these inputs meaningfully change.
00:10:32.580 | Uh, meaning that, uh, every time that we, uh, need new embeddings, um, it's only going to run on a few set of new pins, um, and, uh, this is offline inferred.
00:10:44.400 | So none of this is happening online when a user issues a search query, uh, and the query embedding is pretty much, uh, real time inferred online.
00:10:52.140 | Uh, and search queries are pretty short.
00:10:54.360 | Um, they don't occupy too many tokens, which means, uh, we can keep the latencies for the query embedding.
00:11:01.320 | Up to like a few milliseconds, um, and we also cache this, uh, because search queries get repeated a lot and we get around an 85% cache head rate.
00:11:10.860 | Um, and yeah, this scales really well, uh, to actually solve Pinterest traffic.
00:11:16.140 | Um, the online results here, uh, the first four numbers are relevance, uh, measurements, and DCNG, uh, precision at eight, uh, measured on the US, Germany, France, uh, specific segments that we zoomed into.
00:11:31.620 | Um, we can actually see that we get relevance gains international, uh, internationally, even though we started with a very limited set of US data for this particular experiment.
00:11:42.900 | Um, and, uh, we also see that search fulfillment, which measures engagement on search, um, fulfilling actions, uh, also goes up, uh, also on non-US,
00:11:53.280 | even though our, uh, starting data was predominantly US and, uh, uh, yeah, uh, large language models are very good at, uh, expanding across many different domains, countries, uh, even though, uh, they have, weren't explicitly trained for this.
00:12:09.960 | Um, and, uh, this is a bonus, uh, we also found that relevance tuned large language models produce really good rich, uh, somatic representations, which are very good general purpose.
00:12:22.080 | Uh, this is the same production relevant student model that I shared on the previous slide, uh, and, uh, the pin embedding and the query embedding, uh, are basically three representations that we get from these models, uh,
00:12:36.000 | which can be used across Pinterest for representing pins and search queries.
00:12:40.640 | Um, we also use this to represent boards using the titles, et cetera.
00:12:45.360 | Um, and we found that using these embeddings, especially since they've been distilled from a large language model teacher and also have large language models in them.
00:12:55.980 | Uh, they are very good at semantic content representations, uh, and yeah, they perform pretty well across, uh, related pins.
00:13:05.940 | Um, and it's home feed and a lot of other surfaces where we've seen, uh, representations improved by adding these things.
00:13:13.560 | Um, so let me go over the key takeaways again.
00:13:17.560 | Um, I think lesson one, we found that LLMs are really good at relevance prediction.
00:13:23.040 | Uh, lesson two, we found that visual language model captions are good, uh, good ways to imbue them with, uh, image representations and, uh, user actions are very good contract connotations.
00:13:36.640 | Um, three, uh, we found that knowledge distillation is a very good way to scale, uh, and efficiently serve models, uh, online and, uh, lesson four, uh, relevance tuning produces pretty rich representations, uh, that embed semantic representations for content fairly well.
00:13:55.680 | Thank you.
00:13:56.640 | Um, I wonder if there are any questions from the audience, please come up to the mics.
00:14:02.880 | Uh, how did you decide which open source LLMs to fine tune?
00:14:07.620 | Yeah, that's a, yeah, that's a very good question.
00:14:12.440 | So we did a lot of experiment trying different language models.
00:14:15.940 | And, um, being a previous slide, we also share some, um, performance for different language model.
00:14:21.640 | Yeah.
00:14:22.960 | Uh, if you could just walk us through somebody typing a search prompt, the confusion that I have is you have like LLMs, uh, building some sort of matching.
00:14:40.100 | Is it just being used for the label to be distilled or how did you shim that into the buying code or it wasn't really clear on the two tower while flying airline and how the LLM search kind of influenced that.
00:14:55.740 | We use LLMs to distill into a student model, which predicts search relevant specifically and produces five scale relevant scores.
00:15:02.960 | Um, and it's served at the end of the search pipeline.
00:15:06.380 | It's, uh, the re-ranking stage.
00:15:08.460 | Um, like every recommendation system, we have a lot of, uh, CGs, which are kind of generators, we have early stage ranking.
00:15:16.140 | And this is one of the things that sits further down the pipeline, which actually predicts search relevant scores and, uh, as he used.
00:15:22.920 | Right before blending to actually produce a feed.
00:15:26.800 | So I think it's very similar to most recommender systems.
00:15:29.760 | So I have a, uh, I have a question on how you evolved into this architecture, but I'm sure printers have pre LLM era search as well.
00:15:46.120 | So, like what limitations did you see in those systems that this new architecture is solving for?
00:15:51.200 | So, um, if I'm understanding correctly, your question is about what's the difference between the new system with the, from the, what was the, what was the driver to adopt, adopting LLMs, uh, for, for in, in your search pipeline?
00:16:09.600 | So, um, did the type, did it support new features or is it, does it improve on the existing features where you had limitations?
00:16:16.920 | I think they definitely improve, uh, especially with visual language model captions.
00:16:24.200 | I think we would various effectively able to expand beyond limited markets for actually measuring relevant data and I'm getting relevant data.
00:16:34.100 | And yeah, these modeling with models are very good at, uh, getting, uh, synthetic data for different markets for, uh, then.
00:16:47.540 | Hey, um, great, um, great, great talk.
00:16:49.700 | Um, I was wondering why, uh, or if, if the embedding model is inherently multimodal, um, because you have text, which is, uh, the, the query and then you're matching against, um, either text to links or images.
00:17:03.020 | And so how do you think about multimodality?
00:17:05.220 | It's definitely something we're exploring, but then, uh, on a lot of applications, I think we found, uh, visual captions are very good at, uh, capturing what the image has.
00:17:17.060 | Um, and we have some very good capturing models in house, which, uh, yeah, help us.
00:17:23.000 | Great.
00:17:24.500 | Thanks.
00:17:24.800 | Great talk.
00:17:29.360 | I, yeah, just a quick question.
00:17:31.340 | You mentioned that you saw improvements in other languages as well.
00:17:34.380 | Did you start with the common baseline model for all languages or did you have to sort of, and just change the features for each language or did you actually also start with separate models for.
00:17:47.000 | individual languages.
00:17:47.900 | I'm curious how you actually saw the improvements manifest everywhere.
00:17:52.340 | Yeah.
00:17:54.240 | Um, so we, we use the, um, same model for our own languages and we, because we, we are using the multilingual, um, um, LN.
00:18:02.900 | So we believe it can have transferred to other languages.
00:18:10.900 | We'll see you next time.