back to index

Spotify's Podcast Search Explained


Chapters

0:0 Intro
4:16 NLP in Semantic Search
8:35 Why Now?
9:29 Transformer Models
11:52 Sentence Transformers
13:12 Vector Search
15:56 How Spotify Built Podcast Search
17:35 Data Source, Fine-tuning, and Eval
22:58 Code Implementation, Dataset
24:44 Data Preparation
26:39 Query Generation
29:54 Fine-tuning a Podcast Model
41:40 Evaluation
48:5 Does it Scale?
49:0 Sharing Your Work

Whisper Transcript | Transcript Only Page

00:00:00.000 | in the past few years podcasts have become increasingly popular and
00:00:05.120 | leading the charge in podcasts is Spotify now
00:00:09.200 | Spotify
00:00:11.260 | Only very recently entered the market for podcasts in 2018
00:00:16.440 | Despite that in a few short years
00:00:19.880 | So by 2021 they had already usurped Apple as the leader in terms of
00:00:26.000 | monthly active listeners
00:00:29.000 | Now Apple has been doing this for a very long time since I think it's 2008 or maybe even earlier
00:00:34.940 | But despite that Spotify and now ahead of everyone and of course to back
00:00:41.920 | Their sort of investments in podcasts. They've also invested heavily in the technology that powers
00:00:49.320 | different components of the podcast experience
00:00:52.800 | Whether that's anchor which is an all-in-one app that allows people to record
00:00:59.320 | Publish podcasts or what? We're going to focus on a natural language search for podcasts
00:01:08.080 | Spotify's natural language search or semantic search for podcasts is I think a really interesting use case because it enables a more intuitive
00:01:15.640 | user experience
00:01:17.840 | By allowing us to search through their podcast catalog, which is huge using more natural language queries
00:01:26.400 | Rather than the typical keyword or term matching that you would do in most places
00:01:31.440 | so beforehand where we were searching we'd have to kind of know the words that we're looking for or the terms that we're looking for and
00:01:40.120 | I think we as humans. We don't really know what exactly we're looking for in terms of
00:01:47.200 | words or keywords that
00:01:49.760 | In whatever it is. We're looking for
00:01:51.880 | Instead we tend to think more in
00:01:56.280 | Concepts and ideas. Our language is meaningful. It's not specifically about term matching or keyword matching
00:02:02.760 | so it makes sense that we might want to create a search experience that
00:02:09.560 | Replicates the meaning in language rather than just terms and words
00:02:16.480 | Imagine we wanted to find a podcast that talks about eating healthily over the winter holidays
00:02:23.280 | Okay, we might search something like this. We eat better during Xmas holidays
00:02:29.960 | Now in the days that we're going to be using there is a podcast description that talks about exactly this
00:02:35.960 | It's description is Alex Draney tracks to dr
00:02:39.320 | Priya Alexander about how to stay healthy over Christmas and about her letter to patients now if you compare those two
00:02:46.920 | They don't share any of the same words despite
00:02:50.840 | Pretty much being about the same thing. So if you try and use this in a term matching query
00:02:58.880 | It's not going to work. But if we were to spot that for an actual language query or a semantic search
00:03:06.120 | it does because
00:03:09.120 | What we are searching for here
00:03:11.680 | the meaning overlaps the genuine like human meaning behind those two phrases is
00:03:20.440 | very similar and
00:03:22.440 | When a semantic search is done properly it is able to identify
00:03:27.480 | Okay, we're looking for the meaning behind these and the meaning of these two does overlap
00:03:34.000 | We're talking about some Christmas holidays being healthy eating better
00:03:38.320 | they're very similar concepts, but
00:03:41.600 | Enabling meaningful search in this way is is not easy
00:03:46.320 | so what I want to do is actually have a look at how Spotify have done this and
00:03:51.280 | Actually replicate it and see what we can do as well. So the technology powering
00:03:58.280 | Spotify semantic search
00:04:00.880 | consists of two different components
00:04:02.880 | We have the NLP or natural language processing side of it and we also have the vector search component now
00:04:10.680 | These technologies can be seen as two steps in the search process
00:04:15.920 | given a natural language query a
00:04:18.120 | Language model. So this is the NLP side. We'll take that query and convert it into something. We called a dense vector
00:04:26.320 | Which is essentially a meaning encoded
00:04:30.760 | numeric representation of
00:04:33.640 | Whatever that query is. Okay sounds confusing, but we'll see that it's not
00:04:40.160 | Really too complex quite soon
00:04:43.400 | These dense vectors can then be compared as you would compare normal vectors
00:04:49.360 | So imagine you have two points in a particular space and you want to calculate the distance between them
00:04:55.160 | We can literally do that but with the meaning behind these queries
00:05:00.880 | so let's have a look at an example of that so here we just have a
00:05:05.480 | 3d chart and
00:05:08.080 | Typically when you have these dense vectors, they have many dimensions
00:05:12.040 | Okay, so you're you're looking at 700 plus dimensions in
00:05:15.640 | these vector representations
00:05:18.640 | In this one we've used
00:05:21.480 | PCA to reduce the dimensionality of those very high dimensional vectors into a 3d space just so we can visualize it
00:05:29.040 | But still maintain this or relationships between those vectors. So if we have a look here we can see
00:05:36.240 | There's one main
00:05:40.360 | Grouping of vectors. Okay, so these three over here now
00:05:45.160 | Let's just zoom in a little bit and have a look what they are
00:05:50.120 | so we have text podcasts about cooking and writing a chat with chef who wrote books and
00:05:56.760 | interview with cookbook author so we can see that they're all about cooking and
00:06:02.640 | Writing and they've all been grouped together
00:06:07.560 | that's because our language model has taken these queries and
00:06:11.840 | Encoded them in such a way that the vector representations for these
00:06:17.240 | represent the meaning
00:06:20.400 | Okay, and the meaning of all these although it's not the same is very similar talking about cooking writing books and so on
00:06:28.480 | And we can see that represented in the proximity of these vectors now
00:06:34.560 | Let's have a look at something else. That's kind of close but not within that same cluster
00:06:39.140 | So over here we have how do I cook great food? Okay, so we're talking about cooking again
00:06:44.980 | Okay, and then we have this one further over here eat better during Christmas holidays
00:06:51.400 | So we can kind of see that this one is
00:06:54.120 | How do I cook great food it's sort of in between like these about cooking and writing and a cookbook and then over here
00:07:04.320 | We have eat better during Christmas holidays. I was talking specifically about food
00:07:08.760 | Okay, so this one is cooking and food and it's in the middle of them both and then we have a few more over here
00:07:15.280 | And we have the writing show over here
00:07:18.200 | So it's like a writing podcast or something and then over here one of the more similar
00:07:22.960 | Vectors is how to tell more engaging stories and then also how to keep readers interesting
00:07:28.640 | So again talking about writing and then we go up here. We see this by itself superhero film and arts
00:07:34.880 | It's obviously very different to everything else. We've looked at so this is kind of what I mean by
00:07:41.120 | representing the meaning
00:07:44.280 | numerically in a vector space
00:07:46.400 | Similar things get grouped together
00:07:49.000 | Dissimilar things are separated now
00:07:52.240 | All of those vectors have been encoded by a special kind of NLP model called a sentence transformer
00:07:58.800 | Now that sentence transformer specially trained to do what we just saw which is group similar things together
00:08:04.800 | Separate dissimilar things. Once we have those vectors from our sentence transformer
00:08:10.200 | We need a way to compare them and that's where the vector search component comes in. So imagine
00:08:16.800 | Within all those queries that we just saw
00:08:20.160 | Imagine we came up with a new query
00:08:23.220 | convert that query into a vector place it within that vector space and then we would search for the most similar of the
00:08:30.220 | Vectors and we would return those as being the ones that are most similar to your particular query
00:08:36.880 | NLP and vector search have been around for a long time, but both fields have had very recent
00:08:43.780 | developments that have really acted as catalysts for the
00:08:49.520 | Performance increases and subsequent adoption of semantic search in NLP
00:08:55.040 | we've had the introduction of transformer models and
00:08:58.480 | in vector search the
00:09:01.720 | introduction of ever better
00:09:04.320 | approximate nearest neighbor search algorithms now transformers and
00:09:09.400 | approximate nearest neighbors have
00:09:12.760 | Powered the growth of semantic search, but it's not
00:09:16.160 | Inherently clear what both of those are. I'm assuming most of you probably have no idea what either of those are if you do
00:09:22.520 | Then that's great
00:09:23.800 | But otherwise no problem because we're going to actually go through both of those concepts as well
00:09:28.880 | so starting with transform models transform models are
00:09:32.160 | Essentially the standard in NLP now they
00:09:37.680 | Typically consist of two components, which is quite important for us to know
00:09:42.520 | There's the core of the transformer and then you usually have a head which adapts the transformer for a particular task
00:09:49.880 | Now there's just one problem. The core of these transform models is huge like they're massive models and
00:09:57.840 | For most organizations
00:10:01.440 | It is far too expensive to actually train the core of the model for example with BERT, which is one most popular
00:10:10.840 | Language models. It's not even a particularly large language model that costs were reported
00:10:16.480 | I think two and a half thousand to fifty thousand dollars to train a small one
00:10:20.600 | And then when you look at larger BERT model that shifts up to eighty thousand to one point six million dollars
00:10:28.200 | Which is obviously huge
00:10:30.680 | now most organizations
00:10:33.320 | Don't have that kind of money lying around
00:10:35.960 | So, how do we you know, how a transform is useful for us if we can't afford to train them?
00:10:40.800 | well the way that we typically go about using transform models is
00:10:46.120 | one the core of a transform model is
00:10:50.320 | Trained by the likes of Microsoft or Google it cost them a lot of money to train it
00:10:56.060 | But they have the money to do that
00:11:00.600 | To the model is made publicly available by Microsoft or Google
00:11:06.440 | three other
00:11:08.560 | Organizations take this core this transformer core and they add different heads to it
00:11:15.140 | So these heads are like a final few layers at the end model
00:11:19.720 | That adapt it for a particular task
00:11:22.560 | Now using that so extended model with the head
00:11:28.980 | the model can be fine-tuned using a lot less computing power, which
00:11:34.700 | brings it within the range of
00:11:37.440 | feasibility for other more normal organizations and
00:11:41.880 | for once you've fine-tuned your
00:11:44.960 | Extended transform model you are able to go ahead and actually apply that to your specific task
00:11:51.840 | So in the case of our example podcast search, we might want to take a BERT model
00:11:57.240 | So like BERT based on case, which has been pre-trained by Google
00:12:01.300 | We would take that and then we would add a what's called a pooling head onto the end of it
00:12:07.320 | That converts it into what we call sentence transformers, which will take an input like a normal transform
00:12:13.700 | so some sort of text input the normal transformer will output a set of
00:12:19.100 | token level embeddings now, we can't use token level embeddings when we're comparing sentences because
00:12:26.540 | One token or one sentence for BERT can contain up to 512 tokens. So we
00:12:32.860 | Really need a way to actually compress that down into one single vector to represent the full set of inputs that were given to our
00:12:40.560 | Transform model. So that's where we have that pooling layer that pooling layer takes all of those token level embeddings and
00:12:47.740 | Converts them into a single
00:12:50.260 | embedding
00:12:51.460 | Which is our sentence vector or sentence embedding?
00:12:55.420 | Now there are different types of pooling layers and we'll talk about one later
00:13:00.180 | but they all consume the same input and they will output that single embedding and
00:13:06.300 | It is that sort of model that we refer to as a sentence transformer that sentence embedding at the end is
00:13:14.400 | What we referred to earlier as a dense vector. Okay, the numerical representation of
00:13:21.780 | Whatever we fed into that transform model. So given that
00:13:25.820 | Dense vector we move on to the next step, which is a vector search
00:13:34.380 | approximate nearest neighbors search allows us to search more efficiently through a lot of
00:13:40.140 | Vectors because in many use cases for example with Spotify how many episodes do they have on?
00:13:48.900 | That I imagine there are there are many
00:13:51.460 | okay, and
00:13:54.060 | It's very hard if you or it takes a very long time if you compare your query to all of the
00:14:01.480 | existing vector embeddings
00:14:04.100 | It's very hard to
00:14:06.740 | To do that quickly
00:14:09.240 | In fact, it's impossible if you're using a typical K nearest neighbor search or otherwise known as exhaustive search
00:14:15.500 | because you need to compare everything and
00:14:18.620 | if you have millions of
00:14:20.620 | items to compare there
00:14:23.100 | No matter what hardware you run. It's gonna take a long time
00:14:26.500 | So approximate search
00:14:30.760 | Allows us to speed that up by literally approximating the answer now
00:14:37.380 | approximate nearest neighbor algorithms are very good because they allow you to approximate the answer with a
00:14:46.460 | super high degree of accuracy, it's like 99% accuracy and
00:14:50.860 | Make your search incredibly fast. You're talking sort of sub-second
00:14:56.540 | Sub half a second you're you're in the 10 millisecond range in a lot of cases
00:15:01.840 | So that's really useful
00:15:06.020 | But it's also worth noting that there is between these different algorithms. There's usually a trade-off between
00:15:12.860 | Speed like your search times and the accuracy so you can have an even faster search
00:15:20.040 | But then your accuracy will tend to be
00:15:22.940 | Not quite as good. So you'll be returning maybe 70% accuracy rather than 99% accuracy
00:15:29.000 | but there's always sort of a trade-off between the two, but generally the algorithms now very good and
00:15:36.220 | You can get incredible accuracy and incredible speed as well. So you we've merged those two together
00:15:42.220 | We have our sentence transformers. We have our approximate search and we now have a
00:15:47.700 | tool set that is capable of performing semantic search at
00:15:53.020 | Incredibly large scales. So what I want to do now is look specifically at how Spotify did it to build this kind of
00:16:02.020 | semantic search tool
00:16:04.220 | Spotify first needed a model that can
00:16:08.420 | encode queries and
00:16:12.100 | Episode metadata or descriptions into the same sort of vector space now there are
00:16:19.420 | existing transform models or sentence transform models that
00:16:24.660 | Can do a lot of things for example expert
00:16:28.220 | But Spotify found a few issues with those specifically for expert. They needed a model
00:16:35.860 | that was capable of supporting multi-lingual queries because obviously Spotify has a lot of content from
00:16:43.740 | every place in the world
00:16:46.780 | so they couldn't use expert because expert has been trained on English only data and
00:16:51.820 | to experts performance on
00:16:55.500 | new domains, so for example podcast is
00:16:59.300 | Not great without further fine-tuning. So the out-of-the-box
00:17:05.340 | Pre-trained fine-tuned sentence Bert model could not be used. It wasn't didn't satisfy. What's but if I need it
00:17:12.860 | With that in mind they decided to start with the pre-trained
00:17:17.660 | universal sentence encoder model this allowed them to cover the multi-lingual issue and
00:17:24.580 | Yes, USC the USC model still needs to be fine-tuned
00:17:30.060 | But they were going to need to do that. Anyway, so it's not really an issue
00:17:35.340 | So, let's have a look at what sort of data they use to actually fine-tune in their model so to fine-tune their model
00:17:41.900 | Spotify needs
00:17:44.100 | query
00:17:46.300 | Episode pairs
00:17:49.220 | Now, of course Spotify has a ton of search logs
00:17:57.540 | Over here, so they took these search logs and they use them to create two different data sources
00:18:03.740 | The first was simply okay if there is a successful search
00:18:07.900 | What was a query from that search and what was the episode? Okay, and that created this little data source that you see here
00:18:15.980 | the other one's a bit more interesting, so whenever they found that there was a
00:18:22.300 | Successful search that was straightaway followed by a successful search
00:18:26.580 | They looked at what the query for that
00:18:29.040 | Unsuccessful search was because the logic behind that being if you're typing something you're searching and it doesn't work
00:18:36.360 | You probably use a more sort of natural language or a more natural feeling query
00:18:42.420 | But then it doesn't work so then you change it to be a bit more robotic and trying to fit what you would expect
00:18:51.820 | Spotify's search to come up with
00:18:53.820 | for the correct
00:18:56.340 | episode or podcast episode
00:18:58.060 | so what they did is they took that unsuccessful search query and then the
00:19:02.700 | Episode that they the user found after their successful search and put them together to create this little data source
00:19:10.900 | So we have two data sources now
00:19:13.860 | Use dotted lines here because we can't replicate that. We don't have Spotify's past search logs
00:19:20.260 | So we will just ignore that
00:19:23.420 | But we'll be aware that the way that they would be used would be exactly the same as this third
00:19:30.740 | Data source over here, which we are going to replicate now that third data source is
00:19:36.040 | Taking podcast episodes. So it's a little diamond over here. We're transforming them
00:19:41.700 | so what I mean by transforming is taking it like the podcast show title and description and also the
00:19:49.820 | specific episode title and description and
00:19:51.820 | Concatenating all that together to create a sort of episode
00:19:56.340 | Description from these multiple things and then we use those
00:20:02.140 | episodes that episode data with a query generation set now query generation super interesting and
00:20:09.740 | what we're going to do there is actually use a query generation model to
00:20:14.860 | Create synthetic queries for our episodes and that will create this third data source over here
00:20:22.360 | okay, which is synthetic queries to episode pairs and
00:20:26.820 | We'll be using that
00:20:28.740 | so we'll split that into sort of a training set and we'll use that to fine-tune a
00:20:34.500 | model in the same way that we would have used those other the other two data sources and
00:20:40.020 | We'll be fine-tuning a pre-trained universal sentence encoder model
00:20:44.660 | So we use all that and in the end what we'll have is this podcast
00:20:50.060 | Universal sentence encoder model which we can then use to encode both our queries and our episodes
00:20:57.740 | So there's one other so we have our podcast episodes over here place them into this transform
00:21:04.140 | We've created those podcast episode descriptions
00:21:07.260 | This other data source that you see over here is a manually created data source
00:21:12.340 | Spotify did this and they use it purely for evaluating the model didn't use this in training
00:21:17.600 | So they just manually curated a set of queries for particular podcast episodes
00:21:22.700 | We will do the same. We just do I think it's like seven queries
00:21:27.000 | And see how that works and use that to actually evaluate our models now once we've done all of that
00:21:32.900 | We come to the evaluation and also the actual, you know, how you use this
00:21:37.860 | you would take your
00:21:40.140 | Podcast model you bring it over here into the middle here. And then you're gonna take the
00:21:47.000 | evaluation data
00:21:49.260 | From your query generated data bring it into here
00:21:52.420 | You're going to encode everything and place it into a vector search engine in this case. We're using pinecone
00:22:00.140 | this will allow us to store all of our episodes as
00:22:03.600 | Embeddings and then we'll be able to query and see if we are returning the correct
00:22:08.880 | Episode based on our synthetic queries to episode pairs and we'll also do the same
00:22:14.200 | Using these manually curated query episode pairs and then this is exactly what you would do in production as well
00:22:22.340 | So after you've evaluated you take all of your episode
00:22:25.340 | embeddings put them in your
00:22:28.860 | Vector database of the pinecone at the bottom there and then you would let your users query your query would go into your podcast
00:22:36.740 | Sentence transform model and then it would be passed into your vector database, which would identify the most
00:22:44.700 | semantically similar or most relevant
00:22:47.380 | episode vectors for you
00:22:50.020 | Okay, so that's a very high level. That's how it all works if it's confusing no problem
00:22:55.980 | So we're going to work through all of this. So now we'll move on to the actual implementation of all of this
00:23:00.900 | Now what we first one is a data set that is going to as closely as possible
00:23:06.200 | Replicate the episode data that Spotify are using
00:23:11.340 | So we'll need Kaggle for this so you can download or install the Kaggle API using pip install Kaggle
00:23:18.580 | And then you'll need an account. So you need to sign up on kaggle.com and you can get an API key
00:23:24.220 | So you just come over here you click on on top, right?
00:23:27.440 | You'll see your profile you go to I think account and you can come down and you have API here
00:23:34.780 | So you'll be able to get like create a new API token and it will download
00:23:39.340 | at kaggle.json file for you
00:23:42.360 | And then what you do with that is if you try to import kaggle, so you run this import kaggle
00:23:49.060 | It would say okay, you are not authenticated because you don't have kaggle.json in this directory
00:23:55.020 | So all you do is put your kaggle.json in the directory that is specified
00:23:59.380 | Once you've done that you should be able to import kaggle without any errors and we'll move on to the data download
00:24:06.700 | So we just authenticate our API. I'm not going to go too
00:24:10.020 | Into depth here. There will be a link to this notebook so you can read through everything
00:24:17.260 | As you want and
00:24:19.260 | All we're going to do is download these two data sets
00:24:22.360 | Okay, so if I let me copy this and I'll show you on kaggle this data set
00:24:29.760 | So this is all podcast episodes published in December 2017 from this company called listen notes
00:24:35.220 | Which is a podcast search engine
00:24:37.660 | so we download these podcasts and episodes data and we extract them because they'll be in zip files and
00:24:44.340 | we can see okay, this podcast is actually the details of podcast show itself and
00:24:49.660 | episodes is the details of this, you know individual episodes and
00:24:53.780 | What Spotify did was concatenate the title and description from both of those and use that to create a sort of a single
00:25:03.140 | episode
00:25:04.820 | Description that we then encode
00:25:06.920 | So we do the same
00:25:09.500 | We first need to merge those two data frames
00:25:12.300 | So we're doing that based on the unique podcast ID and we you know
00:25:17.540 | Get this sort of data frame with everything we need inside it
00:25:21.460 | I'm just doing a little bit data cleaning here
00:25:26.460 | So stripping any white space from the features that we care about so the title EP description EP title
00:25:33.580 | podcast and description podcast
00:25:35.940 | EP here is just episode and
00:25:39.380 | Then I'm also just removing you know where we have any new values in any of those
00:25:44.900 | Columns or features just removing them because we have a lot here. Anyway, we don't even use all of these so
00:25:51.740 | It's not a problem. And then I'm concatenating everything. I'm just putting a full stop in between each one of them and
00:25:58.820 | Yeah, and then converting that into a list. So we just have like a list of these episode
00:26:08.140 | Texts now. Okay, you see they do vary a lot. It's not the cleanest data
00:26:12.660 | But it's usable
00:26:15.660 | We also shuffle here as well
00:26:18.920 | When you're shuffling just bear in mind your shuffle is going to randomize the order
00:26:24.540 | So later on when we have these curated queries, they won't necessarily be in the test set for you as well
00:26:33.580 | So you may need to just make up some some new some new queries and
00:26:38.260 | At that point we have our episodes data and we need to move on to query generation because we need you know
00:26:45.880 | Query episode pairs. So spot if I use a fine or they fine-tuned a bar model on ms
00:26:52.580 | Marco now, I'm I'm not going to find you in a model because there are
00:26:57.480 | Honestly, so many query generation models that have been fine-tuned on ms. Marco already including bar models
00:27:05.180 | so I don't think there's really any point in us fine-tuning it because we're just replicating what we can just
00:27:11.160 | pull from hugging face
00:27:13.760 | Hope I tested a few different bar and t5 query generation models
00:27:18.960 | And I found this one to be the best for this particular use case
00:27:22.460 | the queries were just more consistently like
00:27:27.620 | Sensical like they actually made sense and
00:27:29.940 | It also supports multi or it has some multi lingual support
00:27:35.420 | I'm not sure how much but it does manage to actually produce queries that make sense in different languages. So that's
00:27:42.640 | Definitely pretty useful and aligns pretty nicely with what Spotify wanted as well
00:27:48.740 | And then what I'm doing is so I'm using the transformers lyrics is hugging face transformers
00:27:54.980 | You can pip install transformers if you need to I'm just initializing the tokenizer in the model
00:28:01.340 | Okay, so this is going to handle our query generation on the end here. You see I put CUDA
00:28:08.340 | That's because I have a CUDA enabled GPU which will make things a lot faster when we're doing this because this can take a long time
00:28:14.820 | And that's why I'm just taking the first
00:28:18.020 | 100,000 episodes here
00:28:21.580 | So we move on to the query generation loop. It's fairly long. Sorry
00:28:27.420 | This isn't like the quickest way you can do it
00:28:29.740 | so larger batch size means faster processing, but this is limited by the size of your GPU and
00:28:37.860 | Then number of queries. I'm generating for each
00:28:40.860 | each episode
00:28:46.460 | Spotify didn't specify what they use for number of queries here, but I'm using three because that's the
00:28:52.860 | Inline with the approach shape taken by these other query generation techniques Gen Q and GPL
00:28:59.900 | So I'm gonna stick with that
00:29:03.100 | Then we're just going through I'm encoding everything in batches. So we're just tokenizing everything and then we're generating
00:29:08.900 | three queries per episode
00:29:11.660 | decoding those queries back to human readable text because this just outputs a load of
00:29:17.100 | token IDs numbers
00:29:19.100 | Decoding about text and looping through and putting those or all those together
00:29:25.420 | So the the query and episode pairs and placing those together synthetic queries and episode pairs
00:29:30.980 | Okay, and then we we put all those in this pairs
00:29:37.620 | Okay, and we can see a few here. So we have like a query and then we have a
00:29:43.540 | episode
00:29:44.780 | No query episode and so on. You can also see so here we have the multilingual support as well
00:29:50.860 | She's very cool. And also again, yeah
00:29:54.220 | So we now have a data source so we can use of fine-tuning
00:30:00.580 | model the spot if I talk about this in in their article, which is is here and
00:30:05.620 | They mention that they tried okay here
00:30:10.620 | So they tried Bert's but as I said, they they decided they didn't really like it
00:30:18.220 | So, you know, yeah, we're not using it for these multiple reasons. I mentioned before
00:30:23.780 | So what they did is use this universal sentence encoder model and we're going to do something similar
00:30:31.300 | but what I will note here is that they use TensorFlow hub and
00:30:36.700 | We can use TensorFlow hub, but it means we can be stuck with TensorFlow and it makes our lives harder
00:30:43.420 | rather than using
00:30:45.900 | hugging face and Pytorch and the sentence transformers library, so
00:30:50.460 | rather than using the
00:30:53.380 | Universal sentence encoder model which we can't get from the sentence transformers library as a pre-trained model
00:31:00.340 | We are going to use a distilled
00:31:02.940 | universal sentence encoder model which is literally like a smaller version of
00:31:07.100 | USC so we'll come down here and we're using this one here. So there's still USC based multilingual case, okay
00:31:15.300 | So it's good. It's multilingual and it's still pretty much the same model as what?
00:31:21.340 | Spotify used so we just have the model details here
00:31:25.300 | so the maximum number of tokens that will accept 128 and it will output a
00:31:31.820 | 168 dimensional dense vector. So that's a you know, the meaningful vector that we spoke about earlier
00:31:38.620 | Now when we're fine-tuning with the sentence transformers library
00:31:44.180 | And you know, by the way, if you do need to install that you can just go
00:31:47.880 | pistol
00:31:50.620 | Sentence transformers
00:31:52.900 | Like this, okay, obviously in your in your terminal or using the estimation market to start there
00:32:00.540 | So when we're using a sentence transformers library and we're fine-tuning a model
00:32:04.860 | We need to reformat our data our input data. So our pairs into a list of input example objects
00:32:12.660 | now these in this input example object the format for that varies depending on
00:32:20.260 | What task you're actually fine-tuning with now?
00:32:23.880 | we're using a ranking task for fine-tuning or optimizing our model and that means that we
00:32:30.380 | Only need the the query in the episode. Okay, so all we do in that case
00:32:37.580 | I'm here as well. I'm just
00:32:39.580 | Splitting so we have an evaluation set
00:32:43.140 | Down here and a test set as well for later on when we're evaluating things
00:32:48.180 | So all we need in our input example, which we have imported from sentence transformers is the query an episode
00:32:56.860 | Literally it so it's just a list of these input examples
00:32:59.980 | With query an episode in there and you see the sample size that we have there
00:33:05.820 | so let's continue and yeah, so as I
00:33:09.860 | As I mentioned we're going to be using a ranking optimization function now
00:33:14.640 | What does that look like given a specific query light on we use earlier?
00:33:19.180 | So eat better during Christmas holidays or Xmas holidays
00:33:22.140 | our model is going to be given a set of episodes and it's going to need to rank the
00:33:29.100 | the true episode
00:33:31.700 | for that particular query as
00:33:34.020 | Number one in terms of scores. Okay, so the true sort of pair for this query is
00:33:40.660 | This one here and it will have to be able to identify that all these other
00:33:45.500 | Episodes are not real pairs or an less semantically similar to this particular query
00:33:52.100 | so the model learns to give higher scores to more semantically similar episodes for a particular query and
00:33:57.660 | Lowest scores to the more dissimilar episodes for a particular query
00:34:02.340 | And that's why we don't need like a label in this case. We don't actually need a particular similarity
00:34:10.140 | Label or anything when training our model, which is a really nice
00:34:14.540 | Sort of benefit of using ranking functions. You literally just need pairs of data and you can train with that
00:34:21.340 | Which is is pretty cool now
00:34:23.340 | the way the model actually needs to like compare the queries and episodes is
00:34:29.900 | by taking query converting into a query vector taking the episode converting that into a episode vector and
00:34:38.660 | placing them in a in a vector space and
00:34:43.380 | pairing the
00:34:45.020 | Similarity between them now the way that we are going to be calculating similarity in this case is using cosine similarity
00:34:52.340 | So we're essentially calculating the similarity or the angle between vectors
00:34:57.980 | So this is this in the middle here
00:35:00.580 | That is our query vector and this little cone is
00:35:04.740 | Representing the angle in a 3d space
00:35:08.220 | around
00:35:10.300 | that a query vector these two
00:35:13.540 | These two episode vectors here or episode embeddings are the two most similar
00:35:19.420 | So we've got top K equals 2 over here. That's because we're trying to return the two most similar and
00:35:25.460 | These other vectors like this one for example in terms of actual distance. This one might actually be closer
00:35:32.580 | Than this one, but in terms of angular distance, it's further away
00:35:37.940 | So that's why we have these two embeddings being selected rather than you know, any of these other ones
00:35:44.660 | So the model needs to create these vectors and it needs to learn how to put
00:35:50.140 | similar meaning
00:35:52.460 | querying episodes in a similar vector space and
00:35:55.020 | dissimilar querying episodes in
00:35:57.620 | As far apart as possible now another thing that we need to consider here is because we're using a ranking function
00:36:05.460 | Which is going to work by taking query and then a batch of episodes
00:36:09.180 | We need to make sure that within that batch of episodes. We don't have any duplicates
00:36:13.100 | because imagine so
00:36:15.540 | We're going to optimize optimizing this model. We've got a query in episode pair and we're saying okay for this query
00:36:20.740 | This episode is the most similar
00:36:23.660 | We're going to optimizing the model based on that if we have the exact same episode further down in that list and we're telling the model
00:36:32.220 | Actually, okay. These two episodes are exactly the same
00:36:35.100 | But one is right and one is not our models just going to be confused it
00:36:39.360 | That doesn't make sense. It can't do that. So we have to make sure that we don't have duplicates in
00:36:45.340 | The in any of the training batches now, it's probably not a problem if you have the odd one
00:36:50.740 | But we don't want any we just want to be you know certain there's nothing in there
00:36:54.620 | So we can actually do that using the sentence transformers. No duplicates data loader
00:37:00.220 | Which will handle removing any duplicates from from batches for us as well
00:37:06.220 | We use a batch size 64. So if we consider the training
00:37:11.400 | Optimization function here wait taking query and then we're taking a batch of episodes
00:37:16.300 | Imagine we have a batch size of three right or even two. We have a batch size of two
00:37:22.580 | We're just comparing our one query against two episodes
00:37:26.100 | Our model can get get even if it's just randomly guessing it can get the right answer around 50% of the time
00:37:33.340 | With such a small batch size. We increase that to a hundred then
00:37:38.220 | The model is going to perform a lot more poorly randomly guessing
00:37:44.300 | the consequence of of that is that using a larger batch size for this particular training method is
00:37:53.900 | going to typically result in you having a better performing model because it makes the
00:38:00.060 | The task of ranking harder for the model
00:38:03.540 | So your model must get better in order to actually accurately rank everything
00:38:08.300 | So we increased about size as much as our hardware will allow us to
00:38:13.980 | Now after that we can initialize the loss function
00:38:18.480 | This is where the ranking optimization comes in. So in sentence transformers, the ranking function is this multiple natives ranking loss
00:38:26.520 | Okay, so we initialize that
00:38:28.640 | There is just one more step before we actually move on to fine-tuning. So if we have a look over here and
00:38:35.580 | We go down to offline offline evaluation in order to evaluate a model
00:38:41.720 | They use two different types of metrics in batch metrics and for retrieval
00:38:45.760 | We're gonna do both so for the in batch metrics, so just calculating recall and MMR at the batch level
00:38:52.200 | okay, so what we can do to replicate that is use this re-ranking evaluator from sentence transformers and
00:39:00.920 | Will use that to perform ranking and calculate the MMR metric
00:39:06.080 | using
00:39:07.920 | using our evaluation set so
00:39:10.840 | Just one thing as we have already done here is we've removed duplicates from batches. We need to do the same
00:39:17.580 | Before we actually feed data into our re-ranking evaluator and then there are a lot because we create three queries per episode
00:39:24.080 | So we definitely do need to remove those so remove any duplicates
00:39:28.760 | Using this and we in the end we have a thousand unique pairs for our evaluation set
00:39:36.080 | so then we feed them into the re-ranking evaluator, which requires a particular format of query a
00:39:43.160 | Any positives we have so we just have one positive pair query here
00:39:47.080 | There's just a single a single list and then negative and then in here. We just need to pass in all of the other episodes
00:39:54.460 | That are not the positive episode for that particular query
00:39:58.940 | So we do that in in here and then we initialize a re-ranking evaluator
00:40:04.840 | And I set MMR at K to consider the top five
00:40:10.000 | items
00:40:11.840 | When it's calculating its score, I'm not going to go into into the metrics because it takes a little bit more time
00:40:20.800 | Calculating the performance. So this is the MMR
00:40:23.660 | 5 for the model without any fine-tuning so we get 0.68. Okay
00:40:31.880 | so at this point, we're actually ready to go ahead and fine-tune our model and
00:40:35.920 | This is basically the easiest step. So we set a number of epochs to one
00:40:42.200 | I did try other so like more epochs
00:40:45.220 | Typically with sentence transformers you actually and we really need to train for one epoch depending on a model and the data
00:40:51.200 | Anything more it actually degraded the performance
00:40:54.560 | So when one epoch here and we always or typically use a number of warm-up set
00:41:00.480 | so here for the first 10% of training steps
00:41:03.200 | The learning rate is going to be slowly increasing up to the default learning rate
00:41:08.440 | Which I think is like 1 e to the minus 5 or something along those lines
00:41:13.080 | And then we're just saving the the model that we train
00:41:16.840 | Into this directory here. So this this WSC podcasts and Q. Thank you. Here's natural query
00:41:24.160 | And then yeah, so that trains
00:41:29.680 | It shouldn't take too long. I think maybe this took an hour maybe two hours to train
00:41:36.560 | so it's pretty quick and
00:41:39.160 | With that we can actually look at the evaluation so, you know, how did this model perform on the
00:41:47.760 | Evaluator that we initialize at the start and we can actually see that. So I
00:41:54.160 | Mean another environment here. We're actually training the model we go to this the distal USC podcasts and Q
00:42:00.820 | Folder and then he will see this eval
00:42:03.200 | Directory go in here open this and we can see we have
00:42:08.760 | Over on the right here the MMR at five value and it's zero point eight eight eight
00:42:15.960 | So that is a big improvement from what we had appear which is zero point six eight. So it's a
00:42:22.960 | 20 point increase which is is really good
00:42:25.800 | That's on it on small batches. So
00:42:28.920 | That doesn't really replicate the real-world use of this model
00:42:33.200 | So what we want to do in the final evaluation step is is replicate that
00:42:37.520 | So what we're going to do is take the test data we had which has many
00:42:43.040 | thousands of
00:42:45.520 | episodes in there we're going to encode all of those and
00:42:48.360 | Then we're going to perform a semantic search as we would expect this to be using in real life. So what we'll do is
00:42:55.520 | set up the vector database with
00:42:58.920 | thousands of
00:43:01.080 | Episode vectors in there from our test data set and then we're going to calculate the recall value this time
00:43:07.440 | for our fine-tune model and also our
00:43:13.400 | Not fine-tune model so we initialize pinecone create a evaluation index here. Make sure we use the cosine metric
00:43:21.440 | We connect to that. Oh if you if you need an API key, you need to go to
00:43:26.120 | pinecone
00:43:29.240 | I/o, it's free. So, you know, this is all free. There's nothing you don't need to worry about anything there
00:43:35.760 | We create our index we
00:43:39.960 | Connect to it here
00:43:44.840 | Then what we can do is so before we index our test a we remove duplicates like we did with the evaluation set earlier
00:43:51.120 | Same thing. I'm not going to go through it again and
00:43:53.160 | Then here. So what I'm doing is going through again. Just make sure there's definitely no duplicates in there
00:44:01.240 | there shouldn't be anyway, but just in case going through and
00:44:04.680 | Creating batches of episodes to actually encode here using our new model
00:44:10.520 | This is using the distal USC
00:44:12.880 | Podcasts and Q model and then we upset them so insert them into our pinecone
00:44:18.520 | Vector database and then just refreshing the batch and doing it again all in batches
00:44:23.740 | And then at the end the undersigned look at the index that so you can see how many vectors we actually have in there
00:44:28.800 | Which is 18 and a half thousand
00:44:31.700 | Then what I'm doing here is I'm going to calculate the recall at K
00:44:37.720 | So I'm going to loop through all the queries we have so all the queries from our test set
00:44:43.200 | so these are synthetic queries first and
00:44:45.720 | Just looking at okay, what is the recall using those synthetic queries now?
00:44:52.880 | It's 0.88. I wish it is like incredibly good, but we're using synthetic queries here. So it's
00:45:01.200 | It's not really
00:45:04.000 | representative of real like human queries
00:45:07.480 | So this is where the curated queries come in
00:45:11.320 | So I went through I found different episode descriptions
00:45:15.120 | which were in the indexes 1 8 14 and so on and I just created like a
00:45:20.840 | Query for each one of those episodes that that makes sense to me
00:45:25.720 | Now this isn't perfectly accurate that we could you know, we could use this and we could return another episode. That's actually more relevant
00:45:34.560 | But using what I've done here, it would show up as not a match and and bring down the score
00:45:40.840 | So just bear that in mind as well
00:45:43.200 | So once we use those curated like true query episode pairs or more sort of human
00:45:51.240 | queries
00:45:52.440 | we get a
00:45:53.960 | Recall at K of 57 now, you know
00:45:57.800 | It doesn't like it's definitely not as impressive as the AT we had before is okay, but it's nothing special
00:46:03.160 | But what I want to do here is just compare this to what we had before so the the pre-trained model value fine-tuning
00:46:09.800 | Okay, so to do that, I'm just creating a new index if you're using the the free version pinecone
00:46:16.760 | You can only have one index at a time. So you will have to you can you can delete an index, you know
00:46:22.520 | So you can then refresh it and do this
00:46:25.320 | So what we do is I'm going to create the I'm going to initialize the pre-trained model
00:46:31.480 | that hasn't been fine-tuned on our particular data set and
00:46:34.760 | Here I am
00:46:38.640 | creating a new
00:46:40.800 | Index evil zero connecting to that index and I'm going through doing the exact same thing as I did before
00:46:46.480 | Okay, so just pushing all of those
00:46:49.520 | vectors or episode embeddings that have been encoded using the
00:46:56.000 | Older model that hasn't been fine-tuned and then calculating the recall. Okay again using that
00:47:01.440 | Using the curated data set and we get a terrible score of zero point two eight
00:47:08.040 | So we can see there's actually a massive improvement from you know, 28 to
00:47:14.920 | 57 that's a
00:47:18.120 | 29 point improvement which is huge especially to say that our training data was synthetic
00:47:24.840 | Right, so it's a synthetic data set is never going to perform as well as a genuine data set
00:47:30.020 | so if we had those other two data sources that Spotify use
00:47:33.080 | we could probably create something that is like
00:47:36.560 | Very impressive compared to what we've done here
00:47:40.880 | But to say we just use episode data without any queries and synthetically create queries
00:47:46.320 | I think this is really impressive and it shows that
00:47:50.800 | Spotify's methodology at least in terms of you know, adding these synthetic queries and definitely does
00:47:56.440 | Contribute to their overall model and it shows us a really cool way to without any real training data
00:48:03.680 | Do this now?
00:48:06.400 | Another thing that we should really I think about here is how well does the scale?
00:48:11.240 | So we're only using 18 and a half thousand
00:48:14.680 | Vectors in in this like evaluation case
00:48:19.440 | Using pine cone. We're quite easy easily able to go up to you know
00:48:24.880 | Hundreds of millions of episodes or all vectors in there
00:48:28.680 | And even even billions. So in terms of scalability
00:48:33.320 | that is
00:48:36.320 | Well, unless you're going into this sort of trillion scale
00:48:40.160 | this is very possible with pretty much every data set that you're going to be using and
00:48:46.040 | It will still be incredibly fast, which is is I think really cool. I think you know, that's
00:48:51.800 | really awesome the fact that we can do this no data make it incredibly scalable and
00:48:57.040 | You know build something really really awesome. So that's it for this sort of walkthrough
00:49:04.280 | I will say just if you if you are going to build something
00:49:08.480 | Like this with pine cone and you know doing all this sort of cool vector search
00:49:14.920 | Stuff let me know because over at pine cone
00:49:17.760 | We are looking for people that would like to showcase the work that they're doing
00:49:21.840 | And it would be really cool to see what you're building
00:49:25.400 | And if you sort of like to share that and get other people seeing what you're building
00:49:31.360 | I think that's a really good way to do that
00:49:33.760 | so if you're interested in that just go over to the community page at pine cone and
00:49:41.480 | You'll be able to find how to submit any projects that you're working on there
00:49:46.920 | So I hope all this has been useful and interesting
00:49:50.680 | Thank you very much for watching and I will see you again in the next one. Bye