back to indexSpotify's Podcast Search Explained
Chapters
0:0 Intro
4:16 NLP in Semantic Search
8:35 Why Now?
9:29 Transformer Models
11:52 Sentence Transformers
13:12 Vector Search
15:56 How Spotify Built Podcast Search
17:35 Data Source, Fine-tuning, and Eval
22:58 Code Implementation, Dataset
24:44 Data Preparation
26:39 Query Generation
29:54 Fine-tuning a Podcast Model
41:40 Evaluation
48:5 Does it Scale?
49:0 Sharing Your Work
00:00:00.000 |
in the past few years podcasts have become increasingly popular and 00:00:05.120 |
leading the charge in podcasts is Spotify now 00:00:11.260 |
Only very recently entered the market for podcasts in 2018 00:00:19.880 |
So by 2021 they had already usurped Apple as the leader in terms of 00:00:29.000 |
Now Apple has been doing this for a very long time since I think it's 2008 or maybe even earlier 00:00:34.940 |
But despite that Spotify and now ahead of everyone and of course to back 00:00:41.920 |
Their sort of investments in podcasts. They've also invested heavily in the technology that powers 00:00:49.320 |
different components of the podcast experience 00:00:52.800 |
Whether that's anchor which is an all-in-one app that allows people to record 00:00:59.320 |
Publish podcasts or what? We're going to focus on a natural language search for podcasts 00:01:08.080 |
Spotify's natural language search or semantic search for podcasts is I think a really interesting use case because it enables a more intuitive 00:01:17.840 |
By allowing us to search through their podcast catalog, which is huge using more natural language queries 00:01:26.400 |
Rather than the typical keyword or term matching that you would do in most places 00:01:31.440 |
so beforehand where we were searching we'd have to kind of know the words that we're looking for or the terms that we're looking for and 00:01:40.120 |
I think we as humans. We don't really know what exactly we're looking for in terms of 00:01:56.280 |
Concepts and ideas. Our language is meaningful. It's not specifically about term matching or keyword matching 00:02:02.760 |
so it makes sense that we might want to create a search experience that 00:02:09.560 |
Replicates the meaning in language rather than just terms and words 00:02:16.480 |
Imagine we wanted to find a podcast that talks about eating healthily over the winter holidays 00:02:23.280 |
Okay, we might search something like this. We eat better during Xmas holidays 00:02:29.960 |
Now in the days that we're going to be using there is a podcast description that talks about exactly this 00:02:39.320 |
Priya Alexander about how to stay healthy over Christmas and about her letter to patients now if you compare those two 00:02:46.920 |
They don't share any of the same words despite 00:02:50.840 |
Pretty much being about the same thing. So if you try and use this in a term matching query 00:02:58.880 |
It's not going to work. But if we were to spot that for an actual language query or a semantic search 00:03:11.680 |
the meaning overlaps the genuine like human meaning behind those two phrases is 00:03:22.440 |
When a semantic search is done properly it is able to identify 00:03:27.480 |
Okay, we're looking for the meaning behind these and the meaning of these two does overlap 00:03:34.000 |
We're talking about some Christmas holidays being healthy eating better 00:03:41.600 |
Enabling meaningful search in this way is is not easy 00:03:46.320 |
so what I want to do is actually have a look at how Spotify have done this and 00:03:51.280 |
Actually replicate it and see what we can do as well. So the technology powering 00:04:02.880 |
We have the NLP or natural language processing side of it and we also have the vector search component now 00:04:10.680 |
These technologies can be seen as two steps in the search process 00:04:18.120 |
Language model. So this is the NLP side. We'll take that query and convert it into something. We called a dense vector 00:04:33.640 |
Whatever that query is. Okay sounds confusing, but we'll see that it's not 00:04:43.400 |
These dense vectors can then be compared as you would compare normal vectors 00:04:49.360 |
So imagine you have two points in a particular space and you want to calculate the distance between them 00:04:55.160 |
We can literally do that but with the meaning behind these queries 00:05:00.880 |
so let's have a look at an example of that so here we just have a 00:05:08.080 |
Typically when you have these dense vectors, they have many dimensions 00:05:12.040 |
Okay, so you're you're looking at 700 plus dimensions in 00:05:21.480 |
PCA to reduce the dimensionality of those very high dimensional vectors into a 3d space just so we can visualize it 00:05:29.040 |
But still maintain this or relationships between those vectors. So if we have a look here we can see 00:05:40.360 |
Grouping of vectors. Okay, so these three over here now 00:05:45.160 |
Let's just zoom in a little bit and have a look what they are 00:05:50.120 |
so we have text podcasts about cooking and writing a chat with chef who wrote books and 00:05:56.760 |
interview with cookbook author so we can see that they're all about cooking and 00:06:02.640 |
Writing and they've all been grouped together 00:06:07.560 |
that's because our language model has taken these queries and 00:06:11.840 |
Encoded them in such a way that the vector representations for these 00:06:20.400 |
Okay, and the meaning of all these although it's not the same is very similar talking about cooking writing books and so on 00:06:28.480 |
And we can see that represented in the proximity of these vectors now 00:06:34.560 |
Let's have a look at something else. That's kind of close but not within that same cluster 00:06:39.140 |
So over here we have how do I cook great food? Okay, so we're talking about cooking again 00:06:44.980 |
Okay, and then we have this one further over here eat better during Christmas holidays 00:06:54.120 |
How do I cook great food it's sort of in between like these about cooking and writing and a cookbook and then over here 00:07:04.320 |
We have eat better during Christmas holidays. I was talking specifically about food 00:07:08.760 |
Okay, so this one is cooking and food and it's in the middle of them both and then we have a few more over here 00:07:18.200 |
So it's like a writing podcast or something and then over here one of the more similar 00:07:22.960 |
Vectors is how to tell more engaging stories and then also how to keep readers interesting 00:07:28.640 |
So again talking about writing and then we go up here. We see this by itself superhero film and arts 00:07:34.880 |
It's obviously very different to everything else. We've looked at so this is kind of what I mean by 00:07:52.240 |
All of those vectors have been encoded by a special kind of NLP model called a sentence transformer 00:07:58.800 |
Now that sentence transformer specially trained to do what we just saw which is group similar things together 00:08:04.800 |
Separate dissimilar things. Once we have those vectors from our sentence transformer 00:08:10.200 |
We need a way to compare them and that's where the vector search component comes in. So imagine 00:08:23.220 |
convert that query into a vector place it within that vector space and then we would search for the most similar of the 00:08:30.220 |
Vectors and we would return those as being the ones that are most similar to your particular query 00:08:36.880 |
NLP and vector search have been around for a long time, but both fields have had very recent 00:08:43.780 |
developments that have really acted as catalysts for the 00:08:49.520 |
Performance increases and subsequent adoption of semantic search in NLP 00:08:55.040 |
we've had the introduction of transformer models and 00:09:04.320 |
approximate nearest neighbor search algorithms now transformers and 00:09:12.760 |
Powered the growth of semantic search, but it's not 00:09:16.160 |
Inherently clear what both of those are. I'm assuming most of you probably have no idea what either of those are if you do 00:09:23.800 |
But otherwise no problem because we're going to actually go through both of those concepts as well 00:09:28.880 |
so starting with transform models transform models are 00:09:37.680 |
Typically consist of two components, which is quite important for us to know 00:09:42.520 |
There's the core of the transformer and then you usually have a head which adapts the transformer for a particular task 00:09:49.880 |
Now there's just one problem. The core of these transform models is huge like they're massive models and 00:10:01.440 |
It is far too expensive to actually train the core of the model for example with BERT, which is one most popular 00:10:10.840 |
Language models. It's not even a particularly large language model that costs were reported 00:10:16.480 |
I think two and a half thousand to fifty thousand dollars to train a small one 00:10:20.600 |
And then when you look at larger BERT model that shifts up to eighty thousand to one point six million dollars 00:10:35.960 |
So, how do we you know, how a transform is useful for us if we can't afford to train them? 00:10:40.800 |
well the way that we typically go about using transform models is 00:10:50.320 |
Trained by the likes of Microsoft or Google it cost them a lot of money to train it 00:11:00.600 |
To the model is made publicly available by Microsoft or Google 00:11:08.560 |
Organizations take this core this transformer core and they add different heads to it 00:11:15.140 |
So these heads are like a final few layers at the end model 00:11:22.560 |
Now using that so extended model with the head 00:11:28.980 |
the model can be fine-tuned using a lot less computing power, which 00:11:37.440 |
feasibility for other more normal organizations and 00:11:44.960 |
Extended transform model you are able to go ahead and actually apply that to your specific task 00:11:51.840 |
So in the case of our example podcast search, we might want to take a BERT model 00:11:57.240 |
So like BERT based on case, which has been pre-trained by Google 00:12:01.300 |
We would take that and then we would add a what's called a pooling head onto the end of it 00:12:07.320 |
That converts it into what we call sentence transformers, which will take an input like a normal transform 00:12:13.700 |
so some sort of text input the normal transformer will output a set of 00:12:19.100 |
token level embeddings now, we can't use token level embeddings when we're comparing sentences because 00:12:26.540 |
One token or one sentence for BERT can contain up to 512 tokens. So we 00:12:32.860 |
Really need a way to actually compress that down into one single vector to represent the full set of inputs that were given to our 00:12:40.560 |
Transform model. So that's where we have that pooling layer that pooling layer takes all of those token level embeddings and 00:12:51.460 |
Which is our sentence vector or sentence embedding? 00:12:55.420 |
Now there are different types of pooling layers and we'll talk about one later 00:13:00.180 |
but they all consume the same input and they will output that single embedding and 00:13:06.300 |
It is that sort of model that we refer to as a sentence transformer that sentence embedding at the end is 00:13:14.400 |
What we referred to earlier as a dense vector. Okay, the numerical representation of 00:13:21.780 |
Whatever we fed into that transform model. So given that 00:13:25.820 |
Dense vector we move on to the next step, which is a vector search 00:13:34.380 |
approximate nearest neighbors search allows us to search more efficiently through a lot of 00:13:40.140 |
Vectors because in many use cases for example with Spotify how many episodes do they have on? 00:13:54.060 |
It's very hard if you or it takes a very long time if you compare your query to all of the 00:14:09.240 |
In fact, it's impossible if you're using a typical K nearest neighbor search or otherwise known as exhaustive search 00:14:23.100 |
No matter what hardware you run. It's gonna take a long time 00:14:30.760 |
Allows us to speed that up by literally approximating the answer now 00:14:37.380 |
approximate nearest neighbor algorithms are very good because they allow you to approximate the answer with a 00:14:46.460 |
super high degree of accuracy, it's like 99% accuracy and 00:14:50.860 |
Make your search incredibly fast. You're talking sort of sub-second 00:14:56.540 |
Sub half a second you're you're in the 10 millisecond range in a lot of cases 00:15:06.020 |
But it's also worth noting that there is between these different algorithms. There's usually a trade-off between 00:15:12.860 |
Speed like your search times and the accuracy so you can have an even faster search 00:15:22.940 |
Not quite as good. So you'll be returning maybe 70% accuracy rather than 99% accuracy 00:15:29.000 |
but there's always sort of a trade-off between the two, but generally the algorithms now very good and 00:15:36.220 |
You can get incredible accuracy and incredible speed as well. So you we've merged those two together 00:15:42.220 |
We have our sentence transformers. We have our approximate search and we now have a 00:15:47.700 |
tool set that is capable of performing semantic search at 00:15:53.020 |
Incredibly large scales. So what I want to do now is look specifically at how Spotify did it to build this kind of 00:16:12.100 |
Episode metadata or descriptions into the same sort of vector space now there are 00:16:19.420 |
existing transform models or sentence transform models that 00:16:28.220 |
But Spotify found a few issues with those specifically for expert. They needed a model 00:16:35.860 |
that was capable of supporting multi-lingual queries because obviously Spotify has a lot of content from 00:16:46.780 |
so they couldn't use expert because expert has been trained on English only data and 00:16:59.300 |
Not great without further fine-tuning. So the out-of-the-box 00:17:05.340 |
Pre-trained fine-tuned sentence Bert model could not be used. It wasn't didn't satisfy. What's but if I need it 00:17:12.860 |
With that in mind they decided to start with the pre-trained 00:17:17.660 |
universal sentence encoder model this allowed them to cover the multi-lingual issue and 00:17:24.580 |
Yes, USC the USC model still needs to be fine-tuned 00:17:30.060 |
But they were going to need to do that. Anyway, so it's not really an issue 00:17:35.340 |
So, let's have a look at what sort of data they use to actually fine-tune in their model so to fine-tune their model 00:17:49.220 |
Now, of course Spotify has a ton of search logs 00:17:57.540 |
Over here, so they took these search logs and they use them to create two different data sources 00:18:03.740 |
The first was simply okay if there is a successful search 00:18:07.900 |
What was a query from that search and what was the episode? Okay, and that created this little data source that you see here 00:18:15.980 |
the other one's a bit more interesting, so whenever they found that there was a 00:18:22.300 |
Successful search that was straightaway followed by a successful search 00:18:29.040 |
Unsuccessful search was because the logic behind that being if you're typing something you're searching and it doesn't work 00:18:36.360 |
You probably use a more sort of natural language or a more natural feeling query 00:18:42.420 |
But then it doesn't work so then you change it to be a bit more robotic and trying to fit what you would expect 00:18:58.060 |
so what they did is they took that unsuccessful search query and then the 00:19:02.700 |
Episode that they the user found after their successful search and put them together to create this little data source 00:19:13.860 |
Use dotted lines here because we can't replicate that. We don't have Spotify's past search logs 00:19:23.420 |
But we'll be aware that the way that they would be used would be exactly the same as this third 00:19:30.740 |
Data source over here, which we are going to replicate now that third data source is 00:19:36.040 |
Taking podcast episodes. So it's a little diamond over here. We're transforming them 00:19:41.700 |
so what I mean by transforming is taking it like the podcast show title and description and also the 00:19:51.820 |
Concatenating all that together to create a sort of episode 00:19:56.340 |
Description from these multiple things and then we use those 00:20:02.140 |
episodes that episode data with a query generation set now query generation super interesting and 00:20:09.740 |
what we're going to do there is actually use a query generation model to 00:20:14.860 |
Create synthetic queries for our episodes and that will create this third data source over here 00:20:22.360 |
okay, which is synthetic queries to episode pairs and 00:20:28.740 |
so we'll split that into sort of a training set and we'll use that to fine-tune a 00:20:34.500 |
model in the same way that we would have used those other the other two data sources and 00:20:40.020 |
We'll be fine-tuning a pre-trained universal sentence encoder model 00:20:44.660 |
So we use all that and in the end what we'll have is this podcast 00:20:50.060 |
Universal sentence encoder model which we can then use to encode both our queries and our episodes 00:20:57.740 |
So there's one other so we have our podcast episodes over here place them into this transform 00:21:04.140 |
We've created those podcast episode descriptions 00:21:07.260 |
This other data source that you see over here is a manually created data source 00:21:12.340 |
Spotify did this and they use it purely for evaluating the model didn't use this in training 00:21:17.600 |
So they just manually curated a set of queries for particular podcast episodes 00:21:22.700 |
We will do the same. We just do I think it's like seven queries 00:21:27.000 |
And see how that works and use that to actually evaluate our models now once we've done all of that 00:21:32.900 |
We come to the evaluation and also the actual, you know, how you use this 00:21:40.140 |
Podcast model you bring it over here into the middle here. And then you're gonna take the 00:21:49.260 |
From your query generated data bring it into here 00:21:52.420 |
You're going to encode everything and place it into a vector search engine in this case. We're using pinecone 00:22:00.140 |
this will allow us to store all of our episodes as 00:22:03.600 |
Embeddings and then we'll be able to query and see if we are returning the correct 00:22:08.880 |
Episode based on our synthetic queries to episode pairs and we'll also do the same 00:22:14.200 |
Using these manually curated query episode pairs and then this is exactly what you would do in production as well 00:22:22.340 |
So after you've evaluated you take all of your episode 00:22:28.860 |
Vector database of the pinecone at the bottom there and then you would let your users query your query would go into your podcast 00:22:36.740 |
Sentence transform model and then it would be passed into your vector database, which would identify the most 00:22:50.020 |
Okay, so that's a very high level. That's how it all works if it's confusing no problem 00:22:55.980 |
So we're going to work through all of this. So now we'll move on to the actual implementation of all of this 00:23:00.900 |
Now what we first one is a data set that is going to as closely as possible 00:23:06.200 |
Replicate the episode data that Spotify are using 00:23:11.340 |
So we'll need Kaggle for this so you can download or install the Kaggle API using pip install Kaggle 00:23:18.580 |
And then you'll need an account. So you need to sign up on kaggle.com and you can get an API key 00:23:24.220 |
So you just come over here you click on on top, right? 00:23:27.440 |
You'll see your profile you go to I think account and you can come down and you have API here 00:23:34.780 |
So you'll be able to get like create a new API token and it will download 00:23:42.360 |
And then what you do with that is if you try to import kaggle, so you run this import kaggle 00:23:49.060 |
It would say okay, you are not authenticated because you don't have kaggle.json in this directory 00:23:55.020 |
So all you do is put your kaggle.json in the directory that is specified 00:23:59.380 |
Once you've done that you should be able to import kaggle without any errors and we'll move on to the data download 00:24:06.700 |
So we just authenticate our API. I'm not going to go too 00:24:10.020 |
Into depth here. There will be a link to this notebook so you can read through everything 00:24:19.260 |
All we're going to do is download these two data sets 00:24:22.360 |
Okay, so if I let me copy this and I'll show you on kaggle this data set 00:24:29.760 |
So this is all podcast episodes published in December 2017 from this company called listen notes 00:24:37.660 |
so we download these podcasts and episodes data and we extract them because they'll be in zip files and 00:24:44.340 |
we can see okay, this podcast is actually the details of podcast show itself and 00:24:49.660 |
episodes is the details of this, you know individual episodes and 00:24:53.780 |
What Spotify did was concatenate the title and description from both of those and use that to create a sort of a single 00:25:12.300 |
So we're doing that based on the unique podcast ID and we you know 00:25:17.540 |
Get this sort of data frame with everything we need inside it 00:25:21.460 |
I'm just doing a little bit data cleaning here 00:25:26.460 |
So stripping any white space from the features that we care about so the title EP description EP title 00:25:39.380 |
Then I'm also just removing you know where we have any new values in any of those 00:25:44.900 |
Columns or features just removing them because we have a lot here. Anyway, we don't even use all of these so 00:25:51.740 |
It's not a problem. And then I'm concatenating everything. I'm just putting a full stop in between each one of them and 00:25:58.820 |
Yeah, and then converting that into a list. So we just have like a list of these episode 00:26:08.140 |
Texts now. Okay, you see they do vary a lot. It's not the cleanest data 00:26:18.920 |
When you're shuffling just bear in mind your shuffle is going to randomize the order 00:26:24.540 |
So later on when we have these curated queries, they won't necessarily be in the test set for you as well 00:26:33.580 |
So you may need to just make up some some new some new queries and 00:26:38.260 |
At that point we have our episodes data and we need to move on to query generation because we need you know 00:26:45.880 |
Query episode pairs. So spot if I use a fine or they fine-tuned a bar model on ms 00:26:52.580 |
Marco now, I'm I'm not going to find you in a model because there are 00:26:57.480 |
Honestly, so many query generation models that have been fine-tuned on ms. Marco already including bar models 00:27:05.180 |
so I don't think there's really any point in us fine-tuning it because we're just replicating what we can just 00:27:13.760 |
Hope I tested a few different bar and t5 query generation models 00:27:18.960 |
And I found this one to be the best for this particular use case 00:27:29.940 |
It also supports multi or it has some multi lingual support 00:27:35.420 |
I'm not sure how much but it does manage to actually produce queries that make sense in different languages. So that's 00:27:42.640 |
Definitely pretty useful and aligns pretty nicely with what Spotify wanted as well 00:27:48.740 |
And then what I'm doing is so I'm using the transformers lyrics is hugging face transformers 00:27:54.980 |
You can pip install transformers if you need to I'm just initializing the tokenizer in the model 00:28:01.340 |
Okay, so this is going to handle our query generation on the end here. You see I put CUDA 00:28:08.340 |
That's because I have a CUDA enabled GPU which will make things a lot faster when we're doing this because this can take a long time 00:28:21.580 |
So we move on to the query generation loop. It's fairly long. Sorry 00:28:27.420 |
This isn't like the quickest way you can do it 00:28:29.740 |
so larger batch size means faster processing, but this is limited by the size of your GPU and 00:28:37.860 |
Then number of queries. I'm generating for each 00:28:46.460 |
Spotify didn't specify what they use for number of queries here, but I'm using three because that's the 00:28:52.860 |
Inline with the approach shape taken by these other query generation techniques Gen Q and GPL 00:29:03.100 |
Then we're just going through I'm encoding everything in batches. So we're just tokenizing everything and then we're generating 00:29:11.660 |
decoding those queries back to human readable text because this just outputs a load of 00:29:19.100 |
Decoding about text and looping through and putting those or all those together 00:29:25.420 |
So the the query and episode pairs and placing those together synthetic queries and episode pairs 00:29:30.980 |
Okay, and then we we put all those in this pairs 00:29:37.620 |
Okay, and we can see a few here. So we have like a query and then we have a 00:29:44.780 |
No query episode and so on. You can also see so here we have the multilingual support as well 00:29:54.220 |
So we now have a data source so we can use of fine-tuning 00:30:00.580 |
model the spot if I talk about this in in their article, which is is here and 00:30:10.620 |
So they tried Bert's but as I said, they they decided they didn't really like it 00:30:18.220 |
So, you know, yeah, we're not using it for these multiple reasons. I mentioned before 00:30:23.780 |
So what they did is use this universal sentence encoder model and we're going to do something similar 00:30:31.300 |
but what I will note here is that they use TensorFlow hub and 00:30:36.700 |
We can use TensorFlow hub, but it means we can be stuck with TensorFlow and it makes our lives harder 00:30:45.900 |
hugging face and Pytorch and the sentence transformers library, so 00:30:53.380 |
Universal sentence encoder model which we can't get from the sentence transformers library as a pre-trained model 00:31:02.940 |
universal sentence encoder model which is literally like a smaller version of 00:31:07.100 |
USC so we'll come down here and we're using this one here. So there's still USC based multilingual case, okay 00:31:15.300 |
So it's good. It's multilingual and it's still pretty much the same model as what? 00:31:21.340 |
Spotify used so we just have the model details here 00:31:25.300 |
so the maximum number of tokens that will accept 128 and it will output a 00:31:31.820 |
168 dimensional dense vector. So that's a you know, the meaningful vector that we spoke about earlier 00:31:38.620 |
Now when we're fine-tuning with the sentence transformers library 00:31:44.180 |
And you know, by the way, if you do need to install that you can just go 00:31:52.900 |
Like this, okay, obviously in your in your terminal or using the estimation market to start there 00:32:00.540 |
So when we're using a sentence transformers library and we're fine-tuning a model 00:32:04.860 |
We need to reformat our data our input data. So our pairs into a list of input example objects 00:32:12.660 |
now these in this input example object the format for that varies depending on 00:32:20.260 |
What task you're actually fine-tuning with now? 00:32:23.880 |
we're using a ranking task for fine-tuning or optimizing our model and that means that we 00:32:30.380 |
Only need the the query in the episode. Okay, so all we do in that case 00:32:43.140 |
Down here and a test set as well for later on when we're evaluating things 00:32:48.180 |
So all we need in our input example, which we have imported from sentence transformers is the query an episode 00:32:56.860 |
Literally it so it's just a list of these input examples 00:32:59.980 |
With query an episode in there and you see the sample size that we have there 00:33:09.860 |
As I mentioned we're going to be using a ranking optimization function now 00:33:14.640 |
What does that look like given a specific query light on we use earlier? 00:33:19.180 |
So eat better during Christmas holidays or Xmas holidays 00:33:22.140 |
our model is going to be given a set of episodes and it's going to need to rank the 00:33:34.020 |
Number one in terms of scores. Okay, so the true sort of pair for this query is 00:33:40.660 |
This one here and it will have to be able to identify that all these other 00:33:45.500 |
Episodes are not real pairs or an less semantically similar to this particular query 00:33:52.100 |
so the model learns to give higher scores to more semantically similar episodes for a particular query and 00:33:57.660 |
Lowest scores to the more dissimilar episodes for a particular query 00:34:02.340 |
And that's why we don't need like a label in this case. We don't actually need a particular similarity 00:34:10.140 |
Label or anything when training our model, which is a really nice 00:34:14.540 |
Sort of benefit of using ranking functions. You literally just need pairs of data and you can train with that 00:34:23.340 |
the way the model actually needs to like compare the queries and episodes is 00:34:29.900 |
by taking query converting into a query vector taking the episode converting that into a episode vector and 00:34:45.020 |
Similarity between them now the way that we are going to be calculating similarity in this case is using cosine similarity 00:34:52.340 |
So we're essentially calculating the similarity or the angle between vectors 00:35:00.580 |
That is our query vector and this little cone is 00:35:13.540 |
These two episode vectors here or episode embeddings are the two most similar 00:35:19.420 |
So we've got top K equals 2 over here. That's because we're trying to return the two most similar and 00:35:25.460 |
These other vectors like this one for example in terms of actual distance. This one might actually be closer 00:35:32.580 |
Than this one, but in terms of angular distance, it's further away 00:35:37.940 |
So that's why we have these two embeddings being selected rather than you know, any of these other ones 00:35:44.660 |
So the model needs to create these vectors and it needs to learn how to put 00:35:52.460 |
querying episodes in a similar vector space and 00:35:57.620 |
As far apart as possible now another thing that we need to consider here is because we're using a ranking function 00:36:05.460 |
Which is going to work by taking query and then a batch of episodes 00:36:09.180 |
We need to make sure that within that batch of episodes. We don't have any duplicates 00:36:15.540 |
We're going to optimize optimizing this model. We've got a query in episode pair and we're saying okay for this query 00:36:23.660 |
We're going to optimizing the model based on that if we have the exact same episode further down in that list and we're telling the model 00:36:32.220 |
Actually, okay. These two episodes are exactly the same 00:36:35.100 |
But one is right and one is not our models just going to be confused it 00:36:39.360 |
That doesn't make sense. It can't do that. So we have to make sure that we don't have duplicates in 00:36:45.340 |
The in any of the training batches now, it's probably not a problem if you have the odd one 00:36:50.740 |
But we don't want any we just want to be you know certain there's nothing in there 00:36:54.620 |
So we can actually do that using the sentence transformers. No duplicates data loader 00:37:00.220 |
Which will handle removing any duplicates from from batches for us as well 00:37:06.220 |
We use a batch size 64. So if we consider the training 00:37:11.400 |
Optimization function here wait taking query and then we're taking a batch of episodes 00:37:16.300 |
Imagine we have a batch size of three right or even two. We have a batch size of two 00:37:22.580 |
We're just comparing our one query against two episodes 00:37:26.100 |
Our model can get get even if it's just randomly guessing it can get the right answer around 50% of the time 00:37:33.340 |
With such a small batch size. We increase that to a hundred then 00:37:38.220 |
The model is going to perform a lot more poorly randomly guessing 00:37:44.300 |
the consequence of of that is that using a larger batch size for this particular training method is 00:37:53.900 |
going to typically result in you having a better performing model because it makes the 00:38:03.540 |
So your model must get better in order to actually accurately rank everything 00:38:08.300 |
So we increased about size as much as our hardware will allow us to 00:38:13.980 |
Now after that we can initialize the loss function 00:38:18.480 |
This is where the ranking optimization comes in. So in sentence transformers, the ranking function is this multiple natives ranking loss 00:38:28.640 |
There is just one more step before we actually move on to fine-tuning. So if we have a look over here and 00:38:35.580 |
We go down to offline offline evaluation in order to evaluate a model 00:38:41.720 |
They use two different types of metrics in batch metrics and for retrieval 00:38:45.760 |
We're gonna do both so for the in batch metrics, so just calculating recall and MMR at the batch level 00:38:52.200 |
okay, so what we can do to replicate that is use this re-ranking evaluator from sentence transformers and 00:39:00.920 |
Will use that to perform ranking and calculate the MMR metric 00:39:10.840 |
Just one thing as we have already done here is we've removed duplicates from batches. We need to do the same 00:39:17.580 |
Before we actually feed data into our re-ranking evaluator and then there are a lot because we create three queries per episode 00:39:24.080 |
So we definitely do need to remove those so remove any duplicates 00:39:28.760 |
Using this and we in the end we have a thousand unique pairs for our evaluation set 00:39:36.080 |
so then we feed them into the re-ranking evaluator, which requires a particular format of query a 00:39:43.160 |
Any positives we have so we just have one positive pair query here 00:39:47.080 |
There's just a single a single list and then negative and then in here. We just need to pass in all of the other episodes 00:39:54.460 |
That are not the positive episode for that particular query 00:39:58.940 |
So we do that in in here and then we initialize a re-ranking evaluator 00:40:11.840 |
When it's calculating its score, I'm not going to go into into the metrics because it takes a little bit more time 00:40:20.800 |
Calculating the performance. So this is the MMR 00:40:23.660 |
5 for the model without any fine-tuning so we get 0.68. Okay 00:40:31.880 |
so at this point, we're actually ready to go ahead and fine-tune our model and 00:40:35.920 |
This is basically the easiest step. So we set a number of epochs to one 00:40:45.220 |
Typically with sentence transformers you actually and we really need to train for one epoch depending on a model and the data 00:40:51.200 |
Anything more it actually degraded the performance 00:40:54.560 |
So when one epoch here and we always or typically use a number of warm-up set 00:41:03.200 |
The learning rate is going to be slowly increasing up to the default learning rate 00:41:08.440 |
Which I think is like 1 e to the minus 5 or something along those lines 00:41:13.080 |
And then we're just saving the the model that we train 00:41:16.840 |
Into this directory here. So this this WSC podcasts and Q. Thank you. Here's natural query 00:41:29.680 |
It shouldn't take too long. I think maybe this took an hour maybe two hours to train 00:41:39.160 |
With that we can actually look at the evaluation so, you know, how did this model perform on the 00:41:47.760 |
Evaluator that we initialize at the start and we can actually see that. So I 00:41:54.160 |
Mean another environment here. We're actually training the model we go to this the distal USC podcasts and Q 00:42:03.200 |
Directory go in here open this and we can see we have 00:42:08.760 |
Over on the right here the MMR at five value and it's zero point eight eight eight 00:42:15.960 |
So that is a big improvement from what we had appear which is zero point six eight. So it's a 00:42:28.920 |
That doesn't really replicate the real-world use of this model 00:42:33.200 |
So what we want to do in the final evaluation step is is replicate that 00:42:37.520 |
So what we're going to do is take the test data we had which has many 00:42:45.520 |
episodes in there we're going to encode all of those and 00:42:48.360 |
Then we're going to perform a semantic search as we would expect this to be using in real life. So what we'll do is 00:43:01.080 |
Episode vectors in there from our test data set and then we're going to calculate the recall value this time 00:43:13.400 |
Not fine-tune model so we initialize pinecone create a evaluation index here. Make sure we use the cosine metric 00:43:21.440 |
We connect to that. Oh if you if you need an API key, you need to go to 00:43:29.240 |
I/o, it's free. So, you know, this is all free. There's nothing you don't need to worry about anything there 00:43:44.840 |
Then what we can do is so before we index our test a we remove duplicates like we did with the evaluation set earlier 00:43:51.120 |
Same thing. I'm not going to go through it again and 00:43:53.160 |
Then here. So what I'm doing is going through again. Just make sure there's definitely no duplicates in there 00:44:01.240 |
there shouldn't be anyway, but just in case going through and 00:44:04.680 |
Creating batches of episodes to actually encode here using our new model 00:44:12.880 |
Podcasts and Q model and then we upset them so insert them into our pinecone 00:44:18.520 |
Vector database and then just refreshing the batch and doing it again all in batches 00:44:23.740 |
And then at the end the undersigned look at the index that so you can see how many vectors we actually have in there 00:44:31.700 |
Then what I'm doing here is I'm going to calculate the recall at K 00:44:37.720 |
So I'm going to loop through all the queries we have so all the queries from our test set 00:44:45.720 |
Just looking at okay, what is the recall using those synthetic queries now? 00:44:52.880 |
It's 0.88. I wish it is like incredibly good, but we're using synthetic queries here. So it's 00:45:11.320 |
So I went through I found different episode descriptions 00:45:15.120 |
which were in the indexes 1 8 14 and so on and I just created like a 00:45:20.840 |
Query for each one of those episodes that that makes sense to me 00:45:25.720 |
Now this isn't perfectly accurate that we could you know, we could use this and we could return another episode. That's actually more relevant 00:45:34.560 |
But using what I've done here, it would show up as not a match and and bring down the score 00:45:43.200 |
So once we use those curated like true query episode pairs or more sort of human 00:45:57.800 |
It doesn't like it's definitely not as impressive as the AT we had before is okay, but it's nothing special 00:46:03.160 |
But what I want to do here is just compare this to what we had before so the the pre-trained model value fine-tuning 00:46:09.800 |
Okay, so to do that, I'm just creating a new index if you're using the the free version pinecone 00:46:16.760 |
You can only have one index at a time. So you will have to you can you can delete an index, you know 00:46:25.320 |
So what we do is I'm going to create the I'm going to initialize the pre-trained model 00:46:31.480 |
that hasn't been fine-tuned on our particular data set and 00:46:40.800 |
Index evil zero connecting to that index and I'm going through doing the exact same thing as I did before 00:46:49.520 |
vectors or episode embeddings that have been encoded using the 00:46:56.000 |
Older model that hasn't been fine-tuned and then calculating the recall. Okay again using that 00:47:01.440 |
Using the curated data set and we get a terrible score of zero point two eight 00:47:08.040 |
So we can see there's actually a massive improvement from you know, 28 to 00:47:18.120 |
29 point improvement which is huge especially to say that our training data was synthetic 00:47:24.840 |
Right, so it's a synthetic data set is never going to perform as well as a genuine data set 00:47:30.020 |
so if we had those other two data sources that Spotify use 00:47:33.080 |
we could probably create something that is like 00:47:36.560 |
Very impressive compared to what we've done here 00:47:40.880 |
But to say we just use episode data without any queries and synthetically create queries 00:47:46.320 |
I think this is really impressive and it shows that 00:47:50.800 |
Spotify's methodology at least in terms of you know, adding these synthetic queries and definitely does 00:47:56.440 |
Contribute to their overall model and it shows us a really cool way to without any real training data 00:48:06.400 |
Another thing that we should really I think about here is how well does the scale? 00:48:19.440 |
Using pine cone. We're quite easy easily able to go up to you know 00:48:24.880 |
Hundreds of millions of episodes or all vectors in there 00:48:28.680 |
And even even billions. So in terms of scalability 00:48:36.320 |
Well, unless you're going into this sort of trillion scale 00:48:40.160 |
this is very possible with pretty much every data set that you're going to be using and 00:48:46.040 |
It will still be incredibly fast, which is is I think really cool. I think you know, that's 00:48:51.800 |
really awesome the fact that we can do this no data make it incredibly scalable and 00:48:57.040 |
You know build something really really awesome. So that's it for this sort of walkthrough 00:49:04.280 |
I will say just if you if you are going to build something 00:49:08.480 |
Like this with pine cone and you know doing all this sort of cool vector search 00:49:17.760 |
We are looking for people that would like to showcase the work that they're doing 00:49:21.840 |
And it would be really cool to see what you're building 00:49:25.400 |
And if you sort of like to share that and get other people seeing what you're building 00:49:33.760 |
so if you're interested in that just go over to the community page at pine cone and 00:49:41.480 |
You'll be able to find how to submit any projects that you're working on there 00:49:46.920 |
So I hope all this has been useful and interesting 00:49:50.680 |
Thank you very much for watching and I will see you again in the next one. Bye