Back to Index

Spotify's Podcast Search Explained


Chapters

0:0 Intro
4:16 NLP in Semantic Search
8:35 Why Now?
9:29 Transformer Models
11:52 Sentence Transformers
13:12 Vector Search
15:56 How Spotify Built Podcast Search
17:35 Data Source, Fine-tuning, and Eval
22:58 Code Implementation, Dataset
24:44 Data Preparation
26:39 Query Generation
29:54 Fine-tuning a Podcast Model
41:40 Evaluation
48:5 Does it Scale?
49:0 Sharing Your Work

Transcript

in the past few years podcasts have become increasingly popular and leading the charge in podcasts is Spotify now Spotify Only very recently entered the market for podcasts in 2018 Despite that in a few short years So by 2021 they had already usurped Apple as the leader in terms of monthly active listeners Now Apple has been doing this for a very long time since I think it's 2008 or maybe even earlier But despite that Spotify and now ahead of everyone and of course to back Their sort of investments in podcasts.

They've also invested heavily in the technology that powers different components of the podcast experience Whether that's anchor which is an all-in-one app that allows people to record Publish podcasts or what? We're going to focus on a natural language search for podcasts now Spotify's natural language search or semantic search for podcasts is I think a really interesting use case because it enables a more intuitive user experience By allowing us to search through their podcast catalog, which is huge using more natural language queries Rather than the typical keyword or term matching that you would do in most places so beforehand where we were searching we'd have to kind of know the words that we're looking for or the terms that we're looking for and I think we as humans.

We don't really know what exactly we're looking for in terms of words or keywords that In whatever it is. We're looking for Instead we tend to think more in Concepts and ideas. Our language is meaningful. It's not specifically about term matching or keyword matching so it makes sense that we might want to create a search experience that Replicates the meaning in language rather than just terms and words so Imagine we wanted to find a podcast that talks about eating healthily over the winter holidays Okay, we might search something like this.

We eat better during Xmas holidays Now in the days that we're going to be using there is a podcast description that talks about exactly this It's description is Alex Draney tracks to dr Priya Alexander about how to stay healthy over Christmas and about her letter to patients now if you compare those two They don't share any of the same words despite Pretty much being about the same thing.

So if you try and use this in a term matching query It's not going to work. But if we were to spot that for an actual language query or a semantic search it does because What we are searching for here the meaning overlaps the genuine like human meaning behind those two phrases is very similar and When a semantic search is done properly it is able to identify Okay, we're looking for the meaning behind these and the meaning of these two does overlap We're talking about some Christmas holidays being healthy eating better they're very similar concepts, but Enabling meaningful search in this way is is not easy so what I want to do is actually have a look at how Spotify have done this and Actually replicate it and see what we can do as well.

So the technology powering Spotify semantic search consists of two different components We have the NLP or natural language processing side of it and we also have the vector search component now These technologies can be seen as two steps in the search process given a natural language query a Language model.

So this is the NLP side. We'll take that query and convert it into something. We called a dense vector Which is essentially a meaning encoded numeric representation of Whatever that query is. Okay sounds confusing, but we'll see that it's not Really too complex quite soon These dense vectors can then be compared as you would compare normal vectors So imagine you have two points in a particular space and you want to calculate the distance between them We can literally do that but with the meaning behind these queries so let's have a look at an example of that so here we just have a 3d chart and Typically when you have these dense vectors, they have many dimensions Okay, so you're you're looking at 700 plus dimensions in these vector representations In this one we've used PCA to reduce the dimensionality of those very high dimensional vectors into a 3d space just so we can visualize it But still maintain this or relationships between those vectors.

So if we have a look here we can see There's one main Grouping of vectors. Okay, so these three over here now Let's just zoom in a little bit and have a look what they are so we have text podcasts about cooking and writing a chat with chef who wrote books and interview with cookbook author so we can see that they're all about cooking and Writing and they've all been grouped together now that's because our language model has taken these queries and Encoded them in such a way that the vector representations for these represent the meaning Okay, and the meaning of all these although it's not the same is very similar talking about cooking writing books and so on And we can see that represented in the proximity of these vectors now Let's have a look at something else.

That's kind of close but not within that same cluster So over here we have how do I cook great food? Okay, so we're talking about cooking again Okay, and then we have this one further over here eat better during Christmas holidays So we can kind of see that this one is How do I cook great food it's sort of in between like these about cooking and writing and a cookbook and then over here We have eat better during Christmas holidays.

I was talking specifically about food Okay, so this one is cooking and food and it's in the middle of them both and then we have a few more over here And we have the writing show over here So it's like a writing podcast or something and then over here one of the more similar Vectors is how to tell more engaging stories and then also how to keep readers interesting So again talking about writing and then we go up here.

We see this by itself superhero film and arts It's obviously very different to everything else. We've looked at so this is kind of what I mean by representing the meaning numerically in a vector space Similar things get grouped together Dissimilar things are separated now All of those vectors have been encoded by a special kind of NLP model called a sentence transformer Now that sentence transformer specially trained to do what we just saw which is group similar things together Separate dissimilar things.

Once we have those vectors from our sentence transformer We need a way to compare them and that's where the vector search component comes in. So imagine Within all those queries that we just saw Imagine we came up with a new query convert that query into a vector place it within that vector space and then we would search for the most similar of the Vectors and we would return those as being the ones that are most similar to your particular query now NLP and vector search have been around for a long time, but both fields have had very recent developments that have really acted as catalysts for the Performance increases and subsequent adoption of semantic search in NLP we've had the introduction of transformer models and in vector search the introduction of ever better approximate nearest neighbor search algorithms now transformers and approximate nearest neighbors have Powered the growth of semantic search, but it's not Inherently clear what both of those are.

I'm assuming most of you probably have no idea what either of those are if you do Then that's great But otherwise no problem because we're going to actually go through both of those concepts as well so starting with transform models transform models are Essentially the standard in NLP now they Typically consist of two components, which is quite important for us to know There's the core of the transformer and then you usually have a head which adapts the transformer for a particular task Now there's just one problem.

The core of these transform models is huge like they're massive models and For most organizations It is far too expensive to actually train the core of the model for example with BERT, which is one most popular Language models. It's not even a particularly large language model that costs were reported I think two and a half thousand to fifty thousand dollars to train a small one And then when you look at larger BERT model that shifts up to eighty thousand to one point six million dollars Which is obviously huge now most organizations Don't have that kind of money lying around So, how do we you know, how a transform is useful for us if we can't afford to train them?

well the way that we typically go about using transform models is one the core of a transform model is Trained by the likes of Microsoft or Google it cost them a lot of money to train it But they have the money to do that To the model is made publicly available by Microsoft or Google three other Organizations take this core this transformer core and they add different heads to it So these heads are like a final few layers at the end model That adapt it for a particular task Now using that so extended model with the head the model can be fine-tuned using a lot less computing power, which brings it within the range of feasibility for other more normal organizations and for once you've fine-tuned your Extended transform model you are able to go ahead and actually apply that to your specific task So in the case of our example podcast search, we might want to take a BERT model So like BERT based on case, which has been pre-trained by Google We would take that and then we would add a what's called a pooling head onto the end of it That converts it into what we call sentence transformers, which will take an input like a normal transform so some sort of text input the normal transformer will output a set of token level embeddings now, we can't use token level embeddings when we're comparing sentences because One token or one sentence for BERT can contain up to 512 tokens.

So we Really need a way to actually compress that down into one single vector to represent the full set of inputs that were given to our Transform model. So that's where we have that pooling layer that pooling layer takes all of those token level embeddings and Converts them into a single embedding Which is our sentence vector or sentence embedding?

Now there are different types of pooling layers and we'll talk about one later but they all consume the same input and they will output that single embedding and It is that sort of model that we refer to as a sentence transformer that sentence embedding at the end is What we referred to earlier as a dense vector.

Okay, the numerical representation of Whatever we fed into that transform model. So given that Dense vector we move on to the next step, which is a vector search now approximate nearest neighbors search allows us to search more efficiently through a lot of Vectors because in many use cases for example with Spotify how many episodes do they have on?

That I imagine there are there are many okay, and It's very hard if you or it takes a very long time if you compare your query to all of the existing vector embeddings It's very hard to To do that quickly Okay In fact, it's impossible if you're using a typical K nearest neighbor search or otherwise known as exhaustive search because you need to compare everything and if you have millions of items to compare there No matter what hardware you run.

It's gonna take a long time So approximate search Allows us to speed that up by literally approximating the answer now approximate nearest neighbor algorithms are very good because they allow you to approximate the answer with a super high degree of accuracy, it's like 99% accuracy and Make your search incredibly fast.

You're talking sort of sub-second Sub half a second you're you're in the 10 millisecond range in a lot of cases So that's really useful But it's also worth noting that there is between these different algorithms. There's usually a trade-off between Speed like your search times and the accuracy so you can have an even faster search But then your accuracy will tend to be Not quite as good.

So you'll be returning maybe 70% accuracy rather than 99% accuracy but there's always sort of a trade-off between the two, but generally the algorithms now very good and You can get incredible accuracy and incredible speed as well. So you we've merged those two together We have our sentence transformers.

We have our approximate search and we now have a tool set that is capable of performing semantic search at Incredibly large scales. So what I want to do now is look specifically at how Spotify did it to build this kind of semantic search tool Spotify first needed a model that can encode queries and Episode metadata or descriptions into the same sort of vector space now there are existing transform models or sentence transform models that Can do a lot of things for example expert But Spotify found a few issues with those specifically for expert.

They needed a model that was capable of supporting multi-lingual queries because obviously Spotify has a lot of content from every place in the world so they couldn't use expert because expert has been trained on English only data and to experts performance on new domains, so for example podcast is Not great without further fine-tuning.

So the out-of-the-box Pre-trained fine-tuned sentence Bert model could not be used. It wasn't didn't satisfy. What's but if I need it With that in mind they decided to start with the pre-trained universal sentence encoder model this allowed them to cover the multi-lingual issue and Yes, USC the USC model still needs to be fine-tuned But they were going to need to do that.

Anyway, so it's not really an issue So, let's have a look at what sort of data they use to actually fine-tune in their model so to fine-tune their model Spotify needs query Episode pairs Now, of course Spotify has a ton of search logs Over here, so they took these search logs and they use them to create two different data sources The first was simply okay if there is a successful search What was a query from that search and what was the episode?

Okay, and that created this little data source that you see here the other one's a bit more interesting, so whenever they found that there was a Successful search that was straightaway followed by a successful search They looked at what the query for that Unsuccessful search was because the logic behind that being if you're typing something you're searching and it doesn't work You probably use a more sort of natural language or a more natural feeling query But then it doesn't work so then you change it to be a bit more robotic and trying to fit what you would expect Spotify's search to come up with for the correct episode or podcast episode so what they did is they took that unsuccessful search query and then the Episode that they the user found after their successful search and put them together to create this little data source So we have two data sources now Use dotted lines here because we can't replicate that.

We don't have Spotify's past search logs So we will just ignore that But we'll be aware that the way that they would be used would be exactly the same as this third Data source over here, which we are going to replicate now that third data source is Taking podcast episodes.

So it's a little diamond over here. We're transforming them so what I mean by transforming is taking it like the podcast show title and description and also the specific episode title and description and Concatenating all that together to create a sort of episode Description from these multiple things and then we use those episodes that episode data with a query generation set now query generation super interesting and what we're going to do there is actually use a query generation model to Create synthetic queries for our episodes and that will create this third data source over here okay, which is synthetic queries to episode pairs and We'll be using that so we'll split that into sort of a training set and we'll use that to fine-tune a model in the same way that we would have used those other the other two data sources and We'll be fine-tuning a pre-trained universal sentence encoder model So we use all that and in the end what we'll have is this podcast Universal sentence encoder model which we can then use to encode both our queries and our episodes So there's one other so we have our podcast episodes over here place them into this transform We've created those podcast episode descriptions This other data source that you see over here is a manually created data source Spotify did this and they use it purely for evaluating the model didn't use this in training So they just manually curated a set of queries for particular podcast episodes We will do the same.

We just do I think it's like seven queries And see how that works and use that to actually evaluate our models now once we've done all of that We come to the evaluation and also the actual, you know, how you use this you would take your Podcast model you bring it over here into the middle here.

And then you're gonna take the evaluation data From your query generated data bring it into here You're going to encode everything and place it into a vector search engine in this case. We're using pinecone this will allow us to store all of our episodes as Embeddings and then we'll be able to query and see if we are returning the correct Episode based on our synthetic queries to episode pairs and we'll also do the same Using these manually curated query episode pairs and then this is exactly what you would do in production as well So after you've evaluated you take all of your episode embeddings put them in your Vector database of the pinecone at the bottom there and then you would let your users query your query would go into your podcast Sentence transform model and then it would be passed into your vector database, which would identify the most semantically similar or most relevant episode vectors for you Okay, so that's a very high level.

That's how it all works if it's confusing no problem So we're going to work through all of this. So now we'll move on to the actual implementation of all of this Now what we first one is a data set that is going to as closely as possible Replicate the episode data that Spotify are using So we'll need Kaggle for this so you can download or install the Kaggle API using pip install Kaggle And then you'll need an account.

So you need to sign up on kaggle.com and you can get an API key So you just come over here you click on on top, right? You'll see your profile you go to I think account and you can come down and you have API here So you'll be able to get like create a new API token and it will download at kaggle.json file for you And then what you do with that is if you try to import kaggle, so you run this import kaggle It would say okay, you are not authenticated because you don't have kaggle.json in this directory So all you do is put your kaggle.json in the directory that is specified Once you've done that you should be able to import kaggle without any errors and we'll move on to the data download So we just authenticate our API.

I'm not going to go too Into depth here. There will be a link to this notebook so you can read through everything As you want and All we're going to do is download these two data sets Okay, so if I let me copy this and I'll show you on kaggle this data set Okay So this is all podcast episodes published in December 2017 from this company called listen notes Which is a podcast search engine so we download these podcasts and episodes data and we extract them because they'll be in zip files and we can see okay, this podcast is actually the details of podcast show itself and episodes is the details of this, you know individual episodes and What Spotify did was concatenate the title and description from both of those and use that to create a sort of a single episode Description that we then encode So we do the same We first need to merge those two data frames So we're doing that based on the unique podcast ID and we you know Get this sort of data frame with everything we need inside it I'm just doing a little bit data cleaning here So stripping any white space from the features that we care about so the title EP description EP title podcast and description podcast EP here is just episode and Then I'm also just removing you know where we have any new values in any of those Columns or features just removing them because we have a lot here.

Anyway, we don't even use all of these so It's not a problem. And then I'm concatenating everything. I'm just putting a full stop in between each one of them and Yeah, and then converting that into a list. So we just have like a list of these episode Texts now.

Okay, you see they do vary a lot. It's not the cleanest data But it's usable We also shuffle here as well When you're shuffling just bear in mind your shuffle is going to randomize the order So later on when we have these curated queries, they won't necessarily be in the test set for you as well So you may need to just make up some some new some new queries and At that point we have our episodes data and we need to move on to query generation because we need you know Query episode pairs.

So spot if I use a fine or they fine-tuned a bar model on ms Marco now, I'm I'm not going to find you in a model because there are Honestly, so many query generation models that have been fine-tuned on ms. Marco already including bar models so I don't think there's really any point in us fine-tuning it because we're just replicating what we can just pull from hugging face Hope I tested a few different bar and t5 query generation models And I found this one to be the best for this particular use case the queries were just more consistently like Sensical like they actually made sense and It also supports multi or it has some multi lingual support I'm not sure how much but it does manage to actually produce queries that make sense in different languages.

So that's Definitely pretty useful and aligns pretty nicely with what Spotify wanted as well And then what I'm doing is so I'm using the transformers lyrics is hugging face transformers You can pip install transformers if you need to I'm just initializing the tokenizer in the model Okay, so this is going to handle our query generation on the end here.

You see I put CUDA That's because I have a CUDA enabled GPU which will make things a lot faster when we're doing this because this can take a long time And that's why I'm just taking the first 100,000 episodes here So we move on to the query generation loop.

It's fairly long. Sorry This isn't like the quickest way you can do it so larger batch size means faster processing, but this is limited by the size of your GPU and Then number of queries. I'm generating for each each episode so Spotify didn't specify what they use for number of queries here, but I'm using three because that's the Inline with the approach shape taken by these other query generation techniques Gen Q and GPL So I'm gonna stick with that Then we're just going through I'm encoding everything in batches.

So we're just tokenizing everything and then we're generating three queries per episode decoding those queries back to human readable text because this just outputs a load of token IDs numbers Decoding about text and looping through and putting those or all those together So the the query and episode pairs and placing those together synthetic queries and episode pairs Okay, and then we we put all those in this pairs list Okay, and we can see a few here.

So we have like a query and then we have a episode No query episode and so on. You can also see so here we have the multilingual support as well She's very cool. And also again, yeah So we now have a data source so we can use of fine-tuning the model the spot if I talk about this in in their article, which is is here and They mention that they tried okay here So they tried Bert's but as I said, they they decided they didn't really like it So, you know, yeah, we're not using it for these multiple reasons.

I mentioned before So what they did is use this universal sentence encoder model and we're going to do something similar but what I will note here is that they use TensorFlow hub and We can use TensorFlow hub, but it means we can be stuck with TensorFlow and it makes our lives harder rather than using hugging face and Pytorch and the sentence transformers library, so rather than using the Universal sentence encoder model which we can't get from the sentence transformers library as a pre-trained model We are going to use a distilled universal sentence encoder model which is literally like a smaller version of USC so we'll come down here and we're using this one here.

So there's still USC based multilingual case, okay So it's good. It's multilingual and it's still pretty much the same model as what? Spotify used so we just have the model details here so the maximum number of tokens that will accept 128 and it will output a 168 dimensional dense vector.

So that's a you know, the meaningful vector that we spoke about earlier Now when we're fine-tuning with the sentence transformers library And you know, by the way, if you do need to install that you can just go pistol Sentence transformers Like this, okay, obviously in your in your terminal or using the estimation market to start there So when we're using a sentence transformers library and we're fine-tuning a model We need to reformat our data our input data.

So our pairs into a list of input example objects now these in this input example object the format for that varies depending on What task you're actually fine-tuning with now? we're using a ranking task for fine-tuning or optimizing our model and that means that we Only need the the query in the episode.

Okay, so all we do in that case I'm here as well. I'm just Splitting so we have an evaluation set Down here and a test set as well for later on when we're evaluating things So all we need in our input example, which we have imported from sentence transformers is the query an episode Literally it so it's just a list of these input examples With query an episode in there and you see the sample size that we have there so let's continue and yeah, so as I As I mentioned we're going to be using a ranking optimization function now What does that look like given a specific query light on we use earlier?

So eat better during Christmas holidays or Xmas holidays our model is going to be given a set of episodes and it's going to need to rank the the true episode for that particular query as Number one in terms of scores. Okay, so the true sort of pair for this query is This one here and it will have to be able to identify that all these other Episodes are not real pairs or an less semantically similar to this particular query so the model learns to give higher scores to more semantically similar episodes for a particular query and Lowest scores to the more dissimilar episodes for a particular query And that's why we don't need like a label in this case.

We don't actually need a particular similarity Label or anything when training our model, which is a really nice Sort of benefit of using ranking functions. You literally just need pairs of data and you can train with that Which is is pretty cool now the way the model actually needs to like compare the queries and episodes is by taking query converting into a query vector taking the episode converting that into a episode vector and placing them in a in a vector space and pairing the Similarity between them now the way that we are going to be calculating similarity in this case is using cosine similarity So we're essentially calculating the similarity or the angle between vectors So this is this in the middle here That is our query vector and this little cone is Representing the angle in a 3d space around that a query vector these two These two episode vectors here or episode embeddings are the two most similar So we've got top K equals 2 over here.

That's because we're trying to return the two most similar and These other vectors like this one for example in terms of actual distance. This one might actually be closer Than this one, but in terms of angular distance, it's further away So that's why we have these two embeddings being selected rather than you know, any of these other ones So the model needs to create these vectors and it needs to learn how to put similar meaning querying episodes in a similar vector space and dissimilar querying episodes in As far apart as possible now another thing that we need to consider here is because we're using a ranking function Which is going to work by taking query and then a batch of episodes We need to make sure that within that batch of episodes.

We don't have any duplicates because imagine so We're going to optimize optimizing this model. We've got a query in episode pair and we're saying okay for this query This episode is the most similar We're going to optimizing the model based on that if we have the exact same episode further down in that list and we're telling the model Actually, okay.

These two episodes are exactly the same But one is right and one is not our models just going to be confused it That doesn't make sense. It can't do that. So we have to make sure that we don't have duplicates in The in any of the training batches now, it's probably not a problem if you have the odd one But we don't want any we just want to be you know certain there's nothing in there So we can actually do that using the sentence transformers.

No duplicates data loader Which will handle removing any duplicates from from batches for us as well We use a batch size 64. So if we consider the training Optimization function here wait taking query and then we're taking a batch of episodes Imagine we have a batch size of three right or even two.

We have a batch size of two We're just comparing our one query against two episodes Our model can get get even if it's just randomly guessing it can get the right answer around 50% of the time With such a small batch size. We increase that to a hundred then The model is going to perform a lot more poorly randomly guessing so the consequence of of that is that using a larger batch size for this particular training method is going to typically result in you having a better performing model because it makes the The task of ranking harder for the model So your model must get better in order to actually accurately rank everything So we increased about size as much as our hardware will allow us to Now after that we can initialize the loss function This is where the ranking optimization comes in.

So in sentence transformers, the ranking function is this multiple natives ranking loss Okay, so we initialize that There is just one more step before we actually move on to fine-tuning. So if we have a look over here and We go down to offline offline evaluation in order to evaluate a model They use two different types of metrics in batch metrics and for retrieval We're gonna do both so for the in batch metrics, so just calculating recall and MMR at the batch level okay, so what we can do to replicate that is use this re-ranking evaluator from sentence transformers and we Will use that to perform ranking and calculate the MMR metric using using our evaluation set so Just one thing as we have already done here is we've removed duplicates from batches.

We need to do the same Before we actually feed data into our re-ranking evaluator and then there are a lot because we create three queries per episode So we definitely do need to remove those so remove any duplicates Using this and we in the end we have a thousand unique pairs for our evaluation set so then we feed them into the re-ranking evaluator, which requires a particular format of query a Any positives we have so we just have one positive pair query here There's just a single a single list and then negative and then in here.

We just need to pass in all of the other episodes That are not the positive episode for that particular query So we do that in in here and then we initialize a re-ranking evaluator And I set MMR at K to consider the top five items When it's calculating its score, I'm not going to go into into the metrics because it takes a little bit more time so Calculating the performance.

So this is the MMR 5 for the model without any fine-tuning so we get 0.68. Okay so at this point, we're actually ready to go ahead and fine-tune our model and This is basically the easiest step. So we set a number of epochs to one I did try other so like more epochs Typically with sentence transformers you actually and we really need to train for one epoch depending on a model and the data Anything more it actually degraded the performance So when one epoch here and we always or typically use a number of warm-up set so here for the first 10% of training steps The learning rate is going to be slowly increasing up to the default learning rate Which I think is like 1 e to the minus 5 or something along those lines And then we're just saving the the model that we train Into this directory here.

So this this WSC podcasts and Q. Thank you. Here's natural query And then yeah, so that trains It shouldn't take too long. I think maybe this took an hour maybe two hours to train so it's pretty quick and With that we can actually look at the evaluation so, you know, how did this model perform on the Evaluator that we initialize at the start and we can actually see that.

So I Mean another environment here. We're actually training the model we go to this the distal USC podcasts and Q Folder and then he will see this eval Directory go in here open this and we can see we have Over on the right here the MMR at five value and it's zero point eight eight eight So that is a big improvement from what we had appear which is zero point six eight.

So it's a 20 point increase which is is really good That's on it on small batches. So That doesn't really replicate the real-world use of this model So what we want to do in the final evaluation step is is replicate that So what we're going to do is take the test data we had which has many thousands of episodes in there we're going to encode all of those and Then we're going to perform a semantic search as we would expect this to be using in real life.

So what we'll do is set up the vector database with thousands of Episode vectors in there from our test data set and then we're going to calculate the recall value this time for our fine-tune model and also our Not fine-tune model so we initialize pinecone create a evaluation index here.

Make sure we use the cosine metric We connect to that. Oh if you if you need an API key, you need to go to pinecone I/o, it's free. So, you know, this is all free. There's nothing you don't need to worry about anything there We create our index we Connect to it here and Then what we can do is so before we index our test a we remove duplicates like we did with the evaluation set earlier Same thing.

I'm not going to go through it again and Then here. So what I'm doing is going through again. Just make sure there's definitely no duplicates in there there shouldn't be anyway, but just in case going through and Creating batches of episodes to actually encode here using our new model This is using the distal USC Podcasts and Q model and then we upset them so insert them into our pinecone Vector database and then just refreshing the batch and doing it again all in batches And then at the end the undersigned look at the index that so you can see how many vectors we actually have in there Which is 18 and a half thousand Then what I'm doing here is I'm going to calculate the recall at K So I'm going to loop through all the queries we have so all the queries from our test set so these are synthetic queries first and Just looking at okay, what is the recall using those synthetic queries now?

It's 0.88. I wish it is like incredibly good, but we're using synthetic queries here. So it's It's not really representative of real like human queries So this is where the curated queries come in So I went through I found different episode descriptions which were in the indexes 1 8 14 and so on and I just created like a Query for each one of those episodes that that makes sense to me Okay Now this isn't perfectly accurate that we could you know, we could use this and we could return another episode.

That's actually more relevant But using what I've done here, it would show up as not a match and and bring down the score So just bear that in mind as well So once we use those curated like true query episode pairs or more sort of human queries we get a Recall at K of 57 now, you know It doesn't like it's definitely not as impressive as the AT we had before is okay, but it's nothing special But what I want to do here is just compare this to what we had before so the the pre-trained model value fine-tuning Okay, so to do that, I'm just creating a new index if you're using the the free version pinecone You can only have one index at a time.

So you will have to you can you can delete an index, you know So you can then refresh it and do this So what we do is I'm going to create the I'm going to initialize the pre-trained model that hasn't been fine-tuned on our particular data set and Here I am creating a new Index evil zero connecting to that index and I'm going through doing the exact same thing as I did before Okay, so just pushing all of those vectors or episode embeddings that have been encoded using the Older model that hasn't been fine-tuned and then calculating the recall.

Okay again using that Using the curated data set and we get a terrible score of zero point two eight So we can see there's actually a massive improvement from you know, 28 to 57 that's a 29 point improvement which is huge especially to say that our training data was synthetic Right, so it's a synthetic data set is never going to perform as well as a genuine data set so if we had those other two data sources that Spotify use we could probably create something that is like Very impressive compared to what we've done here But to say we just use episode data without any queries and synthetically create queries I think this is really impressive and it shows that Spotify's methodology at least in terms of you know, adding these synthetic queries and definitely does Contribute to their overall model and it shows us a really cool way to without any real training data Do this now?

Another thing that we should really I think about here is how well does the scale? So we're only using 18 and a half thousand Vectors in in this like evaluation case Using pine cone. We're quite easy easily able to go up to you know Hundreds of millions of episodes or all vectors in there And even even billions.

So in terms of scalability that is Well, unless you're going into this sort of trillion scale this is very possible with pretty much every data set that you're going to be using and It will still be incredibly fast, which is is I think really cool. I think you know, that's really awesome the fact that we can do this no data make it incredibly scalable and You know build something really really awesome.

So that's it for this sort of walkthrough I will say just if you if you are going to build something Like this with pine cone and you know doing all this sort of cool vector search Stuff let me know because over at pine cone We are looking for people that would like to showcase the work that they're doing And it would be really cool to see what you're building And if you sort of like to share that and get other people seeing what you're building I think that's a really good way to do that so if you're interested in that just go over to the community page at pine cone and You'll be able to find how to submit any projects that you're working on there So I hope all this has been useful and interesting Thank you very much for watching and I will see you again in the next one.

Bye