Spotify's Podcast Search Explained

00:00:00.000 | in the past few years podcasts have become increasingly popular and

00:00:05.120 | leading the charge in podcasts is Spotify now

00:00:09.200 | Spotify

00:00:11.260 | Only very recently entered the market for podcasts in 2018

00:00:16.440 | Despite that in a few short years

00:00:19.880 | So by 2021 they had already usurped Apple as the leader in terms of

00:00:26.000 | monthly active listeners

00:00:29.000 | Now Apple has been doing this for a very long time since I think it's 2008 or maybe even earlier

00:00:34.940 | But despite that Spotify and now ahead of everyone and of course to back

00:00:41.920 | Their sort of investments in podcasts. They've also invested heavily in the technology that powers

00:00:49.320 | different components of the podcast experience

00:00:52.800 | Whether that's anchor which is an all-in-one app that allows people to record

00:00:59.320 | Publish podcasts or what? We're going to focus on a natural language search for podcasts

00:01:06.120 | now

00:01:08.080 | Spotify's natural language search or semantic search for podcasts is I think a really interesting use case because it enables a more intuitive

00:01:15.640 | user experience

00:01:17.840 | By allowing us to search through their podcast catalog, which is huge using more natural language queries

00:01:26.400 | Rather than the typical keyword or term matching that you would do in most places

00:01:31.440 | so beforehand where we were searching we'd have to kind of know the words that we're looking for or the terms that we're looking for and

00:01:40.120 | I think we as humans. We don't really know what exactly we're looking for in terms of

00:01:47.200 | words or keywords that

00:01:49.760 | In whatever it is. We're looking for

00:01:51.880 | Instead we tend to think more in

00:01:56.280 | Concepts and ideas. Our language is meaningful. It's not specifically about term matching or keyword matching

00:02:02.760 | so it makes sense that we might want to create a search experience that

00:02:09.560 | Replicates the meaning in language rather than just terms and words

00:02:14.000 | so

00:02:16.480 | Imagine we wanted to find a podcast that talks about eating healthily over the winter holidays

00:02:23.280 | Okay, we might search something like this. We eat better during Xmas holidays

00:02:29.960 | Now in the days that we're going to be using there is a podcast description that talks about exactly this

00:02:35.960 | It's description is Alex Draney tracks to dr

00:02:39.320 | Priya Alexander about how to stay healthy over Christmas and about her letter to patients now if you compare those two

00:02:46.920 | They don't share any of the same words despite

00:02:50.840 | Pretty much being about the same thing. So if you try and use this in a term matching query

00:02:58.880 | It's not going to work. But if we were to spot that for an actual language query or a semantic search

00:03:06.120 | it does because

00:03:09.120 | What we are searching for here

00:03:11.680 | the meaning overlaps the genuine like human meaning behind those two phrases is

00:03:20.440 | very similar and

00:03:22.440 | When a semantic search is done properly it is able to identify

00:03:27.480 | Okay, we're looking for the meaning behind these and the meaning of these two does overlap

00:03:34.000 | We're talking about some Christmas holidays being healthy eating better

00:03:38.320 | they're very similar concepts, but

00:03:41.600 | Enabling meaningful search in this way is is not easy

00:03:46.320 | so what I want to do is actually have a look at how Spotify have done this and

00:03:51.280 | Actually replicate it and see what we can do as well. So the technology powering

00:03:58.280 | Spotify semantic search

00:04:00.880 | consists of two different components

00:04:02.880 | We have the NLP or natural language processing side of it and we also have the vector search component now

00:04:10.680 | These technologies can be seen as two steps in the search process

00:04:15.920 | given a natural language query a

00:04:18.120 | Language model. So this is the NLP side. We'll take that query and convert it into something. We called a dense vector

00:04:26.320 | Which is essentially a meaning encoded

00:04:30.760 | numeric representation of

00:04:33.640 | Whatever that query is. Okay sounds confusing, but we'll see that it's not

00:04:40.160 | Really too complex quite soon

00:04:43.400 | These dense vectors can then be compared as you would compare normal vectors

00:04:49.360 | So imagine you have two points in a particular space and you want to calculate the distance between them

00:04:55.160 | We can literally do that but with the meaning behind these queries

00:05:00.880 | so let's have a look at an example of that so here we just have a

00:05:05.480 | 3d chart and

00:05:08.080 | Typically when you have these dense vectors, they have many dimensions

00:05:12.040 | Okay, so you're you're looking at 700 plus dimensions in

00:05:15.640 | these vector representations

00:05:18.640 | In this one we've used

00:05:21.480 | PCA to reduce the dimensionality of those very high dimensional vectors into a 3d space just so we can visualize it

00:05:29.040 | But still maintain this or relationships between those vectors. So if we have a look here we can see

00:05:36.240 | There's one main

00:05:40.360 | Grouping of vectors. Okay, so these three over here now

00:05:45.160 | Let's just zoom in a little bit and have a look what they are

00:05:50.120 | so we have text podcasts about cooking and writing a chat with chef who wrote books and

00:05:56.760 | interview with cookbook author so we can see that they're all about cooking and

00:06:02.640 | Writing and they've all been grouped together

00:06:05.800 | now

00:06:07.560 | that's because our language model has taken these queries and

00:06:11.840 | Encoded them in such a way that the vector representations for these

00:06:17.240 | represent the meaning

00:06:20.400 | Okay, and the meaning of all these although it's not the same is very similar talking about cooking writing books and so on

00:06:28.480 | And we can see that represented in the proximity of these vectors now

00:06:34.560 | Let's have a look at something else. That's kind of close but not within that same cluster

00:06:39.140 | So over here we have how do I cook great food? Okay, so we're talking about cooking again

00:06:44.980 | Okay, and then we have this one further over here eat better during Christmas holidays

00:06:51.400 | So we can kind of see that this one is

00:06:54.120 | How do I cook great food it's sort of in between like these about cooking and writing and a cookbook and then over here

00:07:04.320 | We have eat better during Christmas holidays. I was talking specifically about food

00:07:08.760 | Okay, so this one is cooking and food and it's in the middle of them both and then we have a few more over here

00:07:15.280 | And we have the writing show over here

00:07:18.200 | So it's like a writing podcast or something and then over here one of the more similar

00:07:22.960 | Vectors is how to tell more engaging stories and then also how to keep readers interesting

00:07:28.640 | So again talking about writing and then we go up here. We see this by itself superhero film and arts

00:07:34.880 | It's obviously very different to everything else. We've looked at so this is kind of what I mean by

00:07:41.120 | representing the meaning

00:07:44.280 | numerically in a vector space

00:07:46.400 | Similar things get grouped together

00:07:49.000 | Dissimilar things are separated now

00:07:52.240 | All of those vectors have been encoded by a special kind of NLP model called a sentence transformer

00:07:58.800 | Now that sentence transformer specially trained to do what we just saw which is group similar things together

00:08:04.800 | Separate dissimilar things. Once we have those vectors from our sentence transformer

00:08:10.200 | We need a way to compare them and that's where the vector search component comes in. So imagine

00:08:16.800 | Within all those queries that we just saw

00:08:20.160 | Imagine we came up with a new query

00:08:23.220 | convert that query into a vector place it within that vector space and then we would search for the most similar of the

00:08:30.220 | Vectors and we would return those as being the ones that are most similar to your particular query

00:08:34.920 | now

00:08:36.880 | NLP and vector search have been around for a long time, but both fields have had very recent

00:08:43.780 | developments that have really acted as catalysts for the

00:08:49.520 | Performance increases and subsequent adoption of semantic search in NLP

00:08:55.040 | we've had the introduction of transformer models and

00:08:58.480 | in vector search the

00:09:01.720 | introduction of ever better

00:09:04.320 | approximate nearest neighbor search algorithms now transformers and

00:09:09.400 | approximate nearest neighbors have

00:09:12.760 | Powered the growth of semantic search, but it's not

00:09:16.160 | Inherently clear what both of those are. I'm assuming most of you probably have no idea what either of those are if you do

00:09:22.520 | Then that's great

00:09:23.800 | But otherwise no problem because we're going to actually go through both of those concepts as well

00:09:28.880 | so starting with transform models transform models are

00:09:32.160 | Essentially the standard in NLP now they

00:09:37.680 | Typically consist of two components, which is quite important for us to know

00:09:42.520 | There's the core of the transformer and then you usually have a head which adapts the transformer for a particular task

00:09:49.880 | Now there's just one problem. The core of these transform models is huge like they're massive models and

00:09:57.840 | For most organizations

00:10:01.440 | It is far too expensive to actually train the core of the model for example with BERT, which is one most popular

00:10:10.840 | Language models. It's not even a particularly large language model that costs were reported

00:10:16.480 | I think two and a half thousand to fifty thousand dollars to train a small one

00:10:20.600 | And then when you look at larger BERT model that shifts up to eighty thousand to one point six million dollars

00:10:28.200 | Which is obviously huge

00:10:30.680 | now most organizations

00:10:33.320 | Don't have that kind of money lying around

00:10:35.960 | So, how do we you know, how a transform is useful for us if we can't afford to train them?

00:10:40.800 | well the way that we typically go about using transform models is

00:10:46.120 | one the core of a transform model is

00:10:50.320 | Trained by the likes of Microsoft or Google it cost them a lot of money to train it

00:10:56.060 | But they have the money to do that

00:11:00.600 | To the model is made publicly available by Microsoft or Google

00:11:06.440 | three other

00:11:08.560 | Organizations take this core this transformer core and they add different heads to it

00:11:15.140 | So these heads are like a final few layers at the end model

00:11:19.720 | That adapt it for a particular task

00:11:22.560 | Now using that so extended model with the head

00:11:28.980 | the model can be fine-tuned using a lot less computing power, which

00:11:34.700 | brings it within the range of

00:11:37.440 | feasibility for other more normal organizations and

00:11:41.880 | for once you've fine-tuned your

00:11:44.960 | Extended transform model you are able to go ahead and actually apply that to your specific task

00:11:51.840 | So in the case of our example podcast search, we might want to take a BERT model

00:11:57.240 | So like BERT based on case, which has been pre-trained by Google

00:12:01.300 | We would take that and then we would add a what's called a pooling head onto the end of it

00:12:07.320 | That converts it into what we call sentence transformers, which will take an input like a normal transform

00:12:13.700 | so some sort of text input the normal transformer will output a set of

00:12:19.100 | token level embeddings now, we can't use token level embeddings when we're comparing sentences because

00:12:26.540 | One token or one sentence for BERT can contain up to 512 tokens. So we

00:12:32.860 | Really need a way to actually compress that down into one single vector to represent the full set of inputs that were given to our

00:12:40.560 | Transform model. So that's where we have that pooling layer that pooling layer takes all of those token level embeddings and

00:12:47.740 | Converts them into a single

00:12:50.260 | embedding

00:12:51.460 | Which is our sentence vector or sentence embedding?

00:12:55.420 | Now there are different types of pooling layers and we'll talk about one later

00:13:00.180 | but they all consume the same input and they will output that single embedding and

00:13:06.300 | It is that sort of model that we refer to as a sentence transformer that sentence embedding at the end is

00:13:14.400 | What we referred to earlier as a dense vector. Okay, the numerical representation of

00:13:21.780 | Whatever we fed into that transform model. So given that

00:13:25.820 | Dense vector we move on to the next step, which is a vector search

00:13:31.760 | now

00:13:34.380 | approximate nearest neighbors search allows us to search more efficiently through a lot of

00:13:40.140 | Vectors because in many use cases for example with Spotify how many episodes do they have on?

00:13:48.900 | That I imagine there are there are many

00:13:51.460 | okay, and

00:13:54.060 | It's very hard if you or it takes a very long time if you compare your query to all of the

00:14:01.480 | existing vector embeddings

00:14:04.100 | It's very hard to

00:14:06.740 | To do that quickly

00:14:08.740 | Okay

00:14:09.240 | In fact, it's impossible if you're using a typical K nearest neighbor search or otherwise known as exhaustive search

00:14:15.500 | because you need to compare everything and

00:14:18.620 | if you have millions of

00:14:20.620 | items to compare there

00:14:23.100 | No matter what hardware you run. It's gonna take a long time

00:14:26.500 | So approximate search

00:14:30.760 | Allows us to speed that up by literally approximating the answer now

00:14:37.380 | approximate nearest neighbor algorithms are very good because they allow you to approximate the answer with a

00:14:46.460 | super high degree of accuracy, it's like 99% accuracy and

00:14:50.860 | Make your search incredibly fast. You're talking sort of sub-second

00:14:56.540 | Sub half a second you're you're in the 10 millisecond range in a lot of cases

00:15:01.840 | So that's really useful

00:15:06.020 | But it's also worth noting that there is between these different algorithms. There's usually a trade-off between

00:15:12.860 | Speed like your search times and the accuracy so you can have an even faster search

00:15:20.040 | But then your accuracy will tend to be

00:15:22.940 | Not quite as good. So you'll be returning maybe 70% accuracy rather than 99% accuracy

00:15:29.000 | but there's always sort of a trade-off between the two, but generally the algorithms now very good and

00:15:36.220 | You can get incredible accuracy and incredible speed as well. So you we've merged those two together

00:15:42.220 | We have our sentence transformers. We have our approximate search and we now have a

00:15:47.700 | tool set that is capable of performing semantic search at

00:15:53.020 | Incredibly large scales. So what I want to do now is look specifically at how Spotify did it to build this kind of

00:16:02.020 | semantic search tool

00:16:04.220 | Spotify first needed a model that can

00:16:08.420 | encode queries and

00:16:12.100 | Episode metadata or descriptions into the same sort of vector space now there are

00:16:19.420 | existing transform models or sentence transform models that

00:16:24.660 | Can do a lot of things for example expert

00:16:28.220 | But Spotify found a few issues with those specifically for expert. They needed a model

00:16:35.860 | that was capable of supporting multi-lingual queries because obviously Spotify has a lot of content from

00:16:43.740 | every place in the world

00:16:46.780 | so they couldn't use expert because expert has been trained on English only data and

00:16:51.820 | to experts performance on

00:16:55.500 | new domains, so for example podcast is

00:16:59.300 | Not great without further fine-tuning. So the out-of-the-box

00:17:05.340 | Pre-trained fine-tuned sentence Bert model could not be used. It wasn't didn't satisfy. What's but if I need it

00:17:12.860 | With that in mind they decided to start with the pre-trained

00:17:17.660 | universal sentence encoder model this allowed them to cover the multi-lingual issue and

00:17:24.580 | Yes, USC the USC model still needs to be fine-tuned

00:17:30.060 | But they were going to need to do that. Anyway, so it's not really an issue

00:17:35.340 | So, let's have a look at what sort of data they use to actually fine-tune in their model so to fine-tune their model

00:17:41.900 | Spotify needs

00:17:44.100 | query

00:17:46.300 | Episode pairs

00:17:49.220 | Now, of course Spotify has a ton of search logs

00:17:57.540 | Over here, so they took these search logs and they use them to create two different data sources

00:18:03.740 | The first was simply okay if there is a successful search

00:18:07.900 | What was a query from that search and what was the episode? Okay, and that created this little data source that you see here

00:18:15.980 | the other one's a bit more interesting, so whenever they found that there was a

00:18:22.300 | Successful search that was straightaway followed by a successful search

00:18:26.580 | They looked at what the query for that

00:18:29.040 | Unsuccessful search was because the logic behind that being if you're typing something you're searching and it doesn't work

00:18:36.360 | You probably use a more sort of natural language or a more natural feeling query

00:18:42.420 | But then it doesn't work so then you change it to be a bit more robotic and trying to fit what you would expect

00:18:51.820 | Spotify's search to come up with

00:18:53.820 | for the correct

00:18:56.340 | episode or podcast episode

00:18:58.060 | so what they did is they took that unsuccessful search query and then the

00:19:02.700 | Episode that they the user found after their successful search and put them together to create this little data source

00:19:10.900 | So we have two data sources now

00:19:13.860 | Use dotted lines here because we can't replicate that. We don't have Spotify's past search logs

00:19:20.260 | So we will just ignore that

00:19:23.420 | But we'll be aware that the way that they would be used would be exactly the same as this third

00:19:30.740 | Data source over here, which we are going to replicate now that third data source is

00:19:36.040 | Taking podcast episodes. So it's a little diamond over here. We're transforming them

00:19:41.700 | so what I mean by transforming is taking it like the podcast show title and description and also the

00:19:49.820 | specific episode title and description and

00:19:51.820 | Concatenating all that together to create a sort of episode

00:19:56.340 | Description from these multiple things and then we use those

00:20:02.140 | episodes that episode data with a query generation set now query generation super interesting and

00:20:09.740 | what we're going to do there is actually use a query generation model to

00:20:14.860 | Create synthetic queries for our episodes and that will create this third data source over here

00:20:22.360 | okay, which is synthetic queries to episode pairs and

00:20:26.820 | We'll be using that

00:20:28.740 | so we'll split that into sort of a training set and we'll use that to fine-tune a

00:20:34.500 | model in the same way that we would have used those other the other two data sources and

00:20:40.020 | We'll be fine-tuning a pre-trained universal sentence encoder model

00:20:44.660 | So we use all that and in the end what we'll have is this podcast

00:20:50.060 | Universal sentence encoder model which we can then use to encode both our queries and our episodes

00:20:57.740 | So there's one other so we have our podcast episodes over here place them into this transform

00:21:04.140 | We've created those podcast episode descriptions

00:21:07.260 | This other data source that you see over here is a manually created data source

00:21:12.340 | Spotify did this and they use it purely for evaluating the model didn't use this in training

00:21:17.600 | So they just manually curated a set of queries for particular podcast episodes

00:21:22.700 | We will do the same. We just do I think it's like seven queries

00:21:27.000 | And see how that works and use that to actually evaluate our models now once we've done all of that

00:21:32.900 | We come to the evaluation and also the actual, you know, how you use this

00:21:37.860 | you would take your

00:21:40.140 | Podcast model you bring it over here into the middle here. And then you're gonna take the

00:21:47.000 | evaluation data

00:21:49.260 | From your query generated data bring it into here

00:21:52.420 | You're going to encode everything and place it into a vector search engine in this case. We're using pinecone

00:22:00.140 | this will allow us to store all of our episodes as

00:22:03.600 | Embeddings and then we'll be able to query and see if we are returning the correct

00:22:08.880 | Episode based on our synthetic queries to episode pairs and we'll also do the same

00:22:14.200 | Using these manually curated query episode pairs and then this is exactly what you would do in production as well

00:22:22.340 | So after you've evaluated you take all of your episode

00:22:25.340 | embeddings put them in your

00:22:28.860 | Vector database of the pinecone at the bottom there and then you would let your users query your query would go into your podcast

00:22:36.740 | Sentence transform model and then it would be passed into your vector database, which would identify the most

00:22:44.700 | semantically similar or most relevant

00:22:47.380 | episode vectors for you

00:22:50.020 | Okay, so that's a very high level. That's how it all works if it's confusing no problem

00:22:55.980 | So we're going to work through all of this. So now we'll move on to the actual implementation of all of this

00:23:00.900 | Now what we first one is a data set that is going to as closely as possible

00:23:06.200 | Replicate the episode data that Spotify are using

00:23:11.340 | So we'll need Kaggle for this so you can download or install the Kaggle API using pip install Kaggle

00:23:18.580 | And then you'll need an account. So you need to sign up on kaggle.com and you can get an API key

00:23:24.220 | So you just come over here you click on on top, right?

00:23:27.440 | You'll see your profile you go to I think account and you can come down and you have API here

00:23:34.780 | So you'll be able to get like create a new API token and it will download

00:23:39.340 | at kaggle.json file for you

00:23:42.360 | And then what you do with that is if you try to import kaggle, so you run this import kaggle

00:23:49.060 | It would say okay, you are not authenticated because you don't have kaggle.json in this directory

00:23:55.020 | So all you do is put your kaggle.json in the directory that is specified

00:23:59.380 | Once you've done that you should be able to import kaggle without any errors and we'll move on to the data download

00:24:06.700 | So we just authenticate our API. I'm not going to go too

00:24:10.020 | Into depth here. There will be a link to this notebook so you can read through everything

00:24:17.260 | As you want and

00:24:19.260 | All we're going to do is download these two data sets

00:24:22.360 | Okay, so if I let me copy this and I'll show you on kaggle this data set

00:24:29.160 | Okay

00:24:29.760 | So this is all podcast episodes published in December 2017 from this company called listen notes

00:24:35.220 | Which is a podcast search engine

00:24:37.660 | so we download these podcasts and episodes data and we extract them because they'll be in zip files and

00:24:44.340 | we can see okay, this podcast is actually the details of podcast show itself and

00:24:49.660 | episodes is the details of this, you know individual episodes and

00:24:53.780 | What Spotify did was concatenate the title and description from both of those and use that to create a sort of a single

00:25:03.140 | episode

00:25:04.820 | Description that we then encode

00:25:06.920 | So we do the same

00:25:09.500 | We first need to merge those two data frames

00:25:12.300 | So we're doing that based on the unique podcast ID and we you know

00:25:17.540 | Get this sort of data frame with everything we need inside it

00:25:21.460 | I'm just doing a little bit data cleaning here

00:25:26.460 | So stripping any white space from the features that we care about so the title EP description EP title

00:25:33.580 | podcast and description podcast

00:25:35.940 | EP here is just episode and

00:25:39.380 | Then I'm also just removing you know where we have any new values in any of those

00:25:44.900 | Columns or features just removing them because we have a lot here. Anyway, we don't even use all of these so

00:25:51.740 | It's not a problem. And then I'm concatenating everything. I'm just putting a full stop in between each one of them and

00:25:58.820 | Yeah, and then converting that into a list. So we just have like a list of these episode

00:26:08.140 | Texts now. Okay, you see they do vary a lot. It's not the cleanest data

00:26:12.660 | But it's usable

00:26:15.660 | We also shuffle here as well

00:26:18.920 | When you're shuffling just bear in mind your shuffle is going to randomize the order

00:26:24.540 | So later on when we have these curated queries, they won't necessarily be in the test set for you as well

00:26:33.580 | So you may need to just make up some some new some new queries and

00:26:38.260 | At that point we have our episodes data and we need to move on to query generation because we need you know

00:26:45.880 | Query episode pairs. So spot if I use a fine or they fine-tuned a bar model on ms

00:26:52.580 | Marco now, I'm I'm not going to find you in a model because there are

00:26:57.480 | Honestly, so many query generation models that have been fine-tuned on ms. Marco already including bar models

00:27:05.180 | so I don't think there's really any point in us fine-tuning it because we're just replicating what we can just

00:27:11.160 | pull from hugging face

00:27:13.760 | Hope I tested a few different bar and t5 query generation models

00:27:18.960 | And I found this one to be the best for this particular use case

00:27:22.460 | the queries were just more consistently like

00:27:27.620 | Sensical like they actually made sense and

00:27:29.940 | It also supports multi or it has some multi lingual support

00:27:35.420 | I'm not sure how much but it does manage to actually produce queries that make sense in different languages. So that's

00:27:42.640 | Definitely pretty useful and aligns pretty nicely with what Spotify wanted as well

00:27:48.740 | And then what I'm doing is so I'm using the transformers lyrics is hugging face transformers

00:27:54.980 | You can pip install transformers if you need to I'm just initializing the tokenizer in the model

00:28:01.340 | Okay, so this is going to handle our query generation on the end here. You see I put CUDA

00:28:08.340 | That's because I have a CUDA enabled GPU which will make things a lot faster when we're doing this because this can take a long time

00:28:14.820 | And that's why I'm just taking the first

00:28:18.020 | 100,000 episodes here

00:28:21.580 | So we move on to the query generation loop. It's fairly long. Sorry

00:28:27.420 | This isn't like the quickest way you can do it

00:28:29.740 | so larger batch size means faster processing, but this is limited by the size of your GPU and

00:28:37.860 | Then number of queries. I'm generating for each

00:28:40.860 | each episode

00:28:43.820 | so

00:28:46.460 | Spotify didn't specify what they use for number of queries here, but I'm using three because that's the

00:28:52.860 | Inline with the approach shape taken by these other query generation techniques Gen Q and GPL

00:28:59.900 | So I'm gonna stick with that

00:29:03.100 | Then we're just going through I'm encoding everything in batches. So we're just tokenizing everything and then we're generating

00:29:08.900 | three queries per episode

00:29:11.660 | decoding those queries back to human readable text because this just outputs a load of

00:29:17.100 | token IDs numbers

00:29:19.100 | Decoding about text and looping through and putting those or all those together

00:29:25.420 | So the the query and episode pairs and placing those together synthetic queries and episode pairs

00:29:30.980 | Okay, and then we we put all those in this pairs

00:29:35.020 | list

00:29:37.620 | Okay, and we can see a few here. So we have like a query and then we have a

00:29:43.540 | episode

00:29:44.780 | No query episode and so on. You can also see so here we have the multilingual support as well

00:29:50.860 | She's very cool. And also again, yeah

00:29:54.220 | So we now have a data source so we can use of fine-tuning

00:29:58.860 | the

00:30:00.580 | model the spot if I talk about this in in their article, which is is here and

00:30:05.620 | They mention that they tried okay here

00:30:10.620 | So they tried Bert's but as I said, they they decided they didn't really like it

00:30:18.220 | So, you know, yeah, we're not using it for these multiple reasons. I mentioned before

00:30:23.780 | So what they did is use this universal sentence encoder model and we're going to do something similar

00:30:31.300 | but what I will note here is that they use TensorFlow hub and

00:30:36.700 | We can use TensorFlow hub, but it means we can be stuck with TensorFlow and it makes our lives harder

00:30:43.420 | rather than using

00:30:45.900 | hugging face and Pytorch and the sentence transformers library, so

00:30:50.460 | rather than using the

00:30:53.380 | Universal sentence encoder model which we can't get from the sentence transformers library as a pre-trained model

00:31:00.340 | We are going to use a distilled

00:31:02.940 | universal sentence encoder model which is literally like a smaller version of

00:31:07.100 | USC so we'll come down here and we're using this one here. So there's still USC based multilingual case, okay

00:31:15.300 | So it's good. It's multilingual and it's still pretty much the same model as what?

00:31:21.340 | Spotify used so we just have the model details here

00:31:25.300 | so the maximum number of tokens that will accept 128 and it will output a

00:31:31.820 | 168 dimensional dense vector. So that's a you know, the meaningful vector that we spoke about earlier

00:31:38.620 | Now when we're fine-tuning with the sentence transformers library

00:31:44.180 | And you know, by the way, if you do need to install that you can just go

00:31:47.880 | pistol

00:31:50.620 | Sentence transformers

00:31:52.900 | Like this, okay, obviously in your in your terminal or using the estimation market to start there

00:32:00.540 | So when we're using a sentence transformers library and we're fine-tuning a model

00:32:04.860 | We need to reformat our data our input data. So our pairs into a list of input example objects

00:32:12.660 | now these in this input example object the format for that varies depending on

00:32:20.260 | What task you're actually fine-tuning with now?

00:32:23.880 | we're using a ranking task for fine-tuning or optimizing our model and that means that we

00:32:30.380 | Only need the the query in the episode. Okay, so all we do in that case

00:32:37.580 | I'm here as well. I'm just

00:32:39.580 | Splitting so we have an evaluation set

00:32:43.140 | Down here and a test set as well for later on when we're evaluating things

00:32:48.180 | So all we need in our input example, which we have imported from sentence transformers is the query an episode

00:32:56.860 | Literally it so it's just a list of these input examples

00:32:59.980 | With query an episode in there and you see the sample size that we have there

00:33:05.820 | so let's continue and yeah, so as I

00:33:09.860 | As I mentioned we're going to be using a ranking optimization function now

00:33:14.640 | What does that look like given a specific query light on we use earlier?

00:33:19.180 | So eat better during Christmas holidays or Xmas holidays

00:33:22.140 | our model is going to be given a set of episodes and it's going to need to rank the

00:33:29.100 | the true episode

00:33:31.700 | for that particular query as

00:33:34.020 | Number one in terms of scores. Okay, so the true sort of pair for this query is

00:33:40.660 | This one here and it will have to be able to identify that all these other

00:33:45.500 | Episodes are not real pairs or an less semantically similar to this particular query

00:33:52.100 | so the model learns to give higher scores to more semantically similar episodes for a particular query and

00:33:57.660 | Lowest scores to the more dissimilar episodes for a particular query

00:34:02.340 | And that's why we don't need like a label in this case. We don't actually need a particular similarity

00:34:10.140 | Label or anything when training our model, which is a really nice

00:34:14.540 | Sort of benefit of using ranking functions. You literally just need pairs of data and you can train with that

00:34:21.340 | Which is is pretty cool now

00:34:23.340 | the way the model actually needs to like compare the queries and episodes is

00:34:29.900 | by taking query converting into a query vector taking the episode converting that into a episode vector and

00:34:38.660 | placing them in a in a vector space and

00:34:43.380 | pairing the

00:34:45.020 | Similarity between them now the way that we are going to be calculating similarity in this case is using cosine similarity

00:34:52.340 | So we're essentially calculating the similarity or the angle between vectors

00:34:57.980 | So this is this in the middle here

00:35:00.580 | That is our query vector and this little cone is

00:35:04.740 | Representing the angle in a 3d space

00:35:08.220 | around

00:35:10.300 | that a query vector these two

00:35:13.540 | These two episode vectors here or episode embeddings are the two most similar

00:35:19.420 | So we've got top K equals 2 over here. That's because we're trying to return the two most similar and

00:35:25.460 | These other vectors like this one for example in terms of actual distance. This one might actually be closer

00:35:32.580 | Than this one, but in terms of angular distance, it's further away

00:35:37.940 | So that's why we have these two embeddings being selected rather than you know, any of these other ones

00:35:44.660 | So the model needs to create these vectors and it needs to learn how to put

00:35:50.140 | similar meaning

00:35:52.460 | querying episodes in a similar vector space and

00:35:55.020 | dissimilar querying episodes in

00:35:57.620 | As far apart as possible now another thing that we need to consider here is because we're using a ranking function

00:36:05.460 | Which is going to work by taking query and then a batch of episodes

00:36:09.180 | We need to make sure that within that batch of episodes. We don't have any duplicates

00:36:13.100 | because imagine so

00:36:15.540 | We're going to optimize optimizing this model. We've got a query in episode pair and we're saying okay for this query

00:36:20.740 | This episode is the most similar

00:36:23.660 | We're going to optimizing the model based on that if we have the exact same episode further down in that list and we're telling the model

00:36:32.220 | Actually, okay. These two episodes are exactly the same

00:36:35.100 | But one is right and one is not our models just going to be confused it

00:36:39.360 | That doesn't make sense. It can't do that. So we have to make sure that we don't have duplicates in

00:36:45.340 | The in any of the training batches now, it's probably not a problem if you have the odd one

00:36:50.740 | But we don't want any we just want to be you know certain there's nothing in there

00:36:54.620 | So we can actually do that using the sentence transformers. No duplicates data loader

00:37:00.220 | Which will handle removing any duplicates from from batches for us as well

00:37:06.220 | We use a batch size 64. So if we consider the training

00:37:11.400 | Optimization function here wait taking query and then we're taking a batch of episodes

00:37:16.300 | Imagine we have a batch size of three right or even two. We have a batch size of two

00:37:22.580 | We're just comparing our one query against two episodes

00:37:26.100 | Our model can get get even if it's just randomly guessing it can get the right answer around 50% of the time

00:37:33.340 | With such a small batch size. We increase that to a hundred then

00:37:38.220 | The model is going to perform a lot more poorly randomly guessing

00:37:42.540 | so

00:37:44.300 | the consequence of of that is that using a larger batch size for this particular training method is

00:37:53.900 | going to typically result in you having a better performing model because it makes the

00:38:00.060 | The task of ranking harder for the model

00:38:03.540 | So your model must get better in order to actually accurately rank everything

00:38:08.300 | So we increased about size as much as our hardware will allow us to

00:38:13.980 | Now after that we can initialize the loss function

00:38:18.480 | This is where the ranking optimization comes in. So in sentence transformers, the ranking function is this multiple natives ranking loss

00:38:26.520 | Okay, so we initialize that

00:38:28.640 | There is just one more step before we actually move on to fine-tuning. So if we have a look over here and

00:38:35.580 | We go down to offline offline evaluation in order to evaluate a model

00:38:41.720 | They use two different types of metrics in batch metrics and for retrieval

00:38:45.760 | We're gonna do both so for the in batch metrics, so just calculating recall and MMR at the batch level

00:38:52.200 | okay, so what we can do to replicate that is use this re-ranking evaluator from sentence transformers and

00:38:59.040 | we

00:39:00.920 | Will use that to perform ranking and calculate the MMR metric

00:39:06.080 | using

00:39:07.920 | using our evaluation set so

00:39:10.840 | Just one thing as we have already done here is we've removed duplicates from batches. We need to do the same

00:39:17.580 | Before we actually feed data into our re-ranking evaluator and then there are a lot because we create three queries per episode

00:39:24.080 | So we definitely do need to remove those so remove any duplicates

00:39:28.760 | Using this and we in the end we have a thousand unique pairs for our evaluation set

00:39:36.080 | so then we feed them into the re-ranking evaluator, which requires a particular format of query a

00:39:43.160 | Any positives we have so we just have one positive pair query here

00:39:47.080 | There's just a single a single list and then negative and then in here. We just need to pass in all of the other episodes

00:39:54.460 | That are not the positive episode for that particular query

00:39:58.940 | So we do that in in here and then we initialize a re-ranking evaluator

00:40:04.840 | And I set MMR at K to consider the top five

00:40:10.000 | items

00:40:11.840 | When it's calculating its score, I'm not going to go into into the metrics because it takes a little bit more time

00:40:18.440 | so

00:40:20.800 | Calculating the performance. So this is the MMR

00:40:23.660 | 5 for the model without any fine-tuning so we get 0.68. Okay

00:40:31.880 | so at this point, we're actually ready to go ahead and fine-tune our model and

00:40:35.920 | This is basically the easiest step. So we set a number of epochs to one

00:40:42.200 | I did try other so like more epochs

00:40:45.220 | Typically with sentence transformers you actually and we really need to train for one epoch depending on a model and the data

00:40:51.200 | Anything more it actually degraded the performance

00:40:54.560 | So when one epoch here and we always or typically use a number of warm-up set

00:41:00.480 | so here for the first 10% of training steps

00:41:03.200 | The learning rate is going to be slowly increasing up to the default learning rate

00:41:08.440 | Which I think is like 1 e to the minus 5 or something along those lines

00:41:13.080 | And then we're just saving the the model that we train

00:41:16.840 | Into this directory here. So this this WSC podcasts and Q. Thank you. Here's natural query

00:41:24.160 | And then yeah, so that trains

00:41:29.680 | It shouldn't take too long. I think maybe this took an hour maybe two hours to train

00:41:36.560 | so it's pretty quick and

00:41:39.160 | With that we can actually look at the evaluation so, you know, how did this model perform on the

00:41:47.760 | Evaluator that we initialize at the start and we can actually see that. So I

00:41:54.160 | Mean another environment here. We're actually training the model we go to this the distal USC podcasts and Q

00:42:00.820 | Folder and then he will see this eval

00:42:03.200 | Directory go in here open this and we can see we have

00:42:08.760 | Over on the right here the MMR at five value and it's zero point eight eight eight

00:42:15.960 | So that is a big improvement from what we had appear which is zero point six eight. So it's a

00:42:22.960 | 20 point increase which is is really good

00:42:25.800 | That's on it on small batches. So

00:42:28.920 | That doesn't really replicate the real-world use of this model

00:42:33.200 | So what we want to do in the final evaluation step is is replicate that

00:42:37.520 | So what we're going to do is take the test data we had which has many

00:42:43.040 | thousands of

00:42:45.520 | episodes in there we're going to encode all of those and

00:42:48.360 | Then we're going to perform a semantic search as we would expect this to be using in real life. So what we'll do is

00:42:55.520 | set up the vector database with

00:42:58.920 | thousands of

00:43:01.080 | Episode vectors in there from our test data set and then we're going to calculate the recall value this time

00:43:07.440 | for our fine-tune model and also our

00:43:13.400 | Not fine-tune model so we initialize pinecone create a evaluation index here. Make sure we use the cosine metric

00:43:21.440 | We connect to that. Oh if you if you need an API key, you need to go to

00:43:26.120 | pinecone

00:43:29.240 | I/o, it's free. So, you know, this is all free. There's nothing you don't need to worry about anything there

00:43:35.760 | We create our index we

00:43:39.960 | Connect to it here

00:43:43.440 | and

00:43:44.840 | Then what we can do is so before we index our test a we remove duplicates like we did with the evaluation set earlier

00:43:51.120 | Same thing. I'm not going to go through it again and

00:43:53.160 | Then here. So what I'm doing is going through again. Just make sure there's definitely no duplicates in there

00:44:01.240 | there shouldn't be anyway, but just in case going through and

00:44:04.680 | Creating batches of episodes to actually encode here using our new model

00:44:10.520 | This is using the distal USC

00:44:12.880 | Podcasts and Q model and then we upset them so insert them into our pinecone

00:44:18.520 | Vector database and then just refreshing the batch and doing it again all in batches

00:44:23.740 | And then at the end the undersigned look at the index that so you can see how many vectors we actually have in there

00:44:28.800 | Which is 18 and a half thousand

00:44:31.700 | Then what I'm doing here is I'm going to calculate the recall at K

00:44:37.720 | So I'm going to loop through all the queries we have so all the queries from our test set

00:44:43.200 | so these are synthetic queries first and

00:44:45.720 | Just looking at okay, what is the recall using those synthetic queries now?

00:44:52.880 | It's 0.88. I wish it is like incredibly good, but we're using synthetic queries here. So it's

00:45:01.200 | It's not really

00:45:04.000 | representative of real like human queries

00:45:07.480 | So this is where the curated queries come in

00:45:11.320 | So I went through I found different episode descriptions

00:45:15.120 | which were in the indexes 1 8 14 and so on and I just created like a

00:45:20.840 | Query for each one of those episodes that that makes sense to me

00:45:25.200 | Okay

00:45:25.720 | Now this isn't perfectly accurate that we could you know, we could use this and we could return another episode. That's actually more relevant

00:45:34.560 | But using what I've done here, it would show up as not a match and and bring down the score

00:45:40.840 | So just bear that in mind as well

00:45:43.200 | So once we use those curated like true query episode pairs or more sort of human

00:45:51.240 | queries

00:45:52.440 | we get a

00:45:53.960 | Recall at K of 57 now, you know

00:45:57.800 | It doesn't like it's definitely not as impressive as the AT we had before is okay, but it's nothing special

00:46:03.160 | But what I want to do here is just compare this to what we had before so the the pre-trained model value fine-tuning

00:46:09.800 | Okay, so to do that, I'm just creating a new index if you're using the the free version pinecone

00:46:16.760 | You can only have one index at a time. So you will have to you can you can delete an index, you know

00:46:22.520 | So you can then refresh it and do this

00:46:25.320 | So what we do is I'm going to create the I'm going to initialize the pre-trained model

00:46:31.480 | that hasn't been fine-tuned on our particular data set and

00:46:34.760 | Here I am

00:46:38.640 | creating a new

00:46:40.800 | Index evil zero connecting to that index and I'm going through doing the exact same thing as I did before

00:46:46.480 | Okay, so just pushing all of those

00:46:49.520 | vectors or episode embeddings that have been encoded using the

00:46:56.000 | Older model that hasn't been fine-tuned and then calculating the recall. Okay again using that

00:47:01.440 | Using the curated data set and we get a terrible score of zero point two eight

00:47:08.040 | So we can see there's actually a massive improvement from you know, 28 to

00:47:14.920 | 57 that's a

00:47:18.120 | 29 point improvement which is huge especially to say that our training data was synthetic

00:47:24.840 | Right, so it's a synthetic data set is never going to perform as well as a genuine data set

00:47:30.020 | so if we had those other two data sources that Spotify use

00:47:33.080 | we could probably create something that is like

00:47:36.560 | Very impressive compared to what we've done here

00:47:40.880 | But to say we just use episode data without any queries and synthetically create queries

00:47:46.320 | I think this is really impressive and it shows that

00:47:50.800 | Spotify's methodology at least in terms of you know, adding these synthetic queries and definitely does

00:47:56.440 | Contribute to their overall model and it shows us a really cool way to without any real training data

00:48:03.680 | Do this now?

00:48:06.400 | Another thing that we should really I think about here is how well does the scale?

00:48:11.240 | So we're only using 18 and a half thousand

00:48:14.680 | Vectors in in this like evaluation case

00:48:19.440 | Using pine cone. We're quite easy easily able to go up to you know

00:48:24.880 | Hundreds of millions of episodes or all vectors in there

00:48:28.680 | And even even billions. So in terms of scalability

00:48:33.320 | that is

00:48:36.320 | Well, unless you're going into this sort of trillion scale

00:48:40.160 | this is very possible with pretty much every data set that you're going to be using and

00:48:46.040 | It will still be incredibly fast, which is is I think really cool. I think you know, that's

00:48:51.800 | really awesome the fact that we can do this no data make it incredibly scalable and

00:48:57.040 | You know build something really really awesome. So that's it for this sort of walkthrough

00:49:04.280 | I will say just if you if you are going to build something

00:49:08.480 | Like this with pine cone and you know doing all this sort of cool vector search

00:49:14.920 | Stuff let me know because over at pine cone

00:49:17.760 | We are looking for people that would like to showcase the work that they're doing

00:49:21.840 | And it would be really cool to see what you're building

00:49:25.400 | And if you sort of like to share that and get other people seeing what you're building

00:49:31.360 | I think that's a really good way to do that

00:49:33.760 | so if you're interested in that just go over to the community page at pine cone and

00:49:41.480 | You'll be able to find how to submit any projects that you're working on there

00:49:46.920 | So I hope all this has been useful and interesting

00:49:50.680 | Thank you very much for watching and I will see you again in the next one. Bye

Spotify's Podcast Search Explained

Chapters