Is GPL the Future of Sentence Transformers? | Generative Pseudo-Labeling Deep Dive

Today we're going to dive into what I think is probably one of the most interesting techniques in semantic search for a very long time. That technique is called generative pseudo-labelling or GPL. It's from Noah's Rhymers team at the UKP lab and this is primarily the work of Ketsin Wang.

Now Ketsin Wang also produced the TSE paper maybe around a year or so ago as well and TSE is the unsupervised training method for sentence transformers and GPL kind of goes, takes that and kind of puts up another notch. So we're going to be using GPL to take unstructured data and generate structured data from it for training sentence transformers.

Now before we really dive deep into GPL, I want to give a little bit of background and set the scene of where we might want to use this sort of technique. So in 1999 there was a concept known as semantic web described by the creator of the World Wide Web, Tim Berners-Lee.

Now Tim Berners-Lee had this sort of dream of the web in the way that we know it today but where you have machines roaming the web and being able to just understand everything. But when you start to become a bit more niche, it gets hard to actually find that sort of data set and pretty much impossible most of the time.

So for example let's say you're in the finance industry, you have these internal documents, fairly technical financial documents and you want to build a question answering system where people or staff can ask a question in natural language and return answers from those documents. It's going to be hard for you to find a model, you do have financial Q&A models but they've been fine-tuned on like personal finance questions on Reddit and okay kind of similar but not really.

Like personal finance and technical finance documentation in a finance company, very different. So it's very hard to actually find a data set that is going to satisfy what you need in that scenario because it's fairly niche. And it's not even that niche, there's a lot of companies that need to do that sort of thing.

Another example is a project I worked on was for the Devay language. We wanted to create a set of language models for the Devay language which is the national language of the Maldives. Now there's not that many Devay speakers worldwide so it's pretty niche. So it's very hard to find labeled data and I think what we end up doing is actually using unsupervised methods.

We use TSAE for the actual sentence transformer model. What we have there is another use case where it's fairly niche and it's hard to find labeled data for. But there is unlabeled data, unstructured data in both of those scenarios. In your finance company you have the documents there, you just don't have them, they're not labeled in a way that you can train a model with.

Or in the Devay language example we have, there is many web pages in Devay that we can scrape data from but it's not labeled. So for these more niche topics which covered a vast majority of use cases, you really either need to spend a lot of money labeling your data or we need to find something else that allows us to either synthetically generate data or just train on the unlabeled data.

Now training on the unlabeled data doesn't really work. It works to an extent, TSAE showed that it can but you're looking at very generic semantic similarity and it's not perfect. GPL again, it's not perfect but it's definitely much better. So back to the semantic web example, in the case of the semantic web, the majority of the semantic web is going to be these niche topics that we don't have labeled data for.

GPL is not a solution, it's not going to create the semantic web or anything like that. But it does allow us to actually target those previously inaccessible domains that actually begin producing models that can intelligently comprehend the meaning behind the language in those particular domains which I think is super fascinating and the fact that you can actually do this is so cool.

Now there is a lot to it. This video I'm sure will be quite long but we really are going to go into the details and I think by the end of this video you should know everything you need to know about GPL and you will certainly be able to apply it in practice.

So we've introduced where we might want to use it. I just want to now have a look at sort of like an overview of GPL and where we are going to use it in this video. So GPL is, you can use it in two ways. You can use it to fine tune a pre-trained model.

So when I say pre-trained, I mean a transformer model that hasn't been fine-tuned specifically for semantic search. So a BERT-based case, for example, would be a pre-trained model from Hockey Base Hub. You take that and you can use GPL to fine-tune it for semantic search. You'll get okay results with that.

The better way of using GPL is for domain adaption. So domain adaption is where you already have a semantic search model. So for example, what we are going to look at, we have a model that has been trained on MS Marco from data from 2018. This model, if you give it some questions about COVID-19, because 2018 was pre-COVID, it really kind of struggles with even simple questions.

You ask it a simple question about COVID-19 and it will start returning you answers from about like Ebola and flu rather than COVID-19 because it's confused. It's never seen that before. So this is a typical issue with sentence transformers. They're very brittle and they don't adapt new domains very well whatsoever.

So GPL is best used in these sorts of scenarios where we want to adapt a model to a particular domain. Now, as I said, our example is going to be COVID. We're going to teach this model to understand COVID even though it's never seen anything about COVID before. So let's have a look at how GPL actually works.

So at a very high level, GPL consists of three data preparation steps followed by one fine-tuning step. So the data preparation steps are the steps that produce the synthetic data set that we then use in that fine-tuning step. So that fine-tuning step is actually a common supervised learning method.

The unsupervised, if you want to call it that, part of this method is that we are actually generating the text or the data that we'll be using automatically. We don't need to label it ourselves. So these three data preparation steps are the key to GPL. They are query generation, which is where we create queries from passages, so like paragraphs of text.

The second is negative mining. So that is an information retrieval step where we retrieve similar passages to our positive passage, which is a passage from before that do not actually answer our query, or we assume they do not answer our query. And then the pseudo-labeling step kind of cleans up the data from those two steps, from those two earlier parts, and uses a cross-encoded model to identify the similarity scores to assign to our passages and query pairs that we've generated already.

So you can see that process here. We start with P+, which means positive passage at the top. We pass that through a query generation model. We use T5 for this, and that generates queries or questions. That's passed through a dense retrieval step, the negative mining step, which will hopefully return things that either partially answer or share many of the same words with our query or positive passage, and therefore are similar but do not actually match to our query.

And then we pass all of those, the query, the positive and negative, through to our pseudo-labeling step, which is the cross-encoded step. Now the cross-encoded step is going to produce a margin score. So that's almost like the difference in similarity between our query positive pair and our query negative pair.

We'll explain all this in much more depth later on. I assume right now it's probably quite confusing. So no worries, we will go through all of this. So as you've probably noticed, each of these steps requires a user of a pre-existing model that has been fine-tuned for each one of these parts.

Now we don't need to have fine-tuned this. They're general models. They don't need to have been specially trained for that particular task with this particular domain, okay? These models are generally quite good at adapting to new domains, which is why we're using them. So let's dive deeper into what I've just described and begin looking at the query generation step of GPL.

So as I mentioned, GPL is perfect for those scenarios where we have no labeled data, but we do need a lot of unstructured or unlabeled data, okay? So we need a lot of text data that is not labeled. That could be text scraped from web pages, from PDF documents, from anywhere where you can find text data.

The only requirement is that this text data is actually relevant to your particular use case, e.g. is in-domain. Now, if we consider the case of maybe we have a use case where we need to build a semantic search retrieval model for German finance documents. For the in-domain of that, we might consider German finance news articles or German finance regulations, okay?

They would be in-domain, but other documents like PyTorch documentation or English financial documents, they are not in-domain at all. It's almost like if you imagine you're studying for an exam in one topic, like biology, and you start studying some chemistry papers for that biology exam. Maybe some of that chemistry might be relevant, a crossover of your biology, but not much.

You're never actually going to do that if you want to pass your exam at least. Now, in our examples in this video, we're going to be using the CORD-19 dataset, which is a set of papers that, I don't think all of them are specifically about COVID-19, but I think most of them at least mention COVID-19.

There are a lot of files here. Let's have a look at the code we use to actually download that data. Okay, so we need to find the CORD-19 dataset from the Allen Institute of AI. It's this website here. So it's not available, at least when I checked, it wasn't available on HuggingFace.

Datasets are anywhere like this, so we need to pull it manually. So I'm just going to download that. We create this tar file, and essentially, I'm not going to go explain all of this. I want to be quick through this a little bit, but you will be able to find the link to this in the description.

So you can just run this script, but everything will be stored within this document parses PDF JSON file, which is just going to be a load of JSON files. There's a lot of them, 300, just under 340,000 of them, and they're going to be named like this, and they're going to look like this.

So they each have these paragraphs in there, which we're going to pull in and use in our examples. So once we've actually downloaded all of the test data, we're going to move on to the query generation code. So we have our Core 19 data. We're going to read them from here.

I use a generate function here. So we're going through, getting the text, and we're just yielding the passage. So this is one passage at a time. So one paragraph takes that time that I'm pulling in, and I'm using a generate function here to do that. Now, just be aware there are a lot of duplicates in this data set.

So what I've done is create this passage dupes set. So just check every time we have a new passage, we just check if we already pulled it in. If so, we skip, move on to the next one. Otherwise, it's a unique passage, and we can pull it in and add it to that duplication check, and then yield it.

So using yield here, because this is a generate function that we loop through one at a time. So basically, we're going to iterate through that function, and it will return as one passage at a time, which is what we're doing here. I returned two passages here. You can see they're pretty long.

So there is a lot of text data in there. We're probably going to cut a lot of it out when we're feeding it into our models. That's fine. We just want to see how this method works. We want to make everything perfect. So what we're going to do is use this model here.

So this is a T5 model trained on data from MS Marko from 2018. So it's pre-COVID. It doesn't know anything about COVID. We're using Hug and Face Transformers here, Auto Tokenizer, and then Automotive Sequence-to-Sequence Language Modeling. And one thing, just to be very aware, if you have a CUDA-enabled GPU, definitely try and use it.

So here we're just moving that model to CUDA, to our GPU. It will make things much quicker. This step, otherwise, on CPU, will take a very long time. So it's best not to. If you can, it's best not to do that. I think for me, on the Tesla V100, I think this took maybe one or two hours to do for 200,000 examples, which is a fair few.

So here's just an example of what we're going to do here. So I'm taking one passage. We are going to tokenize the passages to create our inputs. And now, inputs will include the input IDs and attention masks. We need both of those. And we feed them into our model.

And this generates three queries from that. One other thing that you really should make note of here is, again, I'm moving those tensors to CUDA, if I can. If I move my model to CUDA already, these also need to be on CUDA. And let's have a look at what we produce from that.

So this is, sometimes it's good, sometimes it's not good. It really depends. I think this is kind of a relevant example. But it's not perfect, because the queries are about SARS here, rather than COVID, even though we're talking about COVID. Obviously, the model doesn't know that. So this generative step, or query generation step, is very good for out-of-domain topics.

Because, for example, if this said COVID-19, I think there's a good chance that this would say, where is COVID-19 infection found? Because it does a lot of word replacement. So it's seeing your answer, basically. And it's just kind of reformatting that answer into a question about the answer. That's a lot of what this T5 model is doing.

So it can be quite good. And in some cases, it can be less good. But that's fine, because it's better than nothing. We really do need some sort of labeled data here. And in most cases, it works fairly well. So here, we get some OK examples. We have SARS rather than COVID.

But otherwise, it's fine. So what I'm doing here is I'm creating a new directory called data, where I'm going to store all of the synthetic data we create. Importing TQDM for the progress bar, setting a target of 200,000 pairs that I want to create. Batch size, increase as much as you can.

That's going to make it quicker. But obviously, it's restricted, depending on your hardware. And just specify a number of queries. And we're going to do this all in batches to make things a little bit faster. And then we reinitialize that generator. That's going to return as our passages. Because we've already run through three of them already.

So I just want to restart it, essentially. So we're going to go through. I'm using this with TQDM as progress here. Because this is a generator, we don't know how long it is. So it's not going to give us a good progress bar if we just run TQDM on passages.

So I'm specifying the total as a target. So total number of steps is the target that I have up here, the 200,000. So I'm going to increase the count by one with every new passage that we generate queries for. In fact, I'm going to increase it by three, I think.

Yeah, because we create three queries per passage. So we do do that. And once our count is greater than or equal to the target, 200,000, we stop the whole process. Here, for the passage batch, I'm appending a passage. So the passage that we have from here, I'm just replacing any tab and newline characters to keep it clean for when we're writing the data later on, more than anything.

Because we're going to be using tab-separated files. We're going to encode everything in batches. So once the length of the batch is equal to the batch size that we specify up here, so 256, we're going to begin encoding everything. So we tokenize everything in the batch, and then we generate all of our queries.

Again, generating three for each passage that we have. And then another thing is that we need to decode the queries that we've generated to human readable text. Because the next model we use is going to be a different model. So it cannot read the same tokens that this model outputs.

So we need to decode that back to human text. And then we're going to loop through all of the decoded outputs, which is one query per line. And we have to consider there's actually three queries per passage. So we have maybe five passages here, and that means we have 15 queries on the other side.

So we need to consider that when we're pairing those back together. So we use this passage IDX, which is the integer of I divided by number of queries. So imagine you have, so we have for passage 0, we are going to have queries 0, 1, and 2. 0, 1, and 2, you divide those by 3.

And all of them are going to be less than 1. So 2 is the highest one. 2 divided by 1, 0.66. If you take the integer value of that, it's going to become 0, which maps to the passage on this side. So that maps to passage number 0. So that means 0, 1, 2, all mapped to passage 0.

And then we do the next one, so 3, 4, 5. And that's going to map to 1. So that's how we're mapping our generated queries back to our passages. And then for each one of those, we just append it to line here. We're using a tab to separate them because we're just going to write them to file.

And we increase the count. Refresh the batch every time we've been through a batch. And update the progress bar. So updating it, actually updating it by 3 here. No, sorry. Updating it by the size of the decoded output. So if we had five passages, that would be 15 because we have three queries per passage.

OK, now we want to write those query passage pairs to file, which we do like that. OK, so that's the query generation set. There's a lot in there. But from this, we now have our query passage pairs. Now, it is worth noting that, like we saw before, query generation is not perfect.

It can generate noisy, imperfect, super nonsensical queries. And this is where GPL improved upon GENQ. If you know GENQ, we covered it in the last chapter. And GENQ just relies on this query generation step. GPL, later on, we have this pseudo-labeling step, which kind of cleans up any noisy data that we might have, or at least cleans up to an extent, which is really useful.

So that's great. We've finished with our query generation step. And now we're ready to really move on to the next step of negative mining, which I think is probably one of the most interesting steps. So now, if you think about the data we have, we have our queries and the assumed positive passages, or the positively paired query passages that we have in the moment.

Now, suppose we fine-tune our sentence transformer or bind coder with just those. Our sentence transformer is just going to learn how to put things together. So you're going to learn how to place these queries and passages in the same vector space. And yes, OK, that will work to an extent.

But the performance is not great. And what we need is a way to actually find negative passages. So there was a paper on Rocket QA. And they found that when-- or they had this really cool chart that I liked, where you have your performance, or model performance, when they've trained it with hard negatives.

So negative samples are quite hard for the model to figure out that it's negative, because it's very similar to your positive. And we'll explain more about hard negatives later on. And where they train the model without any hard negatives, so just the positives. And it's very clear that models perform better when there are hard negatives.

It's almost like you can think of it as making your exams harder. The people that pass the exams now will have studied even harder than they would have in the past. Because you've made the exam harder. They need to study harder. And as a result of that, they will be better because of that.

Their knowledge will be better because the exam was harder. And they had to try harder to pass it. It's very similar in why we include negatives. It makes the training process harder for our model. But at the same time, the model that we output from that is much better.

So to actually get these negatives, we perform what's called a negative mining step. Now, this negative mining step is used to find highly similar passages to our positive passages that are obviously very similar but are not the same. So we are assuming that these very similar passages are-- maybe they talk about the same topic, but they don't really answer our query.

We're assuming something like this. We don't actually know, but we're assuming this. So we're performing a semantic search, information retrieval step to actually return these possible negatives. And then we can use them as negative training examples. So now what we will have is our query, positive passage P plus, and assumed negative passage P minus.

And the result of this is that our model is going to have to learn very nuanced differences between that negative or assumed negative and that positive in order to be able to separate them out and understand that, OK, the positive is the pair for our query, but this negative, even though it's super similar, is not the pair for our query.

So it's just giving our model an extra task to perform here. Now, with all that in mind, we also need to understand that we're assuming that all the passages we're returning are negatives. We don't actually know that they are. And again, this is something that we're going to handle later on in the pseudo-labeling step.

But for now, we just assume that they are, in fact, negative passages to our query. So let's move on to the actual implementation of negative mining. So the first thing we need to consider for this negative mining step is that there are two parts to it. There is an embedding or retrieval model.

Again, we're going to use something that has been trained on data from pre-COVID times. And what that is going to do is taking our passages and our queries and translate them into vectors. OK, but we need somewhere to store those vectors. So we need a vector database to do that.

So we're going to use a vector database to store all of those vectors. And then we're going to perform a search through all of those with our queries in order to identify the most similar passages. And then we're going to say, OK, is this a positive passage for this query?

If not, then great. We're going to assume it's a negative passage for our query. We're just going to go through and do that. So the first thing we're going to do is load our model. So we're using this MS Markov Silver Base TASB. Again, that is a pre-COVID data that has been trained on.

And we're loading that into Sentence Transformers and just setting the max sequence length to 256 here. OK, and then we're going to initialize a Pinecone index. So this is going to be our vector database. We're going to store all of our vectors. So for that, we do need an API key.

It's all free. You don't need to pay anything for this. All the infrastructure and stuff is handled for us, which is great. So we go to @pinecone.io, make an account if you need to. And then you'll get an API key. Now, for me, it's not the best way to do it.

It's just easy. I'm just storing my API key in a file called secret in the same directory as my code. I'm just reading that in here. OK, so my API key. Oh, as well, another thing you need to do here is to install that, you need to pip install pinecone-client.

OK, not pinecone, pinecone-client. So we initialize pinecone with our API key. This is just a default environment. There are other environments available if you want to use them, but I don't think you really need to. And then we create the index. OK, so we have a native mining index.

So here, I'm just saying you can check your currently running indexes with pinecone list indexes. If native mine is not in there, then I'm going to create native mine index. And there's a few things you need to pass, so dimension. The dimension is what we have here, so that the embedding dimension that our model, embedding model is outputting.

So you specify that, 768. The metric, this is important. We need to be using dot product metric here for GPL. And also, the number of pods. So by default, this is one. What you can do is increase this. So I think I increased it to 70, which is probably like massively overkill for this.

But that shortened. So with one, I had a runtime, I think, of one hour 30 or one hour 40 with 70, which again, it's probably overkill. Maybe 40 would do the same. I don't know. With that, it was, I think, 40 minutes. So a lot faster, but you do have to pay for anything that's more than one pod.

So I know if you're in a rush, then fair enough, fine. Go with that. Otherwise, you just sit with one pod. It's free. Okay. And then you connect. So you, this creates your index, and then you connect to your index. So pine cone index, negative mine. Okay. And then we want to go through, we're going to use our model.

Actually, this bit here, we're creating our file reading generator. So it's just going to yield the query and passage pairs. I include this in here because I got a value error because there's some weird, literally one row of weird data in there. So I just added that in there to skip that one.

And yeah, initializing that generator. And then here, I'm going to go through and actually encode the passages. So you see, we're creating a passage batch here. We're doing it in batches again, make it faster. So I'm adding my passages to the passage batch, also adding IDs to the ID batch.

And then once we reach the batch size, which is 64 in this example, we encode our passage batch. We also convert them to a list because we need it to be a list when we're upsetting everything. So upset just means, it means update or insert. It's just like database lingo.

We need it to be a list when we're pushing this because it's pushing it through an API call here, so JSON request. So we need to convert that to a list rather than NumPy array, which is, I think, the default. And then we just create a list of ID and vectors and upload that to PyCone, our index.

And then we just refresh those batches. So we start again, and then we do 64 and then do the encode and upset again. And then at the end here, I just wanted to check the number of vectors we have in the index. So we see the vector, the dimensionality of the index, so as the index fullness, this will tell you pretty much how quick it's going to run.

At zero, it's perfect. It means it's basically empty. It'll run pretty quick. And then you have the vector count here. So remember, we have 200,000 examples or pairs, but not all of them are being used. So they are all being used, but not all of them are unique because for each passage that we have, we have three queries.

So three unique queries, but three duplicated passages. So obviously, out of those 200,000 passages that we have, we need to divide that by three to get the actual number of unique passages that we have in there, which is something like this 76840. So the database is now set up for us to begin native mindset.

It's full of all of our passages. So what we're going to do is loop through each of the queries in pairs, which we created here. So we're going to loop through all of those here in batches of batch size again, so 100. And we're going to initialize this triplets list where we're going to store our query, positive and negative.

At the moment, we just have query positive, remember? So we're going to go through there, get our queries, get our positives, and we're going to create the query embeddings. So that's in batches again. Then we search for the top 10 most similar matches to our query, and then we loop through all of those.

So for query, positive passage, and query response, this query response will actually have the 10 possible or the 10 high similarity passages for that query inside there. So extract those, remember? 10 in there. And I'm just going to shuffle those. So I'm going to shuffle them, and then we're going to loop through them.

Now, we do this so that we're not just returning the most similar one all the time, but we're returning one of the most similar top 10 instead. And then we extract native passage from that one record that hit. And then one thing we really need to consider here is, OK, if we've got all of our passages in there, we're also going to have the positive passage for our query as well in there.

So it's pretty likely that we're going to return that, at least for a few of our queries for sure. So we need to check that negative passage we're looking at does not match to the positive passage for that query. And then if not, then that means it's a negative passage or assumed negative passage, and we can append it to our triplets.

So we have query, tab, positive, tab, negative. And then we say then file. OK? Now, one last thing. Before we move on, we should delete our index. So if you're on the free tier, you're just using that one pod, it's fine. You're not going to pay anything anyway. But I guess it's also good practice to remove it after you're done with it.

But if you are paying, of course, you want to do this so you're not spending any more money than you need to. So that's the negative mining step. Now we can move on to what is the final data preparation step, which is the pseudo-labeling. Now, pseudo-labeling is essentially where we use a cross-encoder model to kind of clean up the data from the previous two steps.

So what this cross-encoder is going to do is generate similarity scores for both the query positive and the query negative pairs. OK? So we pass both those into cross-encoder model, and it will output predicted similarity scores for the pairs. So first thing you need to do is actually initialize a cross-encoder model.

Again, this should have been trained on pre-COVID data. And then we're going to use a generate function as we have been doing throughout this entire thing, just to read the data or the pairs, triplets in this case, from file, and then yield them. And what we're going to do is use that cross-encoder, using this function here, to calculate both the similarity of the positive and the negative.

And then we subtract those to get the margin between them, so the separation between them, which is ideal for when we're performing margin MSC loss, which we will be very soon for fine-tuning our model. So we go through using our generator here. We get a line, a time, a query, positive, negative.

We get a positive score, negative score, and then we calculate the margin between them. And then we're going to append those to label lines and save it all to file. So it's a pretty quick step, actually, for the pseudo-labeling step. But this final step is also very important in ensuring that we have high-quality training data.

Without it, we would need to assume that all passages returning negative mining step are actually negatives, and they're irrelevant for our particular query. In reality, this is obviously never the case, because some negative passages are obviously going to be more negative or less negative than others. And maybe some of them are not even negative in the first place.

So there was this chart or example from the GPL paper that I really liked. I've just adapted it for our particular use case. So the query is, what are the symptoms of COVID-19? The positive, so the first row there, is COVID-19 symptoms include fever, coughing, loss of sense of smell or taste.

So that positive passage is the passage that we started with. We generated that query in the generative or query generation stage. And then we retrieved these negatives in the negative mining stage. Now, just have a look at each one of these. So the first one, fever, coughing, and a loss of sense of smell are common COVID symptoms.

That is not actually a negative. It's a positive. But this is very likely to happen in the negative mining stage, because we're returning the most similar other passages. And those other very similar passages could very easily be actual genuine positives that are just not marked as being the pair for our query.

We don't actually know that they're negatives. So in this case, this is a false negative, because it should be a positive. Now, the next one, we have these easy negatives. The next one, symptoms are physical or mental features that indicate a condition of disease. Now, this is a pretty typical easy negative, because it has some sort of crossover.

So the question is, what are the symptoms of COVID-19? And this is just defining what a symptom is. But it's not about COVID-19 or the symptoms of COVID-19. This is an easy negative, because it mentions one of those keywords. But other than that, it's not really similar at all.

So it will be quite easy for our model to separate that. So it will be easy for it to look at this easy negative and say, OK, it's not relevant for our query. It's very different to the positive that I have here. That's why it's an easy negative. Next one, another easy negative is COVID-19 is believed to have spread from animals to humans in late 2019.

Again, it's talking about COVID-19, but it's not the symptoms of COVID-19. So maybe it's slightly harder for our model to separate this one. But it's still a super easy negative. And then we have a final one. So this is a hard negative. Coughs are a symptom of many illnesses, including flu, COVID-19, and asthma.

In this case, it's actually a partial answer, because it tells us, OK, cough is one of the symptoms. But it's still not an answer. We're asking about the symptoms. We want to know about multiple symptoms. This is kind of partially answers, but it doesn't really. And what I really like about this is we look at the scores on the right here.

We have the scores when you're using something like pseudo-labeling from GPL, and scores when you're not using pseudo-labeling, like if you're using GenQ. Now, GenQ, even so, like for the false negative, GenQ is seeing this as a full-on negative. So you're just going to confuse your model if you're doing this.

If it has two things that are talking about the same thing, and then you're telling your model, actually, one of them is not relevant, your model is just like, you're really going to damage the performance of your model from doing that. And it's the same with the other ones as well.

There's almost like a sliding scale of relevance here. It's not just black or white. There's all these shades of gray in the middle. And when using GPL and pseudo-labeling, we can fill in those shades of gray, and we can see that there is a sliding scale of relevance. Without pseudo-labeling, we really miss that.

So it's a very simple step, but really important for the performance of our model. Okay, so we now have the fully prepared data, and we can actually move on to actually fine-tuning our sentence transformer using margin MSC loss. Now, this fine-tuning portion is not anything new or unique. It's actually very common.

Margin MSC loss is used for a lot of sentence transformer models. So looking back at the generated data, we have the format of Q, positive passage, negative passage, and margin. Let's have a look at how those fit into the margin MSC function. So we have the query, positive, negative.

We pass those through. We get the similarity here. So this is our similarity of the query and the positive. Okay, and this is this bi-encoder here is the one we're training, by the way. So this bi-encoder is the final model that we're fine-tuning, that we're trying to adapt to our new domain.

And then we also calculate using the query and negative vectors here. We calculate the similarity between those to get the similarity of Q and negative. We subtract those to get this here, which delta hat is our predicted margin. With our predicted margin, we can compare that against our true margin, or we assume it's a true margin from our cross-encoder.

And we feed those into the margin MSC over here, loss function. So we have the predicted minus the true for each sample. We square all of that to get the squared error. And then we're going over for all of our samples, taking the average. And that is our margin MSC loss.

Now, what is really nice is that we can actually use the default training methods in the Sentence Transformers library for this. Because margin MSC loss is just a normal method. There's nothing new here at all. OK. So from Sentence Transformers, I'm going to import the input example, which is just a data format we always use in our training in here.

So we're opening up the triplets margin here. I'm just reading all the lines here. I'm not going to use a generator. This time, I'm going to just create our training data, which is just a list of these input examples. We pass the query positive and negative as a text in here, so as our triplet.

And then the label, we take the float of margin, because margin is a string from our TSV. And from this, we get this 200,000 training examples. And we can load these pairs into a generator data loader here. So we're just using normal torch data loader, nothing special here. We use empty cache here just to clear our GPU in case we have anything on there.

So if you keep running this script again and again, trying to get it working or just modifying different parameters to see how it goes, you'll want to include this in here. The batch size. Batch size is important with margin MSC loss. The larger you can get, the better. I did see, I think, in the sentence transformers documentation, it used something like 64.

And that's hard. You're going to need a pretty good GPU for that. And to be fair, even 32, you need a good GPU. This is using one Tesla V100 for this. And it works. It's fine. So as high as you can get, 32, I think, is good. Even better if you can get 64.

And, yeah, we just initialize data loader. We set the batch size and we shuffle to true. And so the next step we want to do here is initialize the buy encoder or sentence transformer model that we're going to be fine tuning using domain adaption. So this is the same one we used earlier in the retrieval step.

We're actually going to be taking that and fine tuning it. So we set the model match sequence length here, again, as we did before. And then we initialize the margin MSC loss function as we would with normal sentence transformers. And then we just run like model fit. We train for one epoch or actually, it depends.

You can get better performance if you train for more. I'll show you. And I just do 10% of the warm steps here like we usually do. And I'm going to save it. Same model name, but I've added COVID onto the end there. And this doesn't take long. I think this was-- maybe it was 40 minutes per epoch on the test of V100.

So it's not bad. Now, in terms of the amount of training steps you want to actually put your model through, with GPL, they did fine. So in the paper, there's a really nice chart that shows that after around 10,000 steps, for that particular use case, it leveled off. The performance didn't really improve that much more from there.

However, I did find-- so I'm using 200,000 examples. So one epoch for me is already 200,000 training steps, which is more. And I found that actually training for 10 epochs-- I didn't test in between. I just went from one. And I also tested 10. I found that 10, actually, the performance seemed to be better.

I didn't really quantitatively assess this. I just looked at it qualitatively. And it was performing or is returning better results than the model that had been trained for one epoch. So for rather than 200,000 steps, it had been trained for 2 million steps, which is a fair bit more than 100,000 mentioned here.

But that's just what I found. I don't know why. I suppose it will depend very much on the data set you're using in particular. But I would definitely recommend you just try and test a few different models. The training doesn't take too long for this step. So you have the luxury of actually being able to do that and testing these different numbers of training steps.

So I mean, once training is complete, we can actually test that model as we usually would. We have the model as well. So I can show you. So we have the model over on-- if we go to Hugging Face CO Models, maybe. So if you search in here, you go Pinecone.

And one of these is the model that has been trained on COVID. So hit this one here. So you can actually copy this. And you can just write-- I think it is-- it should maybe have an example. Ah, so you write this. So you from Sentence Transformers, import Sentence Transformer.

And you just do model equals Sentence Transformer in your model name, which in this case is what you can see up here, Pinecone MS Marco. And you can test the model that I've actually trained here. Again, this was trained on 10 epochs, not one. OK, so in this evaluation here, you can see what I was looking at.

So load the old model. It's just the MS Marco, Silbert, Base, TUSB. And then here, I've got the model trained for one epoch. And here, the model trained for 10 epochs, which is equivalent to the model I just showed you on Hugging Face Hub. So I've taken these tests.

So we have these queries and three answers or three passages for each one of those. Now, these are not even really perfect matches to the sort of data we were training on. But they're kind of similar in that they just include COVID-19. OK, so it's not a perfect match.

It's not even perfectly in the domain. But it's good enough, I think. But I think that really shows how useful this technique is because it does work even with this kind of not perfect match between what I'm testing on and what I'm actually training on. So here, I'm just going through those.

I'm getting the dot product score between those and sorting based on that. So the highest rated passage for each query is returned. So for the old model, how is COVID-19 transmitted? We get Ebola, HIV. And then at the very bottom, we have Corona, OK? Again, what is the latest name variant of Corona?

And then it's these here. Again, COVID-19 is right at the bottom. We have Corona lager right at the top. I don't know if the rest of the world calls it lager or beer, but I put lager there anyway. And then we say, OK, what are the symptoms of COVID-19?

Then we get flu. Corona comes second here. And then we have symptoms at definition again. And then how will most people identify that they have contracted the coronavirus? And then after drinking too many bottles of Corona beer, most people are hungover, right? So obviously, it's not working very well.

And then we do the GPL model. So this is the training on one epoch. We see slightly better results. So Corona is number one here. And then here, it's number two. So it's not great, but fine. What are the symptoms of COVID-19? This one, I guess, right? So that's good.

And then here, how will most people identify that they have contracted the coronavirus? Again, it's doing the drinking too many bottles of Corona. Again, then you have COVID-19 second place. Then we have the model that's trained for 10 epochs. This one, every time. So number one, number one, and number one.

So it's pretty cool to see that. And yeah, I think it's super interesting that this technique can actually do that using nothing more than just unstructured text data. So that's it for this video on generative pseudo-labeling, or GPL. Now, I think it's already very impressive what it can do.

And more importantly, I think, or what is more interesting for me is where this technique will go in the future. Is this going to become a new standard in training sentence transformers, where you're building the synthetic data? I feel like there's really a lot that we covered in this video and from the GPL paper to unpack and think about.

We have negative mining. We have pseudo-labeling. We have all these different techniques that go together and produce this really cool, in my opinion, technique that we can use super powerful. For me, it's just super impressive that you can actually do this. So I hope there is more in this field in GPL or some new form of GPL in the future.

I'm very excited to see that. But for now, of course, that's it. I hope this video has been useful. I hope you're excited about this as I am. So thank you very much for watching, and I will see you again in the next one. Bye.

Is GPL the Future of Sentence Transformers? | Generative Pseudo-Labeling Deep Dive

Chapters

Transcript