Back to Index

Stanford XCS224U: NLU I Fantastic Language Models and How to Build Them, Part 2 I Spring 2023


Transcript

All right. Welcome everyone. Again, we have a very full day. The plan is to finish up our review of core information retrieval stuff. The focus will be on neural information retrieval, where a lot of the action is these days. I have then a few datasets to show you, and then I'm going to turn it over to Sid, and Sid is going to help us talk again about how to build fantastic language models.

So let's dive in. We'll start by using our big handout here, information retrieval. Right. So here we are, and we are going to skip. That's right. I had a couple more metrics that I wanted to show you. So let's start there. So last time we talked about how assessment in the space of IR should be multidimensional.

We've been focused on accuracy, but I will make amends. We are going to circle back and talk about these other dimensions, which I regard as absolutely crucial in this space. But with that said, we did dive into different metrics. We talked about success and reciprocal rank. Success, you should think of as just saying, for my chosen k, is there a star above me?

That is, is there a relevant document above k? So it's a very coarse-grained measure. So this one here, if we set success at 2, D1, has success because there is a star at 2 or above. D2, that ranking also has a success of 1 because there is a star at 2 or above, and poor D3 gets a success of 0.

And you can see already that it's coarse-grained because D1 and D2 are differentiated, in some intuitive sense, but here they both got a success score of 1. Reciprocal rank is a little bit better in the sense that it's more or less just registering whether there's a star at or above k, except now we are sensitive to the top most ranked one.

So for example, D1 here has an RR at 2 of 1 because there is a star in first place. Whereas D2 has 1 over 2 because the first star is in second place. And then D3 still gets its poor 0. So pretty coarse-grained but very intuitive and sometimes success and RR are good metrics in the sense that you kind of just want to know for your chosen k whether you hit the mark, whether you got a star.

And especially if you only have one relevant document per query, you might as well use these metrics. And then RR will just be a little bit more nuanced. We also talked about precision and recall, the classic accuracy style metrics in this space. The differentiator here from the previous ones is that these are going to be sensitive to multiple stars.

So if you have more than one document that's relevant to your query, you will be able to detect that. So we have this notion of a return value that is just the set of documents k or above. And then the relevant documents, those are the ones with stars. And precision is saying for my chosen k, what percentage of the things at or above k are relevant?

And that's precision in the sense that if you picked k, you're looking at the set of documents and you want to know how many of them have stars relative to the total. Or like the reverse of precision would be like which ones are kind of imprecise as predictions because there's no star there.

And then recall is kind of the dual of that and it says for my chosen k, how many of the stars made it up to k or above? And the opposite of that would be like how many stars are lingering down below? So you can see here because of the numerator that we're going to differentiate systems now based on how many stars are at k or above.

So it's sensitive to multiple stars. So just to walk through again, precision at 2 for D1 is 2 out of 2. For D2, it's 1 out of 2 because just half of them have a star. And for poor D3, 0 out of 2. Recall is very similar but now the denominator changes, right?

So the recall at 2 for this first one is 2 out of 3. That is of the three-star documents, 2 are at k or above. Here it's 1 out of 3 and here at 0 out of 3. And just to round this out, poor D3 has not fared well in our ranking so far.

But in a surprise twist, if I change the value of k to 5, all of a sudden D3 looks pretty good. Because now it's got all three of its stars at 5 or above. Whereas the other two, even though they've got some high stars up there, we're not sensitive to that precisely.

And so now D3 has pulled ahead. And that is maybe something that you want to watch out for because people kind of innocently choose these k values when they're evaluating systems. And I just showed you that that could really impact the ranking of systems. And in particular, like, you know, it's hard to imagine since there are only six documents.

But if it was a lot of work to travel down to our chosen k, if k was 1,000, this would obscure the fact that we might pick as our winner a system that had all the stars more or less at 1,000. And the other systems which have their stars at the top of this ranking, and therefore they're easy to find, those might be diminished with such a high k.

And so that kind of gets you into the role of thinking, what are my users trying to do? What is the cost of them scanning down a list of ranked results and things like that? And that's where I want you to be when you think about these metrics. What are you trying to solve out there in the world?

What are your users confronting? What is the cost of reviewing examples and so forth and so on? Yeah. Well, the neural IR models that we're going to kind of solve this problem of, right, because right now everything's based on the presence or not of a word, rather than kind of maybe a- either a longer meaning or, um, like the quality of the relevance, however we define it.

Like maybe it only says the word once but actually has the best information afterwards. Will that take care of that or is- is all of neural also going to be based on presence or not of words? That's a great question. Ah, wait, we should be careful. So yeah, I think for the first part of your question, I want to say the neural IR models are overall going to be better.

Because of what you alluded to, they have a very rich semantic space. It won't directly impact this because these stars after all aren't about terms. This is about whether a whole document was relevant to a query. You should imagine that the background process is like some team of humans went through and said, okay, you searched for Bert and now I'm going through documents and saying, yeah, this one is relevant, this one isn't.

That's what produced these rankings. But I think you're right in your core intuition. Term-based models are going to be kind of brittle. And if we have hard query document pairs, they might miss them. Actually, that reminds me like this, for some reason it didn't display before. Let's see if it displays now.

I had this nice example that Omar created. This is an example of why search is a hard NLU problem. Because this is a query, what compounds protect the digestive system against viruses, where the response is certainly relevant, but there is zero relevant term overlap between query and document. All of the connections that we want to make are deeply semantic connections.

And I do think that that is why neural IR models have pulled ahead for accuracy style assessments trying to be careful as you'll see. I have one more metric which is average precision. This will be, I think this is fair to say, our most nuanced metric. Okay. So a little bit hard to think about, but I think it's intuitive.

Average precision, notice it has no K. And the reason it has no K is that we're going to sum over all the precision values for different Ks here where there is a relevant document. Think back to our rankings. Wherever there was a star, we're going to choose that as a K.

And we're going to sum up just those precision values and divide it by the number of relevant documents. Here's an example. Same three rankings that we had before, and what I'll show you are these precision calculations. So for the first one, we have stars at position one, two, and six.

And so we accumulate the precision values for one, two, and six. And those are, I hope, the ones I've given there. That sums to 2.5, and then we divide that by three, which is the number of relevant documents. So we've abstracted away the K, which is reassuring, and we're also checking at every level.

So it's not going to have that sensitivity I showed you before where the choice of K dramatically impacts which rankings we favor because now we're kind of looking at all of the ones chosen by the ranking. So that's D1, and then for D2, same thing. But now we're checking at two, five, and six because that's where the stars are, and that sums to 1.4.

And then for D3, we do the same thing at positions three, four, and five, and notice interestingly that D3 has a pulled ahead of D2. That's less surprising to me in the current context because D2 is kind of good and kind of not. It has that one star that's near the top, but the other two stars are way at the bottom of our ranking, whereas at least D3 kind of put them all at least not literally at the bottom.

Whereas D1 looks like just a slam dunk winner here. I mean, it simply has most of the stars right at the top. It has that one lonely one at the bottom, but you know on balance D1 looks good. So if I just stepped back from this little example, I would say that average precision is kind of nice in terms of giving what looks to me like a pretty nuanced picture of these three rankings.

I think that's all of the accuracy style metrics. Of course, there are others that you'll encounter. Some are sensitive to the numerical like the float value, because sometimes you have not just a one or a zero, a star or not, but rather a float value for relevance. There are lots of versions that of course average these over sets of queries.

That'll be very common to see, but underlyingly that's just some kind of arithmetic average of these scores. So I think this is a good sample. Are there questions I can answer about these metrics? Really great. Yeah. The float value one for relevance, how's that at a high level computed?

What's it called? Like the discounted cumulative gain, and it is the sum of all of the scores divided by something or other. This must be in your history somewhere. It is something. You could also just label, you know, have human labels and then you can take the precision, or not precision, maybe just position-weighted combination of human labels.

So you found the discounted cumulative gain. That's a metric I left out. And then you're just observing that very often for these datasets, we'd have humans do a bunch of labeling, and then average precision is one way of aggregating over the labels we might have collected. I kind of alluded to this before, but like here's a partial list of things you could think about.

Which metric? Fundamentally, there is no single answer. Is the cost of sc- scrolling through k passages low? Then maybe success at k is fine. Because you don't care whether it was a position nine or position one, what you really care about is that the user is kind of confronted with the success that they can easily find.

That's one scenario that you could be in. Are there multiple relevant documents per query? This is straightforward. If so, you probably shouldn't use success at k or rrk, because they're only sensitive really to one star. And if you went to the trouble of getting multiple stars, you know, why have your metric be insensitive to that?

So that seems clear. Is it more important to find every relevant document? If so, you should favor recall. That would be a case where maybe human review is cheap, or the cost of missing an example is hugely expensive. In that case, you want to favor recall. You can't miss anything.

Conversely, if you just need to find some relevant things, maybe in an ocean of examples, because you want to label them, or because it's just good to know about them, then you could favor precision. Because then all you really care about is that near the top are some relevant things.

F1 at k is the harmonic mean of precision and recall. Same thing as we do in NLP. And that can be used where there are multiple relevant documents, but maybe the relative order above k doesn't matter. That's just one perspective on what I mean when I say we're combining precision and recall.

And then finally, average precision of the ones I've showed you, will give you the most fine-grained distinctions of the metrics, right? Because it's sensitive to rank, and it's sensitive to precision and recall. Precision because it aggregates over those values, and recall because that's the denominator. So that looks like an awfully good way to really get a fine-grained ranking of systems.

And then finally, I'm going to talk about this a bit later. We have to move on beyond accuracy. This is a paper that I did with a team recently of researchers here and at IBM. What we're seeing here is a kind of post-hoc leaderboard. Not an actual leaderboard because part of our complaint is that there are no leaderboards that really do anything beyond measuring accuracy style things.

But if you did go through the literature as we did here and find a lot of systems, you can see that they vary widely along other dimensions. Here is the mean reciprocal rank, one of our rankings, goes from 19 to 37 or 39 or something. So you say, okay, but then look just to the right of that at the query latency.

To get to 37, look how much time I have to spend versus down here where 36, I spend a fraction of the time. That is absolutely something that will matter to the search experience of users. There is almost no way they're waiting around for 691 milliseconds, for example. Or what about the index size?

Right. If you care about space footprint and you will if you are indexing the web, some of these have tiny little indices and then, uh-oh, that's our model, Colbert V1, 154 gigabytes. Right. So now if you need to hold it in memory, your world just got a lot more expensive.

You'll see over here RAM requirements. So BM25, it has no hardware requirements at all. You can run that on anything. Whereas these models down here that have these really high MRR scores, hugely expensive in terms of compute. Classic story of the neural age, right? So you have to pay somewhere.

Then of course, I hope you're thinking about this. So then what is the best combination of all these things? Well, it depends on how much money you have and how much time you have, and how much you care about accuracy. And so the best pitch I can make to you is that as you evaluate systems, you think about what you care about, what matters, and conduct- construct your evaluations on that basis.

That's gonna be a big theme of the course later on. And I'm hoping to time it so that you all, for your papers, are thinking about assessment. And you think, ah, you know, I should have a whole section about my philosophy of assessment here and not just fall into F1 or fall into success at K or whatever is relevant.

This is kind of interesting too. This is from the same paper. Here's BM25. Costs essentially nothing, but it has very low performance. If you travel straight up from there, look at these splayed models. Also costing essentially nothing, but vastly better in terms of their performance. That looks like a real discovery to me.

You know, this is like the Pareto frontier as they call it. These systems where you would just wouldn't choose any that are off the frontier no matter what your values. And obviously, you can see that to favor this model, there are gonna have to be other dimensions that we care about beyond cost and MRR, because otherwise, that's just not a choice you would make.

But for all I know, there are hidden dimensions that need to be teased out that would show that that ANS model is the best relative to. Let's dive into some of those models then. Neural IR. First, we'll start with cross-encoders. This will be very intuitive. Okay. Here, just imagine I have a huge transformer.

And for cross-encoders, what I do is, I just concatenate the query and the document together, process them with my transformer model. And then on the top here, I put a little scoring function. And the scoring function will just say, for this query, how good is this document? Enormously powerful to this comment from before, we are making like maximal use of this, say, BERT model here to get every possible interaction between query and document.

So this will be good in terms of accuracy. But you might worry about some other things. Here, let me walk through a bit more. In the background here, I'm assuming that our dataset looks like this. We have a query, one positive document, and a set of one or more negative documents.

We could have multiple of the negatives. What I'm depicting on the left here is a model we could summarize like this. This is the encoder. We concatenate the query and the document, and process them, and we retrieve this representation here, layer n, position 0. We feed that through a dense layer that does our scoring, and that is the basis for essentially a classifier.

This is called the negative log likelihood of the positive passage. And if you squint or you don't squint, you just let it go blurry, you will see that it is a typical classifier loss. The only possible twist is the denominator is the positive passage score and then on the denominator, I have the positive passage sum together with all the negative passages that I have in my example set.

But fundamentally, it's a classifier. So that's why this examples look like this because that's what's being used here to optimize all these parameters to score documents. Final thing, I hope you're thinking about this. It's going to be incredibly expressive and powerful, but it just won't scale. The cost of having the query and document interact at query time is that I can't process any of these documents ahead of time.

So just imagine this, your query comes in on the web, like your Google, you're using a cross encoder, the user queries. You need to process that query together with every single document on the web, to score them, and then on that basis, you will get beautiful scores. But obviously, each query could take years to serve.

So from this perspective, it is just not a practical choice. Maybe we could use it for re-ranking. You see this sometimes where a cheap retriever gets a lot of like, like a 1,000 documents and then this is done to re-rank the last 1,000. But we can't do this at web scale.

So a question in the back. Yeah. Um, could you use this with multiple possible, uh, positive documents as well? Um. Like if you were like, like for example, for like the ranking thing right here, like multiple of those could be like good, but. I- let's see. I don't see why not.

The numerator could be the sum of the positive and then the denominator could just include all of those. So what you would be doing is, well, I'm just trying to think through. That, that would be one approach. The other approach would be to just treat them as separate examples.

I think under some conditions, those will be identical, but I'd have to think it through. But I don't see a problem. I don't see a problem. But it's worth thinking about. I'll get back to you on that. Let's improve on this. DPR, dense passage retriever. This will also be intuitive.

Here we go. Query and document, except notice now, they are processed by separate models. The query encoder and the document encoder. Could be the same parameters, but the point is, we process them separately. And I've made lighter every state except the output tokens, but below the two class tokens, because those are the only ones that we need.

Okay. These two. And then we do some kind of scoring on that basis like similarity. Right. So here are examples are the same. Now, the similarity function as I'm calling it for, for a query and a document is we process the query, we get this guy, process the document, and we get this guy, and then we do scoring on that basis.

There are no additional parameters. We just score based on those representations. They're dot product. So now, we've got something that is highly scalable because we can process every document in our entire web collection into a single vector, this one, and it can just sit there on disk. And then at query time, process the query, get its vector, and do this super fast dot product comparison for scoring.

So now we've got something that is probably even going to function as a full ranking model, not just a re-ranker. But the real game is that we can process all our documents offline. The cost was that we now have almost no interactions between the query and the document. Like if you think about token identities, if you think about the soft matching that happens with TF-IDF, none of that is going to be able to happen here.

That was the cost. Yeah. So, uh, I mean, essentially what we'll be comparing would be two fixed length vectors, right? I mean, a vector representing the document and another representing the query. So is there, I mean, a limit to the length of the document that would be represented in that vector?

I mean, like, could it represent an arbitrary, a long document? It would lose context. These are great questions. Let me repeat them. For the first question, yes, you are right. The one constraint we need to impose on the query encoder and the document encoder is that they have the same dimensionality so that we can do the dot product.

They can otherwise be separate models if we want. And then length of query and length of document, that's just going to be imposed by whatever we choose for the query and the document themselves. So if you choose BERT, you're going to be stuck with 512 as the length, the longest document that you can process unless we do some further manipulation of these things.

Yeah. If these models are trained to project into kind of like a shared embedding space, so like documents that are similar to a query are going to fall into a similar location in embedding space, could we have a system where we essentially take, like we pre-process all the documents, we take a query at inference time, project it into embedding space and then do like a new research or something like that?

Yes. Well, yes. So some aspects of what you're describing, I think, are what DPR will be optimized for. The other parts of what you're saying are going to be optimization tricks that I show you in a second, I believe. Yes. You elaborate on what you mean by limited query doc interactions?

Just that all we've got in the end is this vector for the whole query and this vector for the whole document. So token identities to the extent that they're preserved at all, they have to have been packed into those vectors. Whereas over here, we had every token level interaction you can imagine as a result of us using like the transformer.

Yeah. Is there any room for training some more clever synthesis of the two representations you get at the end as opposed to just dot producting them? Oh, that's, yeah. I think that's a natural follow-on is that you might think, I want to have in this layer here some additional parameters, and you can kind of see how that might work, right?

So instead of just using this vector, I would put some parameters on top and then the same optimization can be used. Yeah. If they're going to the same embedding space, attention. Yeah. Yeah. Yeah. And this could be good in the sense that we, we would pay a little bit of a cost by adding more parameters, but we might gain something in terms of expressivity.

Nice. Let me show you a happy compromise. Oh yeah, I just wanted to point this out, that I've just showed you two loss functions. I showed you the cross encoder and the DPR, and you can probably already see that they are identical except for this function that you might call comp here.

And that's kind of freeing. And as you think about these different model architectures, probably what you're thinking about is simply changing this comp function and then using your available data to train the model against this negative log likelihood of the positive passage. There are other losses out there in the literature, but this is the most widely used and it seems to be very effective.

Colbert. This stands for contextualized late interaction with BERT. It was invented by Omar and Matei who are here. Omar is my student, I work closely with Matei. Um, and let's see. So Omar would want you to know that this stands for contextualized late interaction, and he pronounces it Colbert because Stephen Colbert has a show full of contextualized late night interactions.

But you can also pronounce it Colbert. It's your choice because the BERT there is the BERT model. And we are, yes, still hoping that Stephen Colbert will take notice of this. I haven't been so bold, but I welcome you all to do that. That's great. Add him on Twitter, yes.

Here's how this will work. I've drawn the query encoder on the side for reasons that you'll see, but it's the same kind of thing. So imagine BERT processes my query and I've grayed out everything, but the final representations because crucially, those are the only ones that we actually need.

Same thing with the document, and it could be the same encoder. Now what I'm going to do with Colbert is form a grid of scores. And this is going to essentially give the similarity value between every query token and every document token. And then I will choose the values along the rows, that is for each query, the document token that maximizes that similarity comparison.

And the scoring function is essentially the sum of those three max values. That is why you see maxim all over the place for Colbert. Examples are as before, losses as before, and I wrote down here what we would think of as the comp function, and I wrote it as maxim.

For a query in a document, you sum over all the query tokens, and you get the max matching document token. So you can see why it's contextualized late interaction, because I'm using the output states. But unlike DPR, I'm allowing them all to interact with each other via these very fast maxim calculations.

I have token level interactions. It's right, so highly scalable and highly expressive. The only cost is that the interactions happen only in this very thin final layer. But this is really pleasing for IR, that Colbert because of this brings us back to common intuitions in IR. We do genuinely achieve with these maxim scores intuitive soft alignments.

Here I have the query, when did the Transformers cartoon series come out? And the response document, the animated Transformers was released in August 1986. And these are proportional to actual maxim values query relative to document. The thickness of that line. And you can see that it is doing something very intuitive, and also something very semantic.

Because unlike term-based models, I don't have to do anything special to capture the fact that come in the context of come out is a lot like released. And similarly with when and that date. Here I'm showing the two topmost maxim values, and they're also very intuitive. And this is wonderful because IR has been so successful for so long, doing term matching, and it is nice to see that intuition carried forward into this more semantic space in my view.

So I can go up. Yeah. Is that matrix of like query to document mapping, is that why the in-memory index is very purple there? Yes. So your question is, why is the index for Colbert so big? It is because we have to store every token level representation. Yes. I'm gonna, I'm gonna show you that we can do better, but naively storing these for our entire document store is gonna be a lot of vectors.

One per token, not one per type, one per token. Yeah. I have to pay somewhere. I guess that's the insight. Yeah. Question. Maybe there's some intuition that can make this a little clearer. If the document has multiple variants of, of transformers or, you know, Decepticon, Optimus Prime or whatever, they, they're all related to the original token transformers.

Would you be able to kind of draw that link to all those other tokens as well, or would you have to pick, I guess, one relationship? Transformers is represented once. I think that's a great question. So transformers, that's why I picked it. It's amusingly ambiguous between our model and the animated cartoon series.

Um, of my youth. So, but I think the point is that because it's BERT, transformers in this context will have a very different representation from the one if we were talking about NLP. And that's why it's so good that we are using BERT because then we'll get Maxims that are appropriately semantic.

That is the hope. Yeah. Whereas term-based models really gonna struggle. The best they're gonna be able to do is have engrams that kind of capture the fact that this transformers is the cartoon series one. So another, actually, that's another argument in favor of being in a more semantic space.

I want to just quickly talk with you about how we have worked to optimize Colbert, because I think that this suggests things that you would want to do if you developed your own neural retrieval models because the hard truth here is that BM25 is blazingly fast and scalable and these neural models are not.

You have to work much harder to get them to that point of being as performant in terms of other dimensions beyond accuracy. We could use Colbert as a re-ranker as I alluded to before, right? So here I have all these token level representations which I do have to store, and they're each connected to a document.

Now, on- if used naively, this will be not scalable, but I could do this. Given some query, uh, that's represented as a sequence of tokens, uh, I could get the top k documents for it using like BM25 and then re-rank that top k. And so if k is small, I pay the full price of the Colbert model but only for k documents.

And you're hoping that BM25 did a good job of getting you to that initial point. It's a very common application and it can be really meaningful to re-rank that final set of k documents. But we could do a little better. If we wanted to use Colbert end-to-end, here's how we could work.

We again store all those token level vectors, but now we're gonna kind of turn things around. We just need to keep track of those vectors and their associated documents. For a query that we have encoded as a set of vectors using Colbert, we take each query vector wi, and we retrieve the p most token vectors from this huge list that are similar to our target.

And that doesn't require the full Colbert model. That could just be a similarity calculation and you can do those really fast. People have really optimized that. And then you get all the documents associated with this small set of vectors that you found and you score them. So again, the name of the game is to use Colbert only very sparingly in a final stage because it is so expensive.

And then a third step here, just quickly, we can do even better and this is quite striking. What we can do is cluster our token level representations into their centroids using k-means clustering. That's what I've got in red here. And then use them as the basis for search. So again, we encode our query into a series of vectors.

And then for this target vector wi, we first get the centroids that are closest to that. And this is important because in practice, we can collect only like four centroids per token vector and do really well. That's a tiny number. Then we get the t most similar token vectors to that centroid.

And then we finally do scoring on the associated documents. And so by leaps and bounds here, we have reduced the amount of compute we need to do with this huge index by using the centroids and then using Colbert again very sparingly. Final thing and then I'll take some questions.

The team has worked very hard to reduce the latency of Colbert. This is a latency analysis here. And the thing I want to point out to you is that the Colbert model steps, actually for this second version, the one I just described to you with the centroids, that was actually a relatively small part of the overall cost because it was being used so sparingly.

The big costs were being in the- dealing with the huge index and also doing work to quantize the vectors, so that they were easier to store on disk by making them smaller. And so after a bunch of work with this framework called Plaid, they were able to get rid of almost all of that index lookup and de-quantization or decompression steps for the vectors that were costing so much.

And they brought the latency down to like 58 milliseconds. Which- so it went from something that is impossible to imagine deploying industrially to something that is close to what you might entertain as a possibility for deployment. And I- you know, the details are in the Plaid paper. We can talk about them offline.

I just wanted to call out that I think this is an incredible achievement. It is so clever, the set of things that they did to achieve this enormous improvement. So shout out to them. And it does mean that if you had heard a rumor that Colbert was impractical to use because the index was too large and the latency was too long, I think it's not true anymore.

The indices are small because of quantization, and this is that picture of latency. So give it a shot. I have one more model, but let me take questions. Yeah. Did you have a question? Oh, sorry. I just had a question about the latency and also the predictor. Okay. Cool.

Yeah, I'm happy to talk more. The Plaid paper is full of tricks and things like that. I don't want to take up too much time. I definitely want to give Sid plenty of time to talk about models. So let me just show you one more. This is SPLADE. This is also ingenious.

It'll get you thinking in a new way. Okay. So for SPLADE, I wrote sequence at the bottom because we're going to do this for both queries and documents, this process. And crucially, here I have the vocabulary. I've only represented seven tokens, but if it was BERT, it would be like 30,000.

Okay. So we again process the text into the output states, T1 through T3 there. And then we form all these scores. And the scores are determined by this thing here, that's SI sub J. So we're going to apply a linear layer to the encoding, those output states, and we're going to combine it with the embedding for these vocabulary items with a bias.

So if you strip away the details, you can see that this is like a dot product of these states with all of these values here in our vocabulary. And then SPLADE is the sum of that. And so you can think of that as summing across all the document tokens.

And so what we've got in that orange column there is a probably very sparse vector that represents this text down here with respect to our vocabulary. So this is a lot like term-based, uh, work of old, right? This is a lot like a TF-IDF representation, except it was done in the neural space.

So we should get the advantages of being with a semantic model. And then the similarity value is just the SPLADE representation, that is this representation here for the query dot product with the document. And the loss is the one that we've been using all along. So just to be clear, so you do the SPLADE process both with the query and with the document.

And then, okay, cool. Yeah, that's it. There's a bunch of- it looks similar in my document. This is great. Let me review what you just said. There's a bunch of new things. Sequence, not query or document because we do this for both kinds. And of course, we can do all the documents ahead of time.

The big twist is that we're scoring these sequences with respect to the vocabulary. And we are essentially getting in semantic space because this is an embedding space here, and this is a contextual embedding space here, scores for each query term with respect to the whole vocabulary. That gives us this big, presumably pretty sparse vector, and their optimization further encourages sparsity.

And then the similarity value is the dot product of those for queries and for documents. So it has some hallmarks of late interaction, except it is interacting the text representations with the vocabulary, kind of like what you get with TF-IDF. And this model is outstanding. You saw it in some of my doc- of my slides before, very impressive.

And it's also a new way of thinking, which I really like. Here's a bunch of more recent developments, and one theme of them, I won't go through them, is just that people are working hard finally on making these models more efficient. So a big theme of this is not just obsession with accuracy, but also obsession with especially latency.

And then finally, for that paper that I mentioned before, we did a bunch of systematic investigations of different approaches. You can see BM25, DPR, Colbert, and some splayed models here. And these are all kind of variants of these models where people have worked hard to optimize them. There's lots of tables like this in the paper.

Let me just draw out a few comparisons. BM25 is the only solution that could run on this tiny hardware here. We couldn't even run the other systems. That's why it's alone in its own little block there. And it costs nothing. Right. But it's not that successful either. Success at 10 is low relative to the rest.

When we move here, this is sort of interesting. These two Colbert models, uh, achieve very similar performance. If you look all the way to the right, except one of them is double the latency of the other one for this hardware. And so you might wonder, do I really need this extra point of performance?

If I'm gonna have to wait that long. And then if you look to splayed, so splayed is below Colbert v2 small, but its latency is a quarter or something of the Colbert v2 small. So maybe you care more about that and not so much about the success. And then if you compare these two splayed, right, they have the same performance.

But if you just jack up the hardware a little bit, then you get much lower latency. But look how much the price went up. It went up for all of them with this heavy duty hardware. Uh, yeah. So this is the space that you're actually operating in. I'll- we'll talk later about how we might more systematically integrate all these scores.

I think this is enough now to get you thinking about all of these dimensions. And- They are the Plaid Colbert. Yeah. Pretty expensive there. Luckily, in the paper, we show that you never need a GPU for Colbert, I believe. You just- so you can always use cheaper hardware. Yeah.

But those costs do look scary. The final section of this is just some datasets. I think I don't need to go through it because you have it as a resource. If you want to get started in neural information retrieval, you've got T-REC, MS Marko, and then there are a bunch of new benchmarks that are designed to assess systems out of the box, that is zero-shot.

Beer is great. Latte is great for long-tailed, topic stratified evaluation, and then this XOR tie-dye is cool because this is multilingual. And I know you have expressed interest in multilingual stuff. This could be a great playground for doing that with kind of QA and retrieval, like open QA as we've been doing it.

Bunch of other topics. I think the bottom line here is just, again, this is like a refrain in this class. NLU and IR are back together again after being apart for so long, and this is having profound implications for research and technology development. So this is absolutely a very exciting moment to participate in this research because there is so much innovation yet to happen and it is having such an impact on research and also out in the wider world.

Excellent. All right. Sid, want to take over? So it's cool. It's like retrieval isn't just hitting NLU, it's hitting everywhere, like vision and robotics as of like, this week we're starting to use retrieval methods to do. What's the best way to figure out how to do a new task?

Maybe retrieve some examples of a robot or a human doing the same task and then generating your actions. So cool stuff. Cool. All right. Let's see if this works. Yeah. All right. So I'm going to kind of pick up or try to pick up where I left off last week and kind of give you this evolution, this history lesson on how we got to the transformer, and then go from there into tips and tricks for training big models generally, and then end with like a small little teaser on fine tuning and parameter efficient tuning.

So you can use that in your projects down the road. Cool. So just to kind of blaze past things, I kind of started by talking through where things were pre-2017 when the transformer paper came out on both the RNN and the CNN side, and tied a lot of the innovation around the transformer to how modern convolutional neural nets, specifically residual nets were working, and the connections there were closer than the connections to RNNs.

Kind of walk through how we got to the self-attention block with this fancy code, which is basically just saying like you're splitting your heads and you can kind of think of your heads in a self-attention block as the different kind of kernels or filters in a CNN layer. Then kind of closing with like this full self-attention block, where we're actually doing the RNN style attention, and then this question of this non-linearity that we're adding at the end.

Because without this non-linearity and this sort of MLP that we're adding to the end of each transformer block, we're really just doing weighted averages of linear transforms of values. Okay. So, if we kind of take this as ground truth, starting point for what a transformer block looks like, very much inspired by the ideas of CNNs and RNNs with attention at the time.

We have this residual connection here, which is kind of just adding X over and over again as we stack more and more layers together. There's a problem. Can anyone spot the problem in this implementation by itself? So, the problem is that activations blow up. We keep adding the same input over and over again as we go deeper.

Eventually, specifically in the RNN attention layer, when we take this dot product between the queries and the keys, we're going to get overflow. So, we need to do something about that. All right. So, while the first part of kind of building the transformer layer is very, very much inspired by history, the second part is just trying to make sure it doesn't fail, and doesn't blow up, and doesn't crash when we try training it.

So, what's one thing that we can do to kind of avoid this sort of blow up of our activations? So, layer normalization. So, layer normalization, maybe batch norm and layer norm were covered earlier on, is a very, very simple idea. Along each feature dimension, we're just going to normalize so that each feature has mean zero, standard deviation one, which means that every time we add a residual connection, we're going to normalize so that everything comes back to a decent space.

We're still able to learn the kind of same level of expressivity we care about. We're just not necessarily going to keep blowing up or growing the magnitude of our activations. What that looks like is two calls to NN dot layer norm with the dimensionality of our transformer, and then adding that into a res block.

We're just going to normalize each X before we pass it into the attention and the MLP layers respectively. Now, there's a problem with this that isn't obvious, and actually wasn't obvious to the people building transformers at the time. It wasn't really explained kind of till three years later, which is that you have optimization issues when you do this.

Specifically, if you just try to optimize the naive transformer with this layer norm in place, with kind of conventional ML wisdom, which is like learning rate decay or a constant learning rate, bad things happen. Specifically, I'm going to use the hugging face emojis, my stand in for a transformer.

Stuff happens. The optimization crashes. It's either exploding gradients, it's either vanishing gradients. If you ask someone in 2018 or 2019 or 2020, they would tell you one or the other is happening, but there are definitely no gradients that are like stable throughout the training process. So, you introduce this kind of weird thing.

It kind of comes out of almost nowhere, which is like this transform, this like warm-up schedule that you see a lot of the time in like any code for training or even fine-tuning transformers these days. Now, this is actually just like fun because I have the time. I'm going to like go through it.

Who came up with this? >> I'm thinking, like I think I remember in the paper, they had like a weird learning rate, but I don't remember. >> So, it is in the original transformer papers, like the main thing that they get to get this stable. So, it's one of the authors that came up with it.

But if you actually run a Git blame on the first ever transformer code base from Google, the Tensor2Tensor code base, like in the very first commit, in like the R parse like flags for the different optimizers you use, there's one option just called Gnome, after Gnome Shazier. And in the annotated transformer, like the very first like block host, that's what the optimizer is called.

It's called Gnome Opt in Sasha Rush's code. And it's called the Gnome Optimizer for a really long time, until they just decided to call it just like, you know, linear warmup, cosine decay. And so, Gnome Shazier kind of came up with it. And if you were to kind of go back and think about like the sorts of problems and the papers he was working on at the time, he was actually doing a lot of stuff with the different types of like gradient descent optimizers, like RMSProp, Adafactor came out like a year after the transformer paper came out.

And he was really like interested in like looking at this problem of like, "Huh, okay, weights seem to be where like if you just really inspect the gradients early on with like this layer norm thing, variance seems to be high and you kind of want to burn that in." And he's seeing this for LSTM, so he's kind of doing this already for his LSTM work.

And then he just like, "Let's try this." It worked and no one really questioned it. It breaks conventional machine learning wisdom, like why am I warming up my learning rate before I'm bringing it down, right? Like if I'm optimizing some surface, I kind of like want to start kind of high, move in, and then like maybe anneal it as I get closer to my minimum.

But no one is able to explain why, till three years later, a paper comes out that kind of like steps through the specifics of training a transformer model on some data, like synthetic data with the atom optimizer, and actually tie it to the layer normalization layers that we just added.

We fixed one problem, we added another. Right. So up top, we have kind of good gradients. Right. So on the left here is the gradient magnitude and here's the update magnitude. So the gradients that are computed and the updates that are actually applied to the weights. With warm up in blue, in red, we have the same thing but without warm up.

And what ends up happening is that gradients go to zero somehow as you train. It's actually a weird graph because like as you're coming forward in time, it's like as you're training more and more. So this is kind of like starting out and then like this is kind of, yeah.

And this is kind of like towards the, you know, wherever training becomes unstable. And then your updates also become super high variance. So they do some math and they kind of bound the update as a kind of like proportional to the dimensionality of the, or the square root of the dimensionality of your transformer, over the input norm that's coming in.

So if your input norm, like if the size of your activation is like sufficiently large, your layer norm gradient is going to be completely, completely screwed. So what they end up doing is like, okay, so warm up is necessary because it helps you get the atom optimizer to kind of like move slowly enough at the beginning.

So that we're kind of like saturating the gradients, we're like, okay. And then when we kind of go full throttle, like things are generally stable. The activations norms aren't changing all too much. They're changing in a predictable way and we can kind of start to handle that, and then conventional ML kicks in.

But it's weird. And it's also weird that it took three years later, and some people still don't buy this explanation, but it's the best explanation I've got to why we need that warm up. So general wisdom, you're fine tuning or pre-training a transformer, warm up for at least five percent of your full training, and then start decaying.

It just helps. >> Can I ask this? I don't know this paper. So is there some data dependency or some assumption about what the data will be like? Because it seems like you said, hey look, after a while we can relax. These updates are small or reasonable, but the world could do a lot to you.

And if you shifted genres or data types, it would go back into being very unstable. >> Yeah. So I think in this paper they're looking at what I'll call nice datasets. They're looking at the Wikitext 2s of the world that are somewhat predictable. It's all Wikipedia homogenized language. But even when you're training the modern, like the really big transformer these days, even after all of these tricks, you're still going to have bad batches.

Just really, really unpredictable things that are low likelihood under your model, that are going to cause big updates in the middle of training that are going to completely crash your run. So this happened tons of times while we were training like the 355 million parameter models. This happened every time you're training any big model in like the one million plus parameter range.

So the Luther AI folks, like this happens all of the time. The T5 models have this thing in like one of the notes in like the GitHub repository for like training a T5, which is like, if training fails, rewind to the latest checkpoint, re-randomize your data order and then try again, and it won't crash.

That's kind of how modern ML or most modern language models are trained right now. We don't know how to avoid it yet. We think it's tied to the data, we can't isolate it. So why not just re-roll and try again? Eventually, you'll just keep making progress. Cool. So question.

>> Back to the graphs, what do the different colors represent in- >> Yeah. So the question was, what the different colors represent. So up top is blue with the traditional transformer learning rate, so warm up and then go down. Red is the no warm up, let's just start the learning rate high and then taper.

So red is bad, blue is good. >> >> Yeah. So the way you can interpret this graph, and the paper's linked in at the bottom of the slide, but you can think of the furthest back magnitude is like this, basically plotting the mean standard deviation of the updates across layers.

The furthest back is like batch zero. As you get further in, you get to batch 100 or batch 400 or whatever. Yeah. Question. >> I wonder to what extent the warm up and the rate. >> I think I'm relating to the choice of optimizer. Because I've run into some problems where I found that using AtomX of the infinity norm will work because I've got this level of dropout or whatever.

Is there any guidance of all these big hyperparameters that go into this tune, where if I pull one lever, I should be pushing one down or choose this optimizer, I should do another because it feels like, I mean, taking three years, it feels a little bit like the Wild West, which is what it is.

>> Yeah. So if I were to paraphrase your question, it's like, if you decide to change anything about the current recipe, like change your optimizer, change dropout, change your learning rate, are there rules of thumb for what else you need to change to get things to work? No. It is.

So part of why I led with how we got to here, starting from the historical context, was to unpack a lot of this folk knowledge. Because it's still at the point where optimizing these models, especially as we go bigger, is still concentrated in the minds and experience of a very small number of people.

Because who's trained a seven billion parameter language model, or who's trained a 100 billion parameter language model? Where does that skill set come from? When you're talking about a training run that cost millions of dollars to develop, plus however much the compute costs, how many things are you really going to be trying at the end?

What things can you extrapolate from? So folks at OpenAI have definitely done like scaling laws research where they're trying these different things in some bounded search space. But if you were to invent like a brand new optimizer, it kind of looks at like maybe second, third, fourth order moments of your gradients, maybe do something fancy relative to how things are changing over time.

And you were to just try and apply it to the biggest language model you could train, I have no idea what I would tell you in terms of like what things you should change, beyond like some obvious things, like maybe don't set the learning rate to be ridiculously high.

Starting now. >> If you have a batch of data that's like, let's just say during training you come across a bad batch of data that happens to cause like the destabilization with the gradients. And then you rewind back to your checkpoint and you take that same exact batch of data, but instead of running it through training when gradients are enabled, if you just run it through inference, will that bad batch of data have caused like anomalous behavior from the model during inference, or is it strictly just a back propagation issue?

>> So when we debug or when we debug this, so we a couple of years ago trained like some, by today's standards, really, really small language models, but like 124 million to 355 million scale. We were noticing this problem. The way we debugged it was like looking at the gradients, which didn't tell us much, but then we just looked at activation norms per layer, and that's how we actually debug this, right?

So looking at the forward pass, looking at the magnitudes of each layer, like where we thought we could possibly be overflowing or underflowing, that's exactly how we debugged it. But we didn't debug it at the batch level. We debugged it as a function of time, right? Because a single batch isn't going to perturb everything.

A series of batches, like maybe two, three, who knows how many, eventually you're going to fall under some trajectory where things get bad, your activations blow up. So we would be able to deterministically be running to that checkpoint, then deterministically step through training and log every activation, which is expensive, but that's how we were able to get to the bottom of the problem.

But I don't think we have tools for actually figuring out which batch of data was, or which sequence of batches of data were the actual triggers for that behavior. >> I guess I was just curious specifically about if the same data that causes destabilization in training can cause anomalous behavior, or just like a normal form of pass?

>> It probably would. There's some more recent work about how to quantize these models, like how to get a transformer that's trained with like 16-bit precision to like train or to run with like eight bit precision by like intelligently like bucketing floats from Tim Detmers who's a PhD student up at UW.

He has this theory on something called outlier features that show up in these really big models that kind of try and get at this, but more of an art than a science right now. Yeah. Okay. So are we done now that we fix this like layer norm stuff, the learning rate stuff, all of the stuff to get the transformer to work?

Kind of. Right. So like over the last few years, like training has been stable, people want to like milk the most of the transformer, you know, especially as they scale up. So they do a couple of things. So one, when you're training and potentially and you're projecting to queries and keys at sufficient scale, the bias term in each like linear layers like WX plus B, you can get rid of the Bs because like it's not really doing anything and it saves a little bit of compute.

So let's get rid of them. Like that's like the first thing to throw out. There are different activations that have been invented and different types of like cool ways to just better fit your data. So there's this like swish glue, so a gated linear unit actually defines like a separate weight matrix as part of the activation.

And then a swish is like a sigmoid activation that applies to one part of the weight and not the other. I've code for all of this. It works better. This is actually the activation of choice now in most transformer implementations. So Lama was trained with this, Palm was trained with this, works really well.

One thing that folks noticed is that moving the layer norm to happen, you know, before you actually feed it through the attention or the MLP layers instead of after is a more stabilizing force is actually kind of important. Also a layer norm has trainable weights. So some papers decide to be like, you don't actually need these trainable parameters for mean and variance.

You can actually just like divide by the mean square or the RMS of like your tire activation feature. All things to just get rid of irrelevant flops because we're training massive models and we're trying to do the bigger model on the compute that we have. Oh, yeah, here's the code for like a swish glue activation and an RMS norm.

So swish glue is like, so the Silly is like basically a sigmoid, a projection layer is basically saying like let's take this input feature projected into like two separate chunks. One chunk becomes like a gating value kind of like in a gated recurrent unit in like the RNN literature.

One becomes the actual value, you apply the sigmoid to the gate and then multiply it element-wise with the value and you get your new thing, works really well. An RMS norm is like literally just dividing by the norm of the vector instead of like trying to learn anything fancy.

Cool. This is what the modern transform looks like. So that's it for the evolution of the transformer. As far as I know, nothing in the last two weeks have like changed drastically from this. In the last two weeks. To what extent, let's say we are doing a fine-tune instead of like the full train, or we're doing like lower on top of it.

How would we still want to follow these kinds of guidelines or these specific to just doing all of the data doing a full pre-train? Yeah. So question is, what of this do we really need if we're fine-tuning, or if we're kind of doing like parameter efficient fine-tuning, like is this only necessary for pre-training?

So I've started using the Swish Glue instead of like any other activation like everywhere, even for like a two-layer MLP, tends to work better. So take that with a grain of salt. Everything else you can probably not care about. The RMS norm, the pre-norm is probably just like a general rule of thumb if you're adding any transformer layers just because it is like demonstrably more stable, but other than that, no.

Other questions here before we move to how to train on lots and lots of compute. Cool. So let's talk about training at scale. So I'll start with a story, my story. Okay. So I am not old, but I have seen a few things as far as language models have gone.

So like 2018 is when I think did my first deep learning tutorial. I trained a MNIST like the typical like two-layer, four-layer MLP for classification. There's actually a line there. It's the 100,000 parameter line. That's 2018. As I kind of start my PhD in 2019, I'm doing more NLP stuff.

I'm looking at like word vectors, RNNs, some more sophisticated things. I'm getting up to a million parameters. 2020, I kind of branch out from like the small NLP stuff I'm doing to like more intensive NLP so looking at tasks like summarization, training models with like 10 million parameters. Then by 2021, like the biggest models I was training, was like when I switched into multimodality robotics, looking at visual question answering, 18 million parameters.

At the time, the standard pipeline for me, and I think this was the standard pipeline for a lot of grad students that I talked to then, is like I'd be able to train most of my things on one GPU or even my laptop CPU for like a maximum of a few hours.

I got it is what a training run would take, at least for like most of the things I was doing on a day-to-day. But in 2021, Percy's like, "Hey, this GPT-3 thing seems cool. Let's at least figure out if we can get an academic lab to like try and train a GPT-2 like the earlier generation." So clocking in at like 124 million parameters, which is notably an order of magnitude bigger than anything I trained at the time.

So why I decided to do this is still beyond me, but I learned a lot of useful things. One of the useful things that I learned is that training a 124 million parameter model on a decent GPU that we had access to at the time, with a batch size greater than four would go out of memory, which is bad because a batch size of four is small, and we wanted to ideally train with like a batch size of 512.

So there was a simple trick, and it's called gradient accumulation, which is like I'm going to run batch sizes of four however many times it takes to get into 512, and then do an update after processing all of those batches sequentially. So I'm just going to keep accumulating the gradients, and PyTorch makes that really easy.

It's just a for loop and an if statement. But if you do the math, it's like 100 days to train on that single GPU for 400,000 steps. So how do we go from this clock of 100 days to something reasonable? That's what we're going to talk about. So with the scaling toolbox, at least as far as we were concerned, ended up looking like three different parts across 16 GPUs because Percy and Chris Ray, and I think Chris and Dan and Chris Manning, and the NLP group decided to invest upfront in like really powerful GPU machines, so we could actually train on like 16 GPUs at once.

For reference, 16 GPUs on AWS, just rent on an hourly basis is 56 bucks an hour now. Fifty six bucks an hour if you want to just like sit on them. But if you're willing to like let anyone who has the money and like wants to sit on them like preempt you, you can get them for 16 bucks an hour.

So like across four days like that's not the worst. It's like not great, but totally doable. So the scaling toolbox we ended up looking at was data parallelism. You can think about this as like literally just divide and conquer. How do I just parallelize work across all of these GPUs instead of one?

Mixed precision training, and we're going to talk a little bit about what that means. Then this interesting idea called zero redundancy, which is about minimizing the memory footprint of training. Then later on as you want to scale up to hundreds of billions of parameters on 256, 512, 1024, 2048 GPUs.

We'll talk like there are things that come in handy like model parallelism. There are things to consider like hardware and software limitations. But some of you might be here looking at me which is like, okay, do I need any of this stuff if I'm not training really big models?

Like if I'm just fine tuning stuff. A lot of these tips and tricks like you may not have access to 100 GPUs or even eight, but you might have access to two or four, comes in handy. A lot of the ideas here are still ideas that I'm using when I'm training stuff on my laptop or when I'm trying to run inference with the latest big model that is publicly released, so it's useful.

But please ask questions if things become too hazy or too not useful. >> >> Mm-hm. >> For people relying on Colab, data parallelism might actually not help. >> So Colab, yeah, Colab, you're still limited to a single GPU. >> And I'm guessing zero redundancy might help. >> So mixed precision would help, kind of, definitely for running inference.

And zero redundancy would also help running inference. >> What's the zero redundancy? >> So zero redundancy has an add-on that they wrote up in a paper later called zero infinity, which is like, what if I didn't put all my weights on the GPU at once, what if I put some of them in CPU RAM or even in NVMe SSD storage?

So actually turning your laptop into a more powerful workhorse than a Colab GPU. Cool, so this is a toy example kind of going through data parallelism. We're running low on time-ish, so this is MNIST with an MLP. We're trying to do classification. It's kind of the typical PyTorch workflow.

I'm going to define an N dot module. I'm going to define a batch size, a data loader that's going to load from the TorchVision data set. And then I'm just going to run lots and lots of gradient steps. The idea here is, how do we parallelize this across multiple workers, multiple GPUs?

Well, that batch size you see there is totally divisible, especially given that what we're doing at the end when we kind of compute the loss is just take an average. An average of averages is still the average. That's the idea we're going to work with. The mean of means is still the global mean.

So just like in CPU land where you can kind of think about SIMD instructions, like single instruction, multiple data. Right, so this is kind of how most graphics and media operations work on your laptops. We're going to now think about the SPMD paradigm. I'm going to write one program, and it's just going to automatically scale to being split across our running machines, because we're going to split the data across multiple machines.

It seems hard, but as of PyTorch 1.4, a lot of the hard parts are taken care of for you. These are the only lines you need to change in the implementation. Two of them are import statements. So the first thing we're going to do is we're going to just create something called a distributed sampler, which is going to automatically partition our data across the number of workers we define up front.

Right, we're defining a world size of eight, so that means we're training on eight GPUs. So this is going to partition our data into eight different subsets that each worker gets to go through. We're going to wrap our nn.module with this nice little wrapper, this distributed data parallel wrapper, which is going to sync the gradients for us behind the scenes.

And then we're going to run this with a special command called Cortron, which is just going to inject a bunch of environment variables so that we can get some statistics about our local rank, who's the guy who should be printing stuff to the screen, who's the guy who should be logging stuff, where each worker lives.

And that's about it, and you can do all of this, and you just parallelize naively across 16 GPUs. You get not quite a 16x speedup, because there is some overhead from communication. It's like seven days, that's cool. It was not good enough, because we were trying to train lots of models reproducibly, five seeds for like ten different model types, so like 50 models.

So we needed to go a little faster than this. So let's talk about memory footprints. When I am training any model with an atom optimizer, how much memory does just storing that model and the optimizer weights take up? So in 32-bit precision, our model's going to have parameters, where each parameter is stored with 32 bits, that's a float.

Gradients, 32 bits. Now your optimizer is also going to do this weird thing where it's going to have a copy, its own separate copy of the parameters, like kind of duplicating a little bit of work there. That's also 32 bits. And then atom tracks momentum and variance, like the first and second order of the gradients.

So that's another 64 bytes, or bits, right there. So the lower bound on static memory, just like storing this stuff on a GPU, is 20 bytes times the number of parameters that you have. This doesn't include activations at all for these larger transform models. If I want to keep around every buffer, like every intermediate matrix as I pass it through the network, that takes up way more space.

But this at least gives us something that we can reason about. The training implications of this is that if I want to fit a model with a billion parameters, that's going to take about 18 gigs resting, 31 gigs of GPU RAM with a batch size of one. Which is problematic, because most GPUs then cap out at 24 gigs.

The really expensive ones now have like 40 or 80, but this is still bad. 175 billion parameters would take three terabytes of RAM, not storage, just RAM, without activations. With activations, it's probably looking like ten terabytes. Good luck. >> >> The numbers in bold are batch size one. Numbers not in bold are just putting it on the thing.

And things you should know about floats, it was a standard defined in this IEEE document. You have a one bit sign, eight bit exponent, 23 bit scientific notation, like all the stuff that happens after the exponent. Wide range, up to 1E38. And the question is, do you need that range?

Answer is, kind of, but not really. So the mixed precision memory footprint. If I'm training a model in mixed precision, what that means is that I'm just going to run everything in a forward pass and part of the backwards pass in 16 bit precision instead of 32 bit precision.

Notably, what that means is now I'm storing my parameters in 16 bits, my gradient in 16 bits. All of those intermediate activations that take up lots and lots and lots of memory, especially as you go bigger, are halved, which is great. But the weird part about mixed precision is not everything is mixed precision.

So your optimizer to stably update your model still needs the 32 bit parameter copies, your 32 bit momentum, 32 bit variance. But you've dropped four bytes, and those four bytes are kind of useful. Yet training with mixed precision, at least a couple years ago, and it's still mostly true now, is still way faster than training with full precision.

And the reason for that is most NVIDIA GPUs, starting with the Volta cards, started shipping with these things called tensor cores. Tensor cores are basically the individual logical units on a GPU that are responsible for matrix multiplies. Your GPU is really good at accelerating neural network training because it's really good at doing matrix multiplies.

These things are optimized for 4x4 to 16x16 size shards. If you're training in 16 bit precision, you can actually end up using way more tensor cores than if you were using 32 bit precision. And so you're able to get a ton of speed ups just because you're able to tap into the underlying hardware of your system more frequently.

As of the Ampere style cards, like the A100s or the more recent 3090s, 3090TIs, 4090s. Those start shipping with these cores that are able to do float 32 precision, but are still even faster for 16 bit precision. So when you can, train with 16 bit precision. All right, this shaves a day off of our small scale training.

This shaves off way more, especially as you go bigger. And now the final bit is how do we eliminate the redundancies? Right, so in standard data parallelism. Yeah. >> I have a question. So why do you need the 32 bit precision for the optimizer, but what is it for the model?

>> So when you are estimating the gradients, precision matters more. You want that full range. Specifically, you want those 23 bits that kind of correspond to everything that has data to be meaningful. Because while the full range of float 32 is really, really big, it actually can't be super precise in the 0, 1 range, for example.

So you kind of want as much there as possible to kind of ensure precision. Okay, so zero redundancy, standard data parallelism. You're basically storing everything on each GPU. Key idea is I don't need to store everything on every GPU. I just need to store some things on every GPU.

So the model gets to stay on each GPU cuz the model has to do all the work. But the gradients, I can just split across the number of devices that I have. So half my gradients, if I have two GPUs, go on one device, half of them go on the other device.

Same with the optimizer states. Half of them go on one device, half of them go on the other device. With this model of zero redundancy, you're actually not adding any extra communication cost. You just get free memory because you're just intelligently partitioning things across devices. Notice that this scales, you use less and less memory as you add more and more machines, right?

So this is kind of like the biggest trick to start training 1 billion to 10 billion parameter plus models. And now when you add this, you're at three days. >> So would you have to tell it what, cuz this would require things being slightly out of sync to optimize.

Would you have to- >> So this actually doesn't require anything out of sync, because in a backwards pass with distributed data parallel, the individual updates already have to be synced across processes. If you take the loss across average and across all processes, that means that the gradients you apply have to be transferred as well.

So this just basically does that work for you. We're gonna wrap up. You hit a communication wall, matrix multiplies, stop fitting on a device, so you start charting them. And then you have to start scheduling things wisely, and yeah, great. Fine tuning inference, there is a great library that you should use called Peft from Hugging Face, it's great, and that's it.