Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel

Hi everyone. Welcome to the 224n Hugging Face Transformers tutorial. So this tutorial is just going to be about using the Hugging Face library. It's really useful and a super effective way of being able to use kind of some off-the-shelf NLP models, specifically models that are kind of transformer-based, and being able to use those for either your final project, your custom final project or something like that, just using it in the future.

So these are -- it's a really helpful package to learn, and it interfaces really well with PyTorch in particular too. Okay, so first things first is in case there's anything else that you are missing from this kind of like tutorial, the Hugging Face documentation is really good. They also have lots of kind of tutorials and walkthroughs as well as other kind of like notebooks that you can play around with as well.

So if you're ever wondering about something else, that's a really good place to look. Okay, so in the Colab, the first thing we're going to do that I already did but can maybe run again is just installing the Transformers Python package and then the Datasets Python package. So this corresponds to the Hugging Face Transformers and Datasets.

And so those are really helpful. The Transformers is where we'll get a lot of these kind of pre-trained models from, and the Datasets will give us some helpful Datasets that we can potentially use for various tasks, so in this case, sentiment analysis. Okay, and so we'll use a bit of like a helper function for helping us understand what encoding is -- what encodings are actually happening as well.

So we'll run this just to kind of kick things off and import a few more things. Okay, so first what we'll do is this is generally kind of like the step-by-step for how to use something off of Hugging Face. So first what we'll do is we'll find some model from like the Hugging Face hub here.

And note that there's like a ton of different models that you're able to use. There's BERT, there's GPT-2, there's T5-small, which is another language model from Google. So there are a bunch of these different models that are pre-trained, and all of these weights are up here in Hugging Face that are freely available for you guys to download.

So if there's a particular model you're interested in, you can probably find a version of it here. You can also see kind of different types of models on the side as well that -- for a specific task. So if we wanted to do something like zero-shot classification, there are a couple of models that are specifically good at doing that particular task.

Okay, so based off of what task you're looking for, there's probably a Hugging Face model for it that's available online for you to download. Okay, so that's what we'll do first is we'll go ahead and find a model in the Hugging Face hub, and then, you know, whatever you want to do.

In this case, we'll do sentiment analysis. And then there are two things that we need next. The first is a tokenizer for actually, you know, splitting your input text into tokens that your model can use, and the actual model itself. And so the tokenizer, again, kind of converts this to some vocabulary IDs, these discrete IDs that your model can actually take in, and the model will produce some prediction based off of that.

Okay, so first what we can do is, again, import this auto-tokenizer and this auto-model for sequence classification. So what this will do initially is download some of the, you know, key things that we need so that we can actually initialize these. So what do each of these do? So first the tokenizer, this auto-tokenizer, is from some pre-trained tokenizer that has already been used.

So in general, there's a corresponding tokenizer for every model that you want to try and use. In this case, it's like CBERT, so like something around sentiment and Roberta. And then the second is you can import this model for sequence classification as well from something pre-trained on the model hub again.

So again, this corresponds to sentiment, Roberta, large English. And if we want, we can even find this over here. We can find it as, I think English, yeah, large English. So again, this is something we can easily find. You just copy this string up here, and then you can import that.

Okay, we've downloaded all of the kind of, all the things that we need, some kind of like binary files as well. And then now we can go ahead and actually, you know, use some of these inputs, right? So this gives you some set of an input, right? This input string, I'm excited to learn about hogging face transformers.

We'll get some tokenized inputs here from the actual tokenized things here after we pass it through the tokenizer. And then lastly, we'll get some notion of the model output that we get. So this is kind of some legit here over whatever classification that we have, so in this case, good or bad.

And then some corresponding prediction. Okay, and we'll walk through what this kind of looks like in just a second as well, a little more depth. But this is broadly kind of like how we can actually use these together. We'll tokenize some input, and then we'll pass these inputs through the model.

So we'll talk about tokenizers first. So tokenizers are used for basically just pre-processing the inputs that you get for any model. And it takes some raw string to like essentially a mapping to some number or ID that the model can take in and actually kind of understand. So tokenizers are either kind of like are specific to the model that you want to use, or you can use the auto-tokenizer that will kind of conveniently import whatever corresponding tokenizer you need for that model type.

So that's kind of like the helpfulness of the auto-tokenizer. It'll kind of make that selection for you and make sure that you get the correct tokenizer for whatever model you're using. So the question is, does it make sure that everything is mapped to the correct index that the model is trained on?

The answer is yes. So that's why the auto-tokenizer is helpful. So there are two types of tokenizers. There's a Python tokenizer, and there's also like a tokenizer FAST. The tokenizer FAST is written in Rust. In general, if you do the auto-tokenizer, it'll just default to the FAST one. There's not really a huge difference here.

It's just about kind of like the inference time for getting the model outputs. Yeah, so the question was, the tokenizer creates dictionaries of the model inputs. So it's more like I think the way to think about a tokenizer is like that dictionary almost, right? So you want to kind of translate almost or have this mapping from the tokens that you can get from like this string and then map that into kind of some inputs that the model will actually use.

So we'll see an example of that in just a second. So for example, we can kind of call the tokenizer in any way that we would for like a typical PyTorch model, but we're just going to call it on like a string. So here we have our input string is HuggingFaceTransformers is great.

We pass that into the tokenizer almost like it's like a function, right? And then we'll get out some tokenization. So this gives us a set of input IDs. So to answer the earlier question, these are basically the numbers that each of these tokens represent, right? So that the model can actually use them.

And then a corresponding attention mask for the particular transformer. Okay. So there are a couple ways of accessing the actual tokenized input IDs. You can treat it like a dictionary, so hence kind of thinking about it almost as that dictionary form. It's also just like a property of the output that you get.

So there are two ways of accessing this in like a pretty Pythonic way. Okay. So what we can see as well is that we can look at the particular, the actual kind of tokenization process almost. And so this can maybe give some insight into what happens at each step.

Right, so our initial input string is going to be HuggingFaceTransformers is great. Okay, the next step is that we actually want to tokenize these individual kind of individual words that are passed in. So here, this is the kind of output of this tokenization step. Right, we get kind of these individual split tokens.

We'll convert them to IDs here. And then we'll add any special tokens that our model might need for actually performing inference on this. So there's a couple steps that happen kind of like underneath when you use an actual, when you use a tokenizer that happens a few things at a time.

One thing to note is that for fast tokenizers as well, there is another option that you're able to get to. So you have essentially, right, you have this input string. You have the number of tokens that you get. And you might have some notion of like the special token mask as well.

Right, so using char to word is going to give you like the word piece of a particular character in the input. So here, this is just giving you additional options that you can use for the fast tokenizer as well for understanding how the tokens are being used in the, from the input string.

Okay, so there are different ways of using the outputs of these tokenizers too. So one is that you can pass this in. And if you indicate that you want it to return a tensor, you can also return a PyTorch tensor. So that's great in case you need a PyTorch tensor, which you probably generally want.

You can also add multiple tokens into the tokenizer and then pad them as however you need. So for here, for example, we can use the pad token as being this kind of like pad bracket almost. And giving the token ID is going to correspond to zero. So this is just going to add padding to whatever input that you give.

So if you need your outputs to be the same length for a particular type of model, right, this will add those padding tokens and then correspondingly gives you like the zeros in the attention mask where you actually need it. Okay, and so the way to do that here is you basically set padding to be true, and also set truncation to be true as well.

And so if you ever have kind of like more, any other kind of like features of the tokenizer that you're interested in, again you can check the hugging face documentation, which is pretty thorough for what each of these things do. Yeah, so the question is kind of looking at the hash hash at least, and whether that means that we should have like a space before or not.

So here in this case, yeah, so in this case, we probably don't want like the space before, right, just because we have like the hugging, like I don't know, hugging is all one word in this case. Generally, like, generally the tokenizers, generally the output that they give is still pretty consistent though, in terms of how the tokenization process works.

So there might be kind of these like, you know, instances where it might be contrary to what you might expect for kind of how something is tokenized. In general, the tokenization generally works fine. So in most cases, kind of like the direct output that you get from the hugging face tokenizer is sufficient.

Okay, awesome. So one last thing, past the adding kind of additional padding, is that you can also kind of decode like an entire batch at one given time. So if we look again, we have like our tokenizer will additionally have this method called like a batch decode. So if we have like the model inputs that we get up here, this is the output of passing these sentences or these strings into the tokenizer.

We can go ahead and just pass like these input IDs that correspond to that into the batch decode, and it'll give us kind of this good, this decoding that corresponds to all the padding we added in, each of the particular kind of like words and strings. And if you want to, you know, ignore all the presence of these padding tokens or anything like that, you can also pass that in as skipping the special tokens.

Gotcha. So this gives like a, this is a pretty high level overview of the, how you would want to use tokenizers, I guess, in your, in using HuggingFace. So now we can talk about maybe how to use the HuggingFace models themselves. So again, this is, this is pretty similar to what we're seeing for something like initializing a tokenizer.

You just choose the specific model type for your model, and then you can use that or the specific kind of auto model class. Where again, this auto model kind of takes almost the, like the initialization process, it takes care of it for you in a pretty easy way, without really any too much overhead.

So additionally, so for the pre-trained transformers that we have, they generally have the same underlying architecture, but you'll have different kind of heads associated with each transformer. So attention heads that you might have to train if you're doing some sequence classification or just some other task. So HuggingFace will do this for you.

And so for this, I will walk through an example of how to do this for sentiment analysis. So if there's a specific context like sequence classification we want to use, we can use like this, the very specific kind of like class HuggingFace provides, so distilbert for sequence classification. Alternatively, if we were doing it using distilbert in like a mass language model setting, we use distilbert for mass LM.

And then lastly, if we're just doing it purely for the representations that we get out of distilbert, we just use like the baseline model. So the key thing here, or key takeaway, is that there are some task specific classes that we can use from HuggingFace to initialize. So AutoModel again is similar to kind of like the AutoTokenizer.

So for this, it's just going to kind of load by default that specific model. And so in this case, it's going to be just like kind of like the basic weights that you need for that. Okay. So here, we'll have basically three different types of models that we can look at.

One is like an encoder type model, which is BERT. A decoder type model, like GPT-2, that's like performing these like, you know, generating some text potentially. And encoder decoder models, so BART or T5 in this case. So again, if you go back to kind of the HuggingFace hub, there's a whole sort of different types of models that you could potentially use.

And if we look in the documentation as well, so here we can understand some notion of like the different types of classes that we might want to use. Right. So there's some notion of like the AutoTokenizer, different AutoModels for different types of tasks. So here, again, if you have any kind of like specific use cases that you're looking for, then you can check the documentation.

Here, again, if you use like an AutoModel from like pre-trained, you'll just create a model that's an instance of that BERT model. In this case, BERT model for BERT base case. Okay. Let's, we can go ahead and start. One last thing to note is that like again, the particular choice of your model matches up with kind of the type of architecture that you have to use.

Right. So there are different, these different types of models can perform specific tasks. So you're not going to be able to kind of load or use BERT for instance, or DistilBERT as like a sequence to sequence model for instance, which requires the encoder and decoder because DistilBERT only consists of an encoder.

So there's a bit of like a limitation on how you can exactly use these, but it's basically based on like the model architecture itself. Okay. Awesome. So let's go ahead and get started here. So similarly here, we can import so AutoModel for sequence classification. So again, this is, we're going to perform some classification task, and we'll import this AutoModel here so that we don't have to reference again, like something like DistilBERT for sequence classification.

We'll be able to load it automatically and it'll be all set. Alternatively, we can do DistilBERT for sequence classification here, and that specifically will require DistilBERT to be the input there. Okay. So these are two different ways of basically getting the same model here. One using the AutoModel, one using just explicitly DistilBERT.

Cool. And here, because it's classification, we need to specify the number of labels or the number of classes that we're actually going to classify for each of the input sentences. Okay. So here, we'll get some like a warning here, right? If you are following along and you print this out, because some of the sequence classification parameters aren't trained yet, and so we'll go ahead and take care of that.

So here similarly, we'll kind of like walk through how to, how to actually, you know, train some of these models. So the first is how do you actually pass any of the inputs that you get from a tokenizer into the model? Okay. Well, if we get some model inputs from the tokenizer up here, and we pass this into the model by specifying that the input IDs are input IDs from the model inputs.

And likewise, we want to emphasize or we can, you know, show here and specifically pass in that the attention mask is going to correspond to the attention mask that we gave from these like, these outputs of the tokenizer. Okay. So this is option one where you can specifically identify which property goes to what.

The second option is using kind of a Pythonic hack almost, which is where you can directly pass in the model inputs. And so this will basically unpack almost the keys of like the model inputs here. So the model input keys, so the input IDs correspond to this. The attention mask corresponds to the attention mask argument.

So when we use this star star kind of syntax, this will go ahead and unpack our dictionary and basically map the arguments to something of the same key. So this is an alternative way of passing it into the model. Both are going to be the same. Okay. So now what we can do is we can actually print out what the model outputs look like.

So again, these are the inputs, the token IDs and the attention mask. And then second, we'll get the actual model outputs. So here, notice that the outputs are given by kind of these legits here. There's two of them. We pass in one example and there's kind of two potential classes that we're trying to classify.

Okay. And then lastly, we have a course, the corresponding distribution over the labels here, right? Since this is going to be binary classification. Yes, it's like a little bit weird that you're going to have like the two classes for the binary classification task. And you could basically just choose to classify one class or not.

But we do this just basically because of how HuggingFace models are set up. And so additionally, you know, these are the models that we load in from HuggingFace are basically just PyTorch modules. So like these are the actual models and we can use them in the same way that we've been using models before.

So that means things like loss.backward or something like that, actually will do this back propagation step corresponding to the loss of like your inputs that you pass in. So, um, so it's really easy to train, train these guys. As long as you have like a label, you know, label for your data, you can calculate your loss using, you know, the PyTorch cross entropy function.

You get some loss back and then you can go ahead and back propagate it. You can actually even get kind of the parameters as well, um, in the model that you're- would probably get updated from this. So this is just some big tensor of the actual, um, embedding weights that you have.

Okay. We also have like a pretty easy way, um, for HuggingFace itself to be able to, to calculate the loss that we get. So again, if we tokenize some input string, we get our model inputs. We have two labels, positive and negative, um, and then give some kind of corresponding label that we assign to the, the model inputs and we pass this in.

We can see here that the actual model outputs that's- that are given by HuggingFace includes this loss here. Right. So it'll include the loss corresponding to that input anyways. So it's a really easy way of actually, um, calculating the loss just natively in HuggingFace without having to call any additional things from a PyTorch library.

And lastly, we can actually even use, um, if we have kind of like these two labels here, um, again for positive or negative, what we can do is just take the model outputs, look at the legits and see which one is like the biggest again. We'll pass that and take, so the argmax, so that'll give the index that's largest and then that's the output label that the model is actually predicting.

So again, it gives a really easy way of being able to do this sort of like classification, getting the loss, getting what the actual labels are, um, just from within HuggingFace. Okay. Awesome. So, um, the last thing as well is that we can also kind of look inside the model, um, in a pretty, pretty cool way and also seeing what the attention weights the model actually puts, uh, the attention weights the model actually has.

Um, so this is helpful if you're trying to understand like what's going on inside of some NLP model. Um, and so for here, we can do again, uh, where we're importing our model from some pre-trained, um, kind of pre-trained model, model weights in the, um, the HuggingFace hub. We want to output attentions, set output attentions to true and output hidden states to true.

So these are going to be the key arguments that we can use. We're actually kind of investigating, um, what's going on inside the model at each point in time. Again, we'll set the model to be in eval mode, um, and lastly, we'll go ahead and tokenize our input string again.

Um, we don't really care about any of the gradients here. Um, again, so we don't actually want to back propagate anything here. And finally, pass in the model inputs. So now what we're able to do is when we print out the model hidden states. So now this is a new kind of property in the output dictionary that we get.

We can look at what these actually look like here. Um, and sorry, this is a massive output. So you can actually look at the hidden state size per layer, right? And so this kind of gives a notion of what we're going to be looking like, looking at like what the shape of this is at each given layer in our model, as well as the attention head size per layer.

So this gives you like the kind of shape of what you're looking at. And then if we actually look at the model output itself, we'll get all of these different like hidden states basically, right? So, um, so we have like tons and tons of these, uh, different hidden states.

We'll have the last hidden state, um, here. So the model output is pretty robust for kind of showing you what the hidden state looks like, as well as what attention weights actually look like here. So in case you're trying to analyze a particular model, this is a really helpful way of doing that.

So what model.eval does is it- sorry, question is what is the.eval do? Um, what it does is it basically sets your, and this is true for any PyTorch module or model, is it sets it into "eval mode". Um, so again for this like we're not really trying to calculate any of the gradients or anything like that.

That might correspond to, um, like correspond to some data that we pass in or try and update our model in any way. We just care about evaluating it on that particular data point. Um, so for that then it's helpful to set the model into like eval mode essentially, um, to help make sure that, that kind of like, uh, disables some of like that stuff that you'd use during training time.

So it just makes it a little more efficient. Yeah, the question was, uh, it's already pre-trained so can you go ahead and evaluate it? Yeah, you, you can. Um, so yeah, this is just the raw pre-trained model with no, no fine-tuning. So the question is like how do you interpret, um, these shapes basically, uh, for the attention head size and then the hidden state size?

So, um, so yeah, the, the key thing here, uh, is you'll probably want to look at kind of the shape given on the side. It'll correspond to like the layer that you're actually kind of like, uh, looking at. So here, um, like when we call, we looked at the shape here, we're specifically looking at like the first, first one in this list, right?

So this will give us the first hidden layer. Uh, the second gives us a, a notion of kind of like the, the batch that we're looking at. And then the last is like, so this is like some tensor, right? 768 dimensional, I don't know, representation that corresponds there. Um, and then for the attention head size, it corresponds to like the actual query word and the keyword for these last two here.

But yes, so, um, but for this, you know, we would expect this kind of initial index here, right? The one to be bigger if we printed out all of the, you know, all of the layers, but we're just looking at the first one here. So we can also do this, um, for, um, you know, actually being able to get some notion of how these different, how this actually like looks, um, and plot out these axes as well.

So again, if we take this same kind of model input, which again is like this hugging face transformers is great, we're actually trying to see like what do these representations look like, on like a per layer basis. So what we can do here is basically, we're looking at for each layer that we have in our model, right?

And again, this is purely from the model output attentions, or the actual outputs of the model. Um, so what we can do is for each layer, and then for each head, we can analyze essentially like what these representations look like, and in particular, what the attention weights are, across each of like the tokens that we have.

So this is like a good way of again, understanding like what your model is actually attending to, within each layer. So on the side, if we look here, maybe zoom in a bit, we can see that this is going to be like, corresponds to the different layers, and the top will correspond to, these are across the attention, the different attention heads.

Okay. This will just give you some notion of like what the weights are. Here. So again, just to, um, to clarify. So again, if we maybe look at the labels, sorry, it's like a little cut off and like zoomed out, but so this y-axis here, like these different rows, corresponds to the different layers within the model.

Oops. Um, on the x-axis here, right, we have like the, um, like the different attention heads that are present in the model as well. And so for each head, we're able to for each, uh, at each layer to basically get a sense of like what, how the attention distribution is actually being distributed, what's being attended to, corresponding to each of like the tokens that you actually get here.

So if we, if we look up again, um, here as well, right, we're just trying to look at like basically the model attentions that we get, for each kind of corresponding layer. The question is what's the, the color key, um, yellow is like higher, higher magnitude and higher value, and then darker is like closer to zero.

So probably very naive is like zero. So what we can do is now maybe walk through like what a fine-tuning task looks like here. Um, and so first like, uh, in a project, you know, you're probably going to want to fine-tune a model. Um, that's fine. It's a, and we'll go ahead and walk through an example of, of what that looks like here.

Okay. So what we can do as well is, right, what we can do as well is use some of the, um, the data sets that we can get from HuggingFace as well. So it doesn't just have models, it has really nice data sets, um, and be able to kind of like load that in as well.

So here what we're going to be looking at is, uh, looking at like the IMDB data set. Um, and so here again is for sentiment analysis. Uh, we'll just look at only the first 50 tokens or so. Um, and generally, um, so this is, this is like a, you know, helper function that we'll use for truncating the output that we get.

And then lastly for actually kind of making this data set, we can use the dataset dict class from HuggingFace again, that will basically give us this smaller data set that we can get for the, uh, for the train data set as well as specifying what we want for validation as well.

So here what we're going to do for our like mini data set for the purpose of this demonstration is we'll use, uh, make train and val both from the IMDB train, uh, data set. Uh, we'll shuffle it a bit, and then we're just going to select here 128 examples, um, and then 32 for validation.

So it'll shuffle it around, it'll take the first 128 and it'll take the la- the next 32. Um, and then we'll kind of truncate those particular inputs that we get. Again, just to kind of make sure we're efficient. And we can actually run this on a CPU. Okay. So next we can do is just see kind of what does this look like.

It'll just, again, this is kind of just like a dictionary, it's a wrapper class almost of giving, you know, your train data set and then your validation data set. And in particular, we can even look at like what the first 10 of these looks like. Um, so first, like the output, so we specify train.

We want to look at the first 10 entries in our train data set. And the output of this, um, is going to be, um, a dictionary as well, which is pretty cool. So we have some, the first 10 test- text examples that give the actual movie reviews here. Um, so this is the given in a list.

And then the second, uh, key that you get are the labels corresponding to each of these. So whether it's positive or negative. So here one is going to be a positive review, uh, zero is negative. So it makes it really easy to use this for some- something like sentiment- sentiment analysis.

Okay. So what we can do is go ahead and, uh, prepare the data set and put it into batches of 16. Okay. So what does this look like? What we can do is we can call the map function that this like, uh, that this small like data set dictionary has.

So you call map and pass in a lambda function of what we want to actually do. So here the lambda function is for each example that we have. We want to tokenize the text basically. So this is basically saying how do we want to, you know, pre-process this. Um, and so here we're extracting the tokens, input IDs that will pass as a model.

We're adding padding and truncation as well. We're going to do this in a batch and then the batch size will be 16. Okay. Hopefully this makes sense. Okay. So, um, next we're basically just going to, um, uh, do like a little more modification on what the data set actually looks like.

So we're going to remove the column that corresponds to- to text. And then we're going to rename the column label to labels. So again if we see this, this was called label. We're just going to call it labels and we're going to remove the text column because we don't really need it anymore.

We just have gone ahead and pre-processed our data into the input IDs that we need. Okay. And lastly we're going to set it, the format to torch so we can go ahead and just pass this in, um, pass this into our model or our PyTorch model. The question is what is labels?

So, um, so label here corresponds to like again the first, in the context of sentiment analysis. It's like just, yeah, positive or negative. And so here we're just renaming the column. Okay. So now we'll just go ahead and see what this looks like. Again, we're going to look at the train set and only these first two things.

And so, um, so here now we have the two labels that correspond to each of the reviews. And the input IDs that we get corresponding for each of the reviews as well. Lastly, we also get the attention mask. So it's basically just taking the, what you get out from the tokenizer and it's just adding this back into the dataset.

So it's really easy to pass it. The question is, um, we truncated which makes things easy. But how do you want to apply, um, like padding evenly? Um, so here if we do pass in, um, so first is like you could either manually set some high truncation limit like we did.

Um, the second is that, um, you can just go ahead and set, um, padding to be true and then basically, like the padding is basically, uh, added, uh, based off of kind of like the longest, um, like longest sequence that you have. Yes. So the question is, I guess doing it for all of them, all the text lists evenly.

Um, so again, it just like depends on like the size of like the dataset you're, you're like you're loading it, right? So if you're looking at particular batches at a time, you can just pad within that particular like batch versus like, yeah, you don't need to like load all the dataset into memory, pad the entire dataset like, or like in the same way.

So it's fine to do it within just batches. Yeah, the question was, how does, uh, how are the input IDs like added? And, uh, yeah, the answer is yes. It's basically done automatically. Um, so we had to manually remove the text column here, and that kind of like this first line here.

But, um, like if you recall like the outputs of tokenize, like at the tokenizer, it's basically just the input IDs and the, and the attention mask. So it just, it's smart enough to basically aggregate those together. Okay. The last thing we're gonna do is basically just put these. So we have this like dataset now, um, it looks great.

We're just gonna import like a PyTorch data loader, typical normal data loader, and then go ahead and load each of these datasets that we just had. I mean specifying the batch size to be 16. Okay. So that's fine and, and great. Um, and so now for training the model, it's basically like exactly the same as what we would do in typical PyTorch.

So again, it's like you still want to compute the loss, you can back propagate the loss and everything. Um, yeah. So it's, it's really up to your own design how you do, uh, how you do the training. Um, so here there's only like a few kind of asterisks I guess.

One is that you can import specific kind of optimizer types from the transformers, uh, package. So you can do Adam with weight decay, uh, you can get a linear schedule for like the learning rate, which will kind of decrease the learning, during, uh, the learning rate over time for each training step.

So again, it's basically up to your choice. But if you look at the structure of like this code, right, we load the model for classification, we set a number of epochs, and then however many training steps we actually want to do. We initialize our optimizer and get some learning rate schedule, right.

And then from there, it's basically the same thing as what we would do, for a typical kind of like PyTorch model, right. We set the model to train mode, we go ahead and pass in all of these batches from like the, the data loader and then back propagate, step the optimizer and everything like that.

So it's, uh, pretty, pretty similar from what we're kind of like used to seeing essentially. Awesome. So that'll go do its thing at some point. Um, okay. And so, uh, so that's one potential option is if you really like PyTorch, you can just go ahead and do that and it's really nice and easy.

Um, the second thing is, uh, that HuggingFace actually has some sort of like a trainer class that you're able to use that can handle most of, most of these things. Um, so again, if we do that kind of like the same thing here, this will actually run once our model is done training.

Um, like we can create the, our, you know, our dataset in the same way as before. Now, what we can, what we need to use is like this import of like a training arguments class, this is going to be basically a dictionary of all the things that we want to use when we're going to actually train our model.

And then this kind of like additional trainer class, which will handle the training kind of like magically for us and kind of wrap around in that way. Okay. So if you can, okay, I think we're missing a directory, but, um, I think, yeah, pretty straightforward for how you want to train.

Yeah. Um, so for, for here at least, um, again, there are kind of the two key arguments. The first is training arguments. So this will specify, have a number of specifications that you can actually pass through to it. It's where you want to log things for each kind of like device.

In this case, like we're just using one GPU, but potentially if you're using multiple GPUs, what the batch size is during training, what the batch size is during evaluation time. How long you want to train it for. How you want to evaluate it. So this is kind of like evaluating on an epoch level.

What the learning rate is and so on, so on. So again, if you want to check the documentation, you can see that here. There's a bunch of different arguments that you can give. There's like warm-up steps, warm-up ratio, like weight decay. There's like so many things. So again, it's basically like a dictionary.

Feel free to kind of like look at these different arguments you can pass in. But there's a couple of key ones here. And this is basically, this basically mimics the same arguments that we used before in our like explicit PyTorch method here for HuggingFace. Okay, similarly, what we do is we can just pass this into the trainer and that will take care of basically everything for us.

So that whole training loop that we did before is kind of condensed into this one class function for actually just doing the training. So we pass the model, the arguments, the trained dataset, eval dataset, what tokenizer we want to use, and then some function for computing metrics. So for here, we pass in this function, eval, and it takes eval predictions as input.

Basically what this does is these predictions are given from the trainer, we pass into this function, and we just can split it into the actual legit and the labels that are predicted. Sorry, the ground truth labels that we have. And then from here, we can just calculate any sort of additional metrics we want, like accuracy, F1 score, recall, or whatever you want.

Okay, so this is like an alternative way of formulating that training loop. Okay, the last thing here as well is that we can have some sort of callback as well if you want to do things during the training process. So after every epoch or something like that, you want to evaluate your model on the validation set or something like that, or just go ahead and like dump some sort of output.

That's what you can use a callback for. And so here, this is just a logging callback. It's just going to log kind of like the information about the process itself. Again, not super important, but in case that you're looking to try and do any sort of callback during training, it's an easy way to add it in.

The second is if you want to do early stopping as well. So early stopping will basically stop your model early, as it sounds. If it's not learning anything and a bunch of epochs are going by, and so you can set that so that you don't waste kind of like compute time, or you can see the results more easily.

The question is, is there a good choice for the patient's value? It just depends on the model architecture. Not really, I guess. It's pretty up to your discretion. Okay, awesome. And so the last thing that we do is just do a call, trainer.train. So if you recall, this is just the instantiation of this trainer class, called trainer.train, and it'll just kind of go.

So now it's training, which is great. It gives us a nice kind of estimate of how long things are taking, what's going on, what arguments that we actually pass in. So that's just going to run. And then likewise, hopefully it'll train relatively quickly. Okay, it'll take two minutes. We can also evaluate the model pretty easily as well.

So we just call trainer.predict on whatever data set that we're interested in. So here it's the tokenized data set course, we're going to do the validation data set. Okay, hopefully we can pop that out soon. And lastly, so if we saved anything to our model checkpoints, so hopefully this is saving stuff right now.

Yeah, so this is going to be, is continue to save stuff to the folder that we specified. And so here, in case we ever want to kind of like load our model again, from the weights that we've actually saved, we just pass in the name of the checkpoint, like the relative path here to our checkpoint.

So notice how we have some checkpoint eight here. We just pass in the path to that folder, we load it back in, we tokenize, and it's the same thing as we did before. There are a few kind of additional appendices for how to do like different tasks as well.

So appendix on generation, how to define a custom data set as well, how to kind of like pipeline different kind of like tasks together. So this is kind of like using a pre-trained model that you can just use through kind of like the pipeline interface really easily. There's like different types of tasks like mass language modeling.

Feel free to look through those at your own time. And yeah, thanks a bunch.

Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel

Transcript