back to indexStanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel
00:00:06.000 |
Welcome to the 224n Hugging Face Transformers tutorial. 00:00:11.000 |
So this tutorial is just going to be about using the Hugging Face library. 00:00:17.000 |
It's really useful and a super effective way of being able to use kind of some 00:00:24.000 |
specifically models that are kind of transformer-based, 00:00:28.000 |
and being able to use those for either your final project, 00:00:33.000 |
your custom final project or something like that, 00:00:37.000 |
So these are -- it's a really helpful package to learn, 00:00:41.000 |
and it interfaces really well with PyTorch in particular too. 00:00:46.000 |
Okay, so first things first is in case there's anything else that you are 00:00:53.000 |
the Hugging Face documentation is really good. 00:00:56.000 |
They also have lots of kind of tutorials and walkthroughs as well as other kind 00:01:01.000 |
of like notebooks that you can play around with as well. 00:01:04.000 |
So if you're ever wondering about something else, 00:01:08.000 |
Okay, so in the Colab, the first thing we're going to do that I already did 00:01:12.000 |
but can maybe run again is just installing the Transformers Python package 00:01:20.000 |
So this corresponds to the Hugging Face Transformers and Datasets. 00:01:27.000 |
The Transformers is where we'll get a lot of these kind of pre-trained models 00:01:31.000 |
from, and the Datasets will give us some helpful Datasets that we can 00:01:35.000 |
potentially use for various tasks, so in this case, sentiment analysis. 00:01:41.000 |
Okay, and so we'll use a bit of like a helper function for helping us 00:01:45.000 |
understand what encoding is -- what encodings are actually happening as well. 00:01:51.000 |
So we'll run this just to kind of kick things off and import a few more things. 00:01:57.000 |
Okay, so first what we'll do is this is generally kind of like the step-by-step 00:02:03.000 |
for how to use something off of Hugging Face. 00:02:06.000 |
So first what we'll do is we'll find some model from like the Hugging Face hub here. 00:02:13.000 |
And note that there's like a ton of different models that you're able to use. 00:02:17.000 |
There's BERT, there's GPT-2, there's T5-small, 00:02:23.000 |
So there are a bunch of these different models that are pre-trained, 00:02:28.000 |
and all of these weights are up here in Hugging Face that are freely available 00:02:34.000 |
So if there's a particular model you're interested in, 00:02:39.000 |
You can also see kind of different types of models on the side as well 00:02:45.000 |
So if we wanted to do something like zero-shot classification, 00:02:50.000 |
there are a couple of models that are specifically good at doing that particular task. 00:02:55.000 |
Okay, so based off of what task you're looking for, 00:02:57.000 |
there's probably a Hugging Face model for it that's available online for you to download. 00:03:02.000 |
Okay, so that's what we'll do first is we'll go ahead and find a model 00:03:08.000 |
in the Hugging Face hub, and then, you know, whatever you want to do. 00:03:14.000 |
And then there are two things that we need next. 00:03:16.000 |
The first is a tokenizer for actually, you know, 00:03:19.000 |
splitting your input text into tokens that your model can use, 00:03:27.000 |
And so the tokenizer, again, kind of converts this to some vocabulary IDs, 00:03:32.000 |
these discrete IDs that your model can actually take in, 00:03:36.000 |
and the model will produce some prediction based off of that. 00:03:44.000 |
import this auto-tokenizer and this auto-model for sequence classification. 00:03:52.000 |
So what this will do initially is download some of the, you know, 00:03:55.000 |
key things that we need so that we can actually initialize these. 00:04:05.000 |
is from some pre-trained tokenizer that has already been used. 00:04:10.000 |
So in general, there's a corresponding tokenizer for every model 00:04:17.000 |
so like something around sentiment and Roberta. 00:04:20.000 |
And then the second is you can import this model for sequence classification 00:04:25.000 |
as well from something pre-trained on the model hub again. 00:04:29.000 |
So again, this corresponds to sentiment, Roberta, large English. 00:04:33.000 |
And if we want, we can even find this over here. 00:04:37.000 |
We can find it as, I think English, yeah, large English. 00:04:46.000 |
So again, this is something we can easily find. 00:04:48.000 |
You just copy this string up here, and then you can import that. 00:04:53.000 |
Okay, we've downloaded all of the kind of, all the things that we need, 00:04:59.000 |
And then now we can go ahead and actually, you know, 00:05:04.000 |
So this gives you some set of an input, right? 00:05:07.000 |
This input string, I'm excited to learn about hogging face transformers. 00:05:11.000 |
We'll get some tokenized inputs here from the actual tokenized things here 00:05:21.000 |
And then lastly, we'll get some notion of the model output that we get. 00:05:25.000 |
So this is kind of some legit here over whatever classification that we have, 00:05:35.000 |
Okay, and we'll walk through what this kind of looks like in just a second as well, 00:05:40.000 |
But this is broadly kind of like how we can actually use these together. 00:05:44.000 |
We'll tokenize some input, and then we'll pass these inputs through the model. 00:05:50.000 |
So tokenizers are used for basically just pre-processing the inputs 00:05:57.000 |
And it takes some raw string to like essentially a mapping to some number 00:06:03.000 |
or ID that the model can take in and actually kind of understand. 00:06:08.000 |
So tokenizers are either kind of like are specific to the model that you want to use, 00:06:14.000 |
or you can use the auto-tokenizer that will kind of conveniently import 00:06:19.000 |
whatever corresponding tokenizer you need for that model type. 00:06:24.000 |
So that's kind of like the helpfulness of the auto-tokenizer. 00:06:28.000 |
It'll kind of make that selection for you and make sure that you get the correct 00:06:35.000 |
So the question is, does it make sure that everything is mapped to the correct index 00:06:48.000 |
There's a Python tokenizer, and there's also like a tokenizer FAST. 00:06:57.000 |
In general, if you do the auto-tokenizer, it'll just default to the FAST one. 00:07:04.000 |
It's just about kind of like the inference time for getting the model outputs. 00:07:09.000 |
Yeah, so the question was, the tokenizer creates dictionaries of the model inputs. 00:07:15.000 |
So it's more like I think the way to think about a tokenizer is like that dictionary 00:07:25.000 |
So you want to kind of translate almost or have this mapping from the tokens that 00:07:30.000 |
you can get from like this string and then map that into kind of some inputs that 00:07:37.000 |
So we'll see an example of that in just a second. 00:07:40.000 |
So for example, we can kind of call the tokenizer in any way that we would for 00:07:45.000 |
like a typical PyTorch model, but we're just going to call it on like a string. 00:07:49.000 |
So here we have our input string is HuggingFaceTransformers is great. 00:07:54.000 |
We pass that into the tokenizer almost like it's like a function, right? 00:08:03.000 |
So to answer the earlier question, these are basically the numbers that each of 00:08:13.000 |
And then a corresponding attention mask for the particular transformer. 00:08:21.000 |
So there are a couple ways of accessing the actual tokenized input IDs. 00:08:27.000 |
You can treat it like a dictionary, so hence kind of thinking about it almost as 00:08:32.000 |
It's also just like a property of the output that you get. 00:08:36.000 |
So there are two ways of accessing this in like a pretty Pythonic way. 00:08:43.000 |
So what we can see as well is that we can look at the particular, 00:08:48.000 |
the actual kind of tokenization process almost. 00:08:52.000 |
And so this can maybe give some insight into what happens at each step. 00:08:56.000 |
Right, so our initial input string is going to be HuggingFaceTransformers is 00:09:01.000 |
Okay, the next step is that we actually want to tokenize these individual kind 00:09:10.000 |
So here, this is the kind of output of this tokenization step. 00:09:15.000 |
Right, we get kind of these individual split tokens. 00:09:22.000 |
And then we'll add any special tokens that our model might need for actually 00:09:33.000 |
So there's a couple steps that happen kind of like underneath when you use an 00:09:38.000 |
actual, when you use a tokenizer that happens a few things at a time. 00:09:44.000 |
One thing to note is that for fast tokenizers as well, 00:09:48.000 |
there is another option that you're able to get to. 00:09:52.000 |
So you have essentially, right, you have this input string. 00:09:59.000 |
And you might have some notion of like the special token mask as well. 00:10:04.000 |
Right, so using char to word is going to give you like the word piece of a 00:10:11.000 |
So here, this is just giving you additional options that you can use for 00:10:15.000 |
the fast tokenizer as well for understanding how the tokens are being 00:10:26.000 |
Okay, so there are different ways of using the outputs of these tokenizers too. 00:10:35.000 |
And if you indicate that you want it to return a tensor, 00:10:42.000 |
So that's great in case you need a PyTorch tensor, 00:10:49.000 |
You can also add multiple tokens into the tokenizer and 00:10:56.000 |
So for here, for example, we can use the pad token as being this kind of like 00:11:05.000 |
And giving the token ID is going to correspond to zero. 00:11:09.000 |
So this is just going to add padding to whatever input that you give. 00:11:12.000 |
So if you need your outputs to be the same length for 00:11:17.000 |
a particular type of model, right, this will add those padding tokens and 00:11:21.000 |
then correspondingly gives you like the zeros in the attention mask where you 00:11:28.000 |
Okay, and so the way to do that here is you basically set padding to be true, 00:11:41.000 |
any other kind of like features of the tokenizer that you're interested in, 00:11:46.000 |
again you can check the hugging face documentation, 00:11:49.000 |
which is pretty thorough for what each of these things do. 00:11:52.000 |
Yeah, so the question is kind of looking at the hash hash at least, 00:11:58.000 |
and whether that means that we should have like a space before or not. 00:12:08.000 |
we probably don't want like the space before, right, 00:12:11.000 |
just because we have like the hugging, like I don't know, 00:12:24.000 |
generally the output that they give is still pretty consistent though, 00:12:29.000 |
in terms of how the tokenization process works. 00:12:32.000 |
So there might be kind of these like, you know, 00:12:34.000 |
instances where it might be contrary to what you might expect for 00:12:41.000 |
In general, the tokenization generally works fine. 00:12:47.000 |
kind of like the direct output that you get from 00:13:03.000 |
is that you can also kind of decode like an entire batch at one given time. 00:13:16.000 |
additionally have this method called like a batch decode. 00:13:19.000 |
So if we have like the model inputs that we get up here, 00:13:23.000 |
this is the output of passing these sentences or 00:13:29.000 |
We can go ahead and just pass like these input IDs that 00:13:39.000 |
this decoding that corresponds to all the padding we added in, 00:13:43.000 |
each of the particular kind of like words and strings. 00:13:50.000 |
ignore all the presence of these padding tokens or anything like that, 00:13:55.000 |
you can also pass that in as skipping the special tokens. 00:14:04.000 |
how you would want to use tokenizers, I guess, 00:14:10.000 |
So now we can talk about maybe how to use the HuggingFace models themselves. 00:14:16.000 |
So again, this is, this is pretty similar to what we're seeing for something like 00:14:23.000 |
You just choose the specific model type for your model, 00:14:28.000 |
and then you can use that or the specific kind of auto model class. 00:14:33.000 |
Where again, this auto model kind of takes almost the, 00:14:40.000 |
it takes care of it for you in a pretty easy way, 00:14:45.000 |
So additionally, so for the pre-trained transformers that we have, 00:14:52.000 |
they generally have the same underlying architecture, 00:14:54.000 |
but you'll have different kind of heads associated with each transformer. 00:15:00.000 |
So attention heads that you might have to train 00:15:02.000 |
if you're doing some sequence classification or just some other task. 00:15:09.000 |
And so for this, I will walk through an example of how to do this for sentiment analysis. 00:15:16.000 |
So if there's a specific context like sequence classification we want to use, 00:15:21.000 |
we can use like this, the very specific kind of like class HuggingFace provides, 00:15:31.000 |
Alternatively, if we were doing it using distilbert in like a mass language model setting, 00:15:40.000 |
And then lastly, if we're just doing it purely for the representations that we get out of distilbert, 00:15:50.000 |
is that there are some task specific classes that we can use from HuggingFace to initialize. 00:15:56.000 |
So AutoModel again is similar to kind of like the AutoTokenizer. 00:16:01.000 |
So for this, it's just going to kind of load by default that specific model. 00:16:08.000 |
And so in this case, it's going to be just like kind of like the basic weights that you need for that. 00:16:18.000 |
Okay. So here, we'll have basically three different types of models that we can look at. 00:16:24.000 |
One is like an encoder type model, which is BERT. 00:16:27.000 |
A decoder type model, like GPT-2, that's like performing these like, you know, 00:16:37.000 |
And encoder decoder models, so BART or T5 in this case. 00:16:40.000 |
So again, if you go back to kind of the HuggingFace hub, 00:16:45.000 |
there's a whole sort of different types of models that you could potentially use. 00:16:54.000 |
so here we can understand some notion of like the different types of classes that we might want to use. 00:17:01.000 |
Right. So there's some notion of like the AutoTokenizer, 00:17:05.000 |
different AutoModels for different types of tasks. 00:17:08.000 |
So here, again, if you have any kind of like specific use cases that you're looking for, 00:17:16.000 |
Here, again, if you use like an AutoModel from like pre-trained, 00:17:21.000 |
you'll just create a model that's an instance of that BERT model. 00:17:38.000 |
the particular choice of your model matches up with kind of the type of architecture that you have to use. 00:17:46.000 |
these different types of models can perform specific tasks. 00:17:50.000 |
So you're not going to be able to kind of load or use BERT for instance, 00:17:55.000 |
or DistilBERT as like a sequence to sequence model for instance, 00:17:59.000 |
which requires the encoder and decoder because DistilBERT only consists of an encoder. 00:18:06.000 |
So there's a bit of like a limitation on how you can exactly use these, 00:18:10.000 |
but it's basically based on like the model architecture itself. 00:18:16.000 |
Okay. Awesome. So let's go ahead and get started here. 00:18:21.000 |
So similarly here, we can import so AutoModel for sequence classification. 00:18:28.000 |
So again, this is, we're going to perform some classification task, 00:18:31.000 |
and we'll import this AutoModel here so that we don't have to reference again, 00:18:36.000 |
like something like DistilBERT for sequence classification. 00:18:39.000 |
We'll be able to load it automatically and it'll be all set. 00:18:43.000 |
Alternatively, we can do DistilBERT for sequence classification here, 00:18:47.000 |
and that specifically will require DistilBERT to be the input there. 00:18:52.000 |
Okay. So these are two different ways of basically getting the same model here. 00:18:56.000 |
One using the AutoModel, one using just explicitly DistilBERT. 00:19:05.000 |
we need to specify the number of labels or the number of classes that we're 00:19:09.000 |
actually going to classify for each of the input sentences. 00:19:13.000 |
Okay. So here, we'll get some like a warning here, right? 00:19:18.000 |
If you are following along and you print this out, 00:19:21.000 |
because some of the sequence classification parameters aren't trained yet, 00:19:29.000 |
So here similarly, we'll kind of like walk through how to, 00:19:34.000 |
how to actually, you know, train some of these models. 00:19:38.000 |
So the first is how do you actually pass any of the inputs that you get from 00:19:44.000 |
Okay. Well, if we get some model inputs from the tokenizer up here, 00:19:50.000 |
and we pass this into the model by specifying that the input IDs are 00:19:59.000 |
And likewise, we want to emphasize or we can, you know, 00:20:04.000 |
show here and specifically pass in that the attention mask is going to 00:20:08.000 |
correspond to the attention mask that we gave from these like, 00:20:14.000 |
So this is option one where you can specifically identify which property goes to what. 00:20:20.000 |
The second option is using kind of a Pythonic hack almost, 00:20:27.000 |
which is where you can directly pass in the model inputs. 00:20:31.000 |
And so this will basically unpack almost the keys of like the model inputs here. 00:20:43.000 |
The attention mask corresponds to the attention mask argument. 00:20:47.000 |
So when we use this star star kind of syntax, 00:20:51.000 |
this will go ahead and unpack our dictionary and basically map 00:20:56.000 |
So this is an alternative way of passing it into the model. 00:21:03.000 |
Okay. So now what we can do is we can actually print out what the model outputs look like. 00:21:15.000 |
And then second, we'll get the actual model outputs. 00:21:19.000 |
So here, notice that the outputs are given by kind of these legits here. 00:21:25.000 |
There's two of them. We pass in one example and 00:21:27.000 |
there's kind of two potential classes that we're trying to classify. 00:21:33.000 |
the corresponding distribution over the labels here, right? 00:21:37.000 |
Since this is going to be binary classification. 00:21:40.000 |
Yes, it's like a little bit weird that you're going to have like 00:21:43.000 |
the two classes for the binary classification task. 00:21:46.000 |
And you could basically just choose to classify one class or not. 00:21:50.000 |
But we do this just basically because of how HuggingFace models are set up. 00:22:00.000 |
these are the models that we load in from HuggingFace are basically just PyTorch modules. 00:22:07.000 |
So like these are the actual models and we can use them in the same way that we've been using models before. 00:22:13.000 |
So that means things like loss.backward or something like that, 00:22:16.000 |
actually will do this back propagation step corresponding to the loss of like your inputs that you pass in. 00:22:24.000 |
So, um, so it's really easy to train, train these guys. 00:22:28.000 |
As long as you have like a label, you know, label for your data, 00:22:31.000 |
you can calculate your loss using, you know, the PyTorch cross entropy function. 00:22:37.000 |
You get some loss back and then you can go ahead and back propagate it. 00:22:42.000 |
You can actually even get kind of the parameters as well, um, 00:22:46.000 |
in the model that you're- would probably get updated from this. 00:22:50.000 |
So this is just some big tensor of the actual, um, 00:22:56.000 |
Okay. We also have like a pretty easy way, um, 00:23:09.000 |
We have two labels, positive and negative, um, 00:23:13.000 |
and then give some kind of corresponding label that we assign to the, 00:23:20.000 |
We can see here that the actual model outputs that's- that are given by HuggingFace includes this loss here. 00:23:28.000 |
Right. So it'll include the loss corresponding to that input anyways. 00:23:35.000 |
calculating the loss just natively in HuggingFace without having to call any additional things from a PyTorch library. 00:23:47.000 |
if we have kind of like these two labels here, um, 00:23:52.000 |
what we can do is just take the model outputs, 00:23:55.000 |
look at the legits and see which one is like the biggest again. 00:24:03.000 |
so that'll give the index that's largest and then that's the output label that the model is actually predicting. 00:24:09.000 |
So again, it gives a really easy way of being able to do this sort of like classification, 00:24:13.000 |
getting the loss, getting what the actual labels are, um, 00:24:23.000 |
So, um, the last thing as well is that we can also kind of look inside the model, 00:24:30.000 |
um, in a pretty, pretty cool way and also seeing what the attention weights the model actually puts, 00:24:37.000 |
uh, the attention weights the model actually has. 00:24:40.000 |
Um, so this is helpful if you're trying to understand like what's going on inside of some NLP model. 00:24:51.000 |
uh, where we're importing our model from some pre-trained, 00:24:57.000 |
model weights in the, um, the HuggingFace hub. 00:25:03.000 |
set output attentions to true and output hidden states to true. 00:25:07.000 |
So these are going to be the key arguments that we can use. 00:25:12.000 |
what's going on inside the model at each point in time. 00:25:15.000 |
Again, we'll set the model to be in eval mode, um, 00:25:20.000 |
and lastly, we'll go ahead and tokenize our input string again. 00:25:25.000 |
Um, we don't really care about any of the gradients here. 00:25:29.000 |
Um, again, so we don't actually want to back propagate anything here. 00:25:36.000 |
So now what we're able to do is when we print out the model hidden states. 00:25:41.000 |
So now this is a new kind of property in the output dictionary that we get. 00:25:46.000 |
We can look at what these actually look like here. 00:25:53.000 |
So you can actually look at the hidden state size per layer, right? 00:25:58.000 |
And so this kind of gives a notion of what we're going to be looking like, 00:26:02.000 |
looking at like what the shape of this is at each given layer in our model, 00:26:07.000 |
as well as the attention head size per layer. 00:26:10.000 |
So this gives you like the kind of shape of what you're looking at. 00:26:13.000 |
And then if we actually look at the model output itself, 00:26:17.000 |
we'll get all of these different like hidden states basically, right? 00:26:22.000 |
So, um, so we have like tons and tons of these, uh, different hidden states. 00:26:30.000 |
So the model output is pretty robust for kind of showing you what the hidden state looks like, 00:26:36.000 |
as well as what attention weights actually look like here. 00:26:39.000 |
So in case you're trying to analyze a particular model, 00:26:55.000 |
and this is true for any PyTorch module or model, 00:27:01.000 |
Um, so again for this like we're not really trying to calculate any of the gradients or anything like that. 00:27:11.000 |
like correspond to some data that we pass in or try and update our model in any way. 00:27:16.000 |
We just care about evaluating it on that particular data point. 00:27:21.000 |
Um, so for that then it's helpful to set the model into like eval mode essentially, 00:27:30.000 |
disables some of like that stuff that you'd use during training time. 00:27:38.000 |
it's already pre-trained so can you go ahead and evaluate it? 00:27:42.000 |
Um, so yeah, this is just the raw pre-trained model with no, no fine-tuning. 00:27:46.000 |
So the question is like how do you interpret, 00:27:52.000 |
uh, for the attention head size and then the hidden state size? 00:27:59.000 |
is you'll probably want to look at kind of the shape given on the side. 00:28:03.000 |
It'll correspond to like the layer that you're actually kind of like, uh, looking at. 00:28:12.000 |
we're specifically looking at like the first, 00:28:37.000 |
it corresponds to like the actual query word and the keyword for these last two here. 00:28:51.000 |
you know, we would expect this kind of initial index here, right? 00:28:55.000 |
The one to be bigger if we printed out all of the, 00:28:59.000 |
but we're just looking at the first one here. 00:29:07.000 |
actually being able to get some notion of how these different, 00:29:16.000 |
So again, if we take this same kind of model input, 00:29:19.000 |
which again is like this hugging face transformers is great, 00:29:22.000 |
we're actually trying to see like what do these representations look like, 00:29:31.000 |
we're looking at for each layer that we have in our model, right? 00:29:35.000 |
And again, this is purely from the model output attentions, 00:29:46.000 |
we can analyze essentially like what these representations look like, 00:29:50.000 |
and in particular, what the attention weights are, 00:29:57.000 |
understanding like what your model is actually attending to, 00:30:16.000 |
Okay. This will just give you some notion of like what the weights are. 00:30:25.000 |
sorry, it's like a little cut off and like zoomed out, 00:30:32.000 |
corresponds to the different layers within the model. 00:30:43.000 |
like the different attention heads that are present in the model as well. 00:30:51.000 |
at each layer to basically get a sense of like what, 00:30:55.000 |
how the attention distribution is actually being distributed, 00:31:00.000 |
corresponding to each of like the tokens that you actually get here. 00:31:10.000 |
we're just trying to look at like basically the model attentions that we get, 00:31:17.000 |
The question is what's the, the color key, um, 00:31:21.000 |
yellow is like higher, higher magnitude and higher value, 00:31:31.000 |
So what we can do is now maybe walk through like what a fine-tuning task looks like here. 00:31:38.000 |
Um, and so first like, uh, in a project, you know, 00:31:42.000 |
you're probably going to want to fine-tune a model. 00:31:46.000 |
and we'll go ahead and walk through an example of, 00:31:59.000 |
what we can do as well is use some of the, um, 00:32:03.000 |
the data sets that we can get from HuggingFace as well. 00:32:11.000 |
and be able to kind of like load that in as well. 00:32:13.000 |
So here what we're going to be looking at is, uh, 00:32:20.000 |
Um, and so here again is for sentiment analysis. 00:32:24.000 |
Uh, we'll just look at only the first 50 tokens or so. 00:32:34.000 |
helper function that we'll use for truncating the output that we get. 00:32:38.000 |
And then lastly for actually kind of making this data set, 00:32:43.000 |
we can use the dataset dict class from HuggingFace again, 00:32:48.000 |
that will basically give us this smaller data set that we can get for the, 00:32:56.000 |
specifying what we want for validation as well. 00:32:58.000 |
So here what we're going to do for our like mini data set for 00:33:02.000 |
the purpose of this demonstration is we'll use, uh, 00:33:06.000 |
make train and val both from the IMDB train, uh, data set. 00:33:12.000 |
and then we're just going to select here 128 examples, 00:33:20.000 |
it'll take the first 128 and it'll take the la- the next 32. 00:33:25.000 |
Um, and then we'll kind of truncate those particular inputs that we get. 00:33:30.000 |
Again, just to kind of make sure we're efficient. 00:33:38.000 |
Okay. So next we can do is just see kind of what does this look like. 00:33:44.000 |
It'll just, again, this is kind of just like a dictionary, 00:33:46.000 |
it's a wrapper class almost of giving, you know, 00:33:49.000 |
your train data set and then your validation data set. 00:33:52.000 |
And in particular, we can even look at like what the first 10 of these looks like. 00:33:58.000 |
Um, so first, like the output, so we specify train. 00:34:02.000 |
We want to look at the first 10 entries in our train data set. 00:34:14.000 |
the first 10 test- text examples that give the actual movie reviews here. 00:34:26.000 |
key that you get are the labels corresponding to each of these. 00:34:32.000 |
So here one is going to be a positive review, 00:34:36.000 |
So it makes it really easy to use this for some- something like sentiment- sentiment analysis. 00:34:48.000 |
prepare the data set and put it into batches of 16. 00:34:53.000 |
What we can do is we can call the map function that this like, uh, 00:34:58.000 |
that this small like data set dictionary has. 00:35:02.000 |
So you call map and pass in a lambda function of what we want to actually do. 00:35:07.000 |
So here the lambda function is for each example that we have. 00:35:15.000 |
So this is basically saying how do we want to, 00:35:29.000 |
We're going to do this in a batch and then the batch size will be 16. 00:35:35.000 |
Okay. So, um, next we're basically just going to, 00:35:41.000 |
um, uh, do like a little more modification on what the data set actually looks like. 00:35:47.000 |
So we're going to remove the column that corresponds to- to text. 00:35:52.000 |
And then we're going to rename the column label to labels. 00:35:56.000 |
So again if we see this, this was called label. 00:35:59.000 |
We're just going to call it labels and we're going to remove 00:36:01.000 |
the text column because we don't really need it anymore. 00:36:04.000 |
We just have gone ahead and pre-processed our data into the input IDs that we need. 00:36:11.000 |
the format to torch so we can go ahead and just pass this in, 00:36:15.000 |
um, pass this into our model or our PyTorch model. 00:36:20.000 |
So, um, so label here corresponds to like again the first, 00:36:33.000 |
Okay. So now we'll just go ahead and see what this looks like. 00:36:36.000 |
Again, we're going to look at the train set and only these first two things. 00:36:44.000 |
the two labels that correspond to each of the reviews. 00:36:47.000 |
And the input IDs that we get corresponding for each of the reviews as well. 00:36:57.000 |
what you get out from the tokenizer and it's just adding this back into the dataset. 00:37:03.000 |
The question is, um, we truncated which makes things easy. 00:37:16.000 |
so first is like you could either manually set some high truncation limit like we did. 00:37:32.000 |
added, uh, based off of kind of like the longest, 00:37:45.000 |
Um, so again, it just like depends on like the size of like the dataset you're, 00:37:51.000 |
So if you're looking at particular batches at a time, 00:37:54.000 |
you can just pad within that particular like batch versus like, yeah, 00:37:58.000 |
you don't need to like load all the dataset into memory, 00:38:09.000 |
how does, uh, how are the input IDs like added? 00:38:16.000 |
Um, so we had to manually remove the text column here, 00:38:24.000 |
But, um, like if you recall like the outputs of tokenize, 00:38:28.000 |
like at the tokenizer, it's basically just the input IDs and the, 00:38:33.000 |
So it just, it's smart enough to basically aggregate those together. 00:38:37.000 |
Okay. The last thing we're gonna do is basically just put these. 00:38:44.000 |
So we have this like dataset now, um, it looks great. 00:38:48.000 |
We're just gonna import like a PyTorch data loader, 00:38:54.000 |
and then go ahead and load each of these datasets that we just had. 00:39:10.000 |
it's basically like exactly the same as what we would do in typical PyTorch. 00:39:16.000 |
So again, it's like you still want to compute the loss, 00:39:19.000 |
you can back propagate the loss and everything. 00:39:22.000 |
Um, yeah. So it's, it's really up to your own design how you do, 00:39:29.000 |
Um, so here there's only like a few kind of asterisks I guess. 00:39:34.000 |
One is that you can import specific kind of optimizer types 00:39:44.000 |
uh, you can get a linear schedule for like the learning rate, 00:39:50.000 |
during, uh, the learning rate over time for each training step. 00:39:56.000 |
But if you look at the structure of like this code, right, 00:40:03.000 |
and then however many training steps we actually want to do. 00:40:06.000 |
We initialize our optimizer and get some learning rate schedule, right. 00:40:11.000 |
And then from there, it's basically the same thing as what we would do, 00:40:15.000 |
for a typical kind of like PyTorch model, right. 00:40:20.000 |
we go ahead and pass in all of these batches from like the, 00:40:33.000 |
pretty similar from what we're kind of like used to seeing essentially. 00:40:38.000 |
Awesome. So that'll go do its thing at some point. 00:40:49.000 |
so that's one potential option is if you really like PyTorch, 00:40:53.000 |
you can just go ahead and do that and it's really nice and easy. 00:40:59.000 |
that HuggingFace actually has some sort of like a trainer class that you're able to 00:41:05.000 |
use that can handle most of, most of these things. 00:41:08.000 |
Um, so again, if we do that kind of like the same thing here, 00:41:12.000 |
this will actually run once our model is done training. 00:41:18.000 |
our, you know, our dataset in the same way as before. 00:41:23.000 |
what we need to use is like this import of like a training arguments class, 00:41:28.000 |
this is going to be basically a dictionary of all the things that we want to 00:41:32.000 |
use when we're going to actually train our model. 00:41:35.000 |
And then this kind of like additional trainer class, 00:41:39.000 |
which will handle the training kind of like magically for us and kind of wrap around in that way. 00:41:52.000 |
I think, yeah, pretty straightforward for how you want to train. 00:42:04.000 |
So this will specify, have a number of specifications that you can actually pass through to it. 00:42:09.000 |
It's where you want to log things for each kind of like device. 00:42:16.000 |
but potentially if you're using multiple GPUs, 00:42:21.000 |
what the batch size is during evaluation time. 00:42:29.000 |
So this is kind of like evaluating on an epoch level. 00:42:36.000 |
So again, if you want to check the documentation, 00:42:41.000 |
There's a bunch of different arguments that you can give. 00:42:44.000 |
There's like warm-up steps, warm-up ratio, like weight decay. 00:42:53.000 |
Feel free to kind of like look at these different arguments you can pass in. 00:42:59.000 |
And this is basically, this basically mimics the same arguments that we used before 00:43:04.000 |
in our like explicit PyTorch method here for HuggingFace. 00:43:10.000 |
Okay, similarly, what we do is we can just pass this into the trainer 00:43:15.000 |
and that will take care of basically everything for us. 00:43:18.000 |
So that whole training loop that we did before 00:43:21.000 |
is kind of condensed into this one class function 00:43:26.000 |
So we pass the model, the arguments, the trained dataset, eval dataset, 00:43:33.000 |
and then some function for computing metrics. 00:43:44.000 |
Basically what this does is these predictions are given from the trainer, 00:43:50.000 |
and we just can split it into the actual legit 00:43:57.000 |
And then from here, we can just calculate any sort of additional metrics we want, 00:44:01.000 |
like accuracy, F1 score, recall, or whatever you want. 00:44:07.000 |
Okay, so this is like an alternative way of formulating that training loop. 00:44:14.000 |
Okay, the last thing here as well is that we can have some sort of callback as well 00:44:20.000 |
if you want to do things during the training process. 00:44:26.000 |
you want to evaluate your model on the validation set or something like that, 00:44:30.000 |
or just go ahead and like dump some sort of output. 00:44:37.000 |
And so here, this is just a logging callback. 00:44:41.000 |
It's just going to log kind of like the information about the process itself. 00:44:50.000 |
but in case that you're looking to try and do any sort of callback during training, 00:44:58.000 |
The second is if you want to do early stopping as well. 00:45:01.000 |
So early stopping will basically stop your model early, as it sounds. 00:45:07.000 |
If it's not learning anything and a bunch of epochs are going by, 00:45:11.000 |
and so you can set that so that you don't waste kind of like compute time, 00:45:16.000 |
The question is, is there a good choice for the patient's value? 00:45:33.000 |
And so the last thing that we do is just do a call, trainer.train. 00:45:37.000 |
So if you recall, this is just the instantiation of this trainer class, 00:45:42.000 |
called trainer.train, and it'll just kind of go. 00:45:49.000 |
It gives us a nice kind of estimate of how long things are taking, 00:45:53.000 |
what's going on, what arguments that we actually pass in. 00:46:01.000 |
And then likewise, hopefully it'll train relatively quickly. 00:46:08.000 |
We can also evaluate the model pretty easily as well. 00:46:12.000 |
So we just call trainer.predict on whatever data set that we're interested in. 00:46:27.000 |
And lastly, so if we saved anything to our model checkpoints, 00:46:41.000 |
is continue to save stuff to the folder that we specified. 00:46:46.000 |
And so here, in case we ever want to kind of like load our model again, 00:46:55.000 |
like the relative path here to our checkpoint. 00:46:58.000 |
So notice how we have some checkpoint eight here. 00:47:02.000 |
We just pass in the path to that folder, we load it back in, 00:47:06.000 |
we tokenize, and it's the same thing as we did before. 00:47:12.000 |
There are a few kind of additional appendices for how to do like different tasks as well. 00:47:19.000 |
So appendix on generation, how to define a custom data set as well, 00:47:25.000 |
how to kind of like pipeline different kind of like tasks together. 00:47:32.000 |
So this is kind of like using a pre-trained model that you can just use through kind of like the pipeline interface really easily. 00:47:43.000 |
There's like different types of tasks like mass language modeling. 00:47:48.000 |
Feel free to look through those at your own time.