Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel

00:00:00.000 | [PAUSE]

00:00:05.000 | Hi everyone.

00:00:06.000 | Welcome to the 224n Hugging Face Transformers tutorial.

00:00:11.000 | So this tutorial is just going to be about using the Hugging Face library.

00:00:17.000 | It's really useful and a super effective way of being able to use kind of some

00:00:22.000 | off-the-shelf NLP models,

00:00:24.000 | specifically models that are kind of transformer-based,

00:00:28.000 | and being able to use those for either your final project,

00:00:33.000 | your custom final project or something like that,

00:00:35.000 | just using it in the future.

00:00:37.000 | So these are -- it's a really helpful package to learn,

00:00:41.000 | and it interfaces really well with PyTorch in particular too.

00:00:46.000 | Okay, so first things first is in case there's anything else that you are

00:00:51.000 | missing from this kind of like tutorial,

00:00:53.000 | the Hugging Face documentation is really good.

00:00:56.000 | They also have lots of kind of tutorials and walkthroughs as well as other kind

00:01:01.000 | of like notebooks that you can play around with as well.

00:01:04.000 | So if you're ever wondering about something else,

00:01:06.000 | that's a really good place to look.

00:01:08.000 | Okay, so in the Colab, the first thing we're going to do that I already did

00:01:12.000 | but can maybe run again is just installing the Transformers Python package

00:01:17.000 | and then the Datasets Python package.

00:01:20.000 | So this corresponds to the Hugging Face Transformers and Datasets.

00:01:25.000 | And so those are really helpful.

00:01:27.000 | The Transformers is where we'll get a lot of these kind of pre-trained models

00:01:31.000 | from, and the Datasets will give us some helpful Datasets that we can

00:01:35.000 | potentially use for various tasks, so in this case, sentiment analysis.

00:01:41.000 | Okay, and so we'll use a bit of like a helper function for helping us

00:01:45.000 | understand what encoding is -- what encodings are actually happening as well.

00:01:51.000 | So we'll run this just to kind of kick things off and import a few more things.

00:01:57.000 | Okay, so first what we'll do is this is generally kind of like the step-by-step

00:02:03.000 | for how to use something off of Hugging Face.

00:02:06.000 | So first what we'll do is we'll find some model from like the Hugging Face hub here.

00:02:13.000 | And note that there's like a ton of different models that you're able to use.

00:02:17.000 | There's BERT, there's GPT-2, there's T5-small,

00:02:21.000 | which is another language model from Google.

00:02:23.000 | So there are a bunch of these different models that are pre-trained,

00:02:28.000 | and all of these weights are up here in Hugging Face that are freely available

00:02:32.000 | for you guys to download.

00:02:34.000 | So if there's a particular model you're interested in,

00:02:36.000 | you can probably find a version of it here.

00:02:39.000 | You can also see kind of different types of models on the side as well

00:02:43.000 | that -- for a specific task.

00:02:45.000 | So if we wanted to do something like zero-shot classification,

00:02:50.000 | there are a couple of models that are specifically good at doing that particular task.

00:02:55.000 | Okay, so based off of what task you're looking for,

00:02:57.000 | there's probably a Hugging Face model for it that's available online for you to download.

00:03:02.000 | Okay, so that's what we'll do first is we'll go ahead and find a model

00:03:08.000 | in the Hugging Face hub, and then, you know, whatever you want to do.

00:03:12.000 | In this case, we'll do sentiment analysis.

00:03:14.000 | And then there are two things that we need next.

00:03:16.000 | The first is a tokenizer for actually, you know,

00:03:19.000 | splitting your input text into tokens that your model can use,

00:03:24.000 | and the actual model itself.

00:03:27.000 | And so the tokenizer, again, kind of converts this to some vocabulary IDs,

00:03:32.000 | these discrete IDs that your model can actually take in,

00:03:36.000 | and the model will produce some prediction based off of that.

00:03:39.000 | Okay, so first what we can do is, again,

00:03:44.000 | import this auto-tokenizer and this auto-model for sequence classification.

00:03:52.000 | So what this will do initially is download some of the, you know,

00:03:55.000 | key things that we need so that we can actually initialize these.

00:03:59.000 | So what do each of these do?

00:04:01.000 | So first the tokenizer, this auto-tokenizer,

00:04:05.000 | is from some pre-trained tokenizer that has already been used.

00:04:10.000 | So in general, there's a corresponding tokenizer for every model

00:04:13.000 | that you want to try and use.

00:04:15.000 | In this case, it's like CBERT,

00:04:17.000 | so like something around sentiment and Roberta.

00:04:20.000 | And then the second is you can import this model for sequence classification

00:04:25.000 | as well from something pre-trained on the model hub again.

00:04:29.000 | So again, this corresponds to sentiment, Roberta, large English.

00:04:33.000 | And if we want, we can even find this over here.

00:04:37.000 | We can find it as, I think English, yeah, large English.

00:04:46.000 | So again, this is something we can easily find.

00:04:48.000 | You just copy this string up here, and then you can import that.

00:04:53.000 | Okay, we've downloaded all of the kind of, all the things that we need,

00:04:57.000 | some kind of like binary files as well.

00:04:59.000 | And then now we can go ahead and actually, you know,

00:05:02.000 | use some of these inputs, right?

00:05:04.000 | So this gives you some set of an input, right?

00:05:07.000 | This input string, I'm excited to learn about hogging face transformers.

00:05:11.000 | We'll get some tokenized inputs here from the actual tokenized things here

00:05:18.000 | after we pass it through the tokenizer.

00:05:21.000 | And then lastly, we'll get some notion of the model output that we get.

00:05:25.000 | So this is kind of some legit here over whatever classification that we have,

00:05:30.000 | so in this case, good or bad.

00:05:32.000 | And then some corresponding prediction.

00:05:35.000 | Okay, and we'll walk through what this kind of looks like in just a second as well,

00:05:39.000 | a little more depth.

00:05:40.000 | But this is broadly kind of like how we can actually use these together.

00:05:44.000 | We'll tokenize some input, and then we'll pass these inputs through the model.

00:05:48.000 | So we'll talk about tokenizers first.

00:05:50.000 | So tokenizers are used for basically just pre-processing the inputs

00:05:56.000 | that you get for any model.

00:05:57.000 | And it takes some raw string to like essentially a mapping to some number

00:06:03.000 | or ID that the model can take in and actually kind of understand.

00:06:08.000 | So tokenizers are either kind of like are specific to the model that you want to use,

00:06:14.000 | or you can use the auto-tokenizer that will kind of conveniently import

00:06:19.000 | whatever corresponding tokenizer you need for that model type.

00:06:24.000 | So that's kind of like the helpfulness of the auto-tokenizer.

00:06:28.000 | It'll kind of make that selection for you and make sure that you get the correct

00:06:32.000 | tokenizer for whatever model you're using.

00:06:35.000 | So the question is, does it make sure that everything is mapped to the correct index

00:06:39.000 | that the model is trained on?

00:06:40.000 | The answer is yes.

00:06:41.000 | So that's why the auto-tokenizer is helpful.

00:06:44.000 | So there are two types of tokenizers.

00:06:48.000 | There's a Python tokenizer, and there's also like a tokenizer FAST.

00:06:55.000 | The tokenizer FAST is written in Rust.

00:06:57.000 | In general, if you do the auto-tokenizer, it'll just default to the FAST one.

00:07:02.000 | There's not really a huge difference here.

00:07:04.000 | It's just about kind of like the inference time for getting the model outputs.

00:07:09.000 | Yeah, so the question was, the tokenizer creates dictionaries of the model inputs.

00:07:15.000 | So it's more like I think the way to think about a tokenizer is like that dictionary

00:07:24.000 | almost, right?

00:07:25.000 | So you want to kind of translate almost or have this mapping from the tokens that

00:07:30.000 | you can get from like this string and then map that into kind of some inputs that

00:07:35.000 | the model will actually use.

00:07:37.000 | So we'll see an example of that in just a second.

00:07:40.000 | So for example, we can kind of call the tokenizer in any way that we would for

00:07:45.000 | like a typical PyTorch model, but we're just going to call it on like a string.

00:07:49.000 | So here we have our input string is HuggingFaceTransformers is great.

00:07:54.000 | We pass that into the tokenizer almost like it's like a function, right?

00:07:58.000 | And then we'll get out some tokenization.

00:08:00.000 | So this gives us a set of input IDs.

00:08:03.000 | So to answer the earlier question, these are basically the numbers that each of

00:08:08.000 | these tokens represent, right?

00:08:10.000 | So that the model can actually use them.

00:08:13.000 | And then a corresponding attention mask for the particular transformer.

00:08:19.000 | Okay.

00:08:21.000 | So there are a couple ways of accessing the actual tokenized input IDs.

00:08:27.000 | You can treat it like a dictionary, so hence kind of thinking about it almost as

00:08:31.000 | that dictionary form.

00:08:32.000 | It's also just like a property of the output that you get.

00:08:36.000 | So there are two ways of accessing this in like a pretty Pythonic way.

00:08:39.000 | Okay.

00:08:43.000 | So what we can see as well is that we can look at the particular,

00:08:48.000 | the actual kind of tokenization process almost.

00:08:52.000 | And so this can maybe give some insight into what happens at each step.

00:08:56.000 | Right, so our initial input string is going to be HuggingFaceTransformers is

00:09:00.000 | great.

00:09:01.000 | Okay, the next step is that we actually want to tokenize these individual kind

00:09:07.000 | of individual words that are passed in.

00:09:10.000 | So here, this is the kind of output of this tokenization step.

00:09:15.000 | Right, we get kind of these individual split tokens.

00:09:19.000 | We'll convert them to IDs here.

00:09:22.000 | And then we'll add any special tokens that our model might need for actually

00:09:28.000 | performing inference on this.

00:09:33.000 | So there's a couple steps that happen kind of like underneath when you use an

00:09:38.000 | actual, when you use a tokenizer that happens a few things at a time.

00:09:44.000 | One thing to note is that for fast tokenizers as well,

00:09:48.000 | there is another option that you're able to get to.

00:09:52.000 | So you have essentially, right, you have this input string.

00:09:57.000 | You have the number of tokens that you get.

00:09:59.000 | And you might have some notion of like the special token mask as well.

00:10:04.000 | Right, so using char to word is going to give you like the word piece of a

00:10:09.000 | particular character in the input.

00:10:11.000 | So here, this is just giving you additional options that you can use for

00:10:15.000 | the fast tokenizer as well for understanding how the tokens are being

00:10:18.000 | used in the, from the input string.

00:10:26.000 | Okay, so there are different ways of using the outputs of these tokenizers too.

00:10:32.000 | So one is that you can pass this in.

00:10:35.000 | And if you indicate that you want it to return a tensor,

00:10:39.000 | you can also return a PyTorch tensor.

00:10:42.000 | So that's great in case you need a PyTorch tensor,

00:10:47.000 | which you probably generally want.

00:10:49.000 | You can also add multiple tokens into the tokenizer and

00:10:53.000 | then pad them as however you need.

00:10:56.000 | So for here, for example, we can use the pad token as being this kind of like

00:11:03.000 | pad bracket almost.

00:11:05.000 | And giving the token ID is going to correspond to zero.

00:11:09.000 | So this is just going to add padding to whatever input that you give.

00:11:12.000 | So if you need your outputs to be the same length for

00:11:17.000 | a particular type of model, right, this will add those padding tokens and

00:11:21.000 | then correspondingly gives you like the zeros in the attention mask where you

00:11:25.000 | actually need it.

00:11:28.000 | Okay, and so the way to do that here is you basically set padding to be true,

00:11:34.000 | and also set truncation to be true as well.

00:11:37.000 | And so if you ever have kind of like more,

00:11:41.000 | any other kind of like features of the tokenizer that you're interested in,

00:11:46.000 | again you can check the hugging face documentation,

00:11:49.000 | which is pretty thorough for what each of these things do.

00:11:52.000 | Yeah, so the question is kind of looking at the hash hash at least,

00:11:58.000 | and whether that means that we should have like a space before or not.

00:12:03.000 | So here in this case, yeah, so in this case,

00:12:08.000 | we probably don't want like the space before, right,

00:12:11.000 | just because we have like the hugging, like I don't know,

00:12:16.000 | hugging is all one word in this case.

00:12:20.000 | Generally, like, generally the tokenizers,

00:12:24.000 | generally the output that they give is still pretty consistent though,

00:12:29.000 | in terms of how the tokenization process works.

00:12:32.000 | So there might be kind of these like, you know,

00:12:34.000 | instances where it might be contrary to what you might expect for

00:12:38.000 | kind of how something is tokenized.

00:12:41.000 | In general, the tokenization generally works fine.

00:12:45.000 | So in most cases,

00:12:47.000 | kind of like the direct output that you get from

00:12:49.000 | the hugging face tokenizer is sufficient.

00:12:57.000 | Okay, awesome. So one last thing,

00:13:00.000 | past the adding kind of additional padding,

00:13:03.000 | is that you can also kind of decode like an entire batch at one given time.

00:13:10.000 | So if we look again,

00:13:13.000 | we have like our tokenizer will

00:13:16.000 | additionally have this method called like a batch decode.

00:13:19.000 | So if we have like the model inputs that we get up here,

00:13:23.000 | this is the output of passing these sentences or

00:13:26.000 | these strings into the tokenizer.

00:13:29.000 | We can go ahead and just pass like these input IDs that

00:13:34.000 | correspond to that into the batch decode,

00:13:37.000 | and it'll give us kind of this good,

00:13:39.000 | this decoding that corresponds to all the padding we added in,

00:13:43.000 | each of the particular kind of like words and strings.

00:13:48.000 | And if you want to, you know,

00:13:50.000 | ignore all the presence of these padding tokens or anything like that,

00:13:55.000 | you can also pass that in as skipping the special tokens.

00:13:59.000 | Gotcha. So this gives like a,

00:14:01.000 | this is a pretty high level overview of the,

00:14:04.000 | how you would want to use tokenizers, I guess,

00:14:07.000 | in your, in using HuggingFace.

00:14:10.000 | So now we can talk about maybe how to use the HuggingFace models themselves.

00:14:16.000 | So again, this is, this is pretty similar to what we're seeing for something like

00:14:21.000 | initializing a tokenizer.

00:14:23.000 | You just choose the specific model type for your model,

00:14:28.000 | and then you can use that or the specific kind of auto model class.

00:14:33.000 | Where again, this auto model kind of takes almost the,

00:14:37.000 | like the initialization process,

00:14:40.000 | it takes care of it for you in a pretty easy way,

00:14:42.000 | without really any too much overhead.

00:14:45.000 | So additionally, so for the pre-trained transformers that we have,

00:14:52.000 | they generally have the same underlying architecture,

00:14:54.000 | but you'll have different kind of heads associated with each transformer.

00:15:00.000 | So attention heads that you might have to train

00:15:02.000 | if you're doing some sequence classification or just some other task.

00:15:06.000 | So HuggingFace will do this for you.

00:15:09.000 | And so for this, I will walk through an example of how to do this for sentiment analysis.

00:15:16.000 | So if there's a specific context like sequence classification we want to use,

00:15:21.000 | we can use like this, the very specific kind of like class HuggingFace provides,

00:15:28.000 | so distilbert for sequence classification.

00:15:31.000 | Alternatively, if we were doing it using distilbert in like a mass language model setting,

00:15:37.000 | we use distilbert for mass LM.

00:15:40.000 | And then lastly, if we're just doing it purely for the representations that we get out of distilbert,

00:15:45.000 | we just use like the baseline model.

00:15:47.000 | So the key thing here, or key takeaway,

00:15:50.000 | is that there are some task specific classes that we can use from HuggingFace to initialize.

00:15:56.000 | So AutoModel again is similar to kind of like the AutoTokenizer.

00:16:01.000 | So for this, it's just going to kind of load by default that specific model.

00:16:08.000 | And so in this case, it's going to be just like kind of like the basic weights that you need for that.

00:16:18.000 | Okay. So here, we'll have basically three different types of models that we can look at.

00:16:24.000 | One is like an encoder type model, which is BERT.

00:16:27.000 | A decoder type model, like GPT-2, that's like performing these like, you know,

00:16:35.000 | generating some text potentially.

00:16:37.000 | And encoder decoder models, so BART or T5 in this case.

00:16:40.000 | So again, if you go back to kind of the HuggingFace hub,

00:16:45.000 | there's a whole sort of different types of models that you could potentially use.

00:16:51.000 | And if we look in the documentation as well,

00:16:54.000 | so here we can understand some notion of like the different types of classes that we might want to use.

00:17:01.000 | Right. So there's some notion of like the AutoTokenizer,

00:17:05.000 | different AutoModels for different types of tasks.

00:17:08.000 | So here, again, if you have any kind of like specific use cases that you're looking for,

00:17:14.000 | then you can check the documentation.

00:17:16.000 | Here, again, if you use like an AutoModel from like pre-trained,

00:17:21.000 | you'll just create a model that's an instance of that BERT model.

00:17:25.000 | In this case, BERT model for BERT base case.

00:17:31.000 | Okay. Let's, we can go ahead and start.

00:17:35.000 | One last thing to note is that like again,

00:17:38.000 | the particular choice of your model matches up with kind of the type of architecture that you have to use.

00:17:44.000 | Right. So there are different,

00:17:46.000 | these different types of models can perform specific tasks.

00:17:50.000 | So you're not going to be able to kind of load or use BERT for instance,

00:17:55.000 | or DistilBERT as like a sequence to sequence model for instance,

00:17:59.000 | which requires the encoder and decoder because DistilBERT only consists of an encoder.

00:18:06.000 | So there's a bit of like a limitation on how you can exactly use these,

00:18:10.000 | but it's basically based on like the model architecture itself.

00:18:16.000 | Okay. Awesome. So let's go ahead and get started here.

00:18:21.000 | So similarly here, we can import so AutoModel for sequence classification.

00:18:28.000 | So again, this is, we're going to perform some classification task,

00:18:31.000 | and we'll import this AutoModel here so that we don't have to reference again,

00:18:36.000 | like something like DistilBERT for sequence classification.

00:18:39.000 | We'll be able to load it automatically and it'll be all set.

00:18:43.000 | Alternatively, we can do DistilBERT for sequence classification here,

00:18:47.000 | and that specifically will require DistilBERT to be the input there.

00:18:52.000 | Okay. So these are two different ways of basically getting the same model here.

00:18:56.000 | One using the AutoModel, one using just explicitly DistilBERT.

00:19:01.000 | Cool. And here, because it's classification,

00:19:05.000 | we need to specify the number of labels or the number of classes that we're

00:19:09.000 | actually going to classify for each of the input sentences.

00:19:13.000 | Okay. So here, we'll get some like a warning here, right?

00:19:18.000 | If you are following along and you print this out,

00:19:21.000 | because some of the sequence classification parameters aren't trained yet,

00:19:26.000 | and so we'll go ahead and take care of that.

00:19:29.000 | So here similarly, we'll kind of like walk through how to,

00:19:34.000 | how to actually, you know, train some of these models.

00:19:38.000 | So the first is how do you actually pass any of the inputs that you get from

00:19:42.000 | a tokenizer into the model?

00:19:44.000 | Okay. Well, if we get some model inputs from the tokenizer up here,

00:19:50.000 | and we pass this into the model by specifying that the input IDs are

00:19:57.000 | input IDs from the model inputs.

00:19:59.000 | And likewise, we want to emphasize or we can, you know,

00:20:04.000 | show here and specifically pass in that the attention mask is going to

00:20:08.000 | correspond to the attention mask that we gave from these like,

00:20:11.000 | these outputs of the tokenizer. Okay.

00:20:14.000 | So this is option one where you can specifically identify which property goes to what.

00:20:20.000 | The second option is using kind of a Pythonic hack almost,

00:20:27.000 | which is where you can directly pass in the model inputs.

00:20:31.000 | And so this will basically unpack almost the keys of like the model inputs here.

00:20:38.000 | So the model input keys,

00:20:40.000 | so the input IDs correspond to this.

00:20:43.000 | The attention mask corresponds to the attention mask argument.

00:20:47.000 | So when we use this star star kind of syntax,

00:20:51.000 | this will go ahead and unpack our dictionary and basically map

00:20:54.000 | the arguments to something of the same key.

00:20:56.000 | So this is an alternative way of passing it into the model.

00:21:00.000 | Both are going to be the same.

00:21:03.000 | Okay. So now what we can do is we can actually print out what the model outputs look like.

00:21:10.000 | So again, these are the inputs,

00:21:12.000 | the token IDs and the attention mask.

00:21:15.000 | And then second, we'll get the actual model outputs.

00:21:19.000 | So here, notice that the outputs are given by kind of these legits here.

00:21:25.000 | There's two of them. We pass in one example and

00:21:27.000 | there's kind of two potential classes that we're trying to classify.

00:21:30.000 | Okay. And then lastly, we have a course,

00:21:33.000 | the corresponding distribution over the labels here, right?

00:21:37.000 | Since this is going to be binary classification.

00:21:40.000 | Yes, it's like a little bit weird that you're going to have like

00:21:43.000 | the two classes for the binary classification task.

00:21:46.000 | And you could basically just choose to classify one class or not.

00:21:50.000 | But we do this just basically because of how HuggingFace models are set up.

00:21:56.000 | And so additionally, you know,

00:22:00.000 | these are the models that we load in from HuggingFace are basically just PyTorch modules.

00:22:07.000 | So like these are the actual models and we can use them in the same way that we've been using models before.

00:22:13.000 | So that means things like loss.backward or something like that,

00:22:16.000 | actually will do this back propagation step corresponding to the loss of like your inputs that you pass in.

00:22:24.000 | So, um, so it's really easy to train, train these guys.

00:22:28.000 | As long as you have like a label, you know, label for your data,

00:22:31.000 | you can calculate your loss using, you know, the PyTorch cross entropy function.

00:22:37.000 | You get some loss back and then you can go ahead and back propagate it.

00:22:42.000 | You can actually even get kind of the parameters as well, um,

00:22:46.000 | in the model that you're- would probably get updated from this.

00:22:50.000 | So this is just some big tensor of the actual, um,

00:22:53.000 | embedding weights that you have.

00:22:56.000 | Okay. We also have like a pretty easy way, um,

00:23:00.000 | for HuggingFace itself to be able to,

00:23:03.000 | to calculate the loss that we get.

00:23:05.000 | So again, if we tokenize some input string,

00:23:08.000 | we get our model inputs.

00:23:09.000 | We have two labels, positive and negative, um,

00:23:13.000 | and then give some kind of corresponding label that we assign to the,

00:23:17.000 | the model inputs and we pass this in.

00:23:20.000 | We can see here that the actual model outputs that's- that are given by HuggingFace includes this loss here.

00:23:28.000 | Right. So it'll include the loss corresponding to that input anyways.

00:23:32.000 | So it's a really easy way of actually, um,

00:23:35.000 | calculating the loss just natively in HuggingFace without having to call any additional things from a PyTorch library.

00:23:43.000 | And lastly, we can actually even use, um,

00:23:47.000 | if we have kind of like these two labels here, um,

00:23:50.000 | again for positive or negative,

00:23:52.000 | what we can do is just take the model outputs,

00:23:55.000 | look at the legits and see which one is like the biggest again.

00:24:00.000 | We'll pass that and take, so the argmax,

00:24:03.000 | so that'll give the index that's largest and then that's the output label that the model is actually predicting.

00:24:09.000 | So again, it gives a really easy way of being able to do this sort of like classification,

00:24:13.000 | getting the loss, getting what the actual labels are, um,

00:24:17.000 | just from within HuggingFace.

00:24:19.000 | Okay. Awesome.

00:24:23.000 | So, um, the last thing as well is that we can also kind of look inside the model,

00:24:30.000 | um, in a pretty, pretty cool way and also seeing what the attention weights the model actually puts,

00:24:37.000 | uh, the attention weights the model actually has.

00:24:40.000 | Um, so this is helpful if you're trying to understand like what's going on inside of some NLP model.

00:24:47.000 | Um, and so for here, we can do again,

00:24:51.000 | uh, where we're importing our model from some pre-trained,

00:24:55.000 | um, kind of pre-trained model,

00:24:57.000 | model weights in the, um, the HuggingFace hub.

00:25:01.000 | We want to output attentions,

00:25:03.000 | set output attentions to true and output hidden states to true.

00:25:07.000 | So these are going to be the key arguments that we can use.

00:25:09.000 | We're actually kind of investigating, um,

00:25:12.000 | what's going on inside the model at each point in time.

00:25:15.000 | Again, we'll set the model to be in eval mode, um,

00:25:20.000 | and lastly, we'll go ahead and tokenize our input string again.

00:25:25.000 | Um, we don't really care about any of the gradients here.

00:25:29.000 | Um, again, so we don't actually want to back propagate anything here.

00:25:33.000 | And finally, pass in the model inputs.

00:25:36.000 | So now what we're able to do is when we print out the model hidden states.

00:25:41.000 | So now this is a new kind of property in the output dictionary that we get.

00:25:46.000 | We can look at what these actually look like here.

00:25:49.000 | Um, and sorry, this is a massive output.

00:25:53.000 | So you can actually look at the hidden state size per layer, right?

00:25:58.000 | And so this kind of gives a notion of what we're going to be looking like,

00:26:02.000 | looking at like what the shape of this is at each given layer in our model,

00:26:07.000 | as well as the attention head size per layer.

00:26:10.000 | So this gives you like the kind of shape of what you're looking at.

00:26:13.000 | And then if we actually look at the model output itself,

00:26:17.000 | we'll get all of these different like hidden states basically, right?

00:26:22.000 | So, um, so we have like tons and tons of these, uh, different hidden states.

00:26:27.000 | We'll have the last hidden state, um, here.

00:26:30.000 | So the model output is pretty robust for kind of showing you what the hidden state looks like,

00:26:36.000 | as well as what attention weights actually look like here.

00:26:39.000 | So in case you're trying to analyze a particular model,

00:26:43.000 | this is a really helpful way of doing that.

00:26:45.000 | So what model.eval does is it- sorry,

00:26:49.000 | question is what is the.eval do?

00:26:52.000 | Um, what it does is it basically sets your,

00:26:55.000 | and this is true for any PyTorch module or model,

00:26:58.000 | is it sets it into "eval mode".

00:27:01.000 | Um, so again for this like we're not really trying to calculate any of the gradients or anything like that.

00:27:08.000 | That might correspond to, um,

00:27:11.000 | like correspond to some data that we pass in or try and update our model in any way.

00:27:16.000 | We just care about evaluating it on that particular data point.

00:27:21.000 | Um, so for that then it's helpful to set the model into like eval mode essentially,

00:27:27.000 | um, to help make sure that,

00:27:29.000 | that kind of like, uh,

00:27:30.000 | disables some of like that stuff that you'd use during training time.

00:27:34.000 | So it just makes it a little more efficient.

00:27:36.000 | Yeah, the question was, uh,

00:27:38.000 | it's already pre-trained so can you go ahead and evaluate it?

00:27:41.000 | Yeah, you, you can.

00:27:42.000 | Um, so yeah, this is just the raw pre-trained model with no, no fine-tuning.

00:27:46.000 | So the question is like how do you interpret,

00:27:49.000 | um, these shapes basically,

00:27:52.000 | uh, for the attention head size and then the hidden state size?

00:27:56.000 | So, um, so yeah, the,

00:27:58.000 | the key thing here, uh,

00:27:59.000 | is you'll probably want to look at kind of the shape given on the side.

00:28:03.000 | It'll correspond to like the layer that you're actually kind of like, uh, looking at.

00:28:08.000 | So here, um, like when we call,

00:28:11.000 | we looked at the shape here,

00:28:12.000 | we're specifically looking at like the first,

00:28:15.000 | first one in this list, right?

00:28:17.000 | So this will give us the first hidden layer.

00:28:20.000 | Uh, the second gives us a,

00:28:21.000 | a notion of kind of like the,

00:28:23.000 | the batch that we're looking at.

00:28:25.000 | And then the last is like,

00:28:27.000 | so this is like some tensor, right?

00:28:29.000 | 768 dimensional, I don't know,

00:28:32.000 | representation that corresponds there.

00:28:34.000 | Um, and then for the attention head size,

00:28:37.000 | it corresponds to like the actual query word and the keyword for these last two here.

00:28:44.000 | But yes, so, um, but for this,

00:28:51.000 | you know, we would expect this kind of initial index here, right?

00:28:55.000 | The one to be bigger if we printed out all of the,

00:28:58.000 | you know, all of the layers,

00:28:59.000 | but we're just looking at the first one here.

00:29:01.000 | So we can also do this,

00:29:04.000 | um, for, um, you know,

00:29:07.000 | actually being able to get some notion of how these different,

00:29:11.000 | how this actually like looks,

00:29:13.000 | um, and plot out these axes as well.

00:29:16.000 | So again, if we take this same kind of model input,

00:29:19.000 | which again is like this hugging face transformers is great,

00:29:22.000 | we're actually trying to see like what do these representations look like,

00:29:26.000 | on like a per layer basis.

00:29:28.000 | So what we can do here is basically,

00:29:31.000 | we're looking at for each layer that we have in our model, right?

00:29:35.000 | And again, this is purely from the model output attentions,

00:29:38.000 | or the actual outputs of the model.

00:29:40.000 | Um, so what we can do is for each layer,

00:29:44.000 | and then for each head,

00:29:46.000 | we can analyze essentially like what these representations look like,

00:29:50.000 | and in particular, what the attention weights are,

00:29:52.000 | across each of like the tokens that we have.

00:29:55.000 | So this is like a good way of again,

00:29:57.000 | understanding like what your model is actually attending to,

00:30:00.000 | within each layer.

00:30:02.000 | So on the side, if we look here,

00:30:04.000 | maybe zoom in a bit,

00:30:06.000 | we can see that this is going to be like,

00:30:08.000 | corresponds to the different layers,

00:30:10.000 | and the top will correspond to,

00:30:12.000 | these are across the attention,

00:30:13.000 | the different attention heads.

00:30:16.000 | Okay. This will just give you some notion of like what the weights are.

00:30:19.000 | Here. So again, just to, um, to clarify.

00:30:23.000 | So again, if we maybe look at the labels,

00:30:25.000 | sorry, it's like a little cut off and like zoomed out,

00:30:28.000 | but so this y-axis here,

00:30:31.000 | like these different rows,

00:30:32.000 | corresponds to the different layers within the model.

00:30:36.000 | Oops. Um, on the x-axis here, right,

00:30:41.000 | we have like the, um,

00:30:43.000 | like the different attention heads that are present in the model as well.

00:30:47.000 | And so for each head,

00:30:49.000 | we're able to for each, uh,

00:30:51.000 | at each layer to basically get a sense of like what,

00:30:55.000 | how the attention distribution is actually being distributed,

00:30:59.000 | what's being attended to,

00:31:00.000 | corresponding to each of like the tokens that you actually get here.

00:31:04.000 | So if we, if we look up again,

00:31:07.000 | um, here as well, right,

00:31:10.000 | we're just trying to look at like basically the model attentions that we get,

00:31:14.000 | for each kind of corresponding layer.

00:31:17.000 | The question is what's the, the color key, um,

00:31:21.000 | yellow is like higher, higher magnitude and higher value,

00:31:25.000 | and then darker is like closer to zero.

00:31:27.000 | So probably very naive is like zero.

00:31:31.000 | So what we can do is now maybe walk through like what a fine-tuning task looks like here.

00:31:38.000 | Um, and so first like, uh, in a project, you know,

00:31:42.000 | you're probably going to want to fine-tune a model.

00:31:44.000 | Um, that's fine. It's a,

00:31:46.000 | and we'll go ahead and walk through an example of,

00:31:48.000 | of what that looks like here.

00:31:51.000 | Okay. So what we can do as well is, right,

00:31:59.000 | what we can do as well is use some of the, um,

00:32:03.000 | the data sets that we can get from HuggingFace as well.

00:32:07.000 | So it doesn't just have models,

00:32:08.000 | it has really nice data sets, um,

00:32:11.000 | and be able to kind of like load that in as well.

00:32:13.000 | So here what we're going to be looking at is, uh,

00:32:16.000 | looking at like the IMDB data set.

00:32:20.000 | Um, and so here again is for sentiment analysis.

00:32:24.000 | Uh, we'll just look at only the first 50 tokens or so.

00:32:28.000 | Um, and generally, um, so this is,

00:32:32.000 | this is like a, you know,

00:32:34.000 | helper function that we'll use for truncating the output that we get.

00:32:38.000 | And then lastly for actually kind of making this data set,

00:32:43.000 | we can use the dataset dict class from HuggingFace again,

00:32:48.000 | that will basically give us this smaller data set that we can get for the,

00:32:53.000 | uh, for the train data set as well as

00:32:56.000 | specifying what we want for validation as well.

00:32:58.000 | So here what we're going to do for our like mini data set for

00:33:02.000 | the purpose of this demonstration is we'll use, uh,

00:33:06.000 | make train and val both from the IMDB train, uh, data set.

00:33:10.000 | Uh, we'll shuffle it a bit,

00:33:12.000 | and then we're just going to select here 128 examples,

00:33:16.000 | um, and then 32 for validation.

00:33:19.000 | So it'll shuffle it around,

00:33:20.000 | it'll take the first 128 and it'll take the la- the next 32.

00:33:25.000 | Um, and then we'll kind of truncate those particular inputs that we get.

00:33:30.000 | Again, just to kind of make sure we're efficient.

00:33:33.000 | And we can actually run this on a CPU.

00:33:38.000 | Okay. So next we can do is just see kind of what does this look like.

00:33:44.000 | It'll just, again, this is kind of just like a dictionary,

00:33:46.000 | it's a wrapper class almost of giving, you know,

00:33:49.000 | your train data set and then your validation data set.

00:33:52.000 | And in particular, we can even look at like what the first 10 of these looks like.

00:33:58.000 | Um, so first, like the output, so we specify train.

00:34:02.000 | We want to look at the first 10 entries in our train data set.

00:34:06.000 | And the output of this, um,

00:34:08.000 | is going to be, um,

00:34:10.000 | a dictionary as well, which is pretty cool.

00:34:12.000 | So we have some,

00:34:14.000 | the first 10 test- text examples that give the actual movie reviews here.

00:34:20.000 | Um, so this is the given in a list.

00:34:24.000 | And then the second, uh,

00:34:26.000 | key that you get are the labels corresponding to each of these.

00:34:29.000 | So whether it's positive or negative.

00:34:32.000 | So here one is going to be a positive review,

00:34:34.000 | uh, zero is negative.

00:34:36.000 | So it makes it really easy to use this for some- something like sentiment- sentiment analysis.

00:34:42.000 | Okay. So what we can do is go ahead and, uh,

00:34:48.000 | prepare the data set and put it into batches of 16.

00:34:52.000 | Okay. So what does this look like?

00:34:53.000 | What we can do is we can call the map function that this like, uh,

00:34:58.000 | that this small like data set dictionary has.

00:35:02.000 | So you call map and pass in a lambda function of what we want to actually do.

00:35:07.000 | So here the lambda function is for each example that we have.

00:35:12.000 | We want to tokenize the text basically.

00:35:15.000 | So this is basically saying how do we want to,

00:35:18.000 | you know, pre-process this.

00:35:20.000 | Um, and so here we're extracting the tokens,

00:35:23.000 | input IDs that will pass as a model.

00:35:25.000 | We're adding padding and truncation as well.

00:35:29.000 | We're going to do this in a batch and then the batch size will be 16.

00:35:32.000 | Okay. Hopefully this makes sense.

00:35:35.000 | Okay. So, um, next we're basically just going to,

00:35:41.000 | um, uh, do like a little more modification on what the data set actually looks like.

00:35:47.000 | So we're going to remove the column that corresponds to- to text.

00:35:52.000 | And then we're going to rename the column label to labels.

00:35:56.000 | So again if we see this, this was called label.

00:35:59.000 | We're just going to call it labels and we're going to remove

00:36:01.000 | the text column because we don't really need it anymore.

00:36:04.000 | We just have gone ahead and pre-processed our data into the input IDs that we need.

00:36:09.000 | Okay. And lastly we're going to set it,

00:36:11.000 | the format to torch so we can go ahead and just pass this in,

00:36:15.000 | um, pass this into our model or our PyTorch model.

00:36:18.000 | The question is what is labels?

00:36:20.000 | So, um, so label here corresponds to like again the first,

00:36:25.000 | in the context of sentiment analysis.

00:36:27.000 | It's like just, yeah, positive or negative.

00:36:30.000 | And so here we're just renaming the column.

00:36:33.000 | Okay. So now we'll just go ahead and see what this looks like.

00:36:36.000 | Again, we're going to look at the train set and only these first two things.

00:36:41.000 | And so, um, so here now we have

00:36:44.000 | the two labels that correspond to each of the reviews.

00:36:47.000 | And the input IDs that we get corresponding for each of the reviews as well.

00:36:52.000 | Lastly, we also get the attention mask.

00:36:55.000 | So it's basically just taking the,

00:36:57.000 | what you get out from the tokenizer and it's just adding this back into the dataset.

00:37:01.000 | So it's really easy to pass it.

00:37:03.000 | The question is, um, we truncated which makes things easy.

00:37:08.000 | But how do you want to apply,

00:37:10.000 | um, like padding evenly?

00:37:13.000 | Um, so here if we do pass in, um,

00:37:16.000 | so first is like you could either manually set some high truncation limit like we did.

00:37:21.000 | Um, the second is that, um,

00:37:24.000 | you can just go ahead and set, um,

00:37:26.000 | padding to be true and then basically,

00:37:29.000 | like the padding is basically, uh,

00:37:32.000 | added, uh, based off of kind of like the longest,

00:37:37.000 | um, like longest sequence that you have.

00:37:39.000 | Yes. So the question is,

00:37:41.000 | I guess doing it for all of them,

00:37:43.000 | all the text lists evenly.

00:37:45.000 | Um, so again, it just like depends on like the size of like the dataset you're,

00:37:49.000 | you're like you're loading it, right?

00:37:51.000 | So if you're looking at particular batches at a time,

00:37:54.000 | you can just pad within that particular like batch versus like, yeah,

00:37:58.000 | you don't need to like load all the dataset into memory,

00:38:01.000 | pad the entire dataset like,

00:38:03.000 | or like in the same way.

00:38:05.000 | So it's fine to do it within just batches.

00:38:07.000 | Yeah, the question was,

00:38:09.000 | how does, uh, how are the input IDs like added?

00:38:12.000 | And, uh, yeah, the answer is yes.

00:38:15.000 | It's basically done automatically.

00:38:16.000 | Um, so we had to manually remove the text column here,

00:38:21.000 | and that kind of like this first line here.

00:38:24.000 | But, um, like if you recall like the outputs of tokenize,

00:38:28.000 | like at the tokenizer, it's basically just the input IDs and the,

00:38:31.000 | and the attention mask.

00:38:33.000 | So it just, it's smart enough to basically aggregate those together.

00:38:37.000 | Okay. The last thing we're gonna do is basically just put these.

00:38:44.000 | So we have this like dataset now, um, it looks great.

00:38:48.000 | We're just gonna import like a PyTorch data loader,

00:38:52.000 | typical normal data loader,

00:38:54.000 | and then go ahead and load each of these datasets that we just had.

00:38:58.000 | I mean specifying the batch size to be 16.

00:39:01.000 | Okay. So that's fine and, and great.

00:39:07.000 | Um, and so now for training the model,

00:39:10.000 | it's basically like exactly the same as what we would do in typical PyTorch.

00:39:16.000 | So again, it's like you still want to compute the loss,

00:39:19.000 | you can back propagate the loss and everything.

00:39:22.000 | Um, yeah. So it's, it's really up to your own design how you do,

00:39:27.000 | uh, how you do the training.

00:39:29.000 | Um, so here there's only like a few kind of asterisks I guess.

00:39:34.000 | One is that you can import specific kind of optimizer types

00:39:39.000 | from the transformers, uh, package.

00:39:42.000 | So you can do Adam with weight decay,

00:39:44.000 | uh, you can get a linear schedule for like the learning rate,

00:39:48.000 | which will kind of decrease the learning,

00:39:50.000 | during, uh, the learning rate over time for each training step.

00:39:54.000 | So again, it's basically up to your choice.

00:39:56.000 | But if you look at the structure of like this code, right,

00:39:59.000 | we load the model for classification,

00:40:01.000 | we set a number of epochs,

00:40:03.000 | and then however many training steps we actually want to do.

00:40:06.000 | We initialize our optimizer and get some learning rate schedule, right.

00:40:11.000 | And then from there, it's basically the same thing as what we would do,

00:40:15.000 | for a typical kind of like PyTorch model, right.

00:40:18.000 | We set the model to train mode,

00:40:20.000 | we go ahead and pass in all of these batches from like the,

00:40:25.000 | the data loader and then back propagate,

00:40:28.000 | step the optimizer and everything like that.

00:40:31.000 | So it's, uh, pretty,

00:40:33.000 | pretty similar from what we're kind of like used to seeing essentially.

00:40:38.000 | Awesome. So that'll go do its thing at some point.

00:40:47.000 | Um, okay. And so, uh,

00:40:49.000 | so that's one potential option is if you really like PyTorch,

00:40:53.000 | you can just go ahead and do that and it's really nice and easy.

00:40:56.000 | Um, the second thing is, uh,

00:40:59.000 | that HuggingFace actually has some sort of like a trainer class that you're able to

00:41:05.000 | use that can handle most of, most of these things.

00:41:08.000 | Um, so again, if we do that kind of like the same thing here,

00:41:12.000 | this will actually run once our model is done training.

00:41:15.000 | Um, like we can create the,

00:41:18.000 | our, you know, our dataset in the same way as before.

00:41:21.000 | Now, what we can,

00:41:23.000 | what we need to use is like this import of like a training arguments class,

00:41:28.000 | this is going to be basically a dictionary of all the things that we want to

00:41:32.000 | use when we're going to actually train our model.

00:41:35.000 | And then this kind of like additional trainer class,

00:41:39.000 | which will handle the training kind of like magically for us and kind of wrap around in that way.

00:41:45.000 | Okay. So if you can, okay,

00:41:49.000 | I think we're missing a directory, but, um,

00:41:52.000 | I think, yeah, pretty straightforward for how you want to train.

00:41:55.000 | Yeah. Um, so for,

00:41:58.000 | for here at least, um, again,

00:42:00.000 | there are kind of the two key arguments.

00:42:02.000 | The first is training arguments.

00:42:04.000 | So this will specify, have a number of specifications that you can actually pass through to it.

00:42:09.000 | It's where you want to log things for each kind of like device.

00:42:13.000 | In this case, like we're just using one GPU,

00:42:16.000 | but potentially if you're using multiple GPUs,

00:42:19.000 | what the batch size is during training,

00:42:21.000 | what the batch size is during evaluation time.

00:42:24.000 | How long you want to train it for.

00:42:27.000 | How you want to evaluate it.

00:42:29.000 | So this is kind of like evaluating on an epoch level.

00:42:33.000 | What the learning rate is and so on, so on.

00:42:36.000 | So again, if you want to check the documentation,

00:42:39.000 | you can see that here.

00:42:41.000 | There's a bunch of different arguments that you can give.

00:42:44.000 | There's like warm-up steps, warm-up ratio, like weight decay.

00:42:48.000 | There's like so many things.

00:42:50.000 | So again, it's basically like a dictionary.

00:42:53.000 | Feel free to kind of like look at these different arguments you can pass in.

00:42:57.000 | But there's a couple of key ones here.

00:42:59.000 | And this is basically, this basically mimics the same arguments that we used before

00:43:04.000 | in our like explicit PyTorch method here for HuggingFace.

00:43:10.000 | Okay, similarly, what we do is we can just pass this into the trainer

00:43:15.000 | and that will take care of basically everything for us.

00:43:18.000 | So that whole training loop that we did before

00:43:21.000 | is kind of condensed into this one class function

00:43:24.000 | for actually just doing the training.

00:43:26.000 | So we pass the model, the arguments, the trained dataset, eval dataset,

00:43:31.000 | what tokenizer we want to use,

00:43:33.000 | and then some function for computing metrics.

00:43:37.000 | So for here, we pass in this function, eval,

00:43:41.000 | and it takes eval predictions as input.

00:43:44.000 | Basically what this does is these predictions are given from the trainer,

00:43:48.000 | we pass into this function,

00:43:50.000 | and we just can split it into the actual legit

00:43:53.000 | and the labels that are predicted.

00:43:55.000 | Sorry, the ground truth labels that we have.

00:43:57.000 | And then from here, we can just calculate any sort of additional metrics we want,

00:44:01.000 | like accuracy, F1 score, recall, or whatever you want.

00:44:07.000 | Okay, so this is like an alternative way of formulating that training loop.

00:44:14.000 | Okay, the last thing here as well is that we can have some sort of callback as well

00:44:20.000 | if you want to do things during the training process.

00:44:23.000 | So after every epoch or something like that,

00:44:26.000 | you want to evaluate your model on the validation set or something like that,

00:44:30.000 | or just go ahead and like dump some sort of output.

00:44:35.000 | That's what you can use a callback for.

00:44:37.000 | And so here, this is just a logging callback.

00:44:41.000 | It's just going to log kind of like the information about the process itself.

00:44:48.000 | Again, not super important,

00:44:50.000 | but in case that you're looking to try and do any sort of callback during training,

00:44:56.000 | it's an easy way to add it in.

00:44:58.000 | The second is if you want to do early stopping as well.

00:45:01.000 | So early stopping will basically stop your model early, as it sounds.

00:45:07.000 | If it's not learning anything and a bunch of epochs are going by,

00:45:11.000 | and so you can set that so that you don't waste kind of like compute time,

00:45:14.000 | or you can see the results more easily.

00:45:16.000 | The question is, is there a good choice for the patient's value?

00:45:21.000 | It just depends on the model architecture.

00:45:23.000 | Not really, I guess.

00:45:24.000 | It's pretty up to your discretion.

00:45:31.000 | Okay, awesome.

00:45:33.000 | And so the last thing that we do is just do a call, trainer.train.

00:45:37.000 | So if you recall, this is just the instantiation of this trainer class,

00:45:42.000 | called trainer.train, and it'll just kind of go.

00:45:46.000 | So now it's training, which is great.

00:45:49.000 | It gives us a nice kind of estimate of how long things are taking,

00:45:53.000 | what's going on, what arguments that we actually pass in.

00:45:58.000 | So that's just going to run.

00:46:01.000 | And then likewise, hopefully it'll train relatively quickly.

00:46:06.000 | Okay, it'll take two minutes.

00:46:08.000 | We can also evaluate the model pretty easily as well.

00:46:12.000 | So we just call trainer.predict on whatever data set that we're interested in.

00:46:17.000 | So here it's the tokenized data set course,

00:46:19.000 | we're going to do the validation data set.

00:46:23.000 | Okay, hopefully we can pop that out soon.

00:46:27.000 | And lastly, so if we saved anything to our model checkpoints,

00:46:32.000 | so hopefully this is saving stuff right now.

00:46:40.000 | Yeah, so this is going to be,

00:46:41.000 | is continue to save stuff to the folder that we specified.

00:46:46.000 | And so here, in case we ever want to kind of like load our model again,

00:46:50.000 | from the weights that we've actually saved,

00:46:53.000 | we just pass in the name of the checkpoint,

00:46:55.000 | like the relative path here to our checkpoint.

00:46:58.000 | So notice how we have some checkpoint eight here.

00:47:02.000 | We just pass in the path to that folder, we load it back in,

00:47:06.000 | we tokenize, and it's the same thing as we did before.

00:47:12.000 | There are a few kind of additional appendices for how to do like different tasks as well.

00:47:19.000 | So appendix on generation, how to define a custom data set as well,

00:47:25.000 | how to kind of like pipeline different kind of like tasks together.

00:47:32.000 | So this is kind of like using a pre-trained model that you can just use through kind of like the pipeline interface really easily.

00:47:43.000 | There's like different types of tasks like mass language modeling.

00:47:48.000 | Feel free to look through those at your own time.

00:47:50.000 | And yeah, thanks a bunch.

00:47:52.000 | [BLANK_AUDIO]