back to index

Stanford CS224N NLP with Deep Learning | 2023 | Hugging Face Tutorial, Eric Frankel


Whisper Transcript | Transcript Only Page

00:00:00.000 | [PAUSE]
00:00:05.000 | Hi everyone.
00:00:06.000 | Welcome to the 224n Hugging Face Transformers tutorial.
00:00:11.000 | So this tutorial is just going to be about using the Hugging Face library.
00:00:17.000 | It's really useful and a super effective way of being able to use kind of some
00:00:22.000 | off-the-shelf NLP models,
00:00:24.000 | specifically models that are kind of transformer-based,
00:00:28.000 | and being able to use those for either your final project,
00:00:33.000 | your custom final project or something like that,
00:00:35.000 | just using it in the future.
00:00:37.000 | So these are -- it's a really helpful package to learn,
00:00:41.000 | and it interfaces really well with PyTorch in particular too.
00:00:46.000 | Okay, so first things first is in case there's anything else that you are
00:00:51.000 | missing from this kind of like tutorial,
00:00:53.000 | the Hugging Face documentation is really good.
00:00:56.000 | They also have lots of kind of tutorials and walkthroughs as well as other kind
00:01:01.000 | of like notebooks that you can play around with as well.
00:01:04.000 | So if you're ever wondering about something else,
00:01:06.000 | that's a really good place to look.
00:01:08.000 | Okay, so in the Colab, the first thing we're going to do that I already did
00:01:12.000 | but can maybe run again is just installing the Transformers Python package
00:01:17.000 | and then the Datasets Python package.
00:01:20.000 | So this corresponds to the Hugging Face Transformers and Datasets.
00:01:25.000 | And so those are really helpful.
00:01:27.000 | The Transformers is where we'll get a lot of these kind of pre-trained models
00:01:31.000 | from, and the Datasets will give us some helpful Datasets that we can
00:01:35.000 | potentially use for various tasks, so in this case, sentiment analysis.
00:01:41.000 | Okay, and so we'll use a bit of like a helper function for helping us
00:01:45.000 | understand what encoding is -- what encodings are actually happening as well.
00:01:51.000 | So we'll run this just to kind of kick things off and import a few more things.
00:01:57.000 | Okay, so first what we'll do is this is generally kind of like the step-by-step
00:02:03.000 | for how to use something off of Hugging Face.
00:02:06.000 | So first what we'll do is we'll find some model from like the Hugging Face hub here.
00:02:13.000 | And note that there's like a ton of different models that you're able to use.
00:02:17.000 | There's BERT, there's GPT-2, there's T5-small,
00:02:21.000 | which is another language model from Google.
00:02:23.000 | So there are a bunch of these different models that are pre-trained,
00:02:28.000 | and all of these weights are up here in Hugging Face that are freely available
00:02:32.000 | for you guys to download.
00:02:34.000 | So if there's a particular model you're interested in,
00:02:36.000 | you can probably find a version of it here.
00:02:39.000 | You can also see kind of different types of models on the side as well
00:02:43.000 | that -- for a specific task.
00:02:45.000 | So if we wanted to do something like zero-shot classification,
00:02:50.000 | there are a couple of models that are specifically good at doing that particular task.
00:02:55.000 | Okay, so based off of what task you're looking for,
00:02:57.000 | there's probably a Hugging Face model for it that's available online for you to download.
00:03:02.000 | Okay, so that's what we'll do first is we'll go ahead and find a model
00:03:08.000 | in the Hugging Face hub, and then, you know, whatever you want to do.
00:03:12.000 | In this case, we'll do sentiment analysis.
00:03:14.000 | And then there are two things that we need next.
00:03:16.000 | The first is a tokenizer for actually, you know,
00:03:19.000 | splitting your input text into tokens that your model can use,
00:03:24.000 | and the actual model itself.
00:03:27.000 | And so the tokenizer, again, kind of converts this to some vocabulary IDs,
00:03:32.000 | these discrete IDs that your model can actually take in,
00:03:36.000 | and the model will produce some prediction based off of that.
00:03:39.000 | Okay, so first what we can do is, again,
00:03:44.000 | import this auto-tokenizer and this auto-model for sequence classification.
00:03:52.000 | So what this will do initially is download some of the, you know,
00:03:55.000 | key things that we need so that we can actually initialize these.
00:03:59.000 | So what do each of these do?
00:04:01.000 | So first the tokenizer, this auto-tokenizer,
00:04:05.000 | is from some pre-trained tokenizer that has already been used.
00:04:10.000 | So in general, there's a corresponding tokenizer for every model
00:04:13.000 | that you want to try and use.
00:04:15.000 | In this case, it's like CBERT,
00:04:17.000 | so like something around sentiment and Roberta.
00:04:20.000 | And then the second is you can import this model for sequence classification
00:04:25.000 | as well from something pre-trained on the model hub again.
00:04:29.000 | So again, this corresponds to sentiment, Roberta, large English.
00:04:33.000 | And if we want, we can even find this over here.
00:04:37.000 | We can find it as, I think English, yeah, large English.
00:04:46.000 | So again, this is something we can easily find.
00:04:48.000 | You just copy this string up here, and then you can import that.
00:04:53.000 | Okay, we've downloaded all of the kind of, all the things that we need,
00:04:57.000 | some kind of like binary files as well.
00:04:59.000 | And then now we can go ahead and actually, you know,
00:05:02.000 | use some of these inputs, right?
00:05:04.000 | So this gives you some set of an input, right?
00:05:07.000 | This input string, I'm excited to learn about hogging face transformers.
00:05:11.000 | We'll get some tokenized inputs here from the actual tokenized things here
00:05:18.000 | after we pass it through the tokenizer.
00:05:21.000 | And then lastly, we'll get some notion of the model output that we get.
00:05:25.000 | So this is kind of some legit here over whatever classification that we have,
00:05:30.000 | so in this case, good or bad.
00:05:32.000 | And then some corresponding prediction.
00:05:35.000 | Okay, and we'll walk through what this kind of looks like in just a second as well,
00:05:39.000 | a little more depth.
00:05:40.000 | But this is broadly kind of like how we can actually use these together.
00:05:44.000 | We'll tokenize some input, and then we'll pass these inputs through the model.
00:05:48.000 | So we'll talk about tokenizers first.
00:05:50.000 | So tokenizers are used for basically just pre-processing the inputs
00:05:56.000 | that you get for any model.
00:05:57.000 | And it takes some raw string to like essentially a mapping to some number
00:06:03.000 | or ID that the model can take in and actually kind of understand.
00:06:08.000 | So tokenizers are either kind of like are specific to the model that you want to use,
00:06:14.000 | or you can use the auto-tokenizer that will kind of conveniently import
00:06:19.000 | whatever corresponding tokenizer you need for that model type.
00:06:24.000 | So that's kind of like the helpfulness of the auto-tokenizer.
00:06:28.000 | It'll kind of make that selection for you and make sure that you get the correct
00:06:32.000 | tokenizer for whatever model you're using.
00:06:35.000 | So the question is, does it make sure that everything is mapped to the correct index
00:06:39.000 | that the model is trained on?
00:06:40.000 | The answer is yes.
00:06:41.000 | So that's why the auto-tokenizer is helpful.
00:06:44.000 | So there are two types of tokenizers.
00:06:48.000 | There's a Python tokenizer, and there's also like a tokenizer FAST.
00:06:55.000 | The tokenizer FAST is written in Rust.
00:06:57.000 | In general, if you do the auto-tokenizer, it'll just default to the FAST one.
00:07:02.000 | There's not really a huge difference here.
00:07:04.000 | It's just about kind of like the inference time for getting the model outputs.
00:07:09.000 | Yeah, so the question was, the tokenizer creates dictionaries of the model inputs.
00:07:15.000 | So it's more like I think the way to think about a tokenizer is like that dictionary
00:07:24.000 | almost, right?
00:07:25.000 | So you want to kind of translate almost or have this mapping from the tokens that
00:07:30.000 | you can get from like this string and then map that into kind of some inputs that
00:07:35.000 | the model will actually use.
00:07:37.000 | So we'll see an example of that in just a second.
00:07:40.000 | So for example, we can kind of call the tokenizer in any way that we would for
00:07:45.000 | like a typical PyTorch model, but we're just going to call it on like a string.
00:07:49.000 | So here we have our input string is HuggingFaceTransformers is great.
00:07:54.000 | We pass that into the tokenizer almost like it's like a function, right?
00:07:58.000 | And then we'll get out some tokenization.
00:08:00.000 | So this gives us a set of input IDs.
00:08:03.000 | So to answer the earlier question, these are basically the numbers that each of
00:08:08.000 | these tokens represent, right?
00:08:10.000 | So that the model can actually use them.
00:08:13.000 | And then a corresponding attention mask for the particular transformer.
00:08:19.000 | Okay.
00:08:21.000 | So there are a couple ways of accessing the actual tokenized input IDs.
00:08:27.000 | You can treat it like a dictionary, so hence kind of thinking about it almost as
00:08:31.000 | that dictionary form.
00:08:32.000 | It's also just like a property of the output that you get.
00:08:36.000 | So there are two ways of accessing this in like a pretty Pythonic way.
00:08:39.000 | Okay.
00:08:43.000 | So what we can see as well is that we can look at the particular,
00:08:48.000 | the actual kind of tokenization process almost.
00:08:52.000 | And so this can maybe give some insight into what happens at each step.
00:08:56.000 | Right, so our initial input string is going to be HuggingFaceTransformers is
00:09:00.000 | great.
00:09:01.000 | Okay, the next step is that we actually want to tokenize these individual kind
00:09:07.000 | of individual words that are passed in.
00:09:10.000 | So here, this is the kind of output of this tokenization step.
00:09:15.000 | Right, we get kind of these individual split tokens.
00:09:19.000 | We'll convert them to IDs here.
00:09:22.000 | And then we'll add any special tokens that our model might need for actually
00:09:28.000 | performing inference on this.
00:09:33.000 | So there's a couple steps that happen kind of like underneath when you use an
00:09:38.000 | actual, when you use a tokenizer that happens a few things at a time.
00:09:44.000 | One thing to note is that for fast tokenizers as well,
00:09:48.000 | there is another option that you're able to get to.
00:09:52.000 | So you have essentially, right, you have this input string.
00:09:57.000 | You have the number of tokens that you get.
00:09:59.000 | And you might have some notion of like the special token mask as well.
00:10:04.000 | Right, so using char to word is going to give you like the word piece of a
00:10:09.000 | particular character in the input.
00:10:11.000 | So here, this is just giving you additional options that you can use for
00:10:15.000 | the fast tokenizer as well for understanding how the tokens are being
00:10:18.000 | used in the, from the input string.
00:10:26.000 | Okay, so there are different ways of using the outputs of these tokenizers too.
00:10:32.000 | So one is that you can pass this in.
00:10:35.000 | And if you indicate that you want it to return a tensor,
00:10:39.000 | you can also return a PyTorch tensor.
00:10:42.000 | So that's great in case you need a PyTorch tensor,
00:10:47.000 | which you probably generally want.
00:10:49.000 | You can also add multiple tokens into the tokenizer and
00:10:53.000 | then pad them as however you need.
00:10:56.000 | So for here, for example, we can use the pad token as being this kind of like
00:11:03.000 | pad bracket almost.
00:11:05.000 | And giving the token ID is going to correspond to zero.
00:11:09.000 | So this is just going to add padding to whatever input that you give.
00:11:12.000 | So if you need your outputs to be the same length for
00:11:17.000 | a particular type of model, right, this will add those padding tokens and
00:11:21.000 | then correspondingly gives you like the zeros in the attention mask where you
00:11:25.000 | actually need it.
00:11:28.000 | Okay, and so the way to do that here is you basically set padding to be true,
00:11:34.000 | and also set truncation to be true as well.
00:11:37.000 | And so if you ever have kind of like more,
00:11:41.000 | any other kind of like features of the tokenizer that you're interested in,
00:11:46.000 | again you can check the hugging face documentation,
00:11:49.000 | which is pretty thorough for what each of these things do.
00:11:52.000 | Yeah, so the question is kind of looking at the hash hash at least,
00:11:58.000 | and whether that means that we should have like a space before or not.
00:12:03.000 | So here in this case, yeah, so in this case,
00:12:08.000 | we probably don't want like the space before, right,
00:12:11.000 | just because we have like the hugging, like I don't know,
00:12:16.000 | hugging is all one word in this case.
00:12:20.000 | Generally, like, generally the tokenizers,
00:12:24.000 | generally the output that they give is still pretty consistent though,
00:12:29.000 | in terms of how the tokenization process works.
00:12:32.000 | So there might be kind of these like, you know,
00:12:34.000 | instances where it might be contrary to what you might expect for
00:12:38.000 | kind of how something is tokenized.
00:12:41.000 | In general, the tokenization generally works fine.
00:12:45.000 | So in most cases,
00:12:47.000 | kind of like the direct output that you get from
00:12:49.000 | the hugging face tokenizer is sufficient.
00:12:57.000 | Okay, awesome. So one last thing,
00:13:00.000 | past the adding kind of additional padding,
00:13:03.000 | is that you can also kind of decode like an entire batch at one given time.
00:13:10.000 | So if we look again,
00:13:13.000 | we have like our tokenizer will
00:13:16.000 | additionally have this method called like a batch decode.
00:13:19.000 | So if we have like the model inputs that we get up here,
00:13:23.000 | this is the output of passing these sentences or
00:13:26.000 | these strings into the tokenizer.
00:13:29.000 | We can go ahead and just pass like these input IDs that
00:13:34.000 | correspond to that into the batch decode,
00:13:37.000 | and it'll give us kind of this good,
00:13:39.000 | this decoding that corresponds to all the padding we added in,
00:13:43.000 | each of the particular kind of like words and strings.
00:13:48.000 | And if you want to, you know,
00:13:50.000 | ignore all the presence of these padding tokens or anything like that,
00:13:55.000 | you can also pass that in as skipping the special tokens.
00:13:59.000 | Gotcha. So this gives like a,
00:14:01.000 | this is a pretty high level overview of the,
00:14:04.000 | how you would want to use tokenizers, I guess,
00:14:07.000 | in your, in using HuggingFace.
00:14:10.000 | So now we can talk about maybe how to use the HuggingFace models themselves.
00:14:16.000 | So again, this is, this is pretty similar to what we're seeing for something like
00:14:21.000 | initializing a tokenizer.
00:14:23.000 | You just choose the specific model type for your model,
00:14:28.000 | and then you can use that or the specific kind of auto model class.
00:14:33.000 | Where again, this auto model kind of takes almost the,
00:14:37.000 | like the initialization process,
00:14:40.000 | it takes care of it for you in a pretty easy way,
00:14:42.000 | without really any too much overhead.
00:14:45.000 | So additionally, so for the pre-trained transformers that we have,
00:14:52.000 | they generally have the same underlying architecture,
00:14:54.000 | but you'll have different kind of heads associated with each transformer.
00:15:00.000 | So attention heads that you might have to train
00:15:02.000 | if you're doing some sequence classification or just some other task.
00:15:06.000 | So HuggingFace will do this for you.
00:15:09.000 | And so for this, I will walk through an example of how to do this for sentiment analysis.
00:15:16.000 | So if there's a specific context like sequence classification we want to use,
00:15:21.000 | we can use like this, the very specific kind of like class HuggingFace provides,
00:15:28.000 | so distilbert for sequence classification.
00:15:31.000 | Alternatively, if we were doing it using distilbert in like a mass language model setting,
00:15:37.000 | we use distilbert for mass LM.
00:15:40.000 | And then lastly, if we're just doing it purely for the representations that we get out of distilbert,
00:15:45.000 | we just use like the baseline model.
00:15:47.000 | So the key thing here, or key takeaway,
00:15:50.000 | is that there are some task specific classes that we can use from HuggingFace to initialize.
00:15:56.000 | So AutoModel again is similar to kind of like the AutoTokenizer.
00:16:01.000 | So for this, it's just going to kind of load by default that specific model.
00:16:08.000 | And so in this case, it's going to be just like kind of like the basic weights that you need for that.
00:16:18.000 | Okay. So here, we'll have basically three different types of models that we can look at.
00:16:24.000 | One is like an encoder type model, which is BERT.
00:16:27.000 | A decoder type model, like GPT-2, that's like performing these like, you know,
00:16:35.000 | generating some text potentially.
00:16:37.000 | And encoder decoder models, so BART or T5 in this case.
00:16:40.000 | So again, if you go back to kind of the HuggingFace hub,
00:16:45.000 | there's a whole sort of different types of models that you could potentially use.
00:16:51.000 | And if we look in the documentation as well,
00:16:54.000 | so here we can understand some notion of like the different types of classes that we might want to use.
00:17:01.000 | Right. So there's some notion of like the AutoTokenizer,
00:17:05.000 | different AutoModels for different types of tasks.
00:17:08.000 | So here, again, if you have any kind of like specific use cases that you're looking for,
00:17:14.000 | then you can check the documentation.
00:17:16.000 | Here, again, if you use like an AutoModel from like pre-trained,
00:17:21.000 | you'll just create a model that's an instance of that BERT model.
00:17:25.000 | In this case, BERT model for BERT base case.
00:17:31.000 | Okay. Let's, we can go ahead and start.
00:17:35.000 | One last thing to note is that like again,
00:17:38.000 | the particular choice of your model matches up with kind of the type of architecture that you have to use.
00:17:44.000 | Right. So there are different,
00:17:46.000 | these different types of models can perform specific tasks.
00:17:50.000 | So you're not going to be able to kind of load or use BERT for instance,
00:17:55.000 | or DistilBERT as like a sequence to sequence model for instance,
00:17:59.000 | which requires the encoder and decoder because DistilBERT only consists of an encoder.
00:18:06.000 | So there's a bit of like a limitation on how you can exactly use these,
00:18:10.000 | but it's basically based on like the model architecture itself.
00:18:16.000 | Okay. Awesome. So let's go ahead and get started here.
00:18:21.000 | So similarly here, we can import so AutoModel for sequence classification.
00:18:28.000 | So again, this is, we're going to perform some classification task,
00:18:31.000 | and we'll import this AutoModel here so that we don't have to reference again,
00:18:36.000 | like something like DistilBERT for sequence classification.
00:18:39.000 | We'll be able to load it automatically and it'll be all set.
00:18:43.000 | Alternatively, we can do DistilBERT for sequence classification here,
00:18:47.000 | and that specifically will require DistilBERT to be the input there.
00:18:52.000 | Okay. So these are two different ways of basically getting the same model here.
00:18:56.000 | One using the AutoModel, one using just explicitly DistilBERT.
00:19:01.000 | Cool. And here, because it's classification,
00:19:05.000 | we need to specify the number of labels or the number of classes that we're
00:19:09.000 | actually going to classify for each of the input sentences.
00:19:13.000 | Okay. So here, we'll get some like a warning here, right?
00:19:18.000 | If you are following along and you print this out,
00:19:21.000 | because some of the sequence classification parameters aren't trained yet,
00:19:26.000 | and so we'll go ahead and take care of that.
00:19:29.000 | So here similarly, we'll kind of like walk through how to,
00:19:34.000 | how to actually, you know, train some of these models.
00:19:38.000 | So the first is how do you actually pass any of the inputs that you get from
00:19:42.000 | a tokenizer into the model?
00:19:44.000 | Okay. Well, if we get some model inputs from the tokenizer up here,
00:19:50.000 | and we pass this into the model by specifying that the input IDs are
00:19:57.000 | input IDs from the model inputs.
00:19:59.000 | And likewise, we want to emphasize or we can, you know,
00:20:04.000 | show here and specifically pass in that the attention mask is going to
00:20:08.000 | correspond to the attention mask that we gave from these like,
00:20:11.000 | these outputs of the tokenizer. Okay.
00:20:14.000 | So this is option one where you can specifically identify which property goes to what.
00:20:20.000 | The second option is using kind of a Pythonic hack almost,
00:20:27.000 | which is where you can directly pass in the model inputs.
00:20:31.000 | And so this will basically unpack almost the keys of like the model inputs here.
00:20:38.000 | So the model input keys,
00:20:40.000 | so the input IDs correspond to this.
00:20:43.000 | The attention mask corresponds to the attention mask argument.
00:20:47.000 | So when we use this star star kind of syntax,
00:20:51.000 | this will go ahead and unpack our dictionary and basically map
00:20:54.000 | the arguments to something of the same key.
00:20:56.000 | So this is an alternative way of passing it into the model.
00:21:00.000 | Both are going to be the same.
00:21:03.000 | Okay. So now what we can do is we can actually print out what the model outputs look like.
00:21:10.000 | So again, these are the inputs,
00:21:12.000 | the token IDs and the attention mask.
00:21:15.000 | And then second, we'll get the actual model outputs.
00:21:19.000 | So here, notice that the outputs are given by kind of these legits here.
00:21:25.000 | There's two of them. We pass in one example and
00:21:27.000 | there's kind of two potential classes that we're trying to classify.
00:21:30.000 | Okay. And then lastly, we have a course,
00:21:33.000 | the corresponding distribution over the labels here, right?
00:21:37.000 | Since this is going to be binary classification.
00:21:40.000 | Yes, it's like a little bit weird that you're going to have like
00:21:43.000 | the two classes for the binary classification task.
00:21:46.000 | And you could basically just choose to classify one class or not.
00:21:50.000 | But we do this just basically because of how HuggingFace models are set up.
00:21:56.000 | And so additionally, you know,
00:22:00.000 | these are the models that we load in from HuggingFace are basically just PyTorch modules.
00:22:07.000 | So like these are the actual models and we can use them in the same way that we've been using models before.
00:22:13.000 | So that means things like loss.backward or something like that,
00:22:16.000 | actually will do this back propagation step corresponding to the loss of like your inputs that you pass in.
00:22:24.000 | So, um, so it's really easy to train, train these guys.
00:22:28.000 | As long as you have like a label, you know, label for your data,
00:22:31.000 | you can calculate your loss using, you know, the PyTorch cross entropy function.
00:22:37.000 | You get some loss back and then you can go ahead and back propagate it.
00:22:42.000 | You can actually even get kind of the parameters as well, um,
00:22:46.000 | in the model that you're- would probably get updated from this.
00:22:50.000 | So this is just some big tensor of the actual, um,
00:22:53.000 | embedding weights that you have.
00:22:56.000 | Okay. We also have like a pretty easy way, um,
00:23:00.000 | for HuggingFace itself to be able to,
00:23:03.000 | to calculate the loss that we get.
00:23:05.000 | So again, if we tokenize some input string,
00:23:08.000 | we get our model inputs.
00:23:09.000 | We have two labels, positive and negative, um,
00:23:13.000 | and then give some kind of corresponding label that we assign to the,
00:23:17.000 | the model inputs and we pass this in.
00:23:20.000 | We can see here that the actual model outputs that's- that are given by HuggingFace includes this loss here.
00:23:28.000 | Right. So it'll include the loss corresponding to that input anyways.
00:23:32.000 | So it's a really easy way of actually, um,
00:23:35.000 | calculating the loss just natively in HuggingFace without having to call any additional things from a PyTorch library.
00:23:43.000 | And lastly, we can actually even use, um,
00:23:47.000 | if we have kind of like these two labels here, um,
00:23:50.000 | again for positive or negative,
00:23:52.000 | what we can do is just take the model outputs,
00:23:55.000 | look at the legits and see which one is like the biggest again.
00:24:00.000 | We'll pass that and take, so the argmax,
00:24:03.000 | so that'll give the index that's largest and then that's the output label that the model is actually predicting.
00:24:09.000 | So again, it gives a really easy way of being able to do this sort of like classification,
00:24:13.000 | getting the loss, getting what the actual labels are, um,
00:24:17.000 | just from within HuggingFace.
00:24:19.000 | Okay. Awesome.
00:24:23.000 | So, um, the last thing as well is that we can also kind of look inside the model,
00:24:30.000 | um, in a pretty, pretty cool way and also seeing what the attention weights the model actually puts,
00:24:37.000 | uh, the attention weights the model actually has.
00:24:40.000 | Um, so this is helpful if you're trying to understand like what's going on inside of some NLP model.
00:24:47.000 | Um, and so for here, we can do again,
00:24:51.000 | uh, where we're importing our model from some pre-trained,
00:24:55.000 | um, kind of pre-trained model,
00:24:57.000 | model weights in the, um, the HuggingFace hub.
00:25:01.000 | We want to output attentions,
00:25:03.000 | set output attentions to true and output hidden states to true.
00:25:07.000 | So these are going to be the key arguments that we can use.
00:25:09.000 | We're actually kind of investigating, um,
00:25:12.000 | what's going on inside the model at each point in time.
00:25:15.000 | Again, we'll set the model to be in eval mode, um,
00:25:20.000 | and lastly, we'll go ahead and tokenize our input string again.
00:25:25.000 | Um, we don't really care about any of the gradients here.
00:25:29.000 | Um, again, so we don't actually want to back propagate anything here.
00:25:33.000 | And finally, pass in the model inputs.
00:25:36.000 | So now what we're able to do is when we print out the model hidden states.
00:25:41.000 | So now this is a new kind of property in the output dictionary that we get.
00:25:46.000 | We can look at what these actually look like here.
00:25:49.000 | Um, and sorry, this is a massive output.
00:25:53.000 | So you can actually look at the hidden state size per layer, right?
00:25:58.000 | And so this kind of gives a notion of what we're going to be looking like,
00:26:02.000 | looking at like what the shape of this is at each given layer in our model,
00:26:07.000 | as well as the attention head size per layer.
00:26:10.000 | So this gives you like the kind of shape of what you're looking at.
00:26:13.000 | And then if we actually look at the model output itself,
00:26:17.000 | we'll get all of these different like hidden states basically, right?
00:26:22.000 | So, um, so we have like tons and tons of these, uh, different hidden states.
00:26:27.000 | We'll have the last hidden state, um, here.
00:26:30.000 | So the model output is pretty robust for kind of showing you what the hidden state looks like,
00:26:36.000 | as well as what attention weights actually look like here.
00:26:39.000 | So in case you're trying to analyze a particular model,
00:26:43.000 | this is a really helpful way of doing that.
00:26:45.000 | So what model.eval does is it- sorry,
00:26:49.000 | question is what is the.eval do?
00:26:52.000 | Um, what it does is it basically sets your,
00:26:55.000 | and this is true for any PyTorch module or model,
00:26:58.000 | is it sets it into "eval mode".
00:27:01.000 | Um, so again for this like we're not really trying to calculate any of the gradients or anything like that.
00:27:08.000 | That might correspond to, um,
00:27:11.000 | like correspond to some data that we pass in or try and update our model in any way.
00:27:16.000 | We just care about evaluating it on that particular data point.
00:27:21.000 | Um, so for that then it's helpful to set the model into like eval mode essentially,
00:27:27.000 | um, to help make sure that,
00:27:29.000 | that kind of like, uh,
00:27:30.000 | disables some of like that stuff that you'd use during training time.
00:27:34.000 | So it just makes it a little more efficient.
00:27:36.000 | Yeah, the question was, uh,
00:27:38.000 | it's already pre-trained so can you go ahead and evaluate it?
00:27:41.000 | Yeah, you, you can.
00:27:42.000 | Um, so yeah, this is just the raw pre-trained model with no, no fine-tuning.
00:27:46.000 | So the question is like how do you interpret,
00:27:49.000 | um, these shapes basically,
00:27:52.000 | uh, for the attention head size and then the hidden state size?
00:27:56.000 | So, um, so yeah, the,
00:27:58.000 | the key thing here, uh,
00:27:59.000 | is you'll probably want to look at kind of the shape given on the side.
00:28:03.000 | It'll correspond to like the layer that you're actually kind of like, uh, looking at.
00:28:08.000 | So here, um, like when we call,
00:28:11.000 | we looked at the shape here,
00:28:12.000 | we're specifically looking at like the first,
00:28:15.000 | first one in this list, right?
00:28:17.000 | So this will give us the first hidden layer.
00:28:20.000 | Uh, the second gives us a,
00:28:21.000 | a notion of kind of like the,
00:28:23.000 | the batch that we're looking at.
00:28:25.000 | And then the last is like,
00:28:27.000 | so this is like some tensor, right?
00:28:29.000 | 768 dimensional, I don't know,
00:28:32.000 | representation that corresponds there.
00:28:34.000 | Um, and then for the attention head size,
00:28:37.000 | it corresponds to like the actual query word and the keyword for these last two here.
00:28:44.000 | But yes, so, um, but for this,
00:28:51.000 | you know, we would expect this kind of initial index here, right?
00:28:55.000 | The one to be bigger if we printed out all of the,
00:28:58.000 | you know, all of the layers,
00:28:59.000 | but we're just looking at the first one here.
00:29:01.000 | So we can also do this,
00:29:04.000 | um, for, um, you know,
00:29:07.000 | actually being able to get some notion of how these different,
00:29:11.000 | how this actually like looks,
00:29:13.000 | um, and plot out these axes as well.
00:29:16.000 | So again, if we take this same kind of model input,
00:29:19.000 | which again is like this hugging face transformers is great,
00:29:22.000 | we're actually trying to see like what do these representations look like,
00:29:26.000 | on like a per layer basis.
00:29:28.000 | So what we can do here is basically,
00:29:31.000 | we're looking at for each layer that we have in our model, right?
00:29:35.000 | And again, this is purely from the model output attentions,
00:29:38.000 | or the actual outputs of the model.
00:29:40.000 | Um, so what we can do is for each layer,
00:29:44.000 | and then for each head,
00:29:46.000 | we can analyze essentially like what these representations look like,
00:29:50.000 | and in particular, what the attention weights are,
00:29:52.000 | across each of like the tokens that we have.
00:29:55.000 | So this is like a good way of again,
00:29:57.000 | understanding like what your model is actually attending to,
00:30:00.000 | within each layer.
00:30:02.000 | So on the side, if we look here,
00:30:04.000 | maybe zoom in a bit,
00:30:06.000 | we can see that this is going to be like,
00:30:08.000 | corresponds to the different layers,
00:30:10.000 | and the top will correspond to,
00:30:12.000 | these are across the attention,
00:30:13.000 | the different attention heads.
00:30:16.000 | Okay. This will just give you some notion of like what the weights are.
00:30:19.000 | Here. So again, just to, um, to clarify.
00:30:23.000 | So again, if we maybe look at the labels,
00:30:25.000 | sorry, it's like a little cut off and like zoomed out,
00:30:28.000 | but so this y-axis here,
00:30:31.000 | like these different rows,
00:30:32.000 | corresponds to the different layers within the model.
00:30:36.000 | Oops. Um, on the x-axis here, right,
00:30:41.000 | we have like the, um,
00:30:43.000 | like the different attention heads that are present in the model as well.
00:30:47.000 | And so for each head,
00:30:49.000 | we're able to for each, uh,
00:30:51.000 | at each layer to basically get a sense of like what,
00:30:55.000 | how the attention distribution is actually being distributed,
00:30:59.000 | what's being attended to,
00:31:00.000 | corresponding to each of like the tokens that you actually get here.
00:31:04.000 | So if we, if we look up again,
00:31:07.000 | um, here as well, right,
00:31:10.000 | we're just trying to look at like basically the model attentions that we get,
00:31:14.000 | for each kind of corresponding layer.
00:31:17.000 | The question is what's the, the color key, um,
00:31:21.000 | yellow is like higher, higher magnitude and higher value,
00:31:25.000 | and then darker is like closer to zero.
00:31:27.000 | So probably very naive is like zero.
00:31:31.000 | So what we can do is now maybe walk through like what a fine-tuning task looks like here.
00:31:38.000 | Um, and so first like, uh, in a project, you know,
00:31:42.000 | you're probably going to want to fine-tune a model.
00:31:44.000 | Um, that's fine. It's a,
00:31:46.000 | and we'll go ahead and walk through an example of,
00:31:48.000 | of what that looks like here.
00:31:51.000 | Okay. So what we can do as well is, right,
00:31:59.000 | what we can do as well is use some of the, um,
00:32:03.000 | the data sets that we can get from HuggingFace as well.
00:32:07.000 | So it doesn't just have models,
00:32:08.000 | it has really nice data sets, um,
00:32:11.000 | and be able to kind of like load that in as well.
00:32:13.000 | So here what we're going to be looking at is, uh,
00:32:16.000 | looking at like the IMDB data set.
00:32:20.000 | Um, and so here again is for sentiment analysis.
00:32:24.000 | Uh, we'll just look at only the first 50 tokens or so.
00:32:28.000 | Um, and generally, um, so this is,
00:32:32.000 | this is like a, you know,
00:32:34.000 | helper function that we'll use for truncating the output that we get.
00:32:38.000 | And then lastly for actually kind of making this data set,
00:32:43.000 | we can use the dataset dict class from HuggingFace again,
00:32:48.000 | that will basically give us this smaller data set that we can get for the,
00:32:53.000 | uh, for the train data set as well as
00:32:56.000 | specifying what we want for validation as well.
00:32:58.000 | So here what we're going to do for our like mini data set for
00:33:02.000 | the purpose of this demonstration is we'll use, uh,
00:33:06.000 | make train and val both from the IMDB train, uh, data set.
00:33:10.000 | Uh, we'll shuffle it a bit,
00:33:12.000 | and then we're just going to select here 128 examples,
00:33:16.000 | um, and then 32 for validation.
00:33:19.000 | So it'll shuffle it around,
00:33:20.000 | it'll take the first 128 and it'll take the la- the next 32.
00:33:25.000 | Um, and then we'll kind of truncate those particular inputs that we get.
00:33:30.000 | Again, just to kind of make sure we're efficient.
00:33:33.000 | And we can actually run this on a CPU.
00:33:38.000 | Okay. So next we can do is just see kind of what does this look like.
00:33:44.000 | It'll just, again, this is kind of just like a dictionary,
00:33:46.000 | it's a wrapper class almost of giving, you know,
00:33:49.000 | your train data set and then your validation data set.
00:33:52.000 | And in particular, we can even look at like what the first 10 of these looks like.
00:33:58.000 | Um, so first, like the output, so we specify train.
00:34:02.000 | We want to look at the first 10 entries in our train data set.
00:34:06.000 | And the output of this, um,
00:34:08.000 | is going to be, um,
00:34:10.000 | a dictionary as well, which is pretty cool.
00:34:12.000 | So we have some,
00:34:14.000 | the first 10 test- text examples that give the actual movie reviews here.
00:34:20.000 | Um, so this is the given in a list.
00:34:24.000 | And then the second, uh,
00:34:26.000 | key that you get are the labels corresponding to each of these.
00:34:29.000 | So whether it's positive or negative.
00:34:32.000 | So here one is going to be a positive review,
00:34:34.000 | uh, zero is negative.
00:34:36.000 | So it makes it really easy to use this for some- something like sentiment- sentiment analysis.
00:34:42.000 | Okay. So what we can do is go ahead and, uh,
00:34:48.000 | prepare the data set and put it into batches of 16.
00:34:52.000 | Okay. So what does this look like?
00:34:53.000 | What we can do is we can call the map function that this like, uh,
00:34:58.000 | that this small like data set dictionary has.
00:35:02.000 | So you call map and pass in a lambda function of what we want to actually do.
00:35:07.000 | So here the lambda function is for each example that we have.
00:35:12.000 | We want to tokenize the text basically.
00:35:15.000 | So this is basically saying how do we want to,
00:35:18.000 | you know, pre-process this.
00:35:20.000 | Um, and so here we're extracting the tokens,
00:35:23.000 | input IDs that will pass as a model.
00:35:25.000 | We're adding padding and truncation as well.
00:35:29.000 | We're going to do this in a batch and then the batch size will be 16.
00:35:32.000 | Okay. Hopefully this makes sense.
00:35:35.000 | Okay. So, um, next we're basically just going to,
00:35:41.000 | um, uh, do like a little more modification on what the data set actually looks like.
00:35:47.000 | So we're going to remove the column that corresponds to- to text.
00:35:52.000 | And then we're going to rename the column label to labels.
00:35:56.000 | So again if we see this, this was called label.
00:35:59.000 | We're just going to call it labels and we're going to remove
00:36:01.000 | the text column because we don't really need it anymore.
00:36:04.000 | We just have gone ahead and pre-processed our data into the input IDs that we need.
00:36:09.000 | Okay. And lastly we're going to set it,
00:36:11.000 | the format to torch so we can go ahead and just pass this in,
00:36:15.000 | um, pass this into our model or our PyTorch model.
00:36:18.000 | The question is what is labels?
00:36:20.000 | So, um, so label here corresponds to like again the first,
00:36:25.000 | in the context of sentiment analysis.
00:36:27.000 | It's like just, yeah, positive or negative.
00:36:30.000 | And so here we're just renaming the column.
00:36:33.000 | Okay. So now we'll just go ahead and see what this looks like.
00:36:36.000 | Again, we're going to look at the train set and only these first two things.
00:36:41.000 | And so, um, so here now we have
00:36:44.000 | the two labels that correspond to each of the reviews.
00:36:47.000 | And the input IDs that we get corresponding for each of the reviews as well.
00:36:52.000 | Lastly, we also get the attention mask.
00:36:55.000 | So it's basically just taking the,
00:36:57.000 | what you get out from the tokenizer and it's just adding this back into the dataset.
00:37:01.000 | So it's really easy to pass it.
00:37:03.000 | The question is, um, we truncated which makes things easy.
00:37:08.000 | But how do you want to apply,
00:37:10.000 | um, like padding evenly?
00:37:13.000 | Um, so here if we do pass in, um,
00:37:16.000 | so first is like you could either manually set some high truncation limit like we did.
00:37:21.000 | Um, the second is that, um,
00:37:24.000 | you can just go ahead and set, um,
00:37:26.000 | padding to be true and then basically,
00:37:29.000 | like the padding is basically, uh,
00:37:32.000 | added, uh, based off of kind of like the longest,
00:37:37.000 | um, like longest sequence that you have.
00:37:39.000 | Yes. So the question is,
00:37:41.000 | I guess doing it for all of them,
00:37:43.000 | all the text lists evenly.
00:37:45.000 | Um, so again, it just like depends on like the size of like the dataset you're,
00:37:49.000 | you're like you're loading it, right?
00:37:51.000 | So if you're looking at particular batches at a time,
00:37:54.000 | you can just pad within that particular like batch versus like, yeah,
00:37:58.000 | you don't need to like load all the dataset into memory,
00:38:01.000 | pad the entire dataset like,
00:38:03.000 | or like in the same way.
00:38:05.000 | So it's fine to do it within just batches.
00:38:07.000 | Yeah, the question was,
00:38:09.000 | how does, uh, how are the input IDs like added?
00:38:12.000 | And, uh, yeah, the answer is yes.
00:38:15.000 | It's basically done automatically.
00:38:16.000 | Um, so we had to manually remove the text column here,
00:38:21.000 | and that kind of like this first line here.
00:38:24.000 | But, um, like if you recall like the outputs of tokenize,
00:38:28.000 | like at the tokenizer, it's basically just the input IDs and the,
00:38:31.000 | and the attention mask.
00:38:33.000 | So it just, it's smart enough to basically aggregate those together.
00:38:37.000 | Okay. The last thing we're gonna do is basically just put these.
00:38:44.000 | So we have this like dataset now, um, it looks great.
00:38:48.000 | We're just gonna import like a PyTorch data loader,
00:38:52.000 | typical normal data loader,
00:38:54.000 | and then go ahead and load each of these datasets that we just had.
00:38:58.000 | I mean specifying the batch size to be 16.
00:39:01.000 | Okay. So that's fine and, and great.
00:39:07.000 | Um, and so now for training the model,
00:39:10.000 | it's basically like exactly the same as what we would do in typical PyTorch.
00:39:16.000 | So again, it's like you still want to compute the loss,
00:39:19.000 | you can back propagate the loss and everything.
00:39:22.000 | Um, yeah. So it's, it's really up to your own design how you do,
00:39:27.000 | uh, how you do the training.
00:39:29.000 | Um, so here there's only like a few kind of asterisks I guess.
00:39:34.000 | One is that you can import specific kind of optimizer types
00:39:39.000 | from the transformers, uh, package.
00:39:42.000 | So you can do Adam with weight decay,
00:39:44.000 | uh, you can get a linear schedule for like the learning rate,
00:39:48.000 | which will kind of decrease the learning,
00:39:50.000 | during, uh, the learning rate over time for each training step.
00:39:54.000 | So again, it's basically up to your choice.
00:39:56.000 | But if you look at the structure of like this code, right,
00:39:59.000 | we load the model for classification,
00:40:01.000 | we set a number of epochs,
00:40:03.000 | and then however many training steps we actually want to do.
00:40:06.000 | We initialize our optimizer and get some learning rate schedule, right.
00:40:11.000 | And then from there, it's basically the same thing as what we would do,
00:40:15.000 | for a typical kind of like PyTorch model, right.
00:40:18.000 | We set the model to train mode,
00:40:20.000 | we go ahead and pass in all of these batches from like the,
00:40:25.000 | the data loader and then back propagate,
00:40:28.000 | step the optimizer and everything like that.
00:40:31.000 | So it's, uh, pretty,
00:40:33.000 | pretty similar from what we're kind of like used to seeing essentially.
00:40:38.000 | Awesome. So that'll go do its thing at some point.
00:40:47.000 | Um, okay. And so, uh,
00:40:49.000 | so that's one potential option is if you really like PyTorch,
00:40:53.000 | you can just go ahead and do that and it's really nice and easy.
00:40:56.000 | Um, the second thing is, uh,
00:40:59.000 | that HuggingFace actually has some sort of like a trainer class that you're able to
00:41:05.000 | use that can handle most of, most of these things.
00:41:08.000 | Um, so again, if we do that kind of like the same thing here,
00:41:12.000 | this will actually run once our model is done training.
00:41:15.000 | Um, like we can create the,
00:41:18.000 | our, you know, our dataset in the same way as before.
00:41:21.000 | Now, what we can,
00:41:23.000 | what we need to use is like this import of like a training arguments class,
00:41:28.000 | this is going to be basically a dictionary of all the things that we want to
00:41:32.000 | use when we're going to actually train our model.
00:41:35.000 | And then this kind of like additional trainer class,
00:41:39.000 | which will handle the training kind of like magically for us and kind of wrap around in that way.
00:41:45.000 | Okay. So if you can, okay,
00:41:49.000 | I think we're missing a directory, but, um,
00:41:52.000 | I think, yeah, pretty straightforward for how you want to train.
00:41:55.000 | Yeah. Um, so for,
00:41:58.000 | for here at least, um, again,
00:42:00.000 | there are kind of the two key arguments.
00:42:02.000 | The first is training arguments.
00:42:04.000 | So this will specify, have a number of specifications that you can actually pass through to it.
00:42:09.000 | It's where you want to log things for each kind of like device.
00:42:13.000 | In this case, like we're just using one GPU,
00:42:16.000 | but potentially if you're using multiple GPUs,
00:42:19.000 | what the batch size is during training,
00:42:21.000 | what the batch size is during evaluation time.
00:42:24.000 | How long you want to train it for.
00:42:27.000 | How you want to evaluate it.
00:42:29.000 | So this is kind of like evaluating on an epoch level.
00:42:33.000 | What the learning rate is and so on, so on.
00:42:36.000 | So again, if you want to check the documentation,
00:42:39.000 | you can see that here.
00:42:41.000 | There's a bunch of different arguments that you can give.
00:42:44.000 | There's like warm-up steps, warm-up ratio, like weight decay.
00:42:48.000 | There's like so many things.
00:42:50.000 | So again, it's basically like a dictionary.
00:42:53.000 | Feel free to kind of like look at these different arguments you can pass in.
00:42:57.000 | But there's a couple of key ones here.
00:42:59.000 | And this is basically, this basically mimics the same arguments that we used before
00:43:04.000 | in our like explicit PyTorch method here for HuggingFace.
00:43:10.000 | Okay, similarly, what we do is we can just pass this into the trainer
00:43:15.000 | and that will take care of basically everything for us.
00:43:18.000 | So that whole training loop that we did before
00:43:21.000 | is kind of condensed into this one class function
00:43:24.000 | for actually just doing the training.
00:43:26.000 | So we pass the model, the arguments, the trained dataset, eval dataset,
00:43:31.000 | what tokenizer we want to use,
00:43:33.000 | and then some function for computing metrics.
00:43:37.000 | So for here, we pass in this function, eval,
00:43:41.000 | and it takes eval predictions as input.
00:43:44.000 | Basically what this does is these predictions are given from the trainer,
00:43:48.000 | we pass into this function,
00:43:50.000 | and we just can split it into the actual legit
00:43:53.000 | and the labels that are predicted.
00:43:55.000 | Sorry, the ground truth labels that we have.
00:43:57.000 | And then from here, we can just calculate any sort of additional metrics we want,
00:44:01.000 | like accuracy, F1 score, recall, or whatever you want.
00:44:07.000 | Okay, so this is like an alternative way of formulating that training loop.
00:44:14.000 | Okay, the last thing here as well is that we can have some sort of callback as well
00:44:20.000 | if you want to do things during the training process.
00:44:23.000 | So after every epoch or something like that,
00:44:26.000 | you want to evaluate your model on the validation set or something like that,
00:44:30.000 | or just go ahead and like dump some sort of output.
00:44:35.000 | That's what you can use a callback for.
00:44:37.000 | And so here, this is just a logging callback.
00:44:41.000 | It's just going to log kind of like the information about the process itself.
00:44:48.000 | Again, not super important,
00:44:50.000 | but in case that you're looking to try and do any sort of callback during training,
00:44:56.000 | it's an easy way to add it in.
00:44:58.000 | The second is if you want to do early stopping as well.
00:45:01.000 | So early stopping will basically stop your model early, as it sounds.
00:45:07.000 | If it's not learning anything and a bunch of epochs are going by,
00:45:11.000 | and so you can set that so that you don't waste kind of like compute time,
00:45:14.000 | or you can see the results more easily.
00:45:16.000 | The question is, is there a good choice for the patient's value?
00:45:21.000 | It just depends on the model architecture.
00:45:23.000 | Not really, I guess.
00:45:24.000 | It's pretty up to your discretion.
00:45:31.000 | Okay, awesome.
00:45:33.000 | And so the last thing that we do is just do a call, trainer.train.
00:45:37.000 | So if you recall, this is just the instantiation of this trainer class,
00:45:42.000 | called trainer.train, and it'll just kind of go.
00:45:46.000 | So now it's training, which is great.
00:45:49.000 | It gives us a nice kind of estimate of how long things are taking,
00:45:53.000 | what's going on, what arguments that we actually pass in.
00:45:58.000 | So that's just going to run.
00:46:01.000 | And then likewise, hopefully it'll train relatively quickly.
00:46:06.000 | Okay, it'll take two minutes.
00:46:08.000 | We can also evaluate the model pretty easily as well.
00:46:12.000 | So we just call trainer.predict on whatever data set that we're interested in.
00:46:17.000 | So here it's the tokenized data set course,
00:46:19.000 | we're going to do the validation data set.
00:46:23.000 | Okay, hopefully we can pop that out soon.
00:46:27.000 | And lastly, so if we saved anything to our model checkpoints,
00:46:32.000 | so hopefully this is saving stuff right now.
00:46:40.000 | Yeah, so this is going to be,
00:46:41.000 | is continue to save stuff to the folder that we specified.
00:46:46.000 | And so here, in case we ever want to kind of like load our model again,
00:46:50.000 | from the weights that we've actually saved,
00:46:53.000 | we just pass in the name of the checkpoint,
00:46:55.000 | like the relative path here to our checkpoint.
00:46:58.000 | So notice how we have some checkpoint eight here.
00:47:02.000 | We just pass in the path to that folder, we load it back in,
00:47:06.000 | we tokenize, and it's the same thing as we did before.
00:47:12.000 | There are a few kind of additional appendices for how to do like different tasks as well.
00:47:19.000 | So appendix on generation, how to define a custom data set as well,
00:47:25.000 | how to kind of like pipeline different kind of like tasks together.
00:47:32.000 | So this is kind of like using a pre-trained model that you can just use through kind of like the pipeline interface really easily.
00:47:43.000 | There's like different types of tasks like mass language modeling.
00:47:48.000 | Feel free to look through those at your own time.
00:47:50.000 | And yeah, thanks a bunch.
00:47:52.000 | [BLANK_AUDIO]