Stanford XCS224U: NLU I Contextual Word Representations, Part 1: Guiding Ideas I Spring 2023

00:00:00.000 | Hello everyone, I'm Chris Potts.

00:00:06.280 | Welcome to our unit on contextual word representations.

00:00:09.360 | I thought I'd kick this off with some high-level guiding ideas.

00:00:12.780 | The first thing I wanted to say is that in previous iterations of this course,

00:00:16.080 | we spent about two weeks focused on static vector representations

00:00:20.480 | of words and in some cases phrases and full sentences.

00:00:23.920 | For this iteration of the course,

00:00:25.460 | we're going to go directly to contextual word representations which have

00:00:29.000 | proven so powerful for today's NLP research and technologies.

00:00:33.280 | But I did want to offer a brief overview of

00:00:35.720 | the history that leads to contextual representations.

00:00:38.920 | Let's rewind the clock back to what I've called feature-based,

00:00:42.400 | classical lexical representations and the hallmark

00:00:45.520 | of these representations is that they are sparse.

00:00:48.400 | Typically, what we're thinking of is very long vectors where

00:00:52.120 | the column dimensions represent the output of

00:00:54.680 | usually hand-built feature functions that we've written that might

00:00:57.520 | capture things like the binary sentiment classification of a word,

00:01:01.720 | or which part of speech it tends to have most dominantly,

00:01:05.600 | or whether or not it ends in a suffix like ing for English,

00:01:09.600 | and so forth and so on.

00:01:11.000 | We write a lot of these feature functions and as a result,

00:01:13.560 | we have vectors that are mostly zeros because most words don't have these properties,

00:01:18.000 | but the few ones carry important information.

00:01:21.320 | In the next phase, we have count-based methods.

00:01:24.400 | This is the introduction of the distributional hypothesis.

00:01:27.640 | I'm thinking of methods like point-wise mutual information and TF-IDF,

00:01:31.840 | term frequency inverse document frequency,

00:01:34.400 | a classic from information retrieval that we'll visit a bit later in the course.

00:01:39.400 | For these methods, we begin from some count matrix.

00:01:42.680 | Typically, for PMI,

00:01:44.040 | it would be a word-by-word matrix where the cells in that matrix capture

00:01:47.800 | the number of time so that each word co-occurs with the other words in our corpus.

00:01:52.680 | For TF-IDF, it would probably be a word-by-document matrix,

00:01:56.200 | and now we're capturing which words appear and with

00:01:59.240 | what frequency in all the documents in our corpus.

00:02:02.760 | The idea behind these methods is that we take PMI or

00:02:06.360 | TF-IDF and we massage those counts in a way that leads to better representations.

00:02:12.000 | That's coming purely from distributional information.

00:02:14.560 | We typically don't write hand-built feature functions in this mode,

00:02:18.080 | but these vectors tend to be pretty sparse.

00:02:21.400 | The next phase is what I've called classical dimensionality reduction.

00:02:25.360 | This is the introduction of dense representations.

00:02:27.920 | I have in mind methods like PCA,

00:02:30.320 | principal components analysis, SVD,

00:02:33.200 | that singular value decomposition,

00:02:35.040 | LDA is latent Dirichlet allocation.

00:02:37.400 | There's a whole family of these.

00:02:39.160 | Typically, what we're doing at this phase is taking maybe the output of

00:02:42.720 | representations built in the mode of step 2 there and compressing them.

00:02:47.760 | In compressing them, we typically get denser,

00:02:51.000 | more informative representations that can also capture

00:02:54.240 | higher-order notions of distributional similarity and co-occurrence and so forth.

00:02:59.120 | That proved incredibly powerful.

00:03:01.960 | Then the fourth phase, which might be the final phase for

00:03:05.000 | this static vector representation approach,

00:03:07.440 | we have what I've called learned dimensionality reduction approaches.

00:03:10.800 | Here again, we have dense representations.

00:03:13.300 | These could be the output of an autoencoder

00:03:15.880 | or a classical method like word2vec or GloVe.

00:03:19.240 | What we do at this phase is really essentially combine

00:03:23.040 | the count-based methods from step 2 in this history with

00:03:27.240 | the dimensionality reduction that we see from methods like SVD and PCA.

00:03:32.880 | That leads us to very powerful,

00:03:35.380 | typically learned representations that have

00:03:37.920 | a tremendous capacity to capture higher-order notions of co-occurrence.

00:03:43.000 | That's a very fast overview.

00:03:44.760 | If you would like to get hands-on with these methods and really

00:03:48.680 | deeply understand them then check out

00:03:51.360 | this page that I've linked to here from the course website.

00:03:54.240 | It links to a lot of notebooks and some older videos that

00:03:57.320 | could help you with a refresher on all these methods or to just get up to

00:04:01.240 | speed where you're ready to think directly about contextual representations.

00:04:06.760 | The representations that we get from

00:04:09.540 | these contextual models that will be the focus for this unit and really for

00:04:12.860 | the entire course really resonate with me as a linguist.

00:04:16.840 | I thought I would just pause here and run through some examples that I think

00:04:21.040 | lead from the linguistics angle to the conclusion that

00:04:24.200 | word representations are highly sensitive to the context in which they're used.

00:04:29.040 | Let's start with a simple case like the vase broke and

00:04:31.880 | our focus will be on that English verb break.

00:04:35.000 | Here, the sense of break is something like shatter.

00:04:38.640 | For dawn broke though, we have

00:04:40.600 | a superficially similar looking sentence, subject verb.

00:04:44.180 | But now the sense of break is more like begin.

00:04:47.960 | That's presumably conditioned by what we know about the subject and how

00:04:52.440 | that operates on the sense of the verb break for a stereotypical reading.

00:04:56.760 | The news broke again, superficially similar sentence.

00:05:00.040 | The news is the subject,

00:05:01.420 | but now it seems to be conditioning a reading that is more

00:05:04.220 | like was announced or appeared or was published.

00:05:08.160 | Very different sense yet again.

00:05:10.040 | For Sandy broke the record,

00:05:11.540 | we have our first transitive case.

00:05:13.640 | Now the sense of break is something like surpass the previous world record.

00:05:18.320 | But for Sandy broke the law,

00:05:20.320 | a different sense yet again.

00:05:21.680 | This is something like a transgression,

00:05:23.540 | again conditioned in this case by the direct object.

00:05:26.760 | The burglar broke into the house,

00:05:28.800 | now we have break appearing with a particle into,

00:05:31.640 | and in this case it means like enter forcibly without permission.

00:05:35.560 | But the newscaster broke into the movie broadcast means something more like interrupt,

00:05:40.040 | a related but interestingly distinct meaning that is coming

00:05:43.360 | from the same break plus particle construction.

00:05:47.240 | Then in we broke even,

00:05:49.360 | just for another surprise,

00:05:50.960 | this is an entirely new sense break plus even means

00:05:53.640 | something like we lost the same amount as we gained.

00:05:57.440 | This is just a sample of the many,

00:06:00.560 | many senses that break can take on in English.

00:06:03.640 | What we see in these patterns is that the sense it does have in context is

00:06:08.280 | driven by the surrounding words transparently,

00:06:11.200 | but maybe also by things that are happening in the context.

00:06:15.000 | Similar things arise for adjectives like flat,

00:06:19.680 | as in flat beer,

00:06:20.840 | flat note, flat tire,

00:06:22.760 | flat surface.

00:06:24.020 | That's the same adjective flat,

00:06:26.000 | but depending on what noun it modifies,

00:06:28.080 | you get very different senses.

00:06:30.040 | It can happen with a verb to throw a party,

00:06:32.360 | throw a fight, throw a ball,

00:06:33.880 | throw a fit.

00:06:34.880 | There might be some metaphorical sense in which all of these senses are related,

00:06:39.200 | but they are also clearly distinct and they're being driven

00:06:42.800 | by something about how this direct object interacts with the verb.

00:06:47.400 | We should extend this beyond just the simple morphosyntactic context.

00:06:52.920 | Let's think about ambiguity resolution coming from the things we say to each other.

00:06:57.840 | For the sentence, a crane caught a fish,

00:07:00.120 | you probably infer that there the sense of crane is for a bird.

00:07:04.200 | It's not determined by that,

00:07:06.000 | but that's the most likely inference.

00:07:07.900 | Whereas a crane picked up the steel beam,

00:07:11.160 | you probably infer in that case that a crane is a machine.

00:07:15.280 | Then correspondingly for I saw a crane,

00:07:18.080 | you might not absent more context know whether I mean a bird or a machine.

00:07:23.120 | But if we do embed this sentence in larger context,

00:07:26.120 | you'll begin to make inferences about whether it was a bird or a machine.

00:07:30.680 | That shows you that even things that are happening ambiently in the context of utterance

00:07:35.120 | can impact what sense crane has in these sentences.

00:07:40.660 | Are there any typos? I didn't see any.

00:07:43.820 | The second sentence is clearly elliptical,

00:07:45.840 | and in this case, we probably infer that I didn't see any means I didn't see any typos.

00:07:50.760 | Contrast that with are there any bookstores downtown?

00:07:53.660 | I didn't see any.

00:07:54.760 | It's an identical second sentence,

00:07:57.240 | but now the inference is that any means something like any bookstores.

00:08:01.680 | That is again showing that the senses that these words take on in context could be

00:08:06.520 | driven by everything that is happening in the surrounding sentence,

00:08:10.360 | and also in the surrounding discourse on

00:08:12.900 | out extending into things about world knowledge and so forth.

00:08:17.200 | That for me really shows that the static word vector representation approach was

00:08:22.200 | never really going to work out because it insists that broke in all those examples in

00:08:28.520 | one correspond to a single vector flat into a has to be a single vector and so forth and so on.

00:08:36.400 | What we actually see in how language works is

00:08:39.400 | much more malleability for individual word meanings,

00:08:42.920 | and that's precisely what contextual word representations allow us to capture.

00:08:49.120 | I thought I would pause here just to offer

00:08:52.480 | a brief history of where these ideas come from because it's a very recent history.

00:08:56.760 | Things are moving fast, but it's also interesting to track.

00:08:59.680 | I think a classic paper,

00:09:01.200 | maybe the starting point for this is Dai and Lei, 2015.

00:09:05.040 | They really showed the value of language model style pre-training for

00:09:08.920 | downstream tasks and the paper is fascinating to look at.

00:09:12.040 | It's a pre-transformer paper.

00:09:14.160 | A lot of the things that they do look complicated in retrospect,

00:09:17.540 | and some of the simpler ideas that they offer we can now see in retrospect are incredibly powerful.

00:09:23.560 | We fast forward a little bit to August 2017, McCann et al.

00:09:27.640 | This is the Cove paper,

00:09:29.240 | and what they showed is that pre-trained bidirectional LSTMs for

00:09:33.640 | machine translation could offer us sequence representations that were

00:09:38.480 | a useful starting point for many other downstream tasks outside of MT.

00:09:44.360 | That begins this move toward pre-training for contextual representations.

00:09:49.440 | It really takes off with Elmo in February 2018.

00:09:53.720 | That team was really the first to show that very large-scale pre-training of

00:09:58.400 | bidirectional LSTMs could lead to rich multipurpose representations that were

00:10:03.320 | easily adapted to lots of downstream tasks via fine-tuning.

00:10:08.320 | In June 2018, we get GPT.

00:10:11.800 | Then in October 2018,

00:10:14.120 | the BERT era truly begins.

00:10:16.240 | Devlin et al. 2019 is when the paper was published,

00:10:19.800 | but the work appeared before that and had

00:10:21.960 | already had tremendous influence by the time it was actually officially published.

00:10:26.200 | That BERT model is really the cornerstone of so much that we'll discuss in this unit and beyond.

00:10:33.240 | There's a parallel dimension to all this,

00:10:36.640 | a parallel journey that I thought I would talk a little bit about as another guiding idea.

00:10:41.520 | I've put this under the heading of model structure and linguistic structure,

00:10:45.600 | and this is related also to this idea of what kinds

00:10:49.600 | of structural biases we build into our model architectures.

00:10:54.200 | In the upper left, I have a simple model that's reviewed in those background materials,

00:10:58.800 | where you can imagine that each word in the sentence,

00:11:01.880 | the rock rules is looked up in like a glove space or a word to vex space.

00:11:07.240 | Then what this model does is simply add those representations

00:11:10.400 | together to get a representation for the entire sentence.

00:11:14.400 | This is a very high bias model,

00:11:17.200 | in particular because we have to decide ahead of time that

00:11:20.920 | the way those representations will combine will be via addition.

00:11:25.520 | You have to think, even if that's approximately correct,

00:11:28.800 | it's only going to be approximate,

00:11:30.360 | it would be a minor miracle if addition turned out to

00:11:33.760 | be actually the optimal way to combine those word meanings.

00:11:38.040 | You could see that as giving rise to the models that are on the right here,

00:11:41.520 | this is a recurrent neural network.

00:11:44.080 | Again, you could imagine that we look up each one of those words in

00:11:47.600 | a static vector representation space like glove.

00:11:51.400 | But now we feed them into this RNN process that actually

00:11:55.240 | processes them with a bunch of new neural network parameters.

00:12:00.400 | In that way, it could be trained,

00:12:03.240 | it could be taught to combine those word meanings in an optimal way.

00:12:07.560 | It could be the case that the optimal way to combine word meanings is with addition,

00:12:12.880 | and the models that we have over here on the right are certainly powerful

00:12:16.520 | enough to learn addition of those vectors if it's correct,

00:12:20.040 | but that is very unlikely to happen.

00:12:22.080 | Probably, the model will learn

00:12:24.080 | a much more complicated function that might be much more nuanced.

00:12:27.800 | In that way, we've released some of the biases that we introduced over here,

00:12:33.080 | and we're going to allow ourselves to learn in a more free-form way

00:12:36.320 | from data about how to optimally combine words.

00:12:39.960 | There's another dimension to this.

00:12:41.560 | If you go down on the left here,

00:12:42.800 | this is a tree-structured neural network.

00:12:45.040 | In this case, we decide ahead of time that we know the constituent structure,

00:12:49.120 | and then we might have a bunch of neural network parameters that combine

00:12:52.840 | the child nodes to create representations for

00:12:55.440 | the parents in a recursive fashion as we move up the tree.

00:12:59.200 | That is very powerful in the sense that we could have

00:13:02.720 | lots of different functions that get

00:13:04.880 | learned for how to combine child nodes into their parents.

00:13:08.240 | But this model is high bias because it decides ahead of

00:13:11.680 | time about how the constituent structure should look.

00:13:14.600 | You have to believe or know a priori that

00:13:18.120 | the rock forms a constituent in a way that rock rules simply does not.

00:13:22.600 | Whereas the models up here take a more free-form approach.

00:13:26.960 | Down in the right-hand corner,

00:13:29.120 | I have the models that we saw in the lead up to the transformer.

00:13:33.640 | This could be like a bidirectional RNN,

00:13:36.760 | where again, we could look up the words in a static vector representation space,

00:13:40.840 | but now we have information flowing left to right in these hidden representations,

00:13:45.200 | and we might have added a bunch of attention mechanisms on top that

00:13:49.680 | essentially connect every other hidden unit to

00:13:52.080 | every other hidden unit in this representation.

00:13:55.240 | Whereas this had a presumption that we would process left to right,

00:13:58.520 | and this one had an assumption that we would process by constituent structure,

00:14:02.640 | this model down here says essentially anything goes.

00:14:06.160 | I think it's fair to say that a lesson of

00:14:08.480 | the transformer era is that anything goes given sufficient data,

00:14:12.880 | is the most powerful mode to be in.

00:14:15.480 | That really is a kind of insight behind the transformer architecture.

00:14:21.040 | The attention mechanisms that I mentioned there are really important,

00:14:25.320 | and this is also part of the journey that leads us to

00:14:27.640 | the transformer and might harbor most of its power.

00:14:31.240 | Let me give you a simple example of how attention worked,

00:14:35.160 | especially in the lead up to transformers.

00:14:37.880 | Here I have a model that you might think of as an RNN,

00:14:40.320 | maybe we're processing left to right for really not so good.

00:14:44.080 | We look up those words in some static vector representation space,

00:14:48.860 | and then we have our left to right process that

00:14:50.960 | leads to these hidden representations.

00:14:53.280 | Suppose now that I want to train a classifier on top of this final output state.

00:14:59.360 | Well, I might worry in doing that,

00:15:02.040 | that there'll be a lot of information about the words that are laid in the sequence,

00:15:06.280 | but not enough information in this representation

00:15:09.080 | here about the words that were earlier in the sequence.

00:15:12.640 | Attention emerges as a way to remind ourselves in late states,

00:15:18.120 | what was in earlier ones.

00:15:20.000 | We could do that with a scoring function,

00:15:22.120 | and here what I've depicted is a simple dot product scoring function,

00:15:25.780 | exactly the sort that we get from the transformer in essence.

00:15:29.360 | What it's doing is taking the dot product of

00:15:31.560 | our target representation with all of the previous hidden states.

00:15:36.440 | We softmax normalize those,

00:15:39.080 | and then we bring those into the representation that we are targeting,

00:15:43.880 | to get a context vector here.

00:15:45.880 | We could take the average of all of them.

00:15:48.220 | Then finally, we get this attention combination H here,

00:15:52.360 | and that could be a neural network parameterized function that takes in

00:15:56.600 | this representation plus the attention representation that we created here,

00:16:01.760 | and feeds that through some parameters and a non-linearity.

00:16:06.160 | That finally gives us the representation that we feed

00:16:09.640 | into the classifier that we wanted to fit originally.

00:16:13.400 | The idea here is that now our classification decision is based indeed on this representation

00:16:19.680 | at the end but now infused with a lot of information about how

00:16:23.880 | similar that representation is to the ones that preceded it.

00:16:27.600 | That is the essential idea behind dot product attention,

00:16:31.560 | which will be the beating heart,

00:16:33.320 | so to speak, of the transformer.

00:16:36.480 | Another idea that has proved so powerful is a notion of sub-word modeling.

00:16:42.240 | I thought I would take you on a brief journey of how we

00:16:44.520 | arrived in the current phase for this sub-word modeling,

00:16:48.480 | beginning with ELMo,

00:16:49.920 | because what ELMo did is truly fascinating.

00:16:53.080 | The ELMo word representation space begins with character level representations,

00:16:58.960 | and then it has a bunch of filters on top of those,

00:17:02.440 | and then it has a bunch of

00:17:04.240 | different convolutional layers that we then do max pooling

00:17:08.120 | over to get a representation for the entire sequence.

00:17:12.160 | We do that at different layers here,

00:17:14.160 | and so those get concatenated up into

00:17:16.080 | these max pooling representations at the top,

00:17:19.180 | and those form the basis for word representations,

00:17:22.560 | and the idea is that this gives us whole word vectors that nonetheless have

00:17:28.080 | lots of information about the sub-word parts all the way down to characters,

00:17:33.340 | but including all of these convolutions of

00:17:35.440 | different lengths that capture different notions of sub-word within that space.

00:17:39.880 | Incredibly visionary, I would say.

00:17:42.640 | One thing I should note though is that the ELMo vocabulary has about 100,000 words in it,

00:17:47.920 | which is an enormous vocabulary,

00:17:50.100 | and even still, if you deal with real text,

00:17:52.840 | you will find that you are mostly encountering words that are not in that vocabulary,

00:17:57.980 | and even if you double it to 200,000 or 300,000,

00:18:02.340 | now you're getting a really large embedding space,

00:18:04.600 | you will still mostly encounter words that are unked out,

00:18:09.360 | that is unknown to your model,

00:18:11.320 | and that's an incredibly limiting factor for this whole word approach,

00:18:15.600 | but we see in here the essence of the idea that we should model sub-words.

00:18:20.600 | A big change happened for the transformer when we

00:18:24.320 | got in parallel this notion of word piece tokenization.

00:18:28.280 | Here I'm going to give you a feel for that by looking at the BERT tokenizer.

00:18:32.240 | That gets loaded in cell 2 here,

00:18:35.400 | and then when we call the tokenizer on the sentence,

00:18:37.880 | this isn't too surprising,

00:18:39.520 | we get things that look mostly like whole words,

00:18:42.420 | especially if you're used to NLP where the suffix for

00:18:45.980 | isn't has been broken apart and so is the punctuation.

00:18:49.500 | But when we call the tokenizer on the sequence encode me with an exclamation mark,

00:18:54.860 | notice that the word encode has been split apart into two words,

00:18:59.640 | n and then code with those markers there indicating that that is a word internal piece.

00:19:05.500 | If you tokenize a word like snuffleupagus,

00:19:09.180 | you get a sequence of 1, 2, 3, 4,

00:19:12.460 | 5, 6 parts to that single word that came in.

00:19:16.900 | The effect here is that we can have a vanishingly small vocabulary.

00:19:21.540 | There are under 30,000 words in this BERT tokenization space,

00:19:25.620 | so a very small embedding space.

00:19:27.780 | But nonetheless, when we encounter words that are outside of that space,

00:19:31.980 | they don't get unked out,

00:19:33.560 | but rather we analyze them into sub-word pieces that we

00:19:36.940 | do have embedding representations for.

00:19:40.460 | Incredibly powerful and in the context of a contextual model,

00:19:45.420 | we might have some hope that for cases like encode being split into two tokens,

00:19:50.740 | the model will learn internally that in some sense,

00:19:53.780 | those form a coherent piece, a word.

00:19:56.620 | But we don't need that directly reflected in the tokenizers vocabulary.

00:20:02.400 | Incredible idea. A related idea for

00:20:06.320 | the transformer that is so foundational and that I think

00:20:08.860 | the field is still figuring out is positional encoding.

00:20:12.980 | When we talk about the transformer,

00:20:14.660 | you will see that it has almost no way of

00:20:16.700 | keeping track of the order of words in a sequence.

00:20:19.300 | It's mostly a bunch of columns of different things that happen

00:20:22.860 | independently with some attention mechanisms bringing them together.

00:20:27.300 | To capture word order,

00:20:29.240 | we typically have something like a positional encoding.

00:20:32.420 | The most heavy-handed way to do that is to simply have,

00:20:36.020 | in addition to your word embedding space,

00:20:38.400 | a positional embedding space that simply

00:20:41.060 | records where individual words appear in the sequence.

00:20:44.940 | Here, the is paired with one because it's at the start of the sequence.

00:20:48.480 | But if the was in position 4,

00:20:50.900 | it would be this fixed vector here combined with the embedding for position 4.

00:20:56.560 | Those get added together into what you might think of as

00:20:59.580 | the basis for contextual representation.

00:21:03.300 | This has proved effective,

00:21:05.060 | but it has many limitations that we're going to talk about later in the unit,

00:21:08.620 | and we're going to explore ways to capture what's good about

00:21:11.460 | positional encoding while also overcoming

00:21:13.940 | some of the problems that it introduces.

00:21:17.060 | Then of course, one of the major guiding ideas behind

00:21:21.620 | all of this is simply massive scale pre-training.

00:21:25.500 | This is an idea that was unlocked by the distributional hypothesis,

00:21:29.980 | which had the insight that we don't need to write

00:21:32.100 | hand-built feature functions but rather we can just rely on

00:21:35.380 | unlabeled corpora and keep track of

00:21:38.140 | which words are appearing with which other words.

00:21:40.980 | It really comes into its own in the neural era with

00:21:44.520 | models like Word2Vec and GloVe, as I mentioned.

00:21:48.100 | Following that, we get the ELMo paper,

00:21:50.660 | and I mentioned before that that was the eye-opening moment when we

00:21:53.540 | saw that we could learn contextual representations at

00:21:57.100 | scale and have that transfer into tasks we wanted to fine-tune for.

00:22:02.460 | Of course, you get the GPT paper,

00:22:05.020 | and then BERT launches the BERT era.

00:22:08.460 | Then you get, at the end of this little history here,

00:22:11.780 | the GPT-3 paper which applied this massive scale idea at

00:22:16.300 | a level that was previously unimagined and unimaginable,

00:22:20.180 | and that really did introduce a phase change in research as we

00:22:23.780 | started to deal with these truly massive models,

00:22:26.900 | and ask them to learn in context,

00:22:29.420 | that is just from prompts we offer them,

00:22:31.480 | how to perform tasks and so forth.

00:22:34.660 | Then related to this, of course,

00:22:37.180 | is the idea of fine-tuning

00:22:39.300 | and its corresponding notion of pre-training.

00:22:41.840 | Here's a brief review of these ideas.

00:22:44.340 | In 2016-2018, the notion of pre-training that we had was

00:22:49.300 | essentially that we would feed static word representations

00:22:52.260 | into variants of RNNs,

00:22:54.780 | and then the model would be

00:22:56.520 | fine-tuned and it would learn a bunch of stuff.

00:22:59.100 | When we fast-forward to the BERT era in 2018,

00:23:02.800 | we start to get fine-tuning of contextual models.

00:23:05.980 | Here is just a bit of code of a source that you will write in this course,

00:23:10.180 | where we read in BERT representations and

00:23:13.140 | fine-tune them essentially to be a classifier.

00:23:16.820 | That started in 2018, and I think it continues.

00:23:19.940 | We might be headed into an era in which most of

00:23:22.740 | the fine-tuning that happens is on

00:23:24.980 | these massive language models that we mostly don't have access to.

00:23:28.460 | We can't write code as in the BERT code there,

00:23:31.780 | but rather we just call an API and some partially understood and

00:23:36.440 | partially known to us fine-tuning process,

00:23:39.480 | fine-tune some of the parameters for

00:23:41.500 | one of these large models to do what we want to do.

00:23:44.080 | I'm hoping that we still see a lot

00:23:45.900 | more of this custom code being written.

00:23:47.760 | It's very powerful analytically and technologically,

00:23:51.060 | but this is surely part of your future as an NLP-er as well.

00:23:57.420 | [BLANK_AUDIO]