back to index[Paper Club] BERT: Bidirectional Encoder Representations from Transformers
00:00:20.360 |
- Yeah, 'cause I'm working on a text classification problem 00:00:39.040 |
that only mirrors a structured output GPT-4-O call 00:00:44.040 |
and just mirrors it until it has enough data for Bert 00:01:09.360 |
but I don't think it's as automatic and as seamless 00:01:16.200 |
people suggest using Bert for classification, 00:01:18.840 |
it's cheap, like it's a path that is worth taking. 00:01:32.680 |
you have endpoints and hosted model and all that. 00:01:36.520 |
So to do Bert, like you gotta go deploy it somewhere 00:01:56.200 |
and then there's additional material out there on Bert. 00:02:05.720 |
we can look at some of the other things out there. 00:02:15.440 |
bi-directional encoder representations from transformers. 00:02:19.400 |
So this is one of the first transformer papers, 00:02:34.600 |
So ancient history in terms of deep learning and NLP, 00:03:06.480 |
And so it provided context for search results 00:03:11.480 |
so that there's some examples maybe we can look at later, 00:03:20.240 |
that could mean a couple of different things, 00:03:22.800 |
they use this model to discriminate between the two. 00:03:26.560 |
So let's, I guess, just walk down through the paper here. 00:03:50.600 |
And so the like models like Elmo were feature-based 00:04:15.880 |
and then you could fine tune it after the fact 00:04:29.360 |
one of the limitations of standard language models 00:04:49.320 |
and try to predict the next word in the sequence. 00:05:02.080 |
it can also start at say the end of a piece of text 00:05:07.080 |
And so we'll talk a little bit how they avoid 00:05:16.680 |
because obviously if you're training from back to front, 00:05:19.800 |
you get a peek at what the words are coming up. 00:05:31.880 |
So a lot of what they reference is like RNNs, GRUs, LSTMs. 00:05:40.920 |
"Crazy idea if you look at it from front to back." 00:05:51.360 |
We listen to the whole sentence, then we classify. 00:05:53.880 |
So this was more so like LSTM, RNN era, yeah. 00:06:09.440 |
So let's see, we talked about bi-directional. 00:06:12.720 |
And then, yeah, so it's fine-tuned versus feature-based. 00:06:28.360 |
I guess one call out on the related work is just ELMo. 00:06:41.600 |
If anyone knows for sure, feel free to correct. 00:07:00.320 |
you could say that means I'm going to chase a dog 00:07:07.960 |
hey, let's stick to the material that we're talking about, 00:07:14.840 |
And so those can mean different things, the same token. 00:07:20.640 |
they want to use like different representations, 00:07:33.040 |
ELMo is from Allen Institute and University of Washington. 00:07:44.680 |
Roberta is like BERT, but make it good and bigger. 00:07:51.320 |
at University of Washington and Facebook, I think. 00:07:54.320 |
But ELMo was just, yeah, it wasn't from Google, 00:07:58.480 |
It went from like one hot encoding bag of words 00:08:15.520 |
So let's take a stop here to look at this diagram. 00:08:32.480 |
Essentially, you can see this pink row down here 00:08:44.320 |
I'm not sure why I can't zoom in while I'm sharing, 00:08:56.360 |
but anyway, this token is a classifier token. 00:09:13.160 |
And then there's another token one through M. 00:09:39.120 |
And then you see, so this is a pre-trained model. 00:09:45.240 |
So this is trained on, well, what was at that point 00:09:51.640 |
And then these are the different fine tunes of that model 00:10:02.240 |
So if there's anything in there, I'm not seeing it. 00:10:07.160 |
or if someone just wants to unmute, feel free. 00:10:16.840 |
the classifier token and then the separator token. 00:10:26.560 |
So they talk about how there's two steps in the framework, 00:10:33.840 |
Probably a lot of people are familiar with that. 00:10:43.960 |
So this is 2019 numbers of what was at that point 00:10:51.080 |
You can see the base model was 110 million parameters 00:11:02.240 |
or like unbelievably large or something like that, 00:11:06.040 |
is 340 million parameters, which is, you know, 00:11:34.640 |
Well, there's two training sets that they used for it. 00:11:40.800 |
One was a, all of Wikipedia, the English version, 00:11:58.680 |
while they could have been large for the time, 00:12:03.920 |
Typically, like at least for frontier models, 00:12:08.240 |
you're talking about low trillions of tokens to train them. 00:12:13.760 |
So here they talk about what I mentioned earlier 00:12:25.360 |
in one token sequence with the separator token 00:12:30.200 |
And so let's go down and talk about pre-training. 00:12:39.800 |
And here we get to their answer to the left-to-right 00:12:56.160 |
potentially each word could see itself in the future. 00:13:10.600 |
so instead of having the actual word in the sequence, 00:13:24.920 |
and do the, you know, score the training on that. 00:13:44.560 |
they have to compensate for that in the pre-training step. 00:14:11.160 |
And so that helps during the fine tuning stage 00:14:15.000 |
so that fine tuning doesn't expect these mask tokens 00:14:20.520 |
And then the other task they give it during pre-training 00:14:42.080 |
that's the sentence A, separator, sentence B. 00:14:47.400 |
And then they have the 50% split of training data 00:15:28.320 |
And so this part of the pre-training figures out 00:16:12.320 |
So it's the vector representation of that particular word. 00:16:20.080 |
So this splits it up between sentence A and sentence B. 00:16:34.520 |
And then finally as the positional embedding. 00:16:46.600 |
to the other layers helps the model to distinguish 00:17:12.960 |
We'll just maybe take a look at some of the results 00:17:22.840 |
let me take a break to look at the chat here. 00:17:49.520 |
- So I didn't look at any of the following BERT papers 00:17:58.440 |
So I'm not sure if there's anyone else on the call 00:18:06.720 |
- Yeah, we had a slight follow-up in the comments there. 00:18:14.880 |
Basically for encoder decoder stuff, it still makes sense. 00:18:22.200 |
was they had really interesting pre-training tasks. 00:18:31.600 |
Like next token prediction is still useful to generate words 00:18:41.280 |
Should sentence one come before sentence two or whatever? 00:18:45.960 |
There's never a time where there's people trying to predict 00:18:51.280 |
but it does really teach a model conceptually 00:18:57.320 |
So there's these words and there's these words. 00:18:59.800 |
You have to understand should these set of words 00:19:05.640 |
where you have to like group together words together. 00:19:08.240 |
So in some sense for the tasks that BERT is trying to do 00:19:13.240 |
it's trying to be a small efficient classification model 00:19:16.600 |
as one of the tasks it's trying to do, right? 00:19:19.320 |
It kind of makes sense to do these weird training objectives. 00:19:22.160 |
So next token prediction or like masking of words 00:19:33.880 |
but it does teach a model like over a billion words. 00:19:45.360 |
So it's like, instead of having a very broad task 00:19:54.600 |
like get this emergent capability to do classification. 00:20:03.080 |
You're a small model, use all this to do classification 00:20:11.800 |
and small models that are not just next token prediction. 00:20:19.000 |
to subsets of your like main goal, it's still very effective. 00:20:25.440 |
and look at what Bert did, it makes no sense. 00:20:28.360 |
Like Google doesn't need to spend millions of dollars 00:20:32.880 |
or after another sentence, but it does help a small model 00:20:46.240 |
Like the sentence, the sentence prediction didn't seem 00:20:53.960 |
There's not too many use cases where that would be helpful. 00:21:13.360 |
like breaking down the problem and understand word order. 00:21:17.680 |
I think they also did something where they swapped words 00:21:20.640 |
from different sentences or swap sentences, right? 00:21:23.360 |
And like, that's even more useless in reality. 00:21:34.960 |
But once again, it helps the model generalize 00:21:41.400 |
It's just, if you look at it in today's sense, 00:21:51.880 |
you wanna start to employ breaking down problems 00:21:56.800 |
But there's examples of papers that do this type of work. 00:22:06.800 |
Would be great to follow on this presentation 00:22:09.960 |
with some of the more recent work that kind of builds on it. 00:22:14.080 |
- There are not many other questions in chat, by the way. 00:22:22.880 |
You can see from this, at least when it was released, 00:22:28.360 |
BERT-Large was state-of-the-art, even beating out GPT-1. 00:22:53.760 |
of it being state-of-the-art in a lot of things. 00:23:15.720 |
So here's a couple of different ablations they did 00:23:20.240 |
was they removed the next sentence prediction task. 00:23:25.240 |
So I guess this is something we were just talking about, 00:23:40.280 |
And they also have the no sentence prediction. 00:23:44.680 |
And so you can see the results from those attempts up here. 00:23:57.280 |
And then if you look at the no next sentence prediction, 00:24:11.280 |
But then as you also take away the bidirectional, 00:24:30.640 |
Oh yeah, maybe this is what I was talking about, 00:24:58.680 |
as far as the extreme model size at this point. 00:25:16.880 |
Otherwise, let's go over to and look through. 00:25:19.600 |
There's Jay Alomar has made some very helpful, 00:25:34.720 |
So that's just kind of a comparison of some models 00:25:41.120 |
And then this is one thing that was a takeaway for me, 00:25:55.200 |
you basically stick a classifier after the BERT. 00:26:04.960 |
And then you use that classifier to then classify, 00:26:17.000 |
There's one diagram that I thought was especially helpful. 00:26:46.320 |
That one contains essentially the entire sense 00:27:05.760 |
and I think there's like something like 768 dimensions 00:27:39.200 |
So as we mentioned earlier, BERT is encoder only. 00:27:55.720 |
And so encoder is like used mostly these days 00:28:08.200 |
To my knowledge, there's encoder only transformers 00:28:18.320 |
like sequence generation or next token generation. 00:28:21.680 |
This talks about ELMo and the different context of words 00:28:47.400 |
Yeah, so this is just like what we talked about 00:29:02.000 |
you can stick another model for training on the end of it. 00:29:07.960 |
And then you can also use BERT for embedding. 00:29:23.880 |
you can continue pre-training or do fine tuning on BERT 00:29:31.360 |
with your corpus in your industry specific like text corpus 00:29:36.360 |
and then create an encoder that's especially built 00:30:03.880 |
They had a BERT based model, a math language model. 00:30:08.080 |
token classification, QA, sequence classification. 00:30:13.360 |
And basically what they did where they were BERT models 00:30:16.360 |
with a layer added on top for a classification head. 00:30:25.360 |
what was really common to do was you could either, 00:30:31.840 |
add in a linear output head for classification, 00:30:34.920 |
where you basically take all this, there's no output head. 00:30:46.120 |
Now, then you fine tune it on a lot of your data itself. 00:30:57.920 |
you just continue fine tuning it on your data 00:31:00.200 |
and it's already somewhat good at sequence classification. 00:31:06.400 |
that looked into based on how much data you have, 00:31:14.760 |
it was pretty common to not only add a classification head, 00:31:26.320 |
and then continue training those in as well for your task. 00:31:29.280 |
Because at some level, what people started to learn 00:31:34.840 |
of like mask word prediction and sentence ordering or QA, 00:31:46.000 |
you could just train more of your whole model on that. 00:31:59.160 |
You could, if you have less data, freeze layers, 00:32:08.480 |
with the architecture and add classification head. 00:32:11.160 |
- Do you know if anyone trained all the layers 00:32:27.600 |
where basically you could retrain BERT in 24 hours 00:32:31.200 |
on like a budget of like sub $500 with regular A100s 00:32:54.000 |
There's like an academia paper that came out about this. 00:33:03.000 |
of how they can better objectify these 12 pre-training tasks 00:33:10.560 |
and outperform it on a couple of hundred dollars in 24 hours. 00:33:15.080 |
where like there was a sentence classification 00:33:19.880 |
and sentence extraction tasks that BERT was adapted towards. 00:33:25.760 |
or extraction for like abstractive summarization. 00:33:28.400 |
And then companies that took it to production 00:33:46.800 |
Yeah, I mean that sounds like interesting stuff. 00:33:55.160 |
please drop in the chat and I'll check it out. 00:33:57.520 |
So maybe the last thing we can go through here 00:34:14.120 |
how to do this movie review sentiment classification. 00:34:24.080 |
So DistilBERT is a hugging face like recreation of BERT 00:34:41.400 |
he just uses a basic logistic regression model 00:35:11.120 |
So let's see, that's just installing it, doing imports. 00:35:15.880 |
He uses, he must've mentioned it up above, but a... 00:35:29.240 |
Anyway, there's a particular HuggingFace dataset 00:35:32.480 |
that he's using that has the movie sentiment training data. 00:35:36.480 |
Maybe he just uploaded it somewhere, he had it. 00:35:58.160 |
one of the big things that made BERT somewhat popular 00:36:13.880 |
if you have stuff like, "I hate this so much," 00:36:18.880 |
that in some contexts in tweets could still be positive, 00:36:24.240 |
And when you just look at lexical understanding of words, 00:36:40.520 |
were versions that they used in a lot of these demos. 00:36:53.520 |
So here we're just uploading or downloading BERT, 00:37:06.240 |
So you can see it's just a few lines of code there. 00:37:09.040 |
So we have to do a few things like tokenize it 00:37:26.960 |
So we need to pad out so that they're all the same length. 00:37:30.320 |
And then we need to mark the padded sections as masked 00:37:37.120 |
so that BERT doesn't get confused at thinking like 00:37:55.520 |
And again, apologies, I'll try to zoom in again. 00:38:15.120 |
And then the one tricky part about all of this 00:39:08.080 |
You just turn those 768 dimensions into features 00:39:39.320 |
then the expected amount just from random chance 00:39:51.280 |
but obviously still plenty of room for improvement. 00:39:55.680 |
Okay, and down here, it says the highest accuracy score 00:40:30.920 |
So that's about all I had for prepared stuff. 00:40:39.280 |
I'll stop sharing and then maybe go through the chat. 00:40:48.600 |
add any color commentary, feel free to do that. 00:40:52.800 |
- Someone linked the paper on the academic budget, 00:41:07.360 |
where apparently MosaicML showed how you can do it for $20 00:41:12.560 |
So yeah, $20, they did like eight A100s for an hour, 00:41:17.560 |
and they're able to match the glue store of basic BERT 00:41:25.480 |
So a note Eugene Cha made is researchers felt 00:41:35.320 |
BERT, all these things that compare like 24-hour, 00:41:46.960 |
which is like at the time, let's say eight to $10 an hour, 00:41:59.160 |
The BERT-large was trained for more than four days. 00:42:05.040 |
10K on the regular BERT-based, less on the little one. 00:42:08.440 |
And then you got to add in like the time and the R&D. 00:42:12.280 |
Oh, it was well more than a 10K project at Google. 00:42:16.400 |
The BERT-large itself was already a 50K train run, 00:42:20.080 |
plus 10 to 15 for BERT-based, plus just experimentation. 00:43:13.880 |
Because this is regarding the embedding science, right? 00:43:18.840 |
Even though I joke about this was the era before the GPU, 00:43:26.400 |
hey, you need to be divisible by 64, 32, or power two, right? 00:43:33.280 |
Does TPUs not have the divisible by 64 batch? 00:43:44.440 |
That's why they have all these weird embedding 00:43:56.520 |
where I dug through why they specifically did 768 and 512. 00:44:02.240 |
And also someone noted in chat that's a limitation. 00:44:09.760 |
to like SentenceBERT that extends the embedding dimension. 00:44:18.560 |
But back to Eugene's point of, is it hardware limitation? 00:44:28.440 |
I really can't remember the specifics of the reasons. 00:44:38.600 |
And there was a decent range reason for why all this. 00:44:43.160 |
It's also like there's 12 layers divisible by 768. 00:44:56.760 |
through every layer and all of this math working out. 00:44:59.200 |
And then a reason for like, oh, here's why this, 00:45:08.520 |
It's in some notes from a couple of years ago. 00:45:31.680 |
I don't know if there's been enough research in that area 00:45:41.680 |
So I, like when I was researching this presentation, 00:45:51.840 |
or something like that and looked at all the top papers 00:45:59.800 |
And like a lot of them were pre or 2021 or previous. 00:46:05.880 |
And so it seems like this direction of like encoder only 00:46:14.320 |
well, I don't know about the bi-directional part, 00:46:20.920 |
So for example, I don't know if anyone's spending 00:46:43.920 |
I think it was all led by these transformed LLMs 00:46:51.240 |
like a Mistral 7B turned into embedding model. 00:47:21.920 |
to higher performance with this classification, 00:47:26.040 |
but I'd have to do a little bit more research on that. 00:47:39.920 |
directly showed why decoder only token prediction 00:47:54.520 |
You're just straight training on less, right? 00:47:59.120 |
you can mask 15% of them and train on learning the 15%, 00:48:06.640 |
Now, if you have 15 trillion tokens versus 1 trillion, 00:48:15.560 |
the better trade-off scaling curve at the start 00:48:19.360 |
for encoder learns better with less tokens at first, 00:48:23.960 |
but then extending this out in pure scaling laws, 00:48:28.760 |
yeah, you lose a lot of your training data, right? 00:48:35.840 |
because it scales better than other tasks, right? 00:48:38.760 |
So scaling laws were made to show better objectives, 00:48:51.120 |
like you can deploy a BART as a guardrail live, 00:49:04.800 |
There was another part to this that I'm blanking on. 00:49:10.120 |
they're doing encoder decoder generation models 00:49:12.880 |
where they're adding decoder heads to encoders, 00:49:15.200 |
and they're scaling them up to billions of parameters. 00:49:38.520 |
is we need to accumulate some good training data. 00:49:45.360 |
to like actually train a BART or that type of model. 00:49:51.520 |
So to start with, we're just using LLMs and prompts 00:50:01.920 |
to like kind of bootstrap until we get enough data. 00:50:12.920 |
so that we'll have enough like solid training data 00:50:18.880 |
The main purpose of it being faster performance 00:51:07.240 |
I'll look back at anything that I'm interested in. 00:51:28.280 |
I guess like we talked a little bit about embedding papers, 00:51:56.600 |
but I think I remember we went through one embedding paper. 00:52:01.480 |
Maybe it's NOMIC, maybe it's NOMIC, I don't know. 00:52:33.440 |
- Okay, well, I will volunteer for NOMIC or Geno. 00:52:49.720 |
If anyone else has papers they wanna cover in the meantime, 00:52:55.720 |
but otherwise I don't wanna drag this too long.