back to index

[Paper Club] BERT: Bidirectional Encoder Representations from Transformers


Whisper Transcript | Transcript Only Page

00:00:00.000 | Let's go.
00:00:00.840 | I'll kill time while Eric figures out
00:00:10.600 | how to share his computer.
00:00:13.260 | So why do we pick Bert today?
00:00:17.280 | I'm kind of curious.
00:00:20.360 | - Yeah, 'cause I'm working on a text classification problem
00:00:25.380 | at work and just wanna get the background.
00:00:28.440 | - Okay.
00:00:29.280 | - Yeah, I wonder if there is a microservice
00:00:35.040 | that is easy to set up.
00:00:37.040 | I wonder if OpenPipe does this,
00:00:39.040 | that only mirrors a structured output GPT-4-O call
00:00:44.040 | and just mirrors it until it has enough data for Bert
00:00:50.000 | and then just switches you to Bert.
00:00:51.760 | - What do you mean by mirrors?
00:00:56.880 | - Shadows it.
00:00:58.880 | - Basically, you just use it at the start
00:01:01.840 | until you have enough data and then do URLs
00:01:04.080 | and then just swap it, so it's cheap.
00:01:06.360 | - Yep.
00:01:07.720 | - I think OpenPipe actually does that,
00:01:09.360 | but I don't think it's as automatic and as seamless
00:01:11.520 | as how your vision is.
00:01:14.520 | - Yeah, I mean, like the amount of times
00:01:16.200 | people suggest using Bert for classification,
00:01:18.840 | it's cheap, like it's a path that is worth taking.
00:01:21.600 | Oh, Eric's rejoining, okay.
00:01:24.440 | - Bart and T5 are good,
00:01:26.680 | like they also got a host, you need to,
00:01:29.640 | like one thing is just switch,
00:01:30.960 | but like for OpenAI or routing,
00:01:32.680 | you have endpoints and hosted model and all that.
00:01:36.520 | So to do Bert, like you gotta go deploy it somewhere
00:01:39.840 | and handle that part.
00:01:42.160 | - Yeah.
00:01:44.000 | All right, Eric, we can see you.
00:01:47.600 | - Okay, perfect.
00:01:48.880 | So we'll start by going through the paper
00:01:56.200 | and then there's additional material out there on Bert.
00:02:01.200 | So depending how much time we have,
00:02:05.720 | we can look at some of the other things out there.
00:02:09.080 | So first of all, Bert stands for
00:02:15.440 | bi-directional encoder representations from transformers.
00:02:19.400 | So this is one of the first transformer papers,
00:02:24.560 | like I believe it was about a year after
00:02:28.520 | the attention is everything you need paper.
00:02:31.840 | You can see it's from 2019.
00:02:34.600 | So ancient history in terms of deep learning and NLP,
00:02:39.600 | but still useful for a lot of use cases.
00:02:46.600 | And just for a little bit more context.
00:02:53.560 | So after, so this was a Google paper
00:02:58.560 | and after they trained and released Bert,
00:03:02.240 | they started using it in Google search.
00:03:06.480 | And so it provided context for search results
00:03:11.480 | so that there's some examples maybe we can look at later,
00:03:16.960 | but it would, if you gave like a query
00:03:20.240 | that could mean a couple of different things,
00:03:22.800 | they use this model to discriminate between the two.
00:03:26.560 | So let's, I guess, just walk down through the paper here.
00:03:34.240 | One thing to note is that there is this,
00:03:42.840 | two different kinds of approaches early on,
00:03:47.520 | feature-based and fine tuning.
00:03:50.600 | And so the like models like Elmo were feature-based
00:03:55.600 | where they had tasks specific architectures
00:04:00.600 | that were built into the model.
00:04:03.080 | Whereas models like Bert and GPT
00:04:07.520 | were more used to fine tuning approach
00:04:10.720 | where they just, as more general purpose,
00:04:13.960 | they just trained one model
00:04:15.880 | and then you could fine tune it after the fact
00:04:18.520 | for your particular tasks.
00:04:22.440 | They talk a little bit here about the,
00:04:29.360 | one of the limitations of standard language models
00:04:33.560 | is that they're unidirectional.
00:04:36.120 | So what that means is for like GPT
00:04:39.600 | and many other models out there now,
00:04:44.600 | they only look forward.
00:04:46.640 | And so they take a sequence of words
00:04:49.320 | and try to predict the next word in the sequence.
00:04:53.080 | However, Bert also goes backwards.
00:04:57.520 | So it takes like in the training data,
00:05:02.080 | it can also start at say the end of a piece of text
00:05:04.960 | and try to predict the previous word.
00:05:07.080 | And so we'll talk a little bit how they avoid
00:05:12.960 | like contamination of the prediction,
00:05:16.680 | because obviously if you're training from back to front,
00:05:19.800 | you get a peek at what the words are coming up.
00:05:23.880 | - So a lot of this section
00:05:27.920 | was pre-decoder only transformers, right?
00:05:31.880 | So a lot of what they reference is like RNNs, GRUs, LSTMs.
00:05:36.480 | So like pre-transformer stuff,
00:05:38.400 | you only get one pass and then they're like,
00:05:40.920 | "Crazy idea if you look at it from front to back."
00:05:43.800 | Like, so a lot of the training tasks
00:05:46.440 | was like classification, right?
00:05:47.920 | When humans classify something,
00:05:49.720 | we don't do it token by token.
00:05:51.360 | We listen to the whole sentence, then we classify.
00:05:53.880 | So this was more so like LSTM, RNN era, yeah.
00:05:58.880 | - Right.
00:06:01.240 | Yeah, it's a very early paper.
00:06:09.440 | So let's see, we talked about bi-directional.
00:06:12.720 | And then, yeah, so it's fine-tuned versus feature-based.
00:06:22.280 | And then they show the effectiveness on BERT
00:06:25.240 | on 11 different NLP tasks.
00:06:28.360 | I guess one call out on the related work is just ELMo.
00:06:36.600 | So ELMo was another model.
00:06:39.680 | I believe it was also by Google.
00:06:41.600 | If anyone knows for sure, feel free to correct.
00:06:45.840 | But they used different representations
00:06:50.840 | for the same word in different contexts.
00:06:55.320 | And so like the word stick, for example,
00:07:00.320 | you could say that means I'm going to chase a dog
00:07:05.760 | with a stick, or it could be like,
00:07:07.960 | hey, let's stick to the material that we're talking about,
00:07:11.840 | or maybe some other words or contexts.
00:07:14.840 | And so those can mean different things, the same token.
00:07:19.760 | And because of that,
00:07:20.640 | they want to use like different representations,
00:07:23.320 | even though it's the same word in English.
00:07:26.160 | And so BERT leverages that feature as well.
00:07:30.120 | - So slight correction.
00:07:33.040 | ELMo is from Allen Institute and University of Washington.
00:07:37.160 | And then further context.
00:07:38.720 | So one of the best BERT adaptations,
00:07:42.440 | I think in 2019 was Roberta.
00:07:44.680 | Roberta is like BERT, but make it good and bigger.
00:07:48.440 | And it was also from the same team
00:07:51.320 | at University of Washington and Facebook, I think.
00:07:54.320 | But ELMo was just, yeah, it wasn't from Google,
00:07:57.480 | but same thing.
00:07:58.480 | It went from like one hot encoding bag of words
00:08:00.640 | to ELMo was really popular
00:08:02.760 | for like a bunch of Kaggle competitions
00:08:05.160 | where you needed good embeddings.
00:08:06.920 | And then for embeddings, it's not Google.
00:08:09.720 | - Great, thanks for the additional info.
00:08:15.520 | So let's take a stop here to look at this diagram.
00:08:27.480 | So this is the BERT like architecture.
00:08:32.480 | Essentially, you can see this pink row down here
00:08:38.080 | is a set of tokens.
00:08:39.480 | Let me see if I can zoom in here.
00:08:44.320 | I'm not sure why I can't zoom in while I'm sharing,
00:08:56.360 | but anyway, this token is a classifier token.
00:09:01.360 | And then you see like tokens one through N.
00:09:06.800 | So those are the first sentence.
00:09:10.640 | And then there's a separator token.
00:09:13.160 | And then there's another token one through M.
00:09:18.000 | And the reason that BERT has this structure
00:09:22.200 | is because one of the tasks it does
00:09:24.520 | is sentence classification.
00:09:29.520 | Basically, it can take two sentences
00:09:32.600 | and determine if one of the sentences
00:09:36.440 | seems like it follows the other one.
00:09:39.120 | And then you see, so this is a pre-trained model.
00:09:45.240 | So this is trained on, well, what was at that point
00:09:49.280 | a large amount of text.
00:09:51.640 | And then these are the different fine tunes of that model
00:09:55.120 | for different benchmarks, essentially.
00:09:58.600 | And FYI, I'm not looking at the chat.
00:10:02.240 | So if there's anything in there, I'm not seeing it.
00:10:04.800 | I can answer questions later
00:10:07.160 | or if someone just wants to unmute, feel free.
00:10:10.040 | So here's those two tokens I mentioned
00:10:14.720 | that you might not have been able to see,
00:10:16.840 | the classifier token and then the separator token.
00:10:20.200 | The classifier token tells it like what task
00:10:24.360 | it's supposed to be performing.
00:10:26.560 | So they talk about how there's two steps in the framework,
00:10:31.640 | retraining and fine tuning.
00:10:33.840 | Probably a lot of people are familiar with that.
00:10:38.000 | So I won't go into too much detail.
00:10:41.560 | This is interesting.
00:10:43.960 | So this is 2019 numbers of what was at that point
00:10:49.640 | a large model.
00:10:51.080 | You can see the base model was 110 million parameters
00:10:56.080 | and the large model, which was,
00:10:58.680 | I saw someone referred to as like gigantic
00:11:02.240 | or like unbelievably large or something like that,
00:11:06.040 | is 340 million parameters, which is, you know,
00:11:10.600 | like 5% the size of like a LLAMA7B
00:11:15.600 | or something along those lines.
00:11:18.000 | So what at the time seemed very large,
00:11:21.600 | in retrospect, isn't really.
00:11:25.160 | And that also goes for the training data.
00:11:29.440 | Do they have it in here?
00:11:34.640 | Well, there's two training sets that they used for it.
00:11:40.800 | One was a, all of Wikipedia, the English version,
00:11:46.320 | which was like 800 million words.
00:11:48.560 | And then I think a set of books,
00:11:51.560 | which was 2.5 billion words.
00:11:53.720 | And again, those data sizes,
00:11:58.680 | while they could have been large for the time,
00:12:01.160 | currently are relatively small.
00:12:03.920 | Typically, like at least for frontier models,
00:12:08.240 | you're talking about low trillions of tokens to train them.
00:12:13.760 | So here they talk about what I mentioned earlier
00:12:18.760 | about having a pair of sentences
00:12:23.240 | where you can have a question and answer
00:12:25.360 | in one token sequence with the separator token
00:12:29.280 | in the middle.
00:12:30.200 | And so let's go down and talk about pre-training.
00:12:39.800 | And here we get to their answer to the left-to-right
00:12:44.800 | versus right-to-left dilemma.
00:12:53.280 | So because it's bi-directional,
00:12:56.160 | potentially each word could see itself in the future.
00:13:01.160 | And so what they ended up doing
00:13:04.600 | was to mask 15% of the words
00:13:10.600 | so instead of having the actual word in the sequence,
00:13:15.600 | they have this mask token.
00:13:18.200 | And then they try to predict that word
00:13:22.720 | either going forward or backward
00:13:24.920 | and do the, you know, score the training on that.
00:13:31.760 | They do have an issue that, let's see,
00:13:39.520 | that since the mask token does not appear
00:13:43.240 | during fine tuning,
00:13:44.560 | they have to compensate for that in the pre-training step.
00:13:51.200 | So in order to do that,
00:13:52.320 | they don't always replace the mask words
00:13:55.200 | with this mask token.
00:13:57.320 | They do 80% of the time, 10% of the time,
00:14:02.120 | they just stick in a random token
00:14:04.480 | or anything in the vocabulary.
00:14:06.600 | And then another 10% of the time,
00:14:08.800 | they just leave it unchanged.
00:14:11.160 | And so that helps during the fine tuning stage
00:14:15.000 | so that fine tuning doesn't expect these mask tokens
00:14:19.320 | like all the time.
00:14:20.520 | And then the other task they give it during pre-training
00:14:25.720 | is this next sentence prediction.
00:14:28.640 | That is the, you saw the two sentences
00:14:35.320 | and whether they are related.
00:14:37.080 | So that is, you know,
00:14:42.080 | that's the sentence A, separator, sentence B.
00:14:47.400 | And then they have the 50% split of training data
00:14:56.120 | where half of it is labeled as is next
00:15:01.960 | and half of it is labeled as not next.
00:15:04.360 | So you can imagine a initial sentence
00:15:09.360 | of I walked down to the store.
00:15:13.040 | One that might be marked as is next
00:15:17.360 | is something like I bought a burrito
00:15:20.840 | and something that might be not next is,
00:15:24.960 | you know, salamanders have many spots
00:15:27.160 | or something like that.
00:15:28.320 | And so this part of the pre-training figures out
00:15:32.120 | which of those sentences follow each other.
00:15:35.000 | Here, oh, okay.
00:15:41.080 | So here's what I was talking about earlier,
00:15:42.880 | the book corpus, 800 million words.
00:15:44.600 | And I guess I had them flipped Wikipedia
00:15:46.680 | as the 2.5 billion words.
00:15:49.360 | And this shows how the input representation.
00:15:58.360 | So there's a couple of different layers
00:16:02.800 | and they just add these layers up
00:16:04.960 | to come up with the input.
00:16:07.320 | So the top layer here is the token.
00:16:12.320 | So it's the vector representation of that particular word.
00:16:16.960 | The next is the segment.
00:16:20.080 | So this splits it up between sentence A and sentence B.
00:16:24.840 | Each one of those has a different factor
00:16:28.680 | that gets added into the input
00:16:31.840 | to help the model distinguish between them.
00:16:34.520 | And then finally as the positional embedding.
00:16:36.680 | So each place in the sequence,
00:16:41.680 | it's its own value that when it's added
00:16:46.600 | to the other layers helps the model to distinguish
00:16:52.600 | like does my come before dog or vice versa.
00:16:57.600 | Okay, and then they talk about a bunch
00:17:08.520 | of different benchmarks.
00:17:12.960 | We'll just maybe take a look at some of the results
00:17:18.800 | and then continue forward.
00:17:21.480 | I'll also look at the,
00:17:22.840 | let me take a break to look at the chat here.
00:17:27.280 | - Another serious questions in chat.
00:17:31.040 | I guess Sam had one.
00:17:33.280 | Have there been revisiting
00:17:35.640 | since the weird now pre-training tasks
00:17:37.560 | that were used to invert?
00:17:39.320 | Or has this all been done
00:17:40.840 | and it's somewhat of a shut book
00:17:43.000 | and next token prediction is all you need?
00:17:45.880 | Any thoughts, comments?
00:17:49.520 | - So I didn't look at any of the following BERT papers
00:17:53.960 | like Roberta or the Berta for this.
00:17:58.440 | So I'm not sure if there's anyone else on the call
00:18:03.280 | who knows would be happy for a contribution.
00:18:06.720 | - Yeah, we had a slight follow-up in the comments there.
00:18:14.880 | Basically for encoder decoder stuff, it still makes sense.
00:18:18.200 | And a lot of my takeaway from this paper
00:18:20.640 | when I read it like years ago
00:18:22.200 | was they had really interesting pre-training tasks.
00:18:26.760 | And if you look at it from a lens
00:18:28.480 | of what they're trying to do,
00:18:30.040 | it's actually pretty useless, right?
00:18:31.600 | Like next token prediction is still useful to generate words
00:18:35.960 | but something like predicting sentence order
00:18:38.760 | is sentence one or sentence two.
00:18:41.280 | Should sentence one come before sentence two or whatever?
00:18:44.160 | It's not really useful, right?
00:18:45.960 | There's never a time where there's people trying to predict
00:18:49.160 | which sentence came before the other
00:18:51.280 | but it does really teach a model conceptually
00:18:54.880 | like what word order is, right?
00:18:57.320 | So there's these words and there's these words.
00:18:59.800 | You have to understand should these set of words
00:19:01.800 | come before these set of words.
00:19:03.080 | So that helps in the classification task
00:19:05.640 | where you have to like group together words together.
00:19:08.240 | So in some sense for the tasks that BERT is trying to do
00:19:13.240 | it's trying to be a small efficient classification model
00:19:16.600 | as one of the tasks it's trying to do, right?
00:19:19.320 | It kind of makes sense to do these weird training objectives.
00:19:22.160 | So next token prediction or like masking of words
00:19:26.000 | is still like, there's not many use cases
00:19:28.800 | where you need to fill in the blank, right?
00:19:30.600 | We don't have like 500 words
00:19:32.360 | and you have to fill in one word
00:19:33.880 | but it does teach a model like over a billion words.
00:19:37.720 | If you start to understand
00:19:38.920 | what words go in between other words
00:19:41.080 | it's a good pre-training task
00:19:42.800 | and it helps with classification in general.
00:19:45.360 | So it's like, instead of having a very broad task
00:19:48.920 | of just predict next token and then extract,
00:19:52.600 | like abstract that out to eventually
00:19:54.600 | like get this emergent capability to do classification.
00:19:57.720 | This is like, here's 12 tasks
00:19:59.960 | that mimic understanding words very well.
00:20:03.080 | You're a small model, use all this to do classification
00:20:06.200 | but that concept is still applied
00:20:09.040 | for like current encoder decoder models
00:20:11.800 | and small models that are not just next token prediction.
00:20:15.360 | So basically if you dumb down the task
00:20:19.000 | to subsets of your like main goal, it's still very effective.
00:20:23.200 | And that's why like, if you take a step back
00:20:25.440 | and look at what Bert did, it makes no sense.
00:20:28.360 | Like Google doesn't need to spend millions of dollars
00:20:31.000 | to predict which sentence comes before
00:20:32.880 | or after another sentence, but it does help a small model
00:20:36.480 | that'll better learn embedding per se
00:20:40.080 | and abstract that to these tasks.
00:20:42.120 | - Yeah, agreed.
00:20:46.240 | Like the sentence, the sentence prediction didn't seem
00:20:52.120 | like you're right.
00:20:53.960 | There's not too many use cases where that would be helpful.
00:20:57.760 | I wonder if it does help the model
00:21:00.200 | to like figure out the relationships
00:21:03.240 | between words and ideas to some degree.
00:21:09.120 | - Yeah, that's pretty much the whole point.
00:21:11.160 | It should help the model better understand
00:21:13.360 | like breaking down the problem and understand word order.
00:21:17.680 | I think they also did something where they swapped words
00:21:20.640 | from different sentences or swap sentences, right?
00:21:23.360 | And like, that's even more useless in reality.
00:21:27.720 | Like in reality, there's not many times
00:21:31.080 | where you've got mixing of words
00:21:32.840 | from different chunks of sentences.
00:21:34.960 | But once again, it helps the model generalize
00:21:37.600 | towards understanding sentences.
00:21:39.160 | So stuff like that.
00:21:41.400 | It's just, if you look at it in today's sense,
00:21:43.680 | if you have a niche topic that can benefit
00:21:46.160 | from encoder decoder, like small, small,
00:21:49.240 | like on-device active model,
00:21:51.880 | you wanna start to employ breaking down problems
00:21:55.240 | in this sort of methodology.
00:21:56.800 | But there's examples of papers that do this type of work.
00:22:05.200 | - Yeah, for sure.
00:22:06.800 | Would be great to follow on this presentation
00:22:09.960 | with some of the more recent work that kind of builds on it.
00:22:14.080 | - There are not many other questions in chat, by the way.
00:22:20.920 | - Okay, okay, cool.
00:22:22.880 | You can see from this, at least when it was released,
00:22:28.360 | BERT-Large was state-of-the-art, even beating out GPT-1.
00:22:34.120 | And ELMo.
00:22:35.960 | So at least at the time it was released,
00:22:40.360 | it was a very capable model.
00:22:43.000 | And then, I don't know.
00:22:47.240 | I'm gonna kind of skip through these,
00:22:49.880 | except maybe just to show the results
00:22:53.760 | of it being state-of-the-art in a lot of things.
00:22:57.880 | So the next section is ablation studies.
00:23:02.560 | So removing different parts of the model
00:23:05.680 | to see what the effects were.
00:23:09.800 | And let's see.
00:23:15.720 | So here's a couple of different ablations they did
00:23:20.240 | was they removed the next sentence prediction task.
00:23:25.240 | So I guess this is something we were just talking about,
00:23:30.760 | but they still kept the mask LM.
00:23:34.320 | And then the next thing they did
00:23:36.760 | is they only made it go left to right.
00:23:40.280 | And they also have the no sentence prediction.
00:23:44.680 | And so you can see the results from those attempts up here.
00:23:50.480 | The top is the kind of the standard model.
00:23:57.280 | And then if you look at the no next sentence prediction,
00:24:01.200 | it does lose a little bit.
00:24:04.440 | Oh, actually in QNLI,
00:24:06.120 | it looks like it has a significant loss,
00:24:08.560 | maybe in other tasks, much less.
00:24:11.280 | But then as you also take away the bidirectional,
00:24:16.080 | it becomes less capable.
00:24:19.000 | Looks like it kind of varies between tasks,
00:24:22.200 | like how much capability it loses.
00:24:25.240 | But this does show that there is some value
00:24:28.640 | to those components.
00:24:30.640 | Oh yeah, maybe this is what I was talking about,
00:24:43.080 | where they say,
00:24:43.920 | "We believe that this is the first work
00:24:44.760 | "to demonstrate convincingly
00:24:45.920 | "that scaling to extreme model sizes
00:24:48.920 | "also leads to large improvements
00:24:50.480 | "on a very small scale task."
00:24:53.440 | This is kind of like the bidder lesson,
00:24:55.720 | but maybe a little bit exaggerated
00:24:58.680 | as far as the extreme model size at this point.
00:25:01.640 | And then they talk about feature approach
00:25:08.640 | with BERT a little bit.
00:25:10.920 | So if there's questions,
00:25:14.120 | feel free to unmute.
00:25:16.880 | Otherwise, let's go over to and look through.
00:25:19.600 | There's Jay Alomar has made some very helpful,
00:25:24.600 | like illustrated BERT and ELMo articles
00:25:30.080 | that we can go through
00:25:31.280 | to just kind of cement our understanding.
00:25:34.720 | So that's just kind of a comparison of some models
00:25:38.920 | that were out at the time.
00:25:41.120 | And then this is one thing that was a takeaway for me,
00:25:47.720 | is that, okay, this is the pre-training step
00:25:50.960 | that we talked about,
00:25:52.040 | but then on the supervised learning step,
00:25:55.200 | you basically stick a classifier after the BERT.
00:25:58.560 | So BERT is in charge of essentially
00:26:01.240 | encoding the text into an embedding.
00:26:04.960 | And then you use that classifier to then classify,
00:26:08.440 | in this case, either spam or not spam.
00:26:12.240 | Let me see if I can find...
00:26:17.000 | There's one diagram that I thought was especially helpful.
00:26:20.280 | Oh yeah, this one.
00:26:21.920 | So this one shows how BERT
00:26:26.520 | takes the entire sequence of tokens
00:26:29.800 | and then creates like for each input token,
00:26:34.800 | it has an output vector.
00:26:39.240 | However, for the purpose of classification,
00:26:42.040 | we only look at the first output vector.
00:26:46.320 | That one contains essentially the entire sense
00:26:52.720 | of all of the input tokens.
00:26:55.800 | And then you can run that through,
00:26:57.880 | it can be a neural network,
00:26:59.240 | could be a logistic regression.
00:27:01.440 | And then from the features here,
00:27:05.760 | and I think there's like something like 768 dimensions
00:27:10.480 | in the embedding, from the features there,
00:27:13.360 | you can then predict spam or not spam
00:27:17.560 | based on your training set.
00:27:19.440 | And let's see, that shows the same thing.
00:27:26.880 | And then here's like some illustration
00:27:33.160 | of the different encoder blocks.
00:27:39.200 | So as we mentioned earlier, BERT is encoder only.
00:27:43.640 | So the kind of classical transformer
00:27:47.520 | is an encoder and a decoder.
00:27:50.600 | Many modern models are decoder only.
00:27:55.720 | And so encoder is like used mostly these days
00:28:02.280 | for text classification or text clustering.
00:28:08.200 | To my knowledge, there's encoder only transformers
00:28:13.200 | aren't really used for any kind of next,
00:28:18.320 | like sequence generation or next token generation.
00:28:21.680 | This talks about ELMo and the different context of words
00:28:33.440 | and how ELMo captures that.
00:28:36.520 | (silence)
00:28:38.680 | GPT, I thought there was something here.
00:28:47.400 | Yeah, so this is just like what we talked about
00:28:55.680 | about if you have a BERT encoder,
00:29:02.000 | you can stick another model for training on the end of it.
00:29:05.960 | And then go from there.
00:29:07.960 | And then you can also use BERT for embedding.
00:29:14.960 | So if you have like a certain problem space
00:29:19.960 | with a lot of texts that you want to embed,
00:29:23.880 | you can continue pre-training or do fine tuning on BERT
00:29:31.360 | with your corpus in your industry specific like text corpus
00:29:36.360 | and then create an encoder that's especially built
00:29:42.720 | for your needs.
00:29:47.240 | So there's one more.
00:29:53.000 | I'll pause any other questions in the chat.
00:29:57.480 | So some context of how they train BERT,
00:30:01.400 | they have like those 12 paths, right?
00:30:03.880 | They had a BERT based model, a math language model.
00:30:06.200 | They had the next sentence prediction,
00:30:08.080 | token classification, QA, sequence classification.
00:30:11.960 | They had all these tasks.
00:30:13.360 | And basically what they did where they were BERT models
00:30:16.360 | with a layer added on top for a classification head.
00:30:21.240 | Now, in the time of 2019,
00:30:23.720 | when people started using these models,
00:30:25.360 | what was really common to do was you could either,
00:30:29.000 | if you had enough data, take the base model,
00:30:31.840 | add in a linear output head for classification,
00:30:34.920 | where you basically take all this, there's no output head.
00:30:37.520 | It's just output is the last step
00:30:39.240 | of processing these tokens or sentences.
00:30:42.160 | Then you add just a linear head
00:30:43.880 | with a softmax for classification.
00:30:46.120 | Now, then you fine tune it on a lot of your data itself.
00:30:50.600 | If you didn't have as much data,
00:30:53.280 | one thing that was popular was you take
00:30:54.920 | the sequence classification head,
00:30:57.920 | you just continue fine tuning it on your data
00:31:00.200 | and it's already somewhat good at sequence classification.
00:31:03.440 | But there was a whole like series of work
00:31:06.400 | that looked into based on how much data you have,
00:31:10.480 | where you should do your fine tuning.
00:31:12.480 | So if you have a lot of data,
00:31:14.760 | it was pretty common to not only add a classification head,
00:31:18.560 | but also peel back a few layers.
00:31:20.640 | So reset the weights of like the top three,
00:31:23.920 | the top two, the final layer,
00:31:26.320 | and then continue training those in as well for your task.
00:31:29.280 | Because at some level, what people started to learn
00:31:31.600 | was these pre-training objective tasks
00:31:34.840 | of like mask word prediction and sentence ordering or QA,
00:31:39.760 | they were actually affecting the net output
00:31:42.800 | of sequence classification.
00:31:44.480 | And if you wanted better,
00:31:46.000 | you could just train more of your whole model on that.
00:31:48.520 | So there was a whole thing of like,
00:31:50.400 | you should remove the top two layers,
00:31:52.720 | add a sequence classification head,
00:31:54.560 | train on tens of thousands of examples,
00:31:56.440 | and you'll get state-of-the-art results.
00:31:59.160 | You could, if you have less data, freeze layers,
00:32:02.760 | you could unfreeze weights.
00:32:04.080 | There's like a whole set of this,
00:32:05.920 | but it was pretty common to also just mess
00:32:08.480 | with the architecture and add classification head.
00:32:11.160 | - Do you know if anyone trained all the layers
00:32:18.400 | or like just use that as a starting point
00:32:20.680 | to train all the layers?
00:32:21.920 | Okay.
00:32:23.200 | - So today there's like stuff that came out
00:32:25.520 | like a year or two ago,
00:32:27.600 | where basically you could retrain BERT in 24 hours
00:32:31.200 | on like a budget of like sub $500 with regular A100s
00:32:36.200 | and how you can do this better.
00:32:38.920 | So it was in the realm of like at the time,
00:32:42.680 | not as effective to, you know,
00:32:44.440 | like you don't have Google Compute
00:32:45.760 | to retrain BERT from scratch,
00:32:47.480 | but now there's stuff of like 24 hours
00:32:50.360 | and a couple of hundred dollars
00:32:51.920 | to retrain your own better BERT.
00:32:54.000 | There's like an academia paper that came out about this.
00:32:57.080 | If people go down the rabbit hole
00:32:58.720 | of like encoder models and this stuff,
00:33:01.600 | it's a cool one to look into
00:33:03.000 | of how they can better objectify these 12 pre-training tasks
00:33:08.000 | to a few and a better curated dataset
00:33:10.560 | and outperform it on a couple of hundred dollars in 24 hours.
00:33:13.920 | But then it was also common
00:33:15.080 | where like there was a sentence classification
00:33:19.880 | and sentence extraction tasks that BERT was adapted towards.
00:33:23.520 | So like BERT for sequence classification
00:33:25.760 | or extraction for like abstractive summarization.
00:33:28.400 | And then companies that took it to production
00:33:30.320 | would do like significant retrains or like,
00:33:34.560 | yeah, they train a lot more of it.
00:33:36.560 | And then this also just went into like
00:33:41.280 | at what part do you want to start training?
00:33:44.040 | (mouse clicking)
00:33:46.800 | Yeah, I mean that sounds like interesting stuff.
00:33:51.640 | If you have any links,
00:33:55.160 | please drop in the chat and I'll check it out.
00:33:57.520 | So maybe the last thing we can go through here
00:34:04.320 | is the same author Jay Alomar has a notebook
00:34:11.400 | where he shows like hands-on
00:34:14.120 | how to do this movie review sentiment classification.
00:34:19.120 | He uses DistilBERT.
00:34:24.080 | So DistilBERT is a hugging face like recreation of BERT
00:34:29.080 | that like has very comparable performance
00:34:34.920 | on many fewer parameters.
00:34:39.280 | And then to do the classification,
00:34:41.400 | he just uses a basic logistic regression model
00:34:45.320 | from scikit-learn.
00:34:47.720 | And so then the features that go
00:34:51.120 | into this logistic regression model
00:34:52.880 | are just the vector of size 768
00:34:57.040 | that comes out of the DistilBERT embedding.
00:35:04.760 | So a lot of this leans very heavily
00:35:07.920 | on the HuggingFaceTransformers library.
00:35:11.120 | So let's see, that's just installing it, doing imports.
00:35:15.880 | He uses, he must've mentioned it up above, but a...
00:35:23.080 | Maybe it's...
00:35:29.240 | Anyway, there's a particular HuggingFace dataset
00:35:32.480 | that he's using that has the movie sentiment training data.
00:35:36.480 | Maybe he just uploaded it somewhere, he had it.
00:35:43.760 | - I think it's an IMDb dataset.
00:35:47.160 | I think it's one of the Kaggle ones.
00:35:49.360 | Yeah, it's just a Kaggle IMDb.
00:35:50.920 | You have movie reviews, you classify them.
00:35:53.160 | Oh, and on that actually reminds me,
00:35:58.160 | one of the big things that made BERT somewhat popular
00:36:01.560 | was there was another Kaggle competition
00:36:03.840 | on tweet classification of sentiment.
00:36:07.360 | So in tweets, like with previous embeddings,
00:36:10.520 | like Bag of Words or Glove or Elmo,
00:36:13.880 | if you have stuff like, "I hate this so much,"
00:36:18.880 | that in some contexts in tweets could still be positive,
00:36:22.880 | even though it's very negative.
00:36:24.240 | And when you just look at lexical understanding of words,
00:36:30.480 | it's very negative, but then BERT embeddings
00:36:33.080 | were what really dominated that.
00:36:35.320 | And then for like a few years,
00:36:36.320 | they kept doing follow-ups on that.
00:36:37.640 | But IMDb and tweet classification
00:36:40.520 | were versions that they used in a lot of these demos.
00:36:44.600 | - So let's see.
00:36:53.520 | So here we're just uploading or downloading BERT,
00:36:58.920 | DistilBERT from HuggingFace
00:37:02.200 | and getting the model initialized.
00:37:06.240 | So you can see it's just a few lines of code there.
00:37:09.040 | So we have to do a few things like tokenize it
00:37:16.800 | and then add padding.
00:37:19.600 | So this is so that all of the sequences
00:37:23.760 | can be run in parallel.
00:37:26.960 | So we need to pad out so that they're all the same length.
00:37:30.320 | And then we need to mark the padded sections as masked
00:37:37.120 | so that BERT doesn't get confused at thinking like
00:37:41.680 | what's empty space is actual sequence
00:37:45.400 | that we want it to process.
00:37:48.800 | So then you can see this diagram here.
00:37:55.520 | And again, apologies, I'll try to zoom in again.
00:37:58.440 | Oh, it worked this time.
00:38:00.000 | So this just takes the input text,
00:38:04.720 | runs it through DistilBERT
00:38:06.400 | and comes out with the embeddings.
00:38:08.200 | So that's all of this thing does.
00:38:15.120 | And then the one tricky part about all of this
00:38:20.120 | is you need to pick out exactly
00:38:23.800 | which values from this,
00:38:28.800 | I guess it's three-dimensional tensor
00:38:31.720 | you want to predict on.
00:38:34.040 | So if you remember back from here,
00:38:37.040 | we just want the very first output.
00:38:45.120 | We wanna ignore all of these other ones.
00:38:51.680 | So he draws out in detail like
00:38:55.960 | how exactly you pull just those vectors
00:38:59.760 | out of this three-dimensional tensor.
00:39:03.920 | And then it's pretty straightforward
00:39:06.320 | machine learning after that.
00:39:08.080 | You just turn those 768 dimensions into features
00:39:17.800 | and do a test train split,
00:39:21.400 | train your logistic regression model,
00:39:25.040 | and then run, you know,
00:39:31.000 | once you've got the model trained,
00:39:32.040 | you can run a score and it gets 82%.
00:39:35.560 | So assuming it's a 50/50 split,
00:39:39.320 | then the expected amount just from random chance
00:39:42.760 | would be 50%.
00:39:44.560 | So there is a significant increase
00:39:48.560 | using BERT to do classification,
00:39:51.280 | but obviously still plenty of room for improvement.
00:39:55.680 | Okay, and down here, it says the highest accuracy score
00:40:00.360 | for this data set is currently 96.8.
00:40:03.880 | So as you can see,
00:40:06.640 | things have come a long way since 2019,
00:40:10.080 | but still a useful model to start with
00:40:15.080 | for any classification tasks or clustering.
00:40:23.240 | If you just wanna see what tech sequences
00:40:26.800 | are close to each other in embedding space,
00:40:28.760 | you can use it for that as well.
00:40:30.920 | So that's about all I had for prepared stuff.
00:40:39.280 | I'll stop sharing and then maybe go through the chat.
00:40:44.280 | If anyone wants to chime in,
00:40:48.600 | add any color commentary, feel free to do that.
00:40:52.800 | - Someone linked the paper on the academic budget,
00:41:02.680 | the 24-hour BERT,
00:41:05.280 | and then I was also trying to think
00:41:07.360 | where apparently MosaicML showed how you can do it for $20
00:41:10.600 | pre-trained BERT from scratch now.
00:41:12.560 | So yeah, $20, they did like eight A100s for an hour,
00:41:17.560 | and they're able to match the glue store of basic BERT
00:41:21.640 | with their recipe.
00:41:23.080 | Kind of interesting.
00:41:25.480 | So a note Eugene Cha made is researchers felt
00:41:30.480 | 10K training is expensive.
00:41:32.560 | So I remember mapping this out.
00:41:35.320 | BERT, all these things that compare like 24-hour,
00:41:39.880 | one-hour BERT, they trained BERT
00:41:42.600 | for four days of TPU v3 equivalent,
00:41:46.960 | which is like at the time, let's say eight to $10 an hour,
00:41:50.760 | which is like 10 to 15K.
00:41:52.720 | But then there wasn't just BERT-based.
00:41:54.960 | There was BERT-based, there was BERT-large,
00:41:56.800 | there was BERT-small.
00:41:57.800 | There's a bunch of experiments.
00:41:59.160 | The BERT-large was trained for more than four days.
00:42:02.240 | Like the cost equivalent is 50K on that,
00:42:05.040 | 10K on the regular BERT-based, less on the little one.
00:42:08.440 | And then you got to add in like the time and the R&D.
00:42:12.280 | Oh, it was well more than a 10K project at Google.
00:42:16.400 | The BERT-large itself was already a 50K train run,
00:42:20.080 | plus 10 to 15 for BERT-based, plus just experimentation.
00:42:24.520 | So expensive, expensive.
00:42:27.680 | - I think it's more amount of labs
00:42:29.600 | would love to hear those numbers for SOTA.
00:42:32.520 | - Right now.
00:42:33.360 | - True, true.
00:42:39.400 | - I'm just reading through the chat.
00:42:52.280 | - Mm-hmm.
00:43:01.520 | - Did we get a volunteer for next week
00:43:03.360 | or still waiting on that?
00:43:05.160 | - Anyone else next week?
00:43:09.760 | Any other questions on this, by the way?
00:43:12.360 | - I do have a random one.
00:43:13.880 | Because this is regarding the embedding science, right?
00:43:18.840 | Even though I joke about this was the era before the GPU,
00:43:23.840 | the gaming GPU folks came in and said,
00:43:26.400 | hey, you need to be divisible by 64, 32, or power two, right?
00:43:33.280 | Does TPUs not have the divisible by 64 batch?
00:43:39.120 | I mean like optimization when it comes
00:43:41.760 | to embedding size characteristic.
00:43:44.440 | That's why they have all these weird embedding
00:43:46.640 | size on TPUs related training.
00:43:48.960 | - I don't think it's TPU based.
00:43:54.120 | I have like old notes that I'm recalling
00:43:56.520 | where I dug through why they specifically did 768 and 512.
00:44:02.240 | And also someone noted in chat that's a limitation.
00:44:06.280 | There's other work that extends this out
00:44:09.760 | to like SentenceBERT that extends the embedding dimension.
00:44:15.200 | And they were also pretty small.
00:44:18.560 | But back to Eugene's point of, is it hardware limitation?
00:44:22.600 | It's not.
00:44:23.480 | It was well divisibility between layers
00:44:26.520 | and adding layers and a bunch of stuff.
00:44:28.440 | I really can't remember the specifics of the reasons.
00:44:31.760 | I'll dig through some old notes,
00:44:33.400 | but someone broke down the per layer map
00:44:35.520 | and sending through inputs.
00:44:38.600 | And there was a decent range reason for why all this.
00:44:43.160 | It's also like there's 12 layers divisible by 768.
00:44:48.000 | It went down that path.
00:44:51.040 | But also it wasn't like someone
00:44:52.800 | from Google that worked on BERT.
00:44:54.440 | It was just mapping through the input
00:44:56.760 | through every layer and all of this math working out.
00:44:59.200 | And then a reason for like, oh, here's why this,
00:45:01.800 | why not this?
00:45:02.840 | And I was like, sounds good.
00:45:04.280 | Checks out to me.
00:45:05.280 | I can probably find this if I look.
00:45:08.520 | It's in some notes from a couple of years ago.
00:45:11.600 | (indistinct)
00:45:14.000 | - Eric, someone has asked,
00:45:20.160 | does BERT pre-training objective MLM
00:45:23.240 | like mass language modeling follow the same
00:45:25.240 | LLM scaling laws as GPTs?
00:45:27.520 | - That's a good question.
00:45:31.680 | I don't know if there's been enough research in that area
00:45:38.080 | to like come to any conclusion.
00:45:41.680 | So I, like when I was researching this presentation,
00:45:46.840 | I went to, I think it was paperswithcode.com
00:45:51.840 | or something like that and looked at all the top papers
00:45:57.720 | for text classification.
00:45:59.800 | And like a lot of them were pre or 2021 or previous.
00:46:05.880 | And so it seems like this direction of like encoder only
00:46:10.880 | or bi-directional has,
00:46:14.320 | well, I don't know about the bi-directional part,
00:46:15.720 | but at least encoder only,
00:46:17.320 | research has been pretty sparse recently.
00:46:20.920 | So for example, I don't know if anyone's spending
00:46:26.360 | millions of dollars to train like, you know,
00:46:30.440 | super BERT or something like that.
00:46:35.520 | - I think it's-- - Well, the Rekka AI guys
00:46:37.800 | seems to play.
00:46:38.640 | - No, the last time I looked at this,
00:46:41.840 | like leaderboard and hugging face,
00:46:43.920 | I think it was all led by these transformed LLMs
00:46:48.360 | that now get the best performance,
00:46:51.240 | like a Mistral 7B turned into embedding model.
00:46:55.920 | - Is that, sorry, I missed.
00:47:02.440 | Is that for text classification?
00:47:05.320 | - Well, yeah, I guess at the core,
00:47:08.920 | it's all turning text into an embedding.
00:47:11.600 | So yeah.
00:47:14.040 | - Could you drop a link in the chat?
00:47:18.800 | - Yeah, sure.
00:47:20.480 | - Yeah, it seems like that would lead
00:47:21.920 | to higher performance with this classification,
00:47:26.040 | but I'd have to do a little bit more research on that.
00:47:32.720 | - Well, there's two pieces as well, right?
00:47:35.480 | So for mass language modeling,
00:47:37.880 | a lot of the scaling law papers
00:47:39.920 | directly showed why decoder only token prediction
00:47:43.200 | is better scaling than mass language models.
00:47:45.880 | One is purely when you mask 15% of tokens,
00:47:48.720 | you train on the mass.
00:47:49.840 | So you lose a lot of data.
00:47:51.720 | You need more quality data.
00:47:54.520 | You're just straight training on less, right?
00:47:56.680 | If you have a data set of a trillion tokens,
00:47:59.120 | you can mask 15% of them and train on learning the 15%,
00:48:03.000 | or you can train on all trillion of them.
00:48:05.360 | That's a straight scaling.
00:48:06.640 | Now, if you have 15 trillion tokens versus 1 trillion,
00:48:09.680 | that's another question,
00:48:11.160 | but for embedding tasks in smaller models,
00:48:15.560 | the better trade-off scaling curve at the start
00:48:19.360 | for encoder learns better with less tokens at first,
00:48:23.960 | but then extending this out in pure scaling laws,
00:48:28.760 | yeah, you lose a lot of your training data, right?
00:48:31.040 | And then that was one of the big points
00:48:33.040 | of why do we do next token prediction,
00:48:35.840 | because it scales better than other tasks, right?
00:48:38.760 | So scaling laws were made to show better objectives,
00:48:41.480 | so directly against it.
00:48:42.760 | But then at small scale and stuff,
00:48:45.440 | there's benefit in this,
00:48:48.240 | specifically for like edge models,
00:48:51.120 | like you can deploy a BART as a guardrail live,
00:48:55.800 | and you can have it intercept every query
00:48:57.840 | because it can act in milliseconds
00:48:59.360 | versus LLMs will still take longer, right?
00:49:02.280 | So at a smaller scale, they'll be better.
00:49:04.800 | There was another part to this that I'm blanking on.
00:49:09.040 | Oh, Raika AI,
00:49:10.120 | they're doing encoder decoder generation models
00:49:12.880 | where they're adding decoder heads to encoders,
00:49:15.200 | and they're scaling them up to billions of parameters.
00:49:18.320 | They're a case study of spending money
00:49:21.720 | to train them up pretty big.
00:49:24.120 | (mouse clicking)
00:49:26.880 | There's a question from Isaac in the chat
00:49:31.960 | about my use case at work.
00:49:35.400 | So currently where we're at in the project
00:49:38.520 | is we need to accumulate some good training data.
00:49:43.520 | So we don't have enough training data
00:49:45.360 | to like actually train a BART or that type of model.
00:49:51.520 | So to start with, we're just using LLMs and prompts
00:49:56.520 | to like do some logical classification
00:50:01.920 | to like kind of bootstrap until we get enough data.
00:50:07.600 | And then also to create a feedback loop
00:50:09.600 | where we can get feedback from people
00:50:12.920 | so that we'll have enough like solid training data
00:50:15.880 | so we can actually train a model.
00:50:18.880 | The main purpose of it being faster performance
00:50:22.640 | as Vibhu mentioned,
00:50:24.320 | then you can respond in milliseconds
00:50:28.400 | versus multiple seconds or tens of seconds
00:50:33.400 | if you're using an LLM.
00:50:35.440 | (mouse clicking)
00:50:38.200 | - Well, thank you, Eric.
00:50:54.760 | - Yeah.
00:50:55.600 | - Always appreciate the OG papers.
00:50:57.240 | - Yeah, it's good to--
00:51:02.720 | - Do you wanna ask one volunteer next week?
00:51:05.640 | - Any paper, it doesn't have to be,
00:51:07.240 | I'll look back at anything that I'm interested in.
00:51:09.800 | (mouse clicking)
00:51:15.560 | - Yeah, I don't know if there's any paper
00:51:26.800 | that's caught my eye recently.
00:51:28.280 | I guess like we talked a little bit about embedding papers,
00:51:33.320 | people are interested in embedding.
00:51:35.120 | - Did we ever do the Geno2 embeddings paper?
00:51:39.560 | - No, there's also NOMIC embed.
00:51:43.680 | I thought like, I didn't see the Geno one,
00:51:47.000 | but the NOMIC one was pretty detailed
00:51:48.640 | in terms of what their process was.
00:51:51.680 | So, interesting.
00:51:53.520 | - I might be mixing up the paper thought,
00:51:56.600 | but I think I remember we went through one embedding paper.
00:52:01.480 | Maybe it's NOMIC, maybe it's NOMIC, I don't know.
00:52:04.080 | But yeah, probably it wasn't there.
00:52:06.680 | - The NOMIC compares directly to Geno.
00:52:13.120 | I have the exact same thing.
00:52:14.880 | I haven't seen the NOMIC one as much.
00:52:16.320 | I just know Geno was the one open source,
00:52:18.920 | AK context, very detailed.
00:52:21.800 | Here's how to do embeddings from scratch
00:52:23.360 | and fine tune them paper.
00:52:24.840 | But I guess if NOMIC is the same thing,
00:52:27.000 | 50/50 if anyone wants to take one or both,
00:52:29.680 | I would love both.
00:52:31.600 | - Oh, there's Geno three now, crazy.
00:52:33.440 | - Okay, well, I will volunteer for NOMIC or Geno.
00:52:49.720 | If anyone else has papers they wanna cover in the meantime,
00:52:54.800 | let's cover them,
00:52:55.720 | but otherwise I don't wanna drag this too long.
00:52:58.800 | Yeah, nice chat.
00:53:01.040 | - All right, and thanks, Eric.
00:53:04.080 | - Yeah, thank you.
00:53:05.240 | - Thanks, Eric.
00:53:07.000 | Thanks, everyone.
00:53:07.840 | - Bye.
00:53:08.680 | - See ya.
00:53:09.520 | See ya.