[Paper Club] BERT: Bidirectional Encoder Representations from Transformers

00:00:00.000 | Let's go.

00:00:00.840 | I'll kill time while Eric figures out

00:00:10.600 | how to share his computer.

00:00:13.260 | So why do we pick Bert today?

00:00:17.280 | I'm kind of curious.

00:00:20.360 | - Yeah, 'cause I'm working on a text classification problem

00:00:25.380 | at work and just wanna get the background.

00:00:28.440 | - Okay.

00:00:29.280 | - Yeah, I wonder if there is a microservice

00:00:35.040 | that is easy to set up.

00:00:37.040 | I wonder if OpenPipe does this,

00:00:39.040 | that only mirrors a structured output GPT-4-O call

00:00:44.040 | and just mirrors it until it has enough data for Bert

00:00:50.000 | and then just switches you to Bert.

00:00:51.760 | - What do you mean by mirrors?

00:00:56.880 | - Shadows it.

00:00:58.880 | - Basically, you just use it at the start

00:01:01.840 | until you have enough data and then do URLs

00:01:04.080 | and then just swap it, so it's cheap.

00:01:06.360 | - Yep.

00:01:07.720 | - I think OpenPipe actually does that,

00:01:09.360 | but I don't think it's as automatic and as seamless

00:01:11.520 | as how your vision is.

00:01:14.520 | - Yeah, I mean, like the amount of times

00:01:16.200 | people suggest using Bert for classification,

00:01:18.840 | it's cheap, like it's a path that is worth taking.

00:01:21.600 | Oh, Eric's rejoining, okay.

00:01:24.440 | - Bart and T5 are good,

00:01:26.680 | like they also got a host, you need to,

00:01:29.640 | like one thing is just switch,

00:01:30.960 | but like for OpenAI or routing,

00:01:32.680 | you have endpoints and hosted model and all that.

00:01:36.520 | So to do Bert, like you gotta go deploy it somewhere

00:01:39.840 | and handle that part.

00:01:42.160 | - Yeah.

00:01:44.000 | All right, Eric, we can see you.

00:01:47.600 | - Okay, perfect.

00:01:48.880 | So we'll start by going through the paper

00:01:56.200 | and then there's additional material out there on Bert.

00:02:01.200 | So depending how much time we have,

00:02:05.720 | we can look at some of the other things out there.

00:02:09.080 | So first of all, Bert stands for

00:02:15.440 | bi-directional encoder representations from transformers.

00:02:19.400 | So this is one of the first transformer papers,

00:02:24.560 | like I believe it was about a year after

00:02:28.520 | the attention is everything you need paper.

00:02:31.840 | You can see it's from 2019.

00:02:34.600 | So ancient history in terms of deep learning and NLP,

00:02:39.600 | but still useful for a lot of use cases.

00:02:46.600 | And just for a little bit more context.

00:02:53.560 | So after, so this was a Google paper

00:02:58.560 | and after they trained and released Bert,

00:03:02.240 | they started using it in Google search.

00:03:06.480 | And so it provided context for search results

00:03:11.480 | so that there's some examples maybe we can look at later,

00:03:16.960 | but it would, if you gave like a query

00:03:20.240 | that could mean a couple of different things,

00:03:22.800 | they use this model to discriminate between the two.

00:03:26.560 | So let's, I guess, just walk down through the paper here.

00:03:34.240 | One thing to note is that there is this,

00:03:42.840 | two different kinds of approaches early on,

00:03:47.520 | feature-based and fine tuning.

00:03:50.600 | And so the like models like Elmo were feature-based

00:03:55.600 | where they had tasks specific architectures

00:04:00.600 | that were built into the model.

00:04:03.080 | Whereas models like Bert and GPT

00:04:07.520 | were more used to fine tuning approach

00:04:10.720 | where they just, as more general purpose,

00:04:13.960 | they just trained one model

00:04:15.880 | and then you could fine tune it after the fact

00:04:18.520 | for your particular tasks.

00:04:22.440 | They talk a little bit here about the,

00:04:29.360 | one of the limitations of standard language models

00:04:33.560 | is that they're unidirectional.

00:04:36.120 | So what that means is for like GPT

00:04:39.600 | and many other models out there now,

00:04:44.600 | they only look forward.

00:04:46.640 | And so they take a sequence of words

00:04:49.320 | and try to predict the next word in the sequence.

00:04:53.080 | However, Bert also goes backwards.

00:04:57.520 | So it takes like in the training data,

00:05:02.080 | it can also start at say the end of a piece of text

00:05:04.960 | and try to predict the previous word.

00:05:07.080 | And so we'll talk a little bit how they avoid

00:05:12.960 | like contamination of the prediction,

00:05:16.680 | because obviously if you're training from back to front,

00:05:19.800 | you get a peek at what the words are coming up.

00:05:23.880 | - So a lot of this section

00:05:27.920 | was pre-decoder only transformers, right?

00:05:31.880 | So a lot of what they reference is like RNNs, GRUs, LSTMs.

00:05:36.480 | So like pre-transformer stuff,

00:05:38.400 | you only get one pass and then they're like,

00:05:40.920 | "Crazy idea if you look at it from front to back."

00:05:43.800 | Like, so a lot of the training tasks

00:05:46.440 | was like classification, right?

00:05:47.920 | When humans classify something,

00:05:49.720 | we don't do it token by token.

00:05:51.360 | We listen to the whole sentence, then we classify.

00:05:53.880 | So this was more so like LSTM, RNN era, yeah.

00:05:58.880 | - Right.

00:06:01.240 | Yeah, it's a very early paper.

00:06:09.440 | So let's see, we talked about bi-directional.

00:06:12.720 | And then, yeah, so it's fine-tuned versus feature-based.

00:06:22.280 | And then they show the effectiveness on BERT

00:06:25.240 | on 11 different NLP tasks.

00:06:28.360 | I guess one call out on the related work is just ELMo.

00:06:36.600 | So ELMo was another model.

00:06:39.680 | I believe it was also by Google.

00:06:41.600 | If anyone knows for sure, feel free to correct.

00:06:45.840 | But they used different representations

00:06:50.840 | for the same word in different contexts.

00:06:55.320 | And so like the word stick, for example,

00:07:00.320 | you could say that means I'm going to chase a dog

00:07:05.760 | with a stick, or it could be like,

00:07:07.960 | hey, let's stick to the material that we're talking about,

00:07:11.840 | or maybe some other words or contexts.

00:07:14.840 | And so those can mean different things, the same token.

00:07:19.760 | And because of that,

00:07:20.640 | they want to use like different representations,

00:07:23.320 | even though it's the same word in English.

00:07:26.160 | And so BERT leverages that feature as well.

00:07:30.120 | - So slight correction.

00:07:33.040 | ELMo is from Allen Institute and University of Washington.

00:07:37.160 | And then further context.

00:07:38.720 | So one of the best BERT adaptations,

00:07:42.440 | I think in 2019 was Roberta.

00:07:44.680 | Roberta is like BERT, but make it good and bigger.

00:07:48.440 | And it was also from the same team

00:07:51.320 | at University of Washington and Facebook, I think.

00:07:54.320 | But ELMo was just, yeah, it wasn't from Google,

00:07:57.480 | but same thing.

00:07:58.480 | It went from like one hot encoding bag of words

00:08:00.640 | to ELMo was really popular

00:08:02.760 | for like a bunch of Kaggle competitions

00:08:05.160 | where you needed good embeddings.

00:08:06.920 | And then for embeddings, it's not Google.

00:08:09.720 | - Great, thanks for the additional info.

00:08:15.520 | So let's take a stop here to look at this diagram.

00:08:27.480 | So this is the BERT like architecture.

00:08:32.480 | Essentially, you can see this pink row down here

00:08:38.080 | is a set of tokens.

00:08:39.480 | Let me see if I can zoom in here.

00:08:44.320 | I'm not sure why I can't zoom in while I'm sharing,

00:08:56.360 | but anyway, this token is a classifier token.

00:09:01.360 | And then you see like tokens one through N.

00:09:06.800 | So those are the first sentence.

00:09:10.640 | And then there's a separator token.

00:09:13.160 | And then there's another token one through M.

00:09:18.000 | And the reason that BERT has this structure

00:09:22.200 | is because one of the tasks it does

00:09:24.520 | is sentence classification.

00:09:29.520 | Basically, it can take two sentences

00:09:32.600 | and determine if one of the sentences

00:09:36.440 | seems like it follows the other one.

00:09:39.120 | And then you see, so this is a pre-trained model.

00:09:45.240 | So this is trained on, well, what was at that point

00:09:49.280 | a large amount of text.

00:09:51.640 | And then these are the different fine tunes of that model

00:09:55.120 | for different benchmarks, essentially.

00:09:58.600 | And FYI, I'm not looking at the chat.

00:10:02.240 | So if there's anything in there, I'm not seeing it.

00:10:04.800 | I can answer questions later

00:10:07.160 | or if someone just wants to unmute, feel free.

00:10:10.040 | So here's those two tokens I mentioned

00:10:14.720 | that you might not have been able to see,

00:10:16.840 | the classifier token and then the separator token.

00:10:20.200 | The classifier token tells it like what task

00:10:24.360 | it's supposed to be performing.

00:10:26.560 | So they talk about how there's two steps in the framework,

00:10:31.640 | retraining and fine tuning.

00:10:33.840 | Probably a lot of people are familiar with that.

00:10:38.000 | So I won't go into too much detail.

00:10:41.560 | This is interesting.

00:10:43.960 | So this is 2019 numbers of what was at that point

00:10:49.640 | a large model.

00:10:51.080 | You can see the base model was 110 million parameters

00:10:56.080 | and the large model, which was,

00:10:58.680 | I saw someone referred to as like gigantic

00:11:02.240 | or like unbelievably large or something like that,

00:11:06.040 | is 340 million parameters, which is, you know,

00:11:10.600 | like 5% the size of like a LLAMA7B

00:11:15.600 | or something along those lines.

00:11:18.000 | So what at the time seemed very large,

00:11:21.600 | in retrospect, isn't really.

00:11:25.160 | And that also goes for the training data.

00:11:29.440 | Do they have it in here?

00:11:34.640 | Well, there's two training sets that they used for it.

00:11:40.800 | One was a, all of Wikipedia, the English version,

00:11:46.320 | which was like 800 million words.

00:11:48.560 | And then I think a set of books,

00:11:51.560 | which was 2.5 billion words.

00:11:53.720 | And again, those data sizes,

00:11:58.680 | while they could have been large for the time,

00:12:01.160 | currently are relatively small.

00:12:03.920 | Typically, like at least for frontier models,

00:12:08.240 | you're talking about low trillions of tokens to train them.

00:12:13.760 | So here they talk about what I mentioned earlier

00:12:18.760 | about having a pair of sentences

00:12:23.240 | where you can have a question and answer

00:12:25.360 | in one token sequence with the separator token

00:12:29.280 | in the middle.

00:12:30.200 | And so let's go down and talk about pre-training.

00:12:39.800 | And here we get to their answer to the left-to-right

00:12:44.800 | versus right-to-left dilemma.

00:12:53.280 | So because it's bi-directional,

00:12:56.160 | potentially each word could see itself in the future.

00:13:01.160 | And so what they ended up doing

00:13:04.600 | was to mask 15% of the words

00:13:10.600 | so instead of having the actual word in the sequence,

00:13:15.600 | they have this mask token.

00:13:18.200 | And then they try to predict that word

00:13:22.720 | either going forward or backward

00:13:24.920 | and do the, you know, score the training on that.

00:13:31.760 | They do have an issue that, let's see,

00:13:39.520 | that since the mask token does not appear

00:13:43.240 | during fine tuning,

00:13:44.560 | they have to compensate for that in the pre-training step.

00:13:51.200 | So in order to do that,

00:13:52.320 | they don't always replace the mask words

00:13:55.200 | with this mask token.

00:13:57.320 | They do 80% of the time, 10% of the time,

00:14:02.120 | they just stick in a random token

00:14:04.480 | or anything in the vocabulary.

00:14:06.600 | And then another 10% of the time,

00:14:08.800 | they just leave it unchanged.

00:14:11.160 | And so that helps during the fine tuning stage

00:14:15.000 | so that fine tuning doesn't expect these mask tokens

00:14:19.320 | like all the time.

00:14:20.520 | And then the other task they give it during pre-training

00:14:25.720 | is this next sentence prediction.

00:14:28.640 | That is the, you saw the two sentences

00:14:35.320 | and whether they are related.

00:14:37.080 | So that is, you know,

00:14:42.080 | that's the sentence A, separator, sentence B.

00:14:47.400 | And then they have the 50% split of training data

00:14:56.120 | where half of it is labeled as is next

00:15:01.960 | and half of it is labeled as not next.

00:15:04.360 | So you can imagine a initial sentence

00:15:09.360 | of I walked down to the store.

00:15:13.040 | One that might be marked as is next

00:15:17.360 | is something like I bought a burrito

00:15:20.840 | and something that might be not next is,

00:15:24.960 | you know, salamanders have many spots

00:15:27.160 | or something like that.

00:15:28.320 | And so this part of the pre-training figures out

00:15:32.120 | which of those sentences follow each other.

00:15:35.000 | Here, oh, okay.

00:15:41.080 | So here's what I was talking about earlier,

00:15:42.880 | the book corpus, 800 million words.

00:15:44.600 | And I guess I had them flipped Wikipedia

00:15:46.680 | as the 2.5 billion words.

00:15:49.360 | And this shows how the input representation.

00:15:58.360 | So there's a couple of different layers

00:16:02.800 | and they just add these layers up

00:16:04.960 | to come up with the input.

00:16:07.320 | So the top layer here is the token.

00:16:12.320 | So it's the vector representation of that particular word.

00:16:16.960 | The next is the segment.

00:16:20.080 | So this splits it up between sentence A and sentence B.

00:16:24.840 | Each one of those has a different factor

00:16:28.680 | that gets added into the input

00:16:31.840 | to help the model distinguish between them.

00:16:34.520 | And then finally as the positional embedding.

00:16:36.680 | So each place in the sequence,

00:16:41.680 | it's its own value that when it's added

00:16:46.600 | to the other layers helps the model to distinguish

00:16:52.600 | like does my come before dog or vice versa.

00:16:57.600 | Okay, and then they talk about a bunch

00:17:08.520 | of different benchmarks.

00:17:12.960 | We'll just maybe take a look at some of the results

00:17:18.800 | and then continue forward.

00:17:21.480 | I'll also look at the,

00:17:22.840 | let me take a break to look at the chat here.

00:17:27.280 | - Another serious questions in chat.

00:17:31.040 | I guess Sam had one.

00:17:33.280 | Have there been revisiting

00:17:35.640 | since the weird now pre-training tasks

00:17:37.560 | that were used to invert?

00:17:39.320 | Or has this all been done

00:17:40.840 | and it's somewhat of a shut book

00:17:43.000 | and next token prediction is all you need?

00:17:45.880 | Any thoughts, comments?

00:17:49.520 | - So I didn't look at any of the following BERT papers

00:17:53.960 | like Roberta or the Berta for this.

00:17:58.440 | So I'm not sure if there's anyone else on the call

00:18:03.280 | who knows would be happy for a contribution.

00:18:06.720 | - Yeah, we had a slight follow-up in the comments there.

00:18:14.880 | Basically for encoder decoder stuff, it still makes sense.

00:18:18.200 | And a lot of my takeaway from this paper

00:18:20.640 | when I read it like years ago

00:18:22.200 | was they had really interesting pre-training tasks.

00:18:26.760 | And if you look at it from a lens

00:18:28.480 | of what they're trying to do,

00:18:30.040 | it's actually pretty useless, right?

00:18:31.600 | Like next token prediction is still useful to generate words

00:18:35.960 | but something like predicting sentence order

00:18:38.760 | is sentence one or sentence two.

00:18:41.280 | Should sentence one come before sentence two or whatever?

00:18:44.160 | It's not really useful, right?

00:18:45.960 | There's never a time where there's people trying to predict

00:18:49.160 | which sentence came before the other

00:18:51.280 | but it does really teach a model conceptually

00:18:54.880 | like what word order is, right?

00:18:57.320 | So there's these words and there's these words.

00:18:59.800 | You have to understand should these set of words

00:19:01.800 | come before these set of words.

00:19:03.080 | So that helps in the classification task

00:19:05.640 | where you have to like group together words together.

00:19:08.240 | So in some sense for the tasks that BERT is trying to do

00:19:13.240 | it's trying to be a small efficient classification model

00:19:16.600 | as one of the tasks it's trying to do, right?

00:19:19.320 | It kind of makes sense to do these weird training objectives.

00:19:22.160 | So next token prediction or like masking of words

00:19:26.000 | is still like, there's not many use cases

00:19:28.800 | where you need to fill in the blank, right?

00:19:30.600 | We don't have like 500 words

00:19:32.360 | and you have to fill in one word

00:19:33.880 | but it does teach a model like over a billion words.

00:19:37.720 | If you start to understand

00:19:38.920 | what words go in between other words

00:19:41.080 | it's a good pre-training task

00:19:42.800 | and it helps with classification in general.

00:19:45.360 | So it's like, instead of having a very broad task

00:19:48.920 | of just predict next token and then extract,

00:19:52.600 | like abstract that out to eventually

00:19:54.600 | like get this emergent capability to do classification.

00:19:57.720 | This is like, here's 12 tasks

00:19:59.960 | that mimic understanding words very well.

00:20:03.080 | You're a small model, use all this to do classification

00:20:06.200 | but that concept is still applied

00:20:09.040 | for like current encoder decoder models

00:20:11.800 | and small models that are not just next token prediction.

00:20:15.360 | So basically if you dumb down the task

00:20:19.000 | to subsets of your like main goal, it's still very effective.

00:20:23.200 | And that's why like, if you take a step back

00:20:25.440 | and look at what Bert did, it makes no sense.

00:20:28.360 | Like Google doesn't need to spend millions of dollars

00:20:31.000 | to predict which sentence comes before

00:20:32.880 | or after another sentence, but it does help a small model

00:20:36.480 | that'll better learn embedding per se

00:20:40.080 | and abstract that to these tasks.

00:20:42.120 | - Yeah, agreed.

00:20:46.240 | Like the sentence, the sentence prediction didn't seem

00:20:52.120 | like you're right.

00:20:53.960 | There's not too many use cases where that would be helpful.

00:20:57.760 | I wonder if it does help the model

00:21:00.200 | to like figure out the relationships

00:21:03.240 | between words and ideas to some degree.

00:21:09.120 | - Yeah, that's pretty much the whole point.

00:21:11.160 | It should help the model better understand

00:21:13.360 | like breaking down the problem and understand word order.

00:21:17.680 | I think they also did something where they swapped words

00:21:20.640 | from different sentences or swap sentences, right?

00:21:23.360 | And like, that's even more useless in reality.

00:21:27.720 | Like in reality, there's not many times

00:21:31.080 | where you've got mixing of words

00:21:32.840 | from different chunks of sentences.

00:21:34.960 | But once again, it helps the model generalize

00:21:37.600 | towards understanding sentences.

00:21:39.160 | So stuff like that.

00:21:41.400 | It's just, if you look at it in today's sense,

00:21:43.680 | if you have a niche topic that can benefit

00:21:46.160 | from encoder decoder, like small, small,

00:21:49.240 | like on-device active model,

00:21:51.880 | you wanna start to employ breaking down problems

00:21:55.240 | in this sort of methodology.

00:21:56.800 | But there's examples of papers that do this type of work.

00:22:05.200 | - Yeah, for sure.

00:22:06.800 | Would be great to follow on this presentation

00:22:09.960 | with some of the more recent work that kind of builds on it.

00:22:14.080 | - There are not many other questions in chat, by the way.

00:22:20.920 | - Okay, okay, cool.

00:22:22.880 | You can see from this, at least when it was released,

00:22:28.360 | BERT-Large was state-of-the-art, even beating out GPT-1.

00:22:34.120 | And ELMo.

00:22:35.960 | So at least at the time it was released,

00:22:40.360 | it was a very capable model.

00:22:43.000 | And then, I don't know.

00:22:47.240 | I'm gonna kind of skip through these,

00:22:49.880 | except maybe just to show the results

00:22:53.760 | of it being state-of-the-art in a lot of things.

00:22:57.880 | So the next section is ablation studies.

00:23:02.560 | So removing different parts of the model

00:23:05.680 | to see what the effects were.

00:23:09.800 | And let's see.

00:23:15.720 | So here's a couple of different ablations they did

00:23:20.240 | was they removed the next sentence prediction task.

00:23:25.240 | So I guess this is something we were just talking about,

00:23:30.760 | but they still kept the mask LM.

00:23:34.320 | And then the next thing they did

00:23:36.760 | is they only made it go left to right.

00:23:40.280 | And they also have the no sentence prediction.

00:23:44.680 | And so you can see the results from those attempts up here.

00:23:50.480 | The top is the kind of the standard model.

00:23:57.280 | And then if you look at the no next sentence prediction,

00:24:01.200 | it does lose a little bit.

00:24:04.440 | Oh, actually in QNLI,

00:24:06.120 | it looks like it has a significant loss,

00:24:08.560 | maybe in other tasks, much less.

00:24:11.280 | But then as you also take away the bidirectional,

00:24:16.080 | it becomes less capable.

00:24:19.000 | Looks like it kind of varies between tasks,

00:24:22.200 | like how much capability it loses.

00:24:25.240 | But this does show that there is some value

00:24:28.640 | to those components.

00:24:30.640 | Oh yeah, maybe this is what I was talking about,

00:24:43.080 | where they say,

00:24:43.920 | "We believe that this is the first work

00:24:44.760 | "to demonstrate convincingly

00:24:45.920 | "that scaling to extreme model sizes

00:24:48.920 | "also leads to large improvements

00:24:50.480 | "on a very small scale task."

00:24:53.440 | This is kind of like the bidder lesson,

00:24:55.720 | but maybe a little bit exaggerated

00:24:58.680 | as far as the extreme model size at this point.

00:25:01.640 | And then they talk about feature approach

00:25:08.640 | with BERT a little bit.

00:25:10.920 | So if there's questions,

00:25:14.120 | feel free to unmute.

00:25:16.880 | Otherwise, let's go over to and look through.

00:25:19.600 | There's Jay Alomar has made some very helpful,

00:25:24.600 | like illustrated BERT and ELMo articles

00:25:30.080 | that we can go through

00:25:31.280 | to just kind of cement our understanding.

00:25:34.720 | So that's just kind of a comparison of some models

00:25:38.920 | that were out at the time.

00:25:41.120 | And then this is one thing that was a takeaway for me,

00:25:47.720 | is that, okay, this is the pre-training step

00:25:50.960 | that we talked about,

00:25:52.040 | but then on the supervised learning step,

00:25:55.200 | you basically stick a classifier after the BERT.

00:25:58.560 | So BERT is in charge of essentially

00:26:01.240 | encoding the text into an embedding.

00:26:04.960 | And then you use that classifier to then classify,

00:26:08.440 | in this case, either spam or not spam.

00:26:12.240 | Let me see if I can find...

00:26:17.000 | There's one diagram that I thought was especially helpful.

00:26:20.280 | Oh yeah, this one.

00:26:21.920 | So this one shows how BERT

00:26:26.520 | takes the entire sequence of tokens

00:26:29.800 | and then creates like for each input token,

00:26:34.800 | it has an output vector.

00:26:39.240 | However, for the purpose of classification,

00:26:42.040 | we only look at the first output vector.

00:26:46.320 | That one contains essentially the entire sense

00:26:52.720 | of all of the input tokens.

00:26:55.800 | And then you can run that through,

00:26:57.880 | it can be a neural network,

00:26:59.240 | could be a logistic regression.

00:27:01.440 | And then from the features here,

00:27:05.760 | and I think there's like something like 768 dimensions

00:27:10.480 | in the embedding, from the features there,

00:27:13.360 | you can then predict spam or not spam

00:27:17.560 | based on your training set.

00:27:19.440 | And let's see, that shows the same thing.

00:27:26.880 | And then here's like some illustration

00:27:33.160 | of the different encoder blocks.

00:27:39.200 | So as we mentioned earlier, BERT is encoder only.

00:27:43.640 | So the kind of classical transformer

00:27:47.520 | is an encoder and a decoder.

00:27:50.600 | Many modern models are decoder only.

00:27:55.720 | And so encoder is like used mostly these days

00:28:02.280 | for text classification or text clustering.

00:28:08.200 | To my knowledge, there's encoder only transformers

00:28:13.200 | aren't really used for any kind of next,

00:28:18.320 | like sequence generation or next token generation.

00:28:21.680 | This talks about ELMo and the different context of words

00:28:33.440 | and how ELMo captures that.

00:28:36.520 | (silence)

00:28:38.680 | GPT, I thought there was something here.

00:28:47.400 | Yeah, so this is just like what we talked about

00:28:55.680 | about if you have a BERT encoder,

00:29:02.000 | you can stick another model for training on the end of it.

00:29:05.960 | And then go from there.

00:29:07.960 | And then you can also use BERT for embedding.

00:29:14.960 | So if you have like a certain problem space

00:29:19.960 | with a lot of texts that you want to embed,

00:29:23.880 | you can continue pre-training or do fine tuning on BERT

00:29:31.360 | with your corpus in your industry specific like text corpus

00:29:36.360 | and then create an encoder that's especially built

00:29:42.720 | for your needs.

00:29:47.240 | So there's one more.

00:29:53.000 | I'll pause any other questions in the chat.

00:29:57.480 | So some context of how they train BERT,

00:30:01.400 | they have like those 12 paths, right?

00:30:03.880 | They had a BERT based model, a math language model.

00:30:06.200 | They had the next sentence prediction,

00:30:08.080 | token classification, QA, sequence classification.

00:30:11.960 | They had all these tasks.

00:30:13.360 | And basically what they did where they were BERT models

00:30:16.360 | with a layer added on top for a classification head.

00:30:21.240 | Now, in the time of 2019,

00:30:23.720 | when people started using these models,

00:30:25.360 | what was really common to do was you could either,

00:30:29.000 | if you had enough data, take the base model,

00:30:31.840 | add in a linear output head for classification,

00:30:34.920 | where you basically take all this, there's no output head.

00:30:37.520 | It's just output is the last step

00:30:39.240 | of processing these tokens or sentences.

00:30:42.160 | Then you add just a linear head

00:30:43.880 | with a softmax for classification.

00:30:46.120 | Now, then you fine tune it on a lot of your data itself.

00:30:50.600 | If you didn't have as much data,

00:30:53.280 | one thing that was popular was you take

00:30:54.920 | the sequence classification head,

00:30:57.920 | you just continue fine tuning it on your data

00:31:00.200 | and it's already somewhat good at sequence classification.

00:31:03.440 | But there was a whole like series of work

00:31:06.400 | that looked into based on how much data you have,

00:31:10.480 | where you should do your fine tuning.

00:31:12.480 | So if you have a lot of data,

00:31:14.760 | it was pretty common to not only add a classification head,

00:31:18.560 | but also peel back a few layers.

00:31:20.640 | So reset the weights of like the top three,

00:31:23.920 | the top two, the final layer,

00:31:26.320 | and then continue training those in as well for your task.

00:31:29.280 | Because at some level, what people started to learn

00:31:31.600 | was these pre-training objective tasks

00:31:34.840 | of like mask word prediction and sentence ordering or QA,

00:31:39.760 | they were actually affecting the net output

00:31:42.800 | of sequence classification.

00:31:44.480 | And if you wanted better,

00:31:46.000 | you could just train more of your whole model on that.

00:31:48.520 | So there was a whole thing of like,

00:31:50.400 | you should remove the top two layers,

00:31:52.720 | add a sequence classification head,

00:31:54.560 | train on tens of thousands of examples,

00:31:56.440 | and you'll get state-of-the-art results.

00:31:59.160 | You could, if you have less data, freeze layers,

00:32:02.760 | you could unfreeze weights.

00:32:04.080 | There's like a whole set of this,

00:32:05.920 | but it was pretty common to also just mess

00:32:08.480 | with the architecture and add classification head.

00:32:11.160 | - Do you know if anyone trained all the layers

00:32:18.400 | or like just use that as a starting point

00:32:20.680 | to train all the layers?

00:32:21.920 | Okay.

00:32:23.200 | - So today there's like stuff that came out

00:32:25.520 | like a year or two ago,

00:32:27.600 | where basically you could retrain BERT in 24 hours

00:32:31.200 | on like a budget of like sub $500 with regular A100s

00:32:36.200 | and how you can do this better.

00:32:38.920 | So it was in the realm of like at the time,

00:32:42.680 | not as effective to, you know,

00:32:44.440 | like you don't have Google Compute

00:32:45.760 | to retrain BERT from scratch,

00:32:47.480 | but now there's stuff of like 24 hours

00:32:50.360 | and a couple of hundred dollars

00:32:51.920 | to retrain your own better BERT.

00:32:54.000 | There's like an academia paper that came out about this.

00:32:57.080 | If people go down the rabbit hole

00:32:58.720 | of like encoder models and this stuff,

00:33:01.600 | it's a cool one to look into

00:33:03.000 | of how they can better objectify these 12 pre-training tasks

00:33:08.000 | to a few and a better curated dataset

00:33:10.560 | and outperform it on a couple of hundred dollars in 24 hours.

00:33:13.920 | But then it was also common

00:33:15.080 | where like there was a sentence classification

00:33:19.880 | and sentence extraction tasks that BERT was adapted towards.

00:33:23.520 | So like BERT for sequence classification

00:33:25.760 | or extraction for like abstractive summarization.

00:33:28.400 | And then companies that took it to production

00:33:30.320 | would do like significant retrains or like,

00:33:34.560 | yeah, they train a lot more of it.

00:33:36.560 | And then this also just went into like

00:33:41.280 | at what part do you want to start training?

00:33:44.040 | (mouse clicking)

00:33:46.800 | Yeah, I mean that sounds like interesting stuff.

00:33:51.640 | If you have any links,

00:33:55.160 | please drop in the chat and I'll check it out.

00:33:57.520 | So maybe the last thing we can go through here

00:34:04.320 | is the same author Jay Alomar has a notebook

00:34:11.400 | where he shows like hands-on

00:34:14.120 | how to do this movie review sentiment classification.

00:34:19.120 | He uses DistilBERT.

00:34:24.080 | So DistilBERT is a hugging face like recreation of BERT

00:34:29.080 | that like has very comparable performance

00:34:34.920 | on many fewer parameters.

00:34:39.280 | And then to do the classification,

00:34:41.400 | he just uses a basic logistic regression model

00:34:45.320 | from scikit-learn.

00:34:47.720 | And so then the features that go

00:34:51.120 | into this logistic regression model

00:34:52.880 | are just the vector of size 768

00:34:57.040 | that comes out of the DistilBERT embedding.

00:35:04.760 | So a lot of this leans very heavily

00:35:07.920 | on the HuggingFaceTransformers library.

00:35:11.120 | So let's see, that's just installing it, doing imports.

00:35:15.880 | He uses, he must've mentioned it up above, but a...

00:35:23.080 | Maybe it's...

00:35:29.240 | Anyway, there's a particular HuggingFace dataset

00:35:32.480 | that he's using that has the movie sentiment training data.

00:35:36.480 | Maybe he just uploaded it somewhere, he had it.

00:35:43.760 | - I think it's an IMDb dataset.

00:35:47.160 | I think it's one of the Kaggle ones.

00:35:49.360 | Yeah, it's just a Kaggle IMDb.

00:35:50.920 | You have movie reviews, you classify them.

00:35:53.160 | Oh, and on that actually reminds me,

00:35:58.160 | one of the big things that made BERT somewhat popular

00:36:01.560 | was there was another Kaggle competition

00:36:03.840 | on tweet classification of sentiment.

00:36:07.360 | So in tweets, like with previous embeddings,

00:36:10.520 | like Bag of Words or Glove or Elmo,

00:36:13.880 | if you have stuff like, "I hate this so much,"

00:36:18.880 | that in some contexts in tweets could still be positive,

00:36:22.880 | even though it's very negative.

00:36:24.240 | And when you just look at lexical understanding of words,

00:36:30.480 | it's very negative, but then BERT embeddings

00:36:33.080 | were what really dominated that.

00:36:35.320 | And then for like a few years,

00:36:36.320 | they kept doing follow-ups on that.

00:36:37.640 | But IMDb and tweet classification

00:36:40.520 | were versions that they used in a lot of these demos.

00:36:44.600 | - So let's see.

00:36:53.520 | So here we're just uploading or downloading BERT,

00:36:58.920 | DistilBERT from HuggingFace

00:37:02.200 | and getting the model initialized.

00:37:06.240 | So you can see it's just a few lines of code there.

00:37:09.040 | So we have to do a few things like tokenize it

00:37:16.800 | and then add padding.

00:37:19.600 | So this is so that all of the sequences

00:37:23.760 | can be run in parallel.

00:37:26.960 | So we need to pad out so that they're all the same length.

00:37:30.320 | And then we need to mark the padded sections as masked

00:37:37.120 | so that BERT doesn't get confused at thinking like

00:37:41.680 | what's empty space is actual sequence

00:37:45.400 | that we want it to process.

00:37:48.800 | So then you can see this diagram here.

00:37:55.520 | And again, apologies, I'll try to zoom in again.

00:37:58.440 | Oh, it worked this time.

00:38:00.000 | So this just takes the input text,

00:38:04.720 | runs it through DistilBERT

00:38:06.400 | and comes out with the embeddings.

00:38:08.200 | So that's all of this thing does.

00:38:15.120 | And then the one tricky part about all of this

00:38:20.120 | is you need to pick out exactly

00:38:23.800 | which values from this,

00:38:28.800 | I guess it's three-dimensional tensor

00:38:31.720 | you want to predict on.

00:38:34.040 | So if you remember back from here,

00:38:37.040 | we just want the very first output.

00:38:45.120 | We wanna ignore all of these other ones.

00:38:51.680 | So he draws out in detail like

00:38:55.960 | how exactly you pull just those vectors

00:38:59.760 | out of this three-dimensional tensor.

00:39:03.920 | And then it's pretty straightforward

00:39:06.320 | machine learning after that.

00:39:08.080 | You just turn those 768 dimensions into features

00:39:17.800 | and do a test train split,

00:39:21.400 | train your logistic regression model,

00:39:25.040 | and then run, you know,

00:39:31.000 | once you've got the model trained,

00:39:32.040 | you can run a score and it gets 82%.

00:39:35.560 | So assuming it's a 50/50 split,

00:39:39.320 | then the expected amount just from random chance

00:39:42.760 | would be 50%.

00:39:44.560 | So there is a significant increase

00:39:48.560 | using BERT to do classification,

00:39:51.280 | but obviously still plenty of room for improvement.

00:39:55.680 | Okay, and down here, it says the highest accuracy score

00:40:00.360 | for this data set is currently 96.8.

00:40:03.880 | So as you can see,

00:40:06.640 | things have come a long way since 2019,

00:40:10.080 | but still a useful model to start with

00:40:15.080 | for any classification tasks or clustering.

00:40:23.240 | If you just wanna see what tech sequences

00:40:26.800 | are close to each other in embedding space,

00:40:28.760 | you can use it for that as well.

00:40:30.920 | So that's about all I had for prepared stuff.

00:40:39.280 | I'll stop sharing and then maybe go through the chat.

00:40:44.280 | If anyone wants to chime in,

00:40:48.600 | add any color commentary, feel free to do that.

00:40:52.800 | - Someone linked the paper on the academic budget,

00:41:02.680 | the 24-hour BERT,

00:41:05.280 | and then I was also trying to think

00:41:07.360 | where apparently MosaicML showed how you can do it for $20

00:41:10.600 | pre-trained BERT from scratch now.

00:41:12.560 | So yeah, $20, they did like eight A100s for an hour,

00:41:17.560 | and they're able to match the glue store of basic BERT

00:41:21.640 | with their recipe.

00:41:23.080 | Kind of interesting.

00:41:25.480 | So a note Eugene Cha made is researchers felt

00:41:30.480 | 10K training is expensive.

00:41:32.560 | So I remember mapping this out.

00:41:35.320 | BERT, all these things that compare like 24-hour,

00:41:39.880 | one-hour BERT, they trained BERT

00:41:42.600 | for four days of TPU v3 equivalent,

00:41:46.960 | which is like at the time, let's say eight to $10 an hour,

00:41:50.760 | which is like 10 to 15K.

00:41:52.720 | But then there wasn't just BERT-based.

00:41:54.960 | There was BERT-based, there was BERT-large,

00:41:56.800 | there was BERT-small.

00:41:57.800 | There's a bunch of experiments.

00:41:59.160 | The BERT-large was trained for more than four days.

00:42:02.240 | Like the cost equivalent is 50K on that,

00:42:05.040 | 10K on the regular BERT-based, less on the little one.

00:42:08.440 | And then you got to add in like the time and the R&D.

00:42:12.280 | Oh, it was well more than a 10K project at Google.

00:42:16.400 | The BERT-large itself was already a 50K train run,

00:42:20.080 | plus 10 to 15 for BERT-based, plus just experimentation.

00:42:24.520 | So expensive, expensive.

00:42:27.680 | - I think it's more amount of labs

00:42:29.600 | would love to hear those numbers for SOTA.

00:42:32.520 | - Right now.

00:42:33.360 | - True, true.

00:42:39.400 | - I'm just reading through the chat.

00:42:52.280 | - Mm-hmm.

00:43:01.520 | - Did we get a volunteer for next week

00:43:03.360 | or still waiting on that?

00:43:05.160 | - Anyone else next week?

00:43:09.760 | Any other questions on this, by the way?

00:43:12.360 | - I do have a random one.

00:43:13.880 | Because this is regarding the embedding science, right?

00:43:18.840 | Even though I joke about this was the era before the GPU,

00:43:23.840 | the gaming GPU folks came in and said,

00:43:26.400 | hey, you need to be divisible by 64, 32, or power two, right?

00:43:33.280 | Does TPUs not have the divisible by 64 batch?

00:43:39.120 | I mean like optimization when it comes

00:43:41.760 | to embedding size characteristic.

00:43:44.440 | That's why they have all these weird embedding

00:43:46.640 | size on TPUs related training.

00:43:48.960 | - I don't think it's TPU based.

00:43:54.120 | I have like old notes that I'm recalling

00:43:56.520 | where I dug through why they specifically did 768 and 512.

00:44:02.240 | And also someone noted in chat that's a limitation.

00:44:06.280 | There's other work that extends this out

00:44:09.760 | to like SentenceBERT that extends the embedding dimension.

00:44:15.200 | And they were also pretty small.

00:44:18.560 | But back to Eugene's point of, is it hardware limitation?

00:44:22.600 | It's not.

00:44:23.480 | It was well divisibility between layers

00:44:26.520 | and adding layers and a bunch of stuff.

00:44:28.440 | I really can't remember the specifics of the reasons.

00:44:31.760 | I'll dig through some old notes,

00:44:33.400 | but someone broke down the per layer map

00:44:35.520 | and sending through inputs.

00:44:38.600 | And there was a decent range reason for why all this.

00:44:43.160 | It's also like there's 12 layers divisible by 768.

00:44:48.000 | It went down that path.

00:44:51.040 | But also it wasn't like someone

00:44:52.800 | from Google that worked on BERT.

00:44:54.440 | It was just mapping through the input

00:44:56.760 | through every layer and all of this math working out.

00:44:59.200 | And then a reason for like, oh, here's why this,

00:45:01.800 | why not this?

00:45:02.840 | And I was like, sounds good.

00:45:04.280 | Checks out to me.

00:45:05.280 | I can probably find this if I look.

00:45:08.520 | It's in some notes from a couple of years ago.

00:45:11.600 | (indistinct)

00:45:14.000 | - Eric, someone has asked,

00:45:20.160 | does BERT pre-training objective MLM

00:45:23.240 | like mass language modeling follow the same

00:45:25.240 | LLM scaling laws as GPTs?

00:45:27.520 | - That's a good question.

00:45:31.680 | I don't know if there's been enough research in that area

00:45:38.080 | to like come to any conclusion.

00:45:41.680 | So I, like when I was researching this presentation,

00:45:46.840 | I went to, I think it was paperswithcode.com

00:45:51.840 | or something like that and looked at all the top papers

00:45:57.720 | for text classification.

00:45:59.800 | And like a lot of them were pre or 2021 or previous.

00:46:05.880 | And so it seems like this direction of like encoder only

00:46:10.880 | or bi-directional has,

00:46:14.320 | well, I don't know about the bi-directional part,

00:46:15.720 | but at least encoder only,

00:46:17.320 | research has been pretty sparse recently.

00:46:20.920 | So for example, I don't know if anyone's spending

00:46:26.360 | millions of dollars to train like, you know,

00:46:30.440 | super BERT or something like that.

00:46:35.520 | - I think it's-- - Well, the Rekka AI guys

00:46:37.800 | seems to play.

00:46:38.640 | - No, the last time I looked at this,

00:46:41.840 | like leaderboard and hugging face,

00:46:43.920 | I think it was all led by these transformed LLMs

00:46:48.360 | that now get the best performance,

00:46:51.240 | like a Mistral 7B turned into embedding model.

00:46:55.920 | - Is that, sorry, I missed.

00:47:02.440 | Is that for text classification?

00:47:05.320 | - Well, yeah, I guess at the core,

00:47:08.920 | it's all turning text into an embedding.

00:47:11.600 | So yeah.

00:47:14.040 | - Could you drop a link in the chat?

00:47:18.800 | - Yeah, sure.

00:47:20.480 | - Yeah, it seems like that would lead

00:47:21.920 | to higher performance with this classification,

00:47:26.040 | but I'd have to do a little bit more research on that.

00:47:32.720 | - Well, there's two pieces as well, right?

00:47:35.480 | So for mass language modeling,

00:47:37.880 | a lot of the scaling law papers

00:47:39.920 | directly showed why decoder only token prediction

00:47:43.200 | is better scaling than mass language models.

00:47:45.880 | One is purely when you mask 15% of tokens,

00:47:48.720 | you train on the mass.

00:47:49.840 | So you lose a lot of data.

00:47:51.720 | You need more quality data.

00:47:54.520 | You're just straight training on less, right?

00:47:56.680 | If you have a data set of a trillion tokens,

00:47:59.120 | you can mask 15% of them and train on learning the 15%,

00:48:03.000 | or you can train on all trillion of them.

00:48:05.360 | That's a straight scaling.

00:48:06.640 | Now, if you have 15 trillion tokens versus 1 trillion,

00:48:09.680 | that's another question,

00:48:11.160 | but for embedding tasks in smaller models,

00:48:15.560 | the better trade-off scaling curve at the start

00:48:19.360 | for encoder learns better with less tokens at first,

00:48:23.960 | but then extending this out in pure scaling laws,

00:48:28.760 | yeah, you lose a lot of your training data, right?

00:48:31.040 | And then that was one of the big points

00:48:33.040 | of why do we do next token prediction,

00:48:35.840 | because it scales better than other tasks, right?

00:48:38.760 | So scaling laws were made to show better objectives,

00:48:41.480 | so directly against it.

00:48:42.760 | But then at small scale and stuff,

00:48:45.440 | there's benefit in this,

00:48:48.240 | specifically for like edge models,

00:48:51.120 | like you can deploy a BART as a guardrail live,

00:48:55.800 | and you can have it intercept every query

00:48:57.840 | because it can act in milliseconds

00:48:59.360 | versus LLMs will still take longer, right?

00:49:02.280 | So at a smaller scale, they'll be better.

00:49:04.800 | There was another part to this that I'm blanking on.

00:49:09.040 | Oh, Raika AI,

00:49:10.120 | they're doing encoder decoder generation models

00:49:12.880 | where they're adding decoder heads to encoders,

00:49:15.200 | and they're scaling them up to billions of parameters.

00:49:18.320 | They're a case study of spending money

00:49:21.720 | to train them up pretty big.

00:49:24.120 | (mouse clicking)

00:49:26.880 | There's a question from Isaac in the chat

00:49:31.960 | about my use case at work.

00:49:35.400 | So currently where we're at in the project

00:49:38.520 | is we need to accumulate some good training data.

00:49:43.520 | So we don't have enough training data

00:49:45.360 | to like actually train a BART or that type of model.

00:49:51.520 | So to start with, we're just using LLMs and prompts

00:49:56.520 | to like do some logical classification

00:50:01.920 | to like kind of bootstrap until we get enough data.

00:50:07.600 | And then also to create a feedback loop

00:50:09.600 | where we can get feedback from people

00:50:12.920 | so that we'll have enough like solid training data

00:50:15.880 | so we can actually train a model.

00:50:18.880 | The main purpose of it being faster performance

00:50:22.640 | as Vibhu mentioned,

00:50:24.320 | then you can respond in milliseconds

00:50:28.400 | versus multiple seconds or tens of seconds

00:50:33.400 | if you're using an LLM.

00:50:35.440 | (mouse clicking)

00:50:38.200 | - Well, thank you, Eric.

00:50:54.760 | - Yeah.

00:50:55.600 | - Always appreciate the OG papers.

00:50:57.240 | - Yeah, it's good to--

00:51:02.720 | - Do you wanna ask one volunteer next week?

00:51:05.640 | - Any paper, it doesn't have to be,

00:51:07.240 | I'll look back at anything that I'm interested in.

00:51:09.800 | (mouse clicking)

00:51:15.560 | - Yeah, I don't know if there's any paper

00:51:26.800 | that's caught my eye recently.

00:51:28.280 | I guess like we talked a little bit about embedding papers,

00:51:33.320 | people are interested in embedding.

00:51:35.120 | - Did we ever do the Geno2 embeddings paper?

00:51:39.560 | - No, there's also NOMIC embed.

00:51:43.680 | I thought like, I didn't see the Geno one,

00:51:47.000 | but the NOMIC one was pretty detailed

00:51:48.640 | in terms of what their process was.

00:51:51.680 | So, interesting.

00:51:53.520 | - I might be mixing up the paper thought,

00:51:56.600 | but I think I remember we went through one embedding paper.

00:52:01.480 | Maybe it's NOMIC, maybe it's NOMIC, I don't know.

00:52:04.080 | But yeah, probably it wasn't there.

00:52:06.680 | - The NOMIC compares directly to Geno.

00:52:13.120 | I have the exact same thing.

00:52:14.880 | I haven't seen the NOMIC one as much.

00:52:16.320 | I just know Geno was the one open source,

00:52:18.920 | AK context, very detailed.

00:52:21.800 | Here's how to do embeddings from scratch

00:52:23.360 | and fine tune them paper.

00:52:24.840 | But I guess if NOMIC is the same thing,

00:52:27.000 | 50/50 if anyone wants to take one or both,

00:52:29.680 | I would love both.

00:52:31.600 | - Oh, there's Geno three now, crazy.

00:52:33.440 | - Okay, well, I will volunteer for NOMIC or Geno.

00:52:49.720 | If anyone else has papers they wanna cover in the meantime,

00:52:54.800 | let's cover them,

00:52:55.720 | but otherwise I don't wanna drag this too long.

00:52:58.800 | Yeah, nice chat.

00:53:01.040 | - All right, and thanks, Eric.

00:53:04.080 | - Yeah, thank you.

00:53:05.240 | - Thanks, Eric.

00:53:07.000 | Thanks, everyone.

00:53:07.840 | - Bye.

00:53:08.680 | - See ya.

00:53:09.520 | See ya.