back to indexModernBERT: Modern Bidirectional Encoder for Fast,Efficient, and Long Context Finetuning + Inference

00:00:00.000 |
then you can have, I don't know, a full piece of code rather than just one token at a time. 00:00:07.280 |
So I think that was one of the biggest things because it doesn't surpass. 00:00:13.000 |
So for example, in speed, it doesn't surpass one of the base BERT, but in everything else 00:00:22.400 |
So I think that was one of the key highlights for me. 00:00:27.080 |
I'm still trying to figure out where the code data is mentioned. 00:00:32.840 |
I know I read it inside of here, but I don't know where it is. 00:00:47.160 |
Well, anything else that we should cover at a high level? 00:00:50.400 |
Otherwise, we should, I don't know, go through them. 00:00:55.760 |
I'm relatively, this is relatively because we didn't appoint someone to lead this discussion 00:01:06.560 |
But I would say on my end, I think training it on two trillion tokens, like basically 00:01:14.720 |
kind of updating every dimension of normal BERT into 2025 BERT, which I think was the 00:01:22.060 |
original name, makes sense, including these kind of alternating global local attention 00:01:30.760 |
is something that is like very, very state of the art that I was only seeing in some 00:01:40.200 |
And it's very surprising to see this applied to BERT. 00:01:43.960 |
But I felt like it was very well written in terms of the justification of how much these 00:01:50.320 |
things are downloaded and how they deserve to be updated, because they actually much 00:02:05.480 |
One thing that I was, I've never understood in a lot of clarity is why the bidirectional 00:02:13.480 |
encoder is so much more efficient than a causal decoder only model. 00:02:24.780 |
I mean, I don't know if anyone can, it's probably well known, but I don't, I never understood 00:02:33.440 |
I mean, this is where if Ben is available to speak, he can probably be authoritative. 00:02:40.860 |
To me, it's just like, if you're literally like, the only reason you do decoder only 00:02:50.300 |
But if your job is to fill in the middle or to do classification, then you might as well 00:03:02.140 |
That's sort of, that's kind of what I suspected. 00:03:08.060 |
Yeah, I feel like, I don't know, I don't know if there's anyone else has stuff they want 00:03:17.980 |
If I may put in my two cents, the other advantage with encoder models is richness of information 00:03:25.980 |
because you're not just attending to previous tokens, but every token's encoding is going 00:03:35.820 |
So if you have some token that's making a future reference, it's not very common, but 00:03:42.180 |
sometimes you get an adjective for some kind of reference to a future token that hasn't 00:03:48.260 |
Autoregressive models don't get a chance to encode that information because they haven't 00:03:54.620 |
It's fine when you just need to predict the next token. 00:03:58.940 |
Probably the next token's encoding is going to try to reflect that information. 00:04:03.300 |
But when you're doing stuff like classification, like NER, named entity recognition, or you're 00:04:08.300 |
doing sentiment analysis, you want that entire spectrum of information of both past and future 00:04:16.620 |
So the bidirectionality is what you're saying, basically. 00:04:23.900 |
I think maybe if I were to play devil's advocate and disagree here, it would basically be that 00:04:32.140 |
there's no time to think, there's no chain of thought. 00:04:34.860 |
You just have to run attention on it, and then immediately come up with a classification 00:04:46.260 |
You have to encode all of the information and de-embedding and go from there. 00:05:00.420 |
Who has questions or objectives they want to mention? 00:05:10.260 |
You spoke with the guys at Windsurf and they said that they were doing work on a retrieval 00:05:17.060 |
model or at least a way of implementing retrieval so that it would get the right code. 00:05:23.860 |
Do you know if they were training their own embedding model, and if so, how does it compare? 00:05:27.540 |
And if not, how does this model come into play? 00:05:35.140 |
I would completely not be surprised if they were training their own model. 00:05:41.900 |
They said they were working on retrieval, but I don't know if they said it was their 00:05:50.960 |
This is not one of those heavy blips for them. 00:05:55.100 |
The beauty of these smaller models is that it's relatively cheap. 00:06:11.260 |
I'm just throwing questions out here, but does anyone know of... 00:06:17.580 |
Do you necessarily... would this embedding model be actually useful for, for example, 00:06:23.700 |
encoding a text description of a function and then retrieving that function, assuming 00:06:32.540 |
Basically, RAG based on the actual text description versus the function itself. 00:06:43.900 |
They did put details in the paper about their... 00:06:48.900 |
I think the level of detail that they put on the training and the testing, the evaluation 00:06:54.940 |
So, for example, in the training piece, instead of using 15% of masking, they used 30%. 00:07:04.660 |
They started with a 3 billion warmup and then increasing, and then they kept the learning 00:07:11.180 |
rate constant, and then if they will see kind of like a plateau, then they will go back 00:07:16.900 |
and adjust it, but just towards the very end of it. 00:07:19.460 |
Also, it was actually quite interesting that a lot of this stuff, they tested it in a RTX 00:07:27.780 |
I did some fine tuning and I have an RTX 2008 version, which is similar to the 3000 versions, 00:07:36.460 |
and it took about give or take 40 minutes to fine tune it on a 200,000 sample data set. 00:07:44.820 |
So, that was quite interesting, and I think the level of detail in the training was incredible, 00:07:53.460 |
and also in the evaluation piece, they do answer your question with the coding data 00:08:07.940 |
Okay, I have no idea how to guide this one, especially because normally I would try to 00:08:31.860 |
But does anyone else have like a thing that they want to show that I can give you the 00:08:42.460 |
If you want me, I can take it over and like walk through the notes that I took. 00:09:01.180 |
Basically, Nomec just added their Nomec data set and created Embed ModernBird, something 00:09:08.620 |
And that was a useful, like something that you can start using for RAG, whereas ModernBird 00:09:19.500 |
I try to use it a little bit in a data set that I don't think was very well represented 00:09:25.780 |
in the data set that they use, and it wasn't that good, the use case. 00:09:29.940 |
It was with a law-specific one, trying to detect the type of legal case that was in 00:09:40.940 |
And then it didn't really do very well until I fine-tuned it. 00:09:44.140 |
So it's really good off the bat, but for some use cases, you might want to fine-tune it 00:09:49.540 |
Okay, so I took a few notes and I highlighted the things that I thought that they were quite 00:09:57.140 |
And if anybody has any questions, I might be able to answer them, I hope. 00:10:04.900 |
So I think I mentioned this a second ago that one of the coolest pieces is that this can 00:10:12.100 |
extract a lot of power from a single GPU, or you can do a lot with a single GPU. 00:10:18.600 |
That's something that I personally prefer if I have the ability to do so, but I know 00:10:22.540 |
that we all know that a lot of times you just need a much more powerful one. 00:10:28.740 |
So they call it the most speed and memory efficient encoder and designed for inference 00:10:40.540 |
They point out a couple of nice things regarding the drawbacks. 00:10:45.620 |
So previous birds had a length limitation of about 512 tokens, which is so optimal for 00:10:56.860 |
model design, but then they increase in this one the vocabulary to 8,100 and something. 00:11:04.500 |
And the nice thing about it is that they did it in a way in which they could parallelize 00:11:10.300 |
it nicely across the couple of GPUs that they used for training. 00:11:22.620 |
So this might -- so they can be used in training in conjunction with LLMs, for example, detecting 00:11:27.500 |
toxic prompts and preventing responses on routing queries in agentic frameworks. 00:11:33.380 |
I added here that this might not be super obvious for new AI engineers that have worked 00:11:38.900 |
So you might want to have in a pipeline a smaller model, like a bird, like detecting 00:11:44.860 |
specific words, specific keywords, and then use that to say, like, oh, is there toxic 00:11:52.020 |
Or is there toxic language in this response back to the user? 00:11:56.220 |
And so you can have a pre and post LLM generation in an agentic framework. 00:12:04.480 |
So the training data is limited in narrow domains and specifically in coding data. 00:12:12.180 |
So it was quite nice to see that kind of like a new, refreshed look as to what bird could 00:12:21.900 |
And then so overall -- both of them reach overall performance. 00:12:26.580 |
You saw the image that Strix show in the screen. 00:12:33.880 |
They disabled the bias term in all the layers, except the final decoder and in the layer 00:12:41.140 |
I keep hitting the ones that I -- sorry about that, Iro. 00:12:53.980 |
So they took the biases from the -- so they disabled the bias in all linear layers except 00:13:08.400 |
And then they also disabled the bias in the layer norms. 00:13:14.760 |
And then they also used positional Rotary -- Rotary positional embeddings, Rope, instead 00:13:23.480 |
So one of the things that was interesting about this paper is that a lot of the stuff 00:13:26.820 |
that they included inside the paper or for the -- for creating Modern Burn are based 00:13:32.560 |
on papers that come -- that all came out in 2024. 00:13:36.880 |
I was about to start counting like how many were there from 2024, but I think the final 00:13:48.280 |
So in alternating attention -- so the attention layers in Modern Burn alternate between global, 00:13:54.200 |
where every token within a sequence attends to every other token, and then local attention 00:13:58.960 |
where tokens only attend to each other within a small sliding window. 00:14:04.560 |
So every layer employs a global attention with a Rope of 160 and remaining layers use 00:14:13.920 |
Global sliding window attention Rope of theta of 10,000. 00:14:18.360 |
I feel like if somebody else here knows a little bit more about exactly what this means 00:14:23.200 |
in much simpler terms, that would be super useful. 00:14:25.720 |
Because I was trying to wrap my head around exactly how does this look like in the implementation 00:14:33.080 |
But I guess it's more a matter of going into the code and seeing exactly how this gets 00:14:43.720 |
Something interesting as well is that it unpads inputs before the token embedding layer and 00:14:48.880 |
optionally repads the model outputs, leading to a 10 to 20% performance improvement over 00:14:59.480 |
Also uses a mixture of FlashAttention3 for global attention layers and FlashAttention2 00:15:08.820 |
I believe in the latest version of the transformers, the FlashAttention3 has not yet been implemented. 00:15:14.400 |
So if you go to the repo, you're gonna end up seeing that it says, hey, if you're gonna 00:15:18.640 |
be using Attention3, make sure you pip install from the URL directly to the latest git commit 00:15:24.220 |
on the main branch rather than the latest version available in PyPI. 00:15:31.120 |
So that yields a Torch compile, yielded a 10% improvement in throughput with negable 00:15:45.520 |
And so, I mentioned this one about the design through -- so, it was designed through many 00:15:54.320 |
small scale ablations to maximize the utilization of basket -- of a basket of common GPUs. 00:16:00.720 |
I think if you are new today, or if you haven't read -- if you haven't worked as a data scientist, 00:16:06.480 |
but maybe you don't read a whole lot of research papers, you might actually not know what ablation 00:16:11.560 |
I had to Google it as well, even though I've read a few, but it just wasn't in my head. 00:16:16.020 |
So ablation studies, they are experiments where research is systematically removed or 00:16:23.400 |
ablate certain components of a model or system to understand their individual impact on performance. 00:16:30.480 |
It is a kind of like a control experiment when you're creating an architecture. 00:16:35.160 |
So the reason I mention that is because you're going to see the word "ablation" thrown in 00:16:43.160 |
And I'm sure a lot of people here probably know it. 00:16:46.980 |
So it uses 22 and 28 layers for the base and large model for a total of 149 to 395 million, 00:16:57.840 |
And then the base one has a hidden size of 768, with a GLU expansion of 230, 2304, while 00:17:07.840 |
the large one has a size of 1024, and a GLU expansion of 5248. 00:17:19.200 |
There's two trillion tokens of primarily English data on a variety of data sources, including 00:17:23.720 |
web documents code and scientific literature. 00:17:26.360 |
So I don't know -- I don't remember seeing if there was a direct link to the dataset 00:17:37.880 |
They say that they use a modified version of OMO tokenizer, which provides a better 00:17:42.440 |
token efficiency and performance on code-related tasks. 00:17:46.320 |
So this adds to the -- to being good with coding -- with coding files in general. 00:17:57.540 |
And then it's nice that it keeps the same thing that we are used to seeing in encoder 00:18:05.080 |
The last token being the CLS or the CEP, so you can play around with these ones. 00:18:15.260 |
So I mentioned this earlier at the very beginning, that apparently it was very common to use 00:18:19.800 |
15% of masking when you are doing the training. 00:18:25.200 |
They said that the 15% one has -- since the first paper has been shown to be suboptimal. 00:18:34.460 |
And then -- oh, a couple of things that -- about the training set that I thought were interesting. 00:18:40.120 |
So they started with a warmup rate of -- so actually they -- so they 00:18:45.020 |
trained Motherbird Base at a constant learning rate of 0.004 for -- about -- sorry, 0.008 00:18:57.860 |
But they started with a 3 billion token warmup first. 00:19:01.960 |
And then after 2 billion token warmup, they trained Motherbird Large at a learning rate 00:19:06.760 |
of a smaller precision for 900 billion tokens. 00:19:12.200 |
So then they roll back and restarted the training at 0.000005 for the remaining 800 billion 00:19:20.120 |
tokens after -- if they will see a large loss plateau for 100 billion tokens. 00:19:28.120 |
So the batch size schedule, they warm up the batch size from 768 to 4680, over 50 billion 00:19:40.880 |
tokens, and from 448 to 4928, over 10 billion tokens. 00:19:50.000 |
And they talk a little bit how they do the context length extension. 00:19:58.880 |
And it's quite nice that they use -- they took this idea from two papers that were released 00:20:12.520 |
So I think this might address a little bit Sebastian's question. 00:20:19.520 |
So they evaluate the model on both single vector dense passage retrieval. 00:20:23.480 |
So the entire document put into a -- into a dense -- into a vector. 00:20:28.440 |
And then setting the -- and then they also use -- they set the multi-vector covert setting 00:20:36.680 |
So they use different methodologies to evaluate it. 00:20:40.120 |
In some of them, they retrieve documents completely using -- and add them in the context of the 00:20:49.440 |
And then in others, they will have chunks of a piece of text. 00:20:54.440 |
And then they train every base model using the MS Marco and data set which mine -- with 00:21:05.120 |
mine hard negatives on 125 million samples with a batch size of 16, a learning rate and 00:21:11.200 |
warm up of 5% of the training using the sentence transformers. 00:21:16.120 |
So it's nice that you still see sentence transformers, like, still rocking it. 00:21:21.240 |
They adapt the training set up to Jack covert 2.5, which was also part of a paper shown 00:21:29.960 |
And then they train all the models distilling the knowledge from a teacher model using KL 00:21:34.480 |
divergence between the normalized teacher and the student scores. 00:21:41.720 |
So this is the results for all the models in the different evaluators that they use. 00:21:50.800 |
You can see here at the bottom that the base method is better in this particular one, which 00:22:00.360 |
you still have better ones, the GT and the MLM are a little bit better in these two over 00:22:06.600 |
On this table of results, what's really exceptional is that if you adopt -- if you look at COBEAR 00:22:13.320 |
versus dense passive retrieval on the multilingual long document retrieval, MLDR, you look at 00:22:28.360 |
But for Bayer, I think it's benchmarking IR, the jump is not that big. 00:22:34.280 |
So it seems like as your context length gets longer, doing a COBEAR-based approach really 00:22:41.280 |
Of course, a lot more costs, but a lot more metrics, a lot more improved metrics. 00:22:51.880 |
>> You can see that there are eight columns of results, right? 00:22:56.240 |
The first three columns are dense passive retrieval. 00:23:02.680 |
And then what you probably want to compare is basically on the same benchmarks, BIR to 00:23:07.760 |
COBEAR, BEIR to BEIR, and MLDR out of domain to MLDR out of domain to see the comparison. 00:23:33.280 |
There's a few -- there's another one as well. 00:23:35.600 |
Well, actually, let me -- let me see if I have something here for it. 00:23:41.720 |
So to mention the programming-related performance, they evaluated all models in the code search 00:23:49.800 |
So a code-to-text benchmark where the model must identify relevant docstrings and comments 00:23:57.360 |
So actually, it's identifying the docstring as opposed to the code itself, but it might 00:24:01.840 |
be useful still in finding a piece of content if you're doing information retrieval. 00:24:09.680 |
But it might be tricky if -- because especially nowadays, like, I don't know about everyone 00:24:14.540 |
here in the call, I don't particularly -- I add comments. 00:24:18.800 |
I add docstrings inside functions, but I don't go to a massive degree -- to a massive length 00:24:25.600 |
to add a lot of comments or a lot of doc -- like a giant docstring. 00:24:29.800 |
But for example, scikit-learn has probably the absolute best docstrings out there. 00:24:36.600 |
If you were looking for functions inside of those, I can imagine that that's an easy one 00:24:41.640 |
But if you -- more recent tools, I can imagine that it might not see the same extensive docstring 00:24:55.160 |
They evaluated the benchmarks using co-IR and code-IR frameworks. 00:25:04.600 |
And all of the models are reusing the best hyperparameters identified in section 3.1.2. 00:25:14.960 |
And then so here they have the -- this is related to memory. 00:25:19.120 |
So here they have memory batch size and inference in thousands of tokens per second, efficiency 00:25:24.240 |
results in the consumer hardware GPU and RTX 4090, and average of the over ten run -- ten 00:25:33.480 |
I don't know if anybody has any comments on this particular one. 00:25:39.280 |
>> Yeah, I was actually kind of curious about what you were talking about the -- what is 00:25:54.280 |
Is that like -- is it normally evaluated that way? 00:25:57.960 |
By encoding the code and then trying to retrieve the correct docstring? 00:26:03.560 |
Is this like the normal benchmark that's used against encoders? 00:26:10.960 |
>> I was under the impression it's the other way around. 00:26:17.360 |
>> You know, this would be a good thing for illicit or deep research. 00:26:29.280 |
Maybe we have someone for any of the -- from cursor in this call or from any of the major 00:26:46.920 |
So the Motherbird large increases its lead despite having comparatively fewer parameters. 00:26:54.080 |
So from 395 million to GT and MLM's large 435. 00:27:01.700 |
So that's nice being able to do a little bit more with less. 00:27:06.400 |
Then the Motherbird outperforms other long context models with at least a 9NDCG at ten 00:27:18.040 |
I was a little bit -- a little bit thrown about. 00:27:21.240 |
Like I didn't really get this and I had to reread this piece. 00:27:24.880 |
So what it essentially means is -- and then it also surpasses all existing base models, 00:27:36.040 |
including DBERTA V3, becoming the first MLM trained model to do so, apparently, DBERTA 00:27:43.920 |
V3 base was unbeaten in this -- in the glue benchmark. 00:27:54.040 |
And then it improved understanding of code at no detriment of its ability to process 00:28:01.400 |
You often see models that if they have been too fine tuned on code or too -- or they have 00:28:07.520 |
a large emphasis on code, that they don't perform that well on any other common natural 00:28:19.440 |
And then so for the evaluation setting, so it is able to process batches twice as large 00:28:27.800 |
as every other model, but on both input -- on both input length. 00:28:33.200 |
But Motherbird is slightly less memory efficient than the original BERT, large on short context 00:28:45.520 |
But it can still process batches at least 60% bigger on the other one. 00:28:49.720 |
So it's just like one tiny piece or one small piece where it doesn't beat every single -- that 00:28:56.640 |
makes it so that it doesn't beat every single model at every single -- in every single characteristic. 00:29:02.400 |
But other than that, almost in every measure, it goes above all the other models. 00:29:10.940 |
It is the first -- the first open model to feature an entire model on padding, and it 00:29:17.400 |
is the first encoder designed in a hardware-aware way to maximize inference efficiency. 00:29:24.720 |
It is in a class-of-its-own encode and Colbert-style long context retrieval benchmark, scoring 00:29:30.820 |
at least 6.85 and 9.1% points higher than the closest model, respectively, while remaining 00:29:38.400 |
state-of-the-art on short context retrieval in both single and multivector settings. 00:29:44.160 |
And -- oh, I thought that was interesting, that it says that the MLM objective gives 00:29:51.440 |
the model some ability to generate text by suggesting a given token to replace the mask 00:29:57.160 |
token, which could result in generation of harmful content. 00:30:02.200 |
They do say here that, however, it is not primarily a generative model, and as such, 00:30:07.760 |
it has not been trained to and therefore cannot generate a long context sequence, so it might 00:30:12.040 |
predict the next mask and so on in specific ways, but it's not like it's gonna go and 00:30:25.400 |
In terms of the scaling, besides the architectural modifications, a key aspect of the study is 00:30:32.360 |
However, other scaling access they note in terms of parameter are still left unexplored, 00:30:39.200 |
so there's -- so what the authors, I think, are saying here is that there's a lot of room 00:30:43.240 |
for improvement, even if this is a large bump in improvement already. 00:30:50.920 |
That's all, like, the highlights that I put when I was going over it. 00:30:54.740 |
If anybody else has another comment or wants to touch on something else. 00:31:05.480 |
>> I think one thing the authors did that's implicit is compare it only against the BERT 00:31:11.960 |
models rather than comparing it against the other, like, collection of them. 00:31:16.360 |
I mean, certainly sizes are different if you were to go on, like, the MTB leaderboard and 00:31:24.080 |
Yeah, I was somewhat surprised whenever I started to read into it. 00:31:29.200 |
I thought there might be some mention of the other encoding models that are out there, 00:31:37.000 |
>> Yeah, there's a lot of variations of BERT, so I can imagine they pick -- they only pick 00:31:47.240 |
They do mention in the paper why they pick a couple of them. 00:31:51.400 |
So the BERT or the Roberta and also obviously the base BERT, but they definitely could have 00:32:03.560 |
gone or maybe just left it for the reader to do a bit more testing and put stuff out 00:32:12.160 |
>> Sorry, I didn't -- your audio cut out for a bit. 00:32:28.880 |
I was saying that I appreciate that you put the NDCG explainer there, the link. 00:32:39.120 |
I just -- I feel like they could just say ranking score and then, you know, footnote 00:32:46.400 |
It's always hard to remember all these things when you have -- you don't live in this world 00:32:54.600 |
I think we have Benjamin who said that he wanted to talk after we finish recording. 00:33:09.440 |
I don't really get involved in the training side. 00:33:11.320 |
I just want to see -- understand what I need to know as a user of the thing. 00:33:16.040 |
And then potentially as a fine tuner, you know, for stuff. 00:33:19.280 |
So but I think like a very major advance, I think it's going to be a workhorse tool. 00:33:24.640 |
So like I think multiple of us picked this last week and for good reason. 00:33:28.520 |
So I think we're happy to cut over to the offline part of the Convo, unless anyone else 00:33:39.600 |
I will pause recording and then Ben, Benjamin, you can come up.