ModernBERT: Modern Bidirectional Encoder for Fast,Efficient, and Long Context Finetuning + Inference

then you can have, I don't know, a full piece of code rather than just one token at a time. So I think that was one of the biggest things because it doesn't surpass. So for example, in speed, it doesn't surpass one of the base BERT, but in everything else it surpasses every other model.

So I think that was one of the key highlights for me. I'm still trying to figure out where the code data is mentioned. I know I read it inside of here, but I don't know where it is. Yeah. It should be very useful for code. Oh, yeah, there we go.

Here. Okay. Well, anything else that we should cover at a high level? Otherwise, we should, I don't know, go through them. I'm relatively, this is relatively because we didn't appoint someone to lead this discussion and I'm not ready to lead this discussion. But I would say on my end, I think training it on two trillion tokens, like basically kind of updating every dimension of normal BERT into 2025 BERT, which I think was the original name, makes sense, including these kind of alternating global local attention is something that is like very, very state of the art that I was only seeing in some of the papers, including at Character AI.

And it's very surprising to see this applied to BERT. But I felt like it was very well written in terms of the justification of how much these things are downloaded and how they deserve to be updated, because they actually much more efficient compared to other models. And that just makes a lot of sense.

One thing that I was, I've never understood in a lot of clarity is why the bidirectional encoder is so much more efficient than a causal decoder only model. I mean, I don't know if anyone can, it's probably well known, but I don't, I never understood that. Yeah. I mean, this is where if Ben is available to speak, he can probably be authoritative.

To me, it's just like, if you're literally like, the only reason you do decoder only is to generate the next token. That is what is very, very good at. But if your job is to fill in the middle or to do classification, then you might as well look at the whole sentence all at once.

Yeah. Okay. That's sort of, that's kind of what I suspected. Yeah, I feel like, I don't know, I don't know if there's anyone else has stuff they want to add there, but yeah. If I may put in my two cents, the other advantage with encoder models is richness of information because you're not just attending to previous tokens, but every token's encoding is going to be based both on past and future tokens.

So if you have some token that's making a future reference, it's not very common, but sometimes you get an adjective for some kind of reference to a future token that hasn't come yet. Autoregressive models don't get a chance to encode that information because they haven't seen that future token yet.

It's fine when you just need to predict the next token. Probably the next token's encoding is going to try to reflect that information. But when you're doing stuff like classification, like NER, named entity recognition, or you're doing sentiment analysis, you want that entire spectrum of information of both past and future tokens.

Right. Yeah. So the bidirectionality is what you're saying, basically. Yeah, exactly. I think maybe if I were to play devil's advocate and disagree here, it would basically be that there's no time to think, there's no chain of thought. You just have to run attention on it, and then immediately come up with a classification or an answer.

Yeah. Yeah. Yeah. You sort of have less bandwidth. You have to encode all of the information and de-embedding and go from there. The chat is blowing up. Steve, you want to... What is this? What's going on? Okay. Who has questions or objectives they want to mention? I have a question for Swix.

You spoke with the guys at Windsurf and they said that they were doing work on a retrieval model or at least a way of implementing retrieval so that it would get the right code. Do you know if they were training their own embedding model, and if so, how does it compare?

And if not, how does this model come into play? I wouldn't be... I would completely not be surprised if they were training their own model. I don't think they actually confirmed that. They said they were working on retrieval, but I don't know if they said it was their own model or whatever.

I don't think it matters to them. They do train models for fun. This is not one of those heavy blips for them. Yeah. The beauty of these smaller models is that it's relatively cheap. Yeah. I'm just throwing questions out here, but does anyone know of... Do you necessarily... would this embedding model be actually useful for, for example, encoding a text description of a function and then retrieving that function, assuming that it's also embedded?

Basically, RAG based on the actual text description versus the function itself. Yeah. I'm not sure if I'm... They did put details in the paper about their... I think the level of detail that they put on the training and the testing, the evaluation was actually really good. So, for example, in the training piece, instead of using 15% of masking, they used 30%.

They used the warmup was very detailed. They started with a 3 billion warmup and then increasing, and then they kept the learning rate constant, and then if they will see kind of like a plateau, then they will go back and adjust it, but just towards the very end of it.

Also, it was actually quite interesting that a lot of this stuff, they tested it in a RTX or 4090. I did some fine tuning and I have an RTX 2008 version, which is similar to the 3000 versions, and it took about give or take 40 minutes to fine tune it on a 200,000 sample data set.

So, that was quite interesting, and I think the level of detail in the training was incredible, and also in the evaluation piece, they do answer your question with the coding data set that they used to test it. Yeah, that's interesting. Okay, I have no idea how to guide this one, especially because normally I would try to prep a slide deck.

I have been too busy to do a slide deck. But does anyone else have like a thing that they want to show that I can give you the screen for? I did take a bunch of notes on the paper. If you want me, I can take it over and like walk through the notes that I took.

Go ahead. One second. I did add the Nomec tuned version of it. Basically, Nomec just added their Nomec data set and created Embed ModernBird, something like that. And that was a useful, like something that you can start using for RAG, whereas ModernBird was kind of like the baseline. I try to use it a little bit in a data set that I don't think was very well represented in the data set that they use, and it wasn't that good, the use case.

It was with a law-specific one, trying to detect the type of legal case that was in the document. And then it didn't really do very well until I fine-tuned it. So it's really good off the bat, but for some use cases, you might want to fine-tune it and go out a little bit further.

Okay, so I took a few notes and I highlighted the things that I thought that they were quite useful. So I'm just going to run through them. And if anybody has any questions, I might be able to answer them, I hope. I might not. I hope I am. Okay.

So I think I mentioned this a second ago that one of the coolest pieces is that this can extract a lot of power from a single GPU, or you can do a lot with a single GPU. That's something that I personally prefer if I have the ability to do so, but I know that we all know that a lot of times you just need a much more powerful one.

So they call it the most speed and memory efficient encoder and designed for inference on common GPUs. They point out a couple of nice things regarding the drawbacks. So previous birds had a length limitation of about 512 tokens, which is so optimal for model design, but then they increase in this one the vocabulary to 8,100 and something.

And the nice thing about it is that they did it in a way in which they could parallelize it nicely across the couple of GPUs that they used for training. So this might -- so they can be used in training in conjunction with LLMs, for example, detecting toxic prompts and preventing responses on routing queries in agentic frameworks.

I added here that this might not be super obvious for new AI engineers that have worked on decoder-only models. So you might want to have in a pipeline a smaller model, like a bird, like detecting specific words, specific keywords, and then use that to say, like, oh, is there toxic language in this prompt?

Or is there toxic language in this response back to the user? And so you can have a pre and post LLM generation in an agentic framework. So the training data is limited in narrow domains and specifically in coding data. So lacking knowledge of recent events. So it was quite nice to see that kind of like a new, refreshed look as to what bird could be.

And then so overall -- both of them reach overall performance. You saw the image that Strix show in the screen. Even the transformer. They disabled the bias term in all the layers, except the final decoder and in the layer norm. I keep hitting the ones that I -- sorry about that, Iro.

>> Weights are all you need. >> Sorry? >> Weights are all you need. >> Sorry? >> Weights are all you need. No biases. >> So yeah. So they took the biases from the -- so they disabled the bias in all linear layers except for the final decoder linear layer.

And then they also disabled the bias in the layer norms. So I thought that was quite interesting. And then they also used positional Rotary -- Rotary positional embeddings, Rope, instead of the absolute positional embeddings. So one of the things that was interesting about this paper is that a lot of the stuff that they included inside the paper or for the -- for creating Modern Burn are based on papers that come -- that all came out in 2024.

I was about to start counting like how many were there from 2024, but I think the final number is actually quite large. Which is nice. A couple of things in the improvements. So in alternating attention -- so the attention layers in Modern Burn alternate between global, where every token within a sequence attends to every other token, and then local attention where tokens only attend to each other within a small sliding window.

So every layer employs a global attention with a Rope of 160 and remaining layers use about 128 tokens. Global sliding window attention Rope of theta of 10,000. I feel like if somebody else here knows a little bit more about exactly what this means in much simpler terms, that would be super useful.

Because I was trying to wrap my head around exactly how does this look like in the implementation piece. But I guess it's more a matter of going into the code and seeing exactly how this gets implemented. Something interesting as well is that it unpads inputs before the token embedding layer and optionally repads the model outputs, leading to a 10 to 20% performance improvement over other unpadding methods.

Also uses a mixture of FlashAttention3 for global attention layers and FlashAttention2 for local attention layers. I believe in the latest version of the transformers, the FlashAttention3 has not yet been implemented. So if you go to the repo, you're gonna end up seeing that it says, hey, if you're gonna be using Attention3, make sure you pip install from the URL directly to the latest git commit on the main branch rather than the latest version available in PyPI.

So that yields a Torch compile, yielded a 10% improvement in throughput with negable compilation overhead. Which I thought it was interesting. And so, I mentioned this one about the design through -- so, it was designed through many small scale ablations to maximize the utilization of basket -- of a basket of common GPUs.

I think if you are new today, or if you haven't read -- if you haven't worked as a data scientist, but maybe you don't read a whole lot of research papers, you might actually not know what ablation is. I had to Google it as well, even though I've read a few, but it just wasn't in my head.

So ablation studies, they are experiments where research is systematically removed or ablate certain components of a model or system to understand their individual impact on performance. It is a kind of like a control experiment when you're creating an architecture. So the reason I mention that is because you're going to see the word "ablation" thrown in in a lot of places.

And just assume that you know it. And I'm sure a lot of people here probably know it. So it uses 22 and 28 layers for the base and large model for a total of 149 to 395 million, respectively. And then the base one has a hidden size of 768, with a GLU expansion of 230, 2304, while the large one has a size of 1024, and a GLU expansion of 5248.

There's two trillion tokens of primarily English data on a variety of data sources, including web documents code and scientific literature. So I don't know -- I don't remember seeing if there was a direct link to the dataset that they used. But it is open data, though. And they're very explicit about it.

They say that they use a modified version of OMO tokenizer, which provides a better token efficiency and performance on code-related tasks. So this adds to the -- to being good with coding -- with coding files in general. And then it's nice that it keeps the same thing that we are used to seeing in encoder models like Verge.

The last token being the CLS or the CEP, so you can play around with these ones. So I mentioned this earlier at the very beginning, that apparently it was very common to use 15% of masking when you are doing the training. But then this one bumps it up to 30%.

They said that the 15% one has -- since the first paper has been shown to be suboptimal. And then -- oh, a couple of things that -- about the training set that I thought were interesting. So they started with a warmup rate of -- so actually they -- so they trained Motherbird Base at a constant learning rate of 0.004 for -- about -- sorry, 0.008 for about 1.7 trillion tokens.

But they started with a 3 billion token warmup first. And then after 2 billion token warmup, they trained Motherbird Large at a learning rate of a smaller precision for 900 billion tokens. So then they roll back and restarted the training at 0.000005 for the remaining 800 billion tokens after -- if they will see a large loss plateau for 100 billion tokens.

So the batch size schedule, they warm up the batch size from 768 to 4680, over 50 billion tokens, and from 448 to 4928, over 10 billion tokens. And they talk a little bit how they do the context length extension. And it's quite nice that they use -- they took this idea from two papers that were released on -- in this year.

Also in the text retrieval. So I think this might address a little bit Sebastian's question. So they evaluate the model on both single vector dense passage retrieval. So the entire document put into a -- into a dense -- into a vector. And then setting the -- and then they also use -- they set the multi-vector covert setting as well.

So they use different methodologies to evaluate it. In some of them, they retrieve documents completely using -- and add them in the context of the evaluation. And then in others, they will have chunks of a piece of text. And then they train every base model using the MS Marco and data set which mine -- with mine hard negatives on 125 million samples with a batch size of 16, a learning rate and warm up of 5% of the training using the sentence transformers.

So it's nice that you still see sentence transformers, like, still rocking it. They adapt the training set up to Jack covert 2.5, which was also part of a paper shown this year. And then they train all the models distilling the knowledge from a teacher model using KL divergence between the normalized teacher and the student scores.

They say -- okay. Yeah. Here it is. So this is the results for all the models in the different evaluators that they use. You can see here at the bottom that the base method is better in this particular one, which you still have better ones, the GT and the MLM are a little bit better in these two over here.

>> Yeah. On this table of results, what's really exceptional is that if you adopt -- if you look at COBEAR versus dense passive retrieval on the multilingual long document retrieval, MLDR, you look at the jump in performance. It's huge. For ModernBird, it's like 27.4 to 80.2. But for Bayer, I think it's benchmarking IR, the jump is not that big.

So it seems like as your context length gets longer, doing a COBEAR-based approach really gets you a lot more juice out of it. Of course, a lot more costs, but a lot more metrics, a lot more improved metrics. >> Which one is the COBEAR approach? >> You can see that there are eight columns of results, right?

The first three columns are dense passive retrieval. The next two columns are COBEAR. >> Thanks. >> Yeah. And then what you probably want to compare is basically on the same benchmarks, BIR to COBEAR, BEIR to BEIR, and MLDR out of domain to MLDR out of domain to see the comparison.

>> Yeah, that's a huge win. >> Yeah, huge win. Okay. So there's a couple more. There's a few -- there's another one as well. Well, actually, let me -- let me see if I have something here for it. So to mention the programming-related performance, they evaluated all models in the code search net.

So this is related to Sebastian's question. So a code-to-text benchmark where the model must identify relevant docstrings and comments for code blocks. So actually, it's identifying the docstring as opposed to the code itself, but it might be useful still in finding a piece of content if you're doing information retrieval.

But it might be tricky if -- because especially nowadays, like, I don't know about everyone here in the call, I don't particularly -- I add comments. I add docstrings inside functions, but I don't go to a massive degree -- to a massive length to add a lot of comments or a lot of doc -- like a giant docstring.

But for example, scikit-learn has probably the absolute best docstrings out there. NumPy and pandas and so on. If you were looking for functions inside of those, I can imagine that that's an easy one or that's an easy win for it. But if you -- more recent tools, I can imagine that it might not see the same extensive docstring in one of them.

They evaluated the benchmarks using co-IR and code-IR frameworks. So a single vector retrieval task. And all of the models are reusing the best hyperparameters identified in section 3.1.2. And then so here they have the -- this is related to memory. So here they have memory batch size and inference in thousands of tokens per second, efficiency results in the consumer hardware GPU and RTX 4090, and average of the over ten run -- ten runs.

I don't know if anybody has any comments on this particular one. >> Yeah, I was actually kind of curious about what you were talking about the -- what is it, CDN, they call it? The data sets for evaluating code. Is that like -- is it normally evaluated that way?

By encoding the code and then trying to retrieve the correct docstring? Is this like the normal benchmark that's used against encoders? >> That is a great question. I have no idea. >> I was under the impression it's the other way around. >> You know, this would be a good thing for illicit or deep research.

I'm going to see if I can find that. >> Yeah, but that's a great question. Maybe we have someone for any of the -- from cursor in this call or from any of the major IDEs that can say how they're doing it. We'll keep it a secret. >> Okay.

So the Motherbird large increases its lead despite having comparatively fewer parameters. So from 395 million to GT and MLM's large 435. So that's nice being able to do a little bit more with less. Then the Motherbird outperforms other long context models with at least a 9NDCG at ten points lead on both model sizes.

I was a little bit -- a little bit thrown about. Like I didn't really get this and I had to reread this piece. So what it essentially means is -- and then it also surpasses all existing base models, including DBERTA V3, becoming the first MLM trained model to do so, apparently, DBERTA V3 base was unbeaten in this -- in the glue benchmark.

And then it improved understanding of code at no detriment of its ability to process natural text. So it's nice. You often see models that if they have been too fine tuned on code or too -- or they have a large emphasis on code, that they don't perform that well on any other common natural language tasks.

And then so for the evaluation setting, so it is able to process batches twice as large as every other model, but on both input -- on both input length. But Motherbird is slightly less memory efficient than the original BERT, large on short context inputs. So we talk about 500 and something.

But it can still process batches at least 60% bigger on the other one. So it's just like one tiny piece or one small piece where it doesn't beat every single -- that makes it so that it doesn't beat every single model at every single -- in every single characteristic.

But other than that, almost in every measure, it goes above all the other models. It is the first -- the first open model to feature an entire model on padding, and it is the first encoder designed in a hardware-aware way to maximize inference efficiency. It is in a class-of-its-own encode and Colbert-style long context retrieval benchmark, scoring at least 6.85 and 9.1% points higher than the closest model, respectively, while remaining state-of-the-art on short context retrieval in both single and multivector settings.

And -- oh, I thought that was interesting, that it says that the MLM objective gives the model some ability to generate text by suggesting a given token to replace the mask token, which could result in generation of harmful content. They do say here that, however, it is not primarily a generative model, and as such, it has not been trained to and therefore cannot generate a long context sequence, so it might predict the next mask and so on in specific ways, but it's not like it's gonna go and tell you or just cossign you in length.

It might say something harmful. In terms of the scaling, besides the architectural modifications, a key aspect of the study is data scaling. However, other scaling access they note in terms of parameter are still left unexplored, so there's -- so what the authors, I think, are saying here is that there's a lot of room for improvement, even if this is a large bump in improvement already.

And I think that's it. That's all, like, the highlights that I put when I was going over it. If anybody else has another comment or wants to touch on something else. >> I think one thing the authors did that's implicit is compare it only against the BERT models rather than comparing it against the other, like, collection of them.

I mean, certainly sizes are different if you were to go on, like, the MTB leaderboard and look at them. Yeah, I was somewhat surprised whenever I started to read into it. I thought there might be some mention of the other encoding models that are out there, but that wasn't the case.

>> Yeah, there's a lot of variations of BERT, so I can imagine they pick -- they only pick a selected few for specific reasons. They do mention in the paper why they pick a couple of them. So the BERT or the Roberta and also obviously the base BERT, but they definitely could have gone or maybe just left it for the reader to do a bit more testing and put stuff out there.

>> Sorry, I didn't -- your audio cut out for a bit. I didn't hear what you said. >> Oh, sorry. I was saying that I appreciate that you put the NDCG explainer there, the link. >> It's a ranking similarity thing. Yeah. Yeah. I just -- I feel like they could just say ranking score and then, you know, footnote NDCG, but it's the same thing.

It's always hard to remember all these things when you have -- you don't live in this world all the time. Yep. I think we have Benjamin who said that he wanted to talk after we finish recording. I really recommend reading the blog post. That was very much more my level.

I don't really get involved in the training side. I just want to see -- understand what I need to know as a user of the thing. And then potentially as a fine tuner, you know, for stuff. So but I think like a very major advance, I think it's going to be a workhorse tool.

So like I think multiple of us picked this last week and for good reason. So I think we're happy to cut over to the offline part of the Convo, unless anyone else has things to share. No? Okay. All right. I will pause recording and then Ben, Benjamin, you can come up.

ModernBERT: Modern Bidirectional Encoder for Fast,Efficient, and Long Context Finetuning + Inference

Transcript