back to index

ModernBERT: Modern Bidirectional Encoder for Fast,Efficient, and Long Context Finetuning + Inference


Whisper Transcript | Transcript Only Page

00:00:00.000 | then you can have, I don't know, a full piece of code rather than just one token at a time.
00:00:07.280 | So I think that was one of the biggest things because it doesn't surpass.
00:00:13.000 | So for example, in speed, it doesn't surpass one of the base BERT, but in everything else
00:00:20.400 | it surpasses every other model.
00:00:22.400 | So I think that was one of the key highlights for me.
00:00:27.080 | I'm still trying to figure out where the code data is mentioned.
00:00:32.840 | I know I read it inside of here, but I don't know where it is.
00:00:37.840 | Yeah.
00:00:38.840 | It should be very useful for code.
00:00:41.360 | Oh, yeah, there we go.
00:00:45.160 | Here.
00:00:46.160 | Okay.
00:00:47.160 | Well, anything else that we should cover at a high level?
00:00:50.400 | Otherwise, we should, I don't know, go through them.
00:00:55.760 | I'm relatively, this is relatively because we didn't appoint someone to lead this discussion
00:01:02.240 | and I'm not ready to lead this discussion.
00:01:06.560 | But I would say on my end, I think training it on two trillion tokens, like basically
00:01:14.720 | kind of updating every dimension of normal BERT into 2025 BERT, which I think was the
00:01:22.060 | original name, makes sense, including these kind of alternating global local attention
00:01:30.760 | is something that is like very, very state of the art that I was only seeing in some
00:01:37.840 | of the papers, including at Character AI.
00:01:40.200 | And it's very surprising to see this applied to BERT.
00:01:43.960 | But I felt like it was very well written in terms of the justification of how much these
00:01:50.320 | things are downloaded and how they deserve to be updated, because they actually much
00:01:57.160 | more efficient compared to other models.
00:02:01.240 | And that just makes a lot of sense.
00:02:05.480 | One thing that I was, I've never understood in a lot of clarity is why the bidirectional
00:02:13.480 | encoder is so much more efficient than a causal decoder only model.
00:02:24.780 | I mean, I don't know if anyone can, it's probably well known, but I don't, I never understood
00:02:31.440 | that.
00:02:32.440 | Yeah.
00:02:33.440 | I mean, this is where if Ben is available to speak, he can probably be authoritative.
00:02:40.860 | To me, it's just like, if you're literally like, the only reason you do decoder only
00:02:45.460 | is to generate the next token.
00:02:48.080 | That is what is very, very good at.
00:02:50.300 | But if your job is to fill in the middle or to do classification, then you might as well
00:02:56.140 | look at the whole sentence all at once.
00:03:00.140 | Yeah.
00:03:01.140 | Okay.
00:03:02.140 | That's sort of, that's kind of what I suspected.
00:03:08.060 | Yeah, I feel like, I don't know, I don't know if there's anyone else has stuff they want
00:03:14.380 | to add there, but yeah.
00:03:17.980 | If I may put in my two cents, the other advantage with encoder models is richness of information
00:03:25.980 | because you're not just attending to previous tokens, but every token's encoding is going
00:03:32.660 | to be based both on past and future tokens.
00:03:35.820 | So if you have some token that's making a future reference, it's not very common, but
00:03:42.180 | sometimes you get an adjective for some kind of reference to a future token that hasn't
00:03:46.420 | come yet.
00:03:48.260 | Autoregressive models don't get a chance to encode that information because they haven't
00:03:51.940 | seen that future token yet.
00:03:54.620 | It's fine when you just need to predict the next token.
00:03:58.940 | Probably the next token's encoding is going to try to reflect that information.
00:04:03.300 | But when you're doing stuff like classification, like NER, named entity recognition, or you're
00:04:08.300 | doing sentiment analysis, you want that entire spectrum of information of both past and future
00:04:13.620 | tokens.
00:04:14.620 | Right.
00:04:15.620 | Yeah.
00:04:16.620 | So the bidirectionality is what you're saying, basically.
00:04:22.140 | Yeah, exactly.
00:04:23.900 | I think maybe if I were to play devil's advocate and disagree here, it would basically be that
00:04:32.140 | there's no time to think, there's no chain of thought.
00:04:34.860 | You just have to run attention on it, and then immediately come up with a classification
00:04:41.260 | or an answer.
00:04:42.260 | Yeah.
00:04:43.260 | Yeah.
00:04:44.260 | Yeah.
00:04:45.260 | You sort of have less bandwidth.
00:04:46.260 | You have to encode all of the information and de-embedding and go from there.
00:04:52.900 | The chat is blowing up.
00:04:54.780 | Steve, you want to...
00:04:56.740 | What is this?
00:04:57.740 | What's going on?
00:04:59.420 | Okay.
00:05:00.420 | Who has questions or objectives they want to mention?
00:05:05.460 | I have a question for Swix.
00:05:10.260 | You spoke with the guys at Windsurf and they said that they were doing work on a retrieval
00:05:17.060 | model or at least a way of implementing retrieval so that it would get the right code.
00:05:23.860 | Do you know if they were training their own embedding model, and if so, how does it compare?
00:05:27.540 | And if not, how does this model come into play?
00:05:34.140 | I wouldn't be...
00:05:35.140 | I would completely not be surprised if they were training their own model.
00:05:38.780 | I don't think they actually confirmed that.
00:05:41.900 | They said they were working on retrieval, but I don't know if they said it was their
00:05:44.780 | own model or whatever.
00:05:46.260 | I don't think it matters to them.
00:05:49.240 | They do train models for fun.
00:05:50.960 | This is not one of those heavy blips for them.
00:05:54.100 | Yeah.
00:05:55.100 | The beauty of these smaller models is that it's relatively cheap.
00:06:07.500 | Yeah.
00:06:11.260 | I'm just throwing questions out here, but does anyone know of...
00:06:17.580 | Do you necessarily... would this embedding model be actually useful for, for example,
00:06:23.700 | encoding a text description of a function and then retrieving that function, assuming
00:06:30.580 | that it's also embedded?
00:06:32.540 | Basically, RAG based on the actual text description versus the function itself.
00:06:39.300 | Yeah.
00:06:40.300 | I'm not sure if I'm...
00:06:43.900 | They did put details in the paper about their...
00:06:48.900 | I think the level of detail that they put on the training and the testing, the evaluation
00:06:53.940 | was actually really good.
00:06:54.940 | So, for example, in the training piece, instead of using 15% of masking, they used 30%.
00:07:01.660 | They used the warmup was very detailed.
00:07:04.660 | They started with a 3 billion warmup and then increasing, and then they kept the learning
00:07:11.180 | rate constant, and then if they will see kind of like a plateau, then they will go back
00:07:16.900 | and adjust it, but just towards the very end of it.
00:07:19.460 | Also, it was actually quite interesting that a lot of this stuff, they tested it in a RTX
00:07:26.180 | or 4090.
00:07:27.780 | I did some fine tuning and I have an RTX 2008 version, which is similar to the 3000 versions,
00:07:36.460 | and it took about give or take 40 minutes to fine tune it on a 200,000 sample data set.
00:07:44.820 | So, that was quite interesting, and I think the level of detail in the training was incredible,
00:07:53.460 | and also in the evaluation piece, they do answer your question with the coding data
00:07:58.900 | set that they used to test it.
00:08:02.500 | Yeah, that's interesting.
00:08:07.940 | Okay, I have no idea how to guide this one, especially because normally I would try to
00:08:25.900 | prep a slide deck.
00:08:28.180 | I have been too busy to do a slide deck.
00:08:31.860 | But does anyone else have like a thing that they want to show that I can give you the
00:08:38.100 | screen for?
00:08:39.100 | I did take a bunch of notes on the paper.
00:08:42.460 | If you want me, I can take it over and like walk through the notes that I took.
00:08:49.700 | Go ahead.
00:08:52.860 | One second.
00:08:56.020 | I did add the Nomec tuned version of it.
00:09:01.180 | Basically, Nomec just added their Nomec data set and created Embed ModernBird, something
00:09:07.020 | like that.
00:09:08.620 | And that was a useful, like something that you can start using for RAG, whereas ModernBird
00:09:14.540 | was kind of like the baseline.
00:09:19.500 | I try to use it a little bit in a data set that I don't think was very well represented
00:09:25.780 | in the data set that they use, and it wasn't that good, the use case.
00:09:29.940 | It was with a law-specific one, trying to detect the type of legal case that was in
00:09:39.940 | the document.
00:09:40.940 | And then it didn't really do very well until I fine-tuned it.
00:09:44.140 | So it's really good off the bat, but for some use cases, you might want to fine-tune it
00:09:47.540 | and go out a little bit further.
00:09:49.540 | Okay, so I took a few notes and I highlighted the things that I thought that they were quite
00:09:55.140 | useful.
00:09:56.140 | So I'm just going to run through them.
00:09:57.140 | And if anybody has any questions, I might be able to answer them, I hope.
00:10:00.900 | I might not.
00:10:01.900 | I hope I am.
00:10:03.900 | Okay.
00:10:04.900 | So I think I mentioned this a second ago that one of the coolest pieces is that this can
00:10:12.100 | extract a lot of power from a single GPU, or you can do a lot with a single GPU.
00:10:18.600 | That's something that I personally prefer if I have the ability to do so, but I know
00:10:22.540 | that we all know that a lot of times you just need a much more powerful one.
00:10:28.740 | So they call it the most speed and memory efficient encoder and designed for inference
00:10:36.700 | on common GPUs.
00:10:40.540 | They point out a couple of nice things regarding the drawbacks.
00:10:45.620 | So previous birds had a length limitation of about 512 tokens, which is so optimal for
00:10:56.860 | model design, but then they increase in this one the vocabulary to 8,100 and something.
00:11:04.500 | And the nice thing about it is that they did it in a way in which they could parallelize
00:11:10.300 | it nicely across the couple of GPUs that they used for training.
00:11:22.620 | So this might -- so they can be used in training in conjunction with LLMs, for example, detecting
00:11:27.500 | toxic prompts and preventing responses on routing queries in agentic frameworks.
00:11:33.380 | I added here that this might not be super obvious for new AI engineers that have worked
00:11:37.660 | on decoder-only models.
00:11:38.900 | So you might want to have in a pipeline a smaller model, like a bird, like detecting
00:11:44.860 | specific words, specific keywords, and then use that to say, like, oh, is there toxic
00:11:49.860 | language in this prompt?
00:11:52.020 | Or is there toxic language in this response back to the user?
00:11:56.220 | And so you can have a pre and post LLM generation in an agentic framework.
00:12:04.480 | So the training data is limited in narrow domains and specifically in coding data.
00:12:10.140 | So lacking knowledge of recent events.
00:12:12.180 | So it was quite nice to see that kind of like a new, refreshed look as to what bird could
00:12:21.900 | And then so overall -- both of them reach overall performance.
00:12:26.580 | You saw the image that Strix show in the screen.
00:12:32.180 | Even the transformer.
00:12:33.880 | They disabled the bias term in all the layers, except the final decoder and in the layer
00:12:40.140 | norm.
00:12:41.140 | I keep hitting the ones that I -- sorry about that, Iro.
00:12:46.480 | >> Weights are all you need.
00:12:47.980 | >> Sorry?
00:12:48.980 | >> Weights are all you need.
00:12:49.980 | >> Sorry?
00:12:50.980 | >> Weights are all you need.
00:12:51.980 | No biases.
00:12:52.980 | >> So yeah.
00:12:53.980 | So they took the biases from the -- so they disabled the bias in all linear layers except
00:13:05.080 | for the final decoder linear layer.
00:13:08.400 | And then they also disabled the bias in the layer norms.
00:13:12.440 | So I thought that was quite interesting.
00:13:14.760 | And then they also used positional Rotary -- Rotary positional embeddings, Rope, instead
00:13:21.680 | of the absolute positional embeddings.
00:13:23.480 | So one of the things that was interesting about this paper is that a lot of the stuff
00:13:26.820 | that they included inside the paper or for the -- for creating Modern Burn are based
00:13:32.560 | on papers that come -- that all came out in 2024.
00:13:36.880 | I was about to start counting like how many were there from 2024, but I think the final
00:13:41.880 | number is actually quite large.
00:13:44.200 | Which is nice.
00:13:45.200 | A couple of things in the improvements.
00:13:48.280 | So in alternating attention -- so the attention layers in Modern Burn alternate between global,
00:13:54.200 | where every token within a sequence attends to every other token, and then local attention
00:13:58.960 | where tokens only attend to each other within a small sliding window.
00:14:04.560 | So every layer employs a global attention with a Rope of 160 and remaining layers use
00:14:11.760 | about 128 tokens.
00:14:13.920 | Global sliding window attention Rope of theta of 10,000.
00:14:18.360 | I feel like if somebody else here knows a little bit more about exactly what this means
00:14:23.200 | in much simpler terms, that would be super useful.
00:14:25.720 | Because I was trying to wrap my head around exactly how does this look like in the implementation
00:14:32.080 | piece.
00:14:33.080 | But I guess it's more a matter of going into the code and seeing exactly how this gets
00:14:38.360 | implemented.
00:14:43.720 | Something interesting as well is that it unpads inputs before the token embedding layer and
00:14:48.880 | optionally repads the model outputs, leading to a 10 to 20% performance improvement over
00:14:53.920 | other unpadding methods.
00:14:59.480 | Also uses a mixture of FlashAttention3 for global attention layers and FlashAttention2
00:15:06.760 | for local attention layers.
00:15:08.820 | I believe in the latest version of the transformers, the FlashAttention3 has not yet been implemented.
00:15:14.400 | So if you go to the repo, you're gonna end up seeing that it says, hey, if you're gonna
00:15:18.640 | be using Attention3, make sure you pip install from the URL directly to the latest git commit
00:15:24.220 | on the main branch rather than the latest version available in PyPI.
00:15:31.120 | So that yields a Torch compile, yielded a 10% improvement in throughput with negable
00:15:41.480 | compilation overhead.
00:15:42.480 | Which I thought it was interesting.
00:15:45.520 | And so, I mentioned this one about the design through -- so, it was designed through many
00:15:54.320 | small scale ablations to maximize the utilization of basket -- of a basket of common GPUs.
00:16:00.720 | I think if you are new today, or if you haven't read -- if you haven't worked as a data scientist,
00:16:06.480 | but maybe you don't read a whole lot of research papers, you might actually not know what ablation
00:16:11.560 | I had to Google it as well, even though I've read a few, but it just wasn't in my head.
00:16:16.020 | So ablation studies, they are experiments where research is systematically removed or
00:16:23.400 | ablate certain components of a model or system to understand their individual impact on performance.
00:16:30.480 | It is a kind of like a control experiment when you're creating an architecture.
00:16:35.160 | So the reason I mention that is because you're going to see the word "ablation" thrown in
00:16:39.880 | in a lot of places.
00:16:40.880 | And just assume that you know it.
00:16:43.160 | And I'm sure a lot of people here probably know it.
00:16:46.980 | So it uses 22 and 28 layers for the base and large model for a total of 149 to 395 million,
00:16:56.840 | respectively.
00:16:57.840 | And then the base one has a hidden size of 768, with a GLU expansion of 230, 2304, while
00:17:07.840 | the large one has a size of 1024, and a GLU expansion of 5248.
00:17:19.200 | There's two trillion tokens of primarily English data on a variety of data sources, including
00:17:23.720 | web documents code and scientific literature.
00:17:26.360 | So I don't know -- I don't remember seeing if there was a direct link to the dataset
00:17:32.720 | that they used.
00:17:33.720 | But it is open data, though.
00:17:35.720 | And they're very explicit about it.
00:17:37.880 | They say that they use a modified version of OMO tokenizer, which provides a better
00:17:42.440 | token efficiency and performance on code-related tasks.
00:17:46.320 | So this adds to the -- to being good with coding -- with coding files in general.
00:17:57.540 | And then it's nice that it keeps the same thing that we are used to seeing in encoder
00:18:04.080 | models like Verge.
00:18:05.080 | The last token being the CLS or the CEP, so you can play around with these ones.
00:18:15.260 | So I mentioned this earlier at the very beginning, that apparently it was very common to use
00:18:19.800 | 15% of masking when you are doing the training.
00:18:22.640 | But then this one bumps it up to 30%.
00:18:25.200 | They said that the 15% one has -- since the first paper has been shown to be suboptimal.
00:18:34.460 | And then -- oh, a couple of things that -- about the training set that I thought were interesting.
00:18:40.120 | So they started with a warmup rate of -- so actually they -- so they
00:18:45.020 | trained Motherbird Base at a constant learning rate of 0.004 for -- about -- sorry, 0.008
00:18:56.000 | for about 1.7 trillion tokens.
00:18:57.860 | But they started with a 3 billion token warmup first.
00:19:01.960 | And then after 2 billion token warmup, they trained Motherbird Large at a learning rate
00:19:06.760 | of a smaller precision for 900 billion tokens.
00:19:12.200 | So then they roll back and restarted the training at 0.000005 for the remaining 800 billion
00:19:20.120 | tokens after -- if they will see a large loss plateau for 100 billion tokens.
00:19:28.120 | So the batch size schedule, they warm up the batch size from 768 to 4680, over 50 billion
00:19:40.880 | tokens, and from 448 to 4928, over 10 billion tokens.
00:19:50.000 | And they talk a little bit how they do the context length extension.
00:19:58.880 | And it's quite nice that they use -- they took this idea from two papers that were released
00:20:05.120 | on -- in this year.
00:20:10.780 | Also in the text retrieval.
00:20:12.520 | So I think this might address a little bit Sebastian's question.
00:20:19.520 | So they evaluate the model on both single vector dense passage retrieval.
00:20:23.480 | So the entire document put into a -- into a dense -- into a vector.
00:20:28.440 | And then setting the -- and then they also use -- they set the multi-vector covert setting
00:20:35.280 | as well.
00:20:36.680 | So they use different methodologies to evaluate it.
00:20:40.120 | In some of them, they retrieve documents completely using -- and add them in the context of the
00:20:48.440 | evaluation.
00:20:49.440 | And then in others, they will have chunks of a piece of text.
00:20:54.440 | And then they train every base model using the MS Marco and data set which mine -- with
00:21:05.120 | mine hard negatives on 125 million samples with a batch size of 16, a learning rate and
00:21:11.200 | warm up of 5% of the training using the sentence transformers.
00:21:16.120 | So it's nice that you still see sentence transformers, like, still rocking it.
00:21:21.240 | They adapt the training set up to Jack covert 2.5, which was also part of a paper shown
00:21:28.960 | this year.
00:21:29.960 | And then they train all the models distilling the knowledge from a teacher model using KL
00:21:34.480 | divergence between the normalized teacher and the student scores.
00:21:37.960 | They say -- okay.
00:21:39.720 | Yeah.
00:21:40.720 | Here it is.
00:21:41.720 | So this is the results for all the models in the different evaluators that they use.
00:21:50.800 | You can see here at the bottom that the base method is better in this particular one, which
00:22:00.360 | you still have better ones, the GT and the MLM are a little bit better in these two over
00:22:04.600 | here.
00:22:05.600 | >> Yeah.
00:22:06.600 | On this table of results, what's really exceptional is that if you adopt -- if you look at COBEAR
00:22:13.320 | versus dense passive retrieval on the multilingual long document retrieval, MLDR, you look at
00:22:21.120 | the jump in performance.
00:22:23.760 | It's huge.
00:22:24.760 | For ModernBird, it's like 27.4 to 80.2.
00:22:28.360 | But for Bayer, I think it's benchmarking IR, the jump is not that big.
00:22:34.280 | So it seems like as your context length gets longer, doing a COBEAR-based approach really
00:22:39.880 | gets you a lot more juice out of it.
00:22:41.280 | Of course, a lot more costs, but a lot more metrics, a lot more improved metrics.
00:22:47.680 | >> Which one is the COBEAR approach?
00:22:51.880 | >> You can see that there are eight columns of results, right?
00:22:56.240 | The first three columns are dense passive retrieval.
00:22:58.280 | The next two columns are COBEAR.
00:23:00.680 | >> Thanks.
00:23:01.680 | >> Yeah.
00:23:02.680 | And then what you probably want to compare is basically on the same benchmarks, BIR to
00:23:07.760 | COBEAR, BEIR to BEIR, and MLDR out of domain to MLDR out of domain to see the comparison.
00:23:17.480 | >> Yeah, that's a huge win.
00:23:24.440 | >> Yeah, huge win.
00:23:28.440 | Okay.
00:23:29.440 | So there's a couple more.
00:23:33.280 | There's a few -- there's another one as well.
00:23:35.600 | Well, actually, let me -- let me see if I have something here for it.
00:23:41.720 | So to mention the programming-related performance, they evaluated all models in the code search
00:23:47.080 | So this is related to Sebastian's question.
00:23:49.800 | So a code-to-text benchmark where the model must identify relevant docstrings and comments
00:23:56.360 | for code blocks.
00:23:57.360 | So actually, it's identifying the docstring as opposed to the code itself, but it might
00:24:01.840 | be useful still in finding a piece of content if you're doing information retrieval.
00:24:09.680 | But it might be tricky if -- because especially nowadays, like, I don't know about everyone
00:24:14.540 | here in the call, I don't particularly -- I add comments.
00:24:18.800 | I add docstrings inside functions, but I don't go to a massive degree -- to a massive length
00:24:25.600 | to add a lot of comments or a lot of doc -- like a giant docstring.
00:24:29.800 | But for example, scikit-learn has probably the absolute best docstrings out there.
00:24:34.760 | NumPy and pandas and so on.
00:24:36.600 | If you were looking for functions inside of those, I can imagine that that's an easy one
00:24:40.120 | or that's an easy win for it.
00:24:41.640 | But if you -- more recent tools, I can imagine that it might not see the same extensive docstring
00:24:51.800 | in one of them.
00:24:55.160 | They evaluated the benchmarks using co-IR and code-IR frameworks.
00:25:01.840 | So a single vector retrieval task.
00:25:04.600 | And all of the models are reusing the best hyperparameters identified in section 3.1.2.
00:25:14.960 | And then so here they have the -- this is related to memory.
00:25:19.120 | So here they have memory batch size and inference in thousands of tokens per second, efficiency
00:25:24.240 | results in the consumer hardware GPU and RTX 4090, and average of the over ten run -- ten
00:25:32.480 | runs.
00:25:33.480 | I don't know if anybody has any comments on this particular one.
00:25:39.280 | >> Yeah, I was actually kind of curious about what you were talking about the -- what is
00:25:49.160 | it, CDN, they call it?
00:25:51.720 | The data sets for evaluating code.
00:25:54.280 | Is that like -- is it normally evaluated that way?
00:25:57.960 | By encoding the code and then trying to retrieve the correct docstring?
00:26:03.560 | Is this like the normal benchmark that's used against encoders?
00:26:06.320 | >> That is a great question.
00:26:09.960 | I have no idea.
00:26:10.960 | >> I was under the impression it's the other way around.
00:26:17.360 | >> You know, this would be a good thing for illicit or deep research.
00:26:21.600 | I'm going to see if I can find that.
00:26:24.600 | >> Yeah, but that's a great question.
00:26:29.280 | Maybe we have someone for any of the -- from cursor in this call or from any of the major
00:26:36.240 | IDEs that can say how they're doing it.
00:26:39.920 | We'll keep it a secret.
00:26:45.920 | >> Okay.
00:26:46.920 | So the Motherbird large increases its lead despite having comparatively fewer parameters.
00:26:54.080 | So from 395 million to GT and MLM's large 435.
00:27:01.700 | So that's nice being able to do a little bit more with less.
00:27:06.400 | Then the Motherbird outperforms other long context models with at least a 9NDCG at ten
00:27:15.600 | points lead on both model sizes.
00:27:18.040 | I was a little bit -- a little bit thrown about.
00:27:21.240 | Like I didn't really get this and I had to reread this piece.
00:27:24.880 | So what it essentially means is -- and then it also surpasses all existing base models,
00:27:36.040 | including DBERTA V3, becoming the first MLM trained model to do so, apparently, DBERTA
00:27:43.920 | V3 base was unbeaten in this -- in the glue benchmark.
00:27:54.040 | And then it improved understanding of code at no detriment of its ability to process
00:27:59.080 | natural text.
00:28:00.080 | So it's nice.
00:28:01.400 | You often see models that if they have been too fine tuned on code or too -- or they have
00:28:07.520 | a large emphasis on code, that they don't perform that well on any other common natural
00:28:12.400 | language tasks.
00:28:19.440 | And then so for the evaluation setting, so it is able to process batches twice as large
00:28:27.800 | as every other model, but on both input -- on both input length.
00:28:33.200 | But Motherbird is slightly less memory efficient than the original BERT, large on short context
00:28:39.360 | inputs.
00:28:40.360 | So we talk about 500 and something.
00:28:45.520 | But it can still process batches at least 60% bigger on the other one.
00:28:49.720 | So it's just like one tiny piece or one small piece where it doesn't beat every single -- that
00:28:56.640 | makes it so that it doesn't beat every single model at every single -- in every single characteristic.
00:29:02.400 | But other than that, almost in every measure, it goes above all the other models.
00:29:10.940 | It is the first -- the first open model to feature an entire model on padding, and it
00:29:17.400 | is the first encoder designed in a hardware-aware way to maximize inference efficiency.
00:29:24.720 | It is in a class-of-its-own encode and Colbert-style long context retrieval benchmark, scoring
00:29:30.820 | at least 6.85 and 9.1% points higher than the closest model, respectively, while remaining
00:29:38.400 | state-of-the-art on short context retrieval in both single and multivector settings.
00:29:44.160 | And -- oh, I thought that was interesting, that it says that the MLM objective gives
00:29:51.440 | the model some ability to generate text by suggesting a given token to replace the mask
00:29:57.160 | token, which could result in generation of harmful content.
00:30:02.200 | They do say here that, however, it is not primarily a generative model, and as such,
00:30:07.760 | it has not been trained to and therefore cannot generate a long context sequence, so it might
00:30:12.040 | predict the next mask and so on in specific ways, but it's not like it's gonna go and
00:30:17.600 | tell you or just cossign you in length.
00:30:21.760 | It might say something harmful.
00:30:25.400 | In terms of the scaling, besides the architectural modifications, a key aspect of the study is
00:30:31.040 | data scaling.
00:30:32.360 | However, other scaling access they note in terms of parameter are still left unexplored,
00:30:39.200 | so there's -- so what the authors, I think, are saying here is that there's a lot of room
00:30:43.240 | for improvement, even if this is a large bump in improvement already.
00:30:48.920 | And I think that's it.
00:30:50.920 | That's all, like, the highlights that I put when I was going over it.
00:30:54.740 | If anybody else has another comment or wants to touch on something else.
00:31:05.480 | >> I think one thing the authors did that's implicit is compare it only against the BERT
00:31:11.960 | models rather than comparing it against the other, like, collection of them.
00:31:16.360 | I mean, certainly sizes are different if you were to go on, like, the MTB leaderboard and
00:31:22.640 | look at them.
00:31:24.080 | Yeah, I was somewhat surprised whenever I started to read into it.
00:31:29.200 | I thought there might be some mention of the other encoding models that are out there,
00:31:34.040 | but that wasn't the case.
00:31:37.000 | >> Yeah, there's a lot of variations of BERT, so I can imagine they pick -- they only pick
00:31:45.680 | a selected few for specific reasons.
00:31:47.240 | They do mention in the paper why they pick a couple of them.
00:31:51.400 | So the BERT or the Roberta and also obviously the base BERT, but they definitely could have
00:32:03.560 | gone or maybe just left it for the reader to do a bit more testing and put stuff out
00:32:09.160 | there.
00:32:12.160 | >> Sorry, I didn't -- your audio cut out for a bit.
00:32:26.880 | I didn't hear what you said.
00:32:27.880 | >> Oh, sorry.
00:32:28.880 | I was saying that I appreciate that you put the NDCG explainer there, the link.
00:32:34.580 | >> It's a ranking similarity thing.
00:32:37.120 | Yeah.
00:32:38.120 | Yeah.
00:32:39.120 | I just -- I feel like they could just say ranking score and then, you know, footnote
00:32:43.400 | NDCG, but it's the same thing.
00:32:46.400 | It's always hard to remember all these things when you have -- you don't live in this world
00:32:50.840 | all the time.
00:32:54.600 | I think we have Benjamin who said that he wanted to talk after we finish recording.
00:33:02.720 | I really recommend reading the blog post.
00:33:05.840 | That was very much more my level.
00:33:09.440 | I don't really get involved in the training side.
00:33:11.320 | I just want to see -- understand what I need to know as a user of the thing.
00:33:16.040 | And then potentially as a fine tuner, you know, for stuff.
00:33:19.280 | So but I think like a very major advance, I think it's going to be a workhorse tool.
00:33:24.640 | So like I think multiple of us picked this last week and for good reason.
00:33:28.520 | So I think we're happy to cut over to the offline part of the Convo, unless anyone else
00:33:34.120 | has things to share.
00:33:37.600 | Okay.
00:33:38.600 | All right.
00:33:39.600 | I will pause recording and then Ben, Benjamin, you can come up.