[Paper Club] BERT: Bidirectional Encoder Representations from Transformers

Let's go. I'll kill time while Eric figures out how to share his computer. So why do we pick Bert today? I'm kind of curious. - Yeah, 'cause I'm working on a text classification problem at work and just wanna get the background. - Okay. - Yeah, I wonder if there is a microservice that is easy to set up.

I wonder if OpenPipe does this, that only mirrors a structured output GPT-4-O call and just mirrors it until it has enough data for Bert and then just switches you to Bert. - What do you mean by mirrors? - Shadows it. - Basically, you just use it at the start until you have enough data and then do URLs and then just swap it, so it's cheap.

- Yep. - I think OpenPipe actually does that, but I don't think it's as automatic and as seamless as how your vision is. - Yeah, I mean, like the amount of times people suggest using Bert for classification, it's cheap, like it's a path that is worth taking. Oh, Eric's rejoining, okay.

- Bart and T5 are good, like they also got a host, you need to, like one thing is just switch, but like for OpenAI or routing, you have endpoints and hosted model and all that. So to do Bert, like you gotta go deploy it somewhere and handle that part.

- Yeah. All right, Eric, we can see you. - Okay, perfect. So we'll start by going through the paper and then there's additional material out there on Bert. So depending how much time we have, we can look at some of the other things out there. So first of all, Bert stands for bi-directional encoder representations from transformers.

So this is one of the first transformer papers, like I believe it was about a year after the attention is everything you need paper. You can see it's from 2019. So ancient history in terms of deep learning and NLP, but still useful for a lot of use cases. And just for a little bit more context.

So after, so this was a Google paper and after they trained and released Bert, they started using it in Google search. And so it provided context for search results so that there's some examples maybe we can look at later, but it would, if you gave like a query that could mean a couple of different things, they use this model to discriminate between the two.

So let's, I guess, just walk down through the paper here. One thing to note is that there is this, two different kinds of approaches early on, feature-based and fine tuning. And so the like models like Elmo were feature-based where they had tasks specific architectures that were built into the model.

Whereas models like Bert and GPT were more used to fine tuning approach where they just, as more general purpose, they just trained one model and then you could fine tune it after the fact for your particular tasks. They talk a little bit here about the, one of the limitations of standard language models is that they're unidirectional.

So what that means is for like GPT and many other models out there now, they only look forward. And so they take a sequence of words and try to predict the next word in the sequence. However, Bert also goes backwards. So it takes like in the training data, it can also start at say the end of a piece of text and try to predict the previous word.

And so we'll talk a little bit how they avoid like contamination of the prediction, because obviously if you're training from back to front, you get a peek at what the words are coming up. - So a lot of this section was pre-decoder only transformers, right? So a lot of what they reference is like RNNs, GRUs, LSTMs.

So like pre-transformer stuff, you only get one pass and then they're like, "Crazy idea if you look at it from front to back." Like, so a lot of the training tasks was like classification, right? When humans classify something, we don't do it token by token. We listen to the whole sentence, then we classify.

So this was more so like LSTM, RNN era, yeah. - Right. Yeah, it's a very early paper. So let's see, we talked about bi-directional. And then, yeah, so it's fine-tuned versus feature-based. And then they show the effectiveness on BERT on 11 different NLP tasks. I guess one call out on the related work is just ELMo.

So ELMo was another model. I believe it was also by Google. If anyone knows for sure, feel free to correct. But they used different representations for the same word in different contexts. And so like the word stick, for example, you could say that means I'm going to chase a dog with a stick, or it could be like, hey, let's stick to the material that we're talking about, or maybe some other words or contexts.

And so those can mean different things, the same token. And because of that, they want to use like different representations, even though it's the same word in English. And so BERT leverages that feature as well. - So slight correction. ELMo is from Allen Institute and University of Washington. And then further context.

So one of the best BERT adaptations, I think in 2019 was Roberta. Roberta is like BERT, but make it good and bigger. And it was also from the same team at University of Washington and Facebook, I think. But ELMo was just, yeah, it wasn't from Google, but same thing.

It went from like one hot encoding bag of words to ELMo was really popular for like a bunch of Kaggle competitions where you needed good embeddings. And then for embeddings, it's not Google. - Great, thanks for the additional info. So let's take a stop here to look at this diagram.

So this is the BERT like architecture. Essentially, you can see this pink row down here is a set of tokens. Let me see if I can zoom in here. I'm not sure why I can't zoom in while I'm sharing, but anyway, this token is a classifier token. And then you see like tokens one through N.

So those are the first sentence. And then there's a separator token. And then there's another token one through M. And the reason that BERT has this structure is because one of the tasks it does is sentence classification. Basically, it can take two sentences and determine if one of the sentences seems like it follows the other one.

And then you see, so this is a pre-trained model. So this is trained on, well, what was at that point a large amount of text. And then these are the different fine tunes of that model for different benchmarks, essentially. And FYI, I'm not looking at the chat. So if there's anything in there, I'm not seeing it.

I can answer questions later or if someone just wants to unmute, feel free. So here's those two tokens I mentioned that you might not have been able to see, the classifier token and then the separator token. The classifier token tells it like what task it's supposed to be performing.

So they talk about how there's two steps in the framework, retraining and fine tuning. Probably a lot of people are familiar with that. So I won't go into too much detail. This is interesting. So this is 2019 numbers of what was at that point a large model. You can see the base model was 110 million parameters and the large model, which was, I saw someone referred to as like gigantic or like unbelievably large or something like that, is 340 million parameters, which is, you know, like 5% the size of like a LLAMA7B or something along those lines.

So what at the time seemed very large, in retrospect, isn't really. And that also goes for the training data. Do they have it in here? Well, there's two training sets that they used for it. One was a, all of Wikipedia, the English version, which was like 800 million words.

And then I think a set of books, which was 2.5 billion words. And again, those data sizes, while they could have been large for the time, currently are relatively small. Typically, like at least for frontier models, you're talking about low trillions of tokens to train them. So here they talk about what I mentioned earlier about having a pair of sentences where you can have a question and answer in one token sequence with the separator token in the middle.

And so let's go down and talk about pre-training. And here we get to their answer to the left-to-right versus right-to-left dilemma. So because it's bi-directional, potentially each word could see itself in the future. And so what they ended up doing was to mask 15% of the words so instead of having the actual word in the sequence, they have this mask token.

And then they try to predict that word either going forward or backward and do the, you know, score the training on that. They do have an issue that, let's see, that since the mask token does not appear during fine tuning, they have to compensate for that in the pre-training step.

So in order to do that, they don't always replace the mask words with this mask token. They do 80% of the time, 10% of the time, they just stick in a random token or anything in the vocabulary. And then another 10% of the time, they just leave it unchanged.

And so that helps during the fine tuning stage so that fine tuning doesn't expect these mask tokens like all the time. And then the other task they give it during pre-training is this next sentence prediction. That is the, you saw the two sentences and whether they are related. So that is, you know, that's the sentence A, separator, sentence B.

And then they have the 50% split of training data where half of it is labeled as is next and half of it is labeled as not next. So you can imagine a initial sentence of I walked down to the store. One that might be marked as is next is something like I bought a burrito and something that might be not next is, you know, salamanders have many spots or something like that.

And so this part of the pre-training figures out which of those sentences follow each other. Here, oh, okay. So here's what I was talking about earlier, the book corpus, 800 million words. And I guess I had them flipped Wikipedia as the 2.5 billion words. And this shows how the input representation.

So there's a couple of different layers and they just add these layers up to come up with the input. So the top layer here is the token. So it's the vector representation of that particular word. The next is the segment. So this splits it up between sentence A and sentence B.

Each one of those has a different factor that gets added into the input to help the model distinguish between them. And then finally as the positional embedding. So each place in the sequence, it's its own value that when it's added to the other layers helps the model to distinguish like does my come before dog or vice versa.

Okay, and then they talk about a bunch of different benchmarks. We'll just maybe take a look at some of the results and then continue forward. I'll also look at the, let me take a break to look at the chat here. - Another serious questions in chat. I guess Sam had one.

Have there been revisiting since the weird now pre-training tasks that were used to invert? Or has this all been done and it's somewhat of a shut book and next token prediction is all you need? Any thoughts, comments? - So I didn't look at any of the following BERT papers like Roberta or the Berta for this.

So I'm not sure if there's anyone else on the call who knows would be happy for a contribution. - Yeah, we had a slight follow-up in the comments there. Basically for encoder decoder stuff, it still makes sense. And a lot of my takeaway from this paper when I read it like years ago was they had really interesting pre-training tasks.

And if you look at it from a lens of what they're trying to do, it's actually pretty useless, right? Like next token prediction is still useful to generate words but something like predicting sentence order is sentence one or sentence two. Should sentence one come before sentence two or whatever?

It's not really useful, right? There's never a time where there's people trying to predict which sentence came before the other but it does really teach a model conceptually like what word order is, right? So there's these words and there's these words. You have to understand should these set of words come before these set of words.

So that helps in the classification task where you have to like group together words together. So in some sense for the tasks that BERT is trying to do it's trying to be a small efficient classification model as one of the tasks it's trying to do, right? It kind of makes sense to do these weird training objectives.

So next token prediction or like masking of words is still like, there's not many use cases where you need to fill in the blank, right? We don't have like 500 words and you have to fill in one word but it does teach a model like over a billion words.

If you start to understand what words go in between other words it's a good pre-training task and it helps with classification in general. So it's like, instead of having a very broad task of just predict next token and then extract, like abstract that out to eventually like get this emergent capability to do classification.

This is like, here's 12 tasks that mimic understanding words very well. You're a small model, use all this to do classification but that concept is still applied for like current encoder decoder models and small models that are not just next token prediction. So basically if you dumb down the task to subsets of your like main goal, it's still very effective.

And that's why like, if you take a step back and look at what Bert did, it makes no sense. Like Google doesn't need to spend millions of dollars to predict which sentence comes before or after another sentence, but it does help a small model that'll better learn embedding per se and abstract that to these tasks.

- Yeah, agreed. Like the sentence, the sentence prediction didn't seem like you're right. There's not too many use cases where that would be helpful. I wonder if it does help the model to like figure out the relationships between words and ideas to some degree. - Yeah, that's pretty much the whole point.

It should help the model better understand like breaking down the problem and understand word order. I think they also did something where they swapped words from different sentences or swap sentences, right? And like, that's even more useless in reality. Like in reality, there's not many times where you've got mixing of words from different chunks of sentences.

But once again, it helps the model generalize towards understanding sentences. So stuff like that. It's just, if you look at it in today's sense, if you have a niche topic that can benefit from encoder decoder, like small, small, like on-device active model, you wanna start to employ breaking down problems in this sort of methodology.

But there's examples of papers that do this type of work. - Yeah, for sure. Would be great to follow on this presentation with some of the more recent work that kind of builds on it. - There are not many other questions in chat, by the way. - Okay, okay, cool.

You can see from this, at least when it was released, BERT-Large was state-of-the-art, even beating out GPT-1. And ELMo. So at least at the time it was released, it was a very capable model. And then, I don't know. I'm gonna kind of skip through these, except maybe just to show the results of it being state-of-the-art in a lot of things.

So the next section is ablation studies. So removing different parts of the model to see what the effects were. And let's see. So here's a couple of different ablations they did was they removed the next sentence prediction task. So I guess this is something we were just talking about, but they still kept the mask LM.

And then the next thing they did is they only made it go left to right. And they also have the no sentence prediction. And so you can see the results from those attempts up here. The top is the kind of the standard model. And then if you look at the no next sentence prediction, it does lose a little bit.

Oh, actually in QNLI, it looks like it has a significant loss, maybe in other tasks, much less. But then as you also take away the bidirectional, it becomes less capable. Looks like it kind of varies between tasks, like how much capability it loses. But this does show that there is some value to those components.

Oh yeah, maybe this is what I was talking about, where they say, "We believe that this is the first work "to demonstrate convincingly "that scaling to extreme model sizes "also leads to large improvements "on a very small scale task." This is kind of like the bidder lesson, but maybe a little bit exaggerated as far as the extreme model size at this point.

And then they talk about feature approach with BERT a little bit. So if there's questions, feel free to unmute. Otherwise, let's go over to and look through. There's Jay Alomar has made some very helpful, like illustrated BERT and ELMo articles that we can go through to just kind of cement our understanding.

So that's just kind of a comparison of some models that were out at the time. And then this is one thing that was a takeaway for me, is that, okay, this is the pre-training step that we talked about, but then on the supervised learning step, you basically stick a classifier after the BERT.

So BERT is in charge of essentially encoding the text into an embedding. And then you use that classifier to then classify, in this case, either spam or not spam. Let me see if I can find... There's one diagram that I thought was especially helpful. Oh yeah, this one. So this one shows how BERT takes the entire sequence of tokens and then creates like for each input token, it has an output vector.

However, for the purpose of classification, we only look at the first output vector. That one contains essentially the entire sense of all of the input tokens. And then you can run that through, it can be a neural network, could be a logistic regression. And then from the features here, and I think there's like something like 768 dimensions in the embedding, from the features there, you can then predict spam or not spam based on your training set.

And let's see, that shows the same thing. And then here's like some illustration of the different encoder blocks. So as we mentioned earlier, BERT is encoder only. So the kind of classical transformer is an encoder and a decoder. Many modern models are decoder only. And so encoder is like used mostly these days for text classification or text clustering.

To my knowledge, there's encoder only transformers aren't really used for any kind of next, like sequence generation or next token generation. This talks about ELMo and the different context of words and how ELMo captures that. (silence) GPT, I thought there was something here. Yeah, so this is just like what we talked about about if you have a BERT encoder, you can stick another model for training on the end of it.

And then go from there. And then you can also use BERT for embedding. So if you have like a certain problem space with a lot of texts that you want to embed, you can continue pre-training or do fine tuning on BERT with your corpus in your industry specific like text corpus and then create an encoder that's especially built for your needs.

So there's one more. I'll pause any other questions in the chat. So some context of how they train BERT, they have like those 12 paths, right? They had a BERT based model, a math language model. They had the next sentence prediction, token classification, QA, sequence classification. They had all these tasks.

And basically what they did where they were BERT models with a layer added on top for a classification head. Now, in the time of 2019, when people started using these models, what was really common to do was you could either, if you had enough data, take the base model, add in a linear output head for classification, where you basically take all this, there's no output head.

It's just output is the last step of processing these tokens or sentences. Then you add just a linear head with a softmax for classification. Now, then you fine tune it on a lot of your data itself. If you didn't have as much data, one thing that was popular was you take the sequence classification head, you just continue fine tuning it on your data and it's already somewhat good at sequence classification.

But there was a whole like series of work that looked into based on how much data you have, where you should do your fine tuning. So if you have a lot of data, it was pretty common to not only add a classification head, but also peel back a few layers.

So reset the weights of like the top three, the top two, the final layer, and then continue training those in as well for your task. Because at some level, what people started to learn was these pre-training objective tasks of like mask word prediction and sentence ordering or QA, they were actually affecting the net output of sequence classification.

And if you wanted better, you could just train more of your whole model on that. So there was a whole thing of like, you should remove the top two layers, add a sequence classification head, train on tens of thousands of examples, and you'll get state-of-the-art results. You could, if you have less data, freeze layers, you could unfreeze weights.

There's like a whole set of this, but it was pretty common to also just mess with the architecture and add classification head. - Do you know if anyone trained all the layers or like just use that as a starting point to train all the layers? Okay. - So today there's like stuff that came out like a year or two ago, where basically you could retrain BERT in 24 hours on like a budget of like sub $500 with regular A100s and how you can do this better.

So it was in the realm of like at the time, not as effective to, you know, like you don't have Google Compute to retrain BERT from scratch, but now there's stuff of like 24 hours and a couple of hundred dollars to retrain your own better BERT. There's like an academia paper that came out about this.

If people go down the rabbit hole of like encoder models and this stuff, it's a cool one to look into of how they can better objectify these 12 pre-training tasks to a few and a better curated dataset and outperform it on a couple of hundred dollars in 24 hours.

But then it was also common where like there was a sentence classification and sentence extraction tasks that BERT was adapted towards. So like BERT for sequence classification or extraction for like abstractive summarization. And then companies that took it to production would do like significant retrains or like, yeah, they train a lot more of it.

And then this also just went into like at what part do you want to start training? (mouse clicking) Yeah, I mean that sounds like interesting stuff. If you have any links, please drop in the chat and I'll check it out. So maybe the last thing we can go through here is the same author Jay Alomar has a notebook where he shows like hands-on how to do this movie review sentiment classification.

He uses DistilBERT. So DistilBERT is a hugging face like recreation of BERT that like has very comparable performance on many fewer parameters. And then to do the classification, he just uses a basic logistic regression model from scikit-learn. And so then the features that go into this logistic regression model are just the vector of size 768 that comes out of the DistilBERT embedding.

So a lot of this leans very heavily on the HuggingFaceTransformers library. So let's see, that's just installing it, doing imports. He uses, he must've mentioned it up above, but a... Maybe it's... Anyway, there's a particular HuggingFace dataset that he's using that has the movie sentiment training data. Maybe he just uploaded it somewhere, he had it.

- I think it's an IMDb dataset. I think it's one of the Kaggle ones. Yeah, it's just a Kaggle IMDb. You have movie reviews, you classify them. Oh, and on that actually reminds me, one of the big things that made BERT somewhat popular was there was another Kaggle competition on tweet classification of sentiment.

So in tweets, like with previous embeddings, like Bag of Words or Glove or Elmo, if you have stuff like, "I hate this so much," that in some contexts in tweets could still be positive, even though it's very negative. And when you just look at lexical understanding of words, it's very negative, but then BERT embeddings were what really dominated that.

And then for like a few years, they kept doing follow-ups on that. But IMDb and tweet classification were versions that they used in a lot of these demos. - So let's see. So here we're just uploading or downloading BERT, DistilBERT from HuggingFace and getting the model initialized. So you can see it's just a few lines of code there.

So we have to do a few things like tokenize it and then add padding. So this is so that all of the sequences can be run in parallel. So we need to pad out so that they're all the same length. And then we need to mark the padded sections as masked so that BERT doesn't get confused at thinking like what's empty space is actual sequence that we want it to process.

So then you can see this diagram here. And again, apologies, I'll try to zoom in again. Oh, it worked this time. So this just takes the input text, runs it through DistilBERT and comes out with the embeddings. So that's all of this thing does. And then the one tricky part about all of this is you need to pick out exactly which values from this, I guess it's three-dimensional tensor you want to predict on.

So if you remember back from here, we just want the very first output. We wanna ignore all of these other ones. So he draws out in detail like how exactly you pull just those vectors out of this three-dimensional tensor. And then it's pretty straightforward machine learning after that. You just turn those 768 dimensions into features and do a test train split, train your logistic regression model, and then run, you know, once you've got the model trained, you can run a score and it gets 82%.

So assuming it's a 50/50 split, then the expected amount just from random chance would be 50%. So there is a significant increase using BERT to do classification, but obviously still plenty of room for improvement. Okay, and down here, it says the highest accuracy score for this data set is currently 96.8.

So as you can see, things have come a long way since 2019, but still a useful model to start with for any classification tasks or clustering. If you just wanna see what tech sequences are close to each other in embedding space, you can use it for that as well.

So that's about all I had for prepared stuff. I'll stop sharing and then maybe go through the chat. If anyone wants to chime in, add any color commentary, feel free to do that. - Someone linked the paper on the academic budget, the 24-hour BERT, and then I was also trying to think where apparently MosaicML showed how you can do it for $20 pre-trained BERT from scratch now.

So yeah, $20, they did like eight A100s for an hour, and they're able to match the glue store of basic BERT with their recipe. Kind of interesting. So a note Eugene Cha made is researchers felt 10K training is expensive. So I remember mapping this out. BERT, all these things that compare like 24-hour, one-hour BERT, they trained BERT for four days of TPU v3 equivalent, which is like at the time, let's say eight to $10 an hour, which is like 10 to 15K.

But then there wasn't just BERT-based. There was BERT-based, there was BERT-large, there was BERT-small. There's a bunch of experiments. The BERT-large was trained for more than four days. Like the cost equivalent is 50K on that, 10K on the regular BERT-based, less on the little one. And then you got to add in like the time and the R&D.

Oh, it was well more than a 10K project at Google. The BERT-large itself was already a 50K train run, plus 10 to 15 for BERT-based, plus just experimentation. So expensive, expensive. - I think it's more amount of labs would love to hear those numbers for SOTA. - Right now.

- True, true. - I'm just reading through the chat. - Mm-hmm. - Did we get a volunteer for next week or still waiting on that? - Anyone else next week? Any other questions on this, by the way? - I do have a random one. Because this is regarding the embedding science, right?

Even though I joke about this was the era before the GPU, the gaming GPU folks came in and said, hey, you need to be divisible by 64, 32, or power two, right? Does TPUs not have the divisible by 64 batch? I mean like optimization when it comes to embedding size characteristic.

That's why they have all these weird embedding size on TPUs related training. - I don't think it's TPU based. I have like old notes that I'm recalling where I dug through why they specifically did 768 and 512. And also someone noted in chat that's a limitation. There's other work that extends this out to like SentenceBERT that extends the embedding dimension.

And they were also pretty small. But back to Eugene's point of, is it hardware limitation? It's not. It was well divisibility between layers and adding layers and a bunch of stuff. I really can't remember the specifics of the reasons. I'll dig through some old notes, but someone broke down the per layer map and sending through inputs.

And there was a decent range reason for why all this. It's also like there's 12 layers divisible by 768. It went down that path. But also it wasn't like someone from Google that worked on BERT. It was just mapping through the input through every layer and all of this math working out.

And then a reason for like, oh, here's why this, why not this? And I was like, sounds good. Checks out to me. I can probably find this if I look. It's in some notes from a couple of years ago. (indistinct) - Eric, someone has asked, does BERT pre-training objective MLM like mass language modeling follow the same LLM scaling laws as GPTs?

- That's a good question. I don't know if there's been enough research in that area to like come to any conclusion. So I, like when I was researching this presentation, I went to, I think it was paperswithcode.com or something like that and looked at all the top papers for text classification.

And like a lot of them were pre or 2021 or previous. And so it seems like this direction of like encoder only or bi-directional has, well, I don't know about the bi-directional part, but at least encoder only, research has been pretty sparse recently. So for example, I don't know if anyone's spending millions of dollars to train like, you know, super BERT or something like that.

- I think it's-- - Well, the Rekka AI guys seems to play. - No, the last time I looked at this, like leaderboard and hugging face, I think it was all led by these transformed LLMs that now get the best performance, like a Mistral 7B turned into embedding model.

- Is that, sorry, I missed. Is that for text classification? - Well, yeah, I guess at the core, it's all turning text into an embedding. So yeah. - Could you drop a link in the chat? - Yeah, sure. - Yeah, it seems like that would lead to higher performance with this classification, but I'd have to do a little bit more research on that.

- Well, there's two pieces as well, right? So for mass language modeling, a lot of the scaling law papers directly showed why decoder only token prediction is better scaling than mass language models. One is purely when you mask 15% of tokens, you train on the mass. So you lose a lot of data.

You need more quality data. You're just straight training on less, right? If you have a data set of a trillion tokens, you can mask 15% of them and train on learning the 15%, or you can train on all trillion of them. That's a straight scaling. Now, if you have 15 trillion tokens versus 1 trillion, that's another question, but for embedding tasks in smaller models, the better trade-off scaling curve at the start for encoder learns better with less tokens at first, but then extending this out in pure scaling laws, yeah, you lose a lot of your training data, right?

And then that was one of the big points of why do we do next token prediction, because it scales better than other tasks, right? So scaling laws were made to show better objectives, so directly against it. But then at small scale and stuff, there's benefit in this, specifically for like edge models, like you can deploy a BART as a guardrail live, and you can have it intercept every query because it can act in milliseconds versus LLMs will still take longer, right?

So at a smaller scale, they'll be better. There was another part to this that I'm blanking on. Oh, Raika AI, they're doing encoder decoder generation models where they're adding decoder heads to encoders, and they're scaling them up to billions of parameters. They're a case study of spending money to train them up pretty big.

(mouse clicking) There's a question from Isaac in the chat about my use case at work. So currently where we're at in the project is we need to accumulate some good training data. So we don't have enough training data to like actually train a BART or that type of model.

So to start with, we're just using LLMs and prompts to like do some logical classification to like kind of bootstrap until we get enough data. And then also to create a feedback loop where we can get feedback from people so that we'll have enough like solid training data so we can actually train a model.

The main purpose of it being faster performance as Vibhu mentioned, then you can respond in milliseconds versus multiple seconds or tens of seconds if you're using an LLM. (mouse clicking) - Well, thank you, Eric. - Yeah. - Always appreciate the OG papers. - Yeah, it's good to-- - Do you wanna ask one volunteer next week?

- Any paper, it doesn't have to be, I'll look back at anything that I'm interested in. (mouse clicking) - Yeah, I don't know if there's any paper that's caught my eye recently. I guess like we talked a little bit about embedding papers, people are interested in embedding. - Did we ever do the Geno2 embeddings paper?

- No, there's also NOMIC embed. I thought like, I didn't see the Geno one, but the NOMIC one was pretty detailed in terms of what their process was. So, interesting. - I might be mixing up the paper thought, but I think I remember we went through one embedding paper.

Maybe it's NOMIC, maybe it's NOMIC, I don't know. But yeah, probably it wasn't there. - The NOMIC compares directly to Geno. I have the exact same thing. I haven't seen the NOMIC one as much. I just know Geno was the one open source, AK context, very detailed. Here's how to do embeddings from scratch and fine tune them paper.

But I guess if NOMIC is the same thing, 50/50 if anyone wants to take one or both, I would love both. - Oh, there's Geno three now, crazy. - Okay, well, I will volunteer for NOMIC or Geno. If anyone else has papers they wanna cover in the meantime, let's cover them, but otherwise I don't wanna drag this too long.

Yeah, nice chat. - All right, and thanks, Eric. - Yeah, thank you. - Thanks, Eric. Thanks, everyone. - Bye. - See ya. See ya.

[Paper Club] BERT: Bidirectional Encoder Representations from Transformers

Transcript