[Paper Club] Embeddings in 2024: OpenAI, Nomic Embed, Jina Embed, cde-small-v1

there was a whole bunch of interesting embedding work piling up, and I figured it'd be good to have a state of embeddings overview. And so we have basically one blog post and three papers that I've sort of defined in scope. They're all listed in the meeting notes here. And I would consider this basically everything that is relevant for understanding embeddings as of today.

And so I think that the first thing is to understand MTEB, which is Massive Text Embedding Benchmark. This is the sort of de facto benchmark. There are criticisms of it, but it's a pretty-- if you don't know-- if you use embeddings and you don't know MTEB, you don't know embeddings at all.

This changes a lot. It used to be that the Chinese models were completely dominating the top 10. Now we have American Chinese models, other Chinese models, and I don't know some of these guys. So I wouldn't pay strict attention to the ranking of these things, but just to know the main benchmarks that people care about, as well as the trade-offs in model size and memory usage.

This becomes extremely, extremely relevant when it comes to efficiency comments. Even though Stella is ranked as number six, they're at least an order of magnitude more efficient in model size for the same amount of performance that you might get from a much larger model. So practically, you might just use this instead of something that's higher ranked.

The other thing I think is also relevant is I think-- I'm not sure, but I don't know if they actually-- yeah, so they have everything in here, including CDE-small, which we're going to cover. What is text-emitting for? Oh, this is for text. Got it. Where's the-- so I don't know where the OpenAI models land, but I think definitely people should understand that this is the latest update for the OpenAI offering.

Typically, you want to at least be familiar to OpenAI offerings, just because that tends to be the starting point. That's the API key that everyone already has, rather than adding a new API key for other models. So they're not going to be the best in the world, but they're going to be very, very good, and usually that's good enough.

I would say the other thing to be aware of is for the first time, they're offering two different sizes and also Matrioshka embeddings, which-- Matrioshka. Oh. Do you know how to-- where do I find the document? I think they didn't mention it. They did. I think they did. They added it to the blog post at the end.

Okay, you're going to see all my emails. Okay, well, never mind. Maybe-- so I'm just setting up this browser for the first time, so OpenAI text embeddings. There we go. Here. No. It's 2022. Does anyone have that link? Are you looking for Matrioshka or the one where they referenced Matrioshka?

Where they refreshed it. It would be really awesome if they actually had it here. Nope. Embeddings. Okay, I can't find it. Okay, it's actually the first pop-down. If you scroll to this link, I'm pasting in the chat here. Give me a second. Okay, I'm pasting this in the chat here.

If you open it, it's the link that you had. 2024? No, no. The link that you shared. Yeah, click on that. And if you scroll down, scroll down a little bit. Scroll down a little bit more. Ah, reducing embedding dimensions. There we go. Do you see that? It's actually hidden in there.

I don't know if actually the word Matrioshka shows up. Whoever commented in the chat, Kishore sent the blog post that I was looking for, and he did put it in the footnotes because the author complained that they were not credited, which is very, very shady of OpenAI. So yeah, these guys were the first to offer it, and it is very good.

We'll see later in one of the Gina postings how efficient it is. I don't think they communicated very well here, but let me just skip ahead to the Gina posting, and then we'll show you. So the Matrioshka embeddings lets you reduce the amount of data that you store. This is so annoying.

Wait, when I have the image in my head... Wait, did it get rid of it? Wow, they got rid of it. Okay, so I guess I have to refer to my own blog post about it because they got rid of it. Oh, maybe it was in the paper. Ah, okay, yeah, it was the paper.

Sorry. I'm so sorry. Let me refer to my own notes. Here, here, there we go. So when you offer Matrioshka, you can do something like this, where you compress, like, let's say the original output dimensions is 1024. You can compress it to 64, so you're reducing the amount of storage space by 94%, and that only results in an 8% drop.

So basically, from here, 1024, down to 64, your performance drops from 75 to, like, 69 or whatever, which is pretty good. So accuracy at 1 would be, like, 47.2, going down to 41.3, so that's, like, a 6% drop, and then accuracy at 5 would be 75.3, going down to 69.4.

So, like, that's a huge amount of information, that is, storage that is saved, as well as compute everything, for a really good drop. And basically, OpenAI pioneered this. They were the first to acknowledge that this is relevant, and now, basically, every offering should do it. And, yeah, I think that's, those are, those are state-of-the-art.

I don't know if anyone else has played with the OpenAI embedding models enough to offer any more notes before I move on to the Open models, but I just want to start with OpenAI. And the thing is, training these matriarchal embeddings essentially comes for free. You just need to update the loss function.

Yes. You can do it, and I've tried something like this, where I cut embeddings by a quarter of the size, and it's almost, it's almost as good fidelity. And the thing is, okay, you might think that 1024 to 64, okay, that's not such a big drop, but 1024 is just not usable in production, depending on your production use cases, you may not be able to meet the latency, but 64, 128, those are amazing.

So it's essentially the boundary between what's usable and what's not. What exactly, so Dan says 1024 is enormous. I mean, I don't have a, like, what do you mean, it's just, it's just more numbers to store. Like, if you think about it this way, as your embedding size increases, your approximate nearest neighbors lookup will increase as well.

So this is more about, like, the n squared explosion. It's like, yeah, looking up, doing dot products, etc. So it just costs more to compute. Vibhu, are you there? He's like, I think you have a lot more to add to the embedding search space segment. I'm not sure, Vibhu's, I know he's in San Diego with family, so I don't really know if he's able to comment.

I think he dropped out. Yeah, like, we've already done a session on MRL, so we can refer people to that MRL paper if they want to. I was just more just, like, you know, what should you know with, like, state-of-the-art end-of-2024 embeddings? This would be it. There's probably different sizes that you should be aware of, you should know the models, you should know the costs.

It's very cheap. I feel like they're basically embedding this for you at cost, mostly because embeddings are a fantastic form of lock-in for any API provider, because once you've embedded something, you still have to get it back. So let me just continue, unless Sam has other questions that people can answer.

So then we're going to move on to the papers. I think the first one I would highlight is NOMIC, because the reason I picked this was because someone was asking whether there's a good paper on the full training process of a model, and NOMIC is the closest that I can find, that I know of.

Definitely, and there's a US bias here, because there's a whole bunch of Chinese embedding papers that probably have some detail on their training process. But NOMIC has open source code, open data, open training code, and full reproducibility, which in my mind is good enough, if you wanted to deep dive into that.

The main thing I would highlight is the "actually use" part, which is a good follow-up from last week. Basically, what we call the Nome-Shazier stack is pretty standard. These are all basically state-of-the-art in terms of training processes and training tech, as far as I understand from every single model trainer that I've talked to.

I'm not sure about the masking. I actually did not understand. I thought that you just kind of mask individual tokens. I thought that was standard. I didn't know there was a hyperparameter here around mask rate of 30% versus 15%, and it's not something that I'm familiar with, and neither am I familiar with a lot of these other types of sort of BERT-based models.

But I'm curious if anyone has thoughts or questions around what you would like to-- Dr. Charles already came and saw you, right? Yes. You're unmuted. I don't know if that's on purpose. Sorry. Has anyone checked out NOMIC? Are you interested in going into any detail? I'm fairly friendly with that team.

Kishore says, "Original BERT paper masks 15%." Oh, I didn't know that. Yeah. So, yeah. I mean, that was a new finding for me. The rest of it, I think, was relatively unsurprising for its time. I would say the interesting thing-- I mean, one of the reasons that NOMIC is investing in this is because they sell a cluster visualization tool, which is NOMIC Atlas.

And so, they're interested in basically just building tools for embeddings for you to explore your datasets. RJ says, "Is there a study of vector retrieval speed versus embedding size?" No, but I guess they're correlated. I don't know what specifically you would want. Yeah. More detail would be great. But yeah.

So, I would say if you want the sort of state-of-the-art process of paper or data or code, I would just come and grab it off of here. They've found this a lot. RJ says, "Discussing large embeddings equals bad, so you want to quantify." Yeah. How would you quantify it, Eugene?

It sounds like you've had some-- I got you, RJ. And I can address this when you finish, whatever you want to say, and we turn off recordings. Okay. All right. Keep that in mind as we go. But yeah. I mean, and you can go through the NOMIC paper here.

I would say pretty straightforward training stuff here. I just think it's nice to have a starting document where you just have all the tech choices, the hyperparameters, and reproducible code. I don't know. To me, this is where you start. I also think that these prefixes and stuff basically completely reflect BERT.

This is just updating BERT in every shape and form, which is kind of nice. I never really thought about that. These are all the Chinese models that I talked about. If you want their papers, I'm sure they're all reflected here as well. I have not read them. But yeah.

I was a little surprised that it was just BERT updated and modified slightly. But I wonder if that's because the true value out of this neural net would be in the data that's coming into it, meaning it's more dependent on the data than the architecture. I say that, but it's likely both.

>> Yeah. Yeah. I've got nothing for you there. One comment I'll share as well about this that I have had other founders tell me, which is that it's very surprising that all embedding models are effectively general purpose, and there's no code embedding models. If you look at the NOMIC datasets, code is number 10 down the list for less than 1% of the dataset.

And in StackExchange, it's maybe down here. The StackExchange is not even a code-specific thing. People have had the Codiums and the Cursors of the World and MorphLabs, they've had to create their own code embedding models that they don't release, which is surprising. It's IP, but it sounds like a high-potential thing for some PhD person to publish as their open research article, because as of right now, every single embedding model is just general purpose language, and obviously that is different.

We'll cover a little bit of how to change that with CDEs, but I think I'll move on to Gina, unless anyone has issues. Okay. So, NOMIC was very focused on single language, English. I think they have some multilingual capability. I don't know what the... Spanish. It wasn't covered in here, but I did talk to them about this.

But anyway, so Gina, specifically as a European company, very, very focused on multilinguality, so this would be their update. I would also say that I've been very impressed by their out-of-the-box offering. So, when we talked about clip embeddings, right? So, this is one of the AI News articles from last week.

You can see that... You can see the difference between a paper that is very focused on research technique and algorithms, and a paper that is focused on being a technical specification for an API that they intend to offer. And so, Gina Clip 2 is basically... I'm just gonna chuck that in here as part of the reading.

Gina Clip 2 came out of the box. This is actually what I was looking for, by the way. So, Gina Clip 2 came out of the box with, like, here's how you deploy to AWS, Azure, Google Cloud. You won't get this from an Apple paper. And that's just because they're trying to make money off of their API calls, right?

But let's rewind to embeddings. So, basically, they have... They've been running their own embeddings for a while. They updated this in September. And their focus has been multilinguality. So, there's a variant of MTB for multilinguality. I don't know if it's here. It's just Chinese. French. Yeah. There's some Japanese somewhere as well.

I don't think it's this specific leaderboard, but there's a different leaderboard as well. And it's mostly a functional dataset. I don't think there's anything particularly I'll call out here apart from, like, they also, you know, have, like, really, really good thoughts on scaling laws and the kind of dataset that, like, works well for cross-language transfer.

So, they support 89 languages, which is pretty massive. And I think they're also very practical around, like, the size of model. If you look at the size of the models here, some of them are, like, 7D models, which is huge. And so, like, yeah. I don't know if the people are actually interested in using these 7D models.

This is definitely sort of benchmark maxing compared to the more practical oriented people who are, like, no, like, you actually want to use, like, a Roberta and keep it to, like, sub 1B for actual sort of inferencing for embeddings. The other thing that I will call out here is just, like, the LoRa adapters.

They also introduce these, like, this concept of, like, task-specific LoRa adapters. So, let me see if they cover it. Yeah. So, this is where you start to see, like, instead of the traditional single model type of embedding model they use, where it's, like, this is, like, basically everything that we had up till 2024 was just, like, single embedding model.

Here we have task-specific adapters, and we'll see another form of adapters with the last paper today. But the – where am I looking at? Where are the adapters? Okay. Yeah. So, they have retrieval for documents, retrieval for queries, separation of documents and clustering them, classifying them, and then text matching, which is, I think, the classic workload.

And I think, like, we tend to use, at least in traditional RAG, and how I learned it and how I think most people use it, we tend to use the same embedding model in the same mode for all these things and maybe try to prompt differently or preprocess differently to get performance out of them.

But training Loras for individual models for the different RAG tasks, I think, is very interesting and probably, like, a very good idea, because they're basically doing different things. They have different tasks over here, and I think there's, like, obligations for, like, the accuracy and precision of each of these things.

But, like, you know, I would say the main contribution of this paper is just the idea that you should have task adapters. They also have MRLs. So, like, I don't think we should just, you know, I'm just going to leave the MRL discussion aside. Like, we all know that it's good.

I think the main idea to get from here is the, sort of, idea of task-specific adapters. Does anyone have questions? I haven't been looking at the chat. Okay, people are still talking. It's mostly the debate on the dimension size. What? It's mostly the debate on dimension size. I was hoping that they would have done an ablation of the adapters.

They did an ablation of one versus two adapters, but they didn't do one on no adapter versus adapter, which is, I think, in my opinion, more interesting. No adapter versus adapter. You mean on their old model? Isn't that table six? Table six in the results. Yeah, it's like, they did, if you look at it, the second row from the, yeah, Gina is not, Gina V2 has no adapters.

And then, you know, Gina V3 one star is, like, pair training. I think that's no adapter. And then they have a retriever adapter, which is the last row, which you can see a huge boost, well, specifically for retriever. Yeah, at least that's how I interpreted it. Please let me know if I misread it.

Yeah. Okay, it's intuitive that it's a fairly big lift. I mean, like, I think, I can't remember the actual wording, but the most influential thing somebody said to me was, like, you know, like, blindly applying embedding models onto any arbitrary task without actually reading the paper and how it's trained, like, it's asking for failure, because, like, embedding models have very specific assumptions into it.

And so, it makes sense that splitting out the assumptions into, like, the top five use cases and splitting them out would have very material impact on how the embedding works. And, I mean, you can look at the numbers here. They're pretty big lifts. Also, that's it. Take the results here with a pinch of salt, though.

If you scroll down a little bit more, Sykes, in the second paragraph on the left, you can see that their evaluation set size is only 10, fewer than 10 examples. So, they added synthetically generated data, et cetera, et cetera. Yeah. So, we'll see. But it's a very good result, and I'm glad people pushing on using adapters more.

Yeah. So, I guess the other thing from the other paper, the NOMIC paper, they had used prefixes, which apparently is a thing that has been done in training for a little bit. And the prefix is kind of, in my mind, similar to the adapter in that you're just training different tagging things.

But the difference with the adapter is you have a different loss function, right, per type. And so, I would have been interested to see an ablation there as well. I mean, I would say, I would agree with you that prefixes are a standard part of the toolkit. Therefore, that would be kind of covered in the base V2 versus the V3s.

So, but wouldn't you have to, like, I was, this was another point that I was trying to understand is I couldn't find any evidence that there's any place where people, they were, like, putting the prefix in the NOMIC paper. Like, it seems like you would be able to improve a task-specific embedding by putting the prefix into the query, right?

So, because it was trained on that prefix. So, presumably, it would be better if you also used it in the query. Yeah, yeah, no comment on that. Excuse me. I have a question about figure one. Does that, it shows two in the JINYA, in the JINYA paper. Does that mean that the input's both the query and the supplied classification?

Or is there a classification model that is a part of this embeddings model that auto classifies? No, it's just loading in the classification, Laura. Exactly. There's no separate classification model. It's really just embedding the text, embedding the, yeah, like, right now, and then if there's a classification label, embedding a classification label, and just doing the classification.

So, that means that the program or user supplies the class, the adapter task? Yes, if you look at the API for this. So, that's the program that has to match with one of the five that they give you. Do they, does JINYA supply a classification model, or is it up to the?

No, you can only use what they give you. They have a classification, Laura, but you have to provide your own labels. Okay, cool. One thing, for those who are more familiar with LORAS, isn't this wrong? Why is it side by side? Shouldn't the LORAS be the last layer? The LORAS are usually on the MLP layers.

So, it's on, like, every MLP label or every, like, query key attention value layer. So, at least how I use it is I apply LORAS on all the MLP query key value layers. So, it's not just the last layer. I think maybe what you're thinking of is maybe fine-tuning by adding a special last layer to fine-tune that you freeze all the weights and fine-tune that special last layer for the specific task.

That's what I'm familiar with for LORAS, but I guess I might be very focused on, like, diffusion LORAS. Yeah, oh, I think that could be it. Yeah, that could be it. I think in LORAS, it's mostly all the weights except for the embedding weights. Okay, got it. That's not very low-rank to me, but okay.

Well, it's low-rank in the sense that the LORA dimension is very small in the sense you can compress it. Yeah, yeah. All right, I'll throw an honorable mention to Clip even though I didn't mention this just because this was also part of the reason why I chose this topic for this week because there's a whole bunch of embedding shit that just came out in the last two weeks.

So, I just wanted to, like, here's the state-of-the-art, here's everything I know, and then just kind of discuss it. So, they took Embeddings v3 and then jammed it into Clip. So, here's the same ways where Embeddings v3 froze it, and then they have this other vision, sorry, this other vision adapter here, and that's it.

Text embeddings, vision embeddings, you get a Clip model. We've covered Clip in the past, but basically for, you know, for a refresher of those people who don't remember, where is the goddamn Clip paper? I hate it when they don't show everything that's important. Okay, I have to go back to my summarization again.

Oh, okay, it was a different paper, unfortunately. Okay, but this is the Apple one, but, like, I just really love this example. Every example, every paper should have qualitative example of the output compared to competitors, right? Because then you understand fundamentally what they're going for, because they are showing you how they want it to be used.

And so, for example, visual QA, stuff like this, is really cool, because you can definitely see yourself having an image like this, where, you know, there's a number on the screen, and you say, what is the weight of the luggage? OpenAI Clip gets it wrong, Siglip gets it wrong, and, you know, your model gets it correct, right?

Obviously, these are all going to be cherry-picked, but at least it gives you an idea of what's in the damn data set that I find it hard to get. So Gina, unfortunately, did not do this, at least that I can tell, but at least, you know, they publish a lot of really sort of technical quantitative specs, and it's based on embeddings v3.

So this is how foundational embedding models are. Okay, I want to move on to the last one, unless people have questions. I haven't been looking at the comments here. Oh, anyone have interesting comments? Yes, yes, you have to click on my screen. Zoom has made it easy to miss the screen.

Okay. Oh, quick. So is Gina, like, a research lab, or... It's a startup. Company? Okay, startup. It's a Chinese founder, lives in Germany. I met him in Vienna. Very nice guy, a big fan of latent space. We'll have him on at some point. For me, there's like 10 of these, you know, and it's hard for me to, like, figure out who to talk to, but Gina seems to do solid work, and they're very, very serious, and, I mean, look at the quality of their stuff.

Like, it's obvious that they're serious about it. So yeah, they're a startup trying to make it. Has anyone experimented with medical embedding models? Okay, I'm going to go ahead and guess no, but can you show up, can you tell us your interest? Amy? Yeah, hey, good to hear you.

Hi. Yeah, so I'm currently working on, like, with the QN multimodal model. I'm working on a retrieval system for medical papers, and I'm currently trying different models, and there's this BioBird. That's a QN from Alibaba. This new QN, yeah. And BioBird was one, but I'm not 100%. Yeah, it's not really good for my use case, so, and there is not a ton.

So, there's Jon Snow Labs. They are quite active in this area. So, Jon Snow, like, from Game of Thrones. Yeah, but I was wondering if any of you have experience with models that were trained on, like, biomedical data and have high embedding quality? Also, for, like, the, especially, like, chemical or biochemical, like, protein pathways, like this?

So, I don't think anyone here does medical stuff, but Tanishq in our Discord does. So, Tanishq, iScience lover, I think. Do you have any recs on biomedical embedding? And he got you. If he, if it doesn't exist, he'll train it for you. Yeah, it was also maybe starting on the weekend, like, my own embedding model, just with a budget of a few.

Let's see what comes out. I'd be very interested to see if the NOMIC code works for you, because this is supposed to be the, like, you just swap all the data set and you, you know, just run the same code again. Yeah, but that's, that's a theory in practice.

You know, there's, it will not work in the first go, but, yeah. Someone also, Nav also says you can just fine-tune a generic one. I definitely agree with that. Yeah, anyone from the, sort of, fast AI community will be horrified that you should start from random weights. You should just start from something that's a decent weight.

Okay. Oh, Khaled says MedGem and AI Healthcare. What is that? Is that, is that, is that a thing? Oh, okay. Oh, yeah, you know, Google keeps doing this stuff and then I just ignore it, because I don't do any medical stuff, but, yeah, this sounds awesome. Oh, Sam, Sam, Sam has a med model.

I forgot. Yeah, but Sam is for segmentation. It's not a foundation model. They have a med Sam, which is good for segmentation. Oh, no, no, no, no. When I say Sam, I mean Sam Julian, who's in the, in the chat. Oh, okay. I'm sorry. Not, not segment anything. There is a Sam that is a segment anything model from MedGem.

Yeah, and we've, we've interviewed them twice, actually. So we have, I think it's, I think it's Sam too. Yeah, these guys. Nicky is a friend, and Roboflow is a very different friend of ours. But the, the good, the good thing is that they also have a specific model for medical imaging, which is what I work on.

But since it's a coincidence that you mentioned Sam, as we were talking about MedGem and AI and the... Yes, people have used it for medical applications. I believe Joseph in this podcast actually mentions it. But it's, I don't have the domain expertise to go beyond that. But yes, people have fine-tuned Sam to do medical segmentation.

No, you can just write MedSam, and you will get the GitHub. Okay, yeah, yeah, sorry. All right. Cool. All right. Let me, let me round out the other stuff, and then, and then we can sort of jump to Q&A, because I'm also keen to hear Eugene's take on, on embeddings.

So the last thing I'll highlight for you guys is contextual embeddings. So I'm basically trying to organizing it in terms of progression of what I've seen in embeddings this year. So there was Nomic, which is start of the year. Gina, which introduced TaskLawrence. And contextual embeddings now introduced this idea of a two-stage adaptation, where they specifically help you, they specifically train the model to be conditioned on the corpus first, to then be used in an embedding context.

Which is, which is a little bit weird. But it also helps them be very, very OP in one specific aspect, which is efficiency. So they are, I think if we go to, they're still up here somewhere. It's very hard to like keep up. So they are 143 million per amp model, competing with 7 billion per amp models, because of this adaptation.

So it's a little bit cheating to put them on the apples to apples comparison with these guys, because their deployment model is a little bit different. What they're doing is basically, first you consume a context. Let me see if I can show the code. Where is it? I don't know if I saw it here inside the, I think there might be the GitHub.

Where's the GitHub? GitHub, GitHub, GitHub. Sorry, I don't, I don't think I put it in my notes here. Contextual embeddings. There we go. Okay. Yeah, here. Okay, so yeah, this is what I wanted to show you. So instead of just like a single shot, here's a bunch of text, embed this please.

That's basically what all the other models did. In the JINA model, you maybe specify like a task, right? So to load the LoRa. Here you actually kind of construct the LoRa as you go, right? So you feed in the corpus first, feed all of it, and then you get dataset embeddings for the first stage on the whole thing.

Then the second stage, you use it to actually do your prompts query, which is kind of slow for loading. But then you can understand why this domain adapted so much better than basically every other method out there. And it's such a simple idea that you just train your model in a sort of two-stage process.

So these guys worked it out. And the technique is, you know, above my pay grade, but it's a whole bunch of math, whatever. But like that conditional aspect, I think, makes a ton of sense to me. And this, in my mind, like, if this method proves popular enough, basically everyone is going to do it, because it's such a cheap win, especially for the efficiency.

So I'll pause there. Is that the contextual embedding paper that you mentioned? Yeah, yeah. I think most, I think even Gina and Nomik, they actually adopt that methodology. I think there's two things. One is updates to the architecture. Another one is updates to the training methodology. Essentially, they say that they do some clustering, and then they feed in the batches from the same cluster.

I think Gina and Nomik also do that, where they say that they feed in the batch to make sure that the data comes from the same data set. They don't actually mix data sources across different data sets. But I think what's unique... The inference is only one run. Like, you know what I mean?

Like, they don't let you domain adapt this thing. Yeah, that's true. That's true. Exactly. So what's unique is their architecture, whereby the inference, they actually allow you to provide some priors on your existing domain. That's quite interesting to me, and that was new to me as well. Yeah. So I think it's a very good idea.

I would love for other people to adopt it. This might be one of those things where it just takes one of the big labs to read the paper and figure out that this makes sense. I think the other deployment issue is that it's basically a stateful API. So you cannot...

Like, all these are stateless, which is great. Sorry, this is stateless. So you just call an endpoint, right? All the model labs love this kind of model. But here you're going to have to, like, call an endpoint to embed this model first, and then return a new endpoint that you can actually do the embeddings on.

So it might be a little bit annoying for these guys to figure out, but if it's a big enough deal, they'll figure it out. But the lifts are very, very great. If you look at some of the data that they have... Like, yeah. Just across all these models, keep in mind that they're at least an order of magnitude smaller than all these guys.

They actually perform better on basically every task. It's pretty crazy. So it would be interesting... There's no reason for it to be small if you can just make it big, but you just keep the technique the same. This was trained on a grad student budget. If you just scale this up, I think it would work.

I think people would use it. It basically is a more generalized version of this task adapter API, right? So instead of having only five task adapters, what if you could just come up with your own task adapters just arbitrarily by feeding in the corpus that you're trying to embed?

To me, that's a big idea. Anyway, should I pause there? I don't know if there's any other questions. You can see the first data. Isn't that just fine-tuning? No, because there's no gradient updates here. It's more like context caching, maybe. I'll liken it to that, where the initial context is pre-processed as a KB cache, and you just keep the KB cache around.

That's effectively what you do for context caching. I think we can pause the recording, then let Eugene do his hot takes. Eugene, hot takes, let's go!

[Paper Club] Embeddings in 2024: OpenAI, Nomic Embed, Jina Embed, cde-small-v1 - with swyx

Transcript