back to index[Paper Club] Embeddings in 2024: OpenAI, Nomic Embed, Jina Embed, cde-small-v1 - with swyx
00:00:00.000 |
there was a whole bunch of interesting embedding work piling up, and I figured it'd be good to 00:00:06.320 |
have a state of embeddings overview. And so we have basically one blog post and three papers 00:00:13.200 |
that I've sort of defined in scope. They're all listed in the meeting notes here. 00:00:19.760 |
And I would consider this basically everything that is relevant for understanding embeddings 00:00:28.160 |
as of today. And so I think that the first thing is to understand MTEB, which is Massive Text 00:00:34.800 |
Embedding Benchmark. This is the sort of de facto benchmark. There are criticisms of it, 00:00:39.360 |
but it's a pretty-- if you don't know-- if you use embeddings and you don't know 00:00:46.320 |
MTEB, you don't know embeddings at all. This changes a lot. It used to be that the Chinese 00:00:54.160 |
models were completely dominating the top 10. Now we have American Chinese models, 00:00:59.920 |
other Chinese models, and I don't know some of these guys. So I wouldn't pay strict attention 00:01:08.240 |
to the ranking of these things, but just to know the main benchmarks that people care about, 00:01:16.320 |
as well as the trade-offs in model size and memory usage. This becomes extremely, 00:01:23.040 |
extremely relevant when it comes to efficiency comments. Even though Stella is ranked as number 00:01:30.000 |
six, they're at least an order of magnitude more efficient in model size for the same amount of 00:01:36.320 |
performance that you might get from a much larger model. So practically, you might just use this 00:01:42.320 |
instead of something that's higher ranked. The other thing I think is also relevant 00:01:49.280 |
is I think-- I'm not sure, but I don't know if they actually-- yeah, so they have everything 00:01:55.840 |
in here, including CDE-small, which we're going to cover. What is text-emitting for? Oh, 00:02:01.440 |
this is for text. Got it. Where's the-- so I don't know where the OpenAI models land, but I think 00:02:10.320 |
definitely people should understand that this is the latest update for the OpenAI offering. 00:02:16.480 |
Typically, you want to at least be familiar to OpenAI offerings, just because that tends to be 00:02:21.600 |
the starting point. That's the API key that everyone already has, rather than adding a new 00:02:26.320 |
API key for other models. So they're not going to be the best in the world, but they're going 00:02:32.560 |
to be very, very good, and usually that's good enough. I would say the other thing to be aware 00:02:37.920 |
of is for the first time, they're offering two different sizes and also Matrioshka embeddings, 00:02:45.520 |
which-- Matrioshka. Oh. Do you know how to-- where do I find the document? I think they 00:02:56.240 |
didn't mention it. They did. I think they did. They added it to the blog post at the end. 00:03:08.080 |
Okay, you're going to see all my emails. Okay, well, never mind. Maybe-- so I'm just 00:03:16.000 |
setting up this browser for the first time, so OpenAI text embeddings. There we go. Here. No. 00:03:24.320 |
It's 2022. Does anyone have that link? Are you looking for Matrioshka or the one 00:03:34.880 |
where they referenced Matrioshka? Where they refreshed it. It would be really 00:03:39.840 |
awesome if they actually had it here. Nope. Embeddings. Okay, I can't find it. 00:03:50.240 |
Okay, it's actually the first pop-down. If you scroll to this link, I'm pasting in the chat here. 00:04:01.760 |
Give me a second. Okay, I'm pasting this in the chat here. If you open it, it's the link that you 00:04:07.360 |
had. 2024? No, no. The link that you shared. Yeah, click on that. And if you scroll down, 00:04:17.760 |
scroll down a little bit. Scroll down a little bit more. Ah, reducing embedding dimensions. 00:04:23.520 |
There we go. Do you see that? It's actually hidden in there. I don't know if actually 00:04:28.480 |
the word Matrioshka shows up. Whoever commented in the chat, 00:04:34.320 |
Kishore sent the blog post that I was looking for, and he did put it in the footnotes because the 00:04:42.480 |
author complained that they were not credited, which is very, very shady of OpenAI. So yeah, 00:04:49.040 |
these guys were the first to offer it, and it is very good. We'll see later in one of the Gina 00:04:56.640 |
postings how efficient it is. I don't think they communicated very well here, but let me just skip 00:05:03.360 |
ahead to the Gina posting, and then we'll show you. So the Matrioshka embeddings lets you reduce 00:05:09.120 |
the amount of data that you store. This is so annoying. Wait, when I have the image in my head... 00:05:26.400 |
Wow, they got rid of it. Okay, so I guess I have to refer to my own blog post about it 00:05:32.320 |
because they got rid of it. Oh, maybe it was in the paper. Ah, okay, yeah, it was the paper. Sorry. 00:05:56.960 |
there we go. So when you offer Matrioshka, you can do something like this, where you compress, 00:06:06.720 |
like, let's say the original output dimensions is 1024. You can compress it to 64, so you're 00:06:12.480 |
reducing the amount of storage space by 94%, and that only results in an 8% drop. So basically, 00:06:19.040 |
from here, 1024, down to 64, your performance drops from 75 to, like, 69 or whatever, 00:06:27.280 |
which is pretty good. So accuracy at 1 would be, like, 47.2, going down to 41.3, so that's, 00:06:38.080 |
like, a 6% drop, and then accuracy at 5 would be 75.3, going down to 69.4. So, like, that's a huge 00:06:47.440 |
amount of information, that is, storage that is saved, as well as compute everything, for a really 00:06:57.040 |
good drop. And basically, OpenAI pioneered this. They were the first to acknowledge that this is 00:07:05.120 |
relevant, and now, basically, every offering should do it. And, yeah, I think that's, those 00:07:12.160 |
are, those are state-of-the-art. I don't know if anyone else has played with the OpenAI embedding 00:07:16.320 |
models enough to offer any more notes before I move on to the Open models, but I just want to 00:07:21.280 |
start with OpenAI. And the thing is, training these matriarchal embeddings essentially comes 00:07:25.680 |
for free. You just need to update the loss function. Yes. You can do it, and I've tried 00:07:30.720 |
something like this, where I cut embeddings by a quarter of the size, and it's almost, 00:07:34.640 |
it's almost as good fidelity. And the thing is, okay, you might think that 1024 to 64, okay, 00:07:40.320 |
that's not such a big drop, but 1024 is just not usable in production, depending on your production 00:07:45.680 |
use cases, you may not be able to meet the latency, but 64, 128, those are amazing. So it's 00:07:52.400 |
essentially the boundary between what's usable and what's not. What exactly, so Dan says 1024 00:07:57.760 |
is enormous. I mean, I don't have a, like, what do you mean, it's just, it's just more numbers 00:08:02.640 |
to store. Like, if you think about it this way, as your embedding size increases, your approximate 00:08:09.440 |
nearest neighbors lookup will increase as well. So this is more about, like, the n squared 00:08:14.320 |
explosion. It's like, yeah, looking up, doing dot products, etc. So it just costs more to compute. 00:08:19.520 |
Vibhu, are you there? He's like, I think you have a lot more to add to the embedding search space 00:08:26.400 |
segment. I'm not sure, Vibhu's, I know he's in San Diego with family, so I don't really 00:08:35.840 |
know if he's able to comment. I think he dropped out. Yeah, like, we've already done a session on 00:08:42.560 |
MRL, so we can refer people to that MRL paper if they want to. I was just more just, like, 00:08:49.680 |
you know, what should you know with, like, state-of-the-art end-of-2024 embeddings? 00:08:54.880 |
This would be it. There's probably different sizes that you should be aware of, you should 00:09:01.360 |
know the models, you should know the costs. It's very cheap. I feel like they're basically 00:09:11.040 |
embedding this for you at cost, mostly because embeddings are a fantastic form of lock-in for 00:09:16.720 |
any API provider, because once you've embedded something, you still have to get it back. 00:09:21.680 |
So let me just continue, unless Sam has other questions that people can answer. 00:09:30.720 |
So then we're going to move on to the papers. I think the first one I would highlight is 00:09:35.520 |
NOMIC, because the reason I picked this was because someone was asking whether there's a 00:09:42.800 |
good paper on the full training process of a model, and NOMIC is the closest that I can 00:09:52.560 |
find, that I know of. Definitely, and there's a US bias here, because there's a whole bunch of 00:10:00.880 |
Chinese embedding papers that probably have some detail on their training process. 00:10:05.680 |
But NOMIC has open source code, open data, open training code, and full reproducibility, which 00:10:12.640 |
in my mind is good enough, if you wanted to deep dive into that. The main thing I would highlight 00:10:22.400 |
is the "actually use" part, which is a good follow-up from last week. Basically, what we 00:10:33.680 |
call the Nome-Shazier stack is pretty standard. These are all basically state-of-the-art in terms 00:10:40.560 |
of training processes and training tech, as far as I understand from every single model trainer 00:10:47.040 |
that I've talked to. I'm not sure about the masking. I actually did not understand. I thought 00:10:54.560 |
that you just kind of mask individual tokens. I thought that was standard. I didn't know there 00:10:58.560 |
was a hyperparameter here around mask rate of 30% versus 15%, and it's not something that I'm 00:11:05.840 |
familiar with, and neither am I familiar with a lot of these other types of sort of BERT-based 00:11:12.400 |
models. But I'm curious if anyone has thoughts or questions around what you would like to-- 00:11:19.520 |
Dr. Charles already came and saw you, right? Yes. 00:11:21.360 |
You're unmuted. I don't know if that's on purpose. Sorry. 00:11:27.280 |
Has anyone checked out NOMIC? Are you interested in going into any detail? I'm fairly friendly 00:11:38.000 |
with that team. Kishore says, "Original BERT paper masks 15%." Oh, I didn't know that. 00:11:42.960 |
Yeah. So, yeah. I mean, that was a new finding for me. The rest of it, I think, was relatively 00:11:51.360 |
unsurprising for its time. I would say the interesting thing-- I mean, one of the reasons 00:12:00.560 |
that NOMIC is investing in this is because they sell a cluster visualization tool, which is NOMIC 00:12:06.880 |
Atlas. And so, they're interested in basically just building tools for embeddings for you to 00:12:15.520 |
explore your datasets. RJ says, "Is there a study of vector retrieval speed versus embedding size?" 00:12:21.840 |
No, but I guess they're correlated. I don't know what specifically you would want. 00:12:29.520 |
Yeah. More detail would be great. But yeah. So, I would say if you want the sort of state-of-the-art 00:12:37.120 |
process of paper or data or code, I would just come and grab it off of here. They've found this 00:12:45.200 |
a lot. RJ says, "Discussing large embeddings equals bad, so you want to quantify." Yeah. 00:12:55.760 |
How would you quantify it, Eugene? It sounds like you've had some-- 00:12:58.800 |
I got you, RJ. And I can address this when you finish, whatever you want to say, 00:13:05.680 |
and we turn off recordings. Okay. All right. Keep that in mind as we go. 00:13:12.000 |
But yeah. I mean, and you can go through the NOMIC paper here. I would say pretty straightforward 00:13:21.600 |
training stuff here. I just think it's nice to have a starting document where you just have all 00:13:27.120 |
the tech choices, the hyperparameters, and reproducible code. I don't know. To me, 00:13:34.720 |
this is where you start. I also think that these prefixes and stuff basically completely reflect 00:13:43.600 |
BERT. This is just updating BERT in every shape and form, which is kind of nice. I never really 00:13:50.800 |
thought about that. These are all the Chinese models that I talked about. If you want their 00:13:56.960 |
papers, I'm sure they're all reflected here as well. I have not read them. But yeah. 00:14:01.440 |
I was a little surprised that it was just BERT updated and modified slightly. 00:14:08.240 |
But I wonder if that's because the true value out of this neural net would be in the data 00:14:18.720 |
that's coming into it, meaning it's more dependent on the data than the architecture. 00:14:23.760 |
I say that, but it's likely both. >> Yeah. Yeah. I've got nothing for you 00:14:31.920 |
there. One comment I'll share as well about this that I have had other founders tell me, 00:14:40.800 |
which is that it's very surprising that all embedding models are effectively 00:14:46.560 |
general purpose, and there's no code embedding models. If you look at the NOMIC datasets, 00:14:57.200 |
code is number 10 down the list for less than 1% of the dataset. 00:15:07.280 |
And in StackExchange, it's maybe down here. The StackExchange is not even a code-specific thing. 00:15:11.680 |
People have had the Codiums and the Cursors of the World and MorphLabs, they've had to create 00:15:20.880 |
their own code embedding models that they don't release, which is surprising. It's IP, 00:15:27.760 |
but it sounds like a high-potential thing for some PhD person to publish as their open research 00:15:36.080 |
article, because as of right now, every single embedding model is just general purpose language, 00:15:40.800 |
and obviously that is different. We'll cover a little bit of how to change that with CDEs, 00:15:47.840 |
but I think I'll move on to Gina, unless anyone has issues. 00:15:51.520 |
Okay. So, NOMIC was very focused on single language, English. I think they have some 00:16:05.280 |
multilingual capability. I don't know what the... Spanish. It wasn't covered in here, 00:16:20.320 |
but I did talk to them about this. But anyway, so Gina, specifically as a European company, 00:16:24.880 |
very, very focused on multilinguality, so this would be their update. I would also say that 00:16:32.400 |
I've been very impressed by their out-of-the-box offering. So, when we talked about clip embeddings, 00:16:40.320 |
right? So, this is one of the AI News articles from last week. You can see that... You can see 00:16:48.320 |
the difference between a paper that is very focused on research technique and algorithms, 00:16:55.200 |
and a paper that is focused on being a technical specification for an API that they intend to 00:17:01.360 |
offer. And so, Gina Clip 2 is basically... I'm just gonna chuck that in here as part of the reading. 00:17:08.400 |
Gina Clip 2 came out of the box. This is actually what I was looking for, by the way. 00:17:14.960 |
So, Gina Clip 2 came out of the box with, like, here's how you deploy to AWS, Azure, Google Cloud. 00:17:21.520 |
You won't get this from an Apple paper. And that's just because they're trying to make money off of 00:17:27.520 |
their API calls, right? But let's rewind to embeddings. So, basically, they have... They've 00:17:34.400 |
been running their own embeddings for a while. They updated this in September. And their focus 00:17:40.400 |
has been multilinguality. So, there's a variant of MTB for multilinguality. I don't know if it's 00:17:45.280 |
here. It's just Chinese. French. Yeah. There's some Japanese somewhere as well. I don't think 00:17:54.960 |
it's this specific leaderboard, but there's a different leaderboard as well. 00:17:58.080 |
And it's mostly a functional dataset. I don't think there's anything particularly I'll call 00:18:05.520 |
out here apart from, like, they also, you know, have, like, really, really good thoughts on 00:18:11.120 |
scaling laws and the kind of dataset that, like, works well for cross-language transfer. 00:18:20.240 |
So, they support 89 languages, which is pretty massive. And I think they're also very practical 00:18:26.160 |
around, like, the size of model. If you look at the size of the models here, some of them are, 00:18:31.600 |
like, 7D models, which is huge. And so, like, yeah. I don't know if the people are actually 00:18:39.120 |
interested in using these 7D models. This is definitely sort of benchmark maxing compared to 00:18:45.440 |
the more practical oriented people who are, like, no, like, you actually want to use, 00:18:49.440 |
like, a Roberta and keep it to, like, sub 1B for actual sort of inferencing for embeddings. 00:18:57.840 |
The other thing that I will call out here is just, like, the LoRa adapters. They also introduce 00:19:10.240 |
these, like, this concept of, like, task-specific LoRa adapters. So, let me see if they cover it. 00:19:16.240 |
Yeah. So, this is where you start to see, like, instead of the traditional single model type of 00:19:30.160 |
embedding model they use, where it's, like, this is, like, basically everything that we had up 00:19:37.680 |
till 2024 was just, like, single embedding model. Here we have task-specific adapters, 00:19:43.520 |
and we'll see another form of adapters with the last paper today. But the – where am I looking at? 00:19:53.120 |
Where are the adapters? Okay. Yeah. So, they have 00:20:05.600 |
retrieval for documents, retrieval for queries, separation of documents and clustering them, 00:20:11.760 |
classifying them, and then text matching, which is, I think, the classic workload. 00:20:16.080 |
And I think, like, we tend to use, at least in traditional RAG, and how I learned it and how I 00:20:22.560 |
think most people use it, we tend to use the same embedding model in the same mode for all these 00:20:26.480 |
things and maybe try to prompt differently or preprocess differently to get performance out of 00:20:32.000 |
them. But training Loras for individual models for the different RAG tasks, I think, is very 00:20:39.440 |
interesting and probably, like, a very good idea, because they're basically doing different things. 00:20:46.640 |
They have different tasks over here, and I think there's, like, obligations for, like, 00:20:51.760 |
the accuracy and precision of each of these things. But, like, you know, I would say the 00:20:58.640 |
main contribution of this paper is just the idea that you should have task adapters. They also have 00:21:03.040 |
MRLs. So, like, I don't think we should just, you know, I'm just going to leave the MRL discussion 00:21:07.600 |
aside. Like, we all know that it's good. I think the main idea to get from here is the, sort of, 00:21:12.560 |
idea of task-specific adapters. Does anyone have questions? I haven't been looking at the chat. 00:21:21.600 |
It's mostly the debate on the dimension size. 00:21:32.320 |
I was hoping that they would have done an ablation of the adapters. They did an ablation 00:21:38.400 |
of one versus two adapters, but they didn't do one on no adapter versus adapter, which is, I think, 00:21:49.040 |
No adapter versus adapter. You mean on their old model? 00:21:54.320 |
Table six in the results. Yeah, it's like, they did, if you look at it, 00:22:03.920 |
the second row from the, yeah, Gina is not, Gina V2 has no adapters. And then, you know, Gina V3 00:22:10.320 |
one star is, like, pair training. I think that's no adapter. And then they have a retriever adapter, 00:22:15.680 |
which is the last row, which you can see a huge boost, well, specifically for retriever. 00:22:20.720 |
Yeah, at least that's how I interpreted it. Please let me know if I misread it. 00:22:25.680 |
Yeah. Okay, it's intuitive that it's a fairly big lift. I mean, like, 00:22:32.320 |
I think, I can't remember the actual wording, but the most influential thing somebody said to 00:22:37.440 |
me was, like, you know, like, blindly applying embedding models onto any arbitrary task without 00:22:44.640 |
actually reading the paper and how it's trained, like, it's asking for failure, because, like, 00:22:48.560 |
embedding models have very specific assumptions into it. And so, it makes sense that splitting 00:22:54.720 |
out the assumptions into, like, the top five use cases and splitting them out would have very 00:22:59.200 |
material impact on how the embedding works. And, I mean, you can look at the numbers here. They're 00:23:05.600 |
pretty big lifts. Also, that's it. Take the results here with a pinch of salt, though. If you scroll 00:23:10.240 |
down a little bit more, Sykes, in the second paragraph on the left, you can see that their 00:23:14.720 |
evaluation set size is only 10, fewer than 10 examples. So, they added synthetically generated 00:23:22.080 |
data, et cetera, et cetera. Yeah. So, we'll see. But it's a very good result, and I'm glad people 00:23:35.760 |
So, I guess the other thing from the other paper, the NOMIC paper, they had used prefixes, 00:23:41.360 |
which apparently is a thing that has been done in training for a little bit. And the prefix 00:23:48.480 |
is kind of, in my mind, similar to the adapter in that you're just training different 00:23:54.480 |
tagging things. But the difference with the adapter is you have a different loss function, 00:24:00.080 |
right, per type. And so, I would have been interested to see an ablation there as well. 00:24:06.480 |
I mean, I would say, I would agree with you that prefixes are a standard part of the toolkit. 00:24:14.640 |
Therefore, that would be kind of covered in the base V2 versus the V3s. 00:24:20.720 |
So, but wouldn't you have to, like, I was, this was another point that I was 00:24:26.320 |
trying to understand is I couldn't find any evidence that there's any place where people, 00:24:30.560 |
they were, like, putting the prefix in the NOMIC paper. Like, it seems like you would be able to 00:24:35.760 |
improve a task-specific embedding by putting the prefix into the query, right? So, because it was 00:24:42.880 |
trained on that prefix. So, presumably, it would be better if you also used it in the query. 00:24:53.040 |
Excuse me. I have a question about figure one. Does that, it shows two in the JINYA, 00:25:01.840 |
in the JINYA paper. Does that mean that the input's both the query and the supplied 00:25:07.920 |
classification? Or is there a classification model that is a part of this embeddings model 00:25:15.760 |
that auto classifies? No, it's just loading in the classification, Laura. 00:25:21.840 |
Exactly. There's no separate classification model. It's really just embedding the text, 00:25:27.520 |
embedding the, yeah, like, right now, and then if there's a classification label, 00:25:32.240 |
embedding a classification label, and just doing the classification. 00:25:34.400 |
So, that means that the program or user supplies the class, the adapter task? 00:25:49.040 |
So, that's the program that has to match with one of the five that they give you. 00:25:58.960 |
Do they, does JINYA supply a classification model, or is it up to the? 00:26:08.000 |
They have a classification, Laura, but you have to provide your own labels. 00:26:14.240 |
One thing, for those who are more familiar with LORAS, isn't this wrong? 00:26:20.400 |
Why is it side by side? Shouldn't the LORAS be the last layer? 00:26:24.560 |
The LORAS are usually on the MLP layers. So, it's on, like, every MLP label or every, 00:26:35.760 |
like, query key attention value layer. So, at least how I use it is I apply LORAS on all the 00:26:42.800 |
MLP query key value layers. So, it's not just the last layer. 00:26:46.480 |
I think maybe what you're thinking of is maybe fine-tuning by adding a special last layer to 00:26:51.760 |
fine-tune that you freeze all the weights and fine-tune that special last layer for the specific 00:26:56.320 |
task. That's what I'm familiar with for LORAS, but I guess I might be very focused on, like, 00:27:01.760 |
diffusion LORAS. Yeah, oh, I think that could be it. Yeah, that could be it. I think in LORAS, 00:27:08.240 |
it's mostly all the weights except for the embedding weights. Okay, got it. That's not 00:27:14.960 |
very low-rank to me, but okay. Well, it's low-rank in the sense that the LORA dimension is very small 00:27:21.760 |
in the sense you can compress it. Yeah, yeah. All right, I'll throw an honorable mention to 00:27:27.200 |
Clip even though I didn't mention this just because this was also part of the reason why 00:27:31.920 |
I chose this topic for this week because there's a whole bunch of embedding shit that just came 00:27:35.840 |
out in the last two weeks. So, I just wanted to, like, here's the state-of-the-art, here's 00:27:39.840 |
everything I know, and then just kind of discuss it. So, they took Embeddings v3 and then jammed 00:27:44.960 |
it into Clip. So, here's the same ways where Embeddings v3 froze it, and then they have 00:27:50.480 |
this other vision, sorry, this other vision adapter here, and that's it. Text embeddings, 00:27:59.760 |
vision embeddings, you get a Clip model. We've covered Clip in the past, but basically for, 00:28:05.600 |
you know, for a refresher of those people who don't remember, where is the goddamn Clip paper? 00:28:10.880 |
I hate it when they don't show everything that's important. Okay, I have to go back to my 00:28:18.080 |
summarization again. Oh, okay, it was a different paper, unfortunately. Okay, but this is the Apple 00:28:28.400 |
one, but, like, I just really love this example. Every example, every paper should have qualitative 00:28:33.600 |
example of the output compared to competitors, right? Because then you understand fundamentally 00:28:39.680 |
what they're going for, because they are showing you how they want it to be used. And so, for 00:28:43.760 |
example, visual QA, stuff like this, is really cool, because you can definitely see yourself 00:28:49.920 |
having an image like this, where, you know, there's a number on the screen, and you say, 00:28:54.640 |
what is the weight of the luggage? OpenAI Clip gets it wrong, Siglip gets it wrong, and, you know, 00:28:59.840 |
your model gets it correct, right? Obviously, these are all going to be cherry-picked, but at 00:29:03.200 |
least it gives you an idea of what's in the damn data set that I find it hard to get. So Gina, 00:29:11.920 |
unfortunately, did not do this, at least that I can tell, but at least, you know, they publish a 00:29:17.440 |
lot of really sort of technical quantitative specs, and it's based on embeddings v3. So this is how 00:29:22.960 |
foundational embedding models are. Okay, I want to move on to the last one, unless people have 00:29:28.240 |
questions. I haven't been looking at the comments here. Oh, anyone have interesting comments? 00:29:36.240 |
Yes, yes, you have to click on my screen. Zoom has made it easy to miss the screen. Okay. 00:29:48.640 |
Oh, quick. So is Gina, like, a research lab, or... 00:29:57.600 |
It's a Chinese founder, lives in Germany. I met him in Vienna. Very nice guy, a big fan 00:30:03.840 |
of latent space. We'll have him on at some point. For me, there's like 10 of these, you know, and 00:30:12.080 |
it's hard for me to, like, figure out who to talk to, but Gina seems to do solid work, and they're 00:30:16.320 |
very, very serious, and, I mean, look at the quality of their stuff. Like, it's obvious that 00:30:19.520 |
they're serious about it. So yeah, they're a startup trying to make it. Has anyone experimented 00:30:29.440 |
with medical embedding models? Okay, I'm going to go ahead and guess no, but can you show up, 00:30:41.360 |
Hi. Yeah, so I'm currently working on, like, with the QN multimodal model. I'm working on 00:30:54.080 |
a retrieval system for medical papers, and I'm currently trying different models, 00:31:03.360 |
and there's this BioBird. That's a QN from Alibaba. This new QN, yeah. And BioBird was one, 00:31:14.800 |
but I'm not 100%. Yeah, it's not really good for my use case, so, and there is not a ton. So, 00:31:28.000 |
there's Jon Snow Labs. They are quite active in this area. So, Jon Snow, like, from Game of Thrones. 00:31:33.520 |
Yeah, but I was wondering if any of you have experience with 00:31:40.000 |
models that were trained on, like, biomedical data and have high embedding quality? Also, 00:31:47.200 |
for, like, the, especially, like, chemical or biochemical, like, protein pathways, like this? 00:31:56.880 |
So, I don't think anyone here does medical stuff, but Tanishq in our Discord does. So, 00:32:01.600 |
Tanishq, iScience lover, I think. Do you have any recs on biomedical embedding? 00:32:12.720 |
And he got you. If he, if it doesn't exist, he'll train it for you. 00:32:17.680 |
Yeah, it was also maybe starting on the weekend, like, 00:32:23.520 |
my own embedding model, just with a budget of a few. Let's see what comes out. 00:32:30.480 |
I'd be very interested to see if the NOMIC code works for you, because this is supposed to be 00:32:39.840 |
the, like, you just swap all the data set and you, you know, just run the same code again. 00:32:43.760 |
Yeah, but that's, that's a theory in practice. You know, there's, it will not work in the first go, 00:32:52.480 |
but, yeah. Someone also, Nav also says you can just fine-tune a generic one. I definitely agree 00:32:58.320 |
with that. Yeah, anyone from the, sort of, fast AI community will be horrified that you should 00:33:05.120 |
start from random weights. You should just start from something that's a decent weight. 00:33:08.400 |
Okay. Oh, Khaled says MedGem and AI Healthcare. What is that? Is that, is that, is that a thing? 00:33:19.920 |
Oh, okay. Oh, yeah, you know, Google keeps doing this stuff and then I just ignore it, 00:33:25.280 |
because I don't do any medical stuff, but, yeah, this sounds awesome. Oh, Sam, Sam, 00:33:30.800 |
Sam has a med model. I forgot. Yeah, but Sam is for segmentation. It's not a foundation model. 00:33:41.760 |
They have a med Sam, which is good for segmentation. Oh, no, no, no, no. When I say Sam, 00:33:51.680 |
I mean Sam Julian, who's in the, in the chat. Oh, okay. I'm sorry. Not, not segment anything. 00:33:56.560 |
There is a Sam that is a segment anything model from MedGem. Yeah, and we've, we've interviewed 00:34:06.080 |
them twice, actually. So we have, I think it's, I think it's Sam too. Yeah, these guys. Nicky is a 00:34:13.760 |
friend, and Roboflow is a very different friend of ours. But the, the good, the good thing is that 00:34:20.240 |
they also have a specific model for medical imaging, which is what I work on. But since 00:34:28.640 |
it's a coincidence that you mentioned Sam, as we were talking about MedGem and AI and 00:34:34.880 |
the... Yes, people have used it for medical applications. I believe Joseph in this podcast 00:34:43.040 |
actually mentions it. But it's, I don't have the domain expertise to go beyond that. But yes, 00:34:48.880 |
people have fine-tuned Sam to do medical segmentation. No, you can just write MedSam, 00:34:55.120 |
and you will get the GitHub. Okay, yeah, yeah, sorry. All right. Cool. All right. Let me, 00:35:01.360 |
let me round out the other stuff, and then, and then we can sort of jump to Q&A, because I'm also 00:35:05.360 |
keen to hear Eugene's take on, on embeddings. So the last thing I'll highlight for you guys 00:35:12.320 |
is contextual embeddings. So I'm basically trying to organizing it in terms of progression of what 00:35:17.200 |
I've seen in embeddings this year. So there was Nomic, which is start of the year. Gina, which 00:35:23.600 |
introduced TaskLawrence. And contextual embeddings now introduced this idea of a two-stage adaptation, 00:35:30.560 |
where they specifically help you, they specifically train the model to be conditioned 00:35:37.120 |
on the corpus first, to then be used in an embedding context. Which is, which is a little 00:35:43.360 |
bit weird. But it also helps them be very, very OP in one specific aspect, which is efficiency. 00:35:48.800 |
So they are, I think if we go to, they're still up here somewhere. It's very hard to like keep up. 00:35:56.320 |
So they are 143 million per amp model, competing with 7 billion per amp models, 00:36:02.560 |
because of this adaptation. So it's a little bit cheating to put them on the apples to apples 00:36:09.920 |
comparison with these guys, because their deployment model is a little bit different. 00:36:13.680 |
What they're doing is basically, first you consume a context. Let me see if I can show the code. 00:36:20.560 |
Where is it? I don't know if I saw it here inside the, I think there might be the GitHub. 00:36:39.760 |
Sorry, I don't, I don't think I put it in my notes here. Contextual embeddings. 00:36:54.000 |
Okay, so yeah, this is what I wanted to show you. So instead of just like a single shot, 00:37:04.400 |
here's a bunch of text, embed this please. That's basically what all the other models did. 00:37:09.360 |
In the JINA model, you maybe specify like a task, right? So to load the LoRa. Here you actually 00:37:16.160 |
kind of construct the LoRa as you go, right? So you feed in the corpus first, feed all of it, 00:37:22.240 |
and then you get dataset embeddings for the first stage on the whole thing. Then the second stage, 00:37:27.040 |
you use it to actually do your prompts query, which is kind of slow for loading. 00:37:35.200 |
But then you can understand why this domain adapted so much better than basically every 00:37:41.360 |
other method out there. And it's such a simple idea that you just train your model in a sort 00:37:46.720 |
of two-stage process. So these guys worked it out. And the technique is, you know, above my pay grade, 00:37:53.360 |
but it's a whole bunch of math, whatever. But like that conditional aspect, I think, 00:37:59.120 |
makes a ton of sense to me. And this, in my mind, like, if this method proves popular enough, 00:38:04.960 |
basically everyone is going to do it, because it's such a cheap win, especially for the efficiency. 00:38:14.880 |
embedding paper that you mentioned? Yeah, yeah. 00:38:16.960 |
I think most, I think even Gina and Nomik, they actually adopt that methodology. I think there's 00:38:25.120 |
two things. One is updates to the architecture. Another one is updates to the training methodology. 00:38:29.600 |
Essentially, they say that they do some clustering, and then they feed in the batches 00:38:32.720 |
from the same cluster. I think Gina and Nomik also do that, where they say that they feed in 00:38:38.480 |
the batch to make sure that the data comes from the same data set. They don't actually mix data 00:38:44.000 |
sources across different data sets. But I think what's unique... 00:38:47.440 |
The inference is only one run. Like, you know what I mean? Like, they don't let you domain 00:38:53.840 |
adapt this thing. Yeah, that's true. That's true. Exactly. So 00:38:56.640 |
what's unique is their architecture, whereby the inference, they actually allow you to 00:39:00.720 |
provide some priors on your existing domain. That's quite interesting to me, 00:39:06.000 |
and that was new to me as well. Yeah. So I think it's a very good idea. 00:39:11.360 |
I would love for other people to adopt it. This might be one of those things where 00:39:15.200 |
it just takes one of the big labs to read the paper and figure out that this makes sense. 00:39:22.000 |
I think the other deployment issue is that it's basically a stateful API. So you cannot... Like, 00:39:29.360 |
all these are stateless, which is great. Sorry, this is stateless. So you just call an endpoint, 00:39:34.240 |
right? All the model labs love this kind of model. But here you're going to have to, like, 00:39:38.080 |
call an endpoint to embed this model first, and then return a new endpoint that you can actually 00:39:46.080 |
do the embeddings on. So it might be a little bit annoying for these guys to figure out, but 00:39:51.040 |
if it's a big enough deal, they'll figure it out. But the lifts are very, very great. If you look at 00:40:00.480 |
some of the data that they have... Like, yeah. Just across all these models, keep in mind that 00:40:11.760 |
they're at least an order of magnitude smaller than all these guys. They actually perform better 00:40:16.080 |
on basically every task. It's pretty crazy. So it would be interesting... There's no reason for it 00:40:26.400 |
to be small if you can just make it big, but you just keep the technique the same. This was trained 00:40:31.440 |
on a grad student budget. If you just scale this up, I think it would work. I think people would 00:40:37.760 |
use it. It basically is a more generalized version of this task adapter API, right? 00:40:44.960 |
So instead of having only five task adapters, what if you could just come up with your own task 00:40:53.040 |
adapters just arbitrarily by feeding in the corpus that you're trying to embed? 00:40:56.880 |
To me, that's a big idea. Anyway, should I pause there? I don't know if there's any other questions. 00:41:04.480 |
You can see the first data. Isn't that just fine-tuning? No, because there's no gradient 00:41:14.160 |
updates here. It's more like context caching, maybe. I'll liken it to that, where the initial 00:41:29.920 |
context is pre-processed as a KB cache, and you just keep the KB cache around. That's effectively 00:41:36.080 |
what you do for context caching. I think we can pause the recording, 00:41:40.160 |
then let Eugene do his hot takes. Eugene, hot takes, let's go!