back to index

[Paper Club] Embeddings in 2024: OpenAI, Nomic Embed, Jina Embed, cde-small-v1 - with swyx


Whisper Transcript | Transcript Only Page

00:00:00.000 | there was a whole bunch of interesting embedding work piling up, and I figured it'd be good to
00:00:06.320 | have a state of embeddings overview. And so we have basically one blog post and three papers
00:00:13.200 | that I've sort of defined in scope. They're all listed in the meeting notes here.
00:00:19.760 | And I would consider this basically everything that is relevant for understanding embeddings
00:00:28.160 | as of today. And so I think that the first thing is to understand MTEB, which is Massive Text
00:00:34.800 | Embedding Benchmark. This is the sort of de facto benchmark. There are criticisms of it,
00:00:39.360 | but it's a pretty-- if you don't know-- if you use embeddings and you don't know
00:00:46.320 | MTEB, you don't know embeddings at all. This changes a lot. It used to be that the Chinese
00:00:54.160 | models were completely dominating the top 10. Now we have American Chinese models,
00:00:59.920 | other Chinese models, and I don't know some of these guys. So I wouldn't pay strict attention
00:01:08.240 | to the ranking of these things, but just to know the main benchmarks that people care about,
00:01:16.320 | as well as the trade-offs in model size and memory usage. This becomes extremely,
00:01:23.040 | extremely relevant when it comes to efficiency comments. Even though Stella is ranked as number
00:01:30.000 | six, they're at least an order of magnitude more efficient in model size for the same amount of
00:01:36.320 | performance that you might get from a much larger model. So practically, you might just use this
00:01:42.320 | instead of something that's higher ranked. The other thing I think is also relevant
00:01:49.280 | is I think-- I'm not sure, but I don't know if they actually-- yeah, so they have everything
00:01:55.840 | in here, including CDE-small, which we're going to cover. What is text-emitting for? Oh,
00:02:01.440 | this is for text. Got it. Where's the-- so I don't know where the OpenAI models land, but I think
00:02:10.320 | definitely people should understand that this is the latest update for the OpenAI offering.
00:02:16.480 | Typically, you want to at least be familiar to OpenAI offerings, just because that tends to be
00:02:21.600 | the starting point. That's the API key that everyone already has, rather than adding a new
00:02:26.320 | API key for other models. So they're not going to be the best in the world, but they're going
00:02:32.560 | to be very, very good, and usually that's good enough. I would say the other thing to be aware
00:02:37.920 | of is for the first time, they're offering two different sizes and also Matrioshka embeddings,
00:02:45.520 | which-- Matrioshka. Oh. Do you know how to-- where do I find the document? I think they
00:02:56.240 | didn't mention it. They did. I think they did. They added it to the blog post at the end.
00:03:08.080 | Okay, you're going to see all my emails. Okay, well, never mind. Maybe-- so I'm just
00:03:16.000 | setting up this browser for the first time, so OpenAI text embeddings. There we go. Here. No.
00:03:24.320 | It's 2022. Does anyone have that link? Are you looking for Matrioshka or the one
00:03:34.880 | where they referenced Matrioshka? Where they refreshed it. It would be really
00:03:39.840 | awesome if they actually had it here. Nope. Embeddings. Okay, I can't find it.
00:03:50.240 | Okay, it's actually the first pop-down. If you scroll to this link, I'm pasting in the chat here.
00:04:01.760 | Give me a second. Okay, I'm pasting this in the chat here. If you open it, it's the link that you
00:04:07.360 | had. 2024? No, no. The link that you shared. Yeah, click on that. And if you scroll down,
00:04:17.760 | scroll down a little bit. Scroll down a little bit more. Ah, reducing embedding dimensions.
00:04:23.520 | There we go. Do you see that? It's actually hidden in there. I don't know if actually
00:04:28.480 | the word Matrioshka shows up. Whoever commented in the chat,
00:04:34.320 | Kishore sent the blog post that I was looking for, and he did put it in the footnotes because the
00:04:42.480 | author complained that they were not credited, which is very, very shady of OpenAI. So yeah,
00:04:49.040 | these guys were the first to offer it, and it is very good. We'll see later in one of the Gina
00:04:56.640 | postings how efficient it is. I don't think they communicated very well here, but let me just skip
00:05:03.360 | ahead to the Gina posting, and then we'll show you. So the Matrioshka embeddings lets you reduce
00:05:09.120 | the amount of data that you store. This is so annoying. Wait, when I have the image in my head...
00:05:18.320 | Wait, did it get rid of it?
00:05:26.400 | Wow, they got rid of it. Okay, so I guess I have to refer to my own blog post about it
00:05:32.320 | because they got rid of it. Oh, maybe it was in the paper. Ah, okay, yeah, it was the paper. Sorry.
00:05:38.480 | I'm so sorry. Let me refer to my own notes.
00:05:53.200 | Here, here,
00:05:56.960 | there we go. So when you offer Matrioshka, you can do something like this, where you compress,
00:06:06.720 | like, let's say the original output dimensions is 1024. You can compress it to 64, so you're
00:06:12.480 | reducing the amount of storage space by 94%, and that only results in an 8% drop. So basically,
00:06:19.040 | from here, 1024, down to 64, your performance drops from 75 to, like, 69 or whatever,
00:06:27.280 | which is pretty good. So accuracy at 1 would be, like, 47.2, going down to 41.3, so that's,
00:06:38.080 | like, a 6% drop, and then accuracy at 5 would be 75.3, going down to 69.4. So, like, that's a huge
00:06:47.440 | amount of information, that is, storage that is saved, as well as compute everything, for a really
00:06:57.040 | good drop. And basically, OpenAI pioneered this. They were the first to acknowledge that this is
00:07:05.120 | relevant, and now, basically, every offering should do it. And, yeah, I think that's, those
00:07:12.160 | are, those are state-of-the-art. I don't know if anyone else has played with the OpenAI embedding
00:07:16.320 | models enough to offer any more notes before I move on to the Open models, but I just want to
00:07:21.280 | start with OpenAI. And the thing is, training these matriarchal embeddings essentially comes
00:07:25.680 | for free. You just need to update the loss function. Yes. You can do it, and I've tried
00:07:30.720 | something like this, where I cut embeddings by a quarter of the size, and it's almost,
00:07:34.640 | it's almost as good fidelity. And the thing is, okay, you might think that 1024 to 64, okay,
00:07:40.320 | that's not such a big drop, but 1024 is just not usable in production, depending on your production
00:07:45.680 | use cases, you may not be able to meet the latency, but 64, 128, those are amazing. So it's
00:07:52.400 | essentially the boundary between what's usable and what's not. What exactly, so Dan says 1024
00:07:57.760 | is enormous. I mean, I don't have a, like, what do you mean, it's just, it's just more numbers
00:08:02.640 | to store. Like, if you think about it this way, as your embedding size increases, your approximate
00:08:09.440 | nearest neighbors lookup will increase as well. So this is more about, like, the n squared
00:08:14.320 | explosion. It's like, yeah, looking up, doing dot products, etc. So it just costs more to compute.
00:08:19.520 | Vibhu, are you there? He's like, I think you have a lot more to add to the embedding search space
00:08:26.400 | segment. I'm not sure, Vibhu's, I know he's in San Diego with family, so I don't really
00:08:35.840 | know if he's able to comment. I think he dropped out. Yeah, like, we've already done a session on
00:08:42.560 | MRL, so we can refer people to that MRL paper if they want to. I was just more just, like,
00:08:49.680 | you know, what should you know with, like, state-of-the-art end-of-2024 embeddings?
00:08:54.880 | This would be it. There's probably different sizes that you should be aware of, you should
00:09:01.360 | know the models, you should know the costs. It's very cheap. I feel like they're basically
00:09:11.040 | embedding this for you at cost, mostly because embeddings are a fantastic form of lock-in for
00:09:16.720 | any API provider, because once you've embedded something, you still have to get it back.
00:09:21.680 | So let me just continue, unless Sam has other questions that people can answer.
00:09:30.720 | So then we're going to move on to the papers. I think the first one I would highlight is
00:09:35.520 | NOMIC, because the reason I picked this was because someone was asking whether there's a
00:09:42.800 | good paper on the full training process of a model, and NOMIC is the closest that I can
00:09:52.560 | find, that I know of. Definitely, and there's a US bias here, because there's a whole bunch of
00:10:00.880 | Chinese embedding papers that probably have some detail on their training process.
00:10:05.680 | But NOMIC has open source code, open data, open training code, and full reproducibility, which
00:10:12.640 | in my mind is good enough, if you wanted to deep dive into that. The main thing I would highlight
00:10:22.400 | is the "actually use" part, which is a good follow-up from last week. Basically, what we
00:10:33.680 | call the Nome-Shazier stack is pretty standard. These are all basically state-of-the-art in terms
00:10:40.560 | of training processes and training tech, as far as I understand from every single model trainer
00:10:47.040 | that I've talked to. I'm not sure about the masking. I actually did not understand. I thought
00:10:54.560 | that you just kind of mask individual tokens. I thought that was standard. I didn't know there
00:10:58.560 | was a hyperparameter here around mask rate of 30% versus 15%, and it's not something that I'm
00:11:05.840 | familiar with, and neither am I familiar with a lot of these other types of sort of BERT-based
00:11:12.400 | models. But I'm curious if anyone has thoughts or questions around what you would like to--
00:11:19.520 | Dr. Charles already came and saw you, right? Yes.
00:11:21.360 | You're unmuted. I don't know if that's on purpose. Sorry.
00:11:27.280 | Has anyone checked out NOMIC? Are you interested in going into any detail? I'm fairly friendly
00:11:38.000 | with that team. Kishore says, "Original BERT paper masks 15%." Oh, I didn't know that.
00:11:42.960 | Yeah. So, yeah. I mean, that was a new finding for me. The rest of it, I think, was relatively
00:11:51.360 | unsurprising for its time. I would say the interesting thing-- I mean, one of the reasons
00:12:00.560 | that NOMIC is investing in this is because they sell a cluster visualization tool, which is NOMIC
00:12:06.880 | Atlas. And so, they're interested in basically just building tools for embeddings for you to
00:12:15.520 | explore your datasets. RJ says, "Is there a study of vector retrieval speed versus embedding size?"
00:12:21.840 | No, but I guess they're correlated. I don't know what specifically you would want.
00:12:29.520 | Yeah. More detail would be great. But yeah. So, I would say if you want the sort of state-of-the-art
00:12:37.120 | process of paper or data or code, I would just come and grab it off of here. They've found this
00:12:45.200 | a lot. RJ says, "Discussing large embeddings equals bad, so you want to quantify." Yeah.
00:12:55.760 | How would you quantify it, Eugene? It sounds like you've had some--
00:12:58.800 | I got you, RJ. And I can address this when you finish, whatever you want to say,
00:13:05.680 | and we turn off recordings. Okay. All right. Keep that in mind as we go.
00:13:12.000 | But yeah. I mean, and you can go through the NOMIC paper here. I would say pretty straightforward
00:13:21.600 | training stuff here. I just think it's nice to have a starting document where you just have all
00:13:27.120 | the tech choices, the hyperparameters, and reproducible code. I don't know. To me,
00:13:34.720 | this is where you start. I also think that these prefixes and stuff basically completely reflect
00:13:43.600 | BERT. This is just updating BERT in every shape and form, which is kind of nice. I never really
00:13:50.800 | thought about that. These are all the Chinese models that I talked about. If you want their
00:13:56.960 | papers, I'm sure they're all reflected here as well. I have not read them. But yeah.
00:14:01.440 | I was a little surprised that it was just BERT updated and modified slightly.
00:14:08.240 | But I wonder if that's because the true value out of this neural net would be in the data
00:14:18.720 | that's coming into it, meaning it's more dependent on the data than the architecture.
00:14:23.760 | I say that, but it's likely both. >> Yeah. Yeah. I've got nothing for you
00:14:31.920 | there. One comment I'll share as well about this that I have had other founders tell me,
00:14:40.800 | which is that it's very surprising that all embedding models are effectively
00:14:46.560 | general purpose, and there's no code embedding models. If you look at the NOMIC datasets,
00:14:57.200 | code is number 10 down the list for less than 1% of the dataset.
00:15:07.280 | And in StackExchange, it's maybe down here. The StackExchange is not even a code-specific thing.
00:15:11.680 | People have had the Codiums and the Cursors of the World and MorphLabs, they've had to create
00:15:20.880 | their own code embedding models that they don't release, which is surprising. It's IP,
00:15:27.760 | but it sounds like a high-potential thing for some PhD person to publish as their open research
00:15:36.080 | article, because as of right now, every single embedding model is just general purpose language,
00:15:40.800 | and obviously that is different. We'll cover a little bit of how to change that with CDEs,
00:15:47.840 | but I think I'll move on to Gina, unless anyone has issues.
00:15:51.520 | Okay. So, NOMIC was very focused on single language, English. I think they have some
00:16:05.280 | multilingual capability. I don't know what the... Spanish. It wasn't covered in here,
00:16:20.320 | but I did talk to them about this. But anyway, so Gina, specifically as a European company,
00:16:24.880 | very, very focused on multilinguality, so this would be their update. I would also say that
00:16:32.400 | I've been very impressed by their out-of-the-box offering. So, when we talked about clip embeddings,
00:16:40.320 | right? So, this is one of the AI News articles from last week. You can see that... You can see
00:16:48.320 | the difference between a paper that is very focused on research technique and algorithms,
00:16:55.200 | and a paper that is focused on being a technical specification for an API that they intend to
00:17:01.360 | offer. And so, Gina Clip 2 is basically... I'm just gonna chuck that in here as part of the reading.
00:17:08.400 | Gina Clip 2 came out of the box. This is actually what I was looking for, by the way.
00:17:14.960 | So, Gina Clip 2 came out of the box with, like, here's how you deploy to AWS, Azure, Google Cloud.
00:17:21.520 | You won't get this from an Apple paper. And that's just because they're trying to make money off of
00:17:27.520 | their API calls, right? But let's rewind to embeddings. So, basically, they have... They've
00:17:34.400 | been running their own embeddings for a while. They updated this in September. And their focus
00:17:40.400 | has been multilinguality. So, there's a variant of MTB for multilinguality. I don't know if it's
00:17:45.280 | here. It's just Chinese. French. Yeah. There's some Japanese somewhere as well. I don't think
00:17:54.960 | it's this specific leaderboard, but there's a different leaderboard as well.
00:17:58.080 | And it's mostly a functional dataset. I don't think there's anything particularly I'll call
00:18:05.520 | out here apart from, like, they also, you know, have, like, really, really good thoughts on
00:18:11.120 | scaling laws and the kind of dataset that, like, works well for cross-language transfer.
00:18:20.240 | So, they support 89 languages, which is pretty massive. And I think they're also very practical
00:18:26.160 | around, like, the size of model. If you look at the size of the models here, some of them are,
00:18:31.600 | like, 7D models, which is huge. And so, like, yeah. I don't know if the people are actually
00:18:39.120 | interested in using these 7D models. This is definitely sort of benchmark maxing compared to
00:18:45.440 | the more practical oriented people who are, like, no, like, you actually want to use,
00:18:49.440 | like, a Roberta and keep it to, like, sub 1B for actual sort of inferencing for embeddings.
00:18:57.840 | The other thing that I will call out here is just, like, the LoRa adapters. They also introduce
00:19:10.240 | these, like, this concept of, like, task-specific LoRa adapters. So, let me see if they cover it.
00:19:16.240 | Yeah. So, this is where you start to see, like, instead of the traditional single model type of
00:19:30.160 | embedding model they use, where it's, like, this is, like, basically everything that we had up
00:19:37.680 | till 2024 was just, like, single embedding model. Here we have task-specific adapters,
00:19:43.520 | and we'll see another form of adapters with the last paper today. But the – where am I looking at?
00:19:53.120 | Where are the adapters? Okay. Yeah. So, they have
00:20:05.600 | retrieval for documents, retrieval for queries, separation of documents and clustering them,
00:20:11.760 | classifying them, and then text matching, which is, I think, the classic workload.
00:20:16.080 | And I think, like, we tend to use, at least in traditional RAG, and how I learned it and how I
00:20:22.560 | think most people use it, we tend to use the same embedding model in the same mode for all these
00:20:26.480 | things and maybe try to prompt differently or preprocess differently to get performance out of
00:20:32.000 | them. But training Loras for individual models for the different RAG tasks, I think, is very
00:20:39.440 | interesting and probably, like, a very good idea, because they're basically doing different things.
00:20:46.640 | They have different tasks over here, and I think there's, like, obligations for, like,
00:20:51.760 | the accuracy and precision of each of these things. But, like, you know, I would say the
00:20:58.640 | main contribution of this paper is just the idea that you should have task adapters. They also have
00:21:03.040 | MRLs. So, like, I don't think we should just, you know, I'm just going to leave the MRL discussion
00:21:07.600 | aside. Like, we all know that it's good. I think the main idea to get from here is the, sort of,
00:21:12.560 | idea of task-specific adapters. Does anyone have questions? I haven't been looking at the chat.
00:21:16.080 | Okay, people are still talking.
00:21:21.600 | It's mostly the debate on the dimension size.
00:21:24.800 | What?
00:21:26.800 | It's mostly the debate on dimension size.
00:21:32.320 | I was hoping that they would have done an ablation of the adapters. They did an ablation
00:21:38.400 | of one versus two adapters, but they didn't do one on no adapter versus adapter, which is, I think,
00:21:44.240 | in my opinion, more interesting.
00:21:49.040 | No adapter versus adapter. You mean on their old model?
00:21:52.320 | Isn't that table six?
00:21:54.320 | Table six in the results. Yeah, it's like, they did, if you look at it,
00:22:03.920 | the second row from the, yeah, Gina is not, Gina V2 has no adapters. And then, you know, Gina V3
00:22:10.320 | one star is, like, pair training. I think that's no adapter. And then they have a retriever adapter,
00:22:15.680 | which is the last row, which you can see a huge boost, well, specifically for retriever.
00:22:20.720 | Yeah, at least that's how I interpreted it. Please let me know if I misread it.
00:22:25.680 | Yeah. Okay, it's intuitive that it's a fairly big lift. I mean, like,
00:22:32.320 | I think, I can't remember the actual wording, but the most influential thing somebody said to
00:22:37.440 | me was, like, you know, like, blindly applying embedding models onto any arbitrary task without
00:22:44.640 | actually reading the paper and how it's trained, like, it's asking for failure, because, like,
00:22:48.560 | embedding models have very specific assumptions into it. And so, it makes sense that splitting
00:22:54.720 | out the assumptions into, like, the top five use cases and splitting them out would have very
00:22:59.200 | material impact on how the embedding works. And, I mean, you can look at the numbers here. They're
00:23:05.600 | pretty big lifts. Also, that's it. Take the results here with a pinch of salt, though. If you scroll
00:23:10.240 | down a little bit more, Sykes, in the second paragraph on the left, you can see that their
00:23:14.720 | evaluation set size is only 10, fewer than 10 examples. So, they added synthetically generated
00:23:22.080 | data, et cetera, et cetera. Yeah. So, we'll see. But it's a very good result, and I'm glad people
00:23:29.440 | pushing on using adapters more. Yeah.
00:23:35.760 | So, I guess the other thing from the other paper, the NOMIC paper, they had used prefixes,
00:23:41.360 | which apparently is a thing that has been done in training for a little bit. And the prefix
00:23:48.480 | is kind of, in my mind, similar to the adapter in that you're just training different
00:23:54.480 | tagging things. But the difference with the adapter is you have a different loss function,
00:24:00.080 | right, per type. And so, I would have been interested to see an ablation there as well.
00:24:06.480 | I mean, I would say, I would agree with you that prefixes are a standard part of the toolkit.
00:24:14.640 | Therefore, that would be kind of covered in the base V2 versus the V3s.
00:24:20.720 | So, but wouldn't you have to, like, I was, this was another point that I was
00:24:26.320 | trying to understand is I couldn't find any evidence that there's any place where people,
00:24:30.560 | they were, like, putting the prefix in the NOMIC paper. Like, it seems like you would be able to
00:24:35.760 | improve a task-specific embedding by putting the prefix into the query, right? So, because it was
00:24:42.880 | trained on that prefix. So, presumably, it would be better if you also used it in the query.
00:24:50.400 | Yeah, yeah, no comment on that.
00:24:53.040 | Excuse me. I have a question about figure one. Does that, it shows two in the JINYA,
00:25:01.840 | in the JINYA paper. Does that mean that the input's both the query and the supplied
00:25:07.920 | classification? Or is there a classification model that is a part of this embeddings model
00:25:15.760 | that auto classifies? No, it's just loading in the classification, Laura.
00:25:21.840 | Exactly. There's no separate classification model. It's really just embedding the text,
00:25:27.520 | embedding the, yeah, like, right now, and then if there's a classification label,
00:25:32.240 | embedding a classification label, and just doing the classification.
00:25:34.400 | So, that means that the program or user supplies the class, the adapter task?
00:25:45.600 | Yes, if you look at the API for this.
00:25:49.040 | So, that's the program that has to match with one of the five that they give you.
00:25:58.960 | Do they, does JINYA supply a classification model, or is it up to the?
00:26:04.560 | No, you can only use what they give you.
00:26:08.000 | They have a classification, Laura, but you have to provide your own labels.
00:26:13.040 | Okay, cool.
00:26:14.240 | One thing, for those who are more familiar with LORAS, isn't this wrong?
00:26:20.400 | Why is it side by side? Shouldn't the LORAS be the last layer?
00:26:24.560 | The LORAS are usually on the MLP layers. So, it's on, like, every MLP label or every,
00:26:35.760 | like, query key attention value layer. So, at least how I use it is I apply LORAS on all the
00:26:42.800 | MLP query key value layers. So, it's not just the last layer.
00:26:46.480 | I think maybe what you're thinking of is maybe fine-tuning by adding a special last layer to
00:26:51.760 | fine-tune that you freeze all the weights and fine-tune that special last layer for the specific
00:26:56.320 | task. That's what I'm familiar with for LORAS, but I guess I might be very focused on, like,
00:27:01.760 | diffusion LORAS. Yeah, oh, I think that could be it. Yeah, that could be it. I think in LORAS,
00:27:08.240 | it's mostly all the weights except for the embedding weights. Okay, got it. That's not
00:27:14.960 | very low-rank to me, but okay. Well, it's low-rank in the sense that the LORA dimension is very small
00:27:21.760 | in the sense you can compress it. Yeah, yeah. All right, I'll throw an honorable mention to
00:27:27.200 | Clip even though I didn't mention this just because this was also part of the reason why
00:27:31.920 | I chose this topic for this week because there's a whole bunch of embedding shit that just came
00:27:35.840 | out in the last two weeks. So, I just wanted to, like, here's the state-of-the-art, here's
00:27:39.840 | everything I know, and then just kind of discuss it. So, they took Embeddings v3 and then jammed
00:27:44.960 | it into Clip. So, here's the same ways where Embeddings v3 froze it, and then they have
00:27:50.480 | this other vision, sorry, this other vision adapter here, and that's it. Text embeddings,
00:27:59.760 | vision embeddings, you get a Clip model. We've covered Clip in the past, but basically for,
00:28:05.600 | you know, for a refresher of those people who don't remember, where is the goddamn Clip paper?
00:28:10.880 | I hate it when they don't show everything that's important. Okay, I have to go back to my
00:28:18.080 | summarization again. Oh, okay, it was a different paper, unfortunately. Okay, but this is the Apple
00:28:28.400 | one, but, like, I just really love this example. Every example, every paper should have qualitative
00:28:33.600 | example of the output compared to competitors, right? Because then you understand fundamentally
00:28:39.680 | what they're going for, because they are showing you how they want it to be used. And so, for
00:28:43.760 | example, visual QA, stuff like this, is really cool, because you can definitely see yourself
00:28:49.920 | having an image like this, where, you know, there's a number on the screen, and you say,
00:28:54.640 | what is the weight of the luggage? OpenAI Clip gets it wrong, Siglip gets it wrong, and, you know,
00:28:59.840 | your model gets it correct, right? Obviously, these are all going to be cherry-picked, but at
00:29:03.200 | least it gives you an idea of what's in the damn data set that I find it hard to get. So Gina,
00:29:11.920 | unfortunately, did not do this, at least that I can tell, but at least, you know, they publish a
00:29:17.440 | lot of really sort of technical quantitative specs, and it's based on embeddings v3. So this is how
00:29:22.960 | foundational embedding models are. Okay, I want to move on to the last one, unless people have
00:29:28.240 | questions. I haven't been looking at the comments here. Oh, anyone have interesting comments?
00:29:36.240 | Yes, yes, you have to click on my screen. Zoom has made it easy to miss the screen. Okay.
00:29:48.640 | Oh, quick. So is Gina, like, a research lab, or...
00:29:53.920 | It's a startup.
00:29:55.680 | Company? Okay, startup.
00:29:57.600 | It's a Chinese founder, lives in Germany. I met him in Vienna. Very nice guy, a big fan
00:30:03.840 | of latent space. We'll have him on at some point. For me, there's like 10 of these, you know, and
00:30:12.080 | it's hard for me to, like, figure out who to talk to, but Gina seems to do solid work, and they're
00:30:16.320 | very, very serious, and, I mean, look at the quality of their stuff. Like, it's obvious that
00:30:19.520 | they're serious about it. So yeah, they're a startup trying to make it. Has anyone experimented
00:30:29.440 | with medical embedding models? Okay, I'm going to go ahead and guess no, but can you show up,
00:30:34.560 | can you tell us your interest?
00:30:39.840 | Yeah, hey, good to hear you.
00:30:41.360 | Hi. Yeah, so I'm currently working on, like, with the QN multimodal model. I'm working on
00:30:54.080 | a retrieval system for medical papers, and I'm currently trying different models,
00:31:03.360 | and there's this BioBird. That's a QN from Alibaba. This new QN, yeah. And BioBird was one,
00:31:14.800 | but I'm not 100%. Yeah, it's not really good for my use case, so, and there is not a ton. So,
00:31:28.000 | there's Jon Snow Labs. They are quite active in this area. So, Jon Snow, like, from Game of Thrones.
00:31:33.520 | Yeah, but I was wondering if any of you have experience with
00:31:40.000 | models that were trained on, like, biomedical data and have high embedding quality? Also,
00:31:47.200 | for, like, the, especially, like, chemical or biochemical, like, protein pathways, like this?
00:31:56.880 | So, I don't think anyone here does medical stuff, but Tanishq in our Discord does. So,
00:32:01.600 | Tanishq, iScience lover, I think. Do you have any recs on biomedical embedding?
00:32:12.720 | And he got you. If he, if it doesn't exist, he'll train it for you.
00:32:17.680 | Yeah, it was also maybe starting on the weekend, like,
00:32:23.520 | my own embedding model, just with a budget of a few. Let's see what comes out.
00:32:30.480 | I'd be very interested to see if the NOMIC code works for you, because this is supposed to be
00:32:39.840 | the, like, you just swap all the data set and you, you know, just run the same code again.
00:32:43.760 | Yeah, but that's, that's a theory in practice. You know, there's, it will not work in the first go,
00:32:52.480 | but, yeah. Someone also, Nav also says you can just fine-tune a generic one. I definitely agree
00:32:58.320 | with that. Yeah, anyone from the, sort of, fast AI community will be horrified that you should
00:33:05.120 | start from random weights. You should just start from something that's a decent weight.
00:33:08.400 | Okay. Oh, Khaled says MedGem and AI Healthcare. What is that? Is that, is that, is that a thing?
00:33:19.920 | Oh, okay. Oh, yeah, you know, Google keeps doing this stuff and then I just ignore it,
00:33:25.280 | because I don't do any medical stuff, but, yeah, this sounds awesome. Oh, Sam, Sam,
00:33:30.800 | Sam has a med model. I forgot. Yeah, but Sam is for segmentation. It's not a foundation model.
00:33:41.760 | They have a med Sam, which is good for segmentation. Oh, no, no, no, no. When I say Sam,
00:33:51.680 | I mean Sam Julian, who's in the, in the chat. Oh, okay. I'm sorry. Not, not segment anything.
00:33:56.560 | There is a Sam that is a segment anything model from MedGem. Yeah, and we've, we've interviewed
00:34:06.080 | them twice, actually. So we have, I think it's, I think it's Sam too. Yeah, these guys. Nicky is a
00:34:13.760 | friend, and Roboflow is a very different friend of ours. But the, the good, the good thing is that
00:34:20.240 | they also have a specific model for medical imaging, which is what I work on. But since
00:34:28.640 | it's a coincidence that you mentioned Sam, as we were talking about MedGem and AI and
00:34:34.880 | the... Yes, people have used it for medical applications. I believe Joseph in this podcast
00:34:43.040 | actually mentions it. But it's, I don't have the domain expertise to go beyond that. But yes,
00:34:48.880 | people have fine-tuned Sam to do medical segmentation. No, you can just write MedSam,
00:34:55.120 | and you will get the GitHub. Okay, yeah, yeah, sorry. All right. Cool. All right. Let me,
00:35:01.360 | let me round out the other stuff, and then, and then we can sort of jump to Q&A, because I'm also
00:35:05.360 | keen to hear Eugene's take on, on embeddings. So the last thing I'll highlight for you guys
00:35:12.320 | is contextual embeddings. So I'm basically trying to organizing it in terms of progression of what
00:35:17.200 | I've seen in embeddings this year. So there was Nomic, which is start of the year. Gina, which
00:35:23.600 | introduced TaskLawrence. And contextual embeddings now introduced this idea of a two-stage adaptation,
00:35:30.560 | where they specifically help you, they specifically train the model to be conditioned
00:35:37.120 | on the corpus first, to then be used in an embedding context. Which is, which is a little
00:35:43.360 | bit weird. But it also helps them be very, very OP in one specific aspect, which is efficiency.
00:35:48.800 | So they are, I think if we go to, they're still up here somewhere. It's very hard to like keep up.
00:35:56.320 | So they are 143 million per amp model, competing with 7 billion per amp models,
00:36:02.560 | because of this adaptation. So it's a little bit cheating to put them on the apples to apples
00:36:09.920 | comparison with these guys, because their deployment model is a little bit different.
00:36:13.680 | What they're doing is basically, first you consume a context. Let me see if I can show the code.
00:36:20.560 | Where is it? I don't know if I saw it here inside the, I think there might be the GitHub.
00:36:33.200 | Where's the GitHub? GitHub, GitHub, GitHub.
00:36:39.760 | Sorry, I don't, I don't think I put it in my notes here. Contextual embeddings.
00:36:49.920 | There we go. Okay. Yeah, here.
00:36:54.000 | Okay, so yeah, this is what I wanted to show you. So instead of just like a single shot,
00:37:04.400 | here's a bunch of text, embed this please. That's basically what all the other models did.
00:37:09.360 | In the JINA model, you maybe specify like a task, right? So to load the LoRa. Here you actually
00:37:16.160 | kind of construct the LoRa as you go, right? So you feed in the corpus first, feed all of it,
00:37:22.240 | and then you get dataset embeddings for the first stage on the whole thing. Then the second stage,
00:37:27.040 | you use it to actually do your prompts query, which is kind of slow for loading.
00:37:35.200 | But then you can understand why this domain adapted so much better than basically every
00:37:41.360 | other method out there. And it's such a simple idea that you just train your model in a sort
00:37:46.720 | of two-stage process. So these guys worked it out. And the technique is, you know, above my pay grade,
00:37:53.360 | but it's a whole bunch of math, whatever. But like that conditional aspect, I think,
00:37:59.120 | makes a ton of sense to me. And this, in my mind, like, if this method proves popular enough,
00:38:04.960 | basically everyone is going to do it, because it's such a cheap win, especially for the efficiency.
00:38:11.040 | So I'll pause there. Is that the contextual
00:38:14.880 | embedding paper that you mentioned? Yeah, yeah.
00:38:16.960 | I think most, I think even Gina and Nomik, they actually adopt that methodology. I think there's
00:38:25.120 | two things. One is updates to the architecture. Another one is updates to the training methodology.
00:38:29.600 | Essentially, they say that they do some clustering, and then they feed in the batches
00:38:32.720 | from the same cluster. I think Gina and Nomik also do that, where they say that they feed in
00:38:38.480 | the batch to make sure that the data comes from the same data set. They don't actually mix data
00:38:44.000 | sources across different data sets. But I think what's unique...
00:38:47.440 | The inference is only one run. Like, you know what I mean? Like, they don't let you domain
00:38:53.840 | adapt this thing. Yeah, that's true. That's true. Exactly. So
00:38:56.640 | what's unique is their architecture, whereby the inference, they actually allow you to
00:39:00.720 | provide some priors on your existing domain. That's quite interesting to me,
00:39:06.000 | and that was new to me as well. Yeah. So I think it's a very good idea.
00:39:11.360 | I would love for other people to adopt it. This might be one of those things where
00:39:15.200 | it just takes one of the big labs to read the paper and figure out that this makes sense.
00:39:22.000 | I think the other deployment issue is that it's basically a stateful API. So you cannot... Like,
00:39:29.360 | all these are stateless, which is great. Sorry, this is stateless. So you just call an endpoint,
00:39:34.240 | right? All the model labs love this kind of model. But here you're going to have to, like,
00:39:38.080 | call an endpoint to embed this model first, and then return a new endpoint that you can actually
00:39:46.080 | do the embeddings on. So it might be a little bit annoying for these guys to figure out, but
00:39:51.040 | if it's a big enough deal, they'll figure it out. But the lifts are very, very great. If you look at
00:40:00.480 | some of the data that they have... Like, yeah. Just across all these models, keep in mind that
00:40:11.760 | they're at least an order of magnitude smaller than all these guys. They actually perform better
00:40:16.080 | on basically every task. It's pretty crazy. So it would be interesting... There's no reason for it
00:40:26.400 | to be small if you can just make it big, but you just keep the technique the same. This was trained
00:40:31.440 | on a grad student budget. If you just scale this up, I think it would work. I think people would
00:40:37.760 | use it. It basically is a more generalized version of this task adapter API, right?
00:40:44.960 | So instead of having only five task adapters, what if you could just come up with your own task
00:40:53.040 | adapters just arbitrarily by feeding in the corpus that you're trying to embed?
00:40:56.880 | To me, that's a big idea. Anyway, should I pause there? I don't know if there's any other questions.
00:41:04.480 | You can see the first data. Isn't that just fine-tuning? No, because there's no gradient
00:41:14.160 | updates here. It's more like context caching, maybe. I'll liken it to that, where the initial
00:41:29.920 | context is pre-processed as a KB cache, and you just keep the KB cache around. That's effectively
00:41:36.080 | what you do for context caching. I think we can pause the recording,
00:41:40.160 | then let Eugene do his hot takes. Eugene, hot takes, let's go!
00:41:44.720 | [BLANK_AUDIO]