back to indexEvo 2: Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale

00:00:05.000 |
Okay. All right. So, great. I stayed up late last night putting these slides together. 00:00:15.000 |
Didn't have a ton of time to prepare, so hopefully they're clear. If you have questions or want 00:00:23.000 |
to dig in, go ahead. I have about -- I think I have about 45 minutes worth of slides, probably. 00:00:29.000 |
But there's a lot of -- like, there's probably four hours worth of information in these papers 00:00:35.000 |
that I could cover. So, if we do want to dig into any of the -- especially the bio side 00:00:41.000 |
of things, I can describe a lot of what's going on in more detail. But I focus mostly on the 00:00:47.000 |
machine learning stuff. But I'm going to give a little bit of a, you know, genetics -- two-slide 00:00:54.000 |
introduction to genetics just for people who don't have -- don't have experience with it. 00:01:02.000 |
Okay. Great. So, whoops. Why is it not? Okay. So, this is sort of the why. So, the Evo is a 00:01:11.000 |
foundation model. It's like in a series of -- there was like hyena DNA and then there was Evo and then Evo 2. 00:01:20.000 |
it's a 7b and 40b model. And the purpose of a model like this -- so, it's nucleotide-based, meaning it's 00:01:31.000 |
actually modeling the DNA sequences. And the purpose -- these are some of the applications on this slide for that. 00:01:41.000 |
So, like, there's a lot of prediction things where you can -- like, for just like -- I'll go through these. 00:01:46.000 |
Maybe not all of them, but I'll go through a lot of them. So, like -- so, like, if I -- if I knock out genes, is it important? 00:01:55.000 |
Does it impact -- you know, does it impact, like, the health of the person or some -- the function of something? 00:02:02.000 |
Or does it not matter? Or, like, a portion of the gene? What -- this one is actually really, really important and hard to do. 00:02:12.000 |
So, they seem to have made progress. I don't know enough about the biology to know how realistic what they're saying -- they're claiming 00:02:19.000 |
is and how well they've tested it. But, you know, being able to predict the effect of your genetic variations on, like, drug treatments, 00:02:29.000 |
your health, other things like that, is sort of the precision medicine promise. And it's quite difficult. 00:02:38.000 |
And -- but this kind of model seems to be helpful for that. Like, designing genome scale things, I didn't -- I don't really understand that use case super well. 00:02:50.000 |
in terms of, in terms of practical things, maybe for a study. But, like, looking at -- predicting, like, sort of -- 00:02:59.000 |
protein mutations based on gene changes. You know, sort of things in the protein. Like, what feature -- like, identifying structures in the feature -- 00:03:11.000 |
uh, structures in protein folds. You know, sort of designing proteins. Predicting, you know, sort of, like, in the RNA, like, being able to -- 00:03:23.000 |
find mutations. You know, sort of, predict whether a RNA strand will be stable or not. Sort of, like -- 00:03:34.000 |
sort of, like -- and then, um, things in the epigenome. Epigenome is, like, how -- 00:03:40.000 |
is other physical mechanisms that cause genes to be more or less expressed. 00:03:46.000 |
And so -- and chromatin is, like, a major -- one of the most important things with that. And so, um, like, sort of, making predictions about that. 00:03:55.000 |
So, like, all of these things, they have these really -- either really important scientific or sort of -- scientific consequences, like being able to understand or monitor or measure things better. 00:04:09.000 |
or they actually have impact on your ability to, um, sort of, sort of, make clinical decisions with. So, I think -- so, there's, like -- doing this well is exciting. And it does seem like -- 00:04:21.000 |
like, um, this does seem like a bit of a step change to me, um, with not knowing a whole lot. There's been a lot of attempts at models like this, and they do work. But this seems to -- you know, because it's well-funded and done, um, by a large team and, um, you know, sort of, like, has support from NVIDIA and other things, it seems to be, like, a pretty -- pretty impressive effort. 00:04:47.000 |
Um, so, uh, this is -- so, uh, the interesting thing about EVO is that this is not just for a human genome. This is for -- so, they -- they incorporated, uh, genomes. So, each point on this UMAP is a separate genome. Um, and -- or, uh, part of -- at least part of a separate genome. And -- 00:05:09.000 |
Um, and, um, so, you can see there's, like, a ton of different, uh, ton of different genomes here. And I think this core data, that's where humans would be, um, somewhere in there. So, uh, um, so you can see -- and that's, I guess, this one, probably. So, um, this area. So, like, uh, you can see that it's -- you know, that there's a lot of other stuff that got included. 00:05:23.000 |
And that the purpose of that is to be able to, uh, transfer your understanding of the behavior of genetics from these other species into humans and also to study those other species. 00:05:35.000 |
Okay. Okay. Is that clear before I dive into this? Please interrupt if you guys have questions. Um, okay. So, uh, this is -- I say, probably wrong, because this is -- 00:05:53.000 |
-- this is just my pretty lay understanding based on, like, you know, you know, college biology and, uh, and, you know, conversations with, you know, um, computational biologists and things like that. So, um, but I think that for AI, you know, 00:06:21.000 |
I think that for AI engineers, we don't need to understand it too well. There's, like, sort of -- so I'm going to give a conceptual view and then maybe one level deeper. Um, that -- so, um, there's -- genetics kind of deals with, um, these three factors. And then there's lots of other stuff that is related, but DNA, RNA, and then proteins. So -- and then proteins turn into 00:06:49.000 |
into larger structures and -- and signaling and all sorts of stuff. But, um, you know, sort of, like, at its core, at least from my understanding, um, uh, you know, you sort of have DNA, and this is, like, the long-term storage. And so -- and I like to think about this as the software, right? Like, this is the -- this is the -- this is the program that your cells follow in order to create proteins, 00:07:17.000 |
proteins, which are sort of like hardware and sort of like software. So, like, either -- and you could call it software 00:07:23.000 |
or RTL. So, like, you're, you know, describing how to synthesize a piece of hardware, maybe. Um, 00:07:29.000 |
and then RNA is like -- and then RNA is like -- is a less stable, um, kind of very similar to DNA, but less stable, 00:07:37.000 |
more reactive, and it gets duplicated from RNA. And I think that the important thing about it -- one -- or one of the important things about it is it -- it actually gets, uh, repeatedly copied, right? So -- and so that the -- in the level of -- the -- the amount of expression of that -- 00:07:57.000 |
that -- that, um, the genes in RNA impact the behavior of the cell. So, that could be because it affects the behavior -- 00:08:09.000 |
or the -- the expression of other genes. It could be because it signals to other places in the cell or outside the cell. 00:08:15.000 |
um, and it also -- um, the RNA is sort of the mechanism by which proteins are created, um, in the -- in the cell. 00:08:25.000 |
And I -- I think that, um, when I previously understood -- until, like, the sort of modern era, when I -- like, you know, when you learn this way back 00:08:37.000 |
when I was learning it in college or whatever, um, I learned -- you know, I just had this concept of, like, these abstract things 00:08:45.000 |
and their proteins and, you know, they're just, like, these strands and they kind of fold. 00:08:49.000 |
And, like, I didn't really get what they -- like, how -- how that, like, sort of fits into -- like, creates all these, like, structures, right? 00:09:01.000 |
And, like, these are enormously complex cells. So, I -- I -- I think -- I want to just click on this -- this -- um, this video really quick. 00:09:10.000 |
So, this is -- this is what's called an enzyme. An enzyme -- excuse me -- is a -- is a molecular machine. 00:09:20.000 |
It's created by -- from many proteins, right? And this particular one is called ATP synthase. 00:09:27.000 |
It's what turns -- it, like, sort of re-freshes -- I think it's ATD, I think -- um, and it turns it into ATP by adding -- um, by, like, sort of mechanically -- um, by mechanically putting two pieces of -- of -- of protein together. 00:09:48.000 |
And that that -- putting the proteins together sort of, like, causes -- you know, it gives the energy that can be released later, like a spring or something. 00:09:56.000 |
And -- and so they -- like, there's -- so this video -- sorry. 00:10:16.000 |
You cannot. Oh, yeah, that's okay. So, like, I -- the important thing here is that -- like, so these are ATPs that are going into here, right? And then -- 00:10:25.000 |
So, like, this is all just a bunch of proteins together, right? And you can see the -- there's this -- there's this -- there's this sort of -- 00:10:37.000 |
thing that's turning in here. And there's actually literally an electric motor. So, you can see it down at the bottom here. So, the -- there's, like -- there's a membrane, and then there's, like, a bunch of proteins -- or, sorry, a bunch of hydrogen ions down at the bottom -- or -- 00:10:55.000 |
So, protons or hydrogen ions. And they're those, like, sort of, you know, like, sort of super energized things at the bottom. And that causes this to turn. And that causes you -- it to re -- like, refresh the ATP. Right? So, like, this is -- this is what -- this is, like, one example, one of the more impressive ones, but still, like, just one example of the crazy stuff that gets constructed by -- 00:11:13.000 |
-- by basically folding by -- by, like, sort of putting together these proteins into these molecular machines, enzymes, plus signaling, plus other structures. So, I -- I just -- I -- I -- I -- I wanted to just go over that just to be clear. 00:11:31.000 |
that there's, like, there's, like, this -- this is, like, what's being constructed. It's not just, like, some soup of -- of protein that somehow magically turns into -- into, like, a cell, and then cells into people. 00:11:58.000 |
-- Okay. Sorry. So, that's -- that's, like, my -- okay. That's, like, sort of the conceptual view. And then, there's -- so, then, there's, like, a logical view that I have that is, like, basically -- so, for DNA, there's basically four nucleotides. 00:12:19.000 |
-- and this is kind of an approximation, actually, but it's more or less -- there's four nucleotides. So, and they have names, but you just remember A, T, C, G. 00:12:28.000 |
And they -- they kind of -- they -- they fit together. Like, I think A and T and C and G fit together. So, they're called base pairs. 00:12:38.000 |
But you -- they can also fit to -- they can, like, bind to each other next to each other. So, you can just have these -- from a conceptual standpoint, you just have a long sentence of -- with four -- four letters in them. 00:12:50.000 |
-- right? For -- so, there's four tokens, A, T, C, and G. And then, you just have a sequence of those. And that's what code -- so, those, sort of, are in your long-term storage. 00:13:02.000 |
And then, DNA gets transcribed into RNA, which is basically the same, but you -- the T is a U in RNA in it. So, that has different chemical properties. 00:13:12.000 |
It's less stable, meaning -- and -- and so, that allows it to be more -- and it doesn't turn into a double helix. So, that it -- it allows it to be more easily, like, chopped up and -- and shuttled around. 00:13:25.000 |
And, like, it's just more convenient for transcribing. So, that what happens is DNA gets copied into RNA. You get these, like, sequences. 00:13:34.000 |
And then, you know, shorthanding quite a bit, the RNA turns into -- there's, like, sequences of three RNAs. 00:13:44.000 |
And those are called codons. And the codons attach to amino acids. And so, that these codons, sort of, like, these patterns of codons, 00:13:52.000 |
then turn into amino acids, like -- or bind to amino acids and stick them together into these long, what's called a polymer chain. 00:13:59.000 |
And that -- that polymer chain is a protein. And then, the proteins fold because of their chemical properties. 00:14:07.000 |
And they turn into those amazing molecular machines. Okay. So, like, that's very, very, like, high-level summary. 00:14:15.000 |
So, but I think the important thing for an AI engineer to know about this is that there's kind of two main things that I've been calling out here. 00:14:26.000 |
One is that the genes are, like, really long, right? Like, you have these long, long, long, long sequences, you know, in the 1,000 to 1 million token range. 00:14:37.000 |
Right? And then, the vocab is obviously very small. So, your vocabulary is for tokens. And then, they actually added some other special tokens because they're just using a single byte encoding. 00:14:49.000 |
So, what do you do with the rest of the possible tokens? And so, they have them, like, one token per species, basically. 00:14:56.000 |
Okay. Is that -- so, that's my introduction to genetics. Is -- I hope that's enough for you guys to understand the rest of the discussion. Do you guys want to ask any questions? 00:15:08.000 |
I just want to say, I love the effort put into translating things for AI engineers. I think it's very -- this slide, quite important. And, yeah. I like it. 00:15:17.000 |
Okay. Excellent. Thank you. Okay. All right. So, okay. So, let's -- now we're going to, you know, here's like a, you know, like a, like a gear shift with no clutch here. 00:15:30.000 |
So, I'm going to switch to the sort of -- okay, this is the main sort of mechanism by which this model is able to have long, like, efficient inference on very long context. 00:15:46.000 |
So, like, with that -- so, I'll build up to why that -- this mechanism works. But, basically, starting in this -- this has been research that has been developed at Stanford over the past several years. 00:16:01.000 |
So, this -- this part of it is a little old. But -- and this refers to a previous paper that isn't in the list that we read today. 00:16:11.000 |
So, basically, they developed this -- what they call a hyena operator. I don't know the origin of the name. But, basically, as a surrogate -- 00:16:23.000 |
originally, as a surrogate for attention, right? So, how can we -- and the sort of research question, I think, was basically, how can we replace attention with something that is -- is based on convolutions, which is kind of similar -- like, you've probably heard of state-space models. 00:16:41.000 |
So, this is an example of a type of state-space model. And so, because -- the reason why you would want to do that is because you can -- convolutions are more efficient to calculate than a full attention -- self-attention matrix, right? So -- so that they can be -- and it's not obvious why from this diagram. I'll show you why. 00:17:08.000 |
But that -- that's sort of the why of it, is they're more efficient to calculate. And so -- and the -- you know, just to be clear, what's going on here is that you have these sort of like -- you know, so you have like -- this is a convolution. And this is called a toplitz matrix, which is just a -- you know, sort of like an unrolling of the convolution into -- into a matrix. And then, you have weightings on those convolutions. 00:17:26.000 |
And so, you have these pairs of like a weight plus the convolution plus a weight plus a convolution and some like arbitrary number of those stuck together. And then, you can get this thing that kind of sort of approximates the behavior of attention. So, this H U is basically a replacement or a surrogate for attention. 00:17:44.000 |
But -- and -- and -- and it's important to note that there's two ways to represent -- and like there's probably more, but like there's two -- at least two ways to represent this convolution. One is as this toplitz matrix. But the other is this -- quote unquote -- they call it implicit -- 00:18:00.000 |
parameterization, which means is -- think IIAR filter, if you're familiar with those, or recurrent model, if you're familiar with those, where you're taking the previous state and you're just calculating an update. 00:18:16.000 |
rather than having all of the updates sort of written out for you. 00:18:36.000 |
Okay. So -- and then, here's the why again. Convolution is faster than attention. 00:18:42.000 |
it's faster than attention. But, you know, note -- and they actually talk about this in the paper. 00:18:48.000 |
And that's true in the limit. It's not true at short sequence lengths. Right? So, if you look at hyena here, 00:18:56.000 |
it's actually slower than flash attention up until you get to this like 3 -- 10 to the 3.8, which I don't know what -- that's probably 00:19:04.000 |
what -- like about, you know, maybe 5,000 or something -- context length. 00:19:16.000 |
And also notice the -- the performance, although, you know, they claim is, you know, sort of like a surrogate for attention. 00:19:26.000 |
It doesn't actually do as well as -- as -- as GTP -- GTP, NEO, which is a regular attention-based model, or our WKV. 00:19:36.000 |
Hopefully, Eugene's on the call. So, Eugene, you're waiting to -- 00:19:40.000 |
I'm waiting to help answer questions for this part if needed. 00:19:46.000 |
Yeah, yeah. Please. I -- I don't have any questions yet. But if other people do -- and -- and especially, like, 00:20:00.000 |
sort of details on space models, I know -- you know a lot about them in comparison to RWKV. 00:20:08.000 |
Okay, so -- so -- so then, this is a slide on striped hyena. So -- and also striped hyena -- and striped hyena, too. 00:20:24.000 |
This is what they use for EVO2. So, they actually -- so here, in this Toplitz matrix here, you see that, like, the question is sort of like, 00:20:34.000 |
oh, well, I have this full attention block, and then I have, like, my fully realized Toplitz matrix. 00:20:43.000 |
So, like, how am I actually saving in memory or whatever, right, or computation? 00:20:47.000 |
And so, like, part of the answer is, well, you know, you can -- you can perform -- perform this convolution in the frequency domain, and it's faster. 00:21:01.000 |
So that is a good answer. But also, part of it is, well, actually, it's both, like, more efficient and -- and more -- 00:21:14.000 |
both more efficient and better results. If you actually use different -- different convolutional -- or these hyena operators -- so this, like, sort of -- this, like, sort of -- 00:21:24.000 |
pair of, like, this guy plus this guy, you know, and then you have, like, a chain of those. 00:21:33.000 |
So you can do that in different ways. And so they have three different ways that they propose to do this. 00:21:38.000 |
one is this short explicit, and you can -- disappointingly, they didn't have a good Toplitz matrix diagram of this. 00:21:45.000 |
But you can see, like, this is basically a line here. So it's -- it's, like, sort of -- you can think of it as -- as, like -- 00:21:53.000 |
being somewhat related to local attention, right, where you have just a few tokens of context close by that you're looking at, and everything else is zero. 00:22:06.000 |
And then you have this medium regularized, which is, like, more context in the -- 00:22:13.000 |
in the range of a couple hundred tokens of context, and everything else is zero. 00:22:19.000 |
And then you have this long implicit, which -- which means I'm going to use this implicit representation. 00:22:26.000 |
So maybe I'll have to do recurrence in, you know, in -- in the -- in this block, or maybe I will, like, sort of hybridize that with, you know, sort of some implicit and some explicit representation. 00:22:42.000 |
But in any event, you have this -- it has a more efficient way of calculating using FFTs. 00:22:51.000 |
So this -- even if it's, you know, like, sort of fully realized, it still is more efficient than attention to calculate. 00:23:00.000 |
So -- and then, you know, so the idea is that you have this stack. 00:23:05.000 |
So instead of having attention block, attention block, attention block, you know, you have this S-E-M-R-L-I, and then an attention block, and then S-E-R-L-R-M-R-L-I. 00:23:17.000 |
And then, like, so you kind of stack these up so that you end up with, you know, sort of your different layers calculating sort of their attention surrogates in different ways. 00:23:31.000 |
And it turns out -- well, first of all, just to be a little more clear, this -- so this is like a FIR filter, you know, four to seven tokens in depth. 00:23:44.000 |
This is like a FIR filter, and they, quote, unquote, regularize this. 00:23:48.000 |
I actually -- so Eugene, this is actually one place where I had a question. 00:23:53.000 |
They say that this alpha is swept across channels. 00:23:58.000 |
I don't actually quite understand what that means. 00:24:01.000 |
Do you have a -- did you -- I don't know if you read the paper or not, but, like, did you understand what this is? 00:24:06.000 |
So I think the way to view these segments, right, and this makes sense once you view -- understand how, like, future states-based models work, right, is a way to visualize it as you process the tokens. 00:24:22.000 |
So let's just say -- and you can view this as a genetic sequence, left to right. 00:24:31.000 |
The way -- the way these tokens are being merged is the -- at the start, like, these tokens gets processed and gets merged together. 00:24:40.000 |
So as a state-space model built the state by merging everything together over the layers. 00:24:49.000 |
So -- and -- and so -- so what this means here, right, in this case, right, they had these few -- first few layers that process the tokens that come in. 00:24:57.000 |
So we're talking about 47 tokens, and then it gets regularized and then merged into other chunks of tokens. 00:25:07.000 |
I can't remember what's the exact size for future states-based models and the models at that time. 00:25:14.000 |
And therefore, like -- therefore, like, the layers that deal with the longer implicit attention, they get -- they view the summarized information from the previous layers that then gets merged in. 00:25:25.000 |
So, hence, why -- why -- why they show that separation? 00:25:27.000 |
Because at -- at this point, they were still testing and trying to figure out how to best manage long contact sync. 00:25:38.000 |
And -- and -- and part of the evaluation was that if they just use -- if they just keep one state, and then they keep merging in, it performed worse in the case of state space. 00:25:49.000 |
And to have, like -- have it, like, pre -- pre-processed with the short explicit, and then subsequently merge into medium, and then merge into large. 00:26:03.000 |
But I -- so, that's a really good way to think about it. 00:26:09.000 |
What I was actually asking about was this alpha parameter is -- it -- it says in the paper that it's swept across the channels, and they don't really go into what that means. 00:26:22.000 |
And that means that I -- I think I understand what channels are, is just basically the different dimensions in your hidden dimension. 00:26:30.000 |
So -- but then, like, I wouldn't understand -- I don't really understand why sweeping this would be a good idea. 00:26:39.000 |
Or, like, what -- or if I'm completely misunderstanding what they're saying. 00:26:44.000 |
I -- I think it's just view it as -- it's applying that decay as the information flows in, to force the model to, like, just summarize it even further. 00:26:57.000 |
Because -- because even -- even in RRKV, we have that decay mechanism, where -- where, by default, every piece of information will be tried to be forgotten, unless the model has decided that this is important to remember. 00:27:15.000 |
So -- so -- so -- so that decay, right, kicks in to help force the model to forget things by default. 00:27:23.000 |
Unless -- as part of the metric multiplication calculations, it decides that, hey, I should amplify this signal instead. 00:27:28.000 |
And -- and if -- and I think genetic modeling is a very good example of, like, understanding why recurrent models are highly favorable to, like, even this use case, or even -- or even sometimes can be used for tags. 00:27:42.000 |
It's because at the end of the day, right, like, if you look at the genetic code, right, it's one of the cases where it's highly repetitive, the data. 00:27:50.000 |
but there is, like, slight changes that we need to capture meaning. 00:27:54.000 |
And if you view it from a compression point, we were talking, like, zip compression and things like -- it's highly compressible. 00:28:01.000 |
And so -- so what you actually -- what you want the model to actually learn is the pattern and the anomalies, the peaks. 00:28:11.000 |
I know I'm -- this is not the accurate way to say it, but in this use case, right, so by default, if we discard everything towards decay, the peaks -- like, hey, this is weird. 00:28:22.000 |
Why is genome 364, for example, is ABBBA, not as what the repetition should be? 00:28:32.000 |
Then the matrix multiplication will pick that value, and then it will try to, like, conserve it into memory. 00:28:39.000 |
So that's why -- it may sound vague, but it's more of, like, the decays there by default and how it works from there. 00:28:54.000 |
So -- so basically -- and this is just, like, sort of -- this is kind of an ablation. 00:29:03.000 |
They -- they -- they studied, okay, the multi-head attention with their -- you know, sort of, like, if they -- you just make these all multi-head attention, you -- you train for 400 billion tokens, and then you get a perplexity of 3.09, right? 00:29:20.000 |
And then they looked at this LI LI LI LI, which is this one. 00:29:23.000 |
This was the original design of striped hyena. 00:29:34.000 |
And then, interestingly, if you just replace these two first LIs with these SEs, you get the same basic perplexity, so these LI -- so it doesn't hurt to have this more efficient, very short convolution instead. 00:29:51.000 |
And then they found that, like, you do even slightly better if you stick an MR in it. 00:29:56.000 |
So -- so -- so -- and -- and, you know, like, for what it's worth at -- at this -- you know, in this model at -- with this data, like, at this point in the training process, beating multi-head attention. 00:30:10.000 |
And then -- and then I actually -- here's another place I don't understand what this ABF positional embedding is. 00:30:24.000 |
I stuck it in here more as a hope that someone else would explain this slide. 00:30:29.000 |
I -- if not, I don't think it -- they didn't seem to think it was super important, but that, like, basically, this -- it's a more scalable, you know, this ABF is more scalable, and they get up to a million tokens. 00:30:44.000 |
It seems -- one -- one thing I was a little unclear about is why is the perplexity going down as the context length goes up, you guys? 00:30:51.000 |
Is it just data distribution difference, or -- I'm not sure. Did anyone pick up on that? 00:30:56.000 |
For genetic sequencing, that is -- should be expected. 00:31:00.000 |
Mostly because, like, say, majority of the data is going to be repetitive. 00:31:04.000 |
So once you've seen the pattern, you should be able to at least answer the pattern. 00:31:15.000 |
So training -- so training here is -- is, you know, sort of, like, on both 7B and -- 00:31:25.000 |
and 40B. Notice this is log scale significantly faster than dense transformer. 00:31:32.000 |
You know, like, you can see it here, but I think maybe easier to understand is that, you know, like, one would be the same amount of time as Stripe Tyena 2. 00:31:43.000 |
So they beat, you know -- sorry, Tyena 1 plus they beat -- oops, sorry -- they beat the dense transformers by a factor of three or more here, and a little more than three and a little less than three here at the long context, right? 00:31:59.000 |
So you can -- this is what we expect, but it's sort of validation of this was one of the primary design points, and they're saying we kind of met the design point. 00:32:08.000 |
Okay. And then they get into a lot of detail. I'm not -- I'm not going to go into a ton of detail here. 00:32:17.000 |
But there's -- there's this sort of efficient two-stage block convolution algorithm that works well for these explicit parameterizations. 00:32:32.000 |
So, like, you know, from the upper left, here's, like, just a Toplitz matrix, and you can envision this as just blocks of matrices, right? 00:32:43.000 |
So, you know, so this is just the full Toplitz, you know, lower triangular matrix. 00:32:49.000 |
I'm just making it into a smaller -- or I'm just chunking it into these different H -- capital Hs. 00:32:55.000 |
And then -- so what they're saying is that if you -- so you have this, like, a convolution, could -- like, if it's not a full -- 00:33:06.000 |
if it's, you know, like, a short FIR filter, it's only L sub H long, then I have all these zeros here, right? 00:33:15.000 |
And so that I can actually -- if -- if this is kind of short, then I can actually represent this with just two -- 00:33:22.000 |
like, the diagonal H0 and then H1 on the -- right next to the diagonal, right? 00:33:29.000 |
And so I -- this might not be clear from the diagram. I'm going to go into that. 00:33:34.000 |
But just notice there's zeros here, so you don't need -- in this -- you don't need any -- any of the Hs in this lower left-hand corner. 00:33:43.000 |
And that's sort of -- so, you know, this is a pretty good example. 00:33:47.000 |
So if you just -- you know, let's say your sequence length is 6 and your -- and your filter length is 4, then -- then you have this, you know, sort of, like, on the diagonal. 00:34:01.000 |
So you have this -- you can look at, like, this here is this, right? And then it's repeated here, right? 00:34:10.000 |
And then you have this off -- right adjacent to the diagonal, you have this guy, and that's right there. 00:34:16.000 |
And so that could be decomposed into this plus this, right? And so what this allows you to do is much more efficient multiplication. 00:34:30.000 |
So, and they -- so they, you know, study this, and it's indeed significantly faster in terms of the throughput, especially in terms of teraflops per second. 00:34:47.000 |
You know, it's basically double or more, two and a half times, I guess, at scale. 00:34:55.000 |
So that's a good thing. And so this can be used for -- definitely for the SE, the small one, and then the small explicit, and then the medium regularized as well, depending on the block size. 00:35:10.000 |
And so, like -- and then this is a comparison of those blocks to other types of blocks. 00:35:20.000 |
So this is -- this is flash attention, this SPDA is -- is, like, non-flash attention, I guess, and then it compares to Mamba, and then XLSTM, and DeltaNet, which I don't -- I'm not familiar with these two, but presumably they're state-space models. 00:35:49.000 |
I did not understand this diagram, so I stuck it here in case someone wants to explain it to me. 00:35:55.000 |
So this context parallelism, you know, so basically taking your sequence of, in this case, genes, of tokens, and, you know, sort of inferring it in parallel, and then sharing the -- you know, sort of doing a -- 00:36:11.000 |
sort of, like, scatter-gatherer approach or whatever. 00:36:15.000 |
I think this is scatter-gatherer, or the all-to-all, to, you know, sort of, like, communicate the pieces that are missing. 00:36:26.000 |
I don't -- I have trouble understanding this diagram. 00:36:30.000 |
I did -- this one is more, you know, sort of obvious to me. 00:36:35.000 |
And, like, fortunately, we just went over the ultra-scaling handbook. 00:36:40.000 |
And, you know, this is covered in -- in, like, sort of -- or context parallelism is covered in the ultra-scaling handbook that we just went over. 00:36:52.000 |
But, like, it's a little bit -- I think it's a little bit simpler for convolution than it is for attention. 00:36:58.000 |
But with that said, you know, this -- this seems pretty -- this is basically ringing attention. 00:37:05.000 |
what -- and you just, like -- this overlap has to be forwarded to the next guy before they can compute. 00:37:23.000 |
I think it's -- I think to simplify it is think of it as tensor-parallel. 00:37:27.000 |
So, basically, the bulk of the calculations are done in parallel on multiple GPUs. 00:37:32.000 |
And then, once again, like, you know, state-space models, like, at the end, they are just, like, compression of the states as they merge together. 00:37:39.000 |
So, done in parallel, merge, done in parallel. 00:37:45.000 |
That's part of the -- so, the first part is, like, the first chunk in parallel, and, like, by context thing. 00:37:49.000 |
And then, after that, the information is synced across, and then they get merged, and then you get your outputs. 00:37:55.000 |
Ah, okay. Yeah, yeah, yeah. So, that -- I think I see now. Okay. Yeah. 00:38:01.000 |
One thing that I actually do admire the state-space team -- their formulation on is -- well, the high-level parts, right, the math can be rather complicated, right? 00:38:11.000 |
They actually design it in a way where you can reduce the operations onto a very few brilliant operations. 00:38:20.000 |
that is extremely counterintuitive, but it's mathematically right. 00:38:27.000 |
So, they can do proof that this is a cheap and faster way to, like, do those steps, even though it may seem weird. 00:38:32.000 |
Yeah. Yeah, no. So, I spent, actually, quite a bit of time going through the -- sort of the background papers here to understand how the -- like, especially that -- the mapping from their surrogate attention to -- to, like, these hyena operators. 00:38:50.000 |
And, like, actually, the math -- I didn't put it in the talk. It's actually pretty understandable, and I -- there's no, like -- you don't need differential equations, which is -- like, you need -- if you wanted to do the -- you know, sort of the continuous version, you need to use differential equations. 00:39:09.000 |
But for the discrete version, like, you just -- it's all just basic algebra to understand. I recommend, if anyone really wants to understand what's going on under the hood here, you can go and check out the hyena paper. 00:39:21.000 |
I can give a link. It's actually linked up above in the talk. 00:39:26.000 |
Okay. So, that's sort of, like, the machinery. Does anyone want to ask or comment on the machinery part of this? 00:39:40.000 |
Okay. Good. So, they did a bunch of -- they did a whole bunch of evals. 00:39:50.000 |
I think part of that is because, unlike the sort of, like, actual, you know, natural language version, there are not so many benchmarks. There are some benchmarks, but -- so, they have to -- so, a lot of what they do is benchmark against real-world, you know, sort of, like, real-world data sets that -- that -- that are, like, somewhat 00:40:19.000 |
commonly used for evals or for -- or in some cases, they just came up with data sets to test against. So, they basically developed their own benchmarks for the study in a lot of cases. So, it's a little bit hard for me to understand how -- how well they're actually doing. This is an exception. I guess, needle and haystack, we're all familiar with. And it looks like they -- 00:40:48.000 |
like, they do pretty well. There's not any red except for this guy. This 7B model seems to have trouble at a million context length with depth at 100%. So, for -- I don't really understand why I didn't look into it. So -- but in any event, like, they -- they seem to do pretty well. And the way that they do this is they just stick a generated 100 base pair sequence somewhere in the context, you know, at 10% and 20% and whatever. And then they -- 00:41:17.000 |
and then they perturb it and see if that -- how much that impacts the -- the sort of -- the likelihood of the -- the same sequence at the end of the context. So, I think this is -- this closely matches what is typically done for a needle in a haystack test. Except -- one thing that I -- was questionable to me was that -- they used a generated 100 base pair sequence. Instead of -- 00:41:44.000 |
just a generated 100 base pair sequence instead of using some actual out of distribution, you know, sort of like held out sequence from a genome that is not part of the training set. Because -- and I -- I find this questionable because since it's in distribution law, of course, you're going to naturally be perturbed if you have a -- if -- you're -- you're naturally going to have a lower -- or -- 00:42:13.000 |
-- lower likelihood if you perturb the -- the -- the needle. Right? Does that make sense? 00:42:25.000 |
So, so I feel like they -- this is -- unless I'm misunderstanding, I don't feel like this is the right test. Like, they should have used something out of distribution. Like, they actually -- in other places, they were really careful to hold out -- like, they held out certain viruses because -- I think for other reasons, but that -- they test whether or not, like, their performance on these out-of-distribution viruses. And it's really poor. And that's what -- they had actually intended that -- to be the design. Because I didn't -- I think they didn't want to bring in -- 00:42:26.000 |
So I feel like they, this is, unless I'm misunderstanding, I don't feel like this is the right test, right? They should have used something out of distribution. Like, they actually, in other places, they were really careful to hold out, like, they held out certain viruses, because I think for other reasons, but that they test whether or not, like, their performance on these out of distribution viruses, and it's really poor. 00:42:49.080 |
And that's what they had actually intended that to be the design, because I didn't, I think they didn't want to bring in the genetics from those viruses into the study, or into the model for some reason. Or it might have been just for monitoring, but in any event, they should have used something that was held out from the distribution, or from the training set, sorry. 00:43:10.600 |
So, okay, anyway. So then here, like, they have these, so one thing I love about these bioinformatics papers is their diagrams are often really, really good, and really interesting. 00:43:24.940 |
So, I don't want to, like, I don't want to, like, I can talk about a few of these. I think the important thing is the title here is that they're able to predict the mutational effect of, on, like, proteins, and as measured in various ways, on RNA, on organism, organismal fitness, 00:43:48.500 |
and across different domains of life, right, like, not only in humans or something. 00:43:52.720 |
So, so I, if anyone's interested, I can go into the, I think I understand most of these diagrams, but, but I think one interesting thing to note is that these, all these zero shot, 00:44:08.120 |
So, like, zero shot is really clear what it means in language modeling. In this case, I was a little confused about what they meant by a zero shot prediction here. 00:44:21.320 |
And so what they're typically doing with these, all these studies is they're looking at what is the, you know, the change, like, in some cases, it's the change in likelihood or the likelihood of, you know, sort of a sequence. 00:44:37.120 |
I mean, using that to, as a threshold on that likelihood or other more complicated things to, to make a prediction. 00:44:45.800 |
So they, so like, in this case, they're using the change in likelihood when they mutate a sequence and they use that as an indicator of whether it's essential or not. 00:44:56.120 |
Like, and I don't understand the biology of why that's a good test in this case, but, but that's generally, you know, you can see that here as well. 00:45:08.480 |
Here, maybe, I'm not sure it's, it can, can you guys read this, by the way, it might be hard to read and, but like the, so the delta likelihood, you know, like sort of is used to predict that whether or not it's an essential gene. 00:45:24.360 |
Right. So is that, is that kind of clear from a, at least from a high standpoint? So they did it, like you can see each one of these is a section in the paper and they, they really, I think, did a very complete job of looking at, okay, we're trying to build a foundation model. 00:45:40.480 |
So does it generalize to a lot of different use cases without being explicitly trained on those use cases? And I think they did a good job of, of, of making that claim. 00:45:50.400 |
Okay. And then this is, you know, more of the same kind of, but enables, you know, sort of like human clinical variant effect prediction. So, and I thought one of the, yeah. 00:46:07.840 |
So basically, similar stuff, but with human clinical effects, right? So, and this is probably one of the most important use cases for this model is being able to, to like, like I was saying in the beginning, being able to predict what will this, given the specific genetics of this patient, what will their response to this medication be or something like that? 00:46:34.480 |
Or like, or like, what will their progress, how, how, what, which, how will their, their specific set of mutations impact the way a disease progresses, right? 00:46:45.240 |
Those are like super useful, you know, things you can imagine everyone gets, you know, you get a blood prick when you get, get on medication, they do, they sequence your genome if it's not already on file. 00:46:59.120 |
And then you, you can, you know, you can pick which medication to take based on, on your specific genome, right? 00:47:06.560 |
So that's, that's like, and this is kind of enabling that. 00:47:09.300 |
We don't, I don't know how they claim they're doing a good job. 00:47:14.860 |
I think it'll, you know, I think it'll, whether that bears out from a clinical standpoint, I don't know, but it like, you know, looks very promising. 00:47:27.720 |
Okay, so a little, sorry, anyone want to ask questions about either of these two slides? 00:47:43.960 |
So, um, this is like information on how they train the model. 00:47:47.760 |
Uh, um, so interestingly, you know, the way that they thought about this was how do, like, we're going to do a, um, we're, we're going to, uh, train, you know, a base, like, as, as is common with long context models. 00:48:05.900 |
They do like a base training, um, and then they extend the context in, in like, sort of what they're calling mid training, um, that they did that the way they think about this is like, okay, what are the, what are the sort of genetic, um, units that we can fit in, you know, a certain, like an AK context link. 00:48:28.220 |
And so that, you know, you have like these, um, I think tRNA is transcript RNA, you have certain bacterial genes, um, non-coding RNA sequences, which help with the mechanism of transcription and other things like that. 00:48:43.240 |
Um, and, but like things you don't have are like these sort of like larger eukaryote genes, phages, um, you know, sort of like these transcript domains, um, for humans or yeast cells. 00:49:00.240 |
So it's interesting because I think unlike language modeling, you're the, there's the, like these, the things that are here and the things that are here are more qualitatively different, I think. 00:49:13.620 |
Um, I found this to be a, like an interesting difference in language modeling. 00:49:18.720 |
Um, and this sort of shows, do you like the, the, um, the token counts for the two models. 00:49:33.860 |
So then there was a, two really clever things that they did or like that clearly they are, um, they thought carefully about the evals here. 00:49:45.720 |
So they, um, and I, this is part of a larger diagram that I can show, but they, um, is like super hard to understand, or it's like really chaotic and would take like 10 minutes to talk through. 00:49:58.060 |
So I, but the, the, the gist of it is all captured here where they, you know, they train a sparse autoencoder, they extract features, and then they identify biological features that, that they correspond to. 00:50:11.980 |
And, and so they valid, and they were using this partly as, um, a way to verify that their model is actually learning biologically, uh, relevant things. 00:50:23.900 |
So there's like sort of the operational standpoint that we already saw where like, yeah, we're able to predict things that are important, but this is more of a, look, you know, sort of a conceptual standpoint. 00:50:34.220 |
Like, can, can, can our, can we learn features that correspond to, um, you know, sort of like boundaries of genes and, um, you know, different kinds of different types of genes and other things like that. 00:50:50.880 |
So I, I thought this was, um, interesting to include in an eval. 00:50:56.300 |
Um, I think it's something that maybe, uh, we could do in language modeling as well, now that sparse autoencoders are becoming easier to work with. 00:51:05.520 |
And, um, the other thing that I thought was pretty clever was that they, um, they, so this chromatin accessibility is like basically an important, uh, so this is epigenetics. 00:51:23.120 |
Um, meaning it's like heritable differences in, uh, the way that your genes are expressed, um, that are impacted by the environment. 00:51:31.460 |
So basically you're, so it's kind of, you know, I'm probably going to, it's like a geneticist will probably like, you know, throw up if I say it wrong. 00:51:41.660 |
But, um, you know, it's like basically a way that your, your, your genome can learn about its environment, right? 00:51:49.300 |
So that like you can, so that it impacts how much, or how the ways in which your genome is, is actually expressed based on your environment. 00:51:58.060 |
And you can pass this in some cases onto your children. 00:52:01.180 |
And so it's also heritable, but it's a different mechanism. 00:52:06.880 |
There are others, so, um, so what's going on here is, um, they, uh, they basically want in order to, to be able to either do, you know, sort of like in, like, um, like in, uh, com computer simulation. 00:52:28.880 |
Um, like using their model or other things of chromatin accessibility, or actually to design, um, design genes that they can, you know, use CRISPR and other, um, additive mechanisms to stick, uh, things into, uh, into the genome. 00:52:48.340 |
Um, and, and, and, and, and be able to control it's, you, this quote, unquote, chromatin accessibility. 00:52:56.260 |
Did was they basically sampled sequences and then they use this. 00:53:02.020 |
Like, you can just think of this as a classifier basically is, is, is, is this region accessible or not? 00:53:09.660 |
Um, and then they would, they accept some and rejects them and then they just shop out the rejected part and then like, according to the criteria and then, um, and then just keep reject and then take the only the accepted part. 00:53:25.660 |
And then generate from there on and, and then just keep iterating in that. 00:53:29.800 |
So it's sort of like a, um, um, constrained generation basically using this sort of, um, this, this, um, really complex model to be the, be the signal that decides whether or not to reject or, or, or like the constraint. 00:53:49.280 |
So I thought this was also very clever, um, take on constrained generation. 00:53:55.120 |
Um, and like, sort of get, it, it, it got me to thinking about, okay, what besides like, you know, right now we have these very deterministic things that we do, uh, constrained generation with what are the, you know, sort of more, uh, more sophisticated things and like sort of models and downstream things that you could use for constrained generation. 00:54:17.780 |
So like this, you know, again, like sort of truck driver stop, but that's the end of the slides. 00:54:29.060 |
Uh, there's a question regarding, do you get a chance to look into how the data collection was done? 00:54:36.260 |
What was done for evil two, um, specifically, I guess for the training and processes and all that. 00:54:45.960 |
Um, uh, they actually, they, they formed a reasonably large data set. 00:54:53.840 |
Um, here, let me see if I can grab the paper. 00:54:58.560 |
So, um, um, I think it's, there's a good table of contents. 00:55:05.140 |
But basically, um, the, the nutshell is that they had a, they had a data set that they reused 00:55:14.460 |
Um, they, they curated, there's a whole data collection section. 00:55:26.620 |
Or in the appendix, maybe here, um, inclusion, additional results. 00:55:33.720 |
No, uh, sorry, I don't have, I can't pull it up, but, um, yeah, so I, I, so they, um, augmented. 00:55:43.280 |
So the, the, the thing that stood out for me was that they, um, they added, so they use both 00:55:52.960 |
DNA and RNA to train, so they, um, so they, you, and they did that explicitly because they wanted 00:56:03.060 |
to be able to model, uh, RNA sequences and protein as well, or like sort of, um, do downstream 00:56:13.900 |
So, and what they did was they, uh, um, they switched the U symbol to the T symbol, so it 00:56:21.240 |
looks like DNA and so that they keep the vocabulary the same. 00:56:24.900 |
And the, the other thing that, uh, they, that was notable in my mind was that they, uh, they 00:56:33.400 |
left out certain viruses or many viruses from the, I think bacterial viruses out of the training 00:56:40.820 |
And I, I actually don't recall whether that was because they didn't want to pollute the data 00:56:48.520 |
set with genes that were, that would be problematic or whether it was just because we want, it's 00:56:54.540 |
different and we want to be able to measure the difference between in and out of distribution 00:57:07.140 |
It, it's, it's well-documented how they collected the data. 00:57:12.080 |
I guess, did they collect the data or did they filter existing public data sets? 00:57:18.000 |
It's all, I mean, it would be enormously expensive to collect the data yourself. 00:57:25.540 |
Uh, actually there's one thing that I found very interesting about the, can you scroll 00:57:30.200 |
up to, to the part where they do the 8k to, to, to the larger context thing? 00:57:55.240 |
Like, to be fair, I am no, I am not a genome expert. 00:57:59.100 |
So, so, so, so, so there was, there was some things that I found interesting in that one, 00:58:02.680 |
uh, which maybe some, uh, someone else who, who's a genome expert can chime in. 00:58:09.080 |
Is that, so in text modeling, for example, right. 00:58:11.180 |
Um, when we want to train on the smaller context thing, right. 00:58:14.320 |
One of the things that we do is just chunk the data. 00:58:19.100 |
We'll take the large, large Wikipedia pages, let's just say the, the bacteria, and we just 00:58:26.240 |
And then we just throw it into the, into the training for 8K. 00:58:28.820 |
But in this case, from what you explained, I think you kind of said that they didn't do 00:58:34.780 |
Um, they basically just use all the, all the infer, all the genome sequences that were less 00:58:49.520 |
And that to me was surprising because, because it's like, it just means that the vocabulary, 00:58:54.140 |
like the grammar, so the, the equivalent would be like the English grammar here, right. 00:59:00.440 |
Is consistent enough at 8K that even generalized to, to the larger context thing. 00:59:08.360 |
Is let's just say all of this, um, the, the, the, there is a basic grammatical rules of, uh, 00:59:13.980 |
Anglo-Saxacan languages, basically English, French, that, that, that whole category, not 00:59:19.800 |
including Chinese and Japanese, because there's, there's a different category altogether. 00:59:23.100 |
Like they all same, they share the same lineage. 00:59:26.540 |
And so that they have some approximate equivalent. 00:59:30.000 |
And if you just learn the, the, uh, the early Latin languages equivalent, and then you follow 00:59:37.860 |
So that to me was interesting because it just means that the fundamental concept is maybe at 00:59:44.380 |
the end of their base pairs, um, there's just some meaning and rules that we, we haven't 00:59:49.180 |
fully understand and this, and it's captured in 8K. 00:59:53.860 |
No, I, I thought that was really interesting too. 00:59:59.140 |
But once again, I know, you know, expert, I may have gotten that completely wrong. 01:00:10.440 |
Uh, I don't want to hold people past the, the, the close. 01:00:14.680 |
I'm happy to, uh, continue chatting on, on discord or whatever. 01:00:24.800 |
I hope I, I hope this was interesting to people. 01:00:32.540 |
I'm working with, uh, somebody to actually put together a class on, uh, on single cell AI 01:00:40.720 |
models, um, and the sort of things downstream from those. 01:00:44.320 |
So, um, you know, if you're interested, let me know, I'll, I'll try to include you. 01:00:48.300 |
The, I actually think it's interesting because like the, I've seen papers on genetic modeling. 01:00:56.040 |
You can conduct a class to build reasonable models on a laptop. 01:01:04.620 |
Well, and especially in single cell, the models are much smaller too, right? 01:01:09.420 |
Cause you're looking at gene level, not at nucleotide level.