back to index[AI in Action] Towards a Generic Cell Typer — RJ Honicky, PhD withMIRAMICS

00:00:00.000 |
let me do that real quick and this is my friend uh he has consultancy and he's trying to stand up 00:00:06.360 |
some ai stuff and win some big contracts and he needed some help so i am helping him with this 00:00:11.340 |
um it's all open source though um and uh yeah and we intend to keep it that way so um and you know 00:00:19.220 |
we're just gonna make money on consulting so okay so um let's see so actually this is a good slide 00:00:27.080 |
for everyone um uh so cell typing which means like so okay let's back up in one of the one thing that 00:00:37.500 |
has become pretty prevalent in the world of um biology in the last five years or so is what's 00:00:44.760 |
called um single cell transcriptomics which means they you can sequence the rna not the dna but the 00:00:52.200 |
rna um that is being expressed in single cells at a time so like um so first of all dna is sort of 00:01:01.320 |
like the the like the copy of the software on your hard drive and the rna is like the running software 00:01:10.340 |
in memory is a good way to think about it and um it's not a perfect analogy and it's like maybe 00:01:15.980 |
hardware analogies are appropriate as well and whatever but like um the rna is just basically 00:01:21.180 |
what's being done now in the cell as opposed to uh all the things that the cell all the proteins that 00:01:28.740 |
the cell can um can express so there are these there are various technologies and whatever but they all 00:01:35.060 |
basically boil down to uh for a given cell there's you have a matrix of like 50 000 or so genes and 00:01:44.180 |
just count how many of each of these genes are you can detect in that one cell and these matrices are 00:01:51.120 |
generally very sparse so about one percent of in any given cell about one percent of the one to two 00:01:58.420 |
percent of the genes are being expressed at the time of the transcription um and this process is also 00:02:06.440 |
like highly dependent on the experimental setup so it's very subject to what's called batch effects 00:02:15.040 |
meaning um the guy who run it you know spent like you know a little bit longer during the sequencing 00:02:22.380 |
process or they used a different brand of reagent or like oh like just basically there's a lot of 00:02:28.780 |
randomness in the collection process so um these these these batch effects cause a lot of headaches um 00:02:35.940 |
and so that models that are able to sort of like account for batch effects that are able to 00:02:42.820 |
um you know sort of extract the um the the gene expression information and then um and then figure 00:02:52.420 |
out what for this exact cell what type of cell is it and that could be as coarse as is this a neuron or 00:02:59.660 |
a muscle cell or that could be like given this like super specific like i know this is a neuron because 00:03:06.040 |
it's from a chunk of the brain but um i don't know what type of neuron so or maybe 00:03:12.740 |
it's either one of so basically typing the cell um and this is like upstream of a ton of really 00:03:19.900 |
important biological um types of experiments and diagnoses and whatever so um so that basically 00:03:29.320 |
that we've kind of identified this as a really core problem that has a lot of academic um 00:03:38.340 |
there's like tons of research that happened even a couple years ago that about this problem and how 00:03:44.360 |
to solve it and people got to the point where they're kind of happy with the solutions from an 00:03:49.120 |
academic standpoint and then it really hasn't made its way all the way into practice and so this is and 00:03:56.460 |
it's like pretty tough to use these models not that like for you it probably wouldn't be tough but 00:04:00.920 |
for a biologist it's kind of tough and so um so we're kind of focused on let's get let's take this 00:04:09.760 |
up wrap it up into something that is really easy for people to use and so that's the point of this um 00:04:15.600 |
and and i'll give a little more detail on that okay so uh okay so one of the so like i said there's like 00:04:23.700 |
the the state of the art kind of has um is pretty either coarse grain or really specific to specific 00:04:31.940 |
types of of cells so like you could this on the right hand side is like a hierarchy this is ontology 00:04:40.380 |
there's like a database out there that you can use that is basically the data sets that we use are 00:04:45.900 |
normalized to this so like there's like if you so the problem that you have is if you make a look 00:04:51.660 |
if your data is labeled as like you know lymphocytes or like innate and then you predict cd 00:04:57.920 |
16 positive like is that an incorrect prediction or how correct is it um is it's not a simple 00:05:04.920 |
classification problem um and then of course these can be wrong as well so but maybe if it's like 00:05:10.920 |
a different kind of mature natural killer cell like how wrong is that really um if so um that should be 00:05:19.940 |
like seen as different than predicting it as a you know i don't know like a muscle cell or something 00:05:24.440 |
um so um okay so cell typing right um okay you guys already know this um this is a model it's a 00:05:35.640 |
transformer based model it was trained it's so like without getting into too much detail about how it 00:05:41.340 |
works um they you know like the data sets are not um so unlike gene genetic sequences so nucleotides like 00:05:51.840 |
you know you have the four nucleotides and they go in these sequences and they turn into these double 00:05:56.500 |
helixes and blah blah blah this is not that kind of data this is for one cell i have a matrix of 50 000 00:06:03.100 |
possible genes and what's the number of times i saw that gene a gene is like uh you know like a string of 00:06:10.480 |
maybe uh anywhere from a couple thousand to a couple million i think uh no probably a couple hundred 00:06:17.940 |
thousand genes uh i'm sorry nucleotides they all string together at cg in different orders and that 00:06:25.440 |
defines what a gene is um so uh it anyway it you get these counts the count obviously because it's just 00:06:34.280 |
a matrix it doesn't really matter what order the genes are in so you don't have quite a a a one-to-one 00:06:41.740 |
type of data so the authors did some stuff so they could express the data in a a um something like an 00:06:56.000 |
auto-aggressive model and it's a little unclear to me why they really want to do that instead of doing 00:07:01.160 |
a more of like a encoder uh encoder only model um but it does perform well so uh but you know like 00:07:08.780 |
there's a i think in my mind there's a lot of controversy around is it because the other stuff 00:07:13.260 |
you did and you should have used a burt type model um so anyway what for what it is it is like a 00:07:19.140 |
auto-aggressive more or less style of model um similar to gpt um and it was designed explicitly to remove 00:07:28.120 |
these batch effects and this is going to matter in a minute um and it does pretty well but you can see 00:07:34.780 |
these data sets uh if you look at well you can't see this column but like the data sets are uh um 00:07:43.600 |
the data sets are these small specific types of cells so they're not doing like give me like of all 00:07:52.020 |
the cells 700 or so types um and maybe if you look at you know maybe so even a big subset of that like 00:07:59.840 |
they don't do very well with that um okay so i know this is like a whirlwind tour i don't want to 00:08:08.200 |
dwell on anything but i do want to make sure people kind of get the gist of this is there any 00:08:14.840 |
okay cool so let's talk about what i did so um one other background piece there's this other model and 00:08:24.020 |
this is i think the reason why i wanted to share this was because i think this is a super duper cool 00:08:28.600 |
use of embeddings and um it's a little bit counterintuitive or that it works so well until 00:08:36.160 |
or maybe some of you guys might be like yeah obviously but um i think it's awesome that it works super well 00:08:42.360 |
it's actually surprisingly great uh but um uh really simple to implement super efficient um and so i wanted 00:08:53.060 |
to share this with you guys and like sort of the exploration i've done with this and a small model 00:08:58.280 |
that we've built on top of it and where we're going with it so um so this gene pt is um is basically built 00:09:08.000 |
by taking a first of all for every gene in my 50 000 or so genes i turn i have a what's like a description 00:09:18.740 |
of that gene you start with this thing called the ncbi gene database and for every one of those i can 00:09:24.700 |
just get a description this gene you know um you know this gene like i could even maybe zoom in no i 00:09:31.740 |
can't zoom in but like you know this gene is used for creating this kinds of proteins and like is this 00:09:38.100 |
you know like whatever right it talks about what the scientific literature some small summary of what the 00:09:44.100 |
scientific literature says about this gene and then you take that you pump it into the um the this i didn't do this 00:09:52.860 |
the the you know sort of like gpt uh the the open ai embeddings um and you and you get out and embedding 00:10:03.040 |
right and um so that's like a vector that is you know between one of you know 1024 and 3078 dimensions 00:10:12.200 |
these are by the way matrice good embedding so you can chop off the higher dimensions and they still work 00:10:16.900 |
well um and so you take that embedding and then you look at for the cell what is the gene expression for 00:10:27.200 |
um each one of those genes and then you multiply the embedding by that scalar like so if i have like um 00:10:37.080 |
in this particular cell i have six of this type of gene and um seven of that then you take the six 00:10:45.760 |
multiply it by the embedding of the of that particular gene so you know all your embedding weights will 00:10:52.680 |
multiply by six in that case and then another gene you you know the i saw like a hundred of this one so 00:10:58.880 |
i'm going to multiply that one by a hundred and whatever so you you um and then you so you added 00:11:05.600 |
all this up you normalize to one and you end up with this uh embedding of that cell based on its gene 00:11:13.620 |
expression okay is that i want everyone to understand that because that's uh if i think it's obvious if you 00:11:20.620 |
if i explained it right but maybe not a very good explanation so i want to make sure people are on the 00:11:25.680 |
same page here any questions i'm not watching the chat so uh in fact i can watch the chat whoops 00:11:33.360 |
uh okay it's just about transcription okay anyone have questions uh i want to make sure everyone's 00:11:39.920 |
clear here okay where do you get the gene expression developed from okay that's from so i haven't covered 00:11:47.360 |
that so that's why you don't know that the answer to that is um there is a database of there are two 00:11:56.880 |
databases um a database that was compiled by the chan zuckerberg foundation um called cell x gene v2 00:12:05.580 |
and it's basically they collected 100 million cells so like a thousand files that sum up to about 100 00:12:13.340 |
million samples or cells of um of this type of expression data and it's publicly available i can 00:12:20.720 |
it's probably referenced in one of the links here i'll share the slides when i'm done but like you can 00:12:26.140 |
um uh like anyone can download this so we we um also i this company mirror omics also compiles these 00:12:35.140 |
data sets and that we have access to larger and non-publicly available data but they also so but 00:12:41.940 |
anyone can download the data that i'm talking about does that make sense and and then there's 00:12:46.760 |
another data set um that is a subset of that that is like very carefully curated to be good for a 00:12:55.780 |
benchmark on single cell or actually it's not a subset it's a it's it's removed from that data set and 00:13:02.440 |
it's a benchmark on your performance of whatever algorithms but you're supposed to use it not for 00:13:08.100 |
training your models but actually benchmarking your models um that's called uh tabio sapiens 00:13:14.200 |
and both of those are chan zuckerberg foundation um if i didn't know if you don't see a link here and 00:13:22.260 |
you want to see the um if you want to know how to download that i can share with you does that answer 00:13:28.100 |
your question yeah thank you great okay sorry okay hopefully that yeah go ahead please question 00:13:36.000 |
is that um do you have a intuitive way to think about the uh the calculation you're showing on the 00:13:43.980 |
the um slide right now like is it projecting from one place to another yeah so i think that what 00:13:51.960 |
um so each gene has is represented by a direction right which is um so and this is in a high dimensional 00:14:00.640 |
space so that 3078 dimensions so each gene um will be somewhat but not completely orthogonal 00:14:10.980 |
to the other one um there'll be like a general direction that i'm describing a gene right you'll 00:14:17.280 |
all of them will be in that direction but then like each one will be slightly different direction 00:14:22.700 |
and when you add them all up you're kind of getting the average and then normalized to one 00:14:29.740 |
you're kind of getting the average expression of all these different genes but the interesting thing 00:14:35.420 |
about that is is it because it's a text embedding then like let's say that the the information about 00:14:42.720 |
the embedding is like oh you know like experiments show that this is related to cancer right or something 00:14:48.220 |
like that then there'll be like some bit of cancer direction in there too right and and uh 00:14:55.900 |
you know whatever like drug interactions etc so anything that's mentioned in there so that you're getting 00:15:02.140 |
like sort of a um in there is a sense in which the direction that the cell um is pointing will be like 00:15:13.420 |
sort of the aggregate of the scientific literature about all the genes or this particular cell does 00:15:19.740 |
that make sense i mean that's it's it's a little bit abstract by its nature but that's the way i think 00:15:26.060 |
about it that makes sense and so when you multiply it by your that the matrix that's kind of green and blue 00:15:33.980 |
yeah that's all that cell in sorry the gene information you had on the previous slide right like it was a big block 00:15:41.900 |
yes exactly and then what you get in the output is for that specific cell all that specific gene 00:15:48.620 |
information but maybe it but for that cell yeah that's right exactly okay cool thank you yeah and and it 00:15:57.500 |
actually turns out that this technique um in some some cases exceeds the performance of other like the scgpt 00:16:09.100 |
for example um so which is kind of incredible because it the scgpt uh transformer was trained on 00:16:18.620 |
like actual genetic information whereas this is just like uh i'm gonna um so that like you know presumably 00:16:26.620 |
there's like a lot of relationship information between different genes and whatever whereas in this 00:16:31.900 |
case we're just saying you know like oh we have these different um you know embeddings and we're just 00:16:38.380 |
going to create a new embedding space and then and then like look at how well we do um so like and you 00:16:45.100 |
can see this umap over here and i have a few more i think umaps in this but um that you can or no it's not 00:16:52.620 |
not in here it's in my notebooks which i want to make sure i get to um but you can see like okay like 00:16:57.900 |
you are kind of your clustering is kind of capturing uh the the colors are the cell types so like um 00:17:06.780 |
the it is kind of capturing different cell types in different you know clusters in umap so that's 00:17:15.500 |
and then like since this is like a you know latent space uh event uh you have we have to have a graph of 00:17:25.820 |
performance and and um we can see that like i think that my my nutshell is that sometimes scg 00:17:33.660 |
in some of these data sets right scgpt does better in some data sets gmpt does better and there's like 00:17:42.380 |
i'm not going to go into the difference between these two um but and then some uh like this is the exact 00:17:47.980 |
model so in this bones one um and then in some data sets the ensemble between the three of these does 00:17:55.420 |
better and but you'll one thing that is important is that even when it's not the top the ensemble is 00:18:02.060 |
always pretty close to the top right so like so that and so that what this tells us is that the ensemble 00:18:08.540 |
probably will do better and like maybe also interestingly sometimes you know because like 00:18:15.500 |
sometimes scgbt does well sometimes um you know and does pretty significantly better than this 00:18:22.620 |
in accuracy and precision recall f1 and then sometimes uh that you know gene pt does the best 00:18:29.340 |
and it does pretty significantly better than scgpt so they're capturing different information right and 00:18:35.900 |
and that so then ensemble is a great um a great approach because you know you can kind of hopefully 00:18:43.100 |
get close to the best of both worlds um so uh and and so yeah um okay so what so let's talk about i'm 00:18:53.100 |
going to explain this plot a little bit then i'm going to dive into some notebooks because this is ai in 00:18:57.180 |
not a paper club or whatever so um the this um so what what we did was we added um to those descriptions 00:19:11.020 |
so remember these descriptions here because the in the original paper they use these just basically the 00:19:17.900 |
descriptions from this ncbi gene database what we did was we added we got um we got uh gpt4 to 00:19:27.580 |
give me more information about tissue type drug associations and pathways um to improve the 00:19:35.660 |
the embedding right so we embedded the we we generated text on each gene um using a prompt we we did that over all 00:19:45.900 |
that over all 50 000 or whatever genes and then we took those and we embedded the full description 00:19:52.220 |
so we came up with our own and with this is not super well labeled but for different um so we have a 00:19:59.260 |
this smaller test set uh you have like i think 20 donors of cells and then or so and then um just leaving 00:20:11.340 |
i picked three that i thought would be good for test sets and then what we did was just leave one out so 00:20:16.940 |
train on all the so all the donors except this one and then test on the last one and then do the same 00:20:23.980 |
for each donor right and you can see that like um the i think that there's two things that i think are 00:20:30.860 |
noteworthy here is one is that um like the adding so this red one is the original embedding basically and 00:20:40.140 |
then these guys are are improved embedding so it does actually seem to help to add more information 00:20:46.300 |
about the cell and just like these are like progressively adding more information to the prompt 00:20:52.780 |
and it does continue to get better right i mean not in this case maybe but in many cases um and the 00:21:01.020 |
other thing to note is like this is scgpt and this does pretty significantly better in some cases right 00:21:08.380 |
so which we knew in this case it doesn't um and then if you look at the weighted meaning um for by 00:21:15.580 |
by the number of cells then actually it in in this case for this donor um we do this we're done just 00:21:23.260 |
pretty significantly better than um scgpt and even in the unweighted case it does as well a little bit 00:21:29.100 |
okay so that's that's like sort of uh so i want to just talk a little bit about how i did this 00:21:35.740 |
so i'm going to go i'm going to switch to uh let's see i have a couple um work books 00:21:46.300 |
not this one i want this one okay so one thing the other thing is since this is ai in action i just want 00:21:51.820 |
to talk about i think i have this workflow um i actually am curious to hear of anyone else using 00:21:57.740 |
notebooks in cursor or another um another you know sort of vs code base or like what is your i'm 00:22:07.900 |
curious to hear about other people's workflows with um notebooks i have a love-hate relationship with them 00:22:14.140 |
i love that they're so flexible and i especially love that i can see i don't have this like render 00:22:23.100 |
um i can i can like sort of flip back and forth between cells i can render so for exploratory stuff 00:22:29.180 |
um they're awesome they're amazing even for some pipeline stuff they're amazing because 00:22:34.860 |
you can see um sort of plotted output for your pipeline but they're also like total garbage for 00:22:42.220 |
rigorous software development because they basically have global state right and so that 00:22:48.460 |
you're you're and you're programming like basic basically that has like nose concept of isolation 00:22:54.940 |
from other functions right so like what do you guys i want to just pause and ask is anyone else using 00:23:02.380 |
notebooks and if so like what do you do any any uh any takers um is my okay my mic's working yeah 00:23:17.420 |
jeremy howard has like a whole uh workflow for working with python notebooks where they they do 00:23:23.180 |
like production level stuff um i have not quite adapted it yet but i but i actually believe uh they 00:23:30.860 |
produce some good software over there it's fast ai yeah yeah yeah mb dev uh if if if that's something 00:23:37.100 |
you're interested in you want to look into i really respect what they do over there so yeah i think i mean 00:23:41.180 |
that's a cool way to develop but i don't personally use it yeah i mean i definitely have done uh um 00:23:46.380 |
um i've done the fast ai course and um at least one of them and and uh like it but does he have like a 00:23:57.740 |
specifically like here's how i do notebooks in production kind of thing yeah mb dev i think 00:24:03.980 |
is like what they build on top of and if you look at like fast html it's built with yeah flash there's a 00:24:10.300 |
okay okay yeah okay definitely gonna check that out thank you um and i i know that i know slono hates 00:24:18.540 |
him i can't get used to him bro if i'm being honest like it's just uh yeah making making that switch 00:24:26.140 |
it's tough i think but but but i i think it's a cool way to develop if you've adapted it and in mb dev they 00:24:32.780 |
have some cool primitives where like you kind of just like develop in the notebook and then all your 00:24:37.740 |
docs are auto-generated all of your like the readme is auto-generated um like like a bunch of i think 00:24:44.060 |
even the doc strings get auto-generated like there's a bunch of cool primitives that they have in there to 00:24:48.380 |
make it okay easy to go from notebook to like actual production stuff um but i haven't done enough of a 00:24:54.460 |
deep dive to like really say like oh yeah i'm ready to switch but but i respect it and i think it uh it it just 00:25:00.700 |
shows that there's another way it can be done yeah yeah okay great i love jeremy howard and all this 00:25:05.660 |
stuff so definitely gonna check that out you want to say something you know yeah i i mean yeah i hate 00:25:12.060 |
them uh but i worked a lot with like material scientists and like chemists and um like uh mechanical 00:25:19.580 |
engineers that would produce these like absolutely horrid iphone notebooks and so like can you push this to 00:25:24.860 |
prod and be like oh no i uh one of the techniques that i developed is uh a just like have a python file on 00:25:33.820 |
the right that you import into the notebook and then do everything that's like stable enough move it over 00:25:38.620 |
to the package and then reuse it in the notebook yeah and then the other one that i would often use is 00:25:44.460 |
replace the notebook with a script that's the same as the notebook right like something that that is not meant to be 00:25:50.940 |
functions or a program that you reuse but add decorators to cache the expensive to compute data 00:25:57.900 |
so that the first time you run it it will it will do that and then you can always run the whole script 00:26:04.620 |
right so you don't so that you start with this the fresh python don't have like objects left over from 00:26:10.540 |
like 14 cells ago that you know are kind of gone um and they would develop on that so it looks like kind 00:26:17.820 |
of like a notebook but it has it it runs as fast as a notebook to get to the stuff you're working on 00:26:23.580 |
but it is actually reproducible like those were two techniques that i used how did you um sorry where 00:26:29.740 |
does the decorator save the state like onto disk or yeah just like dot state as json or something like 00:26:36.300 |
that i'm sure there's like more there's like more elaborate okay but it's saving it to persistent 00:26:41.180 |
storage so that when you redo then so but the other thing so maybe you have a thought about this too 00:26:46.300 |
the other thing that i really like is the ability to have rich output in between cells right because 00:26:52.620 |
then i can see like even if it's a like it work i often um we do we use databricks and some of the pipelines 00:27:00.300 |
are written with um with notebooks and the nice thing about that is that you get you can be like 00:27:07.180 |
okay here's the out here's the output of this intermediate step and here's a plot or a table or 00:27:13.340 |
something so that when i go back and review because of a failure or whatever i can just look at the plot 00:27:19.420 |
and see the the intermediate state there's this like this lightweight python thingy where it's a python 00:27:26.460 |
file but you can use like ampersand and percent or like percent percent there's like a syntax that 00:27:31.420 |
eludes me right now which when you open it in vs code for example looks just like a notebook and i think 00:27:37.820 |
it can render graph outputs if you have a matplotlib thing in there okay but it's still like a dot py file 00:27:45.420 |
you don't have this like ipy nb json monstrosity like lying around and and uh so so that i found 00:27:52.460 |
pretty useful and then this was quite a long time ago but i remember that airbnb had like a notebook 00:27:58.380 |
publishing system which would make some really nice um renderings of python scripts basically that's that's 00:28:10.620 |
what i used that's what i had the data the data team that i was leading kind of uh use i think it's called airbnb knowledge repo 00:28:20.140 |
okay a i think it was deprecated back then but it looks like it's picking back up again so i mean 00:28:28.220 |
it's two years old but this allowed you to like just publish a python script as a rich kind of uh 00:28:34.220 |
uh interesting output yeah okay cool all right great so that was a really helpful sidebar for me 00:28:41.820 |
hopefully this is being recording so hopefully i can go review anything i forgot okay so okay so and 00:28:47.660 |
actually um just as a side note that what that workflow that you mentioned where um you sort of have like 00:28:53.340 |
like these side you know like things that are like useful you kind of like move into a into actual code 00:29:00.860 |
and i have done that very much um so like you know so like this inference for that gene pt um 00:29:08.060 |
for those uh like the gene pt uh cell cells i've kind of codified into this like set the this set of libraries 00:29:19.340 |
so here's like how do you do the inference this is embeddings this is all open source i'll i'll share 00:29:24.380 |
the link um and um but like so you can see that um you know there's different in like a bunch of utils 00:29:33.660 |
right so i call these from my notebooks um and as i develop them like and they become more like i'm like 00:29:40.620 |
oh yeah this is a common thing or people would benefit from this and i add them there um so okay good so 00:29:49.020 |
uh let's see so where do we start the the um this so i did this i think this is the right notebook 00:29:57.340 |
so i did this eda oh no this is not a very this is not it um i think this one is the one i want um 00:30:04.460 |
so this is as i mentioned this is this tabulous sapiens data set this is like so cz is chan zuckerberg 00:30:12.060 |
um jan zuckerberg institute you can um you know this is a link um i'll i think the easiest if anyone's 00:30:19.660 |
interested in like following up on this more i'll just post a link to the this repo that i'm i'm uh 00:30:26.540 |
presenting from and then you can just like find all this information um so so uh this um 00:30:34.540 |
this is like an example of embedding sorry for the whiplash here um this might even be the plot that 00:30:42.780 |
or a version of the plot that she did except that it's like active and you can see that again um this is 00:30:49.980 |
like you have these different cell types and they're kind of sorta um embedded you know like 00:30:56.940 |
you can kind of like get pretty good classes so um you can imagine that a classifier in high dimensional 00:31:03.420 |
space is probably going to be able to do a pretty good job with this and that's indeed what we find so 00:31:07.980 |
like um so this this code was the embedding and then there's this analysis one um so to be clear 00:31:16.620 |
this was like after you have the embeddings then what do you do um and so you're like this is um 00:31:22.860 |
this is scgbt and you can see like that like a couple of interesting things to note is that they're like 00:31:29.100 |
lots of different well isolated clusters whereas this one is more um maybe a little bit like the 00:31:36.700 |
relationship the the discrimination between the classes is uh better in in general here but you can 00:31:45.340 |
see that they have these like really horrible parts that are going to be completely impossible to just 00:31:50.620 |
differentiate um and then uh this one is let's see what's the difference between well it doesn't really 00:31:57.580 |
matter so um so i you know like we do these um oh this was the different versions of the the uh 00:32:06.140 |
the um the embedding and you can see like they're all kind of the same this one gets a little better 00:32:11.980 |
like i don't know visually it's hard to tell but like maybe this one and then this one like slightly 00:32:18.460 |
better probably um or actually this one yeah so like this one's a little noisier i would say than this 00:32:26.060 |
final one um so kind of you can kind of tell visually that it seems to be doing a little better but not 00:32:32.380 |
during not dramatically but better um so let's see and then like you know this i chose like this guy 00:32:40.780 |
this guy and this guy so tsp1 2 and 14 and because they have good representation across all the cells and if 00:32:49.340 |
you kind of look you'll see that basically there's not a lot of missing cell types although they're 00:32:55.900 |
like this cell type is only in 30 and so the classifier is probably going to not do well in 30 00:33:01.340 |
and um maybe this type as well what is that whatever that uh that's this ecto epithelial self right so um 00:33:11.260 |
um so then i basically just run like in a loop i train and i trained uh let's see like if you look 00:33:19.980 |
at what i did it was sort of um you know leave one out and then i'm looking at uh let's see i'm looking 00:33:29.580 |
at uh the three different algorithms which are random forest uh xg boost or light gbm actually and maybe one 00:33:39.580 |
other um and then you know so like blah blah blah lots and lots of compute and then um and then like so 00:33:48.860 |
here's some tabular data and then eventually you get to this and the other plot that you get to 00:33:56.300 |
is maybe um well okay so i did this i don't think the details you guys really care about a lot um but 00:34:05.420 |
like so okay first of all we're you know this is how we ended up with this plot is we embedded those 00:34:10.780 |
stuff any any questions about that so this is partly about my workflow and stuff so if you have like 00:34:18.700 |
workflow questions or whatever um that's great too and and so okay um how we did this how did we do this 00:34:28.140 |
well um again not nothing no rocket science here but um if you look here uh you know basically um we 00:34:39.020 |
you know sort of like for every gene we submit batch requests to uh open ai so um the open ai embeddings 00:34:49.580 |
are all old but they have a couple advantages one is that they're matryoshka embeddings so um 00:34:55.580 |
you guys are you guys familiar with matryoshka embeddings is it like a really good concept to know about 00:35:03.740 |
um yeah but there was one question in the chat about how are you quantifying which one is better 00:35:10.860 |
as in which metric so i think about that last graphic oh okay this one yeah so this is the 00:35:16.540 |
f1 score so meaning um and this is actually a like an insightful question yeah go ahead uh i know i meant 00:35:24.940 |
to ask about the like um when you were looking at the umap graphs you were saying that somehow the 00:35:30.140 |
embeddings are better so this is um so you're quantifying it using that one score uh the same 00:35:35.500 |
models right yeah yeah and so in the f1 score is a little bit like it's not a great f1 score is only makes 00:35:46.780 |
sense if you have all of the cell types uh in the same level in the hierarchy right so like way back 00:35:54.860 |
in the beginning of the presentation here i i you know like you have these this hierarchy if you have 00:36:03.020 |
like some cell types here and some cell types here even if you're correct you might get it quote unquote 00:36:09.740 |
wrong and so the f1 score doesn't work super well for for like sort of broad data sets but it works well 00:36:16.140 |
for uh the data set i'm using to test here because that tabular sapient data set all of the cell types 00:36:23.340 |
are at the same level in the hierarchy does that make sense yeah so for broad cell types what kind of metrics 00:36:29.420 |
you use uh sorry what what kind of um like when you are taking from different families what kind of metrics 00:36:37.020 |
do you use other than f1 uh so f1 precision recall uh no i'm uh i meant you said f1 is good for this 00:36:45.980 |
particular data set oh i see okay yeah let me i'm gonna get to that so let me let me answer that in a 00:36:52.220 |
little bit that's a great question it's actually one of our most important questions um yeah 00:36:59.180 |
i i will answer that um okay so uh yes so let's see um yeah okay so and i don't i mean like there's 00:37:09.900 |
no magic here most of this crowd probably knows how to do this if you don't i'll like go over it really 00:37:14.940 |
briefly but you basically um you know like to in order to embed things with open ai you basically you 00:37:23.420 |
you know upload all your um text to open ai in batch usually and then like so you create these files 00:37:31.420 |
and then you just create a batch job and it will you know for something like i think it's half the price 00:37:37.500 |
um it will just happily turn away at it for up to a day depending on the size of the data set i think 00:37:46.060 |
you can see here for and interestingly for larger data sets the throughput is much higher but you can see 00:37:52.460 |
like um you know for 30 000 descriptions it took me what about an hour not even an hour half an hour or 40 00:38:02.140 |
minutes right so like um and it cost me like a dollar or something so like it's pretty amazingly cheap 00:38:09.500 |
um and and because their batch api is so well like it's designed for like massive projects so like it's 00:38:19.180 |
super robust like you just send it stuff and you get back stuff in a little while and it works well and 00:38:25.900 |
like it's they've thought through all the failure conditions and how to restart and whatever so 00:38:30.780 |
like i i was like really pleasantly surprised at how well it works versus trying to manage all that 00:38:38.220 |
yourself and it's half the price um okay so uh yeah so that's that's like sort of how we generate it and 00:38:47.100 |
then you know so we generate all these embeddings we try it on different um embedding types and i was kind 00:38:53.500 |
of thinking that you know my hypothesis beforehand was that like actually adding extraneous or like 00:39:00.380 |
less gene specific information and more about like the you know diseases and pathways and interact drug 00:39:07.660 |
interactions and whatever that it would it would like not help or hurt the performance of the classifier 00:39:14.060 |
but it actually was wrong my hypothesis was wrong and it actually does like just adding more information 00:39:19.260 |
about the gene just does tend to help um and it helped close the gap on the scgpt so um and then i've 00:39:28.940 |
uploaded this all the hugging face so you can if you want to just use it for some reason then you can 00:39:33.340 |
just go and use it um okay and then uh so that's the that's like how we generated the the embeddings okay 00:39:44.140 |
so good um any questions oh i was going to mention matryska embeddings they're also this useful thing 00:39:50.700 |
where uh 30 78 dimensions in the uh open ai large um embeddings but they were designed in such a way and 00:40:02.940 |
you can read the matryska embeddings uh paper to understand the details but they're designed in such a way 00:40:09.340 |
that you can chop off the higher dimensions and they most of the information is contained in the lower 00:40:14.860 |
dimensions so that you can kind of adjust how much data you want to store and process versus how much 00:40:22.300 |
you want to how what performance level you need and even there are some cases when lower 00:40:30.220 |
dimensions do better than higher higher number of dimensions so um so we uh yeah so we we ended up 00:40:39.580 |
using um i don't know for i'll explain we ended up chopping the dimensionality but using the starting 00:40:45.980 |
with the 30 78 dimensions okay cool so uh let's see where are we um sorry one in the driveway yeah 00:40:57.900 |
sorry guys um okay so uh so then the next thing that we did was we took okay so now we have this 00:41:06.860 |
like sort of we okay we kind of believe this works we have some embeddings we like um you know we have 00:41:13.100 |
it we didn't we weren't very rigorous on this because one is it it's very pragmatically focused on helping 00:41:20.140 |
customers to get good results so we're i mean not that we shouldn't be doing evals we absolutely have 00:41:26.540 |
to be doing evals but our focus is okay let's just learn about how these models work before we start 00:41:33.180 |
doing really rigorous eval so and i think we've gotten to the point now and i'll discuss where we 00:41:38.860 |
not need to start doing more uh rigorous evals so here's um let's see let's start here so now let's 00:41:47.580 |
load um this gene select gmv2 um this is a hundred million cells it's about i don't want to say i don't 00:41:55.660 |
want to get this wrong but i think it's a like maybe five or six terabytes of data so it's pretty big data 00:42:01.580 |
it's not massive but it's not small either um and uh and then so one thing i so one thing about this 00:42:11.100 |
type of data is it's saved in this kind of uh interesting file format that it kind of it's like 00:42:18.380 |
a file system in a file and it allows you to store um like the the sparse uh gene expression data but then 00:42:27.500 |
also all the vectors of labels and other things and then all sorts of metadata and so that this is 00:42:35.020 |
like the standard way that you um represent this data but it's it what the the consequence of this 00:42:41.340 |
very flexible format is that the data is like like first of all there's a lot of cruft data that you 00:42:49.260 |
in any given uh circumstance you don't really care about or it's not cruft but it's like irrelevant data 00:42:55.580 |
and then um there's also means that different people put different like names on the same things 00:43:01.900 |
and all sorts of like it's not a very super well standardized so that you know there's a lot of 00:43:07.580 |
pre-processing you have to do so i actually ended up writing a script um to extract a bunch of or like 00:43:16.620 |
certain metadata form uh and write it to a certain metadata and write it to parquet files so that i can do quick 00:43:25.100 |
analysis on the um on the metadata for the whole data set so there's a thousand or so files and i 00:43:32.540 |
wanted to be able to read the metadata and do statistics on them both so that we can curate a 00:43:39.180 |
good training set um so i did so i created um the metadata in this file and i can talk it again that's an open 00:43:50.140 |
source project it's right here it's called um and data metadata it's on github i can share that as well 00:43:55.900 |
um and what it does is it creates uh you know these um uh these uh these parquet files that you can use 00:44:07.740 |
to you can query it with duckdb you can load it as um in in pandas or whatever your tool of choice is um so that 00:44:15.260 |
basically um we can you know sort of like ask okay what are the gene what are the like genes being 00:44:25.340 |
expressed in different files what are the like number of genes uh um what are this and then also i had to do 00:44:34.620 |
some post-processing here on what are the distribution of gene counts so like for a given 00:44:41.580 |
file it has this many genes expressed um and then like the number of active gene count or sorry the 00:44:50.300 |
total number of genes like gene expressions and then the number of gene like individual or like the unique 00:44:57.180 |
genes that were expressed and so like i kind of plot this like in like a manual way with three files and 00:45:04.140 |
then um go and get the distribution uh across sorry i'm gonna skip that um across uh all sorts of uh 00:45:16.540 |
sorry to scroll yeah so then like so now this is that data that i was just talking about but this is every 00:45:22.620 |
every single or not every single file but many of the files in the data set in order to understand 00:45:27.340 |
okay how met like what's the how similar are the data visually right so this is a thing i almost always 00:45:35.260 |
do with large data sets is i i some i get like let my eyeballs sort of understand the data and like 00:45:42.620 |
find the outliers in the shape of the data so like the things that i pull out from here is like okay 00:45:48.060 |
these guys actually seem to be pretty uniform but there's like some guys that are like small 00:45:53.260 |
like they have very small um tight distributions and others are wide but they all kind of have a 00:46:00.140 |
reasonable cut off these ones like you know these are really heavy tailed and i cut them off 00:46:06.140 |
um as like this last column is rest so the point i'm making is like okay get i got familiar with the 00:46:13.900 |
shape of data and i actually we were originally going to set some thresholds so that we throw 00:46:18.860 |
away bad data and we just realized from looking at this let's just we're going to train on everything 00:46:23.820 |
we're not going to any like any given file um we're not going to be we're not going to have any um 00:46:30.140 |
discriminatory uh factors that exclude data um does that make sense do you okay i want to make sure i'm not 00:46:39.340 |
like boring you guys with um you know sort of like data analysis details but i wanted to like 00:46:45.900 |
sort of talk about how we're selecting the training set um any questions 00:46:51.740 |
yeah it looks like yikes did ask if you had any intuition on why i ended up helping but maybe in 00:46:58.940 |
the interest of time like you only have a couple minutes left and i think it was a while back too i just 00:47:02.540 |
missed it okay yeah okay yeah yikes let me answer that on on chat so i can finish up but i i will 00:47:09.020 |
answer it um so okay yeah uh so the interesting so what like from this crowd's perspective the 00:47:15.660 |
interesting thing about this data set and and the my training process was that um like or the data 00:47:22.300 |
selection process was that these files are like this weird format that i mentioned and um i had to 00:47:28.140 |
like so they're all in s3 and i had to use this um like i had to rewrite the um the standard tools are 00:47:37.740 |
just no good at using objects directly in s3 they want to download the whole file and then start analyzing 00:47:46.540 |
it and so that the upshot of that is that you end up it's like a very prohibitive io problem and so that 00:47:53.980 |
what i ended up doing was developing that a couple tools um you know sort of in this process that were 00:48:01.100 |
able that that go and get um file fragments and then actually doing i was actually i ended up doing 00:48:08.460 |
you know and this is like where llms really shine is i ended up doing this like um integer linear program 00:48:14.860 |
to select the best uh you know sort of fragments from the files based on the their coverage of the 00:48:22.060 |
type cell types that i wanted so there's like a whole bunch of work that i did um like the vast majority 00:48:28.300 |
of the training work that i did was actually just figuring out how to get the data set the training 00:48:34.700 |
set out of the 100 million cells um in such a way that it wasn't prohibitively expensive okay so so i got 00:48:41.980 |
down to about five million cells and then um and then ended up uh you know sort of um uh let's see so i ended up uh 00:48:57.740 |
building a uh a you know building the subset um and then i ended up the we were going to use xg boost for the classifier 00:49:09.980 |
for the classifier xg boost has problems because it doesn't scale well um to large data sets that um 00:49:18.140 |
that don't need that don't need a a corresponding amount of processing right so you can scale it in 00:49:25.580 |
various ways but none of the scale ways that scales match this data set very well and if people are 00:49:32.380 |
interested i can explain why but so we end up using a multi-layer perceptron um and and uh and again 00:49:40.380 |
loading only the data being very um very selective about the data we load into memory um use optuna 00:49:49.180 |
optuna to tune hyperparameters and then we ended up with you know this classifier um you know that is stored in 00:49:56.380 |
stored in weights and biases and can um classify like our test set you know 100 000 cells in like 00:50:03.900 |
15 seconds so like super efficient at the end the inference is super efficient um but took a lot of 00:50:11.820 |
work to get there um and we ended up with uh so like and i can let me just share on weights and biases really 00:50:19.020 |
quick uh um in fact i think it's reference from here so let me just grab it here yeah here um 00:50:30.940 |
so if you guys are not aware of weights and biases you definitely should be using it um and uh so like 00:50:37.660 |
the like maybe the final thing i'll mention uh is that like i ended up you know learning a lot about 00:50:47.020 |
training models uh in this process i don't have a ton of experience with neural networks and the 00:50:55.900 |
interesting thing so you can look at these two different training runs one is the blue one and the 00:51:01.420 |
you know the um the orange one or yellow one and um and like the i won't talk so much about my metrics 00:51:10.700 |
actually i want to talk also about the metric that we're developing um so i'll be really brief on this 00:51:15.980 |
but you can see that like this guy keeps having these loss spikes and never really getting to a point 00:51:21.820 |
where it can converge before it has another loss spike whereas so let me let me make this larger 00:51:27.260 |
so like you know it keeps getting down and then oh shit and then i have another loss spike and this 00:51:31.660 |
is a periodic thing and what this corresponds to is different um different data sets and so that 00:51:38.860 |
and the reason is that it's doing this because the distribution in the data sets is different and so 00:51:44.380 |
that um it like whenever it gets to a new data set then it's like oh crap i'm mispredicting everything 00:51:51.980 |
and i have to restart from here and then it gets you know it like sort of like lowers um uh so it it 00:51:59.980 |
sort of like um gets down to a certain point and then it doesn't have enough um you know generalization 00:52:06.300 |
so you can see it like generalizes a little bit and that these loss spikes are um a little bit not as 00:52:12.780 |
bad and it gets a little bit lower but it doesn't really um smoothly improve the neural network so what 00:52:19.500 |
we did was this was using the 30 78 dimensional um uh um embeddings and i what we what we did was okay 00:52:29.740 |
just chop it off so everything can fit in memory so we can easily shuffle around the data set and then 00:52:35.420 |
just shuffle the entire data set and pick random sound sub samples for my batches and then suddenly 00:52:42.060 |
you're the loss um is much is dramatically better and um and you know these these are different measures 00:52:48.860 |
that are not super important um but uh one so uh i'll take i'll stay on for a few minutes after and 00:52:58.380 |
answer questions but i want to answer the question um about the metric so um 00:53:05.180 |
we developed the the metric that we used let's see it's written out somewhere here 00:53:11.660 |
uh yeah here the metric that we used um so you can if you look at the um the macro metrics so like 00:53:21.580 |
standard uh f1 score macro then they they're total garbage like the f1 score here is nine nine percent right 00:53:31.660 |
uh uh and uh and and precision and recall are equally bad right so like um that and the reason is because 00:53:39.820 |
even if i'm correctly predicting but i'm in the wrong place in the hierarchy then i still am mispredicting 00:53:46.140 |
and so um like these you know so these are ranking metrics that are traditionally used in information retrieval uh 00:53:54.380 |
uh um like this is like um that if is my is my uh correct answer in the top ten top five top two 00:54:04.860 |
and like these guys are just basically waiting like how far from the top are my uh is the correct answer 00:54:11.660 |
and so um so those do better but i find it very unsatisfying because like one if you like in 00:54:19.580 |
practice i'm not going to be like that doesn't help me right like i'm doing science i don't want to know 00:54:24.940 |
like it's one of these five i want to know what's my best guess right and the other thing is that um 00:54:31.100 |
you know it doesn't take into account this hierarchy very well uh where did that slide go um 00:54:38.940 |
it doesn't take this hierarchy into well so i i think that we're we're coming up on like a basically a 00:54:47.180 |
hierarchical uh metric that we want to use that is basically how far in this hierarchy am i from uh 00:54:55.180 |
the correct measurement so okay this is kind of biology related not so probably interested for this 00:55:03.340 |
crowd but i do like i i thought that this was really fascinating because like this is very much a 00:55:11.340 |
process of discovery of the problem i'm learning about the problem as i learn about in the data as 00:55:18.620 |
i learn about the specific model so i thought the process here was really exciting to me and i wanted 00:55:25.660 |
to share with you guys um did i answer um the the woman who was asking the question yeah okay good all 00:55:35.260 |
right and i i'm curious have you ever dealt with something like this and is that um it like 00:55:40.780 |
like and if so like what did you do i haven't worked with bioinformatics data at all so okay but 00:55:48.700 |
what about like hierarchical data in general uh not really no i've worked with like uh easy related data 00:55:54.780 |
and all of those are i don't see any hierarchical kind of connections in that yeah i also need to ask 00:55:59.580 |
like how important do you think having domain knowledge is in uh inomics is to get into this field 00:56:05.020 |
like to do this by your measure i think that having an expert uh to ask questions and guide the decision 00:56:12.220 |
making process is pretty critical i guess like like the intuition for what will work what's important 00:56:19.260 |
those things um you need an expert but i don't think you need to have a formal training i don't right so i 00:56:28.300 |
have and i have a but i have a friend who is a very experienced bioinformatics who's done drug 00:56:33.500 |
development and other um you know sort of related stuff his whole career so i can so that we work 00:56:39.980 |
together on this and i bring the ai he brings the bio and like we end up we're working really well 00:56:45.260 |
together yeah and do you have any suggestions for like quick exploratory projects following up on this 00:56:51.740 |
on your work right now um i mean uh like we i'd love to like work together on something i can definitely 00:57:03.340 |
come up with specific stuff um i i mean yeah why don't just dm me and we can talk about ways that you 00:57:10.060 |
could help i mean i could think about an answer if you if what we're doing is not super appealing but 00:57:15.500 |
you want something else like in the same domain or whatever i can also probably come up with something for 00:57:19.900 |
that well thank you awesome um yeah okay good so i guess that's probably all i have time for do you 00:57:29.820 |
guys want to um uh are there any other questions i'm happy to stay after uh and talk um but i want to 00:57:39.260 |
like sort of you know at least have a informal ending there or a formal ending i guess 00:57:48.940 |
okay uh okay um okay cool well thank you guys for joining i hope that was helpful and interesting um 00:57:57.420 |
uh please if you uh if you have questions or follow-ups or need links or whatever just hit me in discord 00:58:06.300 |
and i will um i will share what you need and answer questions and you know would love to collaborate and all that 00:58:16.860 |
okay take care guys and i guess what uh uh yikes or flow or whomever do we usually we're starting to 00:58:25.580 |
publish these on on the latent space tv right i i have a bunch of recordings that i'm overdue to publish 00:58:32.460 |
um so yes they will be published at some point but if you need it sooner just let me know and i'll upload it 00:58:38.220 |
i mean i like to after i do a talk i like to just you know i like to get as much credit for all the 00:58:44.540 |
work i do as possible so i like to post it on x and linkedin um just to say so i can send you the raw if 00:58:51.660 |
you want and then once it's on youtube i'll send you the link too okay i mean i think i probably and 00:58:56.220 |
honestly won't do anything with the raw so like what as soon as you can get the youtube up that would be 00:59:01.420 |
awesome yeah yeah no i'll do that today sounds good okay okay cool thanks awesome all right guys thanks