[AI in Action] Towards a Generic Cell Typer — RJ Honicky, PhD withMIRAMICS

00:00:00.000 | let me do that real quick and this is my friend uh he has consultancy and he's trying to stand up

00:00:06.360 | some ai stuff and win some big contracts and he needed some help so i am helping him with this

00:00:11.340 | um it's all open source though um and uh yeah and we intend to keep it that way so um and you know

00:00:19.220 | we're just gonna make money on consulting so okay so um let's see so actually this is a good slide

00:00:27.080 | for everyone um uh so cell typing which means like so okay let's back up in one of the one thing that

00:00:37.500 | has become pretty prevalent in the world of um biology in the last five years or so is what's

00:00:44.760 | called um single cell transcriptomics which means they you can sequence the rna not the dna but the

00:00:52.200 | rna um that is being expressed in single cells at a time so like um so first of all dna is sort of

00:01:01.320 | like the the like the copy of the software on your hard drive and the rna is like the running software

00:01:10.340 | in memory is a good way to think about it and um it's not a perfect analogy and it's like maybe

00:01:15.980 | hardware analogies are appropriate as well and whatever but like um the rna is just basically

00:01:21.180 | what's being done now in the cell as opposed to uh all the things that the cell all the proteins that

00:01:28.740 | the cell can um can express so there are these there are various technologies and whatever but they all

00:01:35.060 | basically boil down to uh for a given cell there's you have a matrix of like 50 000 or so genes and

00:01:44.180 | just count how many of each of these genes are you can detect in that one cell and these matrices are

00:01:51.120 | generally very sparse so about one percent of in any given cell about one percent of the one to two

00:01:58.420 | percent of the genes are being expressed at the time of the transcription um and this process is also

00:02:06.440 | like highly dependent on the experimental setup so it's very subject to what's called batch effects

00:02:15.040 | meaning um the guy who run it you know spent like you know a little bit longer during the sequencing

00:02:22.380 | process or they used a different brand of reagent or like oh like just basically there's a lot of

00:02:28.780 | randomness in the collection process so um these these these batch effects cause a lot of headaches um

00:02:35.940 | and so that models that are able to sort of like account for batch effects that are able to

00:02:42.820 | um you know sort of extract the um the the gene expression information and then um and then figure

00:02:52.420 | out what for this exact cell what type of cell is it and that could be as coarse as is this a neuron or

00:02:59.660 | a muscle cell or that could be like given this like super specific like i know this is a neuron because

00:03:06.040 | it's from a chunk of the brain but um i don't know what type of neuron so or maybe

00:03:12.740 | it's either one of so basically typing the cell um and this is like upstream of a ton of really

00:03:19.900 | important biological um types of experiments and diagnoses and whatever so um so that basically

00:03:29.320 | that we've kind of identified this as a really core problem that has a lot of academic um

00:03:38.340 | there's like tons of research that happened even a couple years ago that about this problem and how

00:03:44.360 | to solve it and people got to the point where they're kind of happy with the solutions from an

00:03:49.120 | academic standpoint and then it really hasn't made its way all the way into practice and so this is and

00:03:56.460 | it's like pretty tough to use these models not that like for you it probably wouldn't be tough but

00:04:00.920 | for a biologist it's kind of tough and so um so we're kind of focused on let's get let's take this

00:04:09.760 | up wrap it up into something that is really easy for people to use and so that's the point of this um

00:04:15.600 | and and i'll give a little more detail on that okay so uh okay so one of the so like i said there's like

00:04:23.700 | the the state of the art kind of has um is pretty either coarse grain or really specific to specific

00:04:31.940 | types of of cells so like you could this on the right hand side is like a hierarchy this is ontology

00:04:40.380 | there's like a database out there that you can use that is basically the data sets that we use are

00:04:45.900 | normalized to this so like there's like if you so the problem that you have is if you make a look

00:04:51.660 | if your data is labeled as like you know lymphocytes or like innate and then you predict cd

00:04:57.920 | 16 positive like is that an incorrect prediction or how correct is it um is it's not a simple

00:05:04.920 | classification problem um and then of course these can be wrong as well so but maybe if it's like

00:05:10.920 | a different kind of mature natural killer cell like how wrong is that really um if so um that should be

00:05:19.940 | like seen as different than predicting it as a you know i don't know like a muscle cell or something

00:05:24.440 | um so um okay so cell typing right um okay you guys already know this um this is a model it's a

00:05:35.640 | transformer based model it was trained it's so like without getting into too much detail about how it

00:05:41.340 | works um they you know like the data sets are not um so unlike gene genetic sequences so nucleotides like

00:05:51.840 | you know you have the four nucleotides and they go in these sequences and they turn into these double

00:05:56.500 | helixes and blah blah blah this is not that kind of data this is for one cell i have a matrix of 50 000

00:06:03.100 | possible genes and what's the number of times i saw that gene a gene is like uh you know like a string of

00:06:10.480 | maybe uh anywhere from a couple thousand to a couple million i think uh no probably a couple hundred

00:06:17.940 | thousand genes uh i'm sorry nucleotides they all string together at cg in different orders and that

00:06:25.440 | defines what a gene is um so uh it anyway it you get these counts the count obviously because it's just

00:06:34.280 | a matrix it doesn't really matter what order the genes are in so you don't have quite a a a one-to-one

00:06:41.740 | type of data so the authors did some stuff so they could express the data in a a um something like an

00:06:56.000 | auto-aggressive model and it's a little unclear to me why they really want to do that instead of doing

00:07:01.160 | a more of like a encoder uh encoder only model um but it does perform well so uh but you know like

00:07:08.780 | there's a i think in my mind there's a lot of controversy around is it because the other stuff

00:07:13.260 | you did and you should have used a burt type model um so anyway what for what it is it is like a

00:07:19.140 | auto-aggressive more or less style of model um similar to gpt um and it was designed explicitly to remove

00:07:28.120 | these batch effects and this is going to matter in a minute um and it does pretty well but you can see

00:07:34.780 | these data sets uh if you look at well you can't see this column but like the data sets are uh um

00:07:43.600 | the data sets are these small specific types of cells so they're not doing like give me like of all

00:07:52.020 | the cells 700 or so types um and maybe if you look at you know maybe so even a big subset of that like

00:07:59.840 | they don't do very well with that um okay so i know this is like a whirlwind tour i don't want to

00:08:08.200 | dwell on anything but i do want to make sure people kind of get the gist of this is there any

00:08:12.800 | questions to start with

00:08:14.840 | okay cool so let's talk about what i did so um one other background piece there's this other model and

00:08:24.020 | this is i think the reason why i wanted to share this was because i think this is a super duper cool

00:08:28.600 | use of embeddings and um it's a little bit counterintuitive or that it works so well until

00:08:36.160 | or maybe some of you guys might be like yeah obviously but um i think it's awesome that it works super well

00:08:42.360 | it's actually surprisingly great uh but um uh really simple to implement super efficient um and so i wanted

00:08:53.060 | to share this with you guys and like sort of the exploration i've done with this and a small model

00:08:58.280 | that we've built on top of it and where we're going with it so um so this gene pt is um is basically built

00:09:08.000 | by taking a first of all for every gene in my 50 000 or so genes i turn i have a what's like a description

00:09:18.740 | of that gene you start with this thing called the ncbi gene database and for every one of those i can

00:09:24.700 | just get a description this gene you know um you know this gene like i could even maybe zoom in no i

00:09:31.740 | can't zoom in but like you know this gene is used for creating this kinds of proteins and like is this

00:09:38.100 | you know like whatever right it talks about what the scientific literature some small summary of what the

00:09:44.100 | scientific literature says about this gene and then you take that you pump it into the um the this i didn't do this

00:09:52.860 | the the you know sort of like gpt uh the the open ai embeddings um and you and you get out and embedding

00:10:03.040 | right and um so that's like a vector that is you know between one of you know 1024 and 3078 dimensions

00:10:12.200 | these are by the way matrice good embedding so you can chop off the higher dimensions and they still work

00:10:16.900 | well um and so you take that embedding and then you look at for the cell what is the gene expression for

00:10:27.200 | um each one of those genes and then you multiply the embedding by that scalar like so if i have like um

00:10:37.080 | in this particular cell i have six of this type of gene and um seven of that then you take the six

00:10:45.760 | multiply it by the embedding of the of that particular gene so you know all your embedding weights will

00:10:52.680 | multiply by six in that case and then another gene you you know the i saw like a hundred of this one so

00:10:58.880 | i'm going to multiply that one by a hundred and whatever so you you um and then you so you added

00:11:05.600 | all this up you normalize to one and you end up with this uh embedding of that cell based on its gene

00:11:13.620 | expression okay is that i want everyone to understand that because that's uh if i think it's obvious if you

00:11:20.620 | if i explained it right but maybe not a very good explanation so i want to make sure people are on the

00:11:25.680 | same page here any questions i'm not watching the chat so uh in fact i can watch the chat whoops

00:11:33.360 | uh okay it's just about transcription okay anyone have questions uh i want to make sure everyone's

00:11:39.920 | clear here okay where do you get the gene expression developed from okay that's from so i haven't covered

00:11:47.360 | that so that's why you don't know that the answer to that is um there is a database of there are two

00:11:56.880 | databases um a database that was compiled by the chan zuckerberg foundation um called cell x gene v2

00:12:05.580 | and it's basically they collected 100 million cells so like a thousand files that sum up to about 100

00:12:13.340 | million samples or cells of um of this type of expression data and it's publicly available i can

00:12:20.720 | it's probably referenced in one of the links here i'll share the slides when i'm done but like you can

00:12:26.140 | um uh like anyone can download this so we we um also i this company mirror omics also compiles these

00:12:35.140 | data sets and that we have access to larger and non-publicly available data but they also so but

00:12:41.940 | anyone can download the data that i'm talking about does that make sense and and then there's

00:12:46.760 | another data set um that is a subset of that that is like very carefully curated to be good for a

00:12:55.780 | benchmark on single cell or actually it's not a subset it's a it's it's removed from that data set and

00:13:02.440 | it's a benchmark on your performance of whatever algorithms but you're supposed to use it not for

00:13:08.100 | training your models but actually benchmarking your models um that's called uh tabio sapiens

00:13:14.200 | and both of those are chan zuckerberg foundation um if i didn't know if you don't see a link here and

00:13:22.260 | you want to see the um if you want to know how to download that i can share with you does that answer

00:13:28.100 | your question yeah thank you great okay sorry okay hopefully that yeah go ahead please question

00:13:36.000 | is that um do you have a intuitive way to think about the uh the calculation you're showing on the

00:13:43.980 | the um slide right now like is it projecting from one place to another yeah so i think that what

00:13:51.960 | um so each gene has is represented by a direction right which is um so and this is in a high dimensional

00:14:00.640 | space so that 3078 dimensions so each gene um will be somewhat but not completely orthogonal

00:14:10.980 | to the other one um there'll be like a general direction that i'm describing a gene right you'll

00:14:17.280 | all of them will be in that direction but then like each one will be slightly different direction

00:14:22.700 | and when you add them all up you're kind of getting the average and then normalized to one

00:14:29.740 | you're kind of getting the average expression of all these different genes but the interesting thing

00:14:35.420 | about that is is it because it's a text embedding then like let's say that the the information about

00:14:42.720 | the embedding is like oh you know like experiments show that this is related to cancer right or something

00:14:48.220 | like that then there'll be like some bit of cancer direction in there too right and and uh

00:14:55.900 | you know whatever like drug interactions etc so anything that's mentioned in there so that you're getting

00:15:02.140 | like sort of a um in there is a sense in which the direction that the cell um is pointing will be like

00:15:13.420 | sort of the aggregate of the scientific literature about all the genes or this particular cell does

00:15:19.740 | that make sense i mean that's it's it's a little bit abstract by its nature but that's the way i think

00:15:26.060 | about it that makes sense and so when you multiply it by your that the matrix that's kind of green and blue

00:15:33.980 | yeah that's all that cell in sorry the gene information you had on the previous slide right like it was a big block

00:15:41.900 | yes exactly and then what you get in the output is for that specific cell all that specific gene

00:15:48.620 | information but maybe it but for that cell yeah that's right exactly okay cool thank you yeah and and it

00:15:57.500 | actually turns out that this technique um in some some cases exceeds the performance of other like the scgpt

00:16:09.100 | for example um so which is kind of incredible because it the scgpt uh transformer was trained on

00:16:18.620 | like actual genetic information whereas this is just like uh i'm gonna um so that like you know presumably

00:16:26.620 | there's like a lot of relationship information between different genes and whatever whereas in this

00:16:31.900 | case we're just saying you know like oh we have these different um you know embeddings and we're just

00:16:38.380 | going to create a new embedding space and then and then like look at how well we do um so like and you

00:16:45.100 | can see this umap over here and i have a few more i think umaps in this but um that you can or no it's not

00:16:52.620 | not in here it's in my notebooks which i want to make sure i get to um but you can see like okay like

00:16:57.900 | you are kind of your clustering is kind of capturing uh the the colors are the cell types so like um

00:17:06.780 | the it is kind of capturing different cell types in different you know clusters in umap so that's

00:17:12.220 | promising um so

00:17:15.500 | and then like since this is like a you know latent space uh event uh you have we have to have a graph of

00:17:25.820 | performance and and um we can see that like i think that my my nutshell is that sometimes scg

00:17:33.660 | in some of these data sets right scgpt does better in some data sets gmpt does better and there's like

00:17:42.380 | i'm not going to go into the difference between these two um but and then some uh like this is the exact

00:17:47.980 | model so in this bones one um and then in some data sets the ensemble between the three of these does

00:17:55.420 | better and but you'll one thing that is important is that even when it's not the top the ensemble is

00:18:02.060 | always pretty close to the top right so like so that and so that what this tells us is that the ensemble

00:18:08.540 | probably will do better and like maybe also interestingly sometimes you know because like

00:18:15.500 | sometimes scgbt does well sometimes um you know and does pretty significantly better than this

00:18:22.620 | in accuracy and precision recall f1 and then sometimes uh that you know gene pt does the best

00:18:29.340 | and it does pretty significantly better than scgpt so they're capturing different information right and

00:18:35.900 | and that so then ensemble is a great um a great approach because you know you can kind of hopefully

00:18:43.100 | get close to the best of both worlds um so uh and and so yeah um okay so what so let's talk about i'm

00:18:53.100 | going to explain this plot a little bit then i'm going to dive into some notebooks because this is ai in

00:18:57.180 | not a paper club or whatever so um the this um so what what we did was we added um to those descriptions

00:19:11.020 | so remember these descriptions here because the in the original paper they use these just basically the

00:19:17.900 | descriptions from this ncbi gene database what we did was we added we got um we got uh gpt4 to

00:19:27.580 | give me more information about tissue type drug associations and pathways um to improve the

00:19:35.660 | the embedding right so we embedded the we we generated text on each gene um using a prompt we we did that over all

00:19:45.900 | that over all 50 000 or whatever genes and then we took those and we embedded the full description

00:19:52.220 | so we came up with our own and with this is not super well labeled but for different um so we have a

00:19:59.260 | this smaller test set uh you have like i think 20 donors of cells and then or so and then um just leaving

00:20:11.340 | i picked three that i thought would be good for test sets and then what we did was just leave one out so

00:20:16.940 | train on all the so all the donors except this one and then test on the last one and then do the same

00:20:23.980 | for each donor right and you can see that like um the i think that there's two things that i think are

00:20:30.860 | noteworthy here is one is that um like the adding so this red one is the original embedding basically and

00:20:40.140 | then these guys are are improved embedding so it does actually seem to help to add more information

00:20:46.300 | about the cell and just like these are like progressively adding more information to the prompt

00:20:52.780 | and it does continue to get better right i mean not in this case maybe but in many cases um and the

00:21:01.020 | other thing to note is like this is scgpt and this does pretty significantly better in some cases right

00:21:08.380 | so which we knew in this case it doesn't um and then if you look at the weighted meaning um for by

00:21:15.580 | by the number of cells then actually it in in this case for this donor um we do this we're done just

00:21:23.260 | pretty significantly better than um scgpt and even in the unweighted case it does as well a little bit

00:21:29.100 | okay so that's that's like sort of uh so i want to just talk a little bit about how i did this

00:21:35.740 | so i'm going to go i'm going to switch to uh let's see i have a couple um work books

00:21:46.300 | not this one i want this one okay so one thing the other thing is since this is ai in action i just want

00:21:51.820 | to talk about i think i have this workflow um i actually am curious to hear of anyone else using

00:21:57.740 | notebooks in cursor or another um another you know sort of vs code base or like what is your i'm

00:22:07.900 | curious to hear about other people's workflows with um notebooks i have a love-hate relationship with them

00:22:14.140 | i love that they're so flexible and i especially love that i can see i don't have this like render

00:22:23.100 | um i can i can like sort of flip back and forth between cells i can render so for exploratory stuff

00:22:29.180 | um they're awesome they're amazing even for some pipeline stuff they're amazing because

00:22:34.860 | you can see um sort of plotted output for your pipeline but they're also like total garbage for

00:22:42.220 | rigorous software development because they basically have global state right and so that

00:22:48.460 | you're you're and you're programming like basic basically that has like nose concept of isolation

00:22:54.940 | from other functions right so like what do you guys i want to just pause and ask is anyone else using

00:23:02.380 | notebooks and if so like what do you do any any uh any takers um is my okay my mic's working yeah

00:23:17.420 | jeremy howard has like a whole uh workflow for working with python notebooks where they they do

00:23:23.180 | like production level stuff um i have not quite adapted it yet but i but i actually believe uh they

00:23:30.860 | produce some good software over there it's fast ai yeah yeah yeah mb dev uh if if if that's something

00:23:37.100 | you're interested in you want to look into i really respect what they do over there so yeah i think i mean

00:23:41.180 | that's a cool way to develop but i don't personally use it yeah i mean i definitely have done uh um

00:23:46.380 | um i've done the fast ai course and um at least one of them and and uh like it but does he have like a

00:23:57.740 | specifically like here's how i do notebooks in production kind of thing yeah mb dev i think

00:24:03.980 | is like what they build on top of and if you look at like fast html it's built with yeah flash there's a

00:24:10.300 | okay okay yeah okay definitely gonna check that out thank you um and i i know that i know slono hates

00:24:18.540 | him i can't get used to him bro if i'm being honest like it's just uh yeah making making that switch

00:24:26.140 | it's tough i think but but but i i think it's a cool way to develop if you've adapted it and in mb dev they

00:24:32.780 | have some cool primitives where like you kind of just like develop in the notebook and then all your

00:24:37.740 | docs are auto-generated all of your like the readme is auto-generated um like like a bunch of i think

00:24:44.060 | even the doc strings get auto-generated like there's a bunch of cool primitives that they have in there to

00:24:48.380 | make it okay easy to go from notebook to like actual production stuff um but i haven't done enough of a

00:24:54.460 | deep dive to like really say like oh yeah i'm ready to switch but but i respect it and i think it uh it it just

00:25:00.700 | shows that there's another way it can be done yeah yeah okay great i love jeremy howard and all this

00:25:05.660 | stuff so definitely gonna check that out you want to say something you know yeah i i mean yeah i hate

00:25:12.060 | them uh but i worked a lot with like material scientists and like chemists and um like uh mechanical

00:25:19.580 | engineers that would produce these like absolutely horrid iphone notebooks and so like can you push this to

00:25:24.860 | prod and be like oh no i uh one of the techniques that i developed is uh a just like have a python file on

00:25:33.820 | the right that you import into the notebook and then do everything that's like stable enough move it over

00:25:38.620 | to the package and then reuse it in the notebook yeah and then the other one that i would often use is

00:25:44.460 | replace the notebook with a script that's the same as the notebook right like something that that is not meant to be

00:25:50.940 | functions or a program that you reuse but add decorators to cache the expensive to compute data

00:25:57.900 | so that the first time you run it it will it will do that and then you can always run the whole script

00:26:04.620 | right so you don't so that you start with this the fresh python don't have like objects left over from

00:26:10.540 | like 14 cells ago that you know are kind of gone um and they would develop on that so it looks like kind

00:26:17.820 | of like a notebook but it has it it runs as fast as a notebook to get to the stuff you're working on

00:26:23.580 | but it is actually reproducible like those were two techniques that i used how did you um sorry where

00:26:29.740 | does the decorator save the state like onto disk or yeah just like dot state as json or something like

00:26:36.300 | that i'm sure there's like more there's like more elaborate okay but it's saving it to persistent

00:26:41.180 | storage so that when you redo then so but the other thing so maybe you have a thought about this too

00:26:46.300 | the other thing that i really like is the ability to have rich output in between cells right because

00:26:52.620 | then i can see like even if it's a like it work i often um we do we use databricks and some of the pipelines

00:27:00.300 | are written with um with notebooks and the nice thing about that is that you get you can be like

00:27:07.180 | okay here's the out here's the output of this intermediate step and here's a plot or a table or

00:27:13.340 | something so that when i go back and review because of a failure or whatever i can just look at the plot

00:27:19.420 | and see the the intermediate state there's this like this lightweight python thingy where it's a python

00:27:26.460 | file but you can use like ampersand and percent or like percent percent there's like a syntax that

00:27:31.420 | eludes me right now which when you open it in vs code for example looks just like a notebook and i think

00:27:37.820 | it can render graph outputs if you have a matplotlib thing in there okay but it's still like a dot py file

00:27:45.420 | you don't have this like ipy nb json monstrosity like lying around and and uh so so that i found

00:27:52.460 | pretty useful and then this was quite a long time ago but i remember that airbnb had like a notebook

00:27:58.380 | publishing system which would make some really nice um renderings of python scripts basically that's that's

00:28:10.620 | what i used that's what i had the data the data team that i was leading kind of uh use i think it's called airbnb knowledge repo

00:28:20.140 | okay a i think it was deprecated back then but it looks like it's picking back up again so i mean

00:28:28.220 | it's two years old but this allowed you to like just publish a python script as a rich kind of uh

00:28:34.220 | uh interesting output yeah okay cool all right great so that was a really helpful sidebar for me

00:28:41.820 | hopefully this is being recording so hopefully i can go review anything i forgot okay so okay so and

00:28:47.660 | actually um just as a side note that what that workflow that you mentioned where um you sort of have like

00:28:53.340 | like these side you know like things that are like useful you kind of like move into a into actual code

00:29:00.860 | and i have done that very much um so like you know so like this inference for that gene pt um

00:29:08.060 | for those uh like the gene pt uh cell cells i've kind of codified into this like set the this set of libraries

00:29:19.340 | so here's like how do you do the inference this is embeddings this is all open source i'll i'll share

00:29:24.380 | the link um and um but like so you can see that um you know there's different in like a bunch of utils

00:29:33.660 | right so i call these from my notebooks um and as i develop them like and they become more like i'm like

00:29:40.620 | oh yeah this is a common thing or people would benefit from this and i add them there um so okay good so

00:29:49.020 | uh let's see so where do we start the the um this so i did this i think this is the right notebook

00:29:57.340 | so i did this eda oh no this is not a very this is not it um i think this one is the one i want um

00:30:04.460 | so this is as i mentioned this is this tabulous sapiens data set this is like so cz is chan zuckerberg

00:30:12.060 | um jan zuckerberg institute you can um you know this is a link um i'll i think the easiest if anyone's

00:30:19.660 | interested in like following up on this more i'll just post a link to the this repo that i'm i'm uh

00:30:26.540 | presenting from and then you can just like find all this information um so so uh this um

00:30:34.540 | this is like an example of embedding sorry for the whiplash here um this might even be the plot that

00:30:42.780 | or a version of the plot that she did except that it's like active and you can see that again um this is

00:30:49.980 | like you have these different cell types and they're kind of sorta um embedded you know like

00:30:56.940 | you can kind of like get pretty good classes so um you can imagine that a classifier in high dimensional

00:31:03.420 | space is probably going to be able to do a pretty good job with this and that's indeed what we find so

00:31:07.980 | like um so this this code was the embedding and then there's this analysis one um so to be clear

00:31:16.620 | this was like after you have the embeddings then what do you do um and so you're like this is um

00:31:22.860 | this is scgbt and you can see like that like a couple of interesting things to note is that they're like

00:31:29.100 | lots of different well isolated clusters whereas this one is more um maybe a little bit like the

00:31:36.700 | relationship the the discrimination between the classes is uh better in in general here but you can

00:31:45.340 | see that they have these like really horrible parts that are going to be completely impossible to just

00:31:50.620 | differentiate um and then uh this one is let's see what's the difference between well it doesn't really

00:31:57.580 | matter so um so i you know like we do these um oh this was the different versions of the the uh

00:32:06.140 | the um the embedding and you can see like they're all kind of the same this one gets a little better

00:32:11.980 | like i don't know visually it's hard to tell but like maybe this one and then this one like slightly

00:32:18.460 | better probably um or actually this one yeah so like this one's a little noisier i would say than this

00:32:26.060 | final one um so kind of you can kind of tell visually that it seems to be doing a little better but not

00:32:32.380 | during not dramatically but better um so let's see and then like you know this i chose like this guy

00:32:40.780 | this guy and this guy so tsp1 2 and 14 and because they have good representation across all the cells and if

00:32:49.340 | you kind of look you'll see that basically there's not a lot of missing cell types although they're

00:32:55.900 | like this cell type is only in 30 and so the classifier is probably going to not do well in 30

00:33:01.340 | and um maybe this type as well what is that whatever that uh that's this ecto epithelial self right so um

00:33:11.260 | um so then i basically just run like in a loop i train and i trained uh let's see like if you look

00:33:19.980 | at what i did it was sort of um you know leave one out and then i'm looking at uh let's see i'm looking

00:33:29.580 | at uh the three different algorithms which are random forest uh xg boost or light gbm actually and maybe one

00:33:39.580 | other um and then you know so like blah blah blah lots and lots of compute and then um and then like so

00:33:48.860 | here's some tabular data and then eventually you get to this and the other plot that you get to

00:33:56.300 | is maybe um well okay so i did this i don't think the details you guys really care about a lot um but

00:34:05.420 | like so okay first of all we're you know this is how we ended up with this plot is we embedded those

00:34:10.780 | stuff any any questions about that so this is partly about my workflow and stuff so if you have like

00:34:18.700 | workflow questions or whatever um that's great too and and so okay um how we did this how did we do this

00:34:28.140 | well um again not nothing no rocket science here but um if you look here uh you know basically um we

00:34:39.020 | you know sort of like for every gene we submit batch requests to uh open ai so um the open ai embeddings

00:34:49.580 | are all old but they have a couple advantages one is that they're matryoshka embeddings so um

00:34:55.580 | you guys are you guys familiar with matryoshka embeddings is it like a really good concept to know about

00:35:03.740 | um yeah but there was one question in the chat about how are you quantifying which one is better

00:35:10.860 | as in which metric so i think about that last graphic oh okay this one yeah so this is the

00:35:16.540 | f1 score so meaning um and this is actually a like an insightful question yeah go ahead uh i know i meant

00:35:24.940 | to ask about the like um when you were looking at the umap graphs you were saying that somehow the

00:35:30.140 | embeddings are better so this is um so you're quantifying it using that one score uh the same

00:35:35.500 | models right yeah yeah and so in the f1 score is a little bit like it's not a great f1 score is only makes

00:35:46.780 | sense if you have all of the cell types uh in the same level in the hierarchy right so like way back

00:35:54.860 | in the beginning of the presentation here i i you know like you have these this hierarchy if you have

00:36:03.020 | like some cell types here and some cell types here even if you're correct you might get it quote unquote

00:36:09.740 | wrong and so the f1 score doesn't work super well for for like sort of broad data sets but it works well

00:36:16.140 | for uh the data set i'm using to test here because that tabular sapient data set all of the cell types

00:36:23.340 | are at the same level in the hierarchy does that make sense yeah so for broad cell types what kind of metrics

00:36:29.420 | you use uh sorry what what kind of um like when you are taking from different families what kind of metrics

00:36:37.020 | do you use other than f1 uh so f1 precision recall uh no i'm uh i meant you said f1 is good for this

00:36:45.980 | particular data set oh i see okay yeah let me i'm gonna get to that so let me let me answer that in a

00:36:52.220 | little bit that's a great question it's actually one of our most important questions um yeah

00:36:59.180 | i i will answer that um okay so uh yes so let's see um yeah okay so and i don't i mean like there's

00:37:09.900 | no magic here most of this crowd probably knows how to do this if you don't i'll like go over it really

00:37:14.940 | briefly but you basically um you know like to in order to embed things with open ai you basically you

00:37:23.420 | you know upload all your um text to open ai in batch usually and then like so you create these files

00:37:31.420 | and then you just create a batch job and it will you know for something like i think it's half the price

00:37:37.500 | um it will just happily turn away at it for up to a day depending on the size of the data set i think

00:37:46.060 | you can see here for and interestingly for larger data sets the throughput is much higher but you can see

00:37:52.460 | like um you know for 30 000 descriptions it took me what about an hour not even an hour half an hour or 40

00:38:02.140 | minutes right so like um and it cost me like a dollar or something so like it's pretty amazingly cheap

00:38:09.500 | um and and because their batch api is so well like it's designed for like massive projects so like it's

00:38:19.180 | super robust like you just send it stuff and you get back stuff in a little while and it works well and

00:38:25.900 | like it's they've thought through all the failure conditions and how to restart and whatever so

00:38:30.780 | like i i was like really pleasantly surprised at how well it works versus trying to manage all that

00:38:38.220 | yourself and it's half the price um okay so uh yeah so that's that's like sort of how we generate it and

00:38:47.100 | then you know so we generate all these embeddings we try it on different um embedding types and i was kind

00:38:53.500 | of thinking that you know my hypothesis beforehand was that like actually adding extraneous or like

00:39:00.380 | less gene specific information and more about like the you know diseases and pathways and interact drug

00:39:07.660 | interactions and whatever that it would it would like not help or hurt the performance of the classifier

00:39:14.060 | but it actually was wrong my hypothesis was wrong and it actually does like just adding more information

00:39:19.260 | about the gene just does tend to help um and it helped close the gap on the scgpt so um and then i've

00:39:28.940 | uploaded this all the hugging face so you can if you want to just use it for some reason then you can

00:39:33.340 | just go and use it um okay and then uh so that's the that's like how we generated the the embeddings okay

00:39:44.140 | so good um any questions oh i was going to mention matryska embeddings they're also this useful thing

00:39:50.700 | where uh 30 78 dimensions in the uh open ai large um embeddings but they were designed in such a way and

00:40:02.940 | you can read the matryska embeddings uh paper to understand the details but they're designed in such a way

00:40:09.340 | that you can chop off the higher dimensions and they most of the information is contained in the lower

00:40:14.860 | dimensions so that you can kind of adjust how much data you want to store and process versus how much

00:40:22.300 | you want to how what performance level you need and even there are some cases when lower

00:40:30.220 | dimensions do better than higher higher number of dimensions so um so we uh yeah so we we ended up

00:40:39.580 | using um i don't know for i'll explain we ended up chopping the dimensionality but using the starting

00:40:45.980 | with the 30 78 dimensions okay cool so uh let's see where are we um sorry one in the driveway yeah

00:40:57.900 | sorry guys um okay so uh so then the next thing that we did was we took okay so now we have this

00:41:06.860 | like sort of we okay we kind of believe this works we have some embeddings we like um you know we have

00:41:13.100 | it we didn't we weren't very rigorous on this because one is it it's very pragmatically focused on helping

00:41:20.140 | customers to get good results so we're i mean not that we shouldn't be doing evals we absolutely have

00:41:26.540 | to be doing evals but our focus is okay let's just learn about how these models work before we start

00:41:33.180 | doing really rigorous eval so and i think we've gotten to the point now and i'll discuss where we

00:41:38.860 | not need to start doing more uh rigorous evals so here's um let's see let's start here so now let's

00:41:47.580 | load um this gene select gmv2 um this is a hundred million cells it's about i don't want to say i don't

00:41:55.660 | want to get this wrong but i think it's a like maybe five or six terabytes of data so it's pretty big data

00:42:01.580 | it's not massive but it's not small either um and uh and then so one thing i so one thing about this

00:42:11.100 | type of data is it's saved in this kind of uh interesting file format that it kind of it's like

00:42:18.380 | a file system in a file and it allows you to store um like the the sparse uh gene expression data but then

00:42:27.500 | also all the vectors of labels and other things and then all sorts of metadata and so that this is

00:42:35.020 | like the standard way that you um represent this data but it's it what the the consequence of this

00:42:41.340 | very flexible format is that the data is like like first of all there's a lot of cruft data that you

00:42:49.260 | in any given uh circumstance you don't really care about or it's not cruft but it's like irrelevant data

00:42:55.580 | and then um there's also means that different people put different like names on the same things

00:43:01.900 | and all sorts of like it's not a very super well standardized so that you know there's a lot of

00:43:07.580 | pre-processing you have to do so i actually ended up writing a script um to extract a bunch of or like

00:43:16.620 | certain metadata form uh and write it to a certain metadata and write it to parquet files so that i can do quick

00:43:25.100 | analysis on the um on the metadata for the whole data set so there's a thousand or so files and i

00:43:32.540 | wanted to be able to read the metadata and do statistics on them both so that we can curate a

00:43:39.180 | good training set um so i did so i created um the metadata in this file and i can talk it again that's an open

00:43:50.140 | source project it's right here it's called um and data metadata it's on github i can share that as well

00:43:55.900 | um and what it does is it creates uh you know these um uh these uh these parquet files that you can use

00:44:07.740 | to you can query it with duckdb you can load it as um in in pandas or whatever your tool of choice is um so that

00:44:15.260 | basically um we can you know sort of like ask okay what are the gene what are the like genes being

00:44:25.340 | expressed in different files what are the like number of genes uh um what are this and then also i had to do

00:44:34.620 | some post-processing here on what are the distribution of gene counts so like for a given

00:44:41.580 | file it has this many genes expressed um and then like the number of active gene count or sorry the

00:44:50.300 | total number of genes like gene expressions and then the number of gene like individual or like the unique

00:44:57.180 | genes that were expressed and so like i kind of plot this like in like a manual way with three files and

00:45:04.140 | then um go and get the distribution uh across sorry i'm gonna skip that um across uh all sorts of uh

00:45:16.540 | sorry to scroll yeah so then like so now this is that data that i was just talking about but this is every

00:45:22.620 | every single or not every single file but many of the files in the data set in order to understand

00:45:27.340 | okay how met like what's the how similar are the data visually right so this is a thing i almost always

00:45:35.260 | do with large data sets is i i some i get like let my eyeballs sort of understand the data and like

00:45:42.620 | find the outliers in the shape of the data so like the things that i pull out from here is like okay

00:45:48.060 | these guys actually seem to be pretty uniform but there's like some guys that are like small

00:45:53.260 | like they have very small um tight distributions and others are wide but they all kind of have a

00:46:00.140 | reasonable cut off these ones like you know these are really heavy tailed and i cut them off

00:46:06.140 | um as like this last column is rest so the point i'm making is like okay get i got familiar with the

00:46:13.900 | shape of data and i actually we were originally going to set some thresholds so that we throw

00:46:18.860 | away bad data and we just realized from looking at this let's just we're going to train on everything

00:46:23.820 | we're not going to any like any given file um we're not going to be we're not going to have any um

00:46:30.140 | discriminatory uh factors that exclude data um does that make sense do you okay i want to make sure i'm not

00:46:39.340 | like boring you guys with um you know sort of like data analysis details but i wanted to like

00:46:45.900 | sort of talk about how we're selecting the training set um any questions

00:46:51.740 | yeah it looks like yikes did ask if you had any intuition on why i ended up helping but maybe in

00:46:58.940 | the interest of time like you only have a couple minutes left and i think it was a while back too i just

00:47:02.540 | missed it okay yeah okay yeah yikes let me answer that on on chat so i can finish up but i i will

00:47:09.020 | answer it um so okay yeah uh so the interesting so what like from this crowd's perspective the

00:47:15.660 | interesting thing about this data set and and the my training process was that um like or the data

00:47:22.300 | selection process was that these files are like this weird format that i mentioned and um i had to

00:47:28.140 | like so they're all in s3 and i had to use this um like i had to rewrite the um the standard tools are

00:47:37.740 | just no good at using objects directly in s3 they want to download the whole file and then start analyzing

00:47:46.540 | it and so that the upshot of that is that you end up it's like a very prohibitive io problem and so that

00:47:53.980 | what i ended up doing was developing that a couple tools um you know sort of in this process that were

00:48:01.100 | able that that go and get um file fragments and then actually doing i was actually i ended up doing

00:48:08.460 | you know and this is like where llms really shine is i ended up doing this like um integer linear program

00:48:14.860 | to select the best uh you know sort of fragments from the files based on the their coverage of the

00:48:22.060 | type cell types that i wanted so there's like a whole bunch of work that i did um like the vast majority

00:48:28.300 | of the training work that i did was actually just figuring out how to get the data set the training

00:48:34.700 | set out of the 100 million cells um in such a way that it wasn't prohibitively expensive okay so so i got

00:48:41.980 | down to about five million cells and then um and then ended up uh you know sort of um uh let's see so i ended up uh

00:48:57.740 | building a uh a you know building the subset um and then i ended up the we were going to use xg boost for the classifier

00:49:09.980 | for the classifier xg boost has problems because it doesn't scale well um to large data sets that um

00:49:18.140 | that don't need that don't need a a corresponding amount of processing right so you can scale it in

00:49:25.580 | various ways but none of the scale ways that scales match this data set very well and if people are

00:49:32.380 | interested i can explain why but so we end up using a multi-layer perceptron um and and uh and again

00:49:40.380 | loading only the data being very um very selective about the data we load into memory um use optuna

00:49:49.180 | optuna to tune hyperparameters and then we ended up with you know this classifier um you know that is stored in

00:49:56.380 | stored in weights and biases and can um classify like our test set you know 100 000 cells in like

00:50:03.900 | 15 seconds so like super efficient at the end the inference is super efficient um but took a lot of

00:50:11.820 | work to get there um and we ended up with uh so like and i can let me just share on weights and biases really

00:50:19.020 | quick uh um in fact i think it's reference from here so let me just grab it here yeah here um

00:50:30.940 | so if you guys are not aware of weights and biases you definitely should be using it um and uh so like

00:50:37.660 | the like maybe the final thing i'll mention uh is that like i ended up you know learning a lot about

00:50:47.020 | training models uh in this process i don't have a ton of experience with neural networks and the

00:50:55.900 | interesting thing so you can look at these two different training runs one is the blue one and the

00:51:01.420 | you know the um the orange one or yellow one and um and like the i won't talk so much about my metrics

00:51:10.700 | actually i want to talk also about the metric that we're developing um so i'll be really brief on this

00:51:15.980 | but you can see that like this guy keeps having these loss spikes and never really getting to a point

00:51:21.820 | where it can converge before it has another loss spike whereas so let me let me make this larger

00:51:27.260 | so like you know it keeps getting down and then oh shit and then i have another loss spike and this

00:51:31.660 | is a periodic thing and what this corresponds to is different um different data sets and so that

00:51:38.860 | and the reason is that it's doing this because the distribution in the data sets is different and so

00:51:44.380 | that um it like whenever it gets to a new data set then it's like oh crap i'm mispredicting everything

00:51:51.980 | and i have to restart from here and then it gets you know it like sort of like lowers um uh so it it

00:51:59.980 | sort of like um gets down to a certain point and then it doesn't have enough um you know generalization

00:52:06.300 | so you can see it like generalizes a little bit and that these loss spikes are um a little bit not as

00:52:12.780 | bad and it gets a little bit lower but it doesn't really um smoothly improve the neural network so what

00:52:19.500 | we did was this was using the 30 78 dimensional um uh um embeddings and i what we what we did was okay

00:52:29.740 | just chop it off so everything can fit in memory so we can easily shuffle around the data set and then

00:52:35.420 | just shuffle the entire data set and pick random sound sub samples for my batches and then suddenly

00:52:42.060 | you're the loss um is much is dramatically better and um and you know these these are different measures

00:52:48.860 | that are not super important um but uh one so uh i'll take i'll stay on for a few minutes after and

00:52:58.380 | answer questions but i want to answer the question um about the metric so um

00:53:05.180 | we developed the the metric that we used let's see it's written out somewhere here

00:53:11.660 | uh yeah here the metric that we used um so you can if you look at the um the macro metrics so like

00:53:21.580 | standard uh f1 score macro then they they're total garbage like the f1 score here is nine nine percent right

00:53:31.660 | uh uh and uh and and precision and recall are equally bad right so like um that and the reason is because

00:53:39.820 | even if i'm correctly predicting but i'm in the wrong place in the hierarchy then i still am mispredicting

00:53:46.140 | and so um like these you know so these are ranking metrics that are traditionally used in information retrieval uh

00:53:54.380 | uh um like this is like um that if is my is my uh correct answer in the top ten top five top two

00:54:04.860 | and like these guys are just basically waiting like how far from the top are my uh is the correct answer

00:54:11.660 | and so um so those do better but i find it very unsatisfying because like one if you like in

00:54:19.580 | practice i'm not going to be like that doesn't help me right like i'm doing science i don't want to know

00:54:24.940 | like it's one of these five i want to know what's my best guess right and the other thing is that um

00:54:31.100 | you know it doesn't take into account this hierarchy very well uh where did that slide go um

00:54:38.940 | it doesn't take this hierarchy into well so i i think that we're we're coming up on like a basically a

00:54:47.180 | hierarchical uh metric that we want to use that is basically how far in this hierarchy am i from uh

00:54:55.180 | the correct measurement so okay this is kind of biology related not so probably interested for this

00:55:03.340 | crowd but i do like i i thought that this was really fascinating because like this is very much a

00:55:11.340 | process of discovery of the problem i'm learning about the problem as i learn about in the data as

00:55:18.620 | i learn about the specific model so i thought the process here was really exciting to me and i wanted

00:55:25.660 | to share with you guys um did i answer um the the woman who was asking the question yeah okay good all

00:55:35.260 | right and i i'm curious have you ever dealt with something like this and is that um it like

00:55:40.780 | like and if so like what did you do i haven't worked with bioinformatics data at all so okay but

00:55:48.700 | what about like hierarchical data in general uh not really no i've worked with like uh easy related data

00:55:54.780 | and all of those are i don't see any hierarchical kind of connections in that yeah i also need to ask

00:55:59.580 | like how important do you think having domain knowledge is in uh inomics is to get into this field

00:56:05.020 | like to do this by your measure i think that having an expert uh to ask questions and guide the decision

00:56:12.220 | making process is pretty critical i guess like like the intuition for what will work what's important

00:56:19.260 | those things um you need an expert but i don't think you need to have a formal training i don't right so i

00:56:28.300 | have and i have a but i have a friend who is a very experienced bioinformatics who's done drug

00:56:33.500 | development and other um you know sort of related stuff his whole career so i can so that we work

00:56:39.980 | together on this and i bring the ai he brings the bio and like we end up we're working really well

00:56:45.260 | together yeah and do you have any suggestions for like quick exploratory projects following up on this

00:56:51.740 | on your work right now um i mean uh like we i'd love to like work together on something i can definitely

00:57:03.340 | come up with specific stuff um i i mean yeah why don't just dm me and we can talk about ways that you

00:57:10.060 | could help i mean i could think about an answer if you if what we're doing is not super appealing but

00:57:15.500 | you want something else like in the same domain or whatever i can also probably come up with something for

00:57:19.900 | that well thank you awesome um yeah okay good so i guess that's probably all i have time for do you

00:57:29.820 | guys want to um uh are there any other questions i'm happy to stay after uh and talk um but i want to

00:57:39.260 | like sort of you know at least have a informal ending there or a formal ending i guess

00:57:48.940 | okay uh okay um okay cool well thank you guys for joining i hope that was helpful and interesting um

00:57:57.420 | uh please if you uh if you have questions or follow-ups or need links or whatever just hit me in discord

00:58:06.300 | and i will um i will share what you need and answer questions and you know would love to collaborate and all that

00:58:16.860 | okay take care guys and i guess what uh uh yikes or flow or whomever do we usually we're starting to

00:58:25.580 | publish these on on the latent space tv right i i have a bunch of recordings that i'm overdue to publish

00:58:32.460 | um so yes they will be published at some point but if you need it sooner just let me know and i'll upload it

00:58:38.220 | i mean i like to after i do a talk i like to just you know i like to get as much credit for all the

00:58:44.540 | work i do as possible so i like to post it on x and linkedin um just to say so i can send you the raw if

00:58:51.660 | you want and then once it's on youtube i'll send you the link too okay i mean i think i probably and

00:58:56.220 | honestly won't do anything with the raw so like what as soon as you can get the youtube up that would be

00:59:01.420 | awesome yeah yeah no i'll do that today sounds good okay okay cool thanks awesome all right guys thanks

00:59:08.220 | take care bye