[AI in Action] Towards a Generic Cell Typer — RJ Honicky, PhD withMIRAMICS

let me do that real quick and this is my friend uh he has consultancy and he's trying to stand up some ai stuff and win some big contracts and he needed some help so i am helping him with this um it's all open source though um and uh yeah and we intend to keep it that way so um and you know we're just gonna make money on consulting so okay so um let's see so actually this is a good slide for everyone um uh so cell typing which means like so okay let's back up in one of the one thing that has become pretty prevalent in the world of um biology in the last five years or so is what's called um single cell transcriptomics which means they you can sequence the rna not the dna but the rna um that is being expressed in single cells at a time so like um so first of all dna is sort of like the the like the copy of the software on your hard drive and the rna is like the running software in memory is a good way to think about it and um it's not a perfect analogy and it's like maybe hardware analogies are appropriate as well and whatever but like um the rna is just basically what's being done now in the cell as opposed to uh all the things that the cell all the proteins that the cell can um can express so there are these there are various technologies and whatever but they all basically boil down to uh for a given cell there's you have a matrix of like 50 000 or so genes and just count how many of each of these genes are you can detect in that one cell and these matrices are generally very sparse so about one percent of in any given cell about one percent of the one to two percent of the genes are being expressed at the time of the transcription um and this process is also like highly dependent on the experimental setup so it's very subject to what's called batch effects meaning um the guy who run it you know spent like you know a little bit longer during the sequencing process or they used a different brand of reagent or like oh like just basically there's a lot of randomness in the collection process so um these these these batch effects cause a lot of headaches um and so that models that are able to sort of like account for batch effects that are able to um you know sort of extract the um the the gene expression information and then um and then figure out what for this exact cell what type of cell is it and that could be as coarse as is this a neuron or a muscle cell or that could be like given this like super specific like i know this is a neuron because it's from a chunk of the brain but um i don't know what type of neuron so or maybe it's either one of so basically typing the cell um and this is like upstream of a ton of really important biological um types of experiments and diagnoses and whatever so um so that basically that we've kind of identified this as a really core problem that has a lot of academic um there's like tons of research that happened even a couple years ago that about this problem and how to solve it and people got to the point where they're kind of happy with the solutions from an academic standpoint and then it really hasn't made its way all the way into practice and so this is and it's like pretty tough to use these models not that like for you it probably wouldn't be tough but for a biologist it's kind of tough and so um so we're kind of focused on let's get let's take this up wrap it up into something that is really easy for people to use and so that's the point of this um and and i'll give a little more detail on that okay so uh okay so one of the so like i said there's like the the state of the art kind of has um is pretty either coarse grain or really specific to specific types of of cells so like you could this on the right hand side is like a hierarchy this is ontology there's like a database out there that you can use that is basically the data sets that we use are normalized to this so like there's like if you so the problem that you have is if you make a look if your data is labeled as like you know lymphocytes or like innate and then you predict cd 16 positive like is that an incorrect prediction or how correct is it um is it's not a simple classification problem um and then of course these can be wrong as well so but maybe if it's like a different kind of mature natural killer cell like how wrong is that really um if so um that should be like seen as different than predicting it as a you know i don't know like a muscle cell or something um so um okay so cell typing right um okay you guys already know this um this is a model it's a transformer based model it was trained it's so like without getting into too much detail about how it works um they you know like the data sets are not um so unlike gene genetic sequences so nucleotides like you know you have the four nucleotides and they go in these sequences and they turn into these double helixes and blah blah blah this is not that kind of data this is for one cell i have a matrix of 50 000 possible genes and what's the number of times i saw that gene a gene is like uh you know like a string of maybe uh anywhere from a couple thousand to a couple million i think uh no probably a couple hundred thousand genes uh i'm sorry nucleotides they all string together at cg in different orders and that defines what a gene is um so uh it anyway it you get these counts the count obviously because it's just a matrix it doesn't really matter what order the genes are in so you don't have quite a a a one-to-one type of data so the authors did some stuff so they could express the data in a a um something like an auto-aggressive model and it's a little unclear to me why they really want to do that instead of doing a more of like a encoder uh encoder only model um but it does perform well so uh but you know like there's a i think in my mind there's a lot of controversy around is it because the other stuff you did and you should have used a burt type model um so anyway what for what it is it is like a auto-aggressive more or less style of model um similar to gpt um and it was designed explicitly to remove these batch effects and this is going to matter in a minute um and it does pretty well but you can see these data sets uh if you look at well you can't see this column but like the data sets are uh um the data sets are these small specific types of cells so they're not doing like give me like of all the cells 700 or so types um and maybe if you look at you know maybe so even a big subset of that like they don't do very well with that um okay so i know this is like a whirlwind tour i don't want to dwell on anything but i do want to make sure people kind of get the gist of this is there any questions to start with okay cool so let's talk about what i did so um one other background piece there's this other model and this is i think the reason why i wanted to share this was because i think this is a super duper cool use of embeddings and um it's a little bit counterintuitive or that it works so well until or maybe some of you guys might be like yeah obviously but um i think it's awesome that it works super well it's actually surprisingly great uh but um uh really simple to implement super efficient um and so i wanted to share this with you guys and like sort of the exploration i've done with this and a small model that we've built on top of it and where we're going with it so um so this gene pt is um is basically built by taking a first of all for every gene in my 50 000 or so genes i turn i have a what's like a description of that gene you start with this thing called the ncbi gene database and for every one of those i can just get a description this gene you know um you know this gene like i could even maybe zoom in no i can't zoom in but like you know this gene is used for creating this kinds of proteins and like is this you know like whatever right it talks about what the scientific literature some small summary of what the scientific literature says about this gene and then you take that you pump it into the um the this i didn't do this the the you know sort of like gpt uh the the open ai embeddings um and you and you get out and embedding right and um so that's like a vector that is you know between one of you know 1024 and 3078 dimensions these are by the way matrice good embedding so you can chop off the higher dimensions and they still work well um and so you take that embedding and then you look at for the cell what is the gene expression for um each one of those genes and then you multiply the embedding by that scalar like so if i have like um in this particular cell i have six of this type of gene and um seven of that then you take the six multiply it by the embedding of the of that particular gene so you know all your embedding weights will multiply by six in that case and then another gene you you know the i saw like a hundred of this one so i'm going to multiply that one by a hundred and whatever so you you um and then you so you added all this up you normalize to one and you end up with this uh embedding of that cell based on its gene expression okay is that i want everyone to understand that because that's uh if i think it's obvious if you if i explained it right but maybe not a very good explanation so i want to make sure people are on the same page here any questions i'm not watching the chat so uh in fact i can watch the chat whoops uh okay it's just about transcription okay anyone have questions uh i want to make sure everyone's clear here okay where do you get the gene expression developed from okay that's from so i haven't covered that so that's why you don't know that the answer to that is um there is a database of there are two databases um a database that was compiled by the chan zuckerberg foundation um called cell x gene v2 and it's basically they collected 100 million cells so like a thousand files that sum up to about 100 million samples or cells of um of this type of expression data and it's publicly available i can it's probably referenced in one of the links here i'll share the slides when i'm done but like you can um uh like anyone can download this so we we um also i this company mirror omics also compiles these data sets and that we have access to larger and non-publicly available data but they also so but anyone can download the data that i'm talking about does that make sense and and then there's another data set um that is a subset of that that is like very carefully curated to be good for a benchmark on single cell or actually it's not a subset it's a it's it's removed from that data set and it's a benchmark on your performance of whatever algorithms but you're supposed to use it not for training your models but actually benchmarking your models um that's called uh tabio sapiens and both of those are chan zuckerberg foundation um if i didn't know if you don't see a link here and you want to see the um if you want to know how to download that i can share with you does that answer your question yeah thank you great okay sorry okay hopefully that yeah go ahead please question is that um do you have a intuitive way to think about the uh the calculation you're showing on the the um slide right now like is it projecting from one place to another yeah so i think that what um so each gene has is represented by a direction right which is um so and this is in a high dimensional space so that 3078 dimensions so each gene um will be somewhat but not completely orthogonal to the other one um there'll be like a general direction that i'm describing a gene right you'll all of them will be in that direction but then like each one will be slightly different direction and when you add them all up you're kind of getting the average and then normalized to one you're kind of getting the average expression of all these different genes but the interesting thing about that is is it because it's a text embedding then like let's say that the the information about the embedding is like oh you know like experiments show that this is related to cancer right or something like that then there'll be like some bit of cancer direction in there too right and and uh you know whatever like drug interactions etc so anything that's mentioned in there so that you're getting like sort of a um in there is a sense in which the direction that the cell um is pointing will be like sort of the aggregate of the scientific literature about all the genes or this particular cell does that make sense i mean that's it's it's a little bit abstract by its nature but that's the way i think about it that makes sense and so when you multiply it by your that the matrix that's kind of green and blue yeah that's all that cell in sorry the gene information you had on the previous slide right like it was a big block yes exactly and then what you get in the output is for that specific cell all that specific gene information but maybe it but for that cell yeah that's right exactly okay cool thank you yeah and and it actually turns out that this technique um in some some cases exceeds the performance of other like the scgpt for example um so which is kind of incredible because it the scgpt uh transformer was trained on like actual genetic information whereas this is just like uh i'm gonna um so that like you know presumably there's like a lot of relationship information between different genes and whatever whereas in this case we're just saying you know like oh we have these different um you know embeddings and we're just going to create a new embedding space and then and then like look at how well we do um so like and you can see this umap over here and i have a few more i think umaps in this but um that you can or no it's not not in here it's in my notebooks which i want to make sure i get to um but you can see like okay like you are kind of your clustering is kind of capturing uh the the colors are the cell types so like um the it is kind of capturing different cell types in different you know clusters in umap so that's promising um so and then like since this is like a you know latent space uh event uh you have we have to have a graph of performance and and um we can see that like i think that my my nutshell is that sometimes scg in some of these data sets right scgpt does better in some data sets gmpt does better and there's like i'm not going to go into the difference between these two um but and then some uh like this is the exact model so in this bones one um and then in some data sets the ensemble between the three of these does better and but you'll one thing that is important is that even when it's not the top the ensemble is always pretty close to the top right so like so that and so that what this tells us is that the ensemble probably will do better and like maybe also interestingly sometimes you know because like sometimes scgbt does well sometimes um you know and does pretty significantly better than this in accuracy and precision recall f1 and then sometimes uh that you know gene pt does the best and it does pretty significantly better than scgpt so they're capturing different information right and and that so then ensemble is a great um a great approach because you know you can kind of hopefully get close to the best of both worlds um so uh and and so yeah um okay so what so let's talk about i'm going to explain this plot a little bit then i'm going to dive into some notebooks because this is ai in not a paper club or whatever so um the this um so what what we did was we added um to those descriptions so remember these descriptions here because the in the original paper they use these just basically the descriptions from this ncbi gene database what we did was we added we got um we got uh gpt4 to give me more information about tissue type drug associations and pathways um to improve the the embedding right so we embedded the we we generated text on each gene um using a prompt we we did that over all that over all 50 000 or whatever genes and then we took those and we embedded the full description so we came up with our own and with this is not super well labeled but for different um so we have a this smaller test set uh you have like i think 20 donors of cells and then or so and then um just leaving i picked three that i thought would be good for test sets and then what we did was just leave one out so train on all the so all the donors except this one and then test on the last one and then do the same for each donor right and you can see that like um the i think that there's two things that i think are noteworthy here is one is that um like the adding so this red one is the original embedding basically and then these guys are are improved embedding so it does actually seem to help to add more information about the cell and just like these are like progressively adding more information to the prompt and it does continue to get better right i mean not in this case maybe but in many cases um and the other thing to note is like this is scgpt and this does pretty significantly better in some cases right so which we knew in this case it doesn't um and then if you look at the weighted meaning um for by by the number of cells then actually it in in this case for this donor um we do this we're done just pretty significantly better than um scgpt and even in the unweighted case it does as well a little bit okay so that's that's like sort of uh so i want to just talk a little bit about how i did this so i'm going to go i'm going to switch to uh let's see i have a couple um work books not this one i want this one okay so one thing the other thing is since this is ai in action i just want to talk about i think i have this workflow um i actually am curious to hear of anyone else using notebooks in cursor or another um another you know sort of vs code base or like what is your i'm curious to hear about other people's workflows with um notebooks i have a love-hate relationship with them i love that they're so flexible and i especially love that i can see i don't have this like render um i can i can like sort of flip back and forth between cells i can render so for exploratory stuff um they're awesome they're amazing even for some pipeline stuff they're amazing because you can see um sort of plotted output for your pipeline but they're also like total garbage for rigorous software development because they basically have global state right and so that you're you're and you're programming like basic basically that has like nose concept of isolation from other functions right so like what do you guys i want to just pause and ask is anyone else using notebooks and if so like what do you do any any uh any takers um is my okay my mic's working yeah jeremy howard has like a whole uh workflow for working with python notebooks where they they do like production level stuff um i have not quite adapted it yet but i but i actually believe uh they produce some good software over there it's fast ai yeah yeah yeah mb dev uh if if if that's something you're interested in you want to look into i really respect what they do over there so yeah i think i mean that's a cool way to develop but i don't personally use it yeah i mean i definitely have done uh um um i've done the fast ai course and um at least one of them and and uh like it but does he have like a specifically like here's how i do notebooks in production kind of thing yeah mb dev i think is like what they build on top of and if you look at like fast html it's built with yeah flash there's a okay okay yeah okay definitely gonna check that out thank you um and i i know that i know slono hates him i can't get used to him bro if i'm being honest like it's just uh yeah making making that switch it's tough i think but but but i i think it's a cool way to develop if you've adapted it and in mb dev they have some cool primitives where like you kind of just like develop in the notebook and then all your docs are auto-generated all of your like the readme is auto-generated um like like a bunch of i think even the doc strings get auto-generated like there's a bunch of cool primitives that they have in there to make it okay easy to go from notebook to like actual production stuff um but i haven't done enough of a deep dive to like really say like oh yeah i'm ready to switch but but i respect it and i think it uh it it just shows that there's another way it can be done yeah yeah okay great i love jeremy howard and all this stuff so definitely gonna check that out you want to say something you know yeah i i mean yeah i hate them uh but i worked a lot with like material scientists and like chemists and um like uh mechanical engineers that would produce these like absolutely horrid iphone notebooks and so like can you push this to prod and be like oh no i uh one of the techniques that i developed is uh a just like have a python file on the right that you import into the notebook and then do everything that's like stable enough move it over to the package and then reuse it in the notebook yeah and then the other one that i would often use is replace the notebook with a script that's the same as the notebook right like something that that is not meant to be functions or a program that you reuse but add decorators to cache the expensive to compute data so that the first time you run it it will it will do that and then you can always run the whole script right so you don't so that you start with this the fresh python don't have like objects left over from like 14 cells ago that you know are kind of gone um and they would develop on that so it looks like kind of like a notebook but it has it it runs as fast as a notebook to get to the stuff you're working on but it is actually reproducible like those were two techniques that i used how did you um sorry where does the decorator save the state like onto disk or yeah just like dot state as json or something like that i'm sure there's like more there's like more elaborate okay but it's saving it to persistent storage so that when you redo then so but the other thing so maybe you have a thought about this too the other thing that i really like is the ability to have rich output in between cells right because then i can see like even if it's a like it work i often um we do we use databricks and some of the pipelines are written with um with notebooks and the nice thing about that is that you get you can be like okay here's the out here's the output of this intermediate step and here's a plot or a table or something so that when i go back and review because of a failure or whatever i can just look at the plot and see the the intermediate state there's this like this lightweight python thingy where it's a python file but you can use like ampersand and percent or like percent percent there's like a syntax that eludes me right now which when you open it in vs code for example looks just like a notebook and i think it can render graph outputs if you have a matplotlib thing in there okay but it's still like a dot py file you don't have this like ipy nb json monstrosity like lying around and and uh so so that i found pretty useful and then this was quite a long time ago but i remember that airbnb had like a notebook publishing system which would make some really nice um renderings of python scripts basically that's that's what i used that's what i had the data the data team that i was leading kind of uh use i think it's called airbnb knowledge repo okay a i think it was deprecated back then but it looks like it's picking back up again so i mean it's two years old but this allowed you to like just publish a python script as a rich kind of uh uh interesting output yeah okay cool all right great so that was a really helpful sidebar for me hopefully this is being recording so hopefully i can go review anything i forgot okay so okay so and actually um just as a side note that what that workflow that you mentioned where um you sort of have like like these side you know like things that are like useful you kind of like move into a into actual code and i have done that very much um so like you know so like this inference for that gene pt um for those uh like the gene pt uh cell cells i've kind of codified into this like set the this set of libraries so here's like how do you do the inference this is embeddings this is all open source i'll i'll share the link um and um but like so you can see that um you know there's different in like a bunch of utils right so i call these from my notebooks um and as i develop them like and they become more like i'm like oh yeah this is a common thing or people would benefit from this and i add them there um so okay good so uh let's see so where do we start the the um this so i did this i think this is the right notebook so i did this eda oh no this is not a very this is not it um i think this one is the one i want um so this is as i mentioned this is this tabulous sapiens data set this is like so cz is chan zuckerberg um jan zuckerberg institute you can um you know this is a link um i'll i think the easiest if anyone's interested in like following up on this more i'll just post a link to the this repo that i'm i'm uh presenting from and then you can just like find all this information um so so uh this um this is like an example of embedding sorry for the whiplash here um this might even be the plot that or a version of the plot that she did except that it's like active and you can see that again um this is like you have these different cell types and they're kind of sorta um embedded you know like you can kind of like get pretty good classes so um you can imagine that a classifier in high dimensional space is probably going to be able to do a pretty good job with this and that's indeed what we find so like um so this this code was the embedding and then there's this analysis one um so to be clear this was like after you have the embeddings then what do you do um and so you're like this is um this is scgbt and you can see like that like a couple of interesting things to note is that they're like lots of different well isolated clusters whereas this one is more um maybe a little bit like the relationship the the discrimination between the classes is uh better in in general here but you can see that they have these like really horrible parts that are going to be completely impossible to just differentiate um and then uh this one is let's see what's the difference between well it doesn't really matter so um so i you know like we do these um oh this was the different versions of the the uh the um the embedding and you can see like they're all kind of the same this one gets a little better like i don't know visually it's hard to tell but like maybe this one and then this one like slightly better probably um or actually this one yeah so like this one's a little noisier i would say than this final one um so kind of you can kind of tell visually that it seems to be doing a little better but not during not dramatically but better um so let's see and then like you know this i chose like this guy this guy and this guy so tsp1 2 and 14 and because they have good representation across all the cells and if you kind of look you'll see that basically there's not a lot of missing cell types although they're like this cell type is only in 30 and so the classifier is probably going to not do well in 30 and um maybe this type as well what is that whatever that uh that's this ecto epithelial self right so um um so then i basically just run like in a loop i train and i trained uh let's see like if you look at what i did it was sort of um you know leave one out and then i'm looking at uh let's see i'm looking at uh the three different algorithms which are random forest uh xg boost or light gbm actually and maybe one other um and then you know so like blah blah blah lots and lots of compute and then um and then like so here's some tabular data and then eventually you get to this and the other plot that you get to is maybe um well okay so i did this i don't think the details you guys really care about a lot um but like so okay first of all we're you know this is how we ended up with this plot is we embedded those stuff any any questions about that so this is partly about my workflow and stuff so if you have like workflow questions or whatever um that's great too and and so okay um how we did this how did we do this well um again not nothing no rocket science here but um if you look here uh you know basically um we you know sort of like for every gene we submit batch requests to uh open ai so um the open ai embeddings are all old but they have a couple advantages one is that they're matryoshka embeddings so um you guys are you guys familiar with matryoshka embeddings is it like a really good concept to know about um yeah but there was one question in the chat about how are you quantifying which one is better as in which metric so i think about that last graphic oh okay this one yeah so this is the f1 score so meaning um and this is actually a like an insightful question yeah go ahead uh i know i meant to ask about the like um when you were looking at the umap graphs you were saying that somehow the embeddings are better so this is um so you're quantifying it using that one score uh the same models right yeah yeah and so in the f1 score is a little bit like it's not a great f1 score is only makes sense if you have all of the cell types uh in the same level in the hierarchy right so like way back in the beginning of the presentation here i i you know like you have these this hierarchy if you have like some cell types here and some cell types here even if you're correct you might get it quote unquote wrong and so the f1 score doesn't work super well for for like sort of broad data sets but it works well for uh the data set i'm using to test here because that tabular sapient data set all of the cell types are at the same level in the hierarchy does that make sense yeah so for broad cell types what kind of metrics you use uh sorry what what kind of um like when you are taking from different families what kind of metrics do you use other than f1 uh so f1 precision recall uh no i'm uh i meant you said f1 is good for this particular data set oh i see okay yeah let me i'm gonna get to that so let me let me answer that in a little bit that's a great question it's actually one of our most important questions um yeah i i will answer that um okay so uh yes so let's see um yeah okay so and i don't i mean like there's no magic here most of this crowd probably knows how to do this if you don't i'll like go over it really briefly but you basically um you know like to in order to embed things with open ai you basically you you know upload all your um text to open ai in batch usually and then like so you create these files and then you just create a batch job and it will you know for something like i think it's half the price um it will just happily turn away at it for up to a day depending on the size of the data set i think you can see here for and interestingly for larger data sets the throughput is much higher but you can see like um you know for 30 000 descriptions it took me what about an hour not even an hour half an hour or 40 minutes right so like um and it cost me like a dollar or something so like it's pretty amazingly cheap um and and because their batch api is so well like it's designed for like massive projects so like it's super robust like you just send it stuff and you get back stuff in a little while and it works well and like it's they've thought through all the failure conditions and how to restart and whatever so like i i was like really pleasantly surprised at how well it works versus trying to manage all that yourself and it's half the price um okay so uh yeah so that's that's like sort of how we generate it and then you know so we generate all these embeddings we try it on different um embedding types and i was kind of thinking that you know my hypothesis beforehand was that like actually adding extraneous or like less gene specific information and more about like the you know diseases and pathways and interact drug interactions and whatever that it would it would like not help or hurt the performance of the classifier but it actually was wrong my hypothesis was wrong and it actually does like just adding more information about the gene just does tend to help um and it helped close the gap on the scgpt so um and then i've uploaded this all the hugging face so you can if you want to just use it for some reason then you can just go and use it um okay and then uh so that's the that's like how we generated the the embeddings okay so good um any questions oh i was going to mention matryska embeddings they're also this useful thing where uh 30 78 dimensions in the uh open ai large um embeddings but they were designed in such a way and you can read the matryska embeddings uh paper to understand the details but they're designed in such a way that you can chop off the higher dimensions and they most of the information is contained in the lower dimensions so that you can kind of adjust how much data you want to store and process versus how much you want to how what performance level you need and even there are some cases when lower dimensions do better than higher higher number of dimensions so um so we uh yeah so we we ended up using um i don't know for i'll explain we ended up chopping the dimensionality but using the starting with the 30 78 dimensions okay cool so uh let's see where are we um sorry one in the driveway yeah sorry guys um okay so uh so then the next thing that we did was we took okay so now we have this like sort of we okay we kind of believe this works we have some embeddings we like um you know we have it we didn't we weren't very rigorous on this because one is it it's very pragmatically focused on helping customers to get good results so we're i mean not that we shouldn't be doing evals we absolutely have to be doing evals but our focus is okay let's just learn about how these models work before we start doing really rigorous eval so and i think we've gotten to the point now and i'll discuss where we not need to start doing more uh rigorous evals so here's um let's see let's start here so now let's load um this gene select gmv2 um this is a hundred million cells it's about i don't want to say i don't want to get this wrong but i think it's a like maybe five or six terabytes of data so it's pretty big data it's not massive but it's not small either um and uh and then so one thing i so one thing about this type of data is it's saved in this kind of uh interesting file format that it kind of it's like a file system in a file and it allows you to store um like the the sparse uh gene expression data but then also all the vectors of labels and other things and then all sorts of metadata and so that this is like the standard way that you um represent this data but it's it what the the consequence of this very flexible format is that the data is like like first of all there's a lot of cruft data that you in any given uh circumstance you don't really care about or it's not cruft but it's like irrelevant data and then um there's also means that different people put different like names on the same things and all sorts of like it's not a very super well standardized so that you know there's a lot of pre-processing you have to do so i actually ended up writing a script um to extract a bunch of or like certain metadata form uh and write it to a certain metadata and write it to parquet files so that i can do quick analysis on the um on the metadata for the whole data set so there's a thousand or so files and i wanted to be able to read the metadata and do statistics on them both so that we can curate a good training set um so i did so i created um the metadata in this file and i can talk it again that's an open source project it's right here it's called um and data metadata it's on github i can share that as well um and what it does is it creates uh you know these um uh these uh these parquet files that you can use to you can query it with duckdb you can load it as um in in pandas or whatever your tool of choice is um so that basically um we can you know sort of like ask okay what are the gene what are the like genes being expressed in different files what are the like number of genes uh um what are this and then also i had to do some post-processing here on what are the distribution of gene counts so like for a given file it has this many genes expressed um and then like the number of active gene count or sorry the total number of genes like gene expressions and then the number of gene like individual or like the unique genes that were expressed and so like i kind of plot this like in like a manual way with three files and then um go and get the distribution uh across sorry i'm gonna skip that um across uh all sorts of uh sorry to scroll yeah so then like so now this is that data that i was just talking about but this is every every single or not every single file but many of the files in the data set in order to understand okay how met like what's the how similar are the data visually right so this is a thing i almost always do with large data sets is i i some i get like let my eyeballs sort of understand the data and like find the outliers in the shape of the data so like the things that i pull out from here is like okay these guys actually seem to be pretty uniform but there's like some guys that are like small like they have very small um tight distributions and others are wide but they all kind of have a reasonable cut off these ones like you know these are really heavy tailed and i cut them off um as like this last column is rest so the point i'm making is like okay get i got familiar with the shape of data and i actually we were originally going to set some thresholds so that we throw away bad data and we just realized from looking at this let's just we're going to train on everything we're not going to any like any given file um we're not going to be we're not going to have any um discriminatory uh factors that exclude data um does that make sense do you okay i want to make sure i'm not like boring you guys with um you know sort of like data analysis details but i wanted to like sort of talk about how we're selecting the training set um any questions yeah it looks like yikes did ask if you had any intuition on why i ended up helping but maybe in the interest of time like you only have a couple minutes left and i think it was a while back too i just missed it okay yeah okay yeah yikes let me answer that on on chat so i can finish up but i i will answer it um so okay yeah uh so the interesting so what like from this crowd's perspective the interesting thing about this data set and and the my training process was that um like or the data selection process was that these files are like this weird format that i mentioned and um i had to like so they're all in s3 and i had to use this um like i had to rewrite the um the standard tools are just no good at using objects directly in s3 they want to download the whole file and then start analyzing it and so that the upshot of that is that you end up it's like a very prohibitive io problem and so that what i ended up doing was developing that a couple tools um you know sort of in this process that were able that that go and get um file fragments and then actually doing i was actually i ended up doing you know and this is like where llms really shine is i ended up doing this like um integer linear program to select the best uh you know sort of fragments from the files based on the their coverage of the type cell types that i wanted so there's like a whole bunch of work that i did um like the vast majority of the training work that i did was actually just figuring out how to get the data set the training set out of the 100 million cells um in such a way that it wasn't prohibitively expensive okay so so i got down to about five million cells and then um and then ended up uh you know sort of um uh let's see so i ended up uh building a uh a you know building the subset um and then i ended up the we were going to use xg boost for the classifier for the classifier xg boost has problems because it doesn't scale well um to large data sets that um that don't need that don't need a a corresponding amount of processing right so you can scale it in various ways but none of the scale ways that scales match this data set very well and if people are interested i can explain why but so we end up using a multi-layer perceptron um and and uh and again loading only the data being very um very selective about the data we load into memory um use optuna optuna to tune hyperparameters and then we ended up with you know this classifier um you know that is stored in stored in weights and biases and can um classify like our test set you know 100 000 cells in like 15 seconds so like super efficient at the end the inference is super efficient um but took a lot of work to get there um and we ended up with uh so like and i can let me just share on weights and biases really quick uh um in fact i think it's reference from here so let me just grab it here yeah here um so if you guys are not aware of weights and biases you definitely should be using it um and uh so like the like maybe the final thing i'll mention uh is that like i ended up you know learning a lot about training models uh in this process i don't have a ton of experience with neural networks and the interesting thing so you can look at these two different training runs one is the blue one and the you know the um the orange one or yellow one and um and like the i won't talk so much about my metrics actually i want to talk also about the metric that we're developing um so i'll be really brief on this but you can see that like this guy keeps having these loss spikes and never really getting to a point where it can converge before it has another loss spike whereas so let me let me make this larger so like you know it keeps getting down and then oh shit and then i have another loss spike and this is a periodic thing and what this corresponds to is different um different data sets and so that and the reason is that it's doing this because the distribution in the data sets is different and so that um it like whenever it gets to a new data set then it's like oh crap i'm mispredicting everything and i have to restart from here and then it gets you know it like sort of like lowers um uh so it it sort of like um gets down to a certain point and then it doesn't have enough um you know generalization so you can see it like generalizes a little bit and that these loss spikes are um a little bit not as bad and it gets a little bit lower but it doesn't really um smoothly improve the neural network so what we did was this was using the 30 78 dimensional um uh um embeddings and i what we what we did was okay just chop it off so everything can fit in memory so we can easily shuffle around the data set and then just shuffle the entire data set and pick random sound sub samples for my batches and then suddenly you're the loss um is much is dramatically better and um and you know these these are different measures that are not super important um but uh one so uh i'll take i'll stay on for a few minutes after and answer questions but i want to answer the question um about the metric so um we developed the the metric that we used let's see it's written out somewhere here uh yeah here the metric that we used um so you can if you look at the um the macro metrics so like standard uh f1 score macro then they they're total garbage like the f1 score here is nine nine percent right uh uh and uh and and precision and recall are equally bad right so like um that and the reason is because even if i'm correctly predicting but i'm in the wrong place in the hierarchy then i still am mispredicting and so um like these you know so these are ranking metrics that are traditionally used in information retrieval uh uh um like this is like um that if is my is my uh correct answer in the top ten top five top two and like these guys are just basically waiting like how far from the top are my uh is the correct answer and so um so those do better but i find it very unsatisfying because like one if you like in practice i'm not going to be like that doesn't help me right like i'm doing science i don't want to know like it's one of these five i want to know what's my best guess right and the other thing is that um you know it doesn't take into account this hierarchy very well uh where did that slide go um it doesn't take this hierarchy into well so i i think that we're we're coming up on like a basically a hierarchical uh metric that we want to use that is basically how far in this hierarchy am i from uh the correct measurement so okay this is kind of biology related not so probably interested for this crowd but i do like i i thought that this was really fascinating because like this is very much a process of discovery of the problem i'm learning about the problem as i learn about in the data as i learn about the specific model so i thought the process here was really exciting to me and i wanted to share with you guys um did i answer um the the woman who was asking the question yeah okay good all right and i i'm curious have you ever dealt with something like this and is that um it like like and if so like what did you do i haven't worked with bioinformatics data at all so okay but what about like hierarchical data in general uh not really no i've worked with like uh easy related data and all of those are i don't see any hierarchical kind of connections in that yeah i also need to ask like how important do you think having domain knowledge is in uh inomics is to get into this field like to do this by your measure i think that having an expert uh to ask questions and guide the decision making process is pretty critical i guess like like the intuition for what will work what's important those things um you need an expert but i don't think you need to have a formal training i don't right so i have and i have a but i have a friend who is a very experienced bioinformatics who's done drug development and other um you know sort of related stuff his whole career so i can so that we work together on this and i bring the ai he brings the bio and like we end up we're working really well together yeah and do you have any suggestions for like quick exploratory projects following up on this on your work right now um i mean uh like we i'd love to like work together on something i can definitely come up with specific stuff um i i mean yeah why don't just dm me and we can talk about ways that you could help i mean i could think about an answer if you if what we're doing is not super appealing but you want something else like in the same domain or whatever i can also probably come up with something for that well thank you awesome um yeah okay good so i guess that's probably all i have time for do you guys want to um uh are there any other questions i'm happy to stay after uh and talk um but i want to like sort of you know at least have a informal ending there or a formal ending i guess okay uh okay um okay cool well thank you guys for joining i hope that was helpful and interesting um uh please if you uh if you have questions or follow-ups or need links or whatever just hit me in discord and i will um i will share what you need and answer questions and you know would love to collaborate and all that okay take care guys and i guess what uh uh yikes or flow or whomever do we usually we're starting to publish these on on the latent space tv right i i have a bunch of recordings that i'm overdue to publish um so yes they will be published at some point but if you need it sooner just let me know and i'll upload it i mean i like to after i do a talk i like to just you know i like to get as much credit for all the work i do as possible so i like to post it on x and linkedin um just to say so i can send you the raw if you want and then once it's on youtube i'll send you the link too okay i mean i think i probably and honestly won't do anything with the raw so like what as soon as you can get the youtube up that would be awesome yeah yeah no i'll do that today sounds good okay okay cool thanks awesome all right guys thanks take care bye

[AI in Action] Towards a Generic Cell Typer — RJ Honicky, PhD withMIRAMICS

Transcript