Stanford CS25: V5 I Multimodal World Models for Drug Discovery, Eshed Margalit of Noetik.ai

All right, welcome everyone to CS25, I believe week eight. Today we're very excited to have Ishad Margulit, who is a neuroscientist and ML researcher working to understand biological systems with AI. He completed his PhD in neuroscience here at Stanford, where he constructed self-supervised neural networks that incorporate biologically inspired constraints to explain the structure, function, and development of primary visual cortex.

He's currently an ML scientist at NoTik, an AI-native biotech startup focused on curing cancer. In his work, he develops novel transformer model architectures and tasks that learn from large multimodal data sets of patient tumor biology and applies these models to drug discovery. As a reminder, please fill out the attendance form, and if you're online, you can ask any questions on Slido.

With that, I'll let you take it away. Awesome. Thank you for that introduction. I feel like that covered it all, so we're done here, we can wrap it up. I'm going to talk today about kind of a mix of basic ML ideas about how to do multimodal learning and a little bit of the kind of research we're doing at NoTik, where I work now.

I'm going to talk really in three parts. You can tell I'm fond of alliteration here. I'll do multimodal model madness and tell you a little bit about ideas for incorporating information streams from multiple modalities. Then I'll focus more specifically on cancer. So in that first part, you know, if you're not interested in biology or cancer, one, I'll convince you otherwise, and two, you'll get a lot out of that first part, I hope.

And then I have a few random things I threw in at the end about work in progress that might be exciting. So just to put it out there, my assumptions about this audience are that you're interested in research on transformers, given the course title, and specifically maybe novel transformer architectures that you haven't encountered before.

I'll assume that you're curious about how we're actually trying to use transformers outside of an academic context and to bring it into a clinically useful context. I will assume that you're kind of familiar with the basics of transformers in machine learning, but not necessarily experts. There will be no pop quiz about equations or formulas or architecture diagrams today.

And I will assume that you don't have any familiarity with cancer immunology. So this will hopefully be friendly to people that have never really thought about that world. But I hope you all agree that curing cancer is a worthwhile thing to do. And I'll tell you about the ways we're trying to bring these worlds together.

What you should know about me, this was covered a bit in the intro, but I did my PhD here at Stanford in neuroscience. So I'm actually not a cancer immunologist by training or background. I'm new to that field. My background is in computational neuroscience and specifically visual cortex and how the primate visual system works.

And when I was here, I mostly studied how we can build in biological constraints into, at the time, convolutional neural networks to make them more brain-like. And my broad interest is in understanding how complex biological systems get put together, how they work when they're working, and how they don't work when they stop working.

Before I go further, I should call out the other people that I work with at Noetic. We're a pretty small machine learning research team. But the stuff I'm going to show you today is due to their efforts as much as mine. And I'll do my best to call out when they contributed specific things.

But in case I forget, these are their names and faces. Okay, and the last bit of housekeeping. This is kind of my plan for roughly the next hour. I might try to go a little faster to save time for questions at the end. I'm going to try to convince you primarily that there's a lot of really exciting and creative work to be done in multimodal machine learning, especially with transformers.

And two, something that I became convinced of in the past year and a half, really, that cancer biology is just a really fantastic place to do that kind of research. But the kind of data we're generating is really unique and something that I wouldn't have considered as a substrate to do basic ML research.

For bonus points, I'll also try to convince you that we're actually making progress on drug discovery for cancer at Noetic. And then the way that I normally prefer to give presentations is I'm super happy to have clarifying questions. So please, if something doesn't make sense, it's probably I forgot a sentence or a definition.

Do let me know. If you have larger philosophical questions about why we're all here, I'd also love to talk about those. But I'll try to save a little time at the end for the more freewheeling discussion. Okay. So with that said, let me jump into my conceptual framework for thinking about multimodal model madness.

I think a unifying goal for much of AI, I won't say all of AI, but much of it is to build world models. And by that, I mean, a system that can simulate the future state of the world, conditioned on observations of the current state of the world, and also being able to simulate forward the effects that that system's actions will have on the future.

So this necessarily means you have the reality that is out there, you have sensors that collect information about that reality, you perceive it with your world model, and ideally, you're able to run a simulation of, okay, well, if I take this action, how will the world change in the future?

And that is a very powerful and generic framework for being able to reason about why the world is the way it is, how you should act, it obviously plays into planning and perception. So one thing that I think is obvious is that the world is perceived at least by us and the world models in our heads in a multimodal way.

So, you know, you have visual modality that will tell you, for example, that this roundabout that I biked through many times on my way to lab is very dangerous and chaotic when people are running late to class and panicking. I mean, but you're not just getting the visual information, you, of course, also have the audio modality, which comes from a different sensor, a microphone for a machine or ears for us, and you get an audio stream.

And often when you encounter video or image data, you could also have a text representation. Someone might have written a caption under their video saying, well, this roundabout is extremely chaotic and I'm going to try to avoid it. My thesis or my assumption is that the best world models are going to be able to incorporate all of these modalities and make decisions about what to do based not just on one of them, but all of them in a productive way.

So when you're running the world model in your head where you're perceiving the data from all of these modalities, you might have a question like, what will happen if I start walking forward into this roundabout? And because you have the combination of maybe someone warned you ahead of time, you can hear what's going on, you can see what's going on, you're able to predict that you might get crashed into by somebody on a bike who's running late for class.

So the task of that model is effectively multimodal learning. You need to know how to integrate these various information streams, which natively come in many different formats. And you need to learn from them in a way that gives you the best chance of making accurate simulations of what's going to happen in the world.

I want to make a distinction that I think is important and will come up a couple of times in this talk about two reasons you might care about combining multimodal streams. One of them is the idea that you could do multimodality as translation. You could say there's information in the visual domain, there's information in the audio domain.

And what I want to do is have a representation of the world that merges those things so that I've captured all of the information about one modality in the other modality. The other thing that you might want to do is disambiguation. So there could be information that just is not available in one modality.

And so you kind of have a blind spot, both figuratively and literally here, where if you only had a camera, you wouldn't know everything about the state of the world. And you're going to need to incorporate this other modality to get a full picture of what is happening in the world around you.

So I'll try to talk a little bit as I'm giving examples in the literature and in our work about whether I think we're doing translation or disambiguation. OK, so translation has a very familiar shape. It's about capturing all of the information in one modality and another modality. So when you're building, for example, a text to image system, what you want is for all of the information in the prompt that you're sending to the system to be reflected in the image.

You want to basically push these things together so that when I include a bit of information in the text stream, I want it reflected back to me in the image stream. And when I talk about disambiguation, what I mean is that you need to piece together information that is only going to be available in one modality or another.

So for example, let's say you're approaching this building and you see everyone sprinting out of it. Now there's some ambiguity here. There are actually many reasons why people could be running out of the building. Now, if you combine that with your sense of hearing, you might hear that there is a fire alarm going off inside.

And that rapidly disambiguates which state of the world you're in, that there's a fire and everyone's running away from it. But you could also have a second scenario where you hear an announcement that there's free boba for the first 10 students that get to it. And now everyone is sprinting out of the building and your world model should update, right?

If you want boba, you need to turn around and go one direction, maybe to where the announcement is saying to go. If there's a fire, you know, you head to the nearest evacuation point or whatever it is we're all supposed to do if that happens. OK, the next few slides will be kind of a tour that I will acknowledge is both brief and incomplete about how people have tried to incorporate multimodal streams in the literature.

And I'm really going to go over five things here. You don't have to memorize them now. I'll go through them one by one. But one thing that I want to call out is when people talk about multimodality or merging streams from one modality and another, there's often this nomenclature of whether you do an early fusion or late fusion.

And what people usually mean by that is are you bringing the modalities together kind of as soon as possible and then running your processing forward from there? Or do you process each information stream separately and then bring them together at the very end when you actually need to make a decision about the state of the world and your actions in it?

So I have to admit that this framework doesn't help me think about this space too much, but I've tried to map onto these five things where I think they fall. And there's kind of a range here from super early fusion to what I would think of as late fusion and everything in between.

So we'll see if you agree with me as I go through it. OK, the first thing I'm going to talk about is learning joint embedding spaces. So a model that many of you might be familiar with is the clip approach for combining the modalities of image and text. The primary idea here is to do contrastive learning where you have two encoders, an image encoder.

Let's see if my mouse shows up. You have an image encoder that pushes images down into an embedding. And you have a text encoder that takes corresponding captions and pushes those down into an embedding. And the objective for the model is to make the embeddings for corresponding image and text pairs as similar as possible, while pushing away the representations of mismatched pairs.

For example, the similarity between the embedding for image two and caption one should be as low as possible, while increasing the similarity between the embedding for image one and text one. And this is kind of explicitly trying to learn a single joint representation space, where if you pick a point in that space, there should be kind of an image that corresponds to that point, and also a text sample that also corresponds to that point.

And that way you can kind of translate between the text modality and the image modality. You can see this explicitly in the paper. On the right of this figure, you can see that what they're doing is taking an image, figuring out what the embedding of that image is with the image encoder, and then searching over many possible text inputs that will bring you as close as possible to that point in the representation space.

So I would consider this actually relatively late fusion, because you're learning the image encoder pretty much separately from the text encoder. Those don't interact until there's pressure through the loss function to make those embeddings as similar as possible. I would say kind of like, I don't want to call it the grown-up version of this, but an expanded version of this is the image bind approach.

And if people aren't familiar with this, it is still a contrastive learning objective between modalities. But instead of just doing images and captions, it uses images as kind of an anchor and learns pairwise contrastive between images and text, images and depth maps, images and audio. And the underlying objective here is the same, that you want a single representation space where it doesn't matter which modality you came in through.

Once you know where you are, you can kind of back out what all the modalities should look like. And in fact, in this paper, they do some interesting demonstrations of that space by doing things like embedding space arithmetic or prompting with one modality and seeing where you fall in that space on other modalities.

Okay. The other option that I have done occasionally but I think is not popular for technical reasons is super, super, super early fusion, where you can just directly concatenate raw inputs. So consider if you had, for example, a 3D RGB color image, and you also had a depth image.

What you could do is just staple on the depth image as a fourth channel and then train a continent or train your vision transformer and pretend that you had four channels all along. And that is a totally valid way to incorporate information from multiple modalities. The reason that I think it is not super popular is it requires that there's something to concatenate on.

So people typically will first kind of project raw inputs down to tokens and then concatenate token streams, which I'll talk about in a second. But if you want to jump really, really, really to the front of that encoding stack, sometimes you have your height and width dimensions on images lineup and you can just concatenate directly.

Okay, the third option that you could do is cross attention. So now we're getting much more transformer specific. The idea here is you have two token streams, one from each modality. In the example from the Lou et al 2019 paper, you have one token stream that comes from images, and then you also have a second token stream that comes from text.

So those are your two modalities. You can have as many self attention layers as you want within each stream. But then you can also have these cross attention layers, which are shown here in the diagram, where you're basically giving each modality the chance to be kind of the primary.

The primary will generate the queries in the attention operation, and the secondary stream will provide the keys and values. And because that is an asymmetric procedure, you will often see these cross attention layers duplicated on either side of the processing stream. So that in one example, the images get to provide the queries, and in the other side of it, the text stream gets to provide the queries.

I broke this out just because when I was first implementing this, it was helpful to think about. But if you look at what is actually happening in an attention layer, you have your input x, you will have a linear projection to your queries q, your keys k, and your values v.

And then those go into your scale dot product attention operator, and then linear projection and dropout. All you're doing in a cross attention layer is you're taking input from another stream y, and you're using that to generate the keys and values. But everything else stays as is. This gives you a chance to have the two information streams mingle somewhere in the middle of the network.

So I would call this intermediate fusion, depending which layers you're doing it in. Okay, the fourth option which I alluded to when I was talking about concatenation is you can just slap tokens onto a stream. I think one thing that has really resonated with me is, as we move into using transformers in more and more domains, the fact that everything is just tokens all the way down makes it really easy to add more tokens to a sequence, and they can come from other modalities if you'd like.

One way that this happens all the time, but I think we don't really think of it as multimodal learning is, if you have, for example, a vision transformer, you might add on, in this case you're prepending a class token that really is like, you could think of it as text, I guess, because the class is literally a text label.

But you have an embedding for the text, and that just goes and participates in the stream of tokens, and participates in the self-attention operations, and there you go, you've integrated a text label with your image data. The other place you see this is actually in the Dolly1 implementation. So the way that this worked is you have text tokens, which are captions for images.

You also have image tokens, which in this case come from learning a code book with a discrete VAE, and you concatenate on the image tokens after the text tokens, and then at inference time you can provide only the text tokens and autoregressively predict what the image tokens should be.

So again, you've solved the multimodality problem by first boiling everything down into token space, and then once they're tokens, your decoder doesn't need to know which ones are image and which ones are text. And finally, something that we have increasingly been using at Noetic is this adaptive layer norm idea.

So I really like this figure. This is from the diffusion transformer paper. What I like about it is that they also talk about two other things I've already mentioned. The middle column there shows how you might do this with cross-attention. I guess I should back up and say, in the diffusion transformer case, the extra modality you're feeding in is a conditioning token.

You want to guide the diffusion in a particular direction. So you have some text label. And you want that to interact with the main transformer block that is going to be doing the diffusion. So you could do that with cross-attention, which is shown here. You could do it with what is called in this paper in-context conditioning, which is just the concatenation approach.

You can actually see that here, concatenate on the sequence dimension, as we just saw. But the thing introduced in this paper, which is very parameter efficient and cool that it works, is you can take that conditioning information and use effectively a linear layer to push it into small scalar parameters.

So here what they're doing is predicting an alpha, beta, and a gamma. They actually predict two triplets of alpha, beta, gamma. And use those in the layer norm operation. So this is just kind of another way, as far as I understand, to push information from one modality and have it steer or influence or guide what is happening in the processing of the primary sensory stream.

I have a little code snippet here, probably not worth lingering on, but it's just striking how straightforward this approach is. But really all that's happening is you have a single linear layer in here hiding after a non-linearity. And then you're using that to determine the parameters for shifting and scaling operations in layer norm.

And this turns out to be a very effective way to get information about a text-based conditioning label into the image domain. Okay. Okay. This is a pretty good place to pause and see if anyone has any questions before I move on. Yeah. Yeah, particularly when we're talking about cross-attention.

Can cross-attention, one could be used as the other in the layer norm as a linear reference? Yeah, if I understand correctly, I think you can. And the thing that you run into here that's both a blessing and a curse is like you can do pretty much anything. Like at some point if you think that there's more useful processing to do before you derive these layer norm shift and scale parameters, this doesn't have to be a linear layer.

It can be an arbitrarily complicated function. It could be a cross-attention layer that is using the information from two streams that way and then also doing the layer norm. The question is, is it parameter efficient? Does it work empirically? And do you have a good principled reason to do it?

And I think often that third thing is missing in a lot of these approaches, which is just like, yeah, it seems to work. So we're going to roll with it. Yeah. Yeah. Can you speak in a complex level, maybe the big consumption level, why in the cancer research space that transformer is important in the sense that does it combine, let's say image of a patient, let's say biopsy and maybe the text description of this.

So what makes a transformer so rather than for cancer research in a different context? Yeah. So the question is why are transformers important for cancer research? And I think the answers could have basically two shapes. One of them is if you format your data and your task properly, I think transformers are both simpler to scale and empirically scale better than other approaches.

The second reason I would say is not all data modalities are amenable to other architectures. So sometimes you do have image data. And I'll show examples of that in a second. Sometimes you have data that looks a lot more like text. And so you want to use the approaches that have been successful in modeling text.

And then the third thing I would say is that specifically for multimodal learning, it's extremely helpful to be able to push everything into token soup and bring it back together however you want. So it turns out the transformers are just a very good substrate for combining kind of arbitrarily creative ideas about how to integrate various measurements.

But in a couple of slides, I'll show exactly what those measurements are. And I'll walk through architectures we're using to learn from patient data. Cool. OK, I will go ahead and do that. OK, so as a reminder, this is kind of my mental model for how things work in the macroscopic world.

And I think the analogy in the microscopic world is now we want to understand the world of cancer biology. And the queries that we want to pose to this world model are not things like what happens if I walk into a roundabout. There are things like what will happen if I give this drug to this patient?

Will the tumor go away or not? And you want effectively a simulator that can help you answer those questions. I will fill in the question marks momentarily about what these sensors are and what the data looks like. But I want to put this up front as like our goal is to take this world modeling approach and to build a system that we can use to simulate answers to clinically relevant questions.

OK, this is my like crash course in cancer immunology. I was talking to a card-carrying immunologist this morning and apologized for trying to boil down their entire career into a few bullet points. So please know that this is a gross simplification. The basic idea behind cancer immunotherapy is that we know that in some cases, your immune system can detect and destroy cancers.

But tumors are sneaky and they evolve to either hide from the immune system or to actively suppress the immune system so that the immune system can't do its job and destroy the tumor. The idea behind immunotherapy as a type of treatment is you want to basically get rid of those immune evasion mechanisms or the immune suppression mechanisms and help reactivate immune cells to help them do their job.

So you already have the machinery in your body to fight cancer. It's just losing that fight and you want to find a way to tip the scales so that the immune system can start winning. The way that we do that could have two shapes and we're interested in both of them at Noetic.

One of them is you just find new drugs that no one has thought of before. But the other thing that I was interested to learn when I started working in this space is actually there's a lot of drugs that have progressed in various clinical trials and they fail because there's just a mixed response.

For some patients, they work really well. For other patients, they don't work at all. And we don't have the right way of targeting the right drug to the right patient. So for both of these, the thing you would want is, again, a world model that can help you simulate the answer to the question, given what you know about the world of this patient's biology, if I take this action, does the tumor go away or not?

OK, so here's the oversimplified schematic again of you want something that tells you given a patient's data. You want to be able to simulate kind of rapidly a bunch of different treatment options and get a reasonable prediction of whether they'll be effective or ineffective. So our approach to this is, first, we need to deploy our sensors and collect a bunch of data.

This is a video of our lab up in South San Francisco in action. So what we do is we get human lung tumor specimens and a few other tumors from a few other organs. These numbers are also a little bit outdated at this point. But we source our tissue blocks.

We have a lab that processes them. We punch out these cores from the tissue that are about one millimeter in diameter. So they're very tiny. But I'll show you in a second that there's an unthinkable amount of data packaged in that small one millimeter core. And then we will basically put those through four different data processing pipelines that I'll show you in a second.

One of those pipelines, which is a fairly common one to run on tissue samples, is we collect these H&E images. H&E has the benefit of being relatively cheap and easy to acquire, and a lot of people have it. So most hospitals and research centers are sitting on a ton of H&E.

There's a lot of public H&E. And if you want to do machine learning on biological data, it basically looks already like images. So you kind of know what to do with it. It's like a three-channel RGB image that you can work with. Technically, it's like a two-channel thing because there's two elements at the same, but you can decovalve them if you want.

It's not super relevant to the rest of this talk. The other thing that we do is, instead of just looking at the gross morphology, which is what you mostly get out of this H&E stain, is we have a 16-plex protein immunofluorescence panel. So that means that we're able to detect proteins, 16 different kinds of proteins, by designing fluorescent antibodies that bind to them.

And we can see the composition of this tissue. So in this pseudo-color map, what I'm showing is green are T cells, which are part of your immune system. Blue is B cells, which is another part of your immune system. And then the red blob in the middle is a tumor.

We're able to go ask what's actually in the sample. This is kind of like an RGB image, but with 16 channels instead of three channels. The other thing to call out here is, you'll notice that the white box has moved with us. So these are spatially aligned samples. And if you remember what I was saying about RGB in depth, you might already be thinking, oh, well, you could just staple these together in the channel dimension and do multimodal learning that way.

One of the cooler kinds of data that we have is spatial transcriptomics. So on, again, these exact same samples, you'll see the white boxes moving with us again. We are able to detect in a 1,000-plex panel. So we're looking for 1,000 different genes worth of RNA transcripts. And those transcripts are spatially localized.

So we don't have to boil down the sample into soup before feeding it in and asking which RNA is in the sample. We get an xy coordinate and, well, it's really like an xy by gene tuple that tells us this gene has RNA expressed at this location. And you'll get anywhere between a few thousand to a few million such transcripts in just one millimeter of data, which I'll try to show in a second if it doesn't crash my computer.

So this data is very rich. It's also very complicated. It's also very expensive. So collecting this data is the hardest, and it's the thing that the fewest people have. In fact, on one of the previous slides, I had mentioned that we at Noetic estimate that we have well over 1% and probably over 2% of, like, the world's spatial transcriptomics data using this platform, which is called Cosmix.

Okay, and the last thing we do is genetic sequencing. We do whole exome sequencing of the patient's data so we know if they have genetic mutations that could be relevant to understanding their tumor biology. Okay, so this is the data that lives out there. And the question is, how are we going to learn from it?

Before we do that, I just want to talk a little bit more about the spatial transcriptomics data because it's so cool. So these are what the raw transcripts look like in 3D. We do actually detect them in X, Y, and Z, but the Z plane is 4 micrometers, 4 microns.

So it's very, very thin. It's effectively just X, Y. And the colors here are different genes. I'm going to try to run a live demo of just what this raw data looks like because it's a lot of fun to play around with. But there's an overwhelming amount of data jammed into this one millimeter diameter chunk of tissue.

I promise this computer is not underpowered. It's just loading 11 million points. So this is what it looks like at the micro level. You have this incredibly rich tissue. And we are actually showing here one of the first whole transcriptome runs we've been able to do. So instead of detecting a thousand gene panel, we're able to try to detect over 18,000 genes.

And we ran out of colors after the first 20 or so, but you'll have to take my word for it. So what are the colors denote here? The colors are different genes. Different genes? Yeah. Okay. I will spare my computer from trying to render that for too long. Here's another example from our thousandplex panel just to give you another sense of the kind of data that we work with.

So at the local scale, now we're talking about tens of microns. You have these highly localized reads of what RNA is being transcribed where in the cell. And I'll try to render the labels. But again, the plotting software struggles a bit with it. Okay. That's kind of an aside.

I just think those data are so cool that I like showing it whenever I have the chance. Okay. So let me talk about how we're trying to model this data and the kinds of world models that we're trying to construct. So we have thousands, at this point over 10,000 cores from different patients.

Each of those cores is one of those one millimeter diameter circles. In each core, there will be thousands of cells. And we can assign these transcripts to cells with a segmentation algorithm to say, in a given cell, you can count up, you know, how often do you get a read of this gene, CD3E, within the borders of a cell.

And so the data you get is effectively tabular. It's also very, very sparse. Most cells don't express most genes. And we're going to try to model this so that we can simulate things and ask, well, does this tilt the tumor immune microenvironment in a more or less favorable direction?

Our general approach is to use masked autoencoders. This ends up being a transformer backbone. The thing worth calling out is that it's not autoregressive transformer processing, so there's no real sequence to the genes to care about. It's more like masked language modeling, where what we'll do is take the input, we'll mask out some of the tokens, and we predict the tokens we mask out.

I'll show on the next slide what that actually looks like operationally. So if this is your input, imagine this has 1,000 rows, soon 18,000 rows. You have every gene and how often it was detected in a single cell. So the input is one cell here. We have a tokenization process that assigns a token to each combination of gene identity and the expression level.

So you're encoding both how much of that thing you have and also what the identity of it is. And then we do partial masking. In our work, we tend to do pretty aggressive masking. So the number you should have in your head is like north of 90% of the tokens are removed and replaced with a learned mask token.

So almost everything is removed from the sample. And the job of the model is to predict the things we removed. This is intentionally a very hard task. We want it to be basically as difficult as possible while still learning the biology. Because if you are able to do this, what it means is that you're able to use a very small amount of context to infer, OK, well, I see I have a lot of CD3.

I don't get to see how much CD8 I have because that got masked out in the masking process. So I have to guess based on my knowledge of biology from learning from this dataset, is it likely that I express CD8, which would make me a cytotoxic or a killer T cell, if I express CD3, which is kind of a more generic marker for T cells that may or may not be killer T cells.

So at inference time, what you can do with this kind of model is provide a couple of things that are not masked, like two tokens worth of information. And then staple on a bunch of mask tokens, you know, the other 998 genes in this panel. And then you can predict everything that's masked out.

So you can say, given that you're conditioned on knowing you're a T cell and you're not a tumor, what else do you think is happening? And, you know, if you've learned from this huge dataset, you might think maybe the best you can do is say, well, OK, what does the average T cell look like?

Because all I've told you is you have this T cell marker, CD3E, and you're definitely not a tumor cell because you're expressing none of this keratin marker. I don't want to say that this model is boring. There's a lot of interesting stuff you can do with it, but it's distinctly not multimodal.

And I promised you multimodality. So I'm going to keep going with this background in mind and tell you about one kind of model that we have gotten a lot of leverage out of, which is to say, this task is very hard and there's a lot of ambiguity. And if you remember, I was talking about multimodality as a way to resolve ambiguity in predictions.

So one thing that we do is we will take the local spatial neighborhood of a cell, the one we're trying to make predictions about. We'll ask, who are your eight nearest neighbors? And we will force the information about what those eight nearest neighbors are expressing through a bottleneck. That bottleneck is just another transformer, but it is required to boil down basically all of that rich spatial context about what's happening in the local environment down into one token.

And then we use this adaptive layer norm approach or we've tried the append tokens thing. We've tried cross attention. They all seem to work. Some are just more efficient than others. And that feeds in as an additional input to what we refer to as the backbone. So now, if you have a prompt in your training that's ambiguous, you have a little bit of help because you might say, well, I don't know if I'm a killer T cell or not, but seven of my neighbors are killer T cells.

And maybe I'm also a killer T cell. And that should help you do a little bit better. And the thing that I should have added here and forgot is showing that if you're just monitoring like the training loss of the backbone, it really likes having the spatial context. It helps quite a lot in lowering that loss curve.

Once you have a model like that, you can do very large scale inference with it. So what we will do is take a real patient sample. We'll pick a location in that sample. Look up the nearest neighbors. And then we will ask the model to do inference with a specified prompt.

In this case, I'm telling it, you are a killer T cell. You express these markers for regular T cells and for cytotoxicity. And I'm asking it to predict the other 990 plus genes conditioned on the spatial context. And on the right, I'm going to slowly build up at each position here what the model thinks is happening for one such gene, IL-7R.

And you will already get a sense once this starts to speed up of how many simulations we're doing here. As of last week, I think we had run on the order of 6 billion virtual cell simulations like this. So we are able to ask basically arbitrary queries about what is happening in patient tissue by saying, hypothetically, if this cell type were in this location in the patient tissue, what do I think it's doing?

And to close the loop in a second, I'll tell you about how you can do counterfactual simulations that will help you think about potential drugs. But for now, just appreciating that once you have this additional input stream to the model, you can pull out a lot of richness about what it's learned, about how spatial context affects what is happening at the single cell level.

So we built a little web UI that lets you explore some of this data. I won't spend too long on it. But it lets you kind of pick a cell type, fly around a real patient sample, see what the cell identities of those nearest neighbors is, and then see how the model's prediction of different genes changes as you move around the space.

So I could spend all day kind of mucking around in this. But it's public. Feel free to poke around. Let us know if you find anything interesting that we should turn into a drug. Yeah, the link is cellaporter.ai. We think of it as like teleporting cells around in patient data.

May I ask you a question? Sure. So masking seems like a really interesting approach in strategy that it has applied. Is that to narrow it down to that one particular pathway? Is one of the 90% masking so you can actually trace like every single one? Is that the core function of masking?

I think it advises two things. The first is that at training time, it makes the task very, very hard. So that the only way you could answer it is by learning about fundamental patient biology. If you provided, you know, go to the other extreme, you provide 999 genes, you only need to predict one, you will probably identify that, oh, I can usually predict this gene by the combination of these other two.

The other thing that it helps us do is, I'll show this in a second, but if you train with a very high masking ratio, at inference time, you can bump the masking ratio to 100%, and the model doesn't freak out. It's like, oh, yeah, I'm used to seeing almost nothing.

And then you can do interesting things like, well, what if I give you no information about that center cell and you only get to use the local spatial context, which is a mode we actually operate in quite often. Yeah. OK, this is another kind of thing we would do with these virtual cells.

So the thing that I showed on this slide is the surrounding context is real data. We go to a real location in a real patient's tissue, look at the eight nearest cells, feed that expression in. But the model doesn't know what's real data and what isn't. So you can go create a synthetic neighborhood and say, OK, I'm going to start from the real data, and I'm going to simulate the effect of a drug by knocking down a particular target I'm interested in.

And say the neighborhood is basically the same, except that I've removed a target of interest. I've done a synthetic knockout in the surrounding context. And I want to know, when I do that knockout, do I make this hypothetical center T cell more or less likely to be in tumor killing mode?

So that's what this animation shows. The gene identities are redacted. But the y-axis is the increase in the production or predicted production of Granzyme K, which is part of the T cell cytotoxic arsenal for attacking tumors. And we can do these simulations for each of these 10,000-plus patient samples for each of the 1,000 genes in the panel.

And you start to see how, with the right scale, you can very quickly search over hypotheses for what's happening in the patient tissue. Yeah? I would just look at the question. We have data for about 8,000 to 7,000 people. Yeah. Good question. So the question is, is 8,000 patients worth of data enough?

Or do you need more? Yeah. Could you augment synthetic data? I think the answer is maybe. I would say if you are staying within a single indication, like we often think about non-small cell lung cancer, it's possible 8,000 patients covers the diversity of different patient biology or different modes that the tumor might be using to escape the immune system.

I think there are things you could learn from expanding to more indications, different cancer types in different organs that could be generalizable even back to non-small cell lung cancer. And so we are actively collecting right now more data and more indications and more organs to make sure we're not ignoring a pocket of how the tumors might be learning to evade the immune system.

Yeah. I think there probably are ways that you generate synthetic data. I'm not sure learning from it directly would help. Like if you're using a model that has trained on real data, I'm not sure if it's going to do better when it's generating its own synthetic data. But there are ways you can use your own knowledge about the biological system to produce things like these synthetic neighborhoods where you're saying, I'm going to go make a targeted change and then run inference and see what the model thinks will happen when I do a very specific perturbation in the data.

Thanks. Okay. The thing about this data that I mentioned earlier is it's expensive and it's rare and it's hard to collect. I also mentioned that this H&E data is image based. It's pretty cheap. It's fairly ubiquitous. A lot of people have it. And one thing that we've been doing increasingly is seeing if we can use our model, which we call octo in part informally because it kind of looks like an octopus with a bunch of arms that sample different modalities to just translate from one to the other.

So now we're back to multimodality as translation. The way that you do this is take this model that we're now familiar with. And now this diagram should look very familiar, except instead of gene expression data from the local neighborhood, we're passing in the surrounding H&E morphology of what is happening local to that cell in the patient's tissue.

You know, we can only do this because the samples are spatially aligned as precisely as they are. And now the model gets to use this additional input as help. It's saying, I'm facing ambiguity. What can I do? Oh, well, I can look at the morphology of the cells around me and try to understand if that makes me, for example, a cytotoxic T cell or not.

Once you've trained this model, you can then go to full masking. Just say, OK, I'm going to effectively ablate the primary input and just pass in 1,000 mask tokens. And now the only thing the model can do is use the local spatial context to say, well, I don't know anything about what the expression should be.

I have to use this local context to try to make a prediction of what the gene expression of a hypothetical cell at the center of that spatial window is doing. And because we've trained it at a very high masking ratio, the model is not totally clueless about this. And we were curious if it makes reasonable predictions.

So what we will do at inference time is take one of these H and E images, which, again, are fairly easy to acquire. Pick a spatial window and run the model forward in inference mode and say, OK, I'm not going to tell you anything about the expression of the cell.

In this case, because I don't have it. Like, we don't know what's going on there. It could be someone sent me this H and E image, and they never ran spatial transcriptomics. The model will make a prediction about what the expression of all 1,000 genes in our panel is at the center of that spatial window.

And here it's giving me some value for keratin 19, which is a tumor marker. And you can just run this as many times as you want. So in practice, we'll run this 1,000 times. Yeah. One question online is, are you only predicting the transcriptomics of non-cancer cells? No. We are predicting the transcriptomics for every cell, including immune cells, tumor cells, fibroblasts, thermal cells, whatever's in the data.

Yeah. So if you run this process a few thousand times, you get basically a prediction heat map of where is it likely that there's a tumor, because you're asking the model to predict one of the 1,000 things it's predicting at each of these locations is a tumor marker. So here's the heat map you get.

And then this is now going back. And for this core, we actually do have cell typing from transcriptomics data. And the red dots are where there's actually tumor. You can see that the model is predicting where there's tumor markers pretty effectively. I won't pretend this is perfect. There are definitely examples where it doesn't do a great job.

So here's another sample that has a ton of B cells. That's the blue in this diagram. And it does, you know, it mostly predicts that there isn't a lot of tumor marker, which is mostly accurate. There isn't a lot of tumor in the sample, but it's still missing the few areas where there is.

But for other markers, even within the same sample, it's doing a very good job of predicting from just this morphology image what the expression of various genes is. And, you know, there's imagine 997 more columns of this diagram and then 10,000 more slides with other samples. And that's how we're evaluating the ability of this model to do that translation from the images to the rich spatial transcriptomics.

So essentially this is what you call masking. Right. We're leveraging the fact that we have the paired data in-house so we can teach the model to do that translation. And then at inference time, it's the model is happy if you mask everything out. It'll just use H&E. Yeah. Yeah, totally.

So the thing I'm going to show on the next slide is actually I've been playing with ideas for, you know, what would an interface look like for you have H&E data. You don't have the ability to generate spatial transcriptomics. You can imagine an interface where someone uploads their own H&E.

You predict a full thousand dimensional panel of what the expression is of a bunch of different genes at every location. We know what different gene programs look like. And you can actually use a large language model to find for you, you know, common areas where things are popping out and descriptions of what's happening.

So you get a fully end-to-end automated, you upload a sample, we predict what the gene expression would be in that sample, and then you can go explore the data, talk with an LLM, explore whatever you want, ask questions about it, and eventually run simulations in browser. So basically what we have built here is a system that attempts to translate relatively easy-to-acquire data into rich patient representations.

So this is kind of a pretty PCA projection of the first few principal components of the thousand-dimensional signal at each location. But what we will actually do with that is then each dot in the diagram to the right is one sample from one patient. And we're embedding it in a two-dimensional space.

In this case, I think we're using PSNI, and we're putting patient samples together if their predicted gene expression profiles are similar. And what I'm labeling by here is actually the patient genetics. So the model has no idea about whether a patient has a genetic mutation in cancer driver mutations.

But it turns out that because the model has learned patient biology, when you cluster together predictions from different patients, the ones that have common genetic mutations end up clustering in this space. So you can recover patient biology starting, again, just from this H and E image because we have trained this model on a bunch of paired data that has H and E and spatial transcriptomics.

Yeah, we spend a lot of time asking what the structure of the space is telling us about the data. I'm just going to give a quick tour of another kind of model that will look familiar because it's using similar masked autoencoding. This is using that protein data. So in this example, the model is not predicting gene expression.

It's predicting protein images. And we're feeding in the other three modalities as extra tokens. So this is now not the adaptive layer norm thing I mentioned. It's just stapling on bonus tokens into the token stream and letting them interact with the primary token stream, which comes from these 16-channel protein images.

So in this model, if I'm creating a new dataset given the earlier process, can that become another input dataset? Yeah. Yeah. So you could only collect H and E, impute or predict what the spatial transcriptomics would have looked like if you collected it, feed that in. The question is if that helps you learn anything.

And I don't have an answer to that yet. We're exploring it. But it's an interesting idea. Yeah. OK. So what we do in this space is ask questions like, well, how do we want to simulate the effect of a drug? The effect of a drug simulated might look like modifying the spatial transcriptomics data, where you say, this drug suppresses this protein.

And so you could say that the way that would look in the spatial transcriptomics data is suppressing the leads of one gene or a combination of genes. And you're going to run this counterfactual, a what if, of, well, what if I had applied that drug, changed the spatial transcriptomics data in this way, and run that forward to a prediction of saying, well, how does that change the output of the model trying to predict the protein image on the other side?

And I have a small animation showing how that might happen. So this is a sample from a patient. The yellow here are immune cells. The pink is tumor. Now you can run two what ifs. In one what if simulation, you're cranking up the expression of this gene whose identity is redacted here.

And in the other one, you're suppressing the activity of that gene. So now I'm back to creating synthetic data on my multimodal input arms. And I'm going to now run both of those parallel universes forward. And what I'm looking for is the readout is one of these 16 channels that's being predicted.

In this case, I'm looking for a tumor immunogenicity marker where higher would indicate more favorable. And you can see that the simulation is telling us, yeah, you should crank up the expression of gene one if you can, because it leads to a more favorable outcome on the other side.

So one way that we often are using multimodal models is to say, we want this backbone that makes predictions to tell us what would change if we could impart a delta or a modification or a perturbation on another one of the input streams. OK. OK. OK. OK. OK. OK.

OK. OK. So if you have a-- if you're able to express your hypothesis for what a drug should do in the input space of the model, and you have a readout on the other side of what is good or bad, then yeah, you could compare multiple different ideas and see which one looks like it'd be most effective.

OK. I showed like a relatively small fraction of the research that we're doing here. But in general, we're very open about sharing the science we're doing. So we have three blog posts up. You can get to them at noetic.ai/research. This first one is about the protein-image-based models that I just showed.

There's more detail on how we're generating data in lab. And then some of this new spatial transcriptomic stuff is in a more recent post. OK. I just have a couple of more slides. And then it looks like we'll have a bit of time for further questions. One thing that we're moving toward doing, if you'll excuse kind of the chaotic animation here, is instead of trying to assign these RNA detections to cells, we're just going straight to the raw data and seeing what we can do with that point cloud of 10 million points that my computer choked on earlier.

And so we're training models that are able to look at the local spatial context in this raw point cloud of RNA transcripts and predict what is going to happen at many points spatially in the sample. So we're breaking free of the need to think about cells as the atomic unit of simulating the biology and going basically as close to the raw data as we can get.

So this trajectory, you know, in biology a cell would not drive through a sample like this. But for a visualization, I'm showing in each position here we're running one inference step. So I ran this model a few thousand times moving this hypothetical cell a few microns at a time.

And then simulating what the readout would be of saying, OK, well, when you're surrounded by this cloud of transcripts, you're probably also seeing this cloud near you and this other cloud over there. And so this is now the kind of simulation engine where basically any hypothesis you have about what a drug might be doing or what's happening in patient biology, if this model has really learned about the world of the patient's biology, you should be able to get a reasonable simulation out on the other side.

So in summary, oh, I have one more slide. I noticed last week was on the biology of a large language model, which is a paper I really liked. So I have a tongue-in-cheek slide on the biology of a large multimodal model for biology, which is just a little teaser.

I don't have a ton of time again. But we've been doing some interpretability research on these models to try to understand what it is they have learned about the patient biology. So we've used things like sparse autoencoders to pull out consistent beans that keep popping out in the data and have used those to build interfaces for biologists of the company to basically do things like automatic semantic segmentation with text labels from identifying persistent concepts that keep appearing in the dataset.

So again, this is starting from just the H&E. So you upload H&E and you get back semantic segmentation with the different parts of the sample and what's happening in each of them. OK. This is basically my last slide, which is just saying I hope one thing I've communicated today is that this is just a big playground.

There's a ton of data. There's a ton of ways you can try to combine that data. And especially in biology, you have data spanning many spatial scales, many levels of abstraction, and also many modalities. So we talked today about really the three of them, the H&E imaging, the immunofluorescence, and RNA sequencing.

There's more to be done in looking at the patient genotyping. There's immunohistochemistry. So the field of omics keeps getting bigger and bigger. And so the question that I think we have to solve and that I don't think has a concrete answer yet is what do you do with all of this?

How do you build a world model that incorporates all of this information? And how do you craft your training tasks, your information bottlenecks, and your fusion points so that you're forcing a model to really learn real patient biology that is specific to each patient? And I won't pretend to have the answer today.

We don't know. We're trying a bunch of stuff and kind of throwing a lot of ideas at the wall and then having a lot of fun playing with it. But I just want to leave people with the idea that there's just a ton of things to try here. And I think it's going to be a very exciting few years ahead.

So with that, I'll now fill in the picture. This is our idea for building a world model for tumor biology. And I'll save the rest of the time for questions. You know, there's HIPAA, there's a lot of privacy protection laws that prevent, let's say, for researchers like you to probably have access to real data.

So my question is, let's say if data is localized at each hospital, how do you see the future of research in this area can leverage the vast data set that is siloed in different hospitals so that this can advance so to speak to a place that can be diffused into healthcare space?

Yeah, that's a very important question. I think there are a few different answers. One of them is open sourcing of methods that you could say, you know, we aren't going to ask you to send us your confidential data, but we're going to tell you how you can run it yourself.

There's probably also ways that I'm not familiar with to just secure end point basically and say, we're going to, as a service, process this data for you and we'll be compliant with the privacy regulations about how to get the data from your hospital to the servers and back. It's very much not my area of expertise, so I don't know exactly what that looks like.

But I imagine someone is thinking very hard about it. And I think the third thing is partnership, that you can always have, you know, a hospital or research institution say, we want access to a bunch of these models. We're going to go into a formal partnership where we have access to the compute and the models that you've trained and we're going to work together to run these samples so that it isn't just kind of an arbitrary like, yeah, email us your H&E data and we'll process it.

Yeah. Amazing. I just can't wait to see if it can be expanded to larger data sets than just 8,000 patients. My other question is, I know cells and molecules and proteins, there are so many models beginning to happen. Are we able to leverage a lot of the alpha-4 models that already exist in the protein layer data set and also exosomes?

In fact, the two new things which are getting to be very interesting is bioelectrics, you know, which is like beginning to be another holy grail, not just cells. Is there any way you can also capture that data set in, you know, early experimentation out of the four, but add like two other layers?

Yeah. And exosomes, particularly the communication channel, that can add value. Yeah. I think that's a great question. One thing I've come to believe is that most forms of data can be jammed into these frameworks fairly easily. And the question is empirically, does that help with anything or not? And I don't know.

But I think, you know, to be totally candid, we're at a place in research where we can try a lot of stuff very quickly and see what helps us make better predictions. So in some cases, I think what we're doing is probably upstream of models like AlphaFold where, you know, once we have run our simulations and said, well, OK, our world model is telling us if you could only make reality look like this, then this patient would respond to treatment.

But then we might still have a challenge of, OK, well, what is the actual pharmacological path by which we get there? And then you might need to engage with something at the molecular level. So, yeah, I think it's possible that one of the big challenges that I'm thinking about is, indeed, how do you integrate across these spatial scales?

And we've tried some stuff, but we have not spanned this entire range yet in a single model. Is it an approach right now of research to start with patient data first or just to a broad predictive model, then, you know, I can figure out later who the patient is?

You know, because there's so much of the early prediction that's happening where you're not a patient yet. Right. You know, there are no markers, there's no tumor yet, but you're able to predict. Yeah. Long shot. Yeah. Right now we are working from patient data and one limitation of that data set, which I think is what you're pointing out, is that healthy people don't get parts of their bodies removed and sent to hospitals or to, you know, research.

So really the thing we would want in a world model is evidence of, well, what does it look like when everything's working? When there's no tumor? When the immune system has identified a threat and eliminated it? And we just don't know because what we're getting are these snapshots of a battlefield.

And if we're lucky, we will see a snapshot and then find out that that patient received some treatment and responded well to it. But we're missing probably the vast majority of the system operating in kind of normal operational mode. And I don't see an immediate way out of that challenge other than acknowledging it and recognizing that, you know, what we're trying to do is learn as much as we can about the snapshots we get of disease state and learn from that as a starting point.

Because Broad Institute has some really interesting projects where they're just taking like dead tumor specimens and they're just doing a bunch of computation modeling. It's really cool to see, you know, if that can integrate into this overarching. Yeah, our attitude is very much like as much data as we can get, we're going to throw it at this thing.

So that's all. Very cool. Thank you. Thank you. Any other questions? So is Noetic just a model company? Do you outsource the drug discovery to other people and just let them use your service as a platform? Or how does it work? No, we are effectively as full stack as it gets.

We generate our data in house. We do the machine learning in house and then we have biologists and immunologists in house. And I didn't talk about this at all, but we have an in vivo platform with mice where you can actually go test hypotheses about saying, here's a mouse model for this patient population.

We think that this drug will be effective. Let's go test it. So we right now actually do not provide any of these. You know, even if you wanted to pay me today to go use one of these, I don't have an API endpoint for you. But we are using these internally to do drug discovery and basic discovery every day.

So the model is like a lower latency version and then the mouse is just to-- so you do most of your testing with the model, I assume, and then the mouse is just if you-- after you do the initial filtering with the AI model? Or how does that work?

Yeah, so the model can tell you two things. In the best case, it tells you, hey, you should go try this drug in mice. And then we don't pretend that this model is going to understand all of the complex biology that happens when you administer a drug. So you still need to go to that in vivo test case and say, is this actually working to get rid of tumor?

The other thing the models tell us is maybe there's already drug people have tried, but they're trying it in the wrong population. Like they're enrolling too broad of a clinical trial because they don't know that actually this small sliver of the patient space is the one that's going to respond well to your drug.

And so we are using the model in that capacity to give us ideas for things to try in our in vivo platform or to reexamine clinical data and see if we can understand why clinical trials have failed, where other people are kind of shrugging and going, well, I don't know, like 10 people were totally cured, but the other 190 weren't and we don't know why.

That makes sense. So is the end game to completely replace clinical trials or--? I don't know. I mean, maybe in many, many years, I think ultimately, you know, safety and efficacy proven out in humans will need to happen at some point. If anything, I would assume that the earlier part of that funnel is going to be increasingly done synthetically.

But maybe we say, you know, you can in some cases skip a mouse model if you have a very clear mechanism of action and you want to go straight to checking for safety and efficacy. Thanks. Yeah, thank you. Thank you. You have many different scenarios of simulations to test with many different combinations.

I'm wondering how you narrow it down to, like, prioritize, like, which one to test first? Yeah, that's a fantastic question. So I think of what we're doing is, like, let's say you had a really good world model, a simulator of the world, even in the macroscopic case. You still need an agent to tell you what is worth simulating.

You can't just try things at random. And I think there's two promising things here. One of them is you still have subject matter experts that say, oh, great, you've given me a perfect simulator of patient biology. I know the first 30 things I'm going to try because I'm familiar with the things that have been considered in this space.

We have hypotheses for how the tumor is evading the immune system and we want to go see if we can disrupt those. The other thing that I don't know how much weight I want to put behind right now but would be a little foolish to dismiss is that we have these LLMs now that are trained on basically the entirety of scientific literature know about a bunch of studies.

And if they have access to effectively tool use and say, OK, what experiments do you want to run and start thinking about, you know, virtual scientists or AI scientist agents that are trying to decide this is the next experiment I would do. And maybe at some point you close the loop and you say, great, we're going to run that experiment and we'll be back to you tomorrow.

And then tell me the next thing to try. But either way you need somebody who's familiar with the domain, whether that's an LLM or a scientist to make the decisions about how to use the simulator. Got it. Thanks. Yeah, thank you. So I know your grand vision is to cure cancer.

Do you have some thoughts on cancer prevention, like probably vaccines or stuff like that? Yeah, that is a good question. I'm not sure I have a fantastic answer about, you know, how are we thinking about cancer prevention? It relates to this problem of we don't have healthy data to look at.

And if there isn't an indicator right now in the clinic of, yeah, like you look healthy today, but we're still going to go biopsy this and send it off and fit you into this framework. Then I think you're going to be limited in what you can do about simulating the patient biology of patients that right now are totally asymptomatic.

I could imagine that changing. I mean, it's possible that data acquisition is going to be broad enough and cheap enough that it is just routine to have a little bit of H&E taken. And maybe there is information kind of in the latent space of this model that says, yeah, this looks like it's over in this part of the patient distribution.

Maybe we keep an eye on this patient over the next few years. Same question. Great. Well, I have a follow-up question. Google scientists have a forward on all this stuff. Yesterday, Microsoft released a co-pilot or semi-autopilot for science discovery tool. I'm very curious how we can run a billion experiments autonomously, maybe modeling, instead of modeling what scenario, maybe model a person, right, as an agent, and figure out what drugs could potentially you know, cure the context we just lost someone like two weeks ago to cancer.

So this is a very big thing for me. Yeah. How do we prevent or cure intermediate cancer? So that's the way I'm thinking around where agents and science intersect. Right. Yeah. I have a few thoughts here. I'm not sure how coherent they are. One of them is I am not sure yet that speed and scale is the bottleneck on running these simulations.

I have not also seen an example yet of an agent come up with a really brilliant experiment that our scientists didn't write down in like the first four minutes of thinking about how to use this thing. So that doesn't mean I'm not optimistic about it. I think these systems are going to get better.

And if anything, I have been wrong every time on saying like, yeah, that's five years away. And then five months later, it turns out that we can do it. So I would like to remain optimistic that we're going to be able to frame these problems, integrate things like tool use in a way that does allow agents to go run a bunch of experiments.

And then, you know, for us, we probably still go test those out in mice and then think about them in that capacity. But yeah, it's a very interesting direction to think about. Give another hand to Mishet for the very interesting talk. Yeah, so you'll notice this talk was sort of different from a lot of the more recent talks or work, just focusing more on LLMs.

But I think this is a very important space and an area, application area where machine learning and AI can make a lot of impact and save the world, change the world, positively impact the world and save lives. So we should definitely encourage more work for machine learning and AI for this area of cancer research and healthcare in general.

Thank you.

Stanford CS25: V5 I Multimodal World Models for Drug Discovery, Eshed Margalit of Noetik.ai

Transcript