Stanford CS25: V2 I Biomedical Transformers

So making you all see the speaker notes was not part of the plan, but I'm glad to be here. And my name is Vivek Natarajan, and I am a research scientist in the Health AI team at Google. A little bit more about me. Growing up in India, my parents always wanted me to be a doctor, to be precise, a medical doctor, but unfortunately I was probably not good enough to memorize all of the biology textbooks that you had to do in case you wanted to crack the medical entrance examinations.

So I ended up becoming a computer scientist instead. But as a great man once said, you can't connect the dots looking forward, you only join them looking backwards. So through a rather long-winded path, not too dissimilar from how we actually train our neural networks, I ended up working in medicine again, this time armed with this magical new tool of AI.

And I can tell you that my parents are far more happy with my life choices right now. But they're never truly satisfied. But digressions aside, my goal for this talk is to peel back the curtains and give you a flavor of all the innovation that is happening at the intersection of AI and biomedicine and how that is being catalyzed by transformers and large language models in particular.

So we will spend the first few minutes trying to work up from first principles why transformers and large language models are a particularly good fit for biomedical data. And then we will deep dive into a few papers covering a bunch of different biomedical application settings. And finally, I'll present my views on how this field is likely going to evolve in the next few years.

And even though my voice or tone may not exactly sound that way, I am incredibly excited by the possibilities of AI and biomedicine. And I think we have an incredible opportunity in front of us to advance human health and human potential. And my hope at the end of this talk is you all will feel the same way as I do today and perhaps join me.

So yeah, let's jump straight in. Why transformers and biomedicine? And sorry, I'm going to pick people who are in person to answer. So maybe if one of you could volunteer. Go for it. Why not? That's a good answer. Sure, go for it. We have a lot of biological data.

So it's like biological data is a problem, but I have a lot of them. Yeah, sure. Medical doctors are expensive, and a well-paid job is just like memorizing, as you said. Yeah, it's an important application setting. Yeah, great one. So I think all of you are on the right track.

And so maybe if you just look at different kinds of biomedical data, for example, what are clinical notes? I think it's a sequence of Dr. gibberish. OK, I did not say that, but let's just call it a sequence of Dr. speak or Dr. notes. Similarly, if you were to look at electronic medical records, what are they?

They are essentially a sequence of a person's encounters with the medical system. What about proteins going deeper into the biological stack? They are nothing but a sequence of amino acids linked together by peptide bonds. And does anybody know what this is? Go for it. I think that's how we store medical records, from them.

Sorry, again? Well, it looks like from them. It looks like they were like from the virus. You're getting close. Anyone else? So this is in the Wellcome Collection in London, and this is actually a printout of the full human genome. And no, they did not cheat over here. The font is super small.

And as you can see, there's a bunch of ATGCs. The entire printout contains, I think, over 130 volumes in that shelf, and each page is printed on both sides. And it's a four-point font with precisely 43,000 characters per page. So that is how big the human reference genome is, more than billions of base pairs.

And so again, the genome is nothing but a sequence of nucleotide base pairs. So what we are essentially seeing over here is sequences are everywhere in biomedical data. And what is the best neural network architecture for modeling them? And I guess since you are all in this course, I don't have to convince you that the answer is transformers, right?

OK, that's good. But maybe I'll just offer a few reasons over here. Firstly, as you can see, the data itself is multimodal in nature. And we just saw a few examples. And as someone pointed out, transformers have proven remarkable at guzzing up pretty much any kind of data. And we are really seeing this remarkable convergence across fields, whether that's speech, or NLP, or vision, or robotics.

I mean, pretty much everywhere we are using transformers, and I think biomedicine is no different. I think secondly, transformers are far more effective at modeling complex long-range interactions over sequences. And this property is particularly important in the biomedical domain. And we will cover this in more detail later in the talk.

And finally, as again, someone pointed out, these data sets can be quite big. And you can easily get into the billions of tokens territory. And this is where transformers with all the parallelizable operations and the relative ease of training-- and maybe someone should try training an LSTM and an RNN on these kind of data sets-- you'll realize that these are much better suited for the kind of data sets that we have in this domain over here.

So yeah, I think there are a few more reasons as well, but I think these are the key ones as to why transformers are particularly well-suited for biomedical data sets and tasks. Any questions so far? OK, great. So now in the next part of this talk, we will dive deep into a few papers applying transformers to biomedical data.

We'll start with clinical applications first, and then go gradually deeper into the biology stack looking at proteins and genomic applications as well. And what you will observe is that while transformers and large language models by extension are a great fit, often you have to innovate not just on the modeling side, but also on the data and evaluation side to make these application scenarios really work.

And so the first paper I want to talk about over here is this recent work from our team called Large Language Models Encode Clinical Knowledge. The motivation for this work is actually quite straightforward. So if you look at medicine, it is a humane endeavor, and language is at the heart of it, facilitating interactions between people and those who provide care for them.

Unfortunately, if you look at a lot of medical AI systems developed to date, these are all narrow, single-task, single-domain models lacking interactive and expressibility capabilities. And as a result, what has happened is there is this discordance between what these models can do and what is expected of them by patients and care providers and others.

And this in turn has, I think, prevented broad uptake of medical AI. And you can see that, for example, we don't really have AI in many clinics out there, like helping us with diagnosis and so on and so forth. But the recent progress with transformer-based large language models, it offers us an opportunity to change all of this and redesign and rethink medical AI systems with language at the heart of it, mediating human-AI interactions between doctors, researchers, and patients.

And I will be honest if I don't point out that there has been a large volume of work in this space, particularly in the last few years. There have been various attempts to train language models in the biomedical domain with models of various different sizes on different corpuses of biomedical data.

And while this is exciting, the quality bar for applications in the medical domain is actually quite high. And so what is missing is that there is actually not many good evaluation benchmarks and evaluation protocols and frameworks. So we don't have the equivalent of a big bench in medicine. And hopefully, you guys have covered big bench before.

And so big bench is this benchmark where you can assess large language models across a variety of task domains and settings. But we don't have an equivalent of that in the medical domain. And further, if you look at the evaluations that are typically used in these previous studies, they only look at objective metrics like accuracy or natural language generation metrics like blue or cider.

But these fail to capture the nuances of real-world use cases in clinical settings. So what we essentially need was a good benchmark and a task and also a good evaluation framework for evaluating these models. And so to address this unmet need and assess the potential of LLMs in medicine, in our team, we decided to focus on the medical question answering task.

Why? Because answering medical questions is actually quite challenging. It requires reading comprehension skills, ability to accurately recall medical knowledge, and also manipulate and reason about it. And furthermore, the Q&A task is general enough and can subsume a bunch of different application settings such as summarization of clinical notes, clinical decision support, and also primary care triaging of patient concerns and so on.

So we've identified the task. The next question was what data set? And so when we looked at the literature over here, what we saw was that there were several data sets floating around assessing model capabilities in a bunch of different settings. So what we decided was we should probably just unify all of them and put together in one benchmark.

And so we did that, and we called it Multimit QA. And so if you look at it, this benchmark now covers medical question answering data sets from a bunch of different settings, such as professional medical questions, like the US medical license exam style questions. It also includes medical research questions, those based on PubMed abstracts and so on, and also questions from live users and consumers asking about medical information.

And also the setting changes, it could be closed domain or open domain, and the model may be expected to produce a long form answer in one setting and maybe a short form answer in another setting. And finally, we saw that while the Q&A data sets which covered consumer questions, yeah, go for it.

I have a quick question, how do you evaluate long form answers? I'll come back to this. Okay. So yeah, very quickly. Finally, when we looked at... Sorry, one other thing. People on Zoom might not be able to hear your questions in person, so they can repeat questions. Okay, cool.

So the question was, how do we evaluate long form answers? And I'll come back to this a bit later. And so very quickly, when we looked at the data sets that actually provided consumer medical questions, we found them to be quite small in size. So we decided to augment them.

And so we went out to Google and looked at the most frequently asked consumer medical questions. And so we curated a data set, and we added that to the benchmark, and we call that HealthSearchQA over here. And so, yeah, again. How big is the composite? I'll come back to the statistics later, I'm sure.

So here are a few examples. So if you look at the consumer medical questions, they are quite short in nature. And so they come from the HealthSearchQA and the LiveQA data sets, whereas I think if you look at the USMLE-style questions, these are like long vignettes. And so doctors have to really, really carefully read through them and come up with the right answer, which often involves a process of elimination.

So again, very, very different application settings, and so the model has to really adapt and understand the task to do well in all these settings across the board. And LiveQA is interesting because the answers, the reference answers over here, were actually provided by librarians. So that's another good comparison point for us.

And so in terms of statistics, we had a total of seven data sets in this benchmark. As I said, we cover professional medicine, medical research, and consumer medical questions. They're, again, of various different sizes and can be long form, short form, open domain, and closed domain. So very diverse, and I think it provides a very comprehensive evaluation of models in this medical question answering setting.

So we have a task on the benchmark. The next question, again, I think I asked was, how do we evaluate these models? And as I mentioned before, automated metrics are actually deeply unsatisfactory because they fail to capture the nuances of real-world clinical applications. So what we did was actually heavily inspired by some of Stephen's work over here, was to put together a human evaluation framework for assessing these long-form answers.

And this had two parts. The first part was evaluation by clinicians, and we asked them to rate the moral responses along 12 axes pertaining to factuality of the responses, ability to recall medical knowledge to medical reasoning, and also for the potential of harm and bias in these responses. But if you look at the potential end users of such medical Q&A systems, these are likely going to be non-expert lay users.

So it is also important to get these answers evaluated by them as well. And so we also additionally asked a pool of lay users as to how helpful and actionable they thought the answers were. And so that was our evaluation framework, and we also have the benchmark fix. So now we move on to the fun part of building and aligning LLMs to the medical domain task.

So in this work, we decided to build on the Palm family of language models. Has that been covered in the course before? Okay, great. So, but very quickly, I believe this is still the largest publicly announced densely activated decoder-only large language model, with the largest one being 540 billion parameters in total.

A few more details, the model is trained on 740 billion tokens, 25% of which is multilingual. The data comes from a bunch of different sources, including social media conversations, web pages, books, GitHub, and Wikipedia, and so on and so forth. And at the time of release, the model was state-of-the-art on many NLP reasoning benchmarks, and also was the first model to exceed the average human performance on Big Bench.

Further, over the last year, Palm-derived models were shown to be super useful in a bunch of different application settings, including for code generation, which was the Palm Coder model, in robotics, the Palm SACAN model, and also for answering math and science questions, which was the Minerva models. And so we thought Palm was a very good foundation model for us to build on and use it in the medical domain as well.

And overall, I think Palm is a true magic of engineering, but I will refer you all back to Akanksha's paper on this for more details. I think it's a must-read. And again, in late October last year, Jason Wei and a few others at Google Brain came out with the Flan-Palm variant of the Palm model, and this is basically the instruction-tuned counterpart.

And this model was even better than Palm, and I believe this is still the state-of-the-art on many benchmarks such as MMLU, TidyQA, and I think it exceeds Palm performance by an average of 9.4% across big bench tasks. So we decided to build on the Flan-Palm model, and we applied a combination of prompting strategies including few-shot prompting, chain-of-thought reasoning, and also self-consistency to the 540-billion-parameter variant, and we evaluated it on the multi-med QA datasets that had the short-form MCQ questions.

And we found that this model was really, really good. At the time of publication, this model on the USMLE-MedQA dataset exceeded the previous state-of-the-art by over 17%. Is this specific to the short-form values? It's only for the USMLE-MedQA dataset. That's right. And so you see that the accuracy over the previous state-of-the-art at the time of publication went up by over 17%.

And I believe this was the first LLM-based AI system to obtain a passing equivalent score, which was 60% or above on this benchmark. And similarly, when we looked at other MCQ datasets in the benchmark, for example, MedMCQA, which is a dataset of Indian medical entrance examination questions, the model was again the state-of-the-art.

On PubMedQA, which was question answering based on PubMed abstracts, again, the model was state-of-the-art at the time of publication. And same story on MMLU clinical topics as well, which include genetics, anatomy, professional medicine, clinical knowledge, and a bunch of other topics in there. So all this was great. And then when we started looking at the scaling plots, what we again saw was that the performance seemed to be improving as we scaled the model from 8 billion to 62 billion to 540 billion.

And so what this basically suggested that these general purpose large language models trained on public internet seemed to encode clinical knowledge pretty well. And their medical reasoning abilities tend to scale with model parameter size. We also did another experiment when we looked at selective prediction. And we used the self-consistency votes to determine when to differ.

And this is important in clinical settings because doctors communicate when they don't know about something. And if our AI systems are going to be used in clinical settings, for example, for diagnosis, they should be able to tell you when they don't know something. And so what we observed here was this fairly crude metric.

So we were getting a linear improvement in performance as we changed the deferral threshold. And this was quite nice. But in practice, it's actually quite inefficient because you're generating multiple decoding samples to be able to compute this metric. So we need a better method. Just to be clear, what is the current fraction?

That's what fraction the model asks. Yeah. It basically says I'm uncertain around one. And that's determined based on the self-consistency votes. I see. OK. So if you apply variance in self-consistency samples, you don't-- Exactly. --defer. Exactly. So are models trained to be capable of deferring themselves? They're generating outputs saying, I don't know.

No. Because they're just trained on this expert prediction task. And that depends on the data set. The PubMed QA has some answers which are maybe. But again, we don't explicitly fine-tune the models over here. So no, the models are not trained. Yeah. So does it imply that this metric runs in the wrong ??

So then how-- and then basically, how do you put the benchmark where the output is correct or not, like how ?? No. So this is primarily based on the reference in the data sets, which is-- so this is all accuracy metrics. So we already know between the four options or five options which one's the right one.

And so we just do that classification one. Yeah. So I'll come back to the clinician evaluation a bit later. Sorry. Maybe I missed something. How are you measuring the ?? So if you know about self-consistency prompting, what we do is we generate multiple decodes from the same model. And then we see the number of times the highest-ranking answer is voted.

And based on that, you can fix a threshold and say, if it's below this number, I'm going to defer. So if, say, the majority answer comes up in your self-consistency decode only like n times out of k or whatever, then if that n is too small, then it's very likely the model's uncertain.

So that's how we defer. So we don't really see a paper-off in this plot, so it's natural to ask what the rest would look like? I think if you plot it further, it will flatline. But again, that's not useful. I mean, if you're saying no to every question, that's not useful at all.

So you want to have a reasonable deferral percentage over here. Yeah, but like 0.4 or 0.5 is not bad, right? I think that's high. I think that's still high. 50% is quite high. But again, this is a very contrived setting. But in real-world use cases, probably, I think that number should be much lower.

That's right. I think balanced accuracy might be a better metric. But we looked at some of these data sets. And one data set, the skew was pretty bad, the PubMed QA data set, and I think no one should use it. So if anyone's reporting SOTA numbers on that data set, you should just distrust them.

And I'm talking about very specific people. But again, I think, as I mentioned, these accuracy metrics are good for publicity and pushing up benchmark numbers and so on and so forth. But the real evaluation is human evaluation of the long-form analysis. And that's what I'll come to in the next part.

So so far, so good, right? I mean, we were getting SOTA results on these benchmarks, and we were very happy. And so what we did was-- I mean, one thing you'll observe that I have so far only reported results on multiple choice questions, short-form answers. So what was left for us to do was to take these answers, take these models, and generate long-form answers to the other data sets that we had and get them human-evaluated.

And I think that is where the real project began. When we looked at the evals by experts and laypeople, it revealed very key gaps and limitations in the flat-form responses. We were often seeing that these models were hallucinating or producing incomplete responses. And when we asked experts whether they preferred clinician-generated answers or these model-generated answers, they almost always preferred clinician-generated answers.

So it was very clear that-- Sorry, I didn't get to this earlier, but I've got these evaluators. Are these laypeople, or are these-- They're both. OK. They're both. OK. So what these previous results showed was, while these models already encode some degree of clinical knowledge, to be really used in actual real-world settings, you need to align these models better to the safety-critical requirements of the medical domain.

But a big challenge is we did not have any kind of supervised or feedback data. And so we really need the alignment technique to be data-efficient. But thankfully, we had instruction from tuning, which was introduced by Brian Lester and a few others at Google a couple of years back.

And how this method works is it essentially freezes the big LLM model and only learns an additional small set of prompt vectors, which can then be used to condition the model at inference when doing the generation. And the nice thing about this is it allows very easy reuse of the model across tasks and domains.

And you only need to carry on these additional prompt parameters. And these tend to be much smaller than the billions of parameters that you have in the LLM. And the other good thing is this is very computationally efficient as well. So if you were to do end-to-end fine-tuning, often in our compute infrastructure, even with a few thousand examples, that would take a few days.

Whereas with instruction-from-tuning, A, given the data set size is also reduced, the number of examples that you need is quite small. And B, you're just updating the prompt token vectors. It meant that we were able to get model updates in a few hours. And so that was really fast and enabled really quick iterations for us.

So this was how we put together the final METPALM model. So we used instructions and exemplars from a panel of expert clinicians. And these were in the order of hundreds, not like thousands or tens of thousands. And you see a few examples over there. There's an instruction, followed by a model answer, followed by an explanation.

And we use that to learn the prompt vectors. And so the final METPALM model is basically all of PLANPALM plus these additional soft prompt vector parameters, which are used to align the model to the requirements of the medical domain. And why this works well is because, as we have seen before, the model already has medical knowledge encoded in it.

All we need is to teach the model how to use it properly in the given application setting. And that's what these prompt parameters do for us. So the question I wanted to ask is, nowadays, you've probably seen a lot about RLHF. And given the fact that you have all of these human preferences expressed by your evaluators, can you guys explain, have you guys tried playing with a reward or preference model and using that to find a new model?

Yeah, I think you can think about different stages of model development. So this is pre-deployment and release in the real world. So you can't put a crappy model out there in the real world. So even before doing that, if you can get maybe 100 examples from whatever experts that you can get hold of and use that to prompt your new model, that's better.

That's a much better starting point before you expose the model to the real world and collect preferences from real users at scale. And so I think RLHF is also much less sample-efficient compared to instruction prompting, again, because you're probably trying to update your entire model as well. So I think this is a very good starting point.

And so they can both be combined depending on the lifecycle of the model. Are your evaluations, human evaluations, public? The data set is public. I'll talk about the results in a bit. No, sorry. Are the human evaluations contained publicly within the data set? You mean the model responses and what the human evaluations are?

That's a good point. So far, not considering releasing them, but maybe we can. Do you see a use case for that? Well, I was thinking, you have a bunch of data in the preferences, so train the model to express those preferences, and then use that model for RLHF in a medical community.

So if I wanted to train a reward model, that data is what I would need to train that reward model. Yeah, that's a good point. I think the evaluation data set is-- I'll talk about this a bit later. It's still small. But I think if we scale it up-- and we are doing it right now-- I think we can release that.

And that will be, I think, a good resource of what you're trying to do. Cool. So we have the metform model, as I said. And now we took the long-form answers from it and compared that to the planform model, as well as to answers generated by expert clinicians. And as I said, we have two parts to the human evaluation.

One is by expert clinicians, and then the other one is by lay users. And so what do these results look like? On the 140-odd questions that we got these evaluation results on, what we observed typically across the board was when we looked at different axes, while the planform model would be quite terrible, honestly, the metform model would do much better and typically close the gap to expert clinicians.

So on this axis, you see that the planform model has probably a 60% accuracy in terms of scientific consensus. The metform model improves on that quite a bit and closes the gap to clinicians over here. A similar story on other axes, as well. Over here, you see the clinician's rating on the axes of how well the model can retrieve medical knowledge, how well it can reason about it.

And again, we see the same trend as in the previous slide. Yeah. So the left column-- the left two columns is correct comprehension of people's reasoning. So it's evidence of correct comprehension. Yes. And the right-- And on the right-hand side, it's incorrect. So I think it would just be 1 minus-- No.

It can be present at the same time. So you can have evidence of correct comprehension, also evidence of incorrect comprehension. Sometimes you see-- Exactly. So that's why they're not 1 minus over here. I see. Oh. But the trends are the same. So that's why I skipped over. But that's a detail.

Good point. Yeah. Again, so there's a typo over here. But this one pertains to incorrect or missing content. But this was an interesting one, because what-- when we were doing this from tuning thing, was we were teaching the MetPALM model to produce longer and more complete answers. And so you'd see a few qualitative examples later.

But what ended up happening in the process was sometimes the model was maybe producing more incorrect information. So that's why you see that maybe in this particular axis, the FlanPALM model was slightly better. But again, this was much worse compared to clinicians. It's a good question. It is more like it's something completely out of context.

So it may be irrelevant to the question. So that's what I would say. So we also looked at possible and extent and likelihood of harm. And again, we see that with the instruction from tuning, we're able to close the gap to expert clinicians over here. Same on the bias axis as well.

Sure. Can you interpret the top? Exactly. So I think, basically, the company death and then the clinicians at, like, 6%-- so would you talk more about how to clarify exactly what that means and what you're talking about? Yeah. So it's basically-- so there might be certain conditions or pathologies or diagnosis, right?

Like cancer. And if, for example, the clinician has not caught that or has maybe given a response that does not appropriately convey the severity of the condition, then that could potentially lead to severe harm or death. And so that's what we were trying to capture over here. So that's a very high-level overview.

This is, I think, a very nuanced topic. And there's a framework for it called the AHRQ framework. And so we've linked that in the paper as well. And so I think that gives you a very detailed notion of harm and bias, as I would refer to that. But at a high level, this is what I'm talking about, how that helps.

All right. So when, later, I read the class, and I say the clinician had 5.7% on extent of possible harm, which means-- like, what would I say? Does that mean that, like, they recommend something that could kill the patient? Maybe they fail to recommend something. Yeah. So it's basically a misdiagnosis or maybe failing to capture the severity of a diagnosis.

This is typical in life-threatening conditions. So it's more often than not mistakes, but rather just missing out on details. Yeah. So I talked about bias as well. And then, as I said, the other axis of human evaluation was with lay users. And so we asked them, how well does the model address the intent of the question?

And again, we saw with instruction prompt during MedPAM closing the gap to clinicians. And then we asked them how helpful the responses were. And what we see is that while Flan-PAM responses were considered to be helpful, like, 60% of the time, the number improved to 80% for MedPAM, but it was still fairly lower compared to clinicians at 90%.

So here are a few qualitative examples. And so what you see is that physicians-- and this is typically because they work in time-constrained settings-- their answers tend to be precise and succinct. But sometimes it's very hard, as lay users or patients, to decipher and decode the answer and get all the full set of details.

And so what I think language models like MedPAM can help with is actually converting the physician's speak to something that's more easily digestible by lay users. And so this is where I think how these models will likely fit in clinical settings in the near term, where they are going to augment physicians in terms of interacting with patients and other physicians and researchers as well.

So I go ahead. In this moment, then, because if I look at this example, I think actually, it looks like by the norm, it's like the physician is actually more understandable. And it's very-- if I take, as a patient, you do it or not, so it really isn't the norm.

And it's my-- That's right. I think it's subjective. And so that's why I think we're still seeing lay users rate Flan-PAM answers to be helpful 80%. Well, that's much higher for physicians. So it's not perfect by any means, but I think this is where there is a complementarity element we feel over here.

And we've asked that. And so when we ask people, how easy is it to interpret doctor notes or recommendations, and they often say, oh, it's very hard. I need to go back to Google, search for what these terms mean, what these abbreviations mean. And so I think this is where a language model can come and take that note and convert that into something that's more easily digestible.

So I think that's the opportunity over here, I feel. So that was all on our paper. But I also want to maybe very quickly point out a very recent work which came out last week with this rather provocative title, Do We Still Need Clinical Language Models? And by clinical language models, they meant smaller models which are trained in domain with clinical data such as medical notes and records and so on and so forth.

And what this paper basically suggests is that smaller, fine-tuned, in-domain LLMs are likely better than general-purpose LLMs. In this paper, I think they evaluated on GPT-3 with in-context learning. So I think that's a pretty interesting and neat observation. I think there's a lot of value for smaller in-domain LLMs such as PubMed, GPT, and a few other variants.

But I think one thing that this paper does not do is consider in-context learning-- sorry, prompt tuning. And I think that's where some of the benefits of these larger general-purpose LLMs shine. And again, we haven't done any in-domain LLM pre-training on these large general-purpose models. But that's, again, an option for us as well to do it on the link.

So you can take these 540 billion parameters and then still train it on medical notes or whatever domain-specific data that you can get hold of. And hopefully, that will probably further improve the performance. So key takeaways so far-- what I wanted to convey was general-purpose LLMs, it looks like they do encode medical knowledge.

And performance on medical reasoning does seem to improve with scale. However, these models, I don't think, can be directly used out-of-the-box in clinical settings. And they need to be aligned with the safety-critical requirements of the medical domain. And I think instruction prompt tuning is an extremely efficient technique, both on the data side and also on the compute side.

And we should probably use it more often, depending on-- and hopefully, the API starts supporting it as well. And these models appear to be closing the gap to expert clinicians, at least on this medical question-answering task. And while this is hugely exciting and has profound implications-- you can all probably dream up and imagine the application scenarios over here-- I think comprehensive benchmarks and evaluation frameworks are necessary in order to further assess and improve these models for real-world use cases.

So I'll stop over here. Any questions? I think-- A lot of it is because these data sets tend to get locked in silos with privacy and other kinds of regulations, which prevent them from being put out there in the real world. So you have to have HIPAA-compliant systems for storage and so on and so forth.

So it's very difficult to get data out of these silos and put together an open benchmark. So honestly, I feel like that's probably not going to improve the scale of these data sets. At least the open version of these data sets are going to remain quite small compared to the big LM training data sets or the computer vision data sets on natural images and so on and so forth.

But what may happen in the future is we may have more distributed federated evaluation settings where you take the model into these private silos and get them evaluated on. So they are never exposed and put out there in the public. But rather, we can have these federated evaluation settings.

So I think that there's some work on that already. There's a system called MedPerf. And we'll probably see more of them. Sure. So the question over here was why medical data sets are smaller compared to natural image data sets in computer vision or LM training data sets and so on and so forth.

What do you think are some of the earliest applications of medical LLMs deployed in the industry? I think the first set of use cases are probably going to be not diagnostic in nature. Sorry. The question was, what do you think are the use cases of medical LLMs in medical industry settings?

And so the answer is I think the first set of use cases that we are going to see are probably going to be non-diagnostic in nature, but more around if a patient comes in and interacts with a doctor, can you generate summary notes? And can you do workflow tasks such as generating letters for insurance, for medications, for referrals, and so on and so forth?

I think these tasks are right up the alley of large language models. And I think if not already, in the next six months to a year, we'll see a lot of these use cases coming up. And I think that's going to make doctors' life, care providers' life much easier because right now they're spending a lot of time doing these things and not actually providing care and attending to the patient.

Diagnostic use cases, I think, will take a lot more time. We need a lot more evaluation. The data sets, as we can see, are probably not there. Evaluation frameworks are not there. But I think in the long run-- and that is the dream setting, right? And then maybe a follow-up is med-- I'm assuming med prom is not open source.

What do you think the best open source model is for medical data? And I think it depends on the-- so the question is, what is the best open source model for medical data? It I think depends on the evaluation setting. So I think the PubMed GPT model from the Stanford Foundation Models Group is quite strong.

I think GPT-3 or 3.5 or whatever variant, if you can bring in some domain-specific medical data and do some in-domain tuning, I think that model can also improve quite a bit. So I think those two would be my favorite starting points over here. So I was curious, like, what are the soft problems with, like, updates such as or-- It's-- you can just think them as vectors corresponding to a few additional tokens.

So it's not really human legible. So the question was, what do the soft prompt vectors look like? And are they human legible? And yeah, the answer is, no, they're not. Just a follow-up. You said-- you mentioned federated learning for a margin of error. If you use files of thirds of parameters, they're usually on-the-prem sites that hospitalize you.

I heard that. Low-quality infrastructure, on-the-demand data set. Do you really believe that's really learning? Will we have, like, on-site hardware, on-site data sets? That all the teams that can play these models is going to work? Sure. So the question was, given a lot of the hospital systems and providers' networks are quite low-tech and don't have good enough hardware, do you really think federated learning could be used for distributed training of large-scale LLMs?

I think we are increasingly seeing a trend towards cloud. And so a lot of these hospital systems are moving their storage and data and compute to standard cloud providers like AWS or Azure or Google Cloud. And so I think that helps, because these systems on the back-end side do have the compute to be able to train these kind of models.

I think it's going to be a very gradual process. So systems that have high-quality infrastructure, probably we're going to start with that first, and then gradually work our way into the long tail. But it also feels like something that will inevitably exist in the world. So 10 years down the line, or 15 years down the line, when we have these distributed large-scale LLM training systems, we'll always think back, "Why did I even doubt that this will not exist?" It's so obvious it's something that has to exist, because that's where all the patient data is, all the interesting data is, right?

So I think that'll just happen. It's just not clear whether that's going to be done by one company, whether that's going to be done by a consortium of academic or industry groups, or whether governments are going to be involved, and so on and so forth. It's interesting. You mentioned cloud computing, but essentially, you say you're doing it federated and distributed, but we're still uploading the data, probably the same compute warehouse, right?

That's right. So the question over here is, we're seeing cloud computing, but we are pretty much uploading the data to the same warehouse. The answer is true. But again, I think these are all going to be separate buckets with their own access controls, and so on and so forth.

So that is how you can differentiate between them. There's not a lot of It doesn't seem like that's a good thing, but it makes sense that we're going to be Sure. So the question was, has there been any studies in MedPalm looking at private information in these data sets?

And the short answer is no. One of the criteria for selecting the data sets that we used in the study was to not include any kind of personally identifiable data or clinical data of that sort. And that helped get this paper out on time. But I think that's an important point.

It's unlikely that we're going to have a lot of PHI data in the public data sets that we are training on. But even when you're training on, say, one private corpus and then you're using it in another application setting, you want to ensure that the model does not leak out any kind of PHI information during its generation.

So I think those sort of studies are necessary. We haven't got into them yet. So the question is, what are the next steps in terms of improving these models further? Yeah. Retrieval is a very important one. Being able to cite sources and especially take in authoritative sources and use that in generating the answers and also communicating that to the users is very important.

I think how you communicate uncertainty is very important. So we've gotten to some extent using instruction from tuning, but I think that can be much, much better. So I think that's another big bucket. Again, I would stress on the evaluation side, looking at more data sets, which for example may do a Q&A on health records or other kinds of medical data, I think that will be important.

And also extending the evaluation both in terms of scale, having a diverse panel of clinicians support, and also in terms of the data that you're using. Maybe adversarially modifying the questions to include demographic confounders or something like that. I think those are all could be interesting directions. I think on the modeling side, the interesting question for me is again, this interplay between smaller domain specific elements versus large general purpose elements and how that's going to play out.

There seems to be some evidence of emergence over here, especially with medical reasoning. And so as you can see at lower scales, sometimes the performance is not good enough. I mean, 50%. I mean, that's a good number, but that's just not viable. But when you get to like 80%, 90%, products really become useful.

And so that we are seeing at bigger parameter sizes of these models. But I don't know. I think it's still an open question over here. Yeah, the question was, is hallucination an issue? I think it still is. But I believe that you can control that fairly well with instruction prompting, like any kind of feedback data.

I think it's not terribly difficult to do. And so I think it might have been overblown generally. So especially when you are doing it in a particular domain, I think it's easier to control. I'm just curious the extent to which the method reader or what it looks like. I just think recently there's been a lot of So I'm just curious, because this particular very, very relevant, and Yeah, so the question was, there is a lot of talk and noise around hallucinations and general purpose LLMs.

And in this particular application domain, it seems particularly relevant. And so can you expand on that a little bit further? Sure. So what we are seeing is, even with an order of a few hundred examples from expert clinicians, teaching the model how to communicate medical information, that is good enough to get the model to maybe stop hallucinating, or at least communicate its uncertainty in a better way.

So at least in this particular domain or this setting, it feels more tractable to us. And the reason I'm saying this is we've looked at the answers qualitatively, and we are seeing that the model does not tend to generate super long answers or make very confident predictions, but rather the tone itself becomes very reserved.

And it starts using terms like, maybe this needs to be done further, or something like that, which communicates uncertainty. So how well is that actually correlated with the representation underlying uncertainty that we have is still, I think, an area of research. But I think this is already promising for us, that it feels controllable in limited application settings like medicine.

But if you have a general purpose LLM trying to answer pretty much everything about the world, I think that's a much harder problem. Do you think that would be a feature of what the domain data says? Like, in medical situations, doctors are more reserved, perhaps, and don't have absolute speakers to handle unreliable uncertainty?

Or do you think it's more that you have just specialized? Like, it could be something else entirely, also. I'm just curious what you might think. Yeah. So my question is, do you think the way how the model is performing in this domain, is that a feature of the data sets in the medical domain, and typically based on how doctors communicate?

And I think that's true. And I think that's something we need to build on and use over here. And I think that's extremely helpful. And hopefully, this kind of behavior is general enough and can be transmitted to the model, even when it's used in non-medical settings, to be more reserved when it's communicating and hallucinate less, and so on and so forth.

So I believe that that's one of the opportunities over here to use these benchmarks, come up with methods that reduce hallucination, communicate uncertainty better, and then use that as a bidirectional learning opportunity to improve the general purpose of an MSO. So if you have any further questions, I'll come back again at the end of the talk.

But I want to cover the rest of the applications as well. So the next domain I want to talk about is proteins. And the papers, from now, I'm going to zip through them a little bit, given time. But the first one I want to talk is this paper from a few folks at Google Research back in 2020, called Mass Language Modeling for Proteins by Linearly Scalable Long Context Transformers.

So the problem here is that modeling long range biological sequences requires efficient transformer architectures. And so in this particular paper, what they introduced was this performer architecture, which approximates the softmax attention kernel via low rank decomposition. And so this does not incorporate any sparsity priors, say, like other methods like the reformer, or there are many others.

And this is good, because sparsity priors may not be appropriate for biological data such as protein, which require global interactions to be modeled. And then the other thing is this model, the performance scales linearly rather than quadratically with the sequence length, L. And the number of random features that you need to approximate this softmax attention kernel, M, is completely independent of the input sequence length.

So just to very quickly visualize the speedups and the space complexity improvements, what you're having with this low rank decomposition is, instead of having fat matrices in your softmax attention kernel, you now have thinner matrices, which are determined by the size of the random features, M. And that basically reduces your quadratic complexity to something that is more linear in nature, and also leads to space improvements.

So I would-- yeah, there are more theoretical analysis and details in the paper, and I would refer you all back to it. But what we see in terms of results when doing protein language modeling is that the accuracy of this model is on par with transformers while reducing computational costs quite a bit.

So what this suggests is that the approximation of the softmax attention kernel is a tight approximation. So that is good. And then when you compare that with other methods, such as the reformer or the linformer, the accuracy is much higher, at least on this task. So it seems that, compared to other methods that approximate-- like, try to build more efficient transformers, this one is much better for biological sequence data, at least in this setting.

And finally, if you look at the attention of the amino acid similarity matrix, you can see that the performer model recognizes highly similar amino acid pairs, such as DNE and FNY over here. So that suggests that the model is learning the right set of information that we really want over here.

So that was a two-minute overview of that paper. But I want to talk about another one, which also I think is really, really cool. So this one is called Protein LM, again, by a few other folks at Google Research. And what this does is model-based natural language protein annotation.

And why this problem is important is because the protein information is in very high demand. So over 50% of all known proteins that have been sequenced, we don't actually know what they do. So it's important that we're able to decipher that, to some degree at least. And then the second thing is we may want to, for example, find protein sequences with given functions.

And this is particularly important in the CRISPR domain. And so if you can train bidirectional models that can do this, I think that will be incredibly helpful. And the reason I say this, again, is that the UniProt database that has, I think, millions of researchers worldwide using it today.

And so getting this information populated in that database would be incredibly useful and accelerate a lot of research in this space. And so the European Bioinformatics Institute, they have curated this free text data about proteins. And so basically, you can use this protein record to train these models. And so what you want to do is you want to maybe learn to directly map from amino acid sequences to natural language descriptions of them.

And this problem is not too different from an image captioning problem, where instead of having a sequence of pixels-- I don't know if sequence is right. But again, if you have pixels, instead you have a sequence of amino acids. And they can range in number from 2 to 40k.

And then what you want to generate out is a description of the protein. And in this paper, the way they do this is they train a T5 model on protein sequence annotation tasks. So the tasks are set up in a bunch of different ways. And the supervised data comes from a bunch of different sources in the protein record that they have.

And this model is an encoder decoder T5 model. So it's a very cool application. And the results are that out of the 56 million proteins in that UniProt database that were previously uncharacterized, 49 million of them now have associated textual descriptions. So we now have a handle on what they do.

And so that's really cool. And then the other one, I think, which is probably even more interesting, is now you can run queries like, find me a smaller version of this CRISPR-Cas9 protein so that it can target certain tissue spectra. And now the model can come back with sequences.

And so I think this is, again, going to be incredibly useful and going to accelerate a lot of research in this space. Already, there's a lot of momentum. I think these models are going to further help. So that was on proteins. The last class of applications that I want to cover is on the genomics side.

Again, the first paper over here was some work last year from our genomics team at Health AI at Google, which is building gap-aware sequence transformers for sequence correction. So this model is called Deep Consensus. And so what role does this model play, and why does it matter? So if you look at the sequencing data lifecycle, what you do is you go from basically atoms to bits.

So you have this physical specimen, which hopefully has some DNA in it. And you put it through a sequencing machine, such as SpagBio. And that comes out with the raw data. And that raw data gets mapped to a reference genome. And then sometimes there might be diffs between an individual and the reference genome.

And that can be corrected through this model called Deep Variant that was introduced by our team a few years back. And that's open source. And then once you have this sequence, you can then use it for a bunch of different analysis, such as ancestry or just basic biomedical research.

So where Deep Variant fits in is it actually makes the raw DNA reads that comes out from the SpagBio sequencer. It tries to make it more accurate. And so how the SpagBio sequencer actually works is it uses this circular consensus sequencing algorithm where the DNA molecule is read several times.

And it produces multiple different sub-reads. And these sub-reads are-- they do contain some errors. And so they are finally assembled together. And so what Deep Variant tries to do is it tries to improve on the errors over here, basically, that comes out from just this circular consensus sequencing algorithm.

And so how does this model work? So as I said, the basic task for Deep Consensus is to use the CCS data and the sub-reads associated with them to generate a corrected sequence. And so in this example, when we run through the model, what we see is that while the CCS identity was at 95.7%, the Deep Consensus prediction identity was at 100%.

So it's a fairly simple task where you're trying to reduce errors that come out from the SpagBio with the CCS algorithm. And so the very natural question is, where do these labels come from? So each CCS sequence that you have, that is aligned to a high-quality assembly. And this high-quality assembly is created by having many CCS reads stitched together.

And so that ends up having fewer errors. And so you can then try to use that high-quality stitched assembly and map that back to the CCS tree for a given block and use that as the label. So that results in stronger ground truth. And you can use that to train the model to improve the accuracy further.

And so this is what the model is trained on. And so the model looks like this. It's a transformer architecture. It takes these sub-reads and this CCS read as well. And it has a bunch of additional context features that come in from the sequencer itself, the sequencing instrument as well.

And these are all fed into the transformer model. It produces a polished segment. And these segments are then stitched together to produce the final polished read over here. One thing I will point out over here is that in order to train this model, you can't use a cross-entropy loss.

And this is because you often have insertions in DNA sequences. And so that can, when you use a cross-entropy loss, really throw off the model. Even a single error, as you can see over here, can propagate throughout the sequence and make it really, really worse. So what you need is a special kind of alignment loss based on distance that can really capture this error much, much better.

And so making this alignment loss work on TPUs and making it differentiable is, I think, the real meat of this paper. And so, again, go back to the paper if you're interested in that kind of topic. I think that's really, really cool. But at a very high level, how well does this model work?

So if you look at the final output, you have the read name. You have the base predictions and also the predicted quality, which can be thought of as a confidence score. And these base predictions are often quite long. And so you can see that continuous offscreen because it's 10K to 20K bases long over here.

And when you look at the quality, it improved quite a bit over the vanilla CCS algorithm over here. The per-read accuracy over here improved quite a bit. And so you may ask, what is the real-world impact of this kind of model? So the answer is this model is already being used in the real world.

So at Stanford, in the genomics team by Dr. Ashley and a few others, there was this recent ultra-rapid nanopore genome sequencing paper where they set a world record for the fastest genome sequencing. And this deep consensus transformer architecture was used in that assembly sequence. And so in this particular study, they were able to very quickly diagnose that Matthew over here had a heart condition due to genetic reasons.

And so they were very quickly able to put Matthew on the patient's donors list over here. So that's the kind of real-world impact you can have with these biomedical transformer models and AI systems in general. And very quickly, the last paper that I want to talk about is this paper from DeepMind on effective gene expression prediction from sequences by integrating long-range interactions.

This was published in Nature Methods. And the motivation for this work is, again, that since the Human Genome Project, there have been thousands of genome-wide association study hits, where the goal is to map genetic variants to different kind of disease phenotypes. But a lot of this involves experimentation. And experimentation, like real-world experimentation, takes a lot of time.

And so if you can do that with machine learning models, that's really, really great. And so that's what they set out to do in this paper. And so if you look at the gene itself, there are like 10% of the gene are going to be coding variants. And these influence protein function.

And then the way they can cause diseases is by disrupting the structure of proteins that are generated or by affecting the protein-protein interactions. The good part about these coding variants are they tend to be closer to the gene. And so they're easier to interpret. On the other hand, the 90% of the gene is like non-coding variants.

And the way they work is they influence protein expression. So they are more like regulatory sequences. And so the way they can lead to diseases, if they have any variants, is by disrupting the transcription of proteins. And given that these non-coding variants can be very, very far away from the gene and the coding variants, it's very difficult to interpret them.

And so the question is, can we train transform models that can predict the influence of these non-coding variants? And so that is the task over here. And so this is a visualization, again. So the paper, again, looks at the-- it focuses on transcription, which is the first step in terms of converting DNA into RNA.

And the way this is done is you have RNA polymerase, which gets recruited at the beginning of the gene by these proteins called transcription factors. And these transcription factors have a binding site which correspond to these promoters, which are quite close to the gene. But then you also have these enhancers, which can be very, very far away from these promoters in terms of the linear space, also influencing this transcription.

And you may ask, how can these enhancers influence the activity over here? This is because while they may be far away in the linear space, when the sequence folds and in the 3D structure, they will end up being quite close to each other. And so they can completely affect the transcription process over here.

So it's a very high-level overview of what's happening over here. And then in terms of the biology. And so the question is, if there are any variants in these non-coding variants and in these enhancers, they may disrupt the transcription factor binding. And this can, in turn, lead to no proteins and then finally to diseases.

So we want to be able to predict that based on the DNA sequences that have been generated. So the problem is quite straightforward. It's a supervised learning problem. The setup is predict experimental data from these DNA sequences. And this can take many different forms. The primary one is gene expression over here.

But then there are also other tasks, such as DNA accessibility, histone modifications, and transcription factor binding, and so on and so forth. So as you can imagine, the baseline model for this task for many years was the CNN model. And as you start to build different CNN layers, you can increase the receptive field.

But there's a limit to that. So in this work, what they showed was you can use transformers instead and do better modeling of these long-range interactions. So the final model is called Enformer, which is a combination of this enhancer and transformer. And so if you look at the model itself, it has a few CNN layers at the beginning.

But then it has a bunch of transformer blocks that are stacked together. And the input is 200 kb DNA sequences. And there are approximately 30 examples that have been trained. And the output is genomic tracks of this RNA expression width. And they have organism-specific heads, so one for humans and one for mouse.

And finally, one key detail is that relative position encodings that were used in this model were actually very key. And these relative position encodings were modeling this power law of interactions. And as a result of using these relative position encodings with the transformer block architecture, they were now able to model interactions over 100 kb space away.

And so you see that in the results over here. So you have the experimental data in green. And you can see the CNN baseline over here. And you see that as soon as you go far away, you see that the CNN model is no longer able to capture these gene expressions.

But you can see that the enhancer model is now able to pick them up. So you can see that as the model goes far away, the enhancer model is able to capture this, whereas the CNN model is no longer able to capture this. And finally, one, I think, very interesting experiment that they had in the paper was they were also able to predict promoter-enhancer inferences.

And that prediction was actually on par with experimented data. So this suggests that using this machine learning model, we can sidestep a lot of these wet lab experiments and get key details, which could be super useful. So yeah, so very quickly, I'm sorry I had to cram through proteins and genomics applications over here.

But I think what you would see is that overall, when you look at clinical proteins and genomic applications, we see that transformers have incredible potential in biomedicine. And with clinical applications, I think the challenges are perhaps more centered around data and evaluation. But on the proteins and genomics side, I think there are some extremely interesting opportunities to innovate on the architecture.

And finally, as I said, there are incredible bi-directional learning opportunities. I think the problem of modeling long-range interactions, that's useful beyond proteins, beyond genomics. I think it's useful in genomics. And so I think any architecture improvement over here can inspire wider progress in AI. So I think that's a big reason to work on this.

Any questions so far? Sorry, I covered a lot of ground over here. Apologies for that. But I think these are super cool papers, and you should go back and read them. So finally, I want to maybe spend a couple of minutes touching upon how I see the future of biomedical AI evolving.

Overall, I believe it's not a question of if AI will transform biomedicine. I think it's rather a question of when and how. And I think the very specific thesis I have over here is, given the nature of biomedical data and how multimodal in nature, and with all the progress in transformers, self-supposed learning, large language models, I think we have an incredibly powerful framework to leverage all this richness at scale and truly build foundational medical AI models.

So I think that is incredibly exciting. And so I'm not-- I think it's-- you've already been over here for far too long, so I'm not going to ask you to recognize these people. But they're actually famous physician scientists. Some of them went on to win Nobel Prizes. And so I think what I want to say over here is there's no reason for a scientist to be different from a physician.

They can be combined together. And that's what I also want to convey with our AI systems as well. We don't have to separate clinical applications and biological applications. I think when we combine them together, we are going to discover a lot of new insights. And I think that's going to accelerate biomedical research and internally to new discoveries, and which is going to be used to eradicate diseases, advance human health span, and generally drive human potential forward.

So-- Good question. I don't actually know who these three are. Sure. I think the rightmost one is Alexander Fleming. But then Jonah Salk, and then Paul Ehrlich. So Fleming is penicillin, Salk is polio, and Ehrlich was a bunch of different stuff. And so maybe I'll ask this question to all of you.

Which field of AI do you think will-- which field do you think AI will win the first Nobel Prize in? You don't have to answer. Just think. What's the complete set of Nobel Prize fields? I think there's six. No, economics is not a Nobel Prize. No, there's like eight Nobel Prizes.

Oh, OK. But it's like-- You can say it's a Nobel Prize. It's like, this is not a real field. I'm asking for equal prize. I think economics is a Nobel Prize. Where will we put it? Like, it's associated with the Nobel Prize. I think economics is a Nobel Prize.

It's not a real field. I think economics is a Nobel Prize. It's associated with the Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize.

I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize. I think economics is a Nobel Prize.

I think economics is a Nobel Prize.

Stanford CS25: V2 I Biomedical Transformers

Chapters

Transcript