back to indexStanford CS25: V2 I Biomedical Transformers
Chapters
0:0 Introduction
1:10 Overview
9:32 Multimet QA
17:43 Selective Prediction
26:39 Evaluation
31:35 Qualitative Examples
33:19 Clinical Language Models
34:34 Key takeaways
39:39 Questions
47:59 Proteins
53:23 Genomics
58:39 Deep Mind
00:00:00.000 |
So making you all see the speaker notes was not part of the plan, but I'm glad to be here. 00:00:12.720 |
And my name is Vivek Natarajan, and I am a research scientist in the Health AI team at 00:00:21.520 |
Growing up in India, my parents always wanted me to be a doctor, to be precise, a medical 00:00:26.640 |
doctor, but unfortunately I was probably not good enough to memorize all of the biology 00:00:31.480 |
textbooks that you had to do in case you wanted to crack the medical entrance examinations. 00:00:37.400 |
So I ended up becoming a computer scientist instead. 00:00:42.000 |
But as a great man once said, you can't connect the dots looking forward, you only join them 00:00:48.640 |
So through a rather long-winded path, not too dissimilar from how we actually train 00:00:52.920 |
our neural networks, I ended up working in medicine again, this time armed with this 00:01:01.700 |
And I can tell you that my parents are far more happy with my life choices right now. 00:01:11.400 |
But digressions aside, my goal for this talk is to peel back the curtains and give you 00:01:17.760 |
a flavor of all the innovation that is happening at the intersection of AI and biomedicine 00:01:22.440 |
and how that is being catalyzed by transformers and large language models in particular. 00:01:29.160 |
So we will spend the first few minutes trying to work up from first principles why transformers 00:01:34.180 |
and large language models are a particularly good fit for biomedical data. 00:01:40.120 |
And then we will deep dive into a few papers covering a bunch of different biomedical application 00:01:47.680 |
And finally, I'll present my views on how this field is likely going to evolve in the 00:01:54.440 |
And even though my voice or tone may not exactly sound that way, I am incredibly excited by 00:02:03.240 |
And I think we have an incredible opportunity in front of us to advance human health and 00:02:07.980 |
And my hope at the end of this talk is you all will feel the same way as I do today and 00:02:20.040 |
And sorry, I'm going to pick people who are in person to answer. 00:02:39.000 |
So it's like biological data is a problem, but I have a lot of them. 00:02:47.120 |
Medical doctors are expensive, and a well-paid job is just like memorizing, as you said. 00:03:04.160 |
So I think all of you are on the right track. 00:03:06.920 |
And so maybe if you just look at different kinds of biomedical data, for example, what 00:03:16.720 |
OK, I did not say that, but let's just call it a sequence of Dr. speak or Dr. notes. 00:03:24.760 |
Similarly, if you were to look at electronic medical records, what are they? 00:03:27.640 |
They are essentially a sequence of a person's encounters with the medical system. 00:03:34.600 |
What about proteins going deeper into the biological stack? 00:03:37.800 |
They are nothing but a sequence of amino acids linked together by peptide bonds. 00:03:47.080 |
I think that's how we store medical records, from them. 00:04:05.320 |
So this is in the Wellcome Collection in London, and this is actually a printout of the full 00:04:18.320 |
And as you can see, there's a bunch of ATGCs. 00:04:21.640 |
The entire printout contains, I think, over 130 volumes in that shelf, and each page is 00:04:29.320 |
And it's a four-point font with precisely 43,000 characters per page. 00:04:34.720 |
So that is how big the human reference genome is, more than billions of base pairs. 00:04:42.160 |
And so again, the genome is nothing but a sequence of nucleotide base pairs. 00:04:47.180 |
So what we are essentially seeing over here is sequences are everywhere in biomedical 00:04:53.040 |
And what is the best neural network architecture for modeling them? 00:04:59.360 |
And I guess since you are all in this course, I don't have to convince you that the answer 00:05:09.120 |
But maybe I'll just offer a few reasons over here. 00:05:12.120 |
Firstly, as you can see, the data itself is multimodal in nature. 00:05:19.880 |
And as someone pointed out, transformers have proven remarkable at guzzing up pretty much 00:05:26.680 |
And we are really seeing this remarkable convergence across fields, whether that's speech, or NLP, 00:05:33.080 |
I mean, pretty much everywhere we are using transformers, and I think biomedicine is no 00:05:38.680 |
I think secondly, transformers are far more effective at modeling complex long-range interactions 00:05:45.800 |
And this property is particularly important in the biomedical domain. 00:05:49.200 |
And we will cover this in more detail later in the talk. 00:05:53.520 |
And finally, as again, someone pointed out, these data sets can be quite big. 00:05:58.040 |
And you can easily get into the billions of tokens territory. 00:06:01.240 |
And this is where transformers with all the parallelizable operations and the relative 00:06:04.960 |
ease of training-- and maybe someone should try training an LSTM and an RNN on these kind 00:06:09.160 |
of data sets-- you'll realize that these are much better suited for the kind of data sets 00:06:16.920 |
So yeah, I think there are a few more reasons as well, but I think these are the key ones 00:06:20.440 |
as to why transformers are particularly well-suited for biomedical data sets and tasks. 00:06:31.640 |
So now in the next part of this talk, we will dive deep into a few papers applying transformers 00:06:38.960 |
We'll start with clinical applications first, and then go gradually deeper into the biology 00:06:44.440 |
stack looking at proteins and genomic applications as well. 00:06:49.000 |
And what you will observe is that while transformers and large language models by extension are 00:06:54.400 |
a great fit, often you have to innovate not just on the modeling side, but also on the 00:07:00.280 |
data and evaluation side to make these application scenarios really work. 00:07:07.400 |
And so the first paper I want to talk about over here is this recent work from our team 00:07:10.960 |
called Large Language Models Encode Clinical Knowledge. 00:07:15.760 |
The motivation for this work is actually quite straightforward. 00:07:18.660 |
So if you look at medicine, it is a humane endeavor, and language is at the heart of 00:07:23.000 |
it, facilitating interactions between people and those who provide care for them. 00:07:27.360 |
Unfortunately, if you look at a lot of medical AI systems developed to date, these are all 00:07:31.760 |
narrow, single-task, single-domain models lacking interactive and expressibility capabilities. 00:07:38.240 |
And as a result, what has happened is there is this discordance between what these models 00:07:42.400 |
can do and what is expected of them by patients and care providers and others. 00:07:49.680 |
And this in turn has, I think, prevented broad uptake of medical AI. 00:07:55.200 |
And you can see that, for example, we don't really have AI in many clinics out there, 00:07:58.000 |
like helping us with diagnosis and so on and so forth. 00:08:01.600 |
But the recent progress with transformer-based large language models, it offers us an opportunity 00:08:06.560 |
to change all of this and redesign and rethink medical AI systems with language at the heart 00:08:11.880 |
of it, mediating human-AI interactions between doctors, researchers, and patients. 00:08:19.920 |
And I will be honest if I don't point out that there has been a large volume of work 00:08:22.960 |
in this space, particularly in the last few years. 00:08:25.320 |
There have been various attempts to train language models in the biomedical domain with 00:08:31.560 |
models of various different sizes on different corpuses of biomedical data. 00:08:37.320 |
And while this is exciting, the quality bar for applications in the medical domain is 00:08:43.680 |
And so what is missing is that there is actually not many good evaluation benchmarks and evaluation 00:08:51.120 |
So we don't have the equivalent of a big bench in medicine. 00:08:54.320 |
And hopefully, you guys have covered big bench before. 00:08:57.880 |
And so big bench is this benchmark where you can assess large language models across a 00:09:03.440 |
But we don't have an equivalent of that in the medical domain. 00:09:07.680 |
And further, if you look at the evaluations that are typically used in these previous 00:09:11.480 |
studies, they only look at objective metrics like accuracy or natural language generation 00:09:19.720 |
But these fail to capture the nuances of real-world use cases in clinical settings. 00:09:24.920 |
So what we essentially need was a good benchmark and a task and also a good evaluation framework 00:09:33.700 |
And so to address this unmet need and assess the potential of LLMs in medicine, in our 00:09:39.040 |
team, we decided to focus on the medical question answering task. 00:09:44.000 |
Because answering medical questions is actually quite challenging. 00:09:47.320 |
It requires reading comprehension skills, ability to accurately recall medical knowledge, 00:09:55.920 |
And furthermore, the Q&A task is general enough and can subsume a bunch of different application 00:10:00.280 |
settings such as summarization of clinical notes, clinical decision support, and also 00:10:05.880 |
primary care triaging of patient concerns and so on. 00:10:14.160 |
And so when we looked at the literature over here, what we saw was that there were several 00:10:17.080 |
data sets floating around assessing model capabilities in a bunch of different settings. 00:10:21.940 |
So what we decided was we should probably just unify all of them and put together in 00:10:26.800 |
And so we did that, and we called it Multimit QA. 00:10:29.040 |
And so if you look at it, this benchmark now covers medical question answering data sets 00:10:32.960 |
from a bunch of different settings, such as professional medical questions, like the US 00:10:39.940 |
It also includes medical research questions, those based on PubMed abstracts and so on, 00:10:44.240 |
and also questions from live users and consumers asking about medical information. 00:10:50.080 |
And also the setting changes, it could be closed domain or open domain, and the model 00:10:53.160 |
may be expected to produce a long form answer in one setting and maybe a short form answer 00:10:58.400 |
And finally, we saw that while the Q&A data sets which covered consumer questions, yeah, 00:11:06.680 |
I have a quick question, how do you evaluate long form answers? 00:11:15.200 |
People on Zoom might not be able to hear your questions in person, so they can repeat questions. 00:11:22.000 |
So the question was, how do we evaluate long form answers? 00:11:28.320 |
And so very quickly, when we looked at the data sets that actually provided consumer 00:11:33.760 |
medical questions, we found them to be quite small in size. 00:11:37.840 |
And so we went out to Google and looked at the most frequently asked consumer medical 00:11:42.520 |
And so we curated a data set, and we added that to the benchmark, and we call that HealthSearchQA 00:11:50.400 |
I'll come back to the statistics later, I'm sure. 00:11:58.620 |
So if you look at the consumer medical questions, they are quite short in nature. 00:12:04.080 |
And so they come from the HealthSearchQA and the LiveQA data sets, whereas I think if you 00:12:06.800 |
look at the USMLE-style questions, these are like long vignettes. 00:12:10.400 |
And so doctors have to really, really carefully read through them and come up with the right 00:12:13.600 |
answer, which often involves a process of elimination. 00:12:16.200 |
So again, very, very different application settings, and so the model has to really adapt 00:12:21.920 |
and understand the task to do well in all these settings across the board. 00:12:27.000 |
And LiveQA is interesting because the answers, the reference answers over here, were actually 00:12:33.440 |
So that's another good comparison point for us. 00:12:37.240 |
And so in terms of statistics, we had a total of seven data sets in this benchmark. 00:12:43.320 |
As I said, we cover professional medicine, medical research, and consumer medical questions. 00:12:47.320 |
They're, again, of various different sizes and can be long form, short form, open domain, 00:12:54.680 |
So very diverse, and I think it provides a very comprehensive evaluation of models in 00:13:06.400 |
The next question, again, I think I asked was, how do we evaluate these models? 00:13:10.880 |
And as I mentioned before, automated metrics are actually deeply unsatisfactory because 00:13:15.440 |
they fail to capture the nuances of real-world clinical applications. 00:13:19.440 |
So what we did was actually heavily inspired by some of Stephen's work over here, was to 00:13:23.560 |
put together a human evaluation framework for assessing these long-form answers. 00:13:31.380 |
The first part was evaluation by clinicians, and we asked them to rate the moral responses 00:13:36.200 |
along 12 axes pertaining to factuality of the responses, ability to recall medical knowledge 00:13:42.120 |
to medical reasoning, and also for the potential of harm and bias in these responses. 00:13:48.960 |
But if you look at the potential end users of such medical Q&A systems, these are likely 00:13:54.720 |
So it is also important to get these answers evaluated by them as well. 00:13:58.300 |
And so we also additionally asked a pool of lay users as to how helpful and actionable 00:14:07.200 |
And so that was our evaluation framework, and we also have the benchmark fix. 00:14:12.480 |
So now we move on to the fun part of building and aligning LLMs to the medical domain task. 00:14:19.000 |
So in this work, we decided to build on the Palm family of language models. 00:14:26.360 |
So, but very quickly, I believe this is still the largest publicly announced densely activated 00:14:32.480 |
decoder-only large language model, with the largest one being 540 billion parameters in 00:14:38.120 |
A few more details, the model is trained on 740 billion tokens, 25% of which is multilingual. 00:14:46.160 |
The data comes from a bunch of different sources, including social media conversations, web 00:14:51.400 |
pages, books, GitHub, and Wikipedia, and so on and so forth. 00:14:55.280 |
And at the time of release, the model was state-of-the-art on many NLP reasoning benchmarks, and also 00:14:59.920 |
was the first model to exceed the average human performance on Big Bench. 00:15:03.800 |
Further, over the last year, Palm-derived models were shown to be super useful in a 00:15:08.560 |
bunch of different application settings, including for code generation, which was the Palm Coder 00:15:12.640 |
model, in robotics, the Palm SACAN model, and also for answering math and science questions, 00:15:19.240 |
And so we thought Palm was a very good foundation model for us to build on and use it in the 00:15:24.520 |
And overall, I think Palm is a true magic of engineering, but I will refer you all back 00:15:28.080 |
to Akanksha's paper on this for more details. 00:15:34.960 |
And again, in late October last year, Jason Wei and a few others at Google Brain came 00:15:39.400 |
out with the Flan-Palm variant of the Palm model, and this is basically the instruction-tuned 00:15:45.480 |
And this model was even better than Palm, and I believe this is still the state-of-the-art 00:15:49.200 |
on many benchmarks such as MMLU, TidyQA, and I think it exceeds Palm performance by an 00:15:58.520 |
So we decided to build on the Flan-Palm model, and we applied a combination of prompting 00:16:03.720 |
strategies including few-shot prompting, chain-of-thought reasoning, and also self-consistency to the 00:16:09.220 |
540-billion-parameter variant, and we evaluated it on the multi-med QA datasets that had the 00:16:17.040 |
And we found that this model was really, really good. 00:16:19.600 |
At the time of publication, this model on the USMLE-MedQA dataset exceeded the previous 00:16:34.300 |
And so you see that the accuracy over the previous state-of-the-art at the time of publication 00:16:39.840 |
And I believe this was the first LLM-based AI system to obtain a passing equivalent score, 00:16:49.400 |
And similarly, when we looked at other MCQ datasets in the benchmark, for example, MedMCQA, 00:16:54.320 |
which is a dataset of Indian medical entrance examination questions, the model was again 00:16:58.560 |
On PubMedQA, which was question answering based on PubMed abstracts, again, the model 00:17:02.960 |
was state-of-the-art at the time of publication. 00:17:05.760 |
And same story on MMLU clinical topics as well, which include genetics, anatomy, professional 00:17:12.240 |
medicine, clinical knowledge, and a bunch of other topics in there. 00:17:19.840 |
And then when we started looking at the scaling plots, what we again saw was that the performance 00:17:23.800 |
seemed to be improving as we scaled the model from 8 billion to 62 billion to 540 billion. 00:17:31.000 |
And so what this basically suggested that these general purpose large language models 00:17:34.320 |
trained on public internet seemed to encode clinical knowledge pretty well. 00:17:38.520 |
And their medical reasoning abilities tend to scale with model parameter size. 00:17:43.480 |
We also did another experiment when we looked at selective prediction. 00:17:48.360 |
And we used the self-consistency votes to determine when to differ. 00:17:52.920 |
And this is important in clinical settings because doctors communicate when they don't 00:17:59.920 |
And if our AI systems are going to be used in clinical settings, for example, for diagnosis, 00:18:02.960 |
they should be able to tell you when they don't know something. 00:18:06.000 |
And so what we observed here was this fairly crude metric. 00:18:08.920 |
So we were getting a linear improvement in performance as we changed the deferral threshold. 00:18:16.880 |
But in practice, it's actually quite inefficient because you're generating multiple decoding 00:18:23.960 |
Just to be clear, what is the current fraction? 00:18:31.360 |
And that's determined based on the self-consistency votes. 00:18:37.360 |
So if you apply variance in self-consistency samples, you don't-- 00:18:42.360 |
So are models trained to be capable of deferring themselves? 00:18:43.360 |
They're generating outputs saying, I don't know. 00:18:45.360 |
Because they're just trained on this expert prediction task. 00:18:47.360 |
The PubMed QA has some answers which are maybe. 00:18:48.360 |
But again, we don't explicitly fine-tune the models over here. 00:18:51.360 |
So does it imply that this metric runs in the wrong [INAUDIBLE]?? 00:18:52.360 |
So then how-- and then basically, how do you put the benchmark [INAUDIBLE] where the output 00:19:18.620 |
So this is primarily based on the reference in the data sets, which is-- so this is all 00:19:23.060 |
So we already know between the four options or five options which one's the right one. 00:19:29.160 |
So I'll come back to the clinician evaluation a bit later. 00:19:35.240 |
So if you know about self-consistency prompting, what we do is we generate multiple decodes 00:19:42.180 |
And then we see the number of times the highest-ranking answer is voted. 00:19:47.760 |
And based on that, you can fix a threshold and say, if it's below this number, I'm going 00:19:52.240 |
So if, say, the majority answer comes up in your self-consistency decode only like n times 00:19:56.760 |
out of k or whatever, then if that n is too small, then it's very likely the model's uncertain. 00:20:10.080 |
So we don't really see a paper-off in this plot, so it's natural to ask what the rest 00:20:17.960 |
I think if you plot it further, it will flatline. 00:20:22.280 |
I mean, if you're saying no to every question, that's not useful at all. 00:20:24.400 |
So you want to have a reasonable deferral percentage over here. 00:20:36.680 |
But in real-world use cases, probably, I think that number should be much lower. 00:21:00.700 |
I think balanced accuracy might be a better metric. 00:21:03.720 |
And one data set, the skew was pretty bad, the PubMed QA data set, and I think no one 00:21:08.840 |
So if anyone's reporting SOTA numbers on that data set, you should just distrust them. 00:21:15.320 |
But again, I think, as I mentioned, these accuracy metrics are good for publicity and 00:21:20.560 |
pushing up benchmark numbers and so on and so forth. 00:21:22.440 |
But the real evaluation is human evaluation of the long-form analysis. 00:21:25.080 |
And that's what I'll come to in the next part. 00:21:32.400 |
I mean, we were getting SOTA results on these benchmarks, and we were very happy. 00:21:37.160 |
And so what we did was-- I mean, one thing you'll observe that I have so far only reported 00:21:40.680 |
results on multiple choice questions, short-form answers. 00:21:44.340 |
So what was left for us to do was to take these answers, take these models, and generate 00:21:47.920 |
long-form answers to the other data sets that we had and get them human-evaluated. 00:21:52.400 |
And I think that is where the real project began. 00:21:55.440 |
When we looked at the evals by experts and laypeople, it revealed very key gaps and limitations 00:22:07.320 |
We were often seeing that these models were hallucinating or producing incomplete responses. 00:22:11.920 |
And when we asked experts whether they preferred clinician-generated answers or these model-generated 00:22:16.600 |
answers, they almost always preferred clinician-generated answers. 00:22:23.360 |
Sorry, I didn't get to this earlier, but I've got these evaluators. 00:22:33.200 |
So what these previous results showed was, while these models already encode some degree 00:22:37.120 |
of clinical knowledge, to be really used in actual real-world settings, you need to align 00:22:41.640 |
these models better to the safety-critical requirements of the medical domain. 00:22:45.680 |
But a big challenge is we did not have any kind of supervised or feedback data. 00:22:49.560 |
And so we really need the alignment technique to be data-efficient. 00:22:53.920 |
But thankfully, we had instruction from tuning, which was introduced by Brian Lester and a 00:23:01.600 |
And how this method works is it essentially freezes the big LLM model and only learns 00:23:08.680 |
an additional small set of prompt vectors, which can then be used to condition the model 00:23:17.260 |
And the nice thing about this is it allows very easy reuse of the model across tasks 00:23:24.120 |
And you only need to carry on these additional prompt parameters. 00:23:28.520 |
And these tend to be much smaller than the billions of parameters that you have in the 00:23:34.760 |
And the other good thing is this is very computationally efficient as well. 00:23:37.720 |
So if you were to do end-to-end fine-tuning, often in our compute infrastructure, even 00:23:42.120 |
with a few thousand examples, that would take a few days. 00:23:45.000 |
Whereas with instruction-from-tuning, A, given the data set size is also reduced, the number 00:23:52.160 |
And B, you're just updating the prompt token vectors. 00:23:55.120 |
It meant that we were able to get model updates in a few hours. 00:23:57.880 |
And so that was really fast and enabled really quick iterations for us. 00:24:02.960 |
So this was how we put together the final METPALM model. 00:24:07.500 |
So we used instructions and exemplars from a panel of expert clinicians. 00:24:19.560 |
And these were in the order of hundreds, not like thousands or tens of thousands. 00:24:24.920 |
There's an instruction, followed by a model answer, followed by an explanation. 00:24:31.160 |
And so the final METPALM model is basically all of PLANPALM plus these additional soft 00:24:36.560 |
prompt vector parameters, which are used to align the model to the requirements of the 00:24:41.480 |
And why this works well is because, as we have seen before, the model already has medical 00:24:45.800 |
All we need is to teach the model how to use it properly in the given application setting. 00:24:49.720 |
And that's what these prompt parameters do for us. 00:24:53.760 |
So the question I wanted to ask is, nowadays, you've probably seen a lot about RLHF. 00:24:58.560 |
And given the fact that you have all of these human preferences expressed by your evaluators, 00:25:02.520 |
can you guys explain, have you guys tried playing with a reward or preference model 00:25:11.360 |
Yeah, I think you can think about different stages of model development. 00:25:14.480 |
So this is pre-deployment and release in the real world. 00:25:17.880 |
So you can't put a crappy model out there in the real world. 00:25:20.460 |
So even before doing that, if you can get maybe 100 examples from whatever experts that 00:25:25.020 |
you can get hold of and use that to prompt your new model, that's better. 00:25:27.420 |
That's a much better starting point before you expose the model to the real world and 00:25:30.740 |
collect preferences from real users at scale. 00:25:33.180 |
And so I think RLHF is also much less sample-efficient compared to instruction prompting, again, 00:25:39.540 |
because you're probably trying to update your entire model as well. 00:25:43.160 |
So I think this is a very good starting point. 00:25:44.580 |
And so they can both be combined depending on the lifecycle of the model. 00:25:48.900 |
Are your evaluations, human evaluations, public? 00:25:59.020 |
Are the human evaluations contained publicly within the data set? 00:26:03.820 |
You mean the model responses and what the human evaluations are? 00:26:09.380 |
So far, not considering releasing them, but maybe we can. 00:26:14.700 |
Well, I was thinking, you have a bunch of data in the preferences, so train the model 00:26:19.420 |
to express those preferences, and then use that model for RLHF in a medical community. 00:26:24.020 |
So if I wanted to train a reward model, that data is what I would need to train that reward 00:26:29.820 |
I think the evaluation data set is-- I'll talk about this a bit later. 00:26:32.980 |
But I think if we scale it up-- and we are doing it right now-- I think we can release 00:26:36.420 |
And that will be, I think, a good resource of what you're trying to do. 00:26:42.420 |
And now we took the long-form answers from it and compared that to the planform model, 00:26:46.780 |
as well as to answers generated by expert clinicians. 00:26:49.820 |
And as I said, we have two parts to the human evaluation. 00:26:52.020 |
One is by expert clinicians, and then the other one is by lay users. 00:26:58.460 |
On the 140-odd questions that we got these evaluation results on, what we observed typically 00:27:05.420 |
across the board was when we looked at different axes, while the planform model would be quite 00:27:11.420 |
terrible, honestly, the metform model would do much better and typically close the gap 00:27:17.280 |
So on this axis, you see that the planform model has probably a 60% accuracy in terms 00:27:26.900 |
The metform model improves on that quite a bit and closes the gap to clinicians over 00:27:34.940 |
Over here, you see the clinician's rating on the axes of how well the model can retrieve 00:27:41.460 |
medical knowledge, how well it can reason about it. 00:27:44.980 |
And again, we see the same trend as in the previous slide. 00:27:50.020 |
So the left column-- the left two columns is correct comprehension of people's reasoning. 00:28:05.200 |
So you can have evidence of correct comprehension, also evidence of incorrect comprehension. 00:28:26.000 |
But this one pertains to incorrect or missing content. 00:28:30.740 |
But this was an interesting one, because what-- when we were doing this from tuning thing, 00:28:34.740 |
was we were teaching the MetPALM model to produce longer and more complete answers. 00:28:40.340 |
And so you'd see a few qualitative examples later. 00:28:42.700 |
But what ended up happening in the process was sometimes the model was maybe producing 00:28:47.580 |
So that's why you see that maybe in this particular axis, the FlanPALM model was slightly better. 00:28:51.900 |
But again, this was much worse compared to clinicians. 00:29:01.940 |
It is more like it's something completely out of context. 00:29:09.780 |
So we also looked at possible and extent and likelihood of harm. 00:29:20.220 |
And again, we see that with the instruction from tuning, we're able to close the gap to 00:29:33.660 |
So I think, basically, the company death and then the clinicians at, like, 6%-- so would 00:29:41.180 |
you talk more about how to clarify exactly what that means and what you're talking about? 00:29:49.180 |
So it's basically-- so there might be certain conditions or pathologies or diagnosis, right? 00:29:55.100 |
And if, for example, the clinician has not caught that or has maybe given a response 00:30:00.700 |
that does not appropriately convey the severity of the condition, then that could potentially 00:30:07.580 |
And so that's what we were trying to capture over here. 00:30:16.020 |
And there's a framework for it called the AHRQ framework. 00:30:19.500 |
And so we've linked that in the paper as well. 00:30:21.460 |
And so I think that gives you a very detailed notion of harm and bias, as I would refer 00:30:26.460 |
But at a high level, this is what I'm talking about, how that helps. 00:30:29.380 |
So when, later, I read the class, and I say the clinician had 5.7% on extent of possible 00:30:38.180 |
Does that mean that, like, they recommend something that could kill the patient? 00:30:45.140 |
So it's basically a misdiagnosis or maybe failing to capture the severity of a diagnosis. 00:30:50.300 |
This is typical in life-threatening conditions. 00:30:53.420 |
So it's more often than not mistakes, but rather just missing out on details. 00:31:05.940 |
And then, as I said, the other axis of human evaluation was with lay users. 00:31:10.740 |
And so we asked them, how well does the model address the intent of the question? 00:31:16.020 |
And again, we saw with instruction prompt during MedPAM closing the gap to clinicians. 00:31:20.100 |
And then we asked them how helpful the responses were. 00:31:24.300 |
And what we see is that while Flan-PAM responses were considered to be helpful, like, 60% of 00:31:28.100 |
the time, the number improved to 80% for MedPAM, but it was still fairly lower compared to 00:31:38.020 |
And so what you see is that physicians-- and this is typically because they work in time-constrained 00:31:42.100 |
settings-- their answers tend to be precise and succinct. 00:31:48.420 |
But sometimes it's very hard, as lay users or patients, to decipher and decode the answer 00:31:53.940 |
And so what I think language models like MedPAM can help with is actually converting the physician's 00:31:58.700 |
speak to something that's more easily digestible by lay users. 00:32:01.980 |
And so this is where I think how these models will likely fit in clinical settings in the 00:32:06.980 |
near term, where they are going to augment physicians in terms of interacting with patients 00:32:10.660 |
and other physicians and researchers as well. 00:32:16.620 |
In this moment, then, because if I look at this example, I think actually, it looks like 00:32:22.700 |
by the norm, it's like the physician is actually more understandable. 00:32:27.500 |
And it's very-- if I take, as a patient, you do it or not, so it really isn't the norm. 00:32:38.860 |
And so that's why I think we're still seeing lay users rate Flan-PAM answers to be helpful 00:32:45.140 |
So it's not perfect by any means, but I think this is where there is a complementarity element 00:32:53.340 |
And so when we ask people, how easy is it to interpret doctor notes or recommendations, 00:33:01.100 |
I need to go back to Google, search for what these terms mean, what these abbreviations 00:33:05.900 |
And so I think this is where a language model can come and take that note and convert that 00:33:07.860 |
into something that's more easily digestible. 00:33:10.020 |
So I think that's the opportunity over here, I feel. 00:33:21.940 |
But I also want to maybe very quickly point out a very recent work which came out last 00:33:25.140 |
week with this rather provocative title, Do We Still Need Clinical Language Models? 00:33:29.660 |
And by clinical language models, they meant smaller models which are trained in domain 00:33:34.820 |
with clinical data such as medical notes and records and so on and so forth. 00:33:41.180 |
And what this paper basically suggests is that smaller, fine-tuned, in-domain LLMs are 00:33:49.340 |
In this paper, I think they evaluated on GPT-3 with in-context learning. 00:33:54.540 |
So I think that's a pretty interesting and neat observation. 00:33:56.540 |
I think there's a lot of value for smaller in-domain LLMs such as PubMed, GPT, and a 00:34:02.820 |
But I think one thing that this paper does not do is consider in-context learning-- sorry, 00:34:08.100 |
And I think that's where some of the benefits of these larger general-purpose LLMs shine. 00:34:12.740 |
And again, we haven't done any in-domain LLM pre-training on these large general-purpose 00:34:21.060 |
But that's, again, an option for us as well to do it on the link. 00:34:24.700 |
So you can take these 540 billion parameters and then still train it on medical notes or 00:34:27.940 |
whatever domain-specific data that you can get hold of. 00:34:29.780 |
And hopefully, that will probably further improve the performance. 00:34:34.460 |
So key takeaways so far-- what I wanted to convey was general-purpose LLMs, it looks 00:34:42.940 |
And performance on medical reasoning does seem to improve with scale. 00:34:46.340 |
However, these models, I don't think, can be directly used out-of-the-box in clinical 00:34:50.940 |
And they need to be aligned with the safety-critical requirements of the medical domain. 00:34:55.020 |
And I think instruction prompt tuning is an extremely efficient technique, both on the 00:35:00.780 |
And we should probably use it more often, depending on-- and hopefully, the API starts 00:35:07.140 |
And these models appear to be closing the gap to expert clinicians, at least on this 00:35:13.300 |
And while this is hugely exciting and has profound implications-- you can all probably 00:35:17.460 |
dream up and imagine the application scenarios over here-- I think comprehensive benchmarks 00:35:22.640 |
and evaluation frameworks are necessary in order to further assess and improve these 00:35:35.460 |
A lot of it is because these data sets tend to get locked in silos with privacy and other 00:35:54.100 |
kinds of regulations, which prevent them from being put out there in the real world. 00:35:57.460 |
So you have to have HIPAA-compliant systems for storage and so on and so forth. 00:36:01.020 |
So it's very difficult to get data out of these silos and put together an open benchmark. 00:36:06.820 |
So honestly, I feel like that's probably not going to improve the scale of these data sets. 00:36:11.540 |
At least the open version of these data sets are going to remain quite small compared to 00:36:17.220 |
the big LM training data sets or the computer vision data sets on natural images and so 00:36:21.780 |
But what may happen in the future is we may have more distributed federated evaluation 00:36:25.940 |
settings where you take the model into these private silos and get them evaluated on. 00:36:31.900 |
So they are never exposed and put out there in the public. 00:36:34.380 |
But rather, we can have these federated evaluation settings. 00:36:37.660 |
So I think that there's some work on that already. 00:36:43.300 |
So the question over here was why medical data sets are smaller compared to natural 00:36:55.060 |
image data sets in computer vision or LM training data sets and so on and so forth. 00:36:58.460 |
What do you think are some of the earliest applications of medical LLMs deployed in the 00:37:06.460 |
I think the first set of use cases are probably going to be not diagnostic in nature. 00:37:13.180 |
The question was, what do you think are the use cases of medical LLMs in medical industry 00:37:19.580 |
And so the answer is I think the first set of use cases that we are going to see are 00:37:23.380 |
probably going to be non-diagnostic in nature, but more around if a patient comes in and 00:37:28.980 |
interacts with a doctor, can you generate summary notes? 00:37:33.020 |
And can you do workflow tasks such as generating letters for insurance, for medications, for 00:37:41.100 |
I think these tasks are right up the alley of large language models. 00:37:43.780 |
And I think if not already, in the next six months to a year, we'll see a lot of these 00:37:48.340 |
And I think that's going to make doctors' life, care providers' life much easier because 00:37:51.900 |
right now they're spending a lot of time doing these things and not actually providing care 00:37:59.100 |
Diagnostic use cases, I think, will take a lot more time. 00:38:01.860 |
The data sets, as we can see, are probably not there. 00:38:05.380 |
But I think in the long run-- and that is the dream setting, right? 00:38:08.780 |
And then maybe a follow-up is med-- I'm assuming med prom is not open source. 00:38:14.740 |
What do you think the best open source model is for medical data? 00:38:20.700 |
And I think it depends on the-- so the question is, what is the best open source model for 00:38:27.380 |
It I think depends on the evaluation setting. 00:38:30.460 |
So I think the PubMed GPT model from the Stanford Foundation Models Group is quite strong. 00:38:36.380 |
I think GPT-3 or 3.5 or whatever variant, if you can bring in some domain-specific medical 00:38:41.340 |
data and do some in-domain tuning, I think that model can also improve quite a bit. 00:38:44.620 |
So I think those two would be my favorite starting points over here. 00:38:47.620 |
So I was curious, like, what are the soft problems with, like, updates such as [INAUDIBLE] 00:38:59.660 |
It's-- you can just think them as vectors corresponding to a few additional tokens. 00:39:06.820 |
So the question was, what do the soft prompt vectors look like? 00:39:15.340 |
You said-- you mentioned federated learning for a margin of error. 00:39:17.340 |
If you use files of thirds of parameters, they're usually on-the-prem sites that hospitalize 00:39:22.620 |
Low-quality infrastructure, on-the-demand data set. 00:39:23.620 |
Do you really believe that's really learning? 00:39:24.620 |
Will we have, like, on-site hardware, on-site data sets? 00:39:25.620 |
That all the teams that can play these models is going to work? 00:39:39.740 |
So the question was, given a lot of the hospital systems and providers' networks are quite 00:39:46.340 |
low-tech and don't have good enough hardware, do you really think federated learning could 00:39:50.380 |
be used for distributed training of large-scale LLMs? 00:39:54.580 |
I think we are increasingly seeing a trend towards cloud. 00:39:57.980 |
And so a lot of these hospital systems are moving their storage and data and compute 00:40:03.700 |
to standard cloud providers like AWS or Azure or Google Cloud. 00:40:08.740 |
And so I think that helps, because these systems on the back-end side do have the compute to 00:40:16.300 |
I think it's going to be a very gradual process. 00:40:18.420 |
So systems that have high-quality infrastructure, probably we're going to start with that first, 00:40:23.540 |
and then gradually work our way into the long tail. 00:40:26.300 |
But it also feels like something that will inevitably exist in the world. 00:40:30.380 |
So 10 years down the line, or 15 years down the line, when we have these distributed large-scale 00:40:33.980 |
LLM training systems, we'll always think back, "Why did I even doubt that this will not exist?" 00:40:40.140 |
It's so obvious it's something that has to exist, because that's where all the patient 00:40:46.180 |
It's just not clear whether that's going to be done by one company, whether that's going 00:40:48.900 |
to be done by a consortium of academic or industry groups, or whether governments are 00:40:53.620 |
going to be involved, and so on and so forth. 00:40:56.340 |
You mentioned cloud computing, but essentially, you say you're doing it federated and distributed, 00:41:00.660 |
but we're still uploading the data, probably the same compute warehouse, right? 00:41:06.420 |
So the question over here is, we're seeing cloud computing, but we are pretty much uploading 00:41:14.380 |
But again, I think these are all going to be separate buckets with their own access 00:41:19.700 |
So that is how you can differentiate between them. 00:41:25.220 |
It doesn't seem like that's a good thing, but it makes sense that we're going to be 00:41:54.420 |
So the question was, has there been any studies in MedPalm looking at private information 00:41:59.020 |
One of the criteria for selecting the data sets that we used in the study was to not 00:42:03.460 |
include any kind of personally identifiable data or clinical data of that sort. 00:42:13.140 |
It's unlikely that we're going to have a lot of PHI data in the public data sets that we 00:42:20.060 |
But even when you're training on, say, one private corpus and then you're using it in 00:42:24.860 |
another application setting, you want to ensure that the model does not leak out any kind 00:42:31.140 |
So I think those sort of studies are necessary. 00:42:35.020 |
So the question is, what are the next steps in terms of improving these models further? 00:42:55.180 |
Being able to cite sources and especially take in authoritative sources and use that 00:43:00.260 |
in generating the answers and also communicating that to the users is very important. 00:43:03.580 |
I think how you communicate uncertainty is very important. 00:43:07.580 |
So we've gotten to some extent using instruction from tuning, but I think that can be much, 00:43:15.980 |
Again, I would stress on the evaluation side, looking at more data sets, which for example 00:43:20.900 |
may do a Q&A on health records or other kinds of medical data, I think that will be important. 00:43:26.940 |
And also extending the evaluation both in terms of scale, having a diverse panel of 00:43:31.380 |
clinicians support, and also in terms of the data that you're using. 00:43:34.740 |
Maybe adversarially modifying the questions to include demographic confounders or something 00:43:39.820 |
I think those are all could be interesting directions. 00:43:42.180 |
I think on the modeling side, the interesting question for me is again, this interplay between 00:43:46.880 |
smaller domain specific elements versus large general purpose elements and how that's going 00:43:54.460 |
There seems to be some evidence of emergence over here, especially with medical reasoning. 00:44:00.540 |
And so as you can see at lower scales, sometimes the performance is not good enough. 00:44:05.460 |
I mean, that's a good number, but that's just not viable. 00:44:07.920 |
But when you get to like 80%, 90%, products really become useful. 00:44:11.340 |
And so that we are seeing at bigger parameter sizes of these models. 00:44:16.920 |
I think it's still an open question over here. 00:44:20.740 |
Yeah, the question was, is hallucination an issue? 00:44:27.500 |
But I believe that you can control that fairly well with instruction prompting, like any 00:44:36.860 |
And so I think it might have been overblown generally. 00:44:42.660 |
So especially when you are doing it in a particular domain, I think it's easier to control. 00:44:46.540 |
I'm just curious [INAUDIBLE] the extent to which the method reader or what it looks like. 00:44:54.820 |
I just think recently there's been a lot of [INAUDIBLE] 00:45:05.140 |
So I'm just curious, because this particular [INAUDIBLE] very, very relevant, and [INAUDIBLE] 00:45:12.340 |
Yeah, so the question was, there is a lot of talk and noise around hallucinations and 00:45:26.940 |
And in this particular application domain, it seems particularly relevant. 00:45:30.140 |
And so can you expand on that a little bit further? 00:45:35.180 |
So what we are seeing is, even with an order of a few hundred examples from expert clinicians, 00:45:40.380 |
teaching the model how to communicate medical information, that is good enough to get the 00:45:46.460 |
model to maybe stop hallucinating, or at least communicate its uncertainty in a better way. 00:45:53.760 |
So at least in this particular domain or this setting, it feels more tractable to us. 00:45:59.940 |
And the reason I'm saying this is we've looked at the answers qualitatively, and we are seeing 00:46:03.220 |
that the model does not tend to generate super long answers or make very confident predictions, 00:46:11.820 |
but rather the tone itself becomes very reserved. 00:46:16.100 |
And it starts using terms like, maybe this needs to be done further, or something like 00:46:22.420 |
So how well is that actually correlated with the representation underlying uncertainty 00:46:26.060 |
that we have is still, I think, an area of research. 00:46:29.260 |
But I think this is already promising for us, that it feels controllable in limited 00:46:35.340 |
But if you have a general purpose LLM trying to answer pretty much everything about the 00:46:39.100 |
Do you think that would be a feature of what the domain data says? 00:46:40.100 |
Like, in medical situations, doctors are more reserved, perhaps, and don't have absolute 00:46:42.100 |
Or do you think it's more that you have just specialized? 00:46:43.100 |
Like, it could be something else entirely, also. 00:47:09.220 |
So my question is, do you think the way how the model is performing in this domain, is 00:47:15.540 |
that a feature of the data sets in the medical domain, and typically based on how doctors 00:47:24.060 |
And I think that's something we need to build on and use over here. 00:47:27.900 |
And hopefully, this kind of behavior is general enough and can be transmitted to the model, 00:47:33.380 |
even when it's used in non-medical settings, to be more reserved when it's communicating 00:47:38.940 |
and hallucinate less, and so on and so forth. 00:47:41.020 |
So I believe that that's one of the opportunities over here to use these benchmarks, come up 00:47:44.900 |
with methods that reduce hallucination, communicate uncertainty better, and then use that as a 00:47:49.500 |
bidirectional learning opportunity to improve the general purpose of an MSO. 00:47:54.580 |
So if you have any further questions, I'll come back again at the end of the talk. 00:47:56.700 |
But I want to cover the rest of the applications as well. 00:48:00.940 |
So the next domain I want to talk about is proteins. 00:48:05.780 |
And the papers, from now, I'm going to zip through them a little bit, given time. 00:48:11.500 |
But the first one I want to talk is this paper from a few folks at Google Research back in 00:48:17.220 |
2020, called Mass Language Modeling for Proteins by Linearly Scalable Long Context Transformers. 00:48:24.380 |
So the problem here is that modeling long range biological sequences requires efficient 00:48:31.860 |
And so in this particular paper, what they introduced was this performer architecture, 00:48:36.540 |
which approximates the softmax attention kernel via low rank decomposition. 00:48:42.040 |
And so this does not incorporate any sparsity priors, say, like other methods like the reformer, 00:48:51.780 |
And this is good, because sparsity priors may not be appropriate for biological data 00:48:57.060 |
such as protein, which require global interactions to be modeled. 00:49:01.500 |
And then the other thing is this model, the performance scales linearly rather than quadratically 00:49:06.020 |
with the sequence length, L. And the number of random features that you need to approximate 00:49:10.700 |
this softmax attention kernel, M, is completely independent of the input sequence length. 00:49:16.420 |
So just to very quickly visualize the speedups and the space complexity improvements, what 00:49:20.780 |
you're having with this low rank decomposition is, instead of having fat matrices in your 00:49:24.820 |
softmax attention kernel, you now have thinner matrices, which are determined by the size 00:49:29.400 |
of the random features, M. And that basically reduces your quadratic complexity to something 00:49:34.620 |
that is more linear in nature, and also leads to space improvements. 00:49:38.860 |
So I would-- yeah, there are more theoretical analysis and details in the paper, and I would 00:49:44.980 |
But what we see in terms of results when doing protein language modeling is that the accuracy 00:49:51.060 |
of this model is on par with transformers while reducing computational costs quite a 00:49:57.260 |
So what this suggests is that the approximation of the softmax attention kernel is a tight 00:50:02.940 |
And then when you compare that with other methods, such as the reformer or the linformer, 00:50:06.960 |
the accuracy is much higher, at least on this task. 00:50:09.400 |
So it seems that, compared to other methods that approximate-- like, try to build more 00:50:13.220 |
efficient transformers, this one is much better for biological sequence data, at least in 00:50:20.140 |
And finally, if you look at the attention of the amino acid similarity matrix, you can 00:50:28.660 |
see that the performer model recognizes highly similar amino acid pairs, such as DNE and 00:50:34.700 |
So that suggests that the model is learning the right set of information that we really 00:50:39.940 |
So that was a two-minute overview of that paper. 00:50:43.940 |
But I want to talk about another one, which also I think is really, really cool. 00:50:49.440 |
So this one is called Protein LM, again, by a few other folks at Google Research. 00:50:54.200 |
And what this does is model-based natural language protein annotation. 00:51:00.460 |
And why this problem is important is because the protein information is in very high demand. 00:51:07.460 |
So over 50% of all known proteins that have been sequenced, we don't actually know what 00:51:12.260 |
So it's important that we're able to decipher that, to some degree at least. 00:51:16.060 |
And then the second thing is we may want to, for example, find protein sequences with given 00:51:21.020 |
And this is particularly important in the CRISPR domain. 00:51:23.520 |
And so if you can train bidirectional models that can do this, I think that will be incredibly 00:51:31.320 |
And the reason I say this, again, is that the UniProt database that has, I think, millions 00:51:39.220 |
And so getting this information populated in that database would be incredibly useful 00:51:43.720 |
and accelerate a lot of research in this space. 00:51:47.300 |
And so the European Bioinformatics Institute, they have curated this free text data about 00:51:53.920 |
And so basically, you can use this protein record to train these models. 00:51:58.340 |
And so what you want to do is you want to maybe learn to directly map from amino acid 00:52:03.020 |
sequences to natural language descriptions of them. 00:52:06.700 |
And this problem is not too different from an image captioning problem, where instead 00:52:10.120 |
of having a sequence of pixels-- I don't know if sequence is right. 00:52:14.260 |
But again, if you have pixels, instead you have a sequence of amino acids. 00:52:20.720 |
And then what you want to generate out is a description of the protein. 00:52:26.660 |
And in this paper, the way they do this is they train a T5 model on protein sequence 00:52:32.960 |
So the tasks are set up in a bunch of different ways. 00:52:35.920 |
And the supervised data comes from a bunch of different sources in the protein record 00:52:41.840 |
And this model is an encoder decoder T5 model. 00:52:46.220 |
And the results are that out of the 56 million proteins in that UniProt database that were 00:52:51.780 |
previously uncharacterized, 49 million of them now have associated textual descriptions. 00:53:01.580 |
And then the other one, I think, which is probably even more interesting, is now you 00:53:04.100 |
can run queries like, find me a smaller version of this CRISPR-Cas9 protein so that it can 00:53:10.740 |
And now the model can come back with sequences. 00:53:12.960 |
And so I think this is, again, going to be incredibly useful and going to accelerate 00:53:19.220 |
I think these models are going to further help. 00:53:25.940 |
The last class of applications that I want to cover is on the genomics side. 00:53:29.980 |
Again, the first paper over here was some work last year from our genomics team at Health 00:53:36.260 |
AI at Google, which is building gap-aware sequence transformers for sequence correction. 00:53:45.620 |
And so what role does this model play, and why does it matter? 00:53:49.520 |
So if you look at the sequencing data lifecycle, what you do is you go from basically atoms 00:53:57.200 |
So you have this physical specimen, which hopefully has some DNA in it. 00:54:01.640 |
And you put it through a sequencing machine, such as SpagBio. 00:54:07.660 |
And that raw data gets mapped to a reference genome. 00:54:11.840 |
And then sometimes there might be diffs between an individual and the reference genome. 00:54:15.640 |
And that can be corrected through this model called Deep Variant that was introduced by 00:54:21.120 |
And then once you have this sequence, you can then use it for a bunch of different analysis, 00:54:25.720 |
such as ancestry or just basic biomedical research. 00:54:32.440 |
So where Deep Variant fits in is it actually makes the raw DNA reads that comes out from 00:54:42.320 |
And so how the SpagBio sequencer actually works is it uses this circular consensus sequencing 00:54:49.560 |
algorithm where the DNA molecule is read several times. 00:54:54.280 |
And it produces multiple different sub-reads. 00:54:56.720 |
And these sub-reads are-- they do contain some errors. 00:55:02.600 |
And so what Deep Variant tries to do is it tries to improve on the errors over here, 00:55:07.760 |
basically, that comes out from just this circular consensus sequencing algorithm. 00:55:14.280 |
So as I said, the basic task for Deep Consensus is to use the CCS data and the sub-reads associated 00:55:23.760 |
And so in this example, when we run through the model, what we see is that while the CCS 00:55:27.140 |
identity was at 95.7%, the Deep Consensus prediction identity was at 100%. 00:55:32.280 |
So it's a fairly simple task where you're trying to reduce errors that come out from 00:55:39.280 |
And so the very natural question is, where do these labels come from? 00:55:43.720 |
So each CCS sequence that you have, that is aligned to a high-quality assembly. 00:55:50.080 |
And this high-quality assembly is created by having many CCS reads stitched together. 00:55:59.360 |
And so you can then try to use that high-quality stitched assembly and map that back to the 00:56:06.160 |
CCS tree for a given block and use that as the label. 00:56:13.720 |
And you can use that to train the model to improve the accuracy further. 00:56:23.200 |
It takes these sub-reads and this CCS read as well. 00:56:27.440 |
And it has a bunch of additional context features that come in from the sequencer itself, the 00:56:34.960 |
And these are all fed into the transformer model. 00:56:39.360 |
And these segments are then stitched together to produce the final polished read over here. 00:56:45.400 |
One thing I will point out over here is that in order to train this model, you can't use 00:56:51.280 |
And this is because you often have insertions in DNA sequences. 00:56:58.280 |
And so that can, when you use a cross-entropy loss, really throw off the model. 00:57:02.360 |
Even a single error, as you can see over here, can propagate throughout the sequence and 00:57:08.080 |
So what you need is a special kind of alignment loss based on distance that can really capture 00:57:15.920 |
And so making this alignment loss work on TPUs and making it differentiable is, I think, 00:57:21.600 |
And so, again, go back to the paper if you're interested in that kind of topic. 00:57:26.680 |
But at a very high level, how well does this model work? 00:57:30.320 |
So if you look at the final output, you have the read name. 00:57:32.800 |
You have the base predictions and also the predicted quality, which can be thought of 00:57:37.320 |
And these base predictions are often quite long. 00:57:40.240 |
And so you can see that continuous offscreen because it's 10K to 20K bases long over here. 00:57:45.400 |
And when you look at the quality, it improved quite a bit over the vanilla CCS algorithm 00:57:51.160 |
The per-read accuracy over here improved quite a bit. 00:57:54.960 |
And so you may ask, what is the real-world impact of this kind of model? 00:58:01.480 |
So the answer is this model is already being used in the real world. 00:58:04.400 |
So at Stanford, in the genomics team by Dr. Ashley and a few others, there was this recent 00:58:08.680 |
ultra-rapid nanopore genome sequencing paper where they set a world record for the fastest 00:58:14.800 |
And this deep consensus transformer architecture was used in that assembly sequence. 00:58:18.880 |
And so in this particular study, they were able to very quickly diagnose that Matthew 00:58:23.120 |
over here had a heart condition due to genetic reasons. 00:58:27.160 |
And so they were very quickly able to put Matthew on the patient's donors list over 00:58:32.320 |
So that's the kind of real-world impact you can have with these biomedical transformer 00:58:40.400 |
And very quickly, the last paper that I want to talk about is this paper from DeepMind 00:58:47.060 |
on effective gene expression prediction from sequences by integrating long-range interactions. 00:58:56.840 |
And the motivation for this work is, again, that since the Human Genome Project, there 00:59:00.440 |
have been thousands of genome-wide association study hits, where the goal is to map genetic 00:59:08.000 |
variants to different kind of disease phenotypes. 00:59:13.200 |
And experimentation, like real-world experimentation, takes a lot of time. 00:59:16.040 |
And so if you can do that with machine learning models, that's really, really great. 00:59:20.320 |
And so that's what they set out to do in this paper. 00:59:23.880 |
And so if you look at the gene itself, there are like 10% of the gene are going to be coding 00:59:33.800 |
And then the way they can cause diseases is by disrupting the structure of proteins that 00:59:37.520 |
are generated or by affecting the protein-protein interactions. 00:59:42.760 |
The good part about these coding variants are they tend to be closer to the gene. 00:59:48.060 |
On the other hand, the 90% of the gene is like non-coding variants. 00:59:52.600 |
And the way they work is they influence protein expression. 01:00:00.760 |
And so the way they can lead to diseases, if they have any variants, is by disrupting 01:00:09.360 |
And given that these non-coding variants can be very, very far away from the gene and the 01:00:14.040 |
coding variants, it's very difficult to interpret them. 01:00:16.800 |
And so the question is, can we train transform models that can predict the influence of these 01:00:27.320 |
So the paper, again, looks at the-- it focuses on transcription, which is the first step 01:00:37.160 |
And the way this is done is you have RNA polymerase, which gets recruited at the beginning of the 01:00:42.440 |
gene by these proteins called transcription factors. 01:00:46.640 |
And these transcription factors have a binding site which correspond to these promoters, 01:00:51.840 |
But then you also have these enhancers, which can be very, very far away from these promoters 01:00:56.720 |
in terms of the linear space, also influencing this transcription. 01:01:02.160 |
And you may ask, how can these enhancers influence the activity over here? 01:01:07.360 |
This is because while they may be far away in the linear space, when the sequence folds 01:01:12.280 |
and in the 3D structure, they will end up being quite close to each other. 01:01:15.800 |
And so they can completely affect the transcription process over here. 01:01:19.040 |
So it's a very high-level overview of what's happening over here. 01:01:24.120 |
And so the question is, if there are any variants in these non-coding variants and in these 01:01:29.880 |
enhancers, they may disrupt the transcription factor binding. 01:01:33.600 |
And this can, in turn, lead to no proteins and then finally to diseases. 01:01:37.420 |
So we want to be able to predict that based on the DNA sequences that have been generated. 01:01:47.400 |
The setup is predict experimental data from these DNA sequences. 01:01:53.560 |
The primary one is gene expression over here. 01:01:55.340 |
But then there are also other tasks, such as DNA accessibility, histone modifications, 01:01:59.960 |
and transcription factor binding, and so on and so forth. 01:02:04.240 |
So as you can imagine, the baseline model for this task for many years was the CNN model. 01:02:11.000 |
And as you start to build different CNN layers, you can increase the receptive field. 01:02:16.360 |
So in this work, what they showed was you can use transformers instead and do better 01:02:25.120 |
So the final model is called Enformer, which is a combination of this enhancer and transformer. 01:02:30.920 |
And so if you look at the model itself, it has a few CNN layers at the beginning. 01:02:34.360 |
But then it has a bunch of transformer blocks that are stacked together. 01:02:42.120 |
And there are approximately 30 examples that have been trained. 01:02:44.820 |
And the output is genomic tracks of this RNA expression width. 01:02:48.600 |
And they have organism-specific heads, so one for humans and one for mouse. 01:02:54.360 |
And finally, one key detail is that relative position encodings that were used in this 01:02:59.680 |
And these relative position encodings were modeling this power law of interactions. 01:03:04.320 |
And as a result of using these relative position encodings with the transformer block architecture, 01:03:08.640 |
they were now able to model interactions over 100 kb space away. 01:03:13.500 |
And so you see that in the results over here. 01:03:20.920 |
And you see that as soon as you go far away, you see that the CNN model is no longer able 01:03:31.800 |
But you can see that the enhancer model is now able to pick them up. 01:03:34.940 |
So you can see that as the model goes far away, the enhancer model is able to capture 01:03:38.440 |
this, whereas the CNN model is no longer able to capture this. 01:03:44.360 |
And finally, one, I think, very interesting experiment that they had in the paper was 01:03:48.480 |
they were also able to predict promoter-enhancer inferences. 01:03:53.900 |
And that prediction was actually on par with experimented data. 01:03:57.040 |
So this suggests that using this machine learning model, we can sidestep a lot of these wet 01:04:00.480 |
lab experiments and get key details, which could be super useful. 01:04:06.040 |
So yeah, so very quickly, I'm sorry I had to cram through proteins and genomics applications 01:04:12.560 |
But I think what you would see is that overall, when you look at clinical proteins and genomic 01:04:16.240 |
applications, we see that transformers have incredible potential in biomedicine. 01:04:21.600 |
And with clinical applications, I think the challenges are perhaps more centered around 01:04:26.400 |
But on the proteins and genomics side, I think there are some extremely interesting opportunities 01:04:32.920 |
And finally, as I said, there are incredible bi-directional learning opportunities. 01:04:35.920 |
I think the problem of modeling long-range interactions, that's useful beyond proteins, 01:04:42.600 |
And so I think any architecture improvement over here can inspire wider progress in AI. 01:04:46.480 |
So I think that's a big reason to work on this. 01:04:56.840 |
But I think these are super cool papers, and you should go back and read them. 01:05:01.120 |
So finally, I want to maybe spend a couple of minutes touching upon how I see the future 01:05:06.920 |
Overall, I believe it's not a question of if AI will transform biomedicine. 01:05:12.240 |
I think it's rather a question of when and how. 01:05:14.840 |
And I think the very specific thesis I have over here is, given the nature of biomedical 01:05:20.600 |
data and how multimodal in nature, and with all the progress in transformers, self-supposed 01:05:24.240 |
learning, large language models, I think we have an incredibly powerful framework to leverage 01:05:29.360 |
all this richness at scale and truly build foundational medical AI models. 01:05:39.040 |
And so I'm not-- I think it's-- you've already been over here for far too long, so I'm not 01:05:46.480 |
But they're actually famous physician scientists. 01:05:50.880 |
And so I think what I want to say over here is there's no reason for a scientist to be 01:05:57.800 |
And that's what I also want to convey with our AI systems as well. 01:06:00.520 |
We don't have to separate clinical applications and biological applications. 01:06:03.540 |
I think when we combine them together, we are going to discover a lot of new insights. 01:06:06.980 |
And I think that's going to accelerate biomedical research and internally to new discoveries, 01:06:11.680 |
and which is going to be used to eradicate diseases, advance human health span, and generally 01:06:24.440 |
I think the rightmost one is Alexander Fleming. 01:06:30.840 |
So Fleming is penicillin, Salk is polio, and Ehrlich was a bunch of different stuff. 01:06:40.360 |
And so maybe I'll ask this question to all of you. 01:06:44.480 |
Which field of AI do you think will-- which field do you think AI will win the first Nobel 01:06:51.680 |
What's the complete set of Nobel Prize fields?