back to index

Stanford CS25: V2 I Biomedical Transformers


Chapters

0:0 Introduction
1:10 Overview
9:32 Multimet QA
17:43 Selective Prediction
26:39 Evaluation
31:35 Qualitative Examples
33:19 Clinical Language Models
34:34 Key takeaways
39:39 Questions
47:59 Proteins
53:23 Genomics
58:39 Deep Mind

Whisper Transcript | Transcript Only Page

00:00:00.000 | So making you all see the speaker notes was not part of the plan, but I'm glad to be here.
00:00:12.720 | And my name is Vivek Natarajan, and I am a research scientist in the Health AI team at
00:00:19.520 | Google.
00:00:20.520 | A little bit more about me.
00:00:21.520 | Growing up in India, my parents always wanted me to be a doctor, to be precise, a medical
00:00:26.640 | doctor, but unfortunately I was probably not good enough to memorize all of the biology
00:00:31.480 | textbooks that you had to do in case you wanted to crack the medical entrance examinations.
00:00:37.400 | So I ended up becoming a computer scientist instead.
00:00:42.000 | But as a great man once said, you can't connect the dots looking forward, you only join them
00:00:46.980 | looking backwards.
00:00:48.640 | So through a rather long-winded path, not too dissimilar from how we actually train
00:00:52.920 | our neural networks, I ended up working in medicine again, this time armed with this
00:00:58.520 | magical new tool of AI.
00:01:01.700 | And I can tell you that my parents are far more happy with my life choices right now.
00:01:09.280 | But they're never truly satisfied.
00:01:11.400 | But digressions aside, my goal for this talk is to peel back the curtains and give you
00:01:17.760 | a flavor of all the innovation that is happening at the intersection of AI and biomedicine
00:01:22.440 | and how that is being catalyzed by transformers and large language models in particular.
00:01:29.160 | So we will spend the first few minutes trying to work up from first principles why transformers
00:01:34.180 | and large language models are a particularly good fit for biomedical data.
00:01:40.120 | And then we will deep dive into a few papers covering a bunch of different biomedical application
00:01:45.840 | settings.
00:01:47.680 | And finally, I'll present my views on how this field is likely going to evolve in the
00:01:52.120 | next few years.
00:01:54.440 | And even though my voice or tone may not exactly sound that way, I am incredibly excited by
00:01:59.560 | the possibilities of AI and biomedicine.
00:02:03.240 | And I think we have an incredible opportunity in front of us to advance human health and
00:02:06.640 | human potential.
00:02:07.980 | And my hope at the end of this talk is you all will feel the same way as I do today and
00:02:12.040 | perhaps join me.
00:02:15.320 | So yeah, let's jump straight in.
00:02:17.120 | Why transformers and biomedicine?
00:02:20.040 | And sorry, I'm going to pick people who are in person to answer.
00:02:23.080 | So maybe if one of you could volunteer.
00:02:25.400 | Go for it.
00:02:28.360 | Why not?
00:02:30.040 | That's a good answer.
00:02:32.920 | Sure, go for it.
00:02:34.920 | We have a lot of biological data.
00:02:39.000 | So it's like biological data is a problem, but I have a lot of them.
00:02:46.120 | Yeah, sure.
00:02:47.120 | Medical doctors are expensive, and a well-paid job is just like memorizing, as you said.
00:02:55.120 | Yeah, it's an important application setting.
00:03:00.120 | Yeah, great one.
00:03:04.160 | So I think all of you are on the right track.
00:03:06.920 | And so maybe if you just look at different kinds of biomedical data, for example, what
00:03:13.520 | are clinical notes?
00:03:14.880 | I think it's a sequence of Dr. gibberish.
00:03:16.720 | OK, I did not say that, but let's just call it a sequence of Dr. speak or Dr. notes.
00:03:24.760 | Similarly, if you were to look at electronic medical records, what are they?
00:03:27.640 | They are essentially a sequence of a person's encounters with the medical system.
00:03:34.600 | What about proteins going deeper into the biological stack?
00:03:37.800 | They are nothing but a sequence of amino acids linked together by peptide bonds.
00:03:42.800 | And does anybody know what this is?
00:03:45.080 | Go for it.
00:03:47.080 | I think that's how we store medical records, from them.
00:03:52.600 | Sorry, again?
00:03:53.880 | Well, it looks like from them.
00:03:56.880 | It looks like they were like from the virus.
00:04:00.920 | You're getting close.
00:04:01.920 | Anyone else?
00:04:05.320 | So this is in the Wellcome Collection in London, and this is actually a printout of the full
00:04:10.600 | human genome.
00:04:13.680 | And no, they did not cheat over here.
00:04:16.600 | The font is super small.
00:04:18.320 | And as you can see, there's a bunch of ATGCs.
00:04:21.640 | The entire printout contains, I think, over 130 volumes in that shelf, and each page is
00:04:28.040 | printed on both sides.
00:04:29.320 | And it's a four-point font with precisely 43,000 characters per page.
00:04:34.720 | So that is how big the human reference genome is, more than billions of base pairs.
00:04:42.160 | And so again, the genome is nothing but a sequence of nucleotide base pairs.
00:04:47.180 | So what we are essentially seeing over here is sequences are everywhere in biomedical
00:04:51.880 | data.
00:04:53.040 | And what is the best neural network architecture for modeling them?
00:04:59.360 | And I guess since you are all in this course, I don't have to convince you that the answer
00:05:03.120 | is transformers, right?
00:05:05.720 | OK, that's good.
00:05:09.120 | But maybe I'll just offer a few reasons over here.
00:05:12.120 | Firstly, as you can see, the data itself is multimodal in nature.
00:05:17.800 | And we just saw a few examples.
00:05:19.880 | And as someone pointed out, transformers have proven remarkable at guzzing up pretty much
00:05:24.560 | any kind of data.
00:05:26.680 | And we are really seeing this remarkable convergence across fields, whether that's speech, or NLP,
00:05:31.800 | or vision, or robotics.
00:05:33.080 | I mean, pretty much everywhere we are using transformers, and I think biomedicine is no
00:05:37.680 | different.
00:05:38.680 | I think secondly, transformers are far more effective at modeling complex long-range interactions
00:05:43.760 | over sequences.
00:05:45.800 | And this property is particularly important in the biomedical domain.
00:05:49.200 | And we will cover this in more detail later in the talk.
00:05:53.520 | And finally, as again, someone pointed out, these data sets can be quite big.
00:05:58.040 | And you can easily get into the billions of tokens territory.
00:06:01.240 | And this is where transformers with all the parallelizable operations and the relative
00:06:04.960 | ease of training-- and maybe someone should try training an LSTM and an RNN on these kind
00:06:09.160 | of data sets-- you'll realize that these are much better suited for the kind of data sets
00:06:13.360 | that we have in this domain over here.
00:06:16.920 | So yeah, I think there are a few more reasons as well, but I think these are the key ones
00:06:20.440 | as to why transformers are particularly well-suited for biomedical data sets and tasks.
00:06:27.440 | Any questions so far?
00:06:28.920 | OK, great.
00:06:31.640 | So now in the next part of this talk, we will dive deep into a few papers applying transformers
00:06:36.880 | to biomedical data.
00:06:38.960 | We'll start with clinical applications first, and then go gradually deeper into the biology
00:06:44.440 | stack looking at proteins and genomic applications as well.
00:06:49.000 | And what you will observe is that while transformers and large language models by extension are
00:06:54.400 | a great fit, often you have to innovate not just on the modeling side, but also on the
00:07:00.280 | data and evaluation side to make these application scenarios really work.
00:07:07.400 | And so the first paper I want to talk about over here is this recent work from our team
00:07:10.960 | called Large Language Models Encode Clinical Knowledge.
00:07:15.760 | The motivation for this work is actually quite straightforward.
00:07:18.660 | So if you look at medicine, it is a humane endeavor, and language is at the heart of
00:07:23.000 | it, facilitating interactions between people and those who provide care for them.
00:07:27.360 | Unfortunately, if you look at a lot of medical AI systems developed to date, these are all
00:07:31.760 | narrow, single-task, single-domain models lacking interactive and expressibility capabilities.
00:07:38.240 | And as a result, what has happened is there is this discordance between what these models
00:07:42.400 | can do and what is expected of them by patients and care providers and others.
00:07:49.680 | And this in turn has, I think, prevented broad uptake of medical AI.
00:07:55.200 | And you can see that, for example, we don't really have AI in many clinics out there,
00:07:58.000 | like helping us with diagnosis and so on and so forth.
00:08:01.600 | But the recent progress with transformer-based large language models, it offers us an opportunity
00:08:06.560 | to change all of this and redesign and rethink medical AI systems with language at the heart
00:08:11.880 | of it, mediating human-AI interactions between doctors, researchers, and patients.
00:08:19.920 | And I will be honest if I don't point out that there has been a large volume of work
00:08:22.960 | in this space, particularly in the last few years.
00:08:25.320 | There have been various attempts to train language models in the biomedical domain with
00:08:31.560 | models of various different sizes on different corpuses of biomedical data.
00:08:37.320 | And while this is exciting, the quality bar for applications in the medical domain is
00:08:41.360 | actually quite high.
00:08:43.680 | And so what is missing is that there is actually not many good evaluation benchmarks and evaluation
00:08:49.680 | protocols and frameworks.
00:08:51.120 | So we don't have the equivalent of a big bench in medicine.
00:08:54.320 | And hopefully, you guys have covered big bench before.
00:08:57.880 | And so big bench is this benchmark where you can assess large language models across a
00:09:01.240 | variety of task domains and settings.
00:09:03.440 | But we don't have an equivalent of that in the medical domain.
00:09:07.680 | And further, if you look at the evaluations that are typically used in these previous
00:09:11.480 | studies, they only look at objective metrics like accuracy or natural language generation
00:09:17.960 | metrics like blue or cider.
00:09:19.720 | But these fail to capture the nuances of real-world use cases in clinical settings.
00:09:24.920 | So what we essentially need was a good benchmark and a task and also a good evaluation framework
00:09:30.880 | for evaluating these models.
00:09:33.700 | And so to address this unmet need and assess the potential of LLMs in medicine, in our
00:09:39.040 | team, we decided to focus on the medical question answering task.
00:09:44.000 | Because answering medical questions is actually quite challenging.
00:09:47.320 | It requires reading comprehension skills, ability to accurately recall medical knowledge,
00:09:52.960 | and also manipulate and reason about it.
00:09:55.920 | And furthermore, the Q&A task is general enough and can subsume a bunch of different application
00:10:00.280 | settings such as summarization of clinical notes, clinical decision support, and also
00:10:05.880 | primary care triaging of patient concerns and so on.
00:10:10.520 | So we've identified the task.
00:10:12.200 | The next question was what data set?
00:10:14.160 | And so when we looked at the literature over here, what we saw was that there were several
00:10:17.080 | data sets floating around assessing model capabilities in a bunch of different settings.
00:10:21.940 | So what we decided was we should probably just unify all of them and put together in
00:10:25.800 | one benchmark.
00:10:26.800 | And so we did that, and we called it Multimit QA.
00:10:29.040 | And so if you look at it, this benchmark now covers medical question answering data sets
00:10:32.960 | from a bunch of different settings, such as professional medical questions, like the US
00:10:37.760 | medical license exam style questions.
00:10:39.940 | It also includes medical research questions, those based on PubMed abstracts and so on,
00:10:44.240 | and also questions from live users and consumers asking about medical information.
00:10:50.080 | And also the setting changes, it could be closed domain or open domain, and the model
00:10:53.160 | may be expected to produce a long form answer in one setting and maybe a short form answer
00:10:56.480 | in another setting.
00:10:58.400 | And finally, we saw that while the Q&A data sets which covered consumer questions, yeah,
00:11:05.680 | go for it.
00:11:06.680 | I have a quick question, how do you evaluate long form answers?
00:11:09.280 | I'll come back to this.
00:11:11.200 | Okay.
00:11:12.200 | So yeah, very quickly.
00:11:13.200 | Finally, when we looked at...
00:11:14.200 | Sorry, one other thing.
00:11:15.200 | People on Zoom might not be able to hear your questions in person, so they can repeat questions.
00:11:21.000 | Okay, cool.
00:11:22.000 | So the question was, how do we evaluate long form answers?
00:11:24.720 | And I'll come back to this a bit later.
00:11:28.320 | And so very quickly, when we looked at the data sets that actually provided consumer
00:11:33.760 | medical questions, we found them to be quite small in size.
00:11:36.160 | So we decided to augment them.
00:11:37.840 | And so we went out to Google and looked at the most frequently asked consumer medical
00:11:41.520 | questions.
00:11:42.520 | And so we curated a data set, and we added that to the benchmark, and we call that HealthSearchQA
00:11:46.360 | over here.
00:11:48.400 | And so, yeah, again.
00:11:49.400 | How big is the composite?
00:11:50.400 | I'll come back to the statistics later, I'm sure.
00:11:56.800 | So here are a few examples.
00:11:58.620 | So if you look at the consumer medical questions, they are quite short in nature.
00:12:04.080 | And so they come from the HealthSearchQA and the LiveQA data sets, whereas I think if you
00:12:06.800 | look at the USMLE-style questions, these are like long vignettes.
00:12:10.400 | And so doctors have to really, really carefully read through them and come up with the right
00:12:13.600 | answer, which often involves a process of elimination.
00:12:16.200 | So again, very, very different application settings, and so the model has to really adapt
00:12:21.920 | and understand the task to do well in all these settings across the board.
00:12:27.000 | And LiveQA is interesting because the answers, the reference answers over here, were actually
00:12:32.120 | provided by librarians.
00:12:33.440 | So that's another good comparison point for us.
00:12:37.240 | And so in terms of statistics, we had a total of seven data sets in this benchmark.
00:12:43.320 | As I said, we cover professional medicine, medical research, and consumer medical questions.
00:12:47.320 | They're, again, of various different sizes and can be long form, short form, open domain,
00:12:53.680 | and closed domain.
00:12:54.680 | So very diverse, and I think it provides a very comprehensive evaluation of models in
00:12:58.360 | this medical question answering setting.
00:13:04.520 | So we have a task on the benchmark.
00:13:06.400 | The next question, again, I think I asked was, how do we evaluate these models?
00:13:10.880 | And as I mentioned before, automated metrics are actually deeply unsatisfactory because
00:13:15.440 | they fail to capture the nuances of real-world clinical applications.
00:13:19.440 | So what we did was actually heavily inspired by some of Stephen's work over here, was to
00:13:23.560 | put together a human evaluation framework for assessing these long-form answers.
00:13:28.840 | And this had two parts.
00:13:31.380 | The first part was evaluation by clinicians, and we asked them to rate the moral responses
00:13:36.200 | along 12 axes pertaining to factuality of the responses, ability to recall medical knowledge
00:13:42.120 | to medical reasoning, and also for the potential of harm and bias in these responses.
00:13:48.960 | But if you look at the potential end users of such medical Q&A systems, these are likely
00:13:52.720 | going to be non-expert lay users.
00:13:54.720 | So it is also important to get these answers evaluated by them as well.
00:13:58.300 | And so we also additionally asked a pool of lay users as to how helpful and actionable
00:14:03.560 | they thought the answers were.
00:14:07.200 | And so that was our evaluation framework, and we also have the benchmark fix.
00:14:12.480 | So now we move on to the fun part of building and aligning LLMs to the medical domain task.
00:14:19.000 | So in this work, we decided to build on the Palm family of language models.
00:14:23.720 | Has that been covered in the course before?
00:14:25.360 | Okay, great.
00:14:26.360 | So, but very quickly, I believe this is still the largest publicly announced densely activated
00:14:32.480 | decoder-only large language model, with the largest one being 540 billion parameters in
00:14:37.120 | total.
00:14:38.120 | A few more details, the model is trained on 740 billion tokens, 25% of which is multilingual.
00:14:46.160 | The data comes from a bunch of different sources, including social media conversations, web
00:14:51.400 | pages, books, GitHub, and Wikipedia, and so on and so forth.
00:14:55.280 | And at the time of release, the model was state-of-the-art on many NLP reasoning benchmarks, and also
00:14:59.920 | was the first model to exceed the average human performance on Big Bench.
00:15:03.800 | Further, over the last year, Palm-derived models were shown to be super useful in a
00:15:08.560 | bunch of different application settings, including for code generation, which was the Palm Coder
00:15:12.640 | model, in robotics, the Palm SACAN model, and also for answering math and science questions,
00:15:17.760 | which was the Minerva models.
00:15:19.240 | And so we thought Palm was a very good foundation model for us to build on and use it in the
00:15:22.640 | medical domain as well.
00:15:24.520 | And overall, I think Palm is a true magic of engineering, but I will refer you all back
00:15:28.080 | to Akanksha's paper on this for more details.
00:15:31.200 | I think it's a must-read.
00:15:34.960 | And again, in late October last year, Jason Wei and a few others at Google Brain came
00:15:39.400 | out with the Flan-Palm variant of the Palm model, and this is basically the instruction-tuned
00:15:44.080 | counterpart.
00:15:45.480 | And this model was even better than Palm, and I believe this is still the state-of-the-art
00:15:49.200 | on many benchmarks such as MMLU, TidyQA, and I think it exceeds Palm performance by an
00:15:53.720 | average of 9.4% across big bench tasks.
00:15:58.520 | So we decided to build on the Flan-Palm model, and we applied a combination of prompting
00:16:03.720 | strategies including few-shot prompting, chain-of-thought reasoning, and also self-consistency to the
00:16:09.220 | 540-billion-parameter variant, and we evaluated it on the multi-med QA datasets that had the
00:16:13.880 | short-form MCQ questions.
00:16:17.040 | And we found that this model was really, really good.
00:16:19.600 | At the time of publication, this model on the USMLE-MedQA dataset exceeded the previous
00:16:24.040 | state-of-the-art by over 17%.
00:16:26.000 | Is this specific to the short-form values?
00:16:29.480 | It's only for the USMLE-MedQA dataset.
00:16:32.880 | That's right.
00:16:34.300 | And so you see that the accuracy over the previous state-of-the-art at the time of publication
00:16:37.440 | went up by over 17%.
00:16:39.840 | And I believe this was the first LLM-based AI system to obtain a passing equivalent score,
00:16:44.480 | which was 60% or above on this benchmark.
00:16:49.400 | And similarly, when we looked at other MCQ datasets in the benchmark, for example, MedMCQA,
00:16:54.320 | which is a dataset of Indian medical entrance examination questions, the model was again
00:16:57.560 | the state-of-the-art.
00:16:58.560 | On PubMedQA, which was question answering based on PubMed abstracts, again, the model
00:17:02.960 | was state-of-the-art at the time of publication.
00:17:05.760 | And same story on MMLU clinical topics as well, which include genetics, anatomy, professional
00:17:12.240 | medicine, clinical knowledge, and a bunch of other topics in there.
00:17:17.480 | So all this was great.
00:17:19.840 | And then when we started looking at the scaling plots, what we again saw was that the performance
00:17:23.800 | seemed to be improving as we scaled the model from 8 billion to 62 billion to 540 billion.
00:17:31.000 | And so what this basically suggested that these general purpose large language models
00:17:34.320 | trained on public internet seemed to encode clinical knowledge pretty well.
00:17:38.520 | And their medical reasoning abilities tend to scale with model parameter size.
00:17:43.480 | We also did another experiment when we looked at selective prediction.
00:17:48.360 | And we used the self-consistency votes to determine when to differ.
00:17:52.920 | And this is important in clinical settings because doctors communicate when they don't
00:17:58.920 | know about something.
00:17:59.920 | And if our AI systems are going to be used in clinical settings, for example, for diagnosis,
00:18:02.960 | they should be able to tell you when they don't know something.
00:18:06.000 | And so what we observed here was this fairly crude metric.
00:18:08.920 | So we were getting a linear improvement in performance as we changed the deferral threshold.
00:18:15.720 | And this was quite nice.
00:18:16.880 | But in practice, it's actually quite inefficient because you're generating multiple decoding
00:18:21.240 | samples to be able to compute this metric.
00:18:22.960 | So we need a better method.
00:18:23.960 | Just to be clear, what is the current fraction?
00:18:26.000 | That's what fraction the model asks.
00:18:29.000 | Yeah.
00:18:30.000 | It basically says I'm uncertain around one.
00:18:31.360 | And that's determined based on the self-consistency votes.
00:18:35.360 | I see.
00:18:37.360 | So if you apply variance in self-consistency samples, you don't--
00:18:39.360 | Exactly.
00:18:40.360 | --defer.
00:18:41.360 | Exactly.
00:18:42.360 | So are models trained to be capable of deferring themselves?
00:18:43.360 | They're generating outputs saying, I don't know.
00:18:45.360 | Because they're just trained on this expert prediction task.
00:18:46.360 | And that depends on the data set.
00:18:47.360 | The PubMed QA has some answers which are maybe.
00:18:48.360 | But again, we don't explicitly fine-tune the models over here.
00:18:49.360 | So no, the models are not trained.
00:18:50.360 | Yeah.
00:18:51.360 | So does it imply that this metric runs in the wrong [INAUDIBLE]??
00:18:52.360 | So then how-- and then basically, how do you put the benchmark [INAUDIBLE] where the output
00:19:13.520 | is correct or not, like how [INAUDIBLE]??
00:19:18.620 | So this is primarily based on the reference in the data sets, which is-- so this is all
00:19:22.060 | accuracy metrics.
00:19:23.060 | So we already know between the four options or five options which one's the right one.
00:19:26.440 | And so we just do that classification one.
00:19:28.160 | Yeah.
00:19:29.160 | So I'll come back to the clinician evaluation a bit later.
00:19:30.160 | Sorry.
00:19:31.160 | Maybe I missed something.
00:19:32.160 | How are you measuring the [INAUDIBLE]??
00:19:35.240 | So if you know about self-consistency prompting, what we do is we generate multiple decodes
00:19:40.520 | from the same model.
00:19:42.180 | And then we see the number of times the highest-ranking answer is voted.
00:19:47.760 | And based on that, you can fix a threshold and say, if it's below this number, I'm going
00:19:51.240 | to defer.
00:19:52.240 | So if, say, the majority answer comes up in your self-consistency decode only like n times
00:19:56.760 | out of k or whatever, then if that n is too small, then it's very likely the model's uncertain.
00:20:02.380 | So that's how we defer.
00:20:10.080 | So we don't really see a paper-off in this plot, so it's natural to ask what the rest
00:20:16.280 | would look like?
00:20:17.960 | I think if you plot it further, it will flatline.
00:20:21.280 | But again, that's not useful.
00:20:22.280 | I mean, if you're saying no to every question, that's not useful at all.
00:20:24.400 | So you want to have a reasonable deferral percentage over here.
00:20:26.880 | Yeah, but like 0.4 or 0.5 is not bad, right?
00:20:29.680 | I think that's high.
00:20:30.680 | [INAUDIBLE]
00:20:31.680 | I think that's still high.
00:20:32.680 | 50% is quite high.
00:20:33.680 | But again, this is a very contrived setting.
00:20:36.680 | But in real-world use cases, probably, I think that number should be much lower.
00:20:40.080 | [INAUDIBLE]
00:20:59.700 | That's right.
00:21:00.700 | I think balanced accuracy might be a better metric.
00:21:01.700 | But we looked at some of these data sets.
00:21:03.720 | And one data set, the skew was pretty bad, the PubMed QA data set, and I think no one
00:21:07.840 | should use it.
00:21:08.840 | So if anyone's reporting SOTA numbers on that data set, you should just distrust them.
00:21:12.160 | And I'm talking about very specific people.
00:21:15.320 | But again, I think, as I mentioned, these accuracy metrics are good for publicity and
00:21:20.560 | pushing up benchmark numbers and so on and so forth.
00:21:22.440 | But the real evaluation is human evaluation of the long-form analysis.
00:21:25.080 | And that's what I'll come to in the next part.
00:21:31.400 | So so far, so good, right?
00:21:32.400 | I mean, we were getting SOTA results on these benchmarks, and we were very happy.
00:21:37.160 | And so what we did was-- I mean, one thing you'll observe that I have so far only reported
00:21:40.680 | results on multiple choice questions, short-form answers.
00:21:44.340 | So what was left for us to do was to take these answers, take these models, and generate
00:21:47.920 | long-form answers to the other data sets that we had and get them human-evaluated.
00:21:52.400 | And I think that is where the real project began.
00:21:55.440 | When we looked at the evals by experts and laypeople, it revealed very key gaps and limitations
00:22:05.520 | in the flat-form responses.
00:22:07.320 | We were often seeing that these models were hallucinating or producing incomplete responses.
00:22:11.920 | And when we asked experts whether they preferred clinician-generated answers or these model-generated
00:22:16.600 | answers, they almost always preferred clinician-generated answers.
00:22:21.520 | So it was very clear that--
00:22:23.360 | Sorry, I didn't get to this earlier, but I've got these evaluators.
00:22:26.480 | Are these laypeople, or are these--
00:22:29.200 | They're both.
00:22:31.200 | They're both.
00:22:33.200 | So what these previous results showed was, while these models already encode some degree
00:22:37.120 | of clinical knowledge, to be really used in actual real-world settings, you need to align
00:22:41.640 | these models better to the safety-critical requirements of the medical domain.
00:22:45.680 | But a big challenge is we did not have any kind of supervised or feedback data.
00:22:49.560 | And so we really need the alignment technique to be data-efficient.
00:22:53.920 | But thankfully, we had instruction from tuning, which was introduced by Brian Lester and a
00:22:59.280 | few others at Google a couple of years back.
00:23:01.600 | And how this method works is it essentially freezes the big LLM model and only learns
00:23:08.680 | an additional small set of prompt vectors, which can then be used to condition the model
00:23:14.160 | at inference when doing the generation.
00:23:17.260 | And the nice thing about this is it allows very easy reuse of the model across tasks
00:23:22.080 | and domains.
00:23:24.120 | And you only need to carry on these additional prompt parameters.
00:23:28.520 | And these tend to be much smaller than the billions of parameters that you have in the
00:23:34.760 | And the other good thing is this is very computationally efficient as well.
00:23:37.720 | So if you were to do end-to-end fine-tuning, often in our compute infrastructure, even
00:23:42.120 | with a few thousand examples, that would take a few days.
00:23:45.000 | Whereas with instruction-from-tuning, A, given the data set size is also reduced, the number
00:23:49.800 | of examples that you need is quite small.
00:23:52.160 | And B, you're just updating the prompt token vectors.
00:23:55.120 | It meant that we were able to get model updates in a few hours.
00:23:57.880 | And so that was really fast and enabled really quick iterations for us.
00:24:02.960 | So this was how we put together the final METPALM model.
00:24:07.500 | So we used instructions and exemplars from a panel of expert clinicians.
00:24:19.560 | And these were in the order of hundreds, not like thousands or tens of thousands.
00:24:23.600 | And you see a few examples over there.
00:24:24.920 | There's an instruction, followed by a model answer, followed by an explanation.
00:24:28.880 | And we use that to learn the prompt vectors.
00:24:31.160 | And so the final METPALM model is basically all of PLANPALM plus these additional soft
00:24:36.560 | prompt vector parameters, which are used to align the model to the requirements of the
00:24:40.320 | medical domain.
00:24:41.480 | And why this works well is because, as we have seen before, the model already has medical
00:24:44.520 | knowledge encoded in it.
00:24:45.800 | All we need is to teach the model how to use it properly in the given application setting.
00:24:49.720 | And that's what these prompt parameters do for us.
00:24:53.760 | So the question I wanted to ask is, nowadays, you've probably seen a lot about RLHF.
00:24:58.560 | And given the fact that you have all of these human preferences expressed by your evaluators,
00:25:02.520 | can you guys explain, have you guys tried playing with a reward or preference model
00:25:08.320 | and using that to find a new model?
00:25:11.360 | Yeah, I think you can think about different stages of model development.
00:25:14.480 | So this is pre-deployment and release in the real world.
00:25:17.880 | So you can't put a crappy model out there in the real world.
00:25:20.460 | So even before doing that, if you can get maybe 100 examples from whatever experts that
00:25:25.020 | you can get hold of and use that to prompt your new model, that's better.
00:25:27.420 | That's a much better starting point before you expose the model to the real world and
00:25:30.740 | collect preferences from real users at scale.
00:25:33.180 | And so I think RLHF is also much less sample-efficient compared to instruction prompting, again,
00:25:39.540 | because you're probably trying to update your entire model as well.
00:25:43.160 | So I think this is a very good starting point.
00:25:44.580 | And so they can both be combined depending on the lifecycle of the model.
00:25:48.900 | Are your evaluations, human evaluations, public?
00:25:54.300 | The data set is public.
00:25:55.300 | I'll talk about the results in a bit.
00:25:58.020 | No, sorry.
00:25:59.020 | Are the human evaluations contained publicly within the data set?
00:26:03.820 | You mean the model responses and what the human evaluations are?
00:26:07.580 | That's a good point.
00:26:09.380 | So far, not considering releasing them, but maybe we can.
00:26:13.380 | Do you see a use case for that?
00:26:14.700 | Well, I was thinking, you have a bunch of data in the preferences, so train the model
00:26:19.420 | to express those preferences, and then use that model for RLHF in a medical community.
00:26:24.020 | So if I wanted to train a reward model, that data is what I would need to train that reward
00:26:27.820 | model.
00:26:28.820 | Yeah, that's a good point.
00:26:29.820 | I think the evaluation data set is-- I'll talk about this a bit later.
00:26:31.980 | It's still small.
00:26:32.980 | But I think if we scale it up-- and we are doing it right now-- I think we can release
00:26:35.420 | that.
00:26:36.420 | And that will be, I think, a good resource of what you're trying to do.
00:26:40.420 | Cool.
00:26:41.420 | So we have the metform model, as I said.
00:26:42.420 | And now we took the long-form answers from it and compared that to the planform model,
00:26:46.780 | as well as to answers generated by expert clinicians.
00:26:49.820 | And as I said, we have two parts to the human evaluation.
00:26:52.020 | One is by expert clinicians, and then the other one is by lay users.
00:26:55.780 | And so what do these results look like?
00:26:58.460 | On the 140-odd questions that we got these evaluation results on, what we observed typically
00:27:05.420 | across the board was when we looked at different axes, while the planform model would be quite
00:27:11.420 | terrible, honestly, the metform model would do much better and typically close the gap
00:27:15.620 | to expert clinicians.
00:27:17.280 | So on this axis, you see that the planform model has probably a 60% accuracy in terms
00:27:23.840 | of scientific consensus.
00:27:26.900 | The metform model improves on that quite a bit and closes the gap to clinicians over
00:27:30.260 | here.
00:27:31.260 | A similar story on other axes, as well.
00:27:34.940 | Over here, you see the clinician's rating on the axes of how well the model can retrieve
00:27:41.460 | medical knowledge, how well it can reason about it.
00:27:44.980 | And again, we see the same trend as in the previous slide.
00:27:48.020 | [INAUDIBLE]
00:27:49.020 | Yeah.
00:27:50.020 | So the left column-- the left two columns is correct comprehension of people's reasoning.
00:27:56.020 | So it's evidence of correct comprehension.
00:27:58.500 | And the right--
00:27:59.500 | And on the right-hand side, it's incorrect.
00:28:00.500 | So I think it would just be 1 minus--
00:28:03.700 | It can be present at the same time.
00:28:05.200 | So you can have evidence of correct comprehension, also evidence of incorrect comprehension.
00:28:10.100 | Sometimes you see--
00:28:11.100 | [INAUDIBLE]
00:28:12.100 | Exactly.
00:28:13.100 | So that's why they're not 1 minus over here.
00:28:15.540 | I see.
00:28:17.540 | But the trends are the same.
00:28:18.540 | So that's why I skipped over.
00:28:19.540 | But that's a detail.
00:28:20.540 | Good point.
00:28:21.540 | Yeah.
00:28:22.540 | Again, so there's a typo over here.
00:28:26.000 | But this one pertains to incorrect or missing content.
00:28:30.740 | But this was an interesting one, because what-- when we were doing this from tuning thing,
00:28:34.740 | was we were teaching the MetPALM model to produce longer and more complete answers.
00:28:40.340 | And so you'd see a few qualitative examples later.
00:28:42.700 | But what ended up happening in the process was sometimes the model was maybe producing
00:28:46.000 | more incorrect information.
00:28:47.580 | So that's why you see that maybe in this particular axis, the FlanPALM model was slightly better.
00:28:51.900 | But again, this was much worse compared to clinicians.
00:28:54.740 | [INAUDIBLE]
00:29:00.140 | It's a good question.
00:29:01.940 | It is more like it's something completely out of context.
00:29:05.540 | So it may be irrelevant to the question.
00:29:08.780 | So that's what I would say.
00:29:09.780 | So we also looked at possible and extent and likelihood of harm.
00:29:20.220 | And again, we see that with the instruction from tuning, we're able to close the gap to
00:29:23.380 | expert clinicians over here.
00:29:26.020 | Same on the bias axis as well.
00:29:27.660 | [INAUDIBLE]
00:29:28.660 | Sure.
00:29:29.660 | Can you interpret the top?
00:29:32.660 | Exactly.
00:29:33.660 | So I think, basically, the company death and then the clinicians at, like, 6%-- so would
00:29:41.180 | you talk more about how to clarify exactly what that means and what you're talking about?
00:29:48.180 | Yeah.
00:29:49.180 | So it's basically-- so there might be certain conditions or pathologies or diagnosis, right?
00:29:53.220 | Like cancer.
00:29:55.100 | And if, for example, the clinician has not caught that or has maybe given a response
00:30:00.700 | that does not appropriately convey the severity of the condition, then that could potentially
00:30:05.740 | lead to severe harm or death.
00:30:07.580 | And so that's what we were trying to capture over here.
00:30:10.340 | So that's a very high-level overview.
00:30:12.700 | This is, I think, a very nuanced topic.
00:30:16.020 | And there's a framework for it called the AHRQ framework.
00:30:19.500 | And so we've linked that in the paper as well.
00:30:21.460 | And so I think that gives you a very detailed notion of harm and bias, as I would refer
00:30:25.460 | to that.
00:30:26.460 | But at a high level, this is what I'm talking about, how that helps.
00:30:28.380 | All right.
00:30:29.380 | So when, later, I read the class, and I say the clinician had 5.7% on extent of possible
00:30:35.540 | harm, which means-- like, what would I say?
00:30:38.180 | Does that mean that, like, they recommend something that could kill the patient?
00:30:42.340 | Maybe they fail to recommend something.
00:30:44.140 | Yeah.
00:30:45.140 | So it's basically a misdiagnosis or maybe failing to capture the severity of a diagnosis.
00:30:50.300 | This is typical in life-threatening conditions.
00:30:53.420 | So it's more often than not mistakes, but rather just missing out on details.
00:31:00.460 | Yeah.
00:31:03.860 | So I talked about bias as well.
00:31:05.940 | And then, as I said, the other axis of human evaluation was with lay users.
00:31:10.740 | And so we asked them, how well does the model address the intent of the question?
00:31:16.020 | And again, we saw with instruction prompt during MedPAM closing the gap to clinicians.
00:31:20.100 | And then we asked them how helpful the responses were.
00:31:24.300 | And what we see is that while Flan-PAM responses were considered to be helpful, like, 60% of
00:31:28.100 | the time, the number improved to 80% for MedPAM, but it was still fairly lower compared to
00:31:32.300 | clinicians at 90%.
00:31:36.140 | So here are a few qualitative examples.
00:31:38.020 | And so what you see is that physicians-- and this is typically because they work in time-constrained
00:31:42.100 | settings-- their answers tend to be precise and succinct.
00:31:48.420 | But sometimes it's very hard, as lay users or patients, to decipher and decode the answer
00:31:52.620 | and get all the full set of details.
00:31:53.940 | And so what I think language models like MedPAM can help with is actually converting the physician's
00:31:58.700 | speak to something that's more easily digestible by lay users.
00:32:01.980 | And so this is where I think how these models will likely fit in clinical settings in the
00:32:06.980 | near term, where they are going to augment physicians in terms of interacting with patients
00:32:10.660 | and other physicians and researchers as well.
00:32:13.180 | So I go ahead.
00:32:16.620 | In this moment, then, because if I look at this example, I think actually, it looks like
00:32:22.700 | by the norm, it's like the physician is actually more understandable.
00:32:27.500 | And it's very-- if I take, as a patient, you do it or not, so it really isn't the norm.
00:32:35.860 | And it's my--
00:32:36.860 | That's right.
00:32:37.860 | I think it's subjective.
00:32:38.860 | And so that's why I think we're still seeing lay users rate Flan-PAM answers to be helpful
00:32:44.140 | Well, that's much higher for physicians.
00:32:45.140 | So it's not perfect by any means, but I think this is where there is a complementarity element
00:32:50.220 | we feel over here.
00:32:52.340 | And we've asked that.
00:32:53.340 | And so when we ask people, how easy is it to interpret doctor notes or recommendations,
00:32:59.900 | and they often say, oh, it's very hard.
00:33:01.100 | I need to go back to Google, search for what these terms mean, what these abbreviations
00:33:04.900 | mean.
00:33:05.900 | And so I think this is where a language model can come and take that note and convert that
00:33:07.860 | into something that's more easily digestible.
00:33:10.020 | So I think that's the opportunity over here, I feel.
00:33:19.860 | So that was all on our paper.
00:33:21.940 | But I also want to maybe very quickly point out a very recent work which came out last
00:33:25.140 | week with this rather provocative title, Do We Still Need Clinical Language Models?
00:33:29.660 | And by clinical language models, they meant smaller models which are trained in domain
00:33:34.820 | with clinical data such as medical notes and records and so on and so forth.
00:33:41.180 | And what this paper basically suggests is that smaller, fine-tuned, in-domain LLMs are
00:33:46.620 | likely better than general-purpose LLMs.
00:33:49.340 | In this paper, I think they evaluated on GPT-3 with in-context learning.
00:33:54.540 | So I think that's a pretty interesting and neat observation.
00:33:56.540 | I think there's a lot of value for smaller in-domain LLMs such as PubMed, GPT, and a
00:34:01.820 | few other variants.
00:34:02.820 | But I think one thing that this paper does not do is consider in-context learning-- sorry,
00:34:06.940 | prompt tuning.
00:34:08.100 | And I think that's where some of the benefits of these larger general-purpose LLMs shine.
00:34:12.740 | And again, we haven't done any in-domain LLM pre-training on these large general-purpose
00:34:20.060 | models.
00:34:21.060 | But that's, again, an option for us as well to do it on the link.
00:34:24.700 | So you can take these 540 billion parameters and then still train it on medical notes or
00:34:27.940 | whatever domain-specific data that you can get hold of.
00:34:29.780 | And hopefully, that will probably further improve the performance.
00:34:34.460 | So key takeaways so far-- what I wanted to convey was general-purpose LLMs, it looks
00:34:40.420 | like they do encode medical knowledge.
00:34:42.940 | And performance on medical reasoning does seem to improve with scale.
00:34:46.340 | However, these models, I don't think, can be directly used out-of-the-box in clinical
00:34:49.940 | settings.
00:34:50.940 | And they need to be aligned with the safety-critical requirements of the medical domain.
00:34:55.020 | And I think instruction prompt tuning is an extremely efficient technique, both on the
00:34:58.700 | data side and also on the compute side.
00:35:00.780 | And we should probably use it more often, depending on-- and hopefully, the API starts
00:35:04.660 | supporting it as well.
00:35:07.140 | And these models appear to be closing the gap to expert clinicians, at least on this
00:35:11.420 | medical question-answering task.
00:35:13.300 | And while this is hugely exciting and has profound implications-- you can all probably
00:35:17.460 | dream up and imagine the application scenarios over here-- I think comprehensive benchmarks
00:35:22.640 | and evaluation frameworks are necessary in order to further assess and improve these
00:35:26.740 | models for real-world use cases.
00:35:30.460 | So I'll stop over here.
00:35:31.460 | Any questions?
00:35:32.460 | [INAUDIBLE]
00:35:33.460 | I think--
00:35:34.460 | [INAUDIBLE]
00:35:35.460 | A lot of it is because these data sets tend to get locked in silos with privacy and other
00:35:54.100 | kinds of regulations, which prevent them from being put out there in the real world.
00:35:57.460 | So you have to have HIPAA-compliant systems for storage and so on and so forth.
00:36:01.020 | So it's very difficult to get data out of these silos and put together an open benchmark.
00:36:06.820 | So honestly, I feel like that's probably not going to improve the scale of these data sets.
00:36:11.540 | At least the open version of these data sets are going to remain quite small compared to
00:36:17.220 | the big LM training data sets or the computer vision data sets on natural images and so
00:36:20.580 | on and so forth.
00:36:21.780 | But what may happen in the future is we may have more distributed federated evaluation
00:36:25.940 | settings where you take the model into these private silos and get them evaluated on.
00:36:31.900 | So they are never exposed and put out there in the public.
00:36:34.380 | But rather, we can have these federated evaluation settings.
00:36:37.660 | So I think that there's some work on that already.
00:36:39.300 | There's a system called MedPerf.
00:36:40.300 | And we'll probably see more of them.
00:36:41.300 | [INAUDIBLE]
00:36:42.300 | Sure.
00:36:43.300 | So the question over here was why medical data sets are smaller compared to natural
00:36:55.060 | image data sets in computer vision or LM training data sets and so on and so forth.
00:36:58.460 | What do you think are some of the earliest applications of medical LLMs deployed in the
00:37:05.460 | industry?
00:37:06.460 | I think the first set of use cases are probably going to be not diagnostic in nature.
00:37:12.180 | Sorry.
00:37:13.180 | The question was, what do you think are the use cases of medical LLMs in medical industry
00:37:18.580 | settings?
00:37:19.580 | And so the answer is I think the first set of use cases that we are going to see are
00:37:23.380 | probably going to be non-diagnostic in nature, but more around if a patient comes in and
00:37:28.980 | interacts with a doctor, can you generate summary notes?
00:37:33.020 | And can you do workflow tasks such as generating letters for insurance, for medications, for
00:37:40.020 | referrals, and so on and so forth?
00:37:41.100 | I think these tasks are right up the alley of large language models.
00:37:43.780 | And I think if not already, in the next six months to a year, we'll see a lot of these
00:37:47.340 | use cases coming up.
00:37:48.340 | And I think that's going to make doctors' life, care providers' life much easier because
00:37:51.900 | right now they're spending a lot of time doing these things and not actually providing care
00:37:55.740 | and attending to the patient.
00:37:59.100 | Diagnostic use cases, I think, will take a lot more time.
00:38:00.500 | We need a lot more evaluation.
00:38:01.860 | The data sets, as we can see, are probably not there.
00:38:04.140 | Evaluation frameworks are not there.
00:38:05.380 | But I think in the long run-- and that is the dream setting, right?
00:38:08.780 | And then maybe a follow-up is med-- I'm assuming med prom is not open source.
00:38:14.740 | What do you think the best open source model is for medical data?
00:38:20.700 | And I think it depends on the-- so the question is, what is the best open source model for
00:38:25.700 | medical data?
00:38:27.380 | It I think depends on the evaluation setting.
00:38:30.460 | So I think the PubMed GPT model from the Stanford Foundation Models Group is quite strong.
00:38:36.380 | I think GPT-3 or 3.5 or whatever variant, if you can bring in some domain-specific medical
00:38:41.340 | data and do some in-domain tuning, I think that model can also improve quite a bit.
00:38:44.620 | So I think those two would be my favorite starting points over here.
00:38:47.620 | So I was curious, like, what are the soft problems with, like, updates such as [INAUDIBLE]
00:38:59.660 | It's-- you can just think them as vectors corresponding to a few additional tokens.
00:39:04.020 | So it's not really human legible.
00:39:06.820 | So the question was, what do the soft prompt vectors look like?
00:39:10.780 | And are they human legible?
00:39:12.380 | And yeah, the answer is, no, they're not.
00:39:14.340 | Just a follow-up.
00:39:15.340 | You said-- you mentioned federated learning for a margin of error.
00:39:17.340 | If you use files of thirds of parameters, they're usually on-the-prem sites that hospitalize
00:39:21.620 | I heard that.
00:39:22.620 | Low-quality infrastructure, on-the-demand data set.
00:39:23.620 | Do you really believe that's really learning?
00:39:24.620 | Will we have, like, on-site hardware, on-site data sets?
00:39:25.620 | That all the teams that can play these models is going to work?
00:39:38.740 | Sure.
00:39:39.740 | So the question was, given a lot of the hospital systems and providers' networks are quite
00:39:46.340 | low-tech and don't have good enough hardware, do you really think federated learning could
00:39:50.380 | be used for distributed training of large-scale LLMs?
00:39:54.580 | I think we are increasingly seeing a trend towards cloud.
00:39:57.980 | And so a lot of these hospital systems are moving their storage and data and compute
00:40:03.700 | to standard cloud providers like AWS or Azure or Google Cloud.
00:40:08.740 | And so I think that helps, because these systems on the back-end side do have the compute to
00:40:13.620 | be able to train these kind of models.
00:40:16.300 | I think it's going to be a very gradual process.
00:40:18.420 | So systems that have high-quality infrastructure, probably we're going to start with that first,
00:40:23.540 | and then gradually work our way into the long tail.
00:40:26.300 | But it also feels like something that will inevitably exist in the world.
00:40:30.380 | So 10 years down the line, or 15 years down the line, when we have these distributed large-scale
00:40:33.980 | LLM training systems, we'll always think back, "Why did I even doubt that this will not exist?"
00:40:40.140 | It's so obvious it's something that has to exist, because that's where all the patient
00:40:42.580 | data is, all the interesting data is, right?
00:40:44.460 | So I think that'll just happen.
00:40:46.180 | It's just not clear whether that's going to be done by one company, whether that's going
00:40:48.900 | to be done by a consortium of academic or industry groups, or whether governments are
00:40:53.620 | going to be involved, and so on and so forth.
00:40:55.340 | It's interesting.
00:40:56.340 | You mentioned cloud computing, but essentially, you say you're doing it federated and distributed,
00:41:00.660 | but we're still uploading the data, probably the same compute warehouse, right?
00:41:05.420 | That's right.
00:41:06.420 | So the question over here is, we're seeing cloud computing, but we are pretty much uploading
00:41:10.660 | the data to the same warehouse.
00:41:13.380 | The answer is true.
00:41:14.380 | But again, I think these are all going to be separate buckets with their own access
00:41:18.500 | controls, and so on and so forth.
00:41:19.700 | So that is how you can differentiate between them.
00:41:21.760 | There's not a lot of [inaudible 00:36.06]
00:41:25.220 | It doesn't seem like that's a good thing, but it makes sense that we're going to be
00:41:35.620 | [inaudible 00:36.20]
00:41:53.420 | Sure.
00:41:54.420 | So the question was, has there been any studies in MedPalm looking at private information
00:41:56.540 | in these data sets?
00:41:58.020 | And the short answer is no.
00:41:59.020 | One of the criteria for selecting the data sets that we used in the study was to not
00:42:03.460 | include any kind of personally identifiable data or clinical data of that sort.
00:42:07.100 | And that helped get this paper out on time.
00:42:10.100 | But I think that's an important point.
00:42:13.140 | It's unlikely that we're going to have a lot of PHI data in the public data sets that we
00:42:19.060 | are training on.
00:42:20.060 | But even when you're training on, say, one private corpus and then you're using it in
00:42:24.860 | another application setting, you want to ensure that the model does not leak out any kind
00:42:28.780 | of PHI information during its generation.
00:42:31.140 | So I think those sort of studies are necessary.
00:42:33.020 | We haven't got into them yet.
00:42:34.020 | [inaudible 00:37.57]
00:42:35.020 | So the question is, what are the next steps in terms of improving these models further?
00:42:51.860 | Yeah.
00:42:52.860 | Retrieval is a very important one.
00:42:55.180 | Being able to cite sources and especially take in authoritative sources and use that
00:43:00.260 | in generating the answers and also communicating that to the users is very important.
00:43:03.580 | I think how you communicate uncertainty is very important.
00:43:07.580 | So we've gotten to some extent using instruction from tuning, but I think that can be much,
00:43:11.700 | much better.
00:43:13.500 | So I think that's another big bucket.
00:43:15.980 | Again, I would stress on the evaluation side, looking at more data sets, which for example
00:43:20.900 | may do a Q&A on health records or other kinds of medical data, I think that will be important.
00:43:26.940 | And also extending the evaluation both in terms of scale, having a diverse panel of
00:43:31.380 | clinicians support, and also in terms of the data that you're using.
00:43:34.740 | Maybe adversarially modifying the questions to include demographic confounders or something
00:43:38.820 | like that.
00:43:39.820 | I think those are all could be interesting directions.
00:43:42.180 | I think on the modeling side, the interesting question for me is again, this interplay between
00:43:46.880 | smaller domain specific elements versus large general purpose elements and how that's going
00:43:52.420 | to play out.
00:43:54.460 | There seems to be some evidence of emergence over here, especially with medical reasoning.
00:44:00.540 | And so as you can see at lower scales, sometimes the performance is not good enough.
00:44:04.460 | I mean, 50%.
00:44:05.460 | I mean, that's a good number, but that's just not viable.
00:44:07.920 | But when you get to like 80%, 90%, products really become useful.
00:44:11.340 | And so that we are seeing at bigger parameter sizes of these models.
00:44:15.920 | But I don't know.
00:44:16.920 | I think it's still an open question over here.
00:44:20.740 | Yeah, the question was, is hallucination an issue?
00:44:25.660 | I think it still is.
00:44:27.500 | But I believe that you can control that fairly well with instruction prompting, like any
00:44:32.460 | kind of feedback data.
00:44:33.460 | I think it's not terribly difficult to do.
00:44:36.860 | And so I think it might have been overblown generally.
00:44:42.660 | So especially when you are doing it in a particular domain, I think it's easier to control.
00:44:46.540 | I'm just curious [INAUDIBLE] the extent to which the method reader or what it looks like.
00:44:54.820 | I just think recently there's been a lot of [INAUDIBLE]
00:45:05.140 | So I'm just curious, because this particular [INAUDIBLE] very, very relevant, and [INAUDIBLE]
00:45:12.340 | Yeah, so the question was, there is a lot of talk and noise around hallucinations and
00:45:25.060 | general purpose LLMs.
00:45:26.940 | And in this particular application domain, it seems particularly relevant.
00:45:30.140 | And so can you expand on that a little bit further?
00:45:34.180 | Sure.
00:45:35.180 | So what we are seeing is, even with an order of a few hundred examples from expert clinicians,
00:45:40.380 | teaching the model how to communicate medical information, that is good enough to get the
00:45:46.460 | model to maybe stop hallucinating, or at least communicate its uncertainty in a better way.
00:45:53.760 | So at least in this particular domain or this setting, it feels more tractable to us.
00:45:59.940 | And the reason I'm saying this is we've looked at the answers qualitatively, and we are seeing
00:46:03.220 | that the model does not tend to generate super long answers or make very confident predictions,
00:46:11.820 | but rather the tone itself becomes very reserved.
00:46:16.100 | And it starts using terms like, maybe this needs to be done further, or something like
00:46:20.540 | that, which communicates uncertainty.
00:46:22.420 | So how well is that actually correlated with the representation underlying uncertainty
00:46:26.060 | that we have is still, I think, an area of research.
00:46:29.260 | But I think this is already promising for us, that it feels controllable in limited
00:46:32.700 | application settings like medicine.
00:46:35.340 | But if you have a general purpose LLM trying to answer pretty much everything about the
00:46:38.100 | world, I think that's a much harder problem.
00:46:39.100 | Do you think that would be a feature of what the domain data says?
00:46:40.100 | Like, in medical situations, doctors are more reserved, perhaps, and don't have absolute
00:46:41.100 | speakers to handle unreliable uncertainty?
00:46:42.100 | Or do you think it's more that you have just specialized?
00:46:43.100 | Like, it could be something else entirely, also.
00:47:05.780 | I'm just curious what you might think.
00:47:08.220 | Yeah.
00:47:09.220 | So my question is, do you think the way how the model is performing in this domain, is
00:47:15.540 | that a feature of the data sets in the medical domain, and typically based on how doctors
00:47:21.820 | communicate?
00:47:22.820 | And I think that's true.
00:47:24.060 | And I think that's something we need to build on and use over here.
00:47:26.580 | And I think that's extremely helpful.
00:47:27.900 | And hopefully, this kind of behavior is general enough and can be transmitted to the model,
00:47:33.380 | even when it's used in non-medical settings, to be more reserved when it's communicating
00:47:38.940 | and hallucinate less, and so on and so forth.
00:47:41.020 | So I believe that that's one of the opportunities over here to use these benchmarks, come up
00:47:44.900 | with methods that reduce hallucination, communicate uncertainty better, and then use that as a
00:47:49.500 | bidirectional learning opportunity to improve the general purpose of an MSO.
00:47:54.580 | So if you have any further questions, I'll come back again at the end of the talk.
00:47:56.700 | But I want to cover the rest of the applications as well.
00:48:00.940 | So the next domain I want to talk about is proteins.
00:48:05.780 | And the papers, from now, I'm going to zip through them a little bit, given time.
00:48:11.500 | But the first one I want to talk is this paper from a few folks at Google Research back in
00:48:17.220 | 2020, called Mass Language Modeling for Proteins by Linearly Scalable Long Context Transformers.
00:48:24.380 | So the problem here is that modeling long range biological sequences requires efficient
00:48:29.180 | transformer architectures.
00:48:31.860 | And so in this particular paper, what they introduced was this performer architecture,
00:48:36.540 | which approximates the softmax attention kernel via low rank decomposition.
00:48:42.040 | And so this does not incorporate any sparsity priors, say, like other methods like the reformer,
00:48:48.620 | or there are many others.
00:48:51.780 | And this is good, because sparsity priors may not be appropriate for biological data
00:48:57.060 | such as protein, which require global interactions to be modeled.
00:49:01.500 | And then the other thing is this model, the performance scales linearly rather than quadratically
00:49:06.020 | with the sequence length, L. And the number of random features that you need to approximate
00:49:10.700 | this softmax attention kernel, M, is completely independent of the input sequence length.
00:49:16.420 | So just to very quickly visualize the speedups and the space complexity improvements, what
00:49:20.780 | you're having with this low rank decomposition is, instead of having fat matrices in your
00:49:24.820 | softmax attention kernel, you now have thinner matrices, which are determined by the size
00:49:29.400 | of the random features, M. And that basically reduces your quadratic complexity to something
00:49:34.620 | that is more linear in nature, and also leads to space improvements.
00:49:38.860 | So I would-- yeah, there are more theoretical analysis and details in the paper, and I would
00:49:42.700 | refer you all back to it.
00:49:44.980 | But what we see in terms of results when doing protein language modeling is that the accuracy
00:49:51.060 | of this model is on par with transformers while reducing computational costs quite a
00:49:57.260 | So what this suggests is that the approximation of the softmax attention kernel is a tight
00:50:00.720 | approximation.
00:50:01.720 | So that is good.
00:50:02.940 | And then when you compare that with other methods, such as the reformer or the linformer,
00:50:06.960 | the accuracy is much higher, at least on this task.
00:50:09.400 | So it seems that, compared to other methods that approximate-- like, try to build more
00:50:13.220 | efficient transformers, this one is much better for biological sequence data, at least in
00:50:17.500 | this setting.
00:50:20.140 | And finally, if you look at the attention of the amino acid similarity matrix, you can
00:50:28.660 | see that the performer model recognizes highly similar amino acid pairs, such as DNE and
00:50:33.700 | FNY over here.
00:50:34.700 | So that suggests that the model is learning the right set of information that we really
00:50:37.680 | want over here.
00:50:39.940 | So that was a two-minute overview of that paper.
00:50:43.940 | But I want to talk about another one, which also I think is really, really cool.
00:50:49.440 | So this one is called Protein LM, again, by a few other folks at Google Research.
00:50:54.200 | And what this does is model-based natural language protein annotation.
00:51:00.460 | And why this problem is important is because the protein information is in very high demand.
00:51:07.460 | So over 50% of all known proteins that have been sequenced, we don't actually know what
00:51:11.260 | they do.
00:51:12.260 | So it's important that we're able to decipher that, to some degree at least.
00:51:16.060 | And then the second thing is we may want to, for example, find protein sequences with given
00:51:20.020 | functions.
00:51:21.020 | And this is particularly important in the CRISPR domain.
00:51:23.520 | And so if you can train bidirectional models that can do this, I think that will be incredibly
00:51:28.160 | helpful.
00:51:31.320 | And the reason I say this, again, is that the UniProt database that has, I think, millions
00:51:37.540 | of researchers worldwide using it today.
00:51:39.220 | And so getting this information populated in that database would be incredibly useful
00:51:43.720 | and accelerate a lot of research in this space.
00:51:47.300 | And so the European Bioinformatics Institute, they have curated this free text data about
00:51:52.920 | proteins.
00:51:53.920 | And so basically, you can use this protein record to train these models.
00:51:58.340 | And so what you want to do is you want to maybe learn to directly map from amino acid
00:52:03.020 | sequences to natural language descriptions of them.
00:52:06.700 | And this problem is not too different from an image captioning problem, where instead
00:52:10.120 | of having a sequence of pixels-- I don't know if sequence is right.
00:52:14.260 | But again, if you have pixels, instead you have a sequence of amino acids.
00:52:17.420 | And they can range in number from 2 to 40k.
00:52:20.720 | And then what you want to generate out is a description of the protein.
00:52:26.660 | And in this paper, the way they do this is they train a T5 model on protein sequence
00:52:31.700 | annotation tasks.
00:52:32.960 | So the tasks are set up in a bunch of different ways.
00:52:35.920 | And the supervised data comes from a bunch of different sources in the protein record
00:52:40.840 | that they have.
00:52:41.840 | And this model is an encoder decoder T5 model.
00:52:45.020 | So it's a very cool application.
00:52:46.220 | And the results are that out of the 56 million proteins in that UniProt database that were
00:52:51.780 | previously uncharacterized, 49 million of them now have associated textual descriptions.
00:52:56.860 | So we now have a handle on what they do.
00:53:00.580 | And so that's really cool.
00:53:01.580 | And then the other one, I think, which is probably even more interesting, is now you
00:53:04.100 | can run queries like, find me a smaller version of this CRISPR-Cas9 protein so that it can
00:53:09.260 | target certain tissue spectra.
00:53:10.740 | And now the model can come back with sequences.
00:53:12.960 | And so I think this is, again, going to be incredibly useful and going to accelerate
00:53:17.220 | a lot of research in this space.
00:53:18.220 | Already, there's a lot of momentum.
00:53:19.220 | I think these models are going to further help.
00:53:24.940 | So that was on proteins.
00:53:25.940 | The last class of applications that I want to cover is on the genomics side.
00:53:29.980 | Again, the first paper over here was some work last year from our genomics team at Health
00:53:36.260 | AI at Google, which is building gap-aware sequence transformers for sequence correction.
00:53:43.140 | So this model is called Deep Consensus.
00:53:45.620 | And so what role does this model play, and why does it matter?
00:53:49.520 | So if you look at the sequencing data lifecycle, what you do is you go from basically atoms
00:53:56.200 | to bits.
00:53:57.200 | So you have this physical specimen, which hopefully has some DNA in it.
00:54:01.640 | And you put it through a sequencing machine, such as SpagBio.
00:54:05.680 | And that comes out with the raw data.
00:54:07.660 | And that raw data gets mapped to a reference genome.
00:54:11.840 | And then sometimes there might be diffs between an individual and the reference genome.
00:54:15.640 | And that can be corrected through this model called Deep Variant that was introduced by
00:54:18.400 | our team a few years back.
00:54:19.760 | And that's open source.
00:54:21.120 | And then once you have this sequence, you can then use it for a bunch of different analysis,
00:54:25.720 | such as ancestry or just basic biomedical research.
00:54:32.440 | So where Deep Variant fits in is it actually makes the raw DNA reads that comes out from
00:54:37.600 | the SpagBio sequencer.
00:54:39.440 | It tries to make it more accurate.
00:54:42.320 | And so how the SpagBio sequencer actually works is it uses this circular consensus sequencing
00:54:49.560 | algorithm where the DNA molecule is read several times.
00:54:54.280 | And it produces multiple different sub-reads.
00:54:56.720 | And these sub-reads are-- they do contain some errors.
00:55:00.200 | And so they are finally assembled together.
00:55:02.600 | And so what Deep Variant tries to do is it tries to improve on the errors over here,
00:55:07.760 | basically, that comes out from just this circular consensus sequencing algorithm.
00:55:13.060 | And so how does this model work?
00:55:14.280 | So as I said, the basic task for Deep Consensus is to use the CCS data and the sub-reads associated
00:55:20.640 | with them to generate a corrected sequence.
00:55:23.760 | And so in this example, when we run through the model, what we see is that while the CCS
00:55:27.140 | identity was at 95.7%, the Deep Consensus prediction identity was at 100%.
00:55:32.280 | So it's a fairly simple task where you're trying to reduce errors that come out from
00:55:36.320 | the SpagBio with the CCS algorithm.
00:55:39.280 | And so the very natural question is, where do these labels come from?
00:55:43.720 | So each CCS sequence that you have, that is aligned to a high-quality assembly.
00:55:50.080 | And this high-quality assembly is created by having many CCS reads stitched together.
00:55:56.800 | And so that ends up having fewer errors.
00:55:59.360 | And so you can then try to use that high-quality stitched assembly and map that back to the
00:56:06.160 | CCS tree for a given block and use that as the label.
00:56:09.120 | So that results in stronger ground truth.
00:56:13.720 | And you can use that to train the model to improve the accuracy further.
00:56:16.600 | And so this is what the model is trained on.
00:56:19.120 | And so the model looks like this.
00:56:21.000 | It's a transformer architecture.
00:56:23.200 | It takes these sub-reads and this CCS read as well.
00:56:27.440 | And it has a bunch of additional context features that come in from the sequencer itself, the
00:56:32.560 | sequencing instrument as well.
00:56:34.960 | And these are all fed into the transformer model.
00:56:37.440 | It produces a polished segment.
00:56:39.360 | And these segments are then stitched together to produce the final polished read over here.
00:56:45.400 | One thing I will point out over here is that in order to train this model, you can't use
00:56:49.520 | a cross-entropy loss.
00:56:51.280 | And this is because you often have insertions in DNA sequences.
00:56:58.280 | And so that can, when you use a cross-entropy loss, really throw off the model.
00:57:02.360 | Even a single error, as you can see over here, can propagate throughout the sequence and
00:57:07.080 | make it really, really worse.
00:57:08.080 | So what you need is a special kind of alignment loss based on distance that can really capture
00:57:13.280 | this error much, much better.
00:57:15.920 | And so making this alignment loss work on TPUs and making it differentiable is, I think,
00:57:20.520 | the real meat of this paper.
00:57:21.600 | And so, again, go back to the paper if you're interested in that kind of topic.
00:57:24.160 | I think that's really, really cool.
00:57:26.680 | But at a very high level, how well does this model work?
00:57:30.320 | So if you look at the final output, you have the read name.
00:57:32.800 | You have the base predictions and also the predicted quality, which can be thought of
00:57:36.240 | as a confidence score.
00:57:37.320 | And these base predictions are often quite long.
00:57:40.240 | And so you can see that continuous offscreen because it's 10K to 20K bases long over here.
00:57:45.400 | And when you look at the quality, it improved quite a bit over the vanilla CCS algorithm
00:57:50.160 | over here.
00:57:51.160 | The per-read accuracy over here improved quite a bit.
00:57:54.960 | And so you may ask, what is the real-world impact of this kind of model?
00:58:01.480 | So the answer is this model is already being used in the real world.
00:58:04.400 | So at Stanford, in the genomics team by Dr. Ashley and a few others, there was this recent
00:58:08.680 | ultra-rapid nanopore genome sequencing paper where they set a world record for the fastest
00:58:13.400 | genome sequencing.
00:58:14.800 | And this deep consensus transformer architecture was used in that assembly sequence.
00:58:18.880 | And so in this particular study, they were able to very quickly diagnose that Matthew
00:58:23.120 | over here had a heart condition due to genetic reasons.
00:58:27.160 | And so they were very quickly able to put Matthew on the patient's donors list over
00:58:31.320 | here.
00:58:32.320 | So that's the kind of real-world impact you can have with these biomedical transformer
00:58:35.920 | models and AI systems in general.
00:58:40.400 | And very quickly, the last paper that I want to talk about is this paper from DeepMind
00:58:47.060 | on effective gene expression prediction from sequences by integrating long-range interactions.
00:58:53.220 | This was published in Nature Methods.
00:58:56.840 | And the motivation for this work is, again, that since the Human Genome Project, there
00:59:00.440 | have been thousands of genome-wide association study hits, where the goal is to map genetic
00:59:08.000 | variants to different kind of disease phenotypes.
00:59:11.080 | But a lot of this involves experimentation.
00:59:13.200 | And experimentation, like real-world experimentation, takes a lot of time.
00:59:16.040 | And so if you can do that with machine learning models, that's really, really great.
00:59:20.320 | And so that's what they set out to do in this paper.
00:59:23.880 | And so if you look at the gene itself, there are like 10% of the gene are going to be coding
00:59:29.920 | variants.
00:59:30.920 | And these influence protein function.
00:59:33.800 | And then the way they can cause diseases is by disrupting the structure of proteins that
00:59:37.520 | are generated or by affecting the protein-protein interactions.
00:59:42.760 | The good part about these coding variants are they tend to be closer to the gene.
00:59:45.720 | And so they're easier to interpret.
00:59:48.060 | On the other hand, the 90% of the gene is like non-coding variants.
00:59:52.600 | And the way they work is they influence protein expression.
00:59:56.000 | So they are more like regulatory sequences.
01:00:00.760 | And so the way they can lead to diseases, if they have any variants, is by disrupting
01:00:05.880 | the transcription of proteins.
01:00:09.360 | And given that these non-coding variants can be very, very far away from the gene and the
01:00:14.040 | coding variants, it's very difficult to interpret them.
01:00:16.800 | And so the question is, can we train transform models that can predict the influence of these
01:00:20.880 | non-coding variants?
01:00:22.000 | And so that is the task over here.
01:00:24.040 | And so this is a visualization, again.
01:00:27.320 | So the paper, again, looks at the-- it focuses on transcription, which is the first step
01:00:34.080 | in terms of converting DNA into RNA.
01:00:37.160 | And the way this is done is you have RNA polymerase, which gets recruited at the beginning of the
01:00:42.440 | gene by these proteins called transcription factors.
01:00:46.640 | And these transcription factors have a binding site which correspond to these promoters,
01:00:50.200 | which are quite close to the gene.
01:00:51.840 | But then you also have these enhancers, which can be very, very far away from these promoters
01:00:56.720 | in terms of the linear space, also influencing this transcription.
01:01:02.160 | And you may ask, how can these enhancers influence the activity over here?
01:01:07.360 | This is because while they may be far away in the linear space, when the sequence folds
01:01:12.280 | and in the 3D structure, they will end up being quite close to each other.
01:01:15.800 | And so they can completely affect the transcription process over here.
01:01:19.040 | So it's a very high-level overview of what's happening over here.
01:01:21.680 | And then in terms of the biology.
01:01:24.120 | And so the question is, if there are any variants in these non-coding variants and in these
01:01:29.880 | enhancers, they may disrupt the transcription factor binding.
01:01:33.600 | And this can, in turn, lead to no proteins and then finally to diseases.
01:01:37.420 | So we want to be able to predict that based on the DNA sequences that have been generated.
01:01:42.280 | So the problem is quite straightforward.
01:01:45.000 | It's a supervised learning problem.
01:01:47.400 | The setup is predict experimental data from these DNA sequences.
01:01:51.360 | And this can take many different forms.
01:01:53.560 | The primary one is gene expression over here.
01:01:55.340 | But then there are also other tasks, such as DNA accessibility, histone modifications,
01:01:59.960 | and transcription factor binding, and so on and so forth.
01:02:04.240 | So as you can imagine, the baseline model for this task for many years was the CNN model.
01:02:11.000 | And as you start to build different CNN layers, you can increase the receptive field.
01:02:14.560 | But there's a limit to that.
01:02:16.360 | So in this work, what they showed was you can use transformers instead and do better
01:02:21.960 | modeling of these long-range interactions.
01:02:25.120 | So the final model is called Enformer, which is a combination of this enhancer and transformer.
01:02:30.920 | And so if you look at the model itself, it has a few CNN layers at the beginning.
01:02:34.360 | But then it has a bunch of transformer blocks that are stacked together.
01:02:37.800 | And the input is 200 kb DNA sequences.
01:02:42.120 | And there are approximately 30 examples that have been trained.
01:02:44.820 | And the output is genomic tracks of this RNA expression width.
01:02:48.600 | And they have organism-specific heads, so one for humans and one for mouse.
01:02:54.360 | And finally, one key detail is that relative position encodings that were used in this
01:02:58.080 | model were actually very key.
01:02:59.680 | And these relative position encodings were modeling this power law of interactions.
01:03:04.320 | And as a result of using these relative position encodings with the transformer block architecture,
01:03:08.640 | they were now able to model interactions over 100 kb space away.
01:03:13.500 | And so you see that in the results over here.
01:03:16.440 | So you have the experimental data in green.
01:03:18.960 | And you can see the CNN baseline over here.
01:03:20.920 | And you see that as soon as you go far away, you see that the CNN model is no longer able
01:03:28.600 | to capture these gene expressions.
01:03:31.800 | But you can see that the enhancer model is now able to pick them up.
01:03:34.940 | So you can see that as the model goes far away, the enhancer model is able to capture
01:03:38.440 | this, whereas the CNN model is no longer able to capture this.
01:03:44.360 | And finally, one, I think, very interesting experiment that they had in the paper was
01:03:48.480 | they were also able to predict promoter-enhancer inferences.
01:03:53.900 | And that prediction was actually on par with experimented data.
01:03:57.040 | So this suggests that using this machine learning model, we can sidestep a lot of these wet
01:04:00.480 | lab experiments and get key details, which could be super useful.
01:04:06.040 | So yeah, so very quickly, I'm sorry I had to cram through proteins and genomics applications
01:04:10.840 | over here.
01:04:12.560 | But I think what you would see is that overall, when you look at clinical proteins and genomic
01:04:16.240 | applications, we see that transformers have incredible potential in biomedicine.
01:04:21.600 | And with clinical applications, I think the challenges are perhaps more centered around
01:04:24.240 | data and evaluation.
01:04:26.400 | But on the proteins and genomics side, I think there are some extremely interesting opportunities
01:04:30.080 | to innovate on the architecture.
01:04:32.920 | And finally, as I said, there are incredible bi-directional learning opportunities.
01:04:35.920 | I think the problem of modeling long-range interactions, that's useful beyond proteins,
01:04:40.220 | beyond genomics.
01:04:41.280 | I think it's useful in genomics.
01:04:42.600 | And so I think any architecture improvement over here can inspire wider progress in AI.
01:04:46.480 | So I think that's a big reason to work on this.
01:04:50.600 | Any questions so far?
01:04:52.200 | Sorry, I covered a lot of ground over here.
01:04:55.840 | Apologies for that.
01:04:56.840 | But I think these are super cool papers, and you should go back and read them.
01:05:01.120 | So finally, I want to maybe spend a couple of minutes touching upon how I see the future
01:05:04.920 | of biomedical AI evolving.
01:05:06.920 | Overall, I believe it's not a question of if AI will transform biomedicine.
01:05:12.240 | I think it's rather a question of when and how.
01:05:14.840 | And I think the very specific thesis I have over here is, given the nature of biomedical
01:05:20.600 | data and how multimodal in nature, and with all the progress in transformers, self-supposed
01:05:24.240 | learning, large language models, I think we have an incredibly powerful framework to leverage
01:05:29.360 | all this richness at scale and truly build foundational medical AI models.
01:05:34.780 | So I think that is incredibly exciting.
01:05:39.040 | And so I'm not-- I think it's-- you've already been over here for far too long, so I'm not
01:05:44.400 | going to ask you to recognize these people.
01:05:46.480 | But they're actually famous physician scientists.
01:05:48.480 | Some of them went on to win Nobel Prizes.
01:05:50.880 | And so I think what I want to say over here is there's no reason for a scientist to be
01:05:55.360 | different from a physician.
01:05:56.360 | They can be combined together.
01:05:57.800 | And that's what I also want to convey with our AI systems as well.
01:06:00.520 | We don't have to separate clinical applications and biological applications.
01:06:03.540 | I think when we combine them together, we are going to discover a lot of new insights.
01:06:06.980 | And I think that's going to accelerate biomedical research and internally to new discoveries,
01:06:11.680 | and which is going to be used to eradicate diseases, advance human health span, and generally
01:06:16.920 | drive human potential forward.
01:06:20.280 | Good question.
01:06:21.280 | I don't actually know who these three are.
01:06:23.440 | Sure.
01:06:24.440 | I think the rightmost one is Alexander Fleming.
01:06:27.480 | But then Jonah Salk, and then Paul Ehrlich.
01:06:30.840 | So Fleming is penicillin, Salk is polio, and Ehrlich was a bunch of different stuff.
01:06:40.360 | And so maybe I'll ask this question to all of you.
01:06:44.480 | Which field of AI do you think will-- which field do you think AI will win the first Nobel
01:06:48.680 | Prize in?
01:06:49.680 | You don't have to answer.
01:06:50.680 | Just think.
01:06:51.680 | What's the complete set of Nobel Prize fields?
01:06:58.340 | I think there's six.
01:06:59.340 | No, economics is not a Nobel Prize.
01:07:00.340 | No, there's like eight Nobel Prizes.
01:07:01.340 | Oh, OK.
01:07:02.340 | But it's like--
01:07:03.340 | You can say it's a Nobel Prize.
01:07:04.340 | It's like, this is not a real field.
01:07:05.340 | I'm asking for equal prize.
01:07:06.340 | I think economics is a Nobel Prize.
01:07:07.340 | Where will we put it?
01:07:08.340 | Like, it's associated with the Nobel Prize.
01:07:09.340 | I think economics is a Nobel Prize.
01:07:10.340 | It's not a real field.
01:07:11.340 | I think economics is a Nobel Prize.
01:07:12.340 | It's associated with the Nobel Prize.
01:07:13.340 | I think economics is a Nobel Prize.
01:07:14.340 | I think economics is a Nobel Prize.
01:07:15.340 | I think economics is a Nobel Prize.
01:07:16.340 | I think economics is a Nobel Prize.
01:07:17.340 | I think economics is a Nobel Prize.
01:07:18.340 | I think economics is a Nobel Prize.
01:07:21.340 | I think economics is a Nobel Prize.
01:07:24.340 | I think economics is a Nobel Prize.
01:07:27.340 | I think economics is a Nobel Prize.
01:07:30.340 | I think economics is a Nobel Prize.
01:07:33.340 | I think economics is a Nobel Prize.
01:07:36.340 | I think economics is a Nobel Prize.
01:07:39.340 | I think economics is a Nobel Prize.
01:07:42.340 | I think economics is a Nobel Prize.
01:07:45.340 | I think economics is a Nobel Prize.
01:07:47.340 | I think economics is a Nobel Prize.
01:07:50.340 | I think economics is a Nobel Prize.
01:07:53.340 | I think economics is a Nobel Prize.
01:07:56.340 | I think economics is a Nobel Prize.
01:07:59.340 | I think economics is a Nobel Prize.
01:08:02.340 | I think economics is a Nobel Prize.
01:08:05.340 | I think economics is a Nobel Prize.