back to index

DeepMind solves protein folding | AlphaFold 2


Chapters

0:0 What happened?
1:3 How big is this accomplishment?
4:39 Proteins and amino acids
5:17 Protein folding
8:26 How AlphaFold 1 works
9:45 How AlphaFold 2 works
12:9 Why is this breakthrough important?
13:19 Long-term future impact

Whisper Transcript | Transcript Only Page

00:00:00.000 | I think it's fair to say that this year, 2020,
00:00:02.600 | has thrown quite a few challenges at human civilization.
00:00:06.440 | So it's really nice to get some positive news
00:00:09.680 | about truly marvelous accomplishments
00:00:12.480 | of engineering and science.
00:00:14.320 | One was SpaceX, I would argue,
00:00:16.880 | launching a new era of space exploration.
00:00:20.480 | And now, a couple of days ago,
00:00:22.040 | DeepMind has announced that its second iteration
00:00:24.640 | of the AlphaFold system has, quote unquote,
00:00:27.960 | solved the 50-year-old grand challenge problem
00:00:31.480 | of protein folding.
00:00:33.200 | Solved here means that these computational methods
00:00:36.480 | were able to achieve prediction performance
00:00:39.480 | similar to much slower, much more expensive
00:00:42.120 | experimental methods like X-ray crystallography.
00:00:45.440 | In 2018, which is the previous iteration
00:00:47.600 | of the CASP competition, AlphaFold achieved a score of 58
00:00:52.320 | on the hardest class of proteins.
00:00:54.280 | And this year, it achieved a score of 87,
00:00:57.480 | which is a huge improvement,
00:00:59.320 | and it's still 26 points better
00:01:01.820 | than the closest competition.
00:01:03.520 | So this is definitely a big leap,
00:01:05.280 | but it's also fair to say that the internet
00:01:07.440 | is full of hype about this breakthrough.
00:01:10.120 | And so let me indulge in the fun a bit.
00:01:12.000 | Some of it is definitely a little bit subjective,
00:01:14.720 | but I think the case could be made on the life science side
00:01:17.200 | that this is the biggest advancements
00:01:18.880 | in structural biology of the past one or two decades.
00:01:22.960 | And in my field of artificial intelligence,
00:01:25.200 | I think a strong case could be made
00:01:27.000 | that this is one of the biggest advancements
00:01:29.240 | in recent history of the field.
00:01:31.100 | So of course, the competition is pretty steep,
00:01:33.040 | and I talk with excitement about each of these entries.
00:01:37.640 | Of course, the ImageNet moment itself,
00:01:39.360 | or the AlexNet moment that launched
00:01:41.160 | a deep learning revolution in the space of computer vision.
00:01:44.360 | So many people are comparing now this breakthrough
00:01:48.560 | of AlphaFold2 to the ImageNet moment,
00:01:51.660 | but now in the life sciences field.
00:01:53.960 | I think the good old argument over beers
00:01:56.320 | about which is the biggest breakthrough
00:01:58.640 | comes down to the importance you place
00:02:01.160 | on how much real world direct impact a breakthrough has.
00:02:05.720 | Of course, AlexNet was ultimately on a toy dataset
00:02:09.280 | of very simplistic image classification problem,
00:02:12.120 | which does not have a direct application to the real world,
00:02:15.920 | but it did demonstrate the ability of deep neural networks
00:02:19.320 | to learn from a large amount of data in a supervised way.
00:02:22.720 | But anyway, this is probably a very long conversation
00:02:25.320 | over many beers of AlphaZero
00:02:28.520 | with reinforcement learning self-play,
00:02:30.600 | obviously in contention for the biggest breakthrough.
00:02:32.560 | The recent breakthroughs in the application of transformers
00:02:35.080 | in the natural language processing space,
00:02:37.280 | with GPT-3 being the most kind of recent iteration
00:02:41.600 | of state-of-the-art performance.
00:02:43.640 | The actual deployment of robots in the field
00:02:46.160 | used by real humans, which is Tesla Autopilot.
00:02:48.680 | Deployment of massive fleet learning,
00:02:51.920 | of massive machine learning in safety critical systems.
00:02:55.520 | And then other kinds of robots,
00:02:57.340 | like the Google self-driving car, Waymo systems
00:03:01.220 | that are taking even a further leap
00:03:03.160 | of removing the human from the picture,
00:03:05.200 | being able to drive the car autonomously
00:03:06.960 | without human supervision.
00:03:09.160 | Smart speakers in the home.
00:03:10.460 | There's a lot of actual in the wild
00:03:12.760 | natural language processing that I think
00:03:14.800 | doesn't get enough credit
00:03:16.400 | from the artificial intelligence community,
00:03:17.960 | how much amazing stuff is there.
00:03:20.440 | And depending how much value you put
00:03:22.040 | in engineering achievements,
00:03:24.840 | especially in the hardware space,
00:03:26.200 | Boston Dynamics with SpotManySpot robot
00:03:29.440 | is just, one could argue,
00:03:31.880 | is one of the great accomplishments
00:03:33.440 | in the artificial intelligence field,
00:03:35.280 | especially when you maybe look 20 and 50 years down the line
00:03:39.940 | when the entire world is populated by robot dogs
00:03:43.240 | and the humans have gone extinct.
00:03:45.020 | Anyway, I say all that for fun,
00:03:47.640 | but really this is one of the big breakthroughs
00:03:49.760 | in our field and something to truly be excited about.
00:03:53.220 | And I'll talk about some of the possible future impact
00:03:55.400 | I see here from this breakthrough
00:03:57.120 | in just a couple of slides here.
00:03:59.080 | Anyway, my prediction is that there will be at least one,
00:04:02.080 | potentially several Nobel prizes
00:04:04.040 | that will result in derivative work
00:04:07.760 | launched directly with these computational methods.
00:04:10.800 | It's kind of exciting to think that it's possible also
00:04:13.520 | that we'll see a first Nobel prize that is awarded
00:04:18.160 | where much of the work is done
00:04:19.800 | by a machine learning system.
00:04:21.560 | Of course, the Nobel prize is awarded
00:04:23.000 | to the humans behind the system,
00:04:24.840 | but it's exciting to think that a computational approach
00:04:27.120 | or machine learning system will play a big role
00:04:30.040 | in a Nobel prize level discovery
00:04:33.960 | in the field like medicine and physiology
00:04:36.120 | or chemistry or even physics.
00:04:39.600 | Okay, let's talk a bit about proteins and protein folding,
00:04:42.380 | why this whole space is really fascinating.
00:04:44.740 | First of all, there's amino acids,
00:04:47.160 | which are the basic building blocks of life
00:04:49.600 | in eukaryotes, which is what we're talking about here
00:04:51.960 | with humans, there's 21 of them.
00:04:54.720 | Proteins are chains of amino acids
00:04:57.240 | and are the workhorses of living organisms of cells.
00:05:01.360 | And they do all kinds of stuff from structural to functional
00:05:04.080 | they serve as catalysts for chemical reactions,
00:05:06.520 | they move stuff around, they do all kinds of things.
00:05:09.040 | So they're both the building blocks of life
00:05:11.320 | and the doers and movers of life.
00:05:14.480 | Hopefully I'm not being too poetic.
00:05:17.160 | So protein folding is the fascinating process
00:05:20.120 | of going from the amino acid sequence to a 3D structure.
00:05:24.960 | There's a lot that could be said here,
00:05:26.740 | there's a lot of lectures on this topic,
00:05:28.800 | but let me quickly say some of the more fascinating
00:05:30.840 | and important things that I remember
00:05:33.220 | from a few biology classes I took in high school and college.
00:05:36.560 | Okay, so first is there's a fascinating property
00:05:40.580 | of uniqueness that a particular sequence
00:05:44.040 | usually maps one-to-one to a 3D structure,
00:05:46.800 | not always, but usually.
00:05:48.880 | To me from an outsider's perspective,
00:05:50.560 | that's just weird and fascinating.
00:05:53.520 | The other thing to say is that the 3D structure
00:05:55.840 | determines the function of the protein.
00:05:58.000 | So one of the core layers of that
00:05:59.740 | is that the underlying cause of many diseases
00:06:02.640 | is the misfolding of proteins.
00:06:04.920 | Now back to the weirdness of the uniqueness of the folding,
00:06:07.860 | there's a lot of ways for a protein to fold
00:06:11.960 | based on the sequence of amino acids.
00:06:14.280 | There's I think 10 to the power of 80 atoms in the universe,
00:06:18.760 | so 10 to the power of 143 is a lot.
00:06:22.840 | And you can look at Leventhal's paradox,
00:06:25.120 | which is one of the early formulations
00:06:26.600 | of just how hard this problem is
00:06:29.640 | and why it's really weird that a protein
00:06:31.240 | is able to do it so quickly.
00:06:33.320 | As a completely irrelevant side note,
00:06:36.160 | I wonder how many possible chess games there are.
00:06:40.400 | I think I remember it being 10 to the power of 100,
00:06:44.040 | something like that.
00:06:46.480 | I think that would also necessitate
00:06:48.040 | removing certain kinds of infinite games.
00:06:50.800 | Anyway, off the top of my head,
00:06:52.620 | I would venture to say that the protein folding problem
00:06:55.360 | just in the number of possible combinations
00:06:58.520 | is much, much harder than the game of chess,
00:07:02.380 | but it's also much weirder.
00:07:04.200 | You know, they say that life imitates chess,
00:07:06.460 | but I think that from a biological perspective,
00:07:10.180 | life is way weirder than chess.
00:07:12.500 | Anyway, to say once again what I said before
00:07:14.560 | is that the misfolding of proteins
00:07:16.740 | is the underlying cause of many diseases.
00:07:19.700 | And again, I'll talk about the implications
00:07:21.300 | of that a little bit later.
00:07:22.980 | From a computational, from a machine learning,
00:07:24.940 | from a dataset perspective,
00:07:26.660 | what we're looking at currently is 200 million proteins
00:07:29.780 | that have been mapped and 170,000 protein 3D structures,
00:07:35.580 | so much, much fewer.
00:07:36.700 | And that's our training data for the learning-based
00:07:39.540 | approaches for the protein folding problem.
00:07:42.100 | Now, the way those 3D structures were determined
00:07:44.780 | is through experimental methods.
00:07:46.740 | One of the most accurate being X-ray crystallography,
00:07:49.500 | which I saw a University of Toronto study
00:07:51.460 | showing that it costs about $120,000 per protein.
00:07:55.220 | It takes about one year to determine the 3D structure.
00:07:58.980 | So because it costs a lot and it's very slow,
00:08:01.340 | that's why you only have
00:08:02.180 | 170,000 3D structures determined.
00:08:05.060 | Now, that's one of the big things
00:08:07.140 | that the AlphaFold2 system might be able to provide
00:08:11.220 | is at least for a large class of proteins,
00:08:13.780 | be able to determine the 3D structure with a high accuracy,
00:08:16.820 | enough to be able to sort of open up
00:08:19.180 | the structural biology field entirely
00:08:21.880 | with sort of several orders of magnitude
00:08:23.540 | more protein 3D structures to play with.
00:08:25.960 | There's not currently a paper out
00:08:28.800 | that describes the details of the AlphaFold2 system,
00:08:32.180 | but I think it's clear that it's heavily based
00:08:34.180 | on the AlphaFold1 system from two years ago.
00:08:37.380 | So I think it's useful to look at how that system works.
00:08:40.060 | And then we can hypothesize, speculate
00:08:42.060 | about the kind of methodological improvements
00:08:45.580 | in the AlphaFold2 system.
00:08:47.540 | Okay, so for AlphaFold1 system,
00:08:50.580 | there's two steps in the process.
00:08:52.500 | The first includes machine learning, the second does not.
00:08:55.420 | The first step includes a convolutional neural network
00:08:58.220 | that takes as input the amino acid residue sequences
00:09:02.540 | plus a ton of different features that their paper describes,
00:09:05.700 | including the multiple sequence alignment
00:09:08.140 | of evolutionary related sequences.
00:09:10.660 | And the output of the network is this distance matrix
00:09:13.180 | with the rows and columns being the amino acid residues.
00:09:16.420 | They're giving a confidence distribution of the distance
00:09:20.680 | between the two amino acids
00:09:21.980 | in the final geometric 3D structure of the protein.
00:09:25.540 | Then once you have the distance matrix,
00:09:27.220 | then you have a non-learning based gradient descent
00:09:30.340 | optimization of folding this 3D structure to figure out
00:09:34.580 | how you can as closely as possible match the distances
00:09:37.740 | between the amino acid residues that are specified
00:09:41.140 | by the distance matrix.
00:09:43.620 | Okay, that's it at a high level.
00:09:45.820 | Now, how does AlphaFold2 work?
00:09:48.340 | First of all, we don't know for sure.
00:09:50.180 | There's only a blog post
00:09:51.420 | and some little speculation here and there.
00:09:54.080 | But one thing is clear that there's attentional mechanisms.
00:09:57.920 | So I think convolutional neural networks are out
00:10:00.180 | and transformers are in.
00:10:01.780 | The same kind of process that's been happening
00:10:04.140 | in the natural language processing space
00:10:05.940 | and really most of the deep learning space.
00:10:08.660 | It's clear that attention mechanisms are going
00:10:11.540 | to be taking over every aspects of machine learning.
00:10:16.060 | So I think the big change is ComNet is out,
00:10:19.060 | transformers are in.
00:10:20.780 | The rest is more in the speculation space.
00:10:23.900 | It does seem that the MSA,
00:10:25.140 | the multiple sequence alignment,
00:10:26.820 | is part of the learning process now,
00:10:29.020 | as opposed to part of the feature engineering,
00:10:31.100 | which it was in the original step.
00:10:33.060 | I believe it was only a source of features.
00:10:35.020 | Please correct me if I'm wrong on that.
00:10:36.780 | But it does seem like here it's now part
00:10:39.460 | of the learning process.
00:10:40.700 | And there's something iterative about it,
00:10:42.700 | at least in the blog post,
00:10:44.220 | where there's a constant passing of learned information
00:10:48.260 | between the sequence residue representation,
00:10:51.680 | which is the evolution related sequence side of things,
00:10:55.020 | and then the amino acid residue to residue distances
00:10:58.180 | that are more akin to the alpha fold one system.
00:11:01.700 | How that iterative process works, it's unclear,
00:11:04.700 | whether it's part of one giant neural network
00:11:06.860 | or whether several neural networks evolved, I don't know.
00:11:10.100 | But it does seem that the evolution related sequences
00:11:12.740 | are now part of the learning process.
00:11:14.720 | It does seem that there's some kind
00:11:15.980 | of iterative passing information,
00:11:17.660 | and of course, attention being involved
00:11:19.660 | into the entire picture.
00:11:21.140 | Now, at least in the blog post,
00:11:23.140 | the term spatial graph is used as opposed
00:11:25.660 | to sort of a distance matrix or adjacency matrix.
00:11:28.820 | So I don't know if there's some magical tricks involved
00:11:32.120 | in some interesting generalization of an adjacency matrix
00:11:36.180 | that's involved in a spatial graph representation,
00:11:39.340 | or if it's simply just using the term spatial graph
00:11:42.260 | because there is more than just pairwise distances evolved
00:11:46.600 | in this version of the learning architecture.
00:11:48.900 | I think the two lessons of the recent history
00:11:51.260 | of deep learning, if you involve attention,
00:11:53.020 | if you involve transformers,
00:11:54.180 | you're gonna get a big boost.
00:11:55.980 | And the other lesson is that if you make
00:11:57.860 | as much of the problem learnable as possible,
00:12:00.460 | you're often going to see quite significant benefits.
00:12:03.940 | This is something I've definitely seen
00:12:05.460 | in the computer vision,
00:12:06.780 | especially the semantic segmentation side of things.
00:12:10.100 | Okay, why is this breakthrough important?
00:12:12.700 | Allow this computer scientist, AI person,
00:12:16.040 | to wax poetic about some biology for a bit.
00:12:19.140 | So because the protein structure
00:12:22.020 | gives us the protein function,
00:12:25.080 | figuring out the structure for maybe millions of proteins
00:12:30.080 | might allow us to learn unknown functions
00:12:32.380 | of genes encoded in DNA.
00:12:34.380 | Also, as I mentioned before,
00:12:35.780 | it might allow us to understand the cause of many diseases
00:12:38.620 | that are the result of misfolded proteins.
00:12:41.620 | Other applications will stem from the ability
00:12:44.480 | to quickly design new proteins that in some way
00:12:47.540 | alter the function of other proteins.
00:12:49.380 | So for treatments, for drugs,
00:12:51.180 | that means designing proteins
00:12:52.500 | that fix other misfolded proteins.
00:12:55.020 | Again, those are the causes of many diseases.
00:12:57.980 | I read a paper that was talking about
00:12:59.520 | agriculture applications of being able to engineer
00:13:02.540 | insecticidal proteins or frost protective coating,
00:13:05.580 | stuff I know nothing about.
00:13:07.040 | I read it, it's out there.
00:13:08.740 | Tissue regeneration through self-assembling proteins,
00:13:11.980 | supplements for improved health and anti-aging,
00:13:15.120 | and all kinds of biomaterials,
00:13:17.140 | for textiles and just materials in general.
00:13:19.880 | Now in the long-term or the super long-term future impact
00:13:24.160 | of this breakthrough might be just the advancement
00:13:27.140 | of end-to-end learning of really complicated problems
00:13:30.160 | in the life sciences.
00:13:31.680 | So protein folding is looking at the folding
00:13:33.920 | of a single protein.
00:13:35.320 | So being able to predict multi-protein interaction
00:13:39.640 | or protein complex formation,
00:13:41.920 | which even in my limited knowledge of biology,
00:13:44.240 | I think is a much, much, much harder problem
00:13:46.840 | as far as I understand.
00:13:48.280 | And just being able to incorporate the environment
00:13:51.840 | into the modeling of the folding of the protein
00:13:55.560 | and also seeing how the function of that protein
00:13:58.240 | might change given the environment.
00:13:59.800 | All those kinds of things,
00:14:00.840 | incorporating that into the end-to-end learning problem.
00:14:04.400 | Then taking a step even further is this is physics,
00:14:08.440 | biophysics, so being able to accurately do
00:14:12.120 | physics-based simulation of biological systems.
00:14:15.320 | So if we think of a protein
00:14:18.120 | as one of the most basic biological systems,
00:14:20.760 | so then taking a step out further and further
00:14:22.880 | and increasing the complexity of the biological systems,
00:14:25.440 | you can start to think of something crazy
00:14:27.960 | like being able to do accurate physics-based simulation
00:14:31.160 | of cells, for example, or entire organs.
00:14:34.760 | Or maybe one day being able to do
00:14:36.680 | an accurate physics-based simulation
00:14:39.160 | of the very over-caffeinated organ
00:14:42.160 | that's producing this very video.
00:14:45.280 | In fact, how do we know this is not a physics-based simulation
00:14:50.280 | of a biological system whose assigned name happens to be Lex?
00:14:56.000 | I guess we'll never know.
00:14:58.600 | And of course, we can go farther out
00:15:00.200 | into super long-term sci-fi kind of ideas
00:15:03.280 | of biological life and artificial life,
00:15:06.160 | which are fascinating ideas of being able to play
00:15:08.840 | with simulation of prediction of organisms
00:15:13.360 | that are biologically-based or non-biologically-based.
00:15:16.160 | I mean, that's the exciting future
00:15:18.080 | of end-to-end learning systems
00:15:20.080 | that step outside the game-playing world
00:15:22.840 | of StarCraft, of Chess and Go,
00:15:24.960 | and go into the life sciences of real-world systems
00:15:28.520 | that operate in the real world.
00:15:29.920 | That's where Tesla Autopilot is really exciting.
00:15:32.400 | That's where any robots that use machine learning
00:15:35.560 | are really exciting.
00:15:36.480 | And that's where this big breakthrough
00:15:38.880 | in the space of structural biology is super exciting.
00:15:42.840 | And truly, to me, as one humble human,
00:15:45.920 | inspiring beyond words.
00:15:48.240 | Speaking of words, for me,
00:15:49.720 | these quick videos are fun and easy to make,
00:15:52.560 | and I hope it's at least somewhat useful to you.
00:15:56.160 | If it is, I'll make more.
00:15:58.280 | It's fun. I enjoy it.
00:15:59.600 | I love it, really.
00:16:01.000 | Quick shout-out to podcast sponsors.
00:16:03.120 | Vincero Watches, the maker
00:16:04.880 | of classy, well-performing watches.
00:16:06.640 | I'm wearing one now.
00:16:08.200 | And Four Sigmatic, the maker of delicious mushroom coffee.
00:16:11.840 | I drink it every morning and all day,
00:16:14.080 | as you can probably tell from my voice now.
00:16:16.440 | Please check out these sponsors in the description
00:16:18.680 | to get a discount and to support this channel.
00:16:21.360 | All right, love you all.
00:16:23.160 | And remember, try to learn something new every day.
00:16:25.920 | (upbeat music)
00:16:28.520 | (upbeat music)
00:16:31.120 | (upbeat music)
00:16:33.720 | (upbeat music)
00:16:36.320 | [BLANK_AUDIO]