back to indexDeepMind solves protein folding | AlphaFold 2
Chapters
0:0 What happened?
1:3 How big is this accomplishment?
4:39 Proteins and amino acids
5:17 Protein folding
8:26 How AlphaFold 1 works
9:45 How AlphaFold 2 works
12:9 Why is this breakthrough important?
13:19 Long-term future impact
00:00:00.000 |
I think it's fair to say that this year, 2020, 00:00:02.600 |
has thrown quite a few challenges at human civilization. 00:00:06.440 |
So it's really nice to get some positive news 00:00:22.040 |
DeepMind has announced that its second iteration 00:00:27.960 |
solved the 50-year-old grand challenge problem 00:00:33.200 |
Solved here means that these computational methods 00:00:42.120 |
experimental methods like X-ray crystallography. 00:00:47.600 |
of the CASP competition, AlphaFold achieved a score of 58 00:01:12.000 |
Some of it is definitely a little bit subjective, 00:01:14.720 |
but I think the case could be made on the life science side 00:01:18.880 |
in structural biology of the past one or two decades. 00:01:31.100 |
So of course, the competition is pretty steep, 00:01:33.040 |
and I talk with excitement about each of these entries. 00:01:41.160 |
a deep learning revolution in the space of computer vision. 00:01:44.360 |
So many people are comparing now this breakthrough 00:02:01.160 |
on how much real world direct impact a breakthrough has. 00:02:05.720 |
Of course, AlexNet was ultimately on a toy dataset 00:02:09.280 |
of very simplistic image classification problem, 00:02:12.120 |
which does not have a direct application to the real world, 00:02:15.920 |
but it did demonstrate the ability of deep neural networks 00:02:19.320 |
to learn from a large amount of data in a supervised way. 00:02:22.720 |
But anyway, this is probably a very long conversation 00:02:30.600 |
obviously in contention for the biggest breakthrough. 00:02:32.560 |
The recent breakthroughs in the application of transformers 00:02:37.280 |
with GPT-3 being the most kind of recent iteration 00:02:46.160 |
used by real humans, which is Tesla Autopilot. 00:02:51.920 |
of massive machine learning in safety critical systems. 00:02:57.340 |
like the Google self-driving car, Waymo systems 00:03:35.280 |
especially when you maybe look 20 and 50 years down the line 00:03:39.940 |
when the entire world is populated by robot dogs 00:03:47.640 |
but really this is one of the big breakthroughs 00:03:49.760 |
in our field and something to truly be excited about. 00:03:53.220 |
And I'll talk about some of the possible future impact 00:03:59.080 |
Anyway, my prediction is that there will be at least one, 00:04:07.760 |
launched directly with these computational methods. 00:04:10.800 |
It's kind of exciting to think that it's possible also 00:04:13.520 |
that we'll see a first Nobel prize that is awarded 00:04:24.840 |
but it's exciting to think that a computational approach 00:04:27.120 |
or machine learning system will play a big role 00:04:39.600 |
Okay, let's talk a bit about proteins and protein folding, 00:04:49.600 |
in eukaryotes, which is what we're talking about here 00:04:57.240 |
and are the workhorses of living organisms of cells. 00:05:01.360 |
And they do all kinds of stuff from structural to functional 00:05:04.080 |
they serve as catalysts for chemical reactions, 00:05:06.520 |
they move stuff around, they do all kinds of things. 00:05:17.160 |
So protein folding is the fascinating process 00:05:20.120 |
of going from the amino acid sequence to a 3D structure. 00:05:28.800 |
but let me quickly say some of the more fascinating 00:05:33.220 |
from a few biology classes I took in high school and college. 00:05:36.560 |
Okay, so first is there's a fascinating property 00:05:53.520 |
The other thing to say is that the 3D structure 00:05:59.740 |
is that the underlying cause of many diseases 00:06:04.920 |
Now back to the weirdness of the uniqueness of the folding, 00:06:14.280 |
There's I think 10 to the power of 80 atoms in the universe, 00:06:36.160 |
I wonder how many possible chess games there are. 00:06:40.400 |
I think I remember it being 10 to the power of 100, 00:06:52.620 |
I would venture to say that the protein folding problem 00:07:06.460 |
but I think that from a biological perspective, 00:07:22.980 |
From a computational, from a machine learning, 00:07:26.660 |
what we're looking at currently is 200 million proteins 00:07:29.780 |
that have been mapped and 170,000 protein 3D structures, 00:07:36.700 |
And that's our training data for the learning-based 00:07:42.100 |
Now, the way those 3D structures were determined 00:07:46.740 |
One of the most accurate being X-ray crystallography, 00:07:51.460 |
showing that it costs about $120,000 per protein. 00:07:55.220 |
It takes about one year to determine the 3D structure. 00:07:58.980 |
So because it costs a lot and it's very slow, 00:08:07.140 |
that the AlphaFold2 system might be able to provide 00:08:13.780 |
be able to determine the 3D structure with a high accuracy, 00:08:28.800 |
that describes the details of the AlphaFold2 system, 00:08:32.180 |
but I think it's clear that it's heavily based 00:08:37.380 |
So I think it's useful to look at how that system works. 00:08:42.060 |
about the kind of methodological improvements 00:08:52.500 |
The first includes machine learning, the second does not. 00:08:55.420 |
The first step includes a convolutional neural network 00:08:58.220 |
that takes as input the amino acid residue sequences 00:09:02.540 |
plus a ton of different features that their paper describes, 00:09:10.660 |
And the output of the network is this distance matrix 00:09:13.180 |
with the rows and columns being the amino acid residues. 00:09:16.420 |
They're giving a confidence distribution of the distance 00:09:21.980 |
in the final geometric 3D structure of the protein. 00:09:27.220 |
then you have a non-learning based gradient descent 00:09:30.340 |
optimization of folding this 3D structure to figure out 00:09:34.580 |
how you can as closely as possible match the distances 00:09:37.740 |
between the amino acid residues that are specified 00:09:54.080 |
But one thing is clear that there's attentional mechanisms. 00:09:57.920 |
So I think convolutional neural networks are out 00:10:01.780 |
The same kind of process that's been happening 00:10:08.660 |
It's clear that attention mechanisms are going 00:10:11.540 |
to be taking over every aspects of machine learning. 00:10:29.020 |
as opposed to part of the feature engineering, 00:10:44.220 |
where there's a constant passing of learned information 00:10:51.680 |
which is the evolution related sequence side of things, 00:10:55.020 |
and then the amino acid residue to residue distances 00:10:58.180 |
that are more akin to the alpha fold one system. 00:11:01.700 |
How that iterative process works, it's unclear, 00:11:04.700 |
whether it's part of one giant neural network 00:11:06.860 |
or whether several neural networks evolved, I don't know. 00:11:10.100 |
But it does seem that the evolution related sequences 00:11:25.660 |
to sort of a distance matrix or adjacency matrix. 00:11:28.820 |
So I don't know if there's some magical tricks involved 00:11:32.120 |
in some interesting generalization of an adjacency matrix 00:11:36.180 |
that's involved in a spatial graph representation, 00:11:39.340 |
or if it's simply just using the term spatial graph 00:11:42.260 |
because there is more than just pairwise distances evolved 00:11:46.600 |
in this version of the learning architecture. 00:11:48.900 |
I think the two lessons of the recent history 00:11:57.860 |
as much of the problem learnable as possible, 00:12:00.460 |
you're often going to see quite significant benefits. 00:12:06.780 |
especially the semantic segmentation side of things. 00:12:25.080 |
figuring out the structure for maybe millions of proteins 00:12:35.780 |
it might allow us to understand the cause of many diseases 00:12:41.620 |
Other applications will stem from the ability 00:12:44.480 |
to quickly design new proteins that in some way 00:12:55.020 |
Again, those are the causes of many diseases. 00:12:59.520 |
agriculture applications of being able to engineer 00:13:02.540 |
insecticidal proteins or frost protective coating, 00:13:08.740 |
Tissue regeneration through self-assembling proteins, 00:13:11.980 |
supplements for improved health and anti-aging, 00:13:19.880 |
Now in the long-term or the super long-term future impact 00:13:24.160 |
of this breakthrough might be just the advancement 00:13:27.140 |
of end-to-end learning of really complicated problems 00:13:35.320 |
So being able to predict multi-protein interaction 00:13:41.920 |
which even in my limited knowledge of biology, 00:13:48.280 |
And just being able to incorporate the environment 00:13:51.840 |
into the modeling of the folding of the protein 00:13:55.560 |
and also seeing how the function of that protein 00:14:00.840 |
incorporating that into the end-to-end learning problem. 00:14:04.400 |
Then taking a step even further is this is physics, 00:14:12.120 |
physics-based simulation of biological systems. 00:14:20.760 |
so then taking a step out further and further 00:14:22.880 |
and increasing the complexity of the biological systems, 00:14:27.960 |
like being able to do accurate physics-based simulation 00:14:45.280 |
In fact, how do we know this is not a physics-based simulation 00:14:50.280 |
of a biological system whose assigned name happens to be Lex? 00:15:06.160 |
which are fascinating ideas of being able to play 00:15:13.360 |
that are biologically-based or non-biologically-based. 00:15:24.960 |
and go into the life sciences of real-world systems 00:15:29.920 |
That's where Tesla Autopilot is really exciting. 00:15:32.400 |
That's where any robots that use machine learning 00:15:38.880 |
in the space of structural biology is super exciting. 00:15:52.560 |
and I hope it's at least somewhat useful to you. 00:16:08.200 |
And Four Sigmatic, the maker of delicious mushroom coffee. 00:16:16.440 |
Please check out these sponsors in the description 00:16:18.680 |
to get a discount and to support this channel. 00:16:23.160 |
And remember, try to learn something new every day.