DeepMind solves protein folding

I think it's fair to say that this year, 2020, has thrown quite a few challenges at human civilization. So it's really nice to get some positive news about truly marvelous accomplishments of engineering and science. One was SpaceX, I would argue, launching a new era of space exploration. And now, a couple of days ago, DeepMind has announced that its second iteration of the AlphaFold system has, quote unquote, solved the 50-year-old grand challenge problem of protein folding.

Solved here means that these computational methods were able to achieve prediction performance similar to much slower, much more expensive experimental methods like X-ray crystallography. In 2018, which is the previous iteration of the CASP competition, AlphaFold achieved a score of 58 on the hardest class of proteins. And this year, it achieved a score of 87, which is a huge improvement, and it's still 26 points better than the closest competition.

So this is definitely a big leap, but it's also fair to say that the internet is full of hype about this breakthrough. And so let me indulge in the fun a bit. Some of it is definitely a little bit subjective, but I think the case could be made on the life science side that this is the biggest advancements in structural biology of the past one or two decades.

And in my field of artificial intelligence, I think a strong case could be made that this is one of the biggest advancements in recent history of the field. So of course, the competition is pretty steep, and I talk with excitement about each of these entries. Of course, the ImageNet moment itself, or the AlexNet moment that launched a deep learning revolution in the space of computer vision.

So many people are comparing now this breakthrough of AlphaFold2 to the ImageNet moment, but now in the life sciences field. I think the good old argument over beers about which is the biggest breakthrough comes down to the importance you place on how much real world direct impact a breakthrough has.

Of course, AlexNet was ultimately on a toy dataset of very simplistic image classification problem, which does not have a direct application to the real world, but it did demonstrate the ability of deep neural networks to learn from a large amount of data in a supervised way. But anyway, this is probably a very long conversation over many beers of AlphaZero with reinforcement learning self-play, obviously in contention for the biggest breakthrough.

The recent breakthroughs in the application of transformers in the natural language processing space, with GPT-3 being the most kind of recent iteration of state-of-the-art performance. The actual deployment of robots in the field used by real humans, which is Tesla Autopilot. Deployment of massive fleet learning, of massive machine learning in safety critical systems.

And then other kinds of robots, like the Google self-driving car, Waymo systems that are taking even a further leap of removing the human from the picture, being able to drive the car autonomously without human supervision. Smart speakers in the home. There's a lot of actual in the wild natural language processing that I think doesn't get enough credit from the artificial intelligence community, how much amazing stuff is there.

And depending how much value you put in engineering achievements, especially in the hardware space, Boston Dynamics with SpotManySpot robot is just, one could argue, is one of the great accomplishments in the artificial intelligence field, especially when you maybe look 20 and 50 years down the line when the entire world is populated by robot dogs and the humans have gone extinct.

Anyway, I say all that for fun, but really this is one of the big breakthroughs in our field and something to truly be excited about. And I'll talk about some of the possible future impact I see here from this breakthrough in just a couple of slides here. Anyway, my prediction is that there will be at least one, potentially several Nobel prizes that will result in derivative work launched directly with these computational methods.

It's kind of exciting to think that it's possible also that we'll see a first Nobel prize that is awarded where much of the work is done by a machine learning system. Of course, the Nobel prize is awarded to the humans behind the system, but it's exciting to think that a computational approach or machine learning system will play a big role in a Nobel prize level discovery in the field like medicine and physiology or chemistry or even physics.

Okay, let's talk a bit about proteins and protein folding, why this whole space is really fascinating. First of all, there's amino acids, which are the basic building blocks of life in eukaryotes, which is what we're talking about here with humans, there's 21 of them. Proteins are chains of amino acids and are the workhorses of living organisms of cells.

And they do all kinds of stuff from structural to functional they serve as catalysts for chemical reactions, they move stuff around, they do all kinds of things. So they're both the building blocks of life and the doers and movers of life. Hopefully I'm not being too poetic. So protein folding is the fascinating process of going from the amino acid sequence to a 3D structure.

There's a lot that could be said here, there's a lot of lectures on this topic, but let me quickly say some of the more fascinating and important things that I remember from a few biology classes I took in high school and college. Okay, so first is there's a fascinating property of uniqueness that a particular sequence usually maps one-to-one to a 3D structure, not always, but usually.

To me from an outsider's perspective, that's just weird and fascinating. The other thing to say is that the 3D structure determines the function of the protein. So one of the core layers of that is that the underlying cause of many diseases is the misfolding of proteins. Now back to the weirdness of the uniqueness of the folding, there's a lot of ways for a protein to fold based on the sequence of amino acids.

There's I think 10 to the power of 80 atoms in the universe, so 10 to the power of 143 is a lot. And you can look at Leventhal's paradox, which is one of the early formulations of just how hard this problem is and why it's really weird that a protein is able to do it so quickly.

As a completely irrelevant side note, I wonder how many possible chess games there are. I think I remember it being 10 to the power of 100, something like that. I think that would also necessitate removing certain kinds of infinite games. Anyway, off the top of my head, I would venture to say that the protein folding problem just in the number of possible combinations is much, much harder than the game of chess, but it's also much weirder.

You know, they say that life imitates chess, but I think that from a biological perspective, life is way weirder than chess. Anyway, to say once again what I said before is that the misfolding of proteins is the underlying cause of many diseases. And again, I'll talk about the implications of that a little bit later.

From a computational, from a machine learning, from a dataset perspective, what we're looking at currently is 200 million proteins that have been mapped and 170,000 protein 3D structures, so much, much fewer. And that's our training data for the learning-based approaches for the protein folding problem. Now, the way those 3D structures were determined is through experimental methods.

One of the most accurate being X-ray crystallography, which I saw a University of Toronto study showing that it costs about $120,000 per protein. It takes about one year to determine the 3D structure. So because it costs a lot and it's very slow, that's why you only have 170,000 3D structures determined.

Now, that's one of the big things that the AlphaFold2 system might be able to provide is at least for a large class of proteins, be able to determine the 3D structure with a high accuracy, enough to be able to sort of open up the structural biology field entirely with sort of several orders of magnitude more protein 3D structures to play with.

There's not currently a paper out that describes the details of the AlphaFold2 system, but I think it's clear that it's heavily based on the AlphaFold1 system from two years ago. So I think it's useful to look at how that system works. And then we can hypothesize, speculate about the kind of methodological improvements in the AlphaFold2 system.

Okay, so for AlphaFold1 system, there's two steps in the process. The first includes machine learning, the second does not. The first step includes a convolutional neural network that takes as input the amino acid residue sequences plus a ton of different features that their paper describes, including the multiple sequence alignment of evolutionary related sequences.

And the output of the network is this distance matrix with the rows and columns being the amino acid residues. They're giving a confidence distribution of the distance between the two amino acids in the final geometric 3D structure of the protein. Then once you have the distance matrix, then you have a non-learning based gradient descent optimization of folding this 3D structure to figure out how you can as closely as possible match the distances between the amino acid residues that are specified by the distance matrix.

Okay, that's it at a high level. Now, how does AlphaFold2 work? First of all, we don't know for sure. There's only a blog post and some little speculation here and there. But one thing is clear that there's attentional mechanisms. So I think convolutional neural networks are out and transformers are in.

The same kind of process that's been happening in the natural language processing space and really most of the deep learning space. It's clear that attention mechanisms are going to be taking over every aspects of machine learning. So I think the big change is ComNet is out, transformers are in.

The rest is more in the speculation space. It does seem that the MSA, the multiple sequence alignment, is part of the learning process now, as opposed to part of the feature engineering, which it was in the original step. I believe it was only a source of features. Please correct me if I'm wrong on that.

But it does seem like here it's now part of the learning process. And there's something iterative about it, at least in the blog post, where there's a constant passing of learned information between the sequence residue representation, which is the evolution related sequence side of things, and then the amino acid residue to residue distances that are more akin to the alpha fold one system.

How that iterative process works, it's unclear, whether it's part of one giant neural network or whether several neural networks evolved, I don't know. But it does seem that the evolution related sequences are now part of the learning process. It does seem that there's some kind of iterative passing information, and of course, attention being involved into the entire picture.

Now, at least in the blog post, the term spatial graph is used as opposed to sort of a distance matrix or adjacency matrix. So I don't know if there's some magical tricks involved in some interesting generalization of an adjacency matrix that's involved in a spatial graph representation, or if it's simply just using the term spatial graph because there is more than just pairwise distances evolved in this version of the learning architecture.

I think the two lessons of the recent history of deep learning, if you involve attention, if you involve transformers, you're gonna get a big boost. And the other lesson is that if you make as much of the problem learnable as possible, you're often going to see quite significant benefits.

This is something I've definitely seen in the computer vision, especially the semantic segmentation side of things. Okay, why is this breakthrough important? Allow this computer scientist, AI person, to wax poetic about some biology for a bit. So because the protein structure gives us the protein function, figuring out the structure for maybe millions of proteins might allow us to learn unknown functions of genes encoded in DNA.

Also, as I mentioned before, it might allow us to understand the cause of many diseases that are the result of misfolded proteins. Other applications will stem from the ability to quickly design new proteins that in some way alter the function of other proteins. So for treatments, for drugs, that means designing proteins that fix other misfolded proteins.

Again, those are the causes of many diseases. I read a paper that was talking about agriculture applications of being able to engineer insecticidal proteins or frost protective coating, stuff I know nothing about. I read it, it's out there. Tissue regeneration through self-assembling proteins, supplements for improved health and anti-aging, and all kinds of biomaterials, for textiles and just materials in general.

Now in the long-term or the super long-term future impact of this breakthrough might be just the advancement of end-to-end learning of really complicated problems in the life sciences. So protein folding is looking at the folding of a single protein. So being able to predict multi-protein interaction or protein complex formation, which even in my limited knowledge of biology, I think is a much, much, much harder problem as far as I understand.

And just being able to incorporate the environment into the modeling of the folding of the protein and also seeing how the function of that protein might change given the environment. All those kinds of things, incorporating that into the end-to-end learning problem. Then taking a step even further is this is physics, biophysics, so being able to accurately do physics-based simulation of biological systems.

So if we think of a protein as one of the most basic biological systems, so then taking a step out further and further and increasing the complexity of the biological systems, you can start to think of something crazy like being able to do accurate physics-based simulation of cells, for example, or entire organs.

Or maybe one day being able to do an accurate physics-based simulation of the very over-caffeinated organ that's producing this very video. In fact, how do we know this is not a physics-based simulation of a biological system whose assigned name happens to be Lex? I guess we'll never know. And of course, we can go farther out into super long-term sci-fi kind of ideas of biological life and artificial life, which are fascinating ideas of being able to play with simulation of prediction of organisms that are biologically-based or non-biologically-based.

I mean, that's the exciting future of end-to-end learning systems that step outside the game-playing world of StarCraft, of Chess and Go, and go into the life sciences of real-world systems that operate in the real world. That's where Tesla Autopilot is really exciting. That's where any robots that use machine learning are really exciting.

And that's where this big breakthrough in the space of structural biology is super exciting. And truly, to me, as one humble human, inspiring beyond words. Speaking of words, for me, these quick videos are fun and easy to make, and I hope it's at least somewhat useful to you. If it is, I'll make more.

It's fun. I enjoy it. I love it, really. Quick shout-out to podcast sponsors. Vincero Watches, the maker of classy, well-performing watches. I'm wearing one now. And Four Sigmatic, the maker of delicious mushroom coffee. I drink it every morning and all day, as you can probably tell from my voice now.

Please check out these sponsors in the description to get a discount and to support this channel. All right, love you all. And remember, try to learn something new every day. (upbeat music) (upbeat music) (upbeat music) (upbeat music)

DeepMind solves protein folding | AlphaFold 2

Chapters

Transcript