Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

Today, we're very happy to have Andrew Trask. He's a brilliant writer, researcher, tweeter, that's a word, in the world of machine learning and artificial intelligence. He is the author of Grok and Deep Learning, the book that I highly recommended in the lecture on Monday. He's the leader and creator of OpenMind, which is an open-source community that strives to make our algorithms, our data, and our world in general more privacy-preserving.

He is coming to us by way of Oxford, but without that rich, complex, beautiful, sophisticated British accent, unfortunately. He is one of the best educators, and truly one of the nicest people I know. So please give him a warm welcome. >> Thanks. That was a very generous introduction. So yeah, today we're going to be talking about privacy-preserving AI.

This talk is going to come in two parts. So the first is going to be looking at privacy tools from the context of a data scientist or a researcher, like how their actual UX might change. Because I think that's the best way to communicate some of the new technologies that are coming about in that context.

Then we're going to zoom out and look at, under the assumption that these kinds of technologies become mature, what is that going to do to society? What consequences or side effects could these kinds of tools have, both positive and negative? So first, let's ask the question, is it possible to answer questions using data that we cannot see?

This is going to be the key question that we look at today. Let's start with an example. So first, if we wanted to answer the question, what do tumors look like in humans? Well, this is a pretty complex question. Tumors are pretty complicated things. So we might train an AI classifier.

If we wanted to do that, we would first need to download a dataset of tumor-related images. So we'd be able to statistically study these and be able to recognize what tumors look like in humans. But this kind of data is not very easy to come by. So it's very rarely that it's collected, it's difficult to move around, it's highly regulated.

So we're probably going to have to buy it from relatively small number of sources that are able to actually collect and manage this kind of information. The scarcity and constraints around this, likely to make this a relatively expensive purchase. If it's going to be an expensive purchase for us to answer this question, well, then we're going to find someone to finance our project.

If we need someone to finance our project, we have to come up with a way of how we're going to pay them back. If we're going to create a business plan, then we have to find a business partner. We're going to find a business partner, we have to spend all our cost in LinkedIn, looking for someone to start a business with us.

Now, it's because we wanted to answer the question, what do tumors look like in humans? What if we wanted to answer a different question? What if we wanted to answer the question, what do handwritten digits look like? Well, this would be a totally different story. We download a dataset, we download a state-of-the-art training script from GitHub, we'd run it, and a few minutes later, we'd have an ability to classify handwritten digits with potentially superhuman ability, if such a thing exists.

Why is this so different between these two questions? The reason is that getting access to private data, data about people, is really, really hard. As a result, we spend most of our time working on problems and tasks like this. So ImageNet, MNIST, IFR10. Anybody who's trained a classifier on MNIST before?

Raise your hand. I expect pretty much everybody. Instead of working on problems like this, does anyone train a classifier to predict dementia, diabetes, Alzheimer's? Looks like she's going. Depression? Anxiety? No one. So why is it that we spend all our time on tasks like this, when these tasks, these represent our friends and loved ones, and problems in society that really, really matter.

Not to say that there aren't people working on this. It's absolutely, there are whole fields dedicated to it. But the machine learning community at large, these tasks are pretty inaccessible. In fact, in order to work on one of these, just getting access to the data, you'd have to dedicate a portion of your life just to getting access to it, whether it's doing a startup or joining a hospital or what have you.

Whereas for other kinds of datasets, they're just simply readily accessible. This brings us back to our question. Is it possible to answer questions using data that we cannot see? So in this talk, we're going to walk through a few different techniques. If the answer to this question is yes, the combination of these techniques is going to try to make it so that we can actually pip install access to datasets like these in the same way that we pip install access to other deep learning tools.

The idea here is to lower the barrier to entry, to increase the accessibility to some of the most important problems that we would like to address. So as Lex mentioned, I lead a community called OpenMind, which is an open-source community of a little over 6,000 people who are focused on lowering the barrier to entry to privacy-preserving AI and machine learning.

Specifically, one of the tools that we're working on we're talking about today is called PySift. PySift extends the major deep learning frameworks with the ability to do privacy-preserving machine learning. So specifically today, we're going to be looking at the extensions into PyTorch. So PyTorch, people generally familiar with PyTorch, yeah, quite a few users.

It's my hope that by walking through a few of these tools, it'll become clear how we can start to be able to do data science, the active answering questions using data that we don't actually have direct access to. Then in the second half of the talk, we're going to generalize this to answering questions even if you're not necessarily a data scientist.

So first, first tool is remote execution. So let's just walk me through this. So we're going to jump into code for a minute, but hopefully this is line by line and relatively simple. Even if you aren't familiar with PyTorch, I think it's relatively intuitive. We're looking at lists of numbers and these kinds of things.

So up at the top, we import Torch as a deep learning framework. SIFT extends Torch with this thing called Torch Hook. All it's doing is just iterating through the library and basically monkey-patching in lots of new functionality. Most deep learning frameworks are built around one core primitive, and that core primitive is the tensor.

For those of you who don't know what tensors are, just think of them as nested list of numbers for now, and that'll be good enough for this talk. But for us, we introduce a second core primitive, which is the worker. A worker is a location within which computation is going to be occurring.

So in this case, we have a virtualized worker that is pointing to say a hospital data center. The assumption that we have is that this worker will allow us to run computation inside of the data center without us actually having direct access to that worker itself. It gives us a limited, white-listed set of methods that we can use on this remote machine.

So just to give you an example, so there's that core primitive we talked about a minute ago. We have the torch tensor, so 1, 3, 4, 5. The first method that we added is called just dot send. This does exactly what you might expect. Takes the tensor, serializes it, sends it into the hospital data center, and returns back to me a pointer.

This pointer is really, really special. For those of you who are actually familiar with deep learning frameworks, I hope that this will really resonate with you. Because it has the full PyTorch API as a part of it, but whenever you execute something using this pointer, instead of it running locally, even though it looks like and feels like it's running locally, it actually executes on the remote machine and returns back to you another pointer to the result.

The idea here being that I can now coordinate remote executions, remote computations without necessarily having to have direct access to the machine. Of course, I can get a dot get request and we'll see that this is actually really, really important to getting permissions around when you can do dot get request, and actually ask for data from a remote machine back to you.

So just remember that. Cool. So this is where we start. So in the Pareto principle, 80 percent for 20 percent, this is like the first big cut. So pros, data remains on a remote machine. We can now, in theory, do data science on a machine that we don't have access to, that we don't own.

But the problem is, the first column we want to address, is how can we actually do good data science without physically seeing the data? So it's all well and good to say, "I'm going to train a deep learning classifier." But the process of answering questions is inherently iterative. It's inherently give and take.

I learn a little bit and I ask a little bit, I learn a little bit and I ask a little bit. This brings me to the second tool. So search and example data. Again, we're starting really simple. It will get more complex here in a minute. So in this case, let's say we have what's called a grid.

So PyGrid, if PySIFT is a library, PyGrid is the platform version. So again, this is all open-source Apache 2 stuff. This is, we have what's called a grid client. So this could be a interface to a large number of datasets inside of a big hospital. So let's say I wanted to train a classifier to do something with diabetes.

So it's going to predict diabetes or predict certain kind of diabetes or certain attribute of diabetes. I should be able to perform remote search. I get back pointers to throw the remote information. I can get back detailed descriptions of what the information is without me actually looking at it.

So how it was collected, what the rows and columns are, what the types of different information is, what the various ranges of the values can take on, things that allow me to do remote normalization, these kinds of things. Then in some cases, even look at samples of this data.

So these samples could be human curated. They could be generated from a GAN. They could be actually short snippets from the actual dataset. Maybe it's okay to release small amounts but not large amounts. The reason I highlight this, this isn't crazy complex stuff. So prior to going back to school, I used to work for a company called Digital Reasoning.

We did on-prem data science. So we delivered AI services to corporations behind the firewall. So we did classified information. We worked with investment banks helping prevent insider trading. Doing data science on data that your home team, back in Nashville in our case, is not able to see is really, really challenging.

But there are some things that can give you the first big jump before you jump into the more complex tools to handle some of the more challenging use cases. Cool. So basic remote execution, so remote procedure calls, basic private search, and the ability to look at sample data, gives us enough general context to be able to start doing things like feature engineering and evaluating quality.

So now the data remains in the remote machine. We can do some basic feature engineering. Here's where things get a little more complicated. So if you remember, in the very first slide, where I show you some code at the bottom, I called dot get on the tensor. What that did was it took the pointer to some remote information and said, "Hey, send that information to me." That is an incredibly important bottleneck.

Unfortunately, despite the fact that I'm doing all my remote execution, if that's just naively implemented, well, I can just steal all the data that I want to. I just call dot get on whatever pointers I want, and there's no additional added real security. So what are we going to do about this?

This brings us to tool number three called differential privacy. Differential privacy, you want to come across? A little higher? Okay, cool. Awesome. Good. So I'm going to do a quick high-level overview of the intuition of differential privacy, and then we're going to jump into how it can and is looking in the code, and I'll give you resources for deeper dive in differential privacy at the end of the talk, should you be interested.

So differential privacy, loosely stated, is a field that allows you to do statistical analysis without compromising the privacy of the dataset. More specifically, it allows you to query a database, while making certain guarantees about the privacy of the records contained within the database. So let me show you what I mean.

Let's say we have an example database, and so this is the canonical DB if you look in the literature for differential privacy. It'll have one row for person, one row for person, and one column of zeros and ones, which corresponds to true and false. We don't actually really care what those zeros and ones are indicating.

It could be presence of a disease, could be male-female, could be just some sensitive attributes, something that's worth protecting. What we're going to do is, our goal is to ensure statistical analysis doesn't compromise privacy. What we're going to do is query this database. We're going to run some function over the entire database, and we're going to look at the result, and then we're going to ask a very important question.

We're going to ask, if I were to remove someone from this database, say John, would the output of my function change? If the answer to that is no, then intuitively, we can say that, well, this output is not conditioned on John's private information. Now, if we could say that about everyone in the database, well then, okay, it would be a perfectly privacy-preserving query, but it might not be that useful.

But this intuitive definition, I think, is quite powerful. The notion of how can we construct queries that are invariant to removing someone or replacing them with someone else. The notion of the maximum amount that the output of a function can change as a result of removing or replacing one of the individuals is known as the sensitivity.

So important, so if you're reading literature, you find it's come across sensitivity, that's what we're talking about. So what do we do when we have a really sensitive function? We're going to take a bit of a sidestep for a minute. I have a twin sister who's finishing a PhD in political science.

Political science, often they need to answer questions about very taboo behavior, something that people are likely to lie about. So let's say I wanted to survey everyone in this room and I wanted to answer the question what percentage of you are secretly serial killers? Not because I think any one of you are, but because I genuinely want to understand this trend.

I'm not trying to arrest people, I'm not trying to be an instrument of the criminal justice system. I'm trying to be a sociologist or political scientist and understand this actual trend. The problem is if I sit down with each one of you in a private room and I say, "I promise, I promise, I promise, I won't tell anybody," I'm still going to get a skewed distribution.

Some people are just going to be like, "Why would I risk telling you this private information?" So what sociologists can do is this technique called randomized response, where I should have brought a coin. You take a coin and you give it to each person before you survey them, and you ask them to flip it twice somewhere that you cannot see.

So I would ask each one of you to flip a coin twice somewhere that I cannot see. Then I would instruct you to, um, if the first coin flip is a heads, answer honestly. But if the first coin flip is a tails, answer yes or no based on the second coin flip.

So roughly half the time, you'll be honest and the other half of the time, you'll be, uh, you'll be giving me a perfect 50-50 coin flip. And the cool thing is that what this is actually doing, is taking whatever the true mean of the distribution is and averaging it with a 50-50 coin flip, right?

So if say, um, 55 percent of you, uh, answered yes, that, that you are a serial killer, um, then I know that the true center of the distribution is actually 60 percent, because it was 60 percent averaged with a 50-50 coin flip. Does that make sense? However, despite the fact that I can recover the center of the distribution, right, given enough samples, um, each individual person has plausible deniability.

If you said yes, it could have been because you actually are, or it could have been because you just happened to flip a certain sequence of coin flips, okay? Now this concept of adding noise to data to give plausible deniability is sort of the secret weapon of differential privacy, right?

And, and the field itself is a, a set of mathematical proofs for trying to do this as efficiently as possible, to give sort of the smallest amount of noise to get the most accurate results, right, um, with the best possible privacy protections, right? There is a meaningful, um, sort of base trade-off that you, you, you, you know, um, there's kind of a Pareto trade-off, right?

And we're trying to, to push that, push that trade-off down. Um, um, but so the, the, the, the field of research that is differential privacy, um, is looking at how to add noise to data and, and resulting queries to give plausible deniability to the, and to the, the members of a, of a database or a training dataset.

Does that make sense? Now, um, a few, um, terms that you should be familiar with. So there's local and there's global differential privacy. So local differential privacy adds noise to data before it's sent to the statistician. So in this case, the one with the coin flip, this was local differential privacy.

It affords you the best amount of protection because you never actually reveal sort of in the clear your information to someone, right? And then there's global differential privacy, which says, okay, we're gonna put everything in the database, perform a query, and then before the output of the query gets published, we're gonna add a little bit of noise to the output of the query, okay?

This tends to have a much better privacy trade-off, but you have to trust the database owner to not compromise the results, okay? And we'll see there's some other things we can do there. Um, but with, with me so far, this is a good, good point for questions if you had any questions.

Got it. So the question is, um, is this verifiable? Um, any of this, this process of differential privacy verifiable? Um, so that is a fantastic question, um, and one that actually absolutely comes up in practice. Um, um, so first, local differential privacy, the nice thing is everyone's doing it for themself, right?

So in that sense, if you're flipping your own coins and answering your own questions, um, that's, that's your verification, right? You're, you're kind of trusting yourself. For global differential privacy, um, stay tuned for the next tool and we'll, we'll come back to that. All right. So what does this look like in code?

So first, we have a pointer to a remote private dataset we call dot get. Whoa, we get big fat error, right? You just asked to sort of see the raw value of some private data point which you cannot do, right? Instead, pass in dot get epsilon to add the appropriate amount of noise.

So one thing I haven't mentioned yet, um, uh, differential privacy. So I mentioned sensitivity, right? So sensitivity was, um, uh, related to the type of query, the type of function that we wanted to do and it's in variance to, um, removing or replacing individual entries in the, in the database.

Um, so epsilon is a measure of what we call our privacy budget, right? And what our privacy budget is, is saying, okay, what, what's the, what's the amount of, of statistical uniqueness that I'm going to sort of limit? What's the upper bound for the amount of statistical uniqueness that I'm going to allow to come out of this, out of this database?

Um, and actually I'm gonna take one more side, side track here, um, because I think it's really worth mentioning, um, data anonymization. Anyone familiar with data anonymization come across this term before? Taking a document like redacting the, the social security numbers and like all this kind of stuff? By and large, it does not work.

If you don't remember anything else from this talk, it is very dangerous to do just data set anonymization, okay? And differential privacy in, in some respects is, is, is the formal version of data anonymization, where instead of, instead of just saying, okay, I'm just gonna redact out these pieces and then I'll be fine, um, this is saying, okay, um, that we, we can do a lot better.

So for example, a Netflix prize, Netflix machine learning prize, if you remember this, a big million dollar prize, maybe some people in here competed in it. So in this prize, right, um, Netflix published an anonymized dataset, right? And that was, um, movies and users, right? And they took all the movies and replaced them with numbers, and they took all the users and replaced them with numbers, and then we just had sparsely populated movie ratings in this matrix, right?

Seemingly anonymous, right? There's no names of any kind. Um, but the problem is, is that each row is statistically unique, meaning it, it kind of is its own fingerprint. And so two months after the dataset was published, some researchers at, uh, UT Austin, um, I think it was, I think it was UT Austin, um, were able to go and scrape IMDb, and basically create the same matrix in IMDb, and then just compare the two.

And it turns out people that were into movie rating, were into movie rating, and, and, and were watching movies at similar times, and similar, similar patterns, and similar tastes, right? And they were able to de-anonymize this first dataset with a high degree of accuracy. Uh, it happened again with, there's a, there's a famous case of like, uh, medical records for like, uh, I think, I think it'd been a Massachusetts senator, I think.

It was someone in the Northeast, um, being de-anonymized, uh, through very similar techniques. So some- one person goes and buys a anonymized medical dataset over here that has, you know, birthdate and zip code, and this one does zip code and, and gender, and this one does zip code, gender, and whether or not you have cancer, right?

And, and when you get all these together, um, you can start to sort of use the uniqueness in each one to, to relink it all back together. I mean, I, um, this is so doable to, to the extreme that I, I, unfortunately know of companies whose business model is to buy anonymized datasets, de-anonymize them, and sell market intelligence to insurance companies.

Ooh, right? But it can be done, okay? And, and the reason that it can be done is that just because the dataset that you are publishing, the one that you are physically looking at, doesn't seem like it has, you know, social security number and stuff in it, does not mean that there's enough unique statistical signal for it to be linked to something else.

And so when I say maximum amount of epsilon, epsilon is an upper bound on, on the, the statistical uniqueness that you're publishing in a dataset, right? And so what, what this tool represents is saying, okay, apply however much noise you need to given whatever computational graph led back to private data for this tensor, right?

To ensure that, you know, to, to put an upper bound on, on the potential for linkage attacks, right? Now, if you said epsilon zero, okay, then that's, that's saying, um, effectively like, uh, um, there's, I'm only going to allow patterns that have occurred at least twice, right? Okay. So meaning, meaning two different people had this pattern and thus it's not unique to either one.

Yes. So what happens if you perform the query twice? So the random noise would be re-randomized and, and sent again and you're absolutely, absolutely correct. So this epsilon, this is how much I'm spending with this query. So if I ran this three times, I would spend epsilon of 0.3.

Does that make sense? So this is, this is a 0.1 query. If I did this multiple times, the epsilons would sum. And so for any given data science project, right? I should, I, I, what we're, we're advocating is that you're given an epsilon budget that you're not allowed to exceed, right?

No matter how many queries that you, you participate. Now, there's, there's another sort of subfield of differential privacy that's looking at sort of single query approaches, which is all around synthetic data sets. So how can I perform sort of one query against the whole data set and create a synthetic data set that has, um, certain invariances that are desirable, right?

So I can do good statistics on it. But then I can query this as many times as I want. Because they're basically, um, you can't, um, uh, yeah, anyway, but we, we, we don't have to get into that now. Does that answer your question? Cool. Awesome. So now you might think, okay, this is like a lossless cause.

Like how can we be answering questions while protecting, while, while keeping statistical signal gone. But like it's, it's the difference between, um, it's the difference between if I have a data set and I wanna know what causes cancer, right? I could query a data set and learn that smoking causes cancer without learning that individuals are, are or are not smokers.

Does that make sense? Right? And the reason for that is, is that I'm, I'm, I'm specifically looking for patterns that are occurring multiple times across different people. And this actually happens to really, um, closely mirror the type of generalization that we want in machine learning statistics anyways. Does that make sense?

Like as machine learning practitioners, we're actually not really interested in the one-offs, right? I mean, sometimes our models memorize things. This, this happens, right? But we're actually more interested in the things that are, the things that are not specific to you. I want, I want the things that are gonna work, you know, the, the heart treatments that are gonna work for everyone in this room, and not just, I mean, you know, obviously if you need a heart treatment, I'd be happy, that'd be cool for you to have one.

But like what we're chiefly interested in are, are the things that generalize, right? Which, which is why this is realistic, um, um, and why with, with continued effort on both tooling and, and the theory side, um, we can, we can have a much better, uh, reality than today. Cool.

So, um, pros, just to review. So first, uh, remote execution allows us, that allows data to remain in the remote machine. Search and sampling, we can feature engineer using toy data. Differential privacy, we can have a formal rigorous privacy budgeting mechanism, right? Now, shoot, how is the privacy budget set?

Is it defined by the user or is it defined by the dataset owner or, or someone else? Um, this is a really, really interesting question actually. Um, so first, it's definitely not set by the data scientist, um, because that would be a bit of a conflict of interest. Um, and at, at first, you might say it should be the data owner, okay?

So the hospital, right? They're trying to cover their butt, right? And make sure that their assets are protected both legally and, and commercially, right? So they're, they're trying to make, make money off this. So there's, there's, um, um, there's sort of proper incentives there. But the interesting thing, and this gets back to your question, is what happens if I have, say, a radiology scan in two different hospitals, right?

And they both spend one epsilon worth of, of, of my privacy in each of these hospitals. Right? That means that actually two epsilon of my private information is out there. Right? And it just means that one person has to be clever enough to go to both places to get the join.

This is actually the exact same mechanism we were talking about a second ago when someone went from Netflix to IMDb, right? And so the, the true answer of who should be setting epsilon budgets, although, um, logistically, it's gonna be challenging. We'll talk about a little bit of this in, in, in part two of the talk, but I'm going a little bit slow.

Um, but okay. Um, is, um, it should be us. It should be people, and it should be people around their own information, right? You should be setting your personal epsilon budget. That makes sense? That's an aspirational goal. Um, we've got a long way before we can get to that level of, of infrastructure around these kinds of things.

Um, and we can talk about that, and we can definitely talk about more of that in the kind of question-answer session as well. But I think in, in theory, in theory, that's what, what we want. Okay. Um, the two cons that we still- two weaknesses of this approach that we still lack are- someone asked this question.

I think it was you. Yeah, yeah, you asked the question. Um, so first, the data is safe, but the model is put at risk. Uh, and what if we need to do a join? Actually, actually, yours is the third one, which I should totally add to the slide. Um, so, so first, um, if I'm sending my, my computations, my model into the hospital to learn how to be a better cancer classifier, right?

My model is put at risk. It's kind of a bummer if, like, you know, this is a $10 million healthcare model. I'm just sending it to a thousand different hospitals to get learned, to learn. So that's potentially risky. Second, um, what if I need to do a join a computation across multiple different data owners, who don't trust each other, right?

Who sends whose data to whom, right? And thirdly, um, as you pointed out, how do I trust that these computations are actually happening the way that I am telling the remote machine that they should happen? This brings me to my absolute favorite tool, secure multi-party computation. Come across this before?

Raise them high. Okay, cool. Little bit above average. Most machine learning people have not heard about this yet, and I absolutely- this is the coolest, this is the coolest thing I've learned about since learning about, like, AI machine learning. This is a, this is a really, really cool technique.

Encrypted computation, how about homomorphic encryption? You come across homomorphic encryption? Okay, a few more. Yeah, this is related to that. Um, so first, the kind of textbook definition is, is like this. If you go on Wikipedia, you'd see, uh, secure NPC allows multiple people to combine their private inputs to compute a function without revealing their inputs to each other, okay?

Um, but in the context of machine learning, the implication of this is multiple different individuals can share ownership of a number, okay? Share ownership of a number. Show you what I mean. So let's say I have the number five, my happy smiling face, and I split this into two shares, two and a three, okay?

I've got two friends, Marianne and Bobby, and I give them these shares. They are now the shareholders of this number, okay? And now I'm gonna go away, and this number is shared between them, okay? And this, this gives us several desirable properties. First, it's encrypted from the standpoint that neither Bob, nor Marianne can tell what number is encrypted between them by looking at their own share by itself.

Now, I've, um, um, for those of you who are familiar with, uh, kind of cryptographic math, um, I'm hand-waving over this a little bit. This would typically be, so in- in- uh, decryption would be adding the shares together, modulus, uh, a large prime. Um, so these would typically look like sort of large pseudo-random numbers, right?

But for the sake of making it sort of intuitive, I've picked pseudo-random numbers that are convenient to the eyes. Um, so first, these two values are encrypted, and second, we get shared governance, meaning that we cannot decrypt these numbers or do anything with these numbers unless all of the shareholders agree, okay?

But the truly extraordinary part is that while this number is encrypted between these individuals, we can actually perform computation, right? So in this case, let's say we wanted to multiply the shares times- or the encrypted number times two, each person can multiply their share times two, and now they have a encrypted number 10, right?

And there's a whole variety of protocols allowing you to do different functions, um, such as the functions needed for machine learning, um, while numbers are in this encrypted state, okay? Um, and I'll give some more resources for you- for you if you're interested in kind of learning more about this at the end as well.

Now, the big tie-in. Models and datasets are just large collections of numbers, which we can individually encrypt, which we can individually, uh, share governance over. Um, now, specifically to reference your question, there's two configurations of, of SecureNPC, active and passive security. In the active security model, you can tell if anyone does computation that you did not sort of independently authorize, which is great.

So what does this look like in practice when you go back to the code? So in this case, we don't need just one worker, it's not just one hospital because we're looking to have a shared governance, shared ownership amongst multiple different individuals. So let's say we have Bob, Alice, and Tao, and a crypto provider, which we won't go into now.

Um, and I can take a tensor, and instead of calling dot send and sending that tensor to someone else, now I call dot share, and that splits each value into multiple different shares and distributes those amongst the shareholders, right? So in this case, Bob, Alice, and Tao. However, in the frameworks that we're working on, you still get kind of the same PyTorch-like interface, and all the cryptographic protocol happens under the hood.

And the idea here is to make it so that we can sort of do encrypted machine learning without you necessarily having to be a cryptographer, right? And vice versa, cryptographers can improve the algorithms and machine learning people can automatically inherit them, right? So kind of classic sort of open-source machine learning library, making complex intelligence more accessible to people, if that makes sense.

And what we can do on tensors, we can also do on models. So we can do encrypted training, encrypted prediction, and we're going to get into, what kind of awesome use cases this opens up in a bit. And this is a nice set of features, right? In my opinion, this is, this is sort of the MVP of doing privacy-preserving data science, right?

The idea being that I can have remote access to a remote dataset. I can learn high-level latent patterns like, like, you know, what causes cancer without learning whether individuals have cancer. I can pull back just, just that sort of high-level information with formal mathematical guarantees over, over, you know, what's sort of the filter that's coming back through here, right?

And I can work with datasets from multiple different data owners while making sure that each, each individual data owners are protected. Now, what's the catch? Okay. So first, um, is computational complexity, right? So encrypted computation, secure MPC, um, this, this involves sending lots of information over, over the network.

I think this is the state of the art for, for training, for deep learning prediction, is that this is a 13x slowdown over plaintext, which is inconvenient but not deadly, right? But you do have to understand that, that assumes like it's like two AWS machines were like talking to each other, you know, they're relatively fast.

But we also haven't had any like hardware optimization to the extent that, that, you know, NVIDIA did a lot for deep learning, like, that there'll be, you know, probably like some sort of Cisco player that's similar for, for doing kind of encrypted or secure MPC-based deep learning, right? Um, let's see.

So this brings us back to kind of the fundamental question, is it possible to answer questions using data we cannot see? Um, the theory is absolutely there. Um, that's, that's something that, that I feel reasonably confident saying, that like, like the, the, the sort of the theoretical frameworks that we have.

And actually, the other thing that's really worth mentioning here is that these come from totally different fields which is why they kind of haven't been necessarily combined that much yet. I'll get, I'll get more into that in a second. Um, but it's, it's my hope that, that by sort of, by considering what these tools can do.

That'll open up your eyes to the potential that in general, we can have this new ability to answer questions using information that we don't actually own ourselves. Um, because from a sociological standpoint, that's net new for like us as a species, if that makes sense. If ever previously we want, we want to just, we had to have, we had to have like a, a trusted third party who would then take all the information in themselves and, and make some sort of neutral decision, right?

Um, so we'll come to that in a second. Um, and so one of the big sort of long-term goals of our community is to make infrastructure for this secure enough and robust enough. And of course in like a free Apache 2 open source license kind of way, um, that, you know, information on the world's most important problems will be this accessible, right?

I'm gonna spend sort of less time working on, um, tasks like that and more time working on tasks like this. So, um, this is gonna be kind of the, the breaking point between sort of part one and part two. Um, part two will be a bit shorter. Um, but if you're interested in, in sort of diving deeper on the technicals of this, um, here's a, like a six or seven hour course that I taught just on these concepts and on the tools.

It's free, uh, on Udacity. Feel free to check it out. Um, so the question was, um, uh, he's asking about how I, I, I specified that a model can be encrypted during training. Is that same as homomorphic encryption or is that something else? So, um, uh, a couple of years ago, there was a, a big burst in literature around training on encrypted data, um, where you would homomorphically encrypt the dataset.

And it turned out that some of the statistical regularities of homomorphic encryption allowed you to actually train on that dataset without, um, without decrypting it. Um, so this is similar to that except, um, the one downside to that is that in order to use that model in the future, you have to still be able to encrypt data with the same key, um, which often is, is, is sort of constraining in practice and also there's a pretty big hit to privacy because you're, you're training on data that inherently has a lot of noise added to it.

What I'm advocating for here, um, is instead we actually encrypt, um, both the model and the dataset, uh, during training but inside the encryption, inside the box, right? It's actually performing the same computations that it would be doing in plain text. So you don't get any degradation in accuracy, um, and you don't get tied to one particular public private key pair.

Yeah, yeah, yeah. So, uh, specifically- so the question was can I comment on federated learning specifically Google's implementation? Um, so I think Google's implementation is, is great. So, um, and obviously the, the fact that they've shown that this can be done hundreds of millions of users is incredibly powerful.

I mean, uh, and even inventing the term, um, uh, and creating momentum in that direction. Um, I think that there's, um, one thing that's worth mentioning is that there are two forms of federated learning. Uh, one is sort of the one where your model is- federated learning, sorry. Uh, ooh, gotta talk about what that is.

Okay. Um, yes, I'll do that quickly. Um, so federated learning is, um, basically the first thing I talked about. So remote execution. So if, if everyone has a smartphone, um, when you plug your phone in at night, if you've got, you know, Android or iOS, you plug your phone in at night, attach to Wi-Fi.

You know when you text and it recommends the next word, um, next word prediction? Um, that model is trained using federated learning. Um, meaning that it, it learns on your device to do that better, and then that model gets uploaded to the Cloud as opposed to uploading all of your tweets to the Cloud and training one global model.

Does that make sense? So, so plug your phone at night, model comes down, trains locally, goes back up. It's federated, right? That's, that's, that's basically what federated learning is in a nutshell. And, and, um, uh, it was pioneered, uh, by the Quark team at Google. And, um, and they're, they're, they do really fantastic work.

They've, they've paid down a lot of the technical debt, a lot of the, the, the risk, or technical risk around it. Um, and they publish really great papers outlining sort of how they do it, which is fantastic. Um, what I outlined here is actually a slightly different style of federated learning.

Because there, there's federated learning with like a fixed dataset and a fixed model. Um, and lots of users where the, the data is very, um, ephemeral like phones are constantly logging in and logging off. Um, you know, you're, you're, you're plugging your phone at night and then you're taking it out, right?

Um, um, this is sort of the, the, the one style of federated learning. It's, it's really useful for like product development, right? So it's useful for like if you want to do a smartphone app that has a piece of intelligence in it, but train that intelligence is going to be prohibitively difficult for you to get access to the data for, um, or you, you want to just have a value prop of protecting privacy, right?

That's a federated learning, that's how federated learning is good for. What I've outlined here is a bit more exploratory federated learning, where it's saying, okay, instead of, instead of, um, the model being hosted in the Cloud and data owners showing up and making it a bit smarter every once in a while, now the data is going to be hosted at a variety of different private Clouds, right?

And data scientists are going to show up and say, "Hmm, I want to do something with di- with diabetes today," or, "Hmm, I want to do something with, with, um, studying dementia today," something like that, right? This is much more difficult because the attack factors for this are much larger, right?

I'm trying to be able to answer arbitrary questions about arbitrary datasets, um, in, in a protected environment, right? So I think, um, yeah, that's, that's kind of my, my, my general thoughts on. Does federated learning leak any information? So federated learning by itself is not a secure protocol, right?

To the, the, to the extent that, um, and that's why I sort of this ensemble of techniques that I've- so the question was does federated learning leak information? Um, so it is perfectly possible for a federated learning model to simply memorize the dataset, uh, and then spit that back out later.

You have to combine it with something like differential privacy in order to be able to prevent that from happening. Does that make sense? Um, so just, just because the, the training is happening on my device does not mean it's not memorizing my data. Does that, does that make sense?

Okay. So now I want to zoom out and, and go a little less from the kind of the data science practitioner perspective. And now take more the perspective of like a, an economist or political scientist or some, someone looking kind of globally at like, okay, what- if, if this becomes mature, what happens?

Right? And, and this is where it gets really exciting. Anyone entrepreneurial? Anyone? Everyone? I don't know. No one? Okay. Cool. Well, this is, this is the, this is the part for you. So, um, the big difference is this ability to answer questions using data you can't see. Because as it turns out, most people spend a great deal of their life just answering questions, and a lot of it is involving sort of personal data.

I mean, whether it's minute things like, you know, where's my water, where are my keys, or you know, um, what movie should I watch tonight, or, or, um, you know, what kind of diet should I have to, to be able to sleep well, right? I mean, a, a wide variety of different questions, right?

And, and we're limited in our answering ability to the information that we have, right? So this ability to answer questions using data we don't have, sociologically I think is quite, quite important. Um, and, um, there's four different areas that I want to highlight as like big groups of use cases for this kind of technology, um, to help kind of inspire you to see where this infrastructure can go.

And actually before, before I jump into that, um, has anyone been to Edinburgh, Edinburgh? Cool. Uh, just to like the castle and stuff like that. Um, so my wife and I, um, this is my wife, Amber, um, we went to Edinburgh for the first time, um, six months ago?

September? September. Um, and, uh, we did the underground, was it the- We did a ghost tour. Yeah, yeah, yeah. We did the ghost tour and, um, that was really cool. Um, there was one thing that took away from it. There was this point we were standing, um, we just walked out of the tunnels and she was pointing up some of the architecture.

Um, and, uh, then, uh, she started talking about, um, basically the cobblestone streets and why the cobblestone streets are there. Cobblestone streets, one of the main purposes of them was to sort of lift you out of the muck. And the reason there was muck was there is that they didn't have any in- internal plumbing and so the sewage just poured out into the street, right?

Because you live in a big city. Um, and this was the norm everywhere, right? And actually, I think she even sort of implied that like the invention or popularization of the umbrella had less to do with actual rain, a bit more would do with buckets of stuff coming down from on high, um, which is, uh, uh, it's a whole different world like when you think about what that is.

Um, but the, the reason that I bring this up, um, is that, you know, however many hundred years ago, people were, were walking through, you know, like sludge, sewage was just everywhere, right? It was all over the place and people were walking through it everywhere they go, and they were wondering why they got sick, right?

And in many cases, and it wasn't because they wanted it to be that way, it's just because it was a natural consequence of the technology that they had at the time, right? This is not malice, this is not anyone being good or bad or, or evil or whatever, it's just, it's just the way things were.

Um, and I think that there's a strong analogy to be made with, with kind of how our data is handled as society at the moment, right? We've just sort of walked into a society, we've had new inventions come up and new things that are practical, new uses for it, and now everywhere we go, we're constantly spreading and spewing our data all over the place, right?

I mean, every, every camera that sees me walking down the street, you know, goodness, there's a, there's a company that takes a whole picture of the Earth by satellite every day. Like, how the hell am I supposed to do anything without, without, you know, everyone following me around all the time, right?

And, um, I imagine that, um, whoever it was, I'm not a historian, so I don't really know, but whoever it was that said, "What if, what if we ran plumbing from every single apartment, business, school, maybe even some public toilets, underground, under our city, all to one location and then processed it, used chemical treatments, and then turned that into usable drinking water?" Like, how laughable would that have been?

Would have been just the, the most massive logistical infrastructure problem ever to take a working city, dig up the whole thing, to take already, already constructed buildings, and run pipes through all of them. I mean, uh, so, so Oxford, uh, gosh. Um, I, there's a building there that's, um, so old they don't have showers because they didn't want to run the plumbing for the head.

You have to ladle water over yourself. It's in, uh, Merton College. It's quite, quite famous, right? I mean, the, the, the infrastructure, anyway, the infrastructure challenges, um, it just must have seen absolutely massive. And so, as I'm about to walk through kind of like four broad areas where things could be different, theoretically, based on this technology, and I think it's probably going to hit you like, "Whoa, that's a lot of code." Or like, "Whoa, that's, that's a lot of change." Um, but, but I think that the, the need is sufficiently great.

I think that, that, I mean, if you view our lives as just one long process of answering important questions, whether it's where we're going to get food or what causes cancer, like making sure that, that the right people can answer questions without, without, you know, data just getting spewed everywhere so that the wrong people can answer their questions, right, is important.

And, um, yeah, anyway, so I know this is going to sound like there's a certain ridiculousness to, to, to, to maybe what some of this will be. But I, I hope that, that you will at least see that, that theoretically, like the, the basic blocks are there. And, and that really what stands between us and a world that's fundamentally different is, is adoption, maturing of the technology, and, and, and, and good engineering.

Um, because I think, you know, once, you know, Sir Thomas Crapper invented the toilet, right? I do remember that one. Um, um, at, at that point, the, the basics were there, right? And, and what stood between them was, was implementation, adoption, and engineering, right? And I, I think that that's, that's where we are.

And, and the best part is we have, you know, companies like Google that have already, already paved the way with some very, very large rollouts of, of the, the early pieces of this technology, right? Cool. So what are the, what are the big categories? One I've already talked about, open data for science.

Okay. So this one is a really big deal. And the reason it's a really big deal is mostly because everyone gets excited about making AI progress, right? Everyone gets super excited about superhuman ability in X, Y, Z. Um, when I started my PhD at Oxford, I, I worked for, my professor's name is, uh, Phil Bluntsum.

The first thing he told me when I sat my butt down in his office in my first day as a student, he said, "Andrew, everyone's gonna want to work on models. But if you look historically, the biggest jumps in progress have happened when we had new big datasets or the ability to process new big datasets." And just to give a few anecdotes, ImageNet, right?

ImageNet, GPUs allowing us to process larger datasets. Um, even, even things like AlphaGo. This is synthetically generated infinite datasets. Or, or, or if, I don't know, did you guys, anyone watch the, um, the AlphaStar, uh, live stream on YouTube? It talked about how it had trained on like 200 years of, of like, uh, of StarCraft, right?

Um, or if you look at, um, um, Watson, the playing, playing Jeopardy, right? Um, this, this was on the heels of, of a new large structured dataset based on Wikipedia. Or if you look at, um, um, Garry Kasparov and IBM's Deep Blue. This was on the heels of the largest open dataset of chess, um, matches haven't been published online, right?

There's this, there's this echo where like big new dataset, big, big new breakthrough, big new dataset, big new breakthrough, right? And what we're talking about here is, is potentially, you know, several orders of magnitude, more data relatively quickly. And the reason for that is that, that we're not, I'm not saying we're gonna invent a new machine, and that machine is gonna collect this, and then it's gonna go online.

I'm saying there's thousands and thousands of enterprises, millions of smartphones, there's, there's, uh, and hundreds of, of governments that are all already have this data sitting inside of data warehouses, right? Largely untapped for two reasons. One, legal risk, and two, commercial viability, right? If I give you a dataset, all of a sudden I just doubled the supply, right?

What does that do to my billing ability? And there's the legal risk that you might do something bad with it that comes back to hurt me. With this category, I know it's like just one phrase, but, but this is like ImageNet, but for every data task that's already been established, right?

This is us, like, I mean, I, I, we're working with a professor at Oxford in the psychology department who wants to study dementia, right? He is, the problem with dementia, is, is every hospital has like five cases, right? It's not like a very centralized disease, it's not like all the, all the cancer patients go to, you know, one big center and like, it's where all the technology is.

Like dementia, um, it's, it's, it's sprinkled everywhere. And so the big thing that's blocking him as a dementia researcher is access to data. And so he's investing in private data science platforms. And I didn't persuade him to, I, I found him after he was, he was already looking to do that.

Um, but, but pick, pick any challenge that, that, where data is already being collected and, and this can unlock not larger amounts of data that exists, but larger amounts of data that can be, can be used together. Does that make sense? This is like a thousand startups right here.

Whereas instead of going out and trying to buy as many datasets as you can, which is a really hard and really expensive task. Talk to anyone who's in Silicon Valley right now, trying to do a data science startup, right? Instead, you go to each individual person that has a dataset and you say, "Hey, let me create a gateway between you and the rest of the world that's gonna keep your data safe and allow people to leverage it." Right?

That's like repeatable business model. Pick a use case, right? Be, be the radiology network gatekeeper, right? Um, okay. So enough on that one. But like, does it make sense how like on a huge variety of tasks, just the ability to have a, a data box silo that you can do data science against, is gonna increase the accuracy of a huge variety of models really, really, really quickly.

Cool? All right. Second one. Oh, that's not right. Single use accountability. Um, this one's a little bit tricky. Um, get to the airport and you get your bag checked, right? Everyone's familiar with this process, I assume. What happens? Someone's sitting at a monitor, and they see all the objects in your bag.

So that occasionally, they can spot objects that are dangerous or illicit, right? There's a lot of extra information leakage, or to the fact that they have, that they have to sit and look at thousands of, of all of the objects, you know, basically searching every single person's bag totally and completely, just so that occasionally, they can find that one.

Answer that, the question they actually want to answer is, is there anything dangerous in this bag? But in order to answer it, they have to basically acquire access to the whole bag, right? So let's, let's, let's think about the same approach of answering questions using data we can't see.

The best example of this in the analog world is a sniffing dog. Familiar with like sniffing dogs, so give your bag a whiff at the airport, right? This is actually a really privacy-preserving thing, because dogs don't speak English or any other language. Um, and so the benefit is, the dog comes by, "Nope, everything's fine," moves on.

The dog has the ability to only reveal one bit of information without you having to search every single bag. Okay? That is what I mean when I say a single-use accountability system. It means I am looking at some data stream because I'm holding someone accountable, right? And we want to make it so that I can only answer the question that I claim to be looking into.

So if this is a video feed, right, for example, right? Instead of getting access to the raw video feed, and, and you know, the millions of bits of information every single person in the frame of view, walking around doing whatever, which I could use for, you know, even if I'm a good person, I technically could use for, for other purposes.

But instead, build a system where I build, say, a machine learning classifier, right? That is an auditable piece of technology, that looks for whatever I'm supposed to be looking for, right? And I only see frames, you know, I only open up bags that actually have to. Okay. This does two things.

One, it makes all of our accountability systems more privacy-preserving, which is great. Mitigates any potential dual or multi-use, right? And two, it means that some things that were simply too off-limits for us to, to properly hold people accountable might be possible, right? One of the things that was really challenging, so we used to do email surveillance, digital reasoning, right?

And, and it was basically help investment banks find insider traders, right? Because they want to help enforce the laws, they, you know, they get fine billion dollar fines if, if, if anyone cause an infraction. But one of the things that was really difficult about developing these kinds of systems was that it's so sensitive, right?

We're talking about, you know, hundreds of millions of emails at some massive investment bank. There's so much private information in there that say, none of our data scientists, barely any of them were able to actually work with the data and try to make it better, right? And, and, and this, this, yeah, this makes it really, really difficult.

Anyway, cool. So enough on that. Third one, and this is the one I think is just incredibly exciting. End-to-end encrypted services. What's up? Everyone familiar with WhatsApp, Telegram, any of these? These are messaging apps, right? Where a message is encrypted on your phone, and it's sent directly to someone else's phone, and only that person's phone can decrypt it, right?

Which means that someone can provide a service, you know, messaging without the service provider seeing any of the information that they're actually providing the service over, right? Very powerful idea. What if the intuition here is that, with a combination of machine learning, encrypted computation, and differential privacy, that we could do the same thing for entire services.

So imagine going to the doctor, okay? So you go to the doctor. This is really a computation between two different datasets. On the one hand, you have dataset that the doctor has, which is their, you know, medical background, their knowledge of, of, of different procedures, and diseases, and tests, and all this kind of stuff.

And then you have your dataset, which is your symptoms, your, your medical history, um, you know, your recent things that you've eaten, um, your, your genes, your genetic predisposition, your heritage, those kinds of things, right? And you're bringing these two datasets together, to compute a function. And that function is, what, what, what treatment should you have, if any?

Okay? And the idea here is that, so there's this new, um, this new field called structured transparency, I guess I should probably mention. I'm not sure, I'm not even sure you can call it a new field yet, because it's not in the literature, but it's been bouncing around a few different circles.

And the, um, and it's, it's, uh, F, X, Y, I'm not very good with chalk, sorry. Um, and then this is Z. Okay. So this, two different people providing their data together, computing a function, and an output. So, um, um, so differential privacy protects the output, encrypted computation, so like MPC, which we talked about earlier, protects the input, right?

So it allows them to, to, to compute F of X of Y, right, without revealing their inputs. Remember this? So basically, encrypt Y, encrypt X, compute the function while it's encrypted. Do, do we remember, do we remember this? Right? And so there's, there's three processes here, right? There's input privacy, which is MPC, there's logic, and then there's output privacy.

And this is what you need to be able to do end-to-end encrypted services. Okay. So imagine, imagine, um, so there, there are machine learning models that can now do, um, skin cancer prediction, right? So I can take a picture of my, of my arm, and send it through machine, machine learning model, and it'll predict whether or not I have melanoma on my arm, right?

Okay. So in this case, machine learning model, perhaps owned by a hospital or a startup, image of my arm, okay? Encrypt both, the logic is done by the machine learning model. The prediction, if it's gonna be published to the output, to the, to the rest of the world, you use differential privacy, but in this case, the prediction can come back to me, and only I see the decrypted result, okay?

The implication being that the, the, the doctor role facilitated by machine learning can classify whether or not I have cancer, can provide this service without anyone seeing my medical information. I can go to the doctor and get a prognosis without ever revealing my medical records to anyone including the doctor, right?

Does that make sense? And if you believe, if you believe that sort of the services that are repeatable, that we do for millions and millions of people, right? Can create a training dataset that we can then train a classifier to do, then we should be able to upgrade it to be end-to-end encrypted.

Does that make sense? So again, it's kind of, it's kind of big. It assumes that, that AI is smart enough to do it. There's lots of questions around quality, and like quality assurance, and all these kinds of things, that have to be addressed. There's very likely to be different institutions that we need.

But I hope that at least these three sort of big categories, this is by no means comprehensive, but I hope at least these three big categories will be sort of sufficient for helping sort of lay the groundwork for how sort of each person could be empowered with sole control over the only copies of their information, while still receiving the same goods and services they've become accustomed to.

Cool. Thanks. Questions. Let's do it. >> First, please give Andrew a big hand. Andrew, it was fascinating, really, really fascinating. Amazing, amazing set of ideas, and hope for this can really get rid of the sewage of data. This vision of end-to-end encrypted services, if I understand correctly, the algorithm would also run on two or more services, and the skin image would go to them, and then you would get the diagnosis.

But the diagnosis itself is not private though, because the output of that is being revealed to the service provider. >> So it could optionally be revealed to the service provider. So in this case, oh yeah, something I didn't say. For a secure NPC for encrypted computation, except with some exceptions.

But for secure NPC, when you perform computation between two encrypted numbers, the result is encrypted between the same shareholders, if that makes sense. Meaning that by default, Z is still encrypted with the same keys as X and Y, and then it's up to the key holders to decide who they want to decrypt it for.

So they could decrypt it for the general public, in which case they should apply differential privacy. They could decrypt it for the input owner, in which case the input owner is not going to hurt anybody else by him knowing whether he has a certain diagnosis, or it could be decrypted for the model owner, perhaps allow them to do more training or some other arbitrary use case.

So it can be, but not as a strict requirement. >> Just to be sure, if Z is being computed by say two parties, to send Z back to Y in this case, the machine knows what Z is. So in that sense, even if you encrypt Z with the key of Y, there's no way to protect the output itself.

>> I haven't described this correctly. So when we perform the encrypted computation, we split this into shares. So we'll say Y1 and Y2. Right? Y2 goes up here, right? And then this populates, what actually happens is this creates Z1 and Z2 at the end, right? Which is still, which is still owned by, you know, person Y.

So we'll say this is Alice and this is Bob. Right? So we have Bob's share and Alice's share. What gets populated is, is shares of Z. So if Alice or if Bob sends his share of Z down to Alice, only Alice can decrypt the result. So does that make more sense?

Okay, cool. >> So even the answer is- >> Even the answer is protected. Yeah. And you would only need to use differential privacy in the case you're planning to decrypt the result for some unknown audience to be able to see. >> Some models are biased based on real data biases, and society tries to make unbiased models, like on gender, race, and so on.

How does it work with privacy, especially when everything is encrypted? So how can you unbiased models when you do not see biases in the data and so on? >> That's a great question. So the first, the first gimme for that is that people don't ever really de-bias a model by physically reading the weights.

Right? So the fact that the weights are encrypted doesn't necessarily help or hurt you. So really what it's about is just making sure that you provision enough of your privacy budget to allow you to do the introspection that you need to be able to measure and adjust for bias.

So I think that's, is that sufficient? >> Yeah. >> Cool. Awesome. Great question then. >> How far away do you think we are from organizations like the FDA requiring differential privacy to be used in regulating medical algorithms? >> So I think the best answer I can give to that, so one, I don't know.

And even laws in the UK regarding privacy GDPR are not prescriptive about things like differential privacy. But I think the best and most relevant data point I have for you on that is that the US Census this year is going to be protecting the census data, the 2020 census data using differential privacy.

And some of the leading work on actually applying differential privacy in the world is going on at the US Census and I'm sure they'd be interested in more helpers if anyone was interested in joining them. >> So I guess her question was kind of one of my questions, but it was more just like how much buy-in are you getting in terms of adoption for open mind or any of, like, do you have like any hospitals that are like participating or?

>> Yeah. So actually there's a few things I probably should have mentioned. So the, so open mind is about two and a half years old. In the very beginning, we had very little buy-in. Because it was just so early, it was kind of like, who cares about privacy? No one's ever going to sort of really, really care about that.

Post GDPR, total, total change, right? Everyone's scrambling to protect the data. But the truth is, it's not just privacy, it's also commercial usability. Right now, if you're selling data, every time you sell it, you lower the price because you increase the supply and you increase the number of people that are also selling it.

So I think that there's people also waking up to kind of the commercial reasons for protecting your own datasets and protecting the unique statistical signal that they have. It's also worth mentioning, so the PyTorch team recently sponsored $250,000 in open-source grants to fund people to work on our PySyth library, which is really good.

We're hoping to announce sort of more grants of similar size later in the year. So if you guys like working on open-source code and like to get paid to do so, to that extent, that's sort of a big vote and buy-in as far as our community is concerned. So this year is when I hope to see kind of the first pilots rolling out.

There are some that are sort of in the works, but they're not public yet. But yeah, so I think basically this is the year for like pilots. I think it's about as far as we are. >> And then I have another question that's kind of on the opposite end of the spectrum that's a little more technical weeds.

>> Cool. >> So when you're doing the encryption where you separate everything into each of the different owners, how does that work for non-linear functions? Because you need that linearity to add it back and for it to maintain the- >> Totally. So the non-linear functions are the most performance intensive.

So you get the biggest performance hit when you have to do them. The- for deep learning specifically, there's kind of two trends. So one line of research is around using polynomial approximations. And then the other line is around doing sort of discrete comparison functions. So which is good for ReLUs and it's good for lopping off the ends of your polynomials so that your unstable tails can be flat.

And I would say that's about that. And then like the science of kind of like trying to relax your security assumptions strategically here and there to get more performance is about where we're at. But as far as the one thing is worth mentioning though is that, there are kind of what I described was SecureNPC sort of on integers and fixed precision numbers.

You can also do it on sort of binary, but in that sense you get a huge performance ever doing it with binary, but you also get the ability to do things sort of more classically with computing. Encrypted computation is sort of like doing computing in the 70s. Like you get a lot of the same kind of constraints.

>> Thank you very much for your talk, Andrew. >> Yeah. >> I'm wondering about your objective to ultimately allow every individual to assign a privacy budget. You mentioned that it would take a lot of work to provide the infrastructure for that to be possible. Do you have an idea for what kind of infrastructure is necessary?

And also when people are reluctant, even perhaps lazy, and they don't really care and they don't want their data to be protected. >> Yeah. >> I guess it takes some training, but yeah, what are your thoughts on building that infrastructure? >> I think it's going to come in waves.

It's the kind of thing where people don't usually invest money and time and resources into things that aren't like a straight shot to value. So I think there's probably going to be multiple individual discrete jumps. The first one is going to be just enterprise adoption. Enterprises are the ones that already have all the data.

So they're the ones who are most natural to start adopting privacy-preserving technologies. I think that that adoption is going to be driven primarily by commercial reasons for commercial reasons, meaning my data is inherently more valuable if I can keep it scarce while allowing people to answer questions with it.

Does that make sense? So it's more profitable for me to not send copies of my data to people if I can actually have them bring their question answering mechanisms to me and just get their questions answered. Does that make sense? That's not a privacy narrative, but I think that that narrative is going to mature privacy technology quite quickly.

Post enterprise adoption, I think that's when, and encrypted services are still really hard at this point. The reason for that is that they require lots of compute and lots of network overhead, which means that you probably want to have something in the Cloud. Some machine that you can control in the Cloud or have the Internet get a lot faster.

But there's this question of how do we actually get to a world where each individual person knows or has notional control over their own personal privacy budget. Let's just say you had perfect enterprise adoption and everyone's tracking their stuff with differential privacy. The piece that you're actually missing here is just some communication between all the different enterprises that are joining up and making, it's just an accounting mechanism.

It's a lot like the IRS. It's just someone to be there to make sure that you're not double spending in different places. Your Epsilon budget that's over here versus over here, versus over here is all coming from the same place. It's not totally clear who this actor would be.

Maybe there's an app that just does it for you. Maybe there has to be an institution around it. Maybe it won't happen at all. Maybe it'll just be decentralized, but whatever. Another option is that there'll actually be data banks. There's been some literature in the last couple of years around saying, "Okay, maybe institutions that they currently handle your money might also be the bank where all of your information lives." That becomes the gateway to your data or something like that.

So there's different things that are, because that would obviously make the accounting much easier. Also, that would give you that Cloud-to-Cloud performance increase. So I think it's clear we wouldn't go to data banks or these kinds of centralized accounting registries directly because you have to have the initial adoption first.

But if I had to guess, it's something like that. We won't see that for a while. It's not even clear what that would look like. But I think it is possible we just have to get through non-trivial adoption first. >> Thank you. >> Yeah. So it's kind of a hazy, but it's predicting the future.

So I guess that's how that goes. >> I was just wondering if you can comment briefly on what you think is the biggest mistake being made with respect to recommendation systems transparency, and if you can comment briefly on what you think might be the best solution. >> So I don't know if this is a mistake.

I would say the biggest opportunity for recommendation systems is that they have the potential to be more holistic. So for example, if you recommended a movie to me based on whether or not it's most likely to keep me engaged, keep me watching movies, it's not really a holistic recommendation.

It's not saying, "Hey, you should do this because it's going to make your life more fulfilling, more satisfied, whatever. It's just going to glue me to my television more." So I think the biggest opportunity, and particularly with privacy-preserving machine learning, is that if a recommender system could have the ability to access private data without actually seeing it, and answer the question, how do I give the best recommendation so that this person gets a good night's sleep or has more meaningful friendships or whatever.

Like these attributes that are actually particularly sensitive, but there are things that we actually want to optimize for, that we could have vastly more beneficial recommendation systems than we do now, just by virtue of having better infrastructure for dealing with private data. So they actually, as far as like the biggest limitation of recommender systems right now, it's just that they don't have access to enough information to have good targets.

Does that make sense? We would like for them to have better targets, but in order to do that, they have to have access to information about those targets. I think that's what privacy-preserving technologies could bring to bear on recommendation systems. >> Thanks. >> Yeah. Great question, by the way.

>> One more time, please give Andrew a big hand. Thank you so much. >> Thanks. >> Thanks for having me. >> Thank you. >> Thank you. >> Thank you. >> Thank you. >> Thank you. >> Thank you. >> Thank you. >> Thank you.

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

Chapters

Transcript