back to indexPrivacy Preserving AI (Andrew Trask) | MIT Deep Learning Series
Chapters
0:0 Introduction
0:54 Privacy preserving AI talk overview
1:28 Key question: Is it possible to answer questions using data we cannot see?
5:56 Tool 1: remote execution
8:44 Tool 2: search and example data
11:35 Tool 3: differential privacy
28:9 Tool 4: secure multi-party computation
36:37 Federated learning
39:55 AI, privacy, and society
46:23 Open data for science
50:35 Single-use accountability
54:29 End-to-end encrypted services
59:51 Q&A: privacy of the diagnosis
62:49 Q&A: removing bias from data when data is encrypted
63:40 Q&A: regulation of privacy
64:27 Q&A: OpenMined
66:16 Q&A: encryption and nonlinear functions
67:53 Q&A: path to adoption of privacy-preserving technology
71:44 Q&A: recommendation systems
00:00:00.000 |
Today, we're very happy to have Andrew Trask. 00:00:09.240 |
in the world of machine learning and artificial intelligence. 00:00:15.040 |
the book that I highly recommended in the lecture on Monday. 00:00:23.920 |
which is an open-source community that strives to make our algorithms, 00:00:27.820 |
our data, and our world in general more privacy-preserving. 00:00:36.480 |
complex, beautiful, sophisticated British accent, unfortunately. 00:00:47.880 |
>> Thanks. That was a very generous introduction. 00:00:54.440 |
So yeah, today we're going to be talking about privacy-preserving AI. 00:00:59.160 |
So the first is going to be looking at privacy tools 00:01:02.400 |
from the context of a data scientist or a researcher, 00:01:07.600 |
Because I think that's the best way to communicate 00:01:09.960 |
some of the new technologies that are coming about in that context. 00:01:14.800 |
under the assumption that these kinds of technologies become mature, 00:01:20.760 |
What consequences or side effects could these kinds of tools 00:01:29.860 |
is it possible to answer questions using data that we cannot see? 00:01:33.480 |
This is going to be the key question that we look at today. 00:01:39.840 |
So first, if we wanted to answer the question, 00:01:51.640 |
we would first need to download a dataset of tumor-related images. 00:01:54.680 |
So we'd be able to statistically study these and be able to 00:01:59.440 |
But this kind of data is not very easy to come by. 00:02:05.040 |
it's difficult to move around, it's highly regulated. 00:02:08.000 |
So we're probably going to have to buy it from 00:02:12.640 |
are able to actually collect and manage this kind of information. 00:02:19.200 |
likely to make this a relatively expensive purchase. 00:02:21.640 |
If it's going to be an expensive purchase for us to answer this question, 00:02:24.040 |
well, then we're going to find someone to finance our project. 00:02:27.700 |
we have to come up with a way of how we're going to pay them back. 00:02:35.240 |
looking for someone to start a business with us. 00:02:37.040 |
Now, it's because we wanted to answer the question, 00:02:41.600 |
What if we wanted to answer a different question? 00:02:48.360 |
Well, this would be a totally different story. 00:02:55.800 |
we download a state-of-the-art training script from GitHub, 00:03:02.120 |
handwritten digits with potentially superhuman ability, 00:03:07.000 |
Why is this so different between these two questions? 00:03:11.320 |
The reason is that getting access to private data, 00:03:21.120 |
our time working on problems and tasks like this. 00:03:25.780 |
Anybody who's trained a classifier on MNIST before? 00:03:28.680 |
Raise your hand. I expect pretty much everybody. 00:03:36.480 |
does anyone train a classifier to predict dementia, 00:03:50.960 |
So why is it that we spend all our time on tasks like this, 00:03:56.520 |
when these tasks, these represent our friends and loved ones, 00:04:00.040 |
and problems in society that really, really matter. 00:04:02.600 |
Not to say that there aren't people working on this. 00:04:04.480 |
It's absolutely, there are whole fields dedicated to it. 00:04:21.840 |
whether it's doing a startup or joining a hospital or what have you. 00:04:33.360 |
Is it possible to answer questions using data that we cannot see? 00:04:39.960 |
So in this talk, we're going to walk through a few different techniques. 00:04:46.920 |
the combination of these techniques is going to try to make it so that we can 00:04:54.760 |
pip install access to other deep learning tools. 00:04:57.280 |
The idea here is to lower the barrier to entry, 00:05:01.080 |
the most important problems that we would like to address. 00:05:09.720 |
which is an open-source community of a little over 6,000 people 00:05:14.600 |
the barrier to entry to privacy-preserving AI and machine learning. 00:05:18.880 |
working on we're talking about today is called PySift. 00:05:21.200 |
PySift extends the major deep learning frameworks 00:05:24.800 |
with the ability to do privacy-preserving machine learning. 00:05:26.680 |
So specifically today, we're going to be looking 00:05:29.760 |
So PyTorch, people generally familiar with PyTorch, 00:05:34.440 |
It's my hope that by walking through a few of these tools, 00:05:39.480 |
it'll become clear how we can start to be able to do data science, 00:05:47.520 |
data that we don't actually have direct access to. 00:05:54.240 |
questions even if you're not necessarily a data scientist. 00:06:01.240 |
So we're going to jump into code for a minute, 00:06:03.800 |
but hopefully this is line by line and relatively simple. 00:06:08.880 |
We're looking at lists of numbers and these kinds of things. 00:06:12.360 |
we import Torch as a deep learning framework. 00:06:14.560 |
SIFT extends Torch with this thing called Torch Hook. 00:06:17.280 |
All it's doing is just iterating through the library and 00:06:19.200 |
basically monkey-patching in lots of new functionality. 00:06:22.000 |
Most deep learning frameworks are built around one core primitive, 00:06:27.320 |
For those of you who don't know what tensors are, 00:06:29.320 |
just think of them as nested list of numbers for now, 00:06:34.000 |
But for us, we introduce a second core primitive, 00:06:38.080 |
A worker is a location within which computation is going to be occurring. 00:06:42.680 |
So in this case, we have a virtualized worker that 00:06:49.280 |
The assumption that we have is that this worker will allow us to run 00:06:52.760 |
computation inside of the data center without us 00:06:55.200 |
actually having direct access to that worker itself. 00:06:59.480 |
white-listed set of methods that we can use on this remote machine. 00:07:06.040 |
so there's that core primitive we talked about a minute ago. 00:07:12.000 |
The first method that we added is called just dot send. 00:07:24.040 |
For those of you who are actually familiar with deep learning frameworks, 00:07:25.920 |
I hope that this will really resonate with you. 00:07:28.180 |
Because it has the full PyTorch API as a part of it, 00:07:32.200 |
but whenever you execute something using this pointer, 00:07:36.480 |
even though it looks like and feels like it's running locally, 00:07:39.080 |
it actually executes on the remote machine and 00:07:41.720 |
returns back to you another pointer to the result. 00:07:45.280 |
The idea here being that I can now coordinate remote executions, 00:07:52.040 |
necessarily having to have direct access to the machine. 00:07:55.120 |
Of course, I can get a dot get request and we'll 00:08:00.000 |
really important to getting permissions around when you can do dot get request, 00:08:02.640 |
and actually ask for data from a remote machine back to you. 00:08:19.960 |
do data science on a machine that we don't have access to, that we don't own. 00:08:23.640 |
But the problem is, the first column we want to address, 00:08:32.640 |
"I'm going to train a deep learning classifier." 00:08:34.200 |
But the process of answering questions is inherently iterative. 00:08:50.840 |
So in this case, let's say we have what's called a grid. 00:08:57.180 |
So again, this is all open-source Apache 2 stuff. 00:09:00.640 |
This is, we have what's called a grid client. 00:09:06.160 |
a large number of datasets inside of a big hospital. 00:09:12.040 |
to train a classifier to do something with diabetes. 00:09:16.760 |
certain kind of diabetes or certain attribute of diabetes. 00:09:23.040 |
I get back pointers to throw the remote information. 00:09:30.240 |
what the information is without me actually looking at it. 00:09:37.840 |
what the various ranges of the values can take on, 00:09:40.160 |
things that allow me to do remote normalization, 00:09:51.040 |
They could be actually short snippets from the actual dataset. 00:09:56.920 |
Maybe it's okay to release small amounts but not large amounts. 00:10:06.000 |
I used to work for a company called Digital Reasoning. 00:10:11.720 |
So we delivered AI services to corporations behind the firewall. 00:10:19.120 |
We worked with investment banks helping prevent insider trading. 00:10:22.600 |
Doing data science on data that your home team, 00:10:27.280 |
is not able to see is really, really challenging. 00:10:29.200 |
But there are some things that can give you the first big jump 00:10:33.680 |
before you jump into the more complex tools to 00:10:35.680 |
handle some of the more challenging use cases. 00:10:42.080 |
basic private search, and the ability to look at sample data, 00:10:48.960 |
to start doing things like feature engineering and evaluating quality. 00:10:52.880 |
So now the data remains in the remote machine. 00:10:58.520 |
Here's where things get a little more complicated. 00:11:18.840 |
Unfortunately, despite the fact that I'm doing all my remote execution, 00:11:25.200 |
well, I can just steal all the data that I want to. 00:11:27.400 |
I just call dot get on whatever pointers I want, 00:11:29.640 |
and there's no additional added real security. 00:11:32.960 |
So what are we going to do about this? This brings us to 00:11:36.480 |
tool number three called differential privacy. 00:11:38.280 |
Differential privacy, you want to come across? 00:11:55.320 |
and I'll give you resources for deeper dive in 00:11:57.480 |
differential privacy at the end of the talk, should you be interested. 00:12:05.880 |
statistical analysis without compromising the privacy of the dataset. 00:12:10.280 |
More specifically, it allows you to query a database, 00:12:14.120 |
while making certain guarantees about the privacy 00:12:16.880 |
of the records contained within the database. 00:12:23.480 |
look in the literature for differential privacy. 00:12:27.840 |
one row for person, and one column of zeros and ones, 00:12:32.120 |
We don't actually really care what those zeros and ones are indicating. 00:12:36.760 |
could be male-female, could be just some sensitive attributes, 00:12:46.640 |
statistical analysis doesn't compromise privacy. 00:12:48.520 |
What we're going to do is query this database. 00:12:50.400 |
We're going to run some function over the entire database, 00:12:55.480 |
and then we're going to ask a very important question. 00:12:57.760 |
We're going to ask, if I were to remove someone from this database, 00:13:04.240 |
say John, would the output of my function change? 00:13:17.960 |
well, this output is not conditioned on John's private information. 00:13:21.320 |
Now, if we could say that about everyone in the database, 00:13:25.020 |
well then, okay, it would be a perfectly privacy-preserving query, 00:13:32.440 |
But this intuitive definition, I think, is quite powerful. 00:13:35.680 |
The notion of how can we construct queries that are 00:13:37.840 |
invariant to removing someone or replacing them with someone else. 00:13:46.480 |
the output of a function can change as a result of removing or 00:13:50.320 |
replacing one of the individuals is known as the sensitivity. 00:13:54.640 |
So important, so if you're reading literature, 00:13:59.840 |
So what do we do when we have a really sensitive function? 00:14:03.880 |
We're going to take a bit of a sidestep for a minute. 00:14:07.120 |
I have a twin sister who's finishing a PhD in political science. 00:14:11.560 |
Political science, often they need to answer questions about 00:14:16.240 |
very taboo behavior, something that people are likely to lie about. 00:14:20.440 |
So let's say I wanted to survey everyone in this room and I wanted to answer 00:14:24.920 |
the question what percentage of you are secretly serial killers? 00:14:35.920 |
but because I genuinely want to understand this trend. 00:14:40.280 |
I'm not trying to be an instrument of the criminal justice system. 00:14:47.400 |
political scientist and understand this actual trend. 00:14:49.640 |
The problem is if I sit down with each one of you in a private room and I say, 00:14:53.960 |
I won't tell anybody," I'm still going to get a skewed distribution. 00:14:57.680 |
Some people are just going to be like, "Why would I risk 00:15:02.640 |
So what sociologists can do is this technique called randomized response, 00:15:08.720 |
You take a coin and you give it to each person before you survey them, 00:15:13.040 |
and you ask them to flip it twice somewhere that you cannot see. 00:15:16.280 |
So I would ask each one of you to flip a coin twice somewhere that I cannot see. 00:15:23.480 |
um, if the first coin flip is a heads, answer honestly. 00:15:33.400 |
answer yes or no based on the second coin flip. 00:15:39.840 |
you'll be honest and the other half of the time, 00:15:43.000 |
you'll be, uh, you'll be giving me a perfect 50-50 coin flip. 00:15:47.240 |
And the cool thing is that what this is actually doing, 00:15:49.880 |
is taking whatever the true mean of the distribution 00:15:51.840 |
is and averaging it with a 50-50 coin flip, right? 00:16:00.520 |
answered yes, that, that you are a serial killer, 00:16:05.960 |
um, then I know that the true center of the distribution is actually 60 percent, 00:16:10.000 |
because it was 60 percent averaged with a 50-50 coin flip. 00:16:12.680 |
Does that make sense? However, despite the fact that I can 00:16:15.720 |
recover the center of the distribution, right, 00:16:21.480 |
each individual person has plausible deniability. 00:16:27.360 |
or it could have been because you just happened to 00:16:33.520 |
Now this concept of adding noise to data to give 00:16:37.760 |
plausible deniability is sort of the secret weapon of differential privacy, right? 00:16:44.960 |
a set of mathematical proofs for trying to do this as efficiently as possible, 00:16:49.280 |
to give sort of the smallest amount of noise to get the most accurate results, 00:16:53.200 |
right, um, with the best possible privacy protections, right? 00:16:58.920 |
sort of base trade-off that you, you, you, you know, 00:17:02.360 |
um, there's kind of a Pareto trade-off, right? 00:17:10.160 |
the, the field of research that is differential privacy, um, 00:17:17.280 |
and resulting queries to give plausible deniability to the, 00:17:22.080 |
of a database or a training dataset. Does that make sense? 00:17:29.560 |
So there's local and there's global differential privacy. 00:17:31.760 |
So local differential privacy adds noise to data before it's sent to the statistician. 00:17:40.240 |
It affords you the best amount of protection because you never actually 00:17:43.400 |
reveal sort of in the clear your information to someone, right? 00:17:47.480 |
And then there's global differential privacy, 00:17:49.640 |
which says, okay, we're gonna put everything in the database, 00:17:52.160 |
perform a query, and then before the output of the query gets published, 00:17:55.440 |
we're gonna add a little bit of noise to the output of the query, okay? 00:17:58.160 |
This tends to have a much better privacy trade-off, 00:18:00.260 |
but you have to trust the database owner to not compromise the results, okay? 00:18:03.680 |
And we'll see there's some other things we can do there. 00:18:06.920 |
this is a good, good point for questions if you had any questions. 00:18:12.720 |
Um, any of this, this process of differential privacy verifiable? 00:18:18.120 |
and one that actually absolutely comes up in practice. 00:18:22.640 |
local differential privacy, the nice thing is everyone's doing it for themself, right? 00:18:26.040 |
So in that sense, if you're flipping your own coins and answering your own questions, 00:18:35.680 |
stay tuned for the next tool and we'll, we'll come back to that. 00:18:38.880 |
All right. So what does this look like in code? 00:18:42.680 |
So first, we have a pointer to a remote private dataset we call dot get. 00:18:50.680 |
the raw value of some private data point which you cannot do, right? 00:18:53.560 |
Instead, pass in dot get epsilon to add the appropriate amount of noise. 00:18:59.520 |
uh, differential privacy. So I mentioned sensitivity, right? 00:19:05.520 |
the type of function that we wanted to do and it's in variance to, 00:19:07.840 |
um, removing or replacing individual entries in the, in the database. 00:19:10.880 |
Um, so epsilon is a measure of what we call our privacy budget, right? 00:19:15.800 |
And what our privacy budget is, is saying, okay, what, what's the, 00:19:18.400 |
what's the amount of, of statistical uniqueness that I'm going to sort of limit? 00:19:22.720 |
What's the upper bound for the amount of statistical uniqueness that I'm going to 00:19:25.080 |
allow to come out of this, out of this database? 00:19:27.160 |
Um, and actually I'm gonna take one more side, 00:19:30.480 |
because I think it's really worth mentioning, um, data anonymization. 00:19:33.640 |
Anyone familiar with data anonymization come across this term before? 00:19:39.120 |
the social security numbers and like all this kind of stuff? 00:19:44.520 |
If you don't remember anything else from this talk, 00:19:46.280 |
it is very dangerous to do just data set anonymization, okay? 00:19:50.080 |
And differential privacy in, in some respects is, 00:19:52.480 |
is, is the formal version of data anonymization, 00:19:55.120 |
where instead of, instead of just saying, okay, 00:19:56.840 |
I'm just gonna redact out these pieces and then I'll be fine, um, 00:19:59.720 |
this is saying, okay, um, that we, we can do a lot better. 00:20:11.760 |
Netflix published an anonymized dataset, right? 00:20:18.040 |
And they took all the movies and replaced them with numbers, 00:20:20.440 |
and they took all the users and replaced them with numbers, 00:20:22.840 |
and then we just had sparsely populated movie ratings in this matrix, right? 00:20:37.960 |
meaning it, it kind of is its own fingerprint. 00:20:41.640 |
And so two months after the dataset was published, 00:20:48.080 |
um, I think it was, I think it was UT Austin, um, 00:20:54.960 |
and basically create the same matrix in IMDb, 00:20:59.840 |
And it turns out people that were into movie rating, 00:21:04.400 |
and, and were watching movies at similar times, 00:21:06.960 |
and similar, similar patterns, and similar tastes, right? 00:21:11.080 |
this first dataset with a high degree of accuracy. 00:21:17.440 |
I think it'd been a Massachusetts senator, I think. 00:21:21.080 |
being de-anonymized, uh, through very similar techniques. 00:21:23.640 |
So some- one person goes and buys a anonymized medical dataset over here that has, 00:21:32.560 |
gender, and whether or not you have cancer, right? 00:21:34.920 |
And, and when you get all these together, um, 00:21:37.600 |
you can start to sort of use the uniqueness in each one to, 00:21:46.880 |
unfortunately know of companies whose business model is to buy anonymized datasets, 00:21:52.480 |
de-anonymize them, and sell market intelligence to insurance companies. 00:22:00.080 |
And, and the reason that it can be done is that just 00:22:11.120 |
does not mean that there's enough unique statistical signal 00:22:20.560 |
on the, the statistical uniqueness that you're publishing in a dataset, right? 00:22:25.320 |
And so what, what this tool represents is saying, okay, 00:22:33.400 |
whatever computational graph led back to private data for this tensor, right? 00:22:49.520 |
there's, I'm only going to allow patterns that have occurred at least twice, right? 00:22:54.200 |
Okay. So meaning, meaning two different people had 00:22:57.000 |
this pattern and thus it's not unique to either one. Yes. 00:22:59.320 |
So what happens if you perform the query twice? 00:23:01.320 |
So the random noise would be re-randomized and, 00:23:03.440 |
and sent again and you're absolutely, absolutely correct. 00:23:05.960 |
So this epsilon, this is how much I'm spending with this query. 00:23:10.800 |
I would spend epsilon of 0.3. Does that make sense? 00:23:14.640 |
If I did this multiple times, the epsilons would sum. 00:23:16.920 |
And so for any given data science project, right? 00:23:21.800 |
an epsilon budget that you're not allowed to exceed, right? 00:23:24.160 |
No matter how many queries that you, you participate. 00:23:26.520 |
Now, there's, there's another sort of subfield of differential privacy that's 00:23:34.400 |
So how can I perform sort of one query against 00:23:36.080 |
the whole data set and create a synthetic data set that has, 00:23:39.000 |
um, certain invariances that are desirable, right? 00:23:43.520 |
But then I can query this as many times as I want. 00:23:50.480 |
but we, we, we don't have to get into that now. 00:23:51.840 |
Does that answer your question? Cool. Awesome. 00:23:56.320 |
Like how can we be answering questions while protecting, 00:23:58.440 |
while, while keeping statistical signal gone. 00:24:00.200 |
But like it's, it's the difference between, um, 00:24:02.440 |
it's the difference between if I have a data set and I wanna know what causes cancer, right? 00:24:07.540 |
I could query a data set and learn that smoking causes cancer 00:24:14.760 |
are or are not smokers. Does that make sense? 00:24:19.560 |
is that I'm, I'm, I'm specifically looking for 00:24:22.080 |
patterns that are occurring multiple times across different people. 00:24:29.440 |
generalization that we want in machine learning statistics anyways. 00:24:32.400 |
Does that make sense? Like as machine learning practitioners, 00:24:35.280 |
we're actually not really interested in the one-offs, right? 00:24:39.200 |
I mean, sometimes our models memorize things. 00:24:42.360 |
But we're actually more interested in the things that are, 00:24:46.720 |
I want, I want the things that are gonna work, you know, 00:24:48.680 |
the, the heart treatments that are gonna work for everyone in this room, 00:24:52.960 |
I'd be happy, that'd be cool for you to have one. 00:24:54.440 |
But like what we're chiefly interested in are, 00:24:58.280 |
Which, which is why this is realistic, um, um, 00:25:01.400 |
and why with, with continued effort on both tooling and, 00:25:06.120 |
we can, we can have a much better, uh, reality than today. 00:25:15.640 |
that allows data to remain in the remote machine. 00:25:17.320 |
Search and sampling, we can feature engineer using toy data. 00:25:19.740 |
Differential privacy, we can have a formal rigorous privacy budgeting mechanism, right? 00:25:26.440 |
Is it defined by the user or is it defined by the dataset owner or, or someone else? 00:25:31.760 |
Um, this is a really, really interesting question actually. 00:25:34.480 |
Um, so first, it's definitely not set by the data scientist, 00:25:38.680 |
um, because that would be a bit of a conflict of interest. 00:25:43.520 |
you might say it should be the data owner, okay? 00:25:50.060 |
And make sure that their assets are protected both legally and, 00:25:54.580 |
So they're, they're trying to make, make money off this. 00:26:00.980 |
But the interesting thing, and this gets back to your question, 00:26:06.300 |
a radiology scan in two different hospitals, right? 00:26:13.700 |
of, of my privacy in each of these hospitals. 00:26:17.380 |
Right? That means that actually two epsilon of my private information is out there. 00:26:22.120 |
Right? And it just means that one person has to be 00:26:25.840 |
clever enough to go to both places to get the join. 00:26:28.680 |
This is actually the exact same mechanism we were talking about a second ago when 00:26:34.160 |
And so the, the true answer of who should be setting epsilon budgets, 00:26:38.600 |
although, um, logistically, it's gonna be challenging. 00:26:40.640 |
We'll talk about a little bit of this in, in, 00:26:41.880 |
in part two of the talk, but I'm going a little bit slow. 00:26:49.660 |
and it should be people around their own information, right? 00:26:52.860 |
You should be setting your personal epsilon budget. That makes sense? 00:26:58.140 |
Um, we've got a long way before we can get to that level of, 00:27:01.520 |
of infrastructure around these kinds of things. 00:27:06.620 |
and we can definitely talk about more of that in 00:27:18.000 |
Okay. Um, the two cons that we still- two weaknesses of 00:27:20.440 |
this approach that we still lack are- someone asked this question. 00:27:23.040 |
I think it was you. Yeah, yeah, you asked the question. 00:27:35.880 |
my model into the hospital to learn how to be a better cancer classifier, right? 00:27:44.580 |
I'm just sending it to a thousand different hospitals to get learned, to learn. 00:27:50.940 |
a computation across multiple different data owners, 00:28:01.060 |
how do I trust that these computations are actually happening the way 00:28:03.780 |
that I am telling the remote machine that they should happen? 00:28:18.400 |
Most machine learning people have not heard about this yet, 00:28:23.720 |
this is the coolest thing I've learned about since learning about, 00:28:26.680 |
This is a, this is a really, really cool technique. 00:28:28.600 |
Encrypted computation, how about homomorphic encryption? 00:28:32.280 |
Okay, a few more. Yeah, this is related to that. 00:28:34.640 |
Um, so first, the kind of textbook definition is, is like this. 00:28:45.340 |
a function without revealing their inputs to each other, okay? 00:28:50.580 |
the implication of this is multiple different individuals 00:28:57.220 |
Share ownership of a number. Show you what I mean. 00:29:15.720 |
They are now the shareholders of this number, okay? 00:29:20.160 |
and this number is shared between them, okay? 00:29:24.960 |
And this, this gives us several desirable properties. 00:29:27.000 |
First, it's encrypted from the standpoint that neither Bob, 00:29:34.880 |
encrypted between them by looking at their own share by itself. 00:29:48.240 |
decryption would be adding the shares together, 00:29:52.000 |
Um, so these would typically look like sort of 00:29:56.040 |
But for the sake of making it sort of intuitive, 00:29:58.080 |
I've picked pseudo-random numbers that are convenient to the eyes. 00:30:01.560 |
Um, so first, these two values are encrypted, 00:30:08.080 |
meaning that we cannot decrypt these numbers or do anything with 00:30:11.320 |
these numbers unless all of the shareholders agree, okay? 00:30:21.660 |
while this number is encrypted between these individuals, 00:30:26.120 |
So in this case, let's say we wanted to multiply 00:30:28.020 |
the shares times- or the encrypted number times two, 00:30:30.520 |
each person can multiply their share times two, 00:30:32.520 |
and now they have a encrypted number 10, right? 00:30:35.160 |
And there's a whole variety of protocols allowing you to do different functions, 00:30:39.360 |
um, such as the functions needed for machine learning, 00:30:41.840 |
um, while numbers are in this encrypted state, okay? 00:30:45.360 |
Um, and I'll give some more resources for you- for you if you're 00:30:47.440 |
interested in kind of learning more about this at the end as well. 00:30:50.280 |
Now, the big tie-in. Models and datasets are just large collections of numbers, 00:31:00.400 |
Um, now, specifically to reference your question, 00:31:07.680 |
you can tell if anyone does computation that you did 00:31:09.640 |
not sort of independently authorize, which is great. 00:31:12.920 |
So what does this look like in practice when you go back to the code? 00:31:20.960 |
it's not just one hospital because we're looking to have a shared governance, 00:31:23.240 |
shared ownership amongst multiple different individuals. 00:31:32.200 |
and instead of calling dot send and sending that tensor to someone else, 00:31:40.920 |
multiple different shares and distributes those amongst the shareholders, right? 00:31:46.680 |
However, in the frameworks that we're working on, 00:31:49.880 |
you still get kind of the same PyTorch-like interface, 00:31:52.760 |
and all the cryptographic protocol happens under the hood. 00:31:55.600 |
And the idea here is to make it so that we can sort of do 00:31:58.360 |
encrypted machine learning without you necessarily having to be a cryptographer, right? 00:32:04.160 |
the algorithms and machine learning people can automatically inherit them, right? 00:32:06.960 |
So kind of classic sort of open-source machine learning library, 00:32:10.280 |
making complex intelligence more accessible to people, if that makes sense. 00:32:20.800 |
encrypted prediction, and we're going to get into, 00:32:23.280 |
what kind of awesome use cases this opens up in a bit. 00:32:35.880 |
the MVP of doing privacy-preserving data science, right? 00:32:39.520 |
The idea being that I can have remote access to a remote dataset. 00:32:46.240 |
like, you know, what causes cancer without learning whether individuals have cancer. 00:32:53.760 |
just that sort of high-level information with 00:32:59.320 |
you know, what's sort of the filter that's coming back through here, right? 00:33:02.920 |
And I can work with datasets from multiple different data owners while making 00:33:06.320 |
sure that each, each individual data owners are protected. 00:33:20.000 |
this, this involves sending lots of information over, over the network. 00:33:27.400 |
is that this is a 13x slowdown over plaintext, 00:33:36.600 |
were like talking to each other, you know, they're relatively fast. 00:33:39.000 |
But we also haven't had any like hardware optimization to the extent that, 00:33:42.300 |
that, you know, NVIDIA did a lot for deep learning, 00:33:45.960 |
probably like some sort of Cisco player that's similar for, 00:33:48.560 |
for doing kind of encrypted or secure MPC-based deep learning, right? 00:33:55.080 |
So this brings us back to kind of the fundamental question, 00:33:57.400 |
is it possible to answer questions using data we cannot see? 00:34:05.920 |
the sort of the theoretical frameworks that we have. 00:34:07.680 |
And actually, the other thing that's really worth mentioning here is that these come 00:34:10.240 |
from totally different fields which is why they kind of 00:34:12.400 |
haven't been necessarily combined that much yet. 00:34:13.920 |
I'll get, I'll get more into that in a second. 00:34:22.840 |
That'll open up your eyes to the potential that in general, 00:34:27.200 |
questions using information that we don't actually own ourselves. 00:34:32.760 |
that's net new for like us as a species, if that makes sense. 00:34:39.120 |
we want to just, we had to have, we had to have like a, 00:34:40.360 |
a trusted third party who would then take all the information in 00:34:42.880 |
themselves and, and make some sort of neutral decision, right? 00:34:49.040 |
Um, and so one of the big sort of long-term goals of 00:34:53.160 |
infrastructure for this secure enough and robust enough. 00:34:55.440 |
And of course in like a free Apache 2 open source license kind of way, 00:35:01.440 |
information on the world's most important problems will be this accessible, right? 00:35:05.360 |
I'm gonna spend sort of less time working on, 00:35:08.680 |
um, tasks like that and more time working on tasks like this. 00:35:14.760 |
the breaking point between sort of part one and part two. 00:35:20.120 |
in sort of diving deeper on the technicals of this, um, 00:35:22.240 |
here's a, like a six or seven hour course that I 00:35:24.760 |
taught just on these concepts and on the tools. 00:35:28.600 |
Feel free to check it out. Um, so the question was, um, uh, 00:35:34.160 |
I specified that a model can be encrypted during training. 00:35:36.360 |
Is that same as homomorphic encryption or is that something else? 00:35:40.940 |
there was a, a big burst in literature around training on encrypted data, 00:35:44.680 |
um, where you would homomorphically encrypt the dataset. 00:35:46.800 |
And it turned out that some of the statistical regularities 00:35:48.960 |
of homomorphic encryption allowed you to actually train on 00:35:50.920 |
that dataset without, um, without decrypting it. 00:35:59.440 |
the one downside to that is that in order to use that model in the future, 00:36:04.040 |
you have to still be able to encrypt data with the same key, um, 00:36:10.000 |
is sort of constraining in practice and also there's a pretty big hit to privacy 00:36:12.680 |
because you're, you're training on data that inherently has a lot of noise added to it. 00:36:24.280 |
during training but inside the encryption, inside the box, right? 00:36:28.320 |
It's actually performing the same computations 00:36:31.680 |
So you don't get any degradation in accuracy, um, 00:36:33.880 |
and you don't get tied to one particular public private key pair. 00:36:38.840 |
specifically- so the question was can I comment on 00:36:40.480 |
federated learning specifically Google's implementation? 00:36:42.640 |
Um, so I think Google's implementation is, is great. 00:36:46.720 |
the fact that they've shown that this can be done hundreds of 00:36:53.280 |
um, uh, and creating momentum in that direction. 00:36:57.520 |
one thing that's worth mentioning is that there are two forms of federated learning. 00:37:00.960 |
Uh, one is sort of the one where your model is- federated learning, sorry. 00:37:13.000 |
um, basically the first thing I talked about. 00:37:22.280 |
you plug your phone in at night, attach to Wi-Fi. 00:37:24.120 |
You know when you text and it recommends the next word, 00:37:28.520 |
Um, that model is trained using federated learning. 00:37:35.080 |
and then that model gets uploaded to the Cloud as opposed to 00:37:37.800 |
uploading all of your tweets to the Cloud and training one global model. 00:37:40.120 |
Does that make sense? So, so plug your phone at night, 00:37:42.880 |
model comes down, trains locally, goes back up. 00:37:46.160 |
that's basically what federated learning is in a nutshell. 00:37:49.600 |
it was pioneered, uh, by the Quark team at Google. 00:37:55.600 |
They've, they've paid down a lot of the technical debt, 00:38:03.040 |
outlining sort of how they do it, which is fantastic. 00:38:08.400 |
actually a slightly different style of federated learning. 00:38:10.560 |
Because there, there's federated learning with like a fixed dataset and a fixed model. 00:38:17.360 |
ephemeral like phones are constantly logging in and logging off. 00:38:22.000 |
you're plugging your phone at night and then you're taking it out, right? 00:38:29.280 |
It's, it's really useful for like product development, right? 00:38:33.640 |
a smartphone app that has a piece of intelligence in it, 00:38:37.760 |
prohibitively difficult for you to get access to the data for, 00:38:40.320 |
um, or you, you want to just have a value prop of protecting privacy, right? 00:38:43.460 |
That's a federated learning, that's how federated learning is good for. 00:38:45.760 |
What I've outlined here is a bit more exploratory federated learning, 00:38:50.980 |
instead of, um, the model being hosted in the Cloud and 00:38:53.820 |
data owners showing up and making it a bit smarter every once in a while, 00:38:56.840 |
now the data is going to be hosted at a variety of different private Clouds, right? 00:39:01.220 |
And data scientists are going to show up and say, "Hmm, 00:39:03.100 |
I want to do something with di- with diabetes today," or, "Hmm, 00:39:06.340 |
with, um, studying dementia today," something like that, right? 00:39:09.900 |
This is much more difficult because the attack factors for 00:39:14.300 |
I'm trying to be able to answer arbitrary questions about 00:39:17.300 |
arbitrary datasets, um, in, in a protected environment, right? 00:39:22.620 |
that's kind of my, my, my general thoughts on. 00:39:25.220 |
Does federated learning leak any information? 00:39:27.180 |
So federated learning by itself is not a secure protocol, right? 00:39:32.340 |
and that's why I sort of this ensemble of techniques that I've- so 00:39:35.860 |
the question was does federated learning leak information? 00:39:40.340 |
a federated learning model to simply memorize the dataset, 00:39:44.100 |
You have to combine it with something like differential privacy 00:39:46.260 |
in order to be able to prevent that from happening. Does that make sense? 00:39:50.580 |
the training is happening on my device does not mean it's not memorizing my data. 00:40:05.100 |
someone looking kind of globally at like, okay, 00:40:07.300 |
what- if, if this becomes mature, what happens? 00:40:10.020 |
Right? And, and this is where it gets really exciting. 00:40:17.220 |
Cool. Well, this is, this is the, this is the part for you. 00:40:23.820 |
is this ability to answer questions using data you can't see. 00:40:28.820 |
most people spend a great deal of their life just answering questions, 00:40:33.380 |
and a lot of it is involving sort of personal data. 00:40:35.500 |
I mean, whether it's minute things like, you know, 00:40:50.700 |
I mean, a, a wide variety of different questions, right? 00:40:55.380 |
our answering ability to the information that we have, right? 00:40:58.860 |
So this ability to answer questions using data we don't have, 00:41:01.540 |
sociologically I think is quite, quite important. 00:41:04.100 |
Um, and, um, there's four different areas that I want to 00:41:08.180 |
highlight as like big groups of use cases for this kind of technology, 00:41:13.780 |
um, to help kind of inspire you to see where this infrastructure can go. 00:41:16.540 |
And actually before, before I jump into that, 00:41:21.220 |
Cool. Uh, just to like the castle and stuff like that. 00:41:42.660 |
We did the ghost tour and, um, that was really cool. 00:41:45.460 |
Um, [LAUGHTER] there was one thing that took away from it. 00:41:50.740 |
we just walked out of the tunnels and she was pointing up some of the architecture. 00:42:02.220 |
basically the cobblestone streets and why the cobblestone streets are there. 00:42:07.300 |
Cobblestone streets, one of the main purposes of them was to sort of lift you out of the muck. 00:42:11.740 |
And the reason there was muck was there is that they didn't have 00:42:13.900 |
any in- internal plumbing and so the sewage just poured out into the street, 00:42:21.620 |
And actually, I think she even sort of implied that like the invention or 00:42:24.100 |
popularization of the umbrella had less to do with actual rain, 00:42:26.860 |
a bit more would do with buckets of stuff coming down from on high, 00:42:33.300 |
it's a whole different world like when you think about what that is. 00:42:36.180 |
Um, but the, the reason that I bring this up, um, 00:42:51.180 |
It was all over the place and people were walking through it everywhere they go, 00:42:54.680 |
and they were wondering why they got sick, right? 00:42:59.940 |
and it wasn't because they wanted it to be that way, 00:43:02.140 |
it's just because it was a natural consequence 00:43:03.700 |
of the technology that they had at the time, right? 00:43:05.580 |
This is not malice, this is not anyone being good or bad or, 00:43:11.900 |
Um, and I think that there's a strong analogy to be made with, 00:43:17.800 |
with kind of how our data is handled as society at the moment, right? 00:43:23.860 |
we've had new inventions come up and new things that are practical, 00:43:28.540 |
we're constantly spreading and spewing our data all over the place, right? 00:43:33.100 |
I mean, every, every camera that sees me walking down the street, you know, 00:43:36.660 |
goodness, there's a, there's a company that takes 00:43:38.300 |
a whole picture of the Earth by satellite every day. 00:43:40.580 |
Like, how the hell am I supposed to do anything without, 00:43:43.580 |
without, you know, everyone following me around all the time, right? 00:43:54.940 |
so I don't really know, but whoever it was that said, 00:43:57.620 |
"What if, what if we ran plumbing from every single apartment, 00:44:03.140 |
business, school, maybe even some public toilets, 00:44:10.820 |
used chemical treatments, and then turned that into usable drinking water?" 00:44:17.860 |
the most massive logistical infrastructure problem 00:44:24.900 |
to take already, already constructed buildings, 00:44:34.500 |
so old they don't have showers because they didn't want to run the plumbing for the head. 00:44:45.220 |
um, it just must have seen absolutely massive. 00:44:49.100 |
And so, as I'm about to walk through kind of like 00:44:51.620 |
four broad areas where things could be different, 00:44:55.740 |
and I think it's probably going to hit you like, 00:45:00.140 |
Or like, "Whoa, that's, that's a lot of change." 00:45:10.460 |
if you view our lives as just one long process of answering important questions, 00:45:15.100 |
whether it's where we're going to get food or what causes cancer, 00:45:17.820 |
like making sure that, that the right people can answer questions without, 00:45:20.980 |
without, you know, data just getting spewed everywhere so that 00:45:23.620 |
the wrong people can answer their questions, right, is important. 00:45:30.220 |
so I know this is going to sound like there's 00:45:35.940 |
But I, I hope that, that you will at least see that, 00:45:41.460 |
And, and that really what stands between us and a world that's fundamentally 00:45:47.620 |
maturing of the technology, and, and, and, and good engineering. 00:45:52.420 |
once, you know, Sir Thomas Crapper invented the toilet, right? 00:45:58.220 |
at, at that point, the, the basics were there, right? 00:46:02.980 |
was implementation, adoption, and engineering, right? 00:46:05.980 |
And I, I think that that's, that's where we are. 00:46:16.460 |
of the, the early pieces of this technology, right? 00:46:19.340 |
Cool. So what are the, what are the big categories? 00:46:22.980 |
One I've already talked about, open data for science. 00:46:38.820 |
And the reason it's a really big deal is mostly 00:46:41.420 |
because everyone gets excited about making AI progress, right? 00:46:45.900 |
Everyone gets super excited about superhuman ability in X, Y, Z. 00:46:51.940 |
I, I worked for, my professor's name is, uh, Phil Bluntsum. 00:46:54.660 |
The first thing he told me when I sat my butt down in his office in my first day as a student, 00:46:58.140 |
he said, "Andrew, everyone's gonna want to work on models. 00:47:01.620 |
the biggest jumps in progress have happened when we had 00:47:03.980 |
new big datasets or the ability to process new big datasets." 00:47:12.060 |
ImageNet, GPUs allowing us to process larger datasets. 00:47:18.860 |
This is synthetically generated infinite datasets. 00:47:26.740 |
It talked about how it had trained on like 200 years of, 00:47:35.620 |
Watson, the playing, playing Jeopardy, right? 00:47:40.420 |
of a new large structured dataset based on Wikipedia. 00:47:48.980 |
This was on the heels of the largest open dataset of chess, 00:47:53.260 |
um, matches haven't been published online, right? 00:47:55.820 |
There's this, there's this echo where like big new dataset, 00:47:59.660 |
big new dataset, big new breakthrough, right? 00:48:04.180 |
is potentially, you know, several orders of magnitude, 00:48:10.460 |
that we're not, I'm not saying we're gonna invent a new machine, 00:48:16.140 |
I'm saying there's thousands and thousands of enterprises, 00:48:21.940 |
and hundreds of, of governments that are all already have 00:48:24.340 |
this data sitting inside of data warehouses, right? 00:48:34.740 |
all of a sudden I just doubled the supply, right? 00:48:41.740 |
something bad with it that comes back to hurt me. 00:48:44.860 |
With this category, I know it's like just one phrase, 00:48:50.640 |
but for every data task that's already been established, right? 00:49:01.820 |
the psychology department who wants to study dementia, right? 00:49:06.660 |
is, is every hospital has like five cases, right? 00:49:11.540 |
it's not like all the, all the cancer patients go to, 00:49:25.220 |
And so he's investing in private data science platforms. 00:49:37.740 |
and this can unlock not larger amounts of data that exists, 00:49:47.980 |
Whereas instead of going out and trying to buy as many datasets as you can, 00:49:51.540 |
which is a really hard and really expensive task. 00:49:53.860 |
Talk to anyone who's in Silicon Valley right now, 00:49:57.100 |
Instead, you go to each individual person that has a dataset and you say, 00:50:00.660 |
"Hey, let me create a gateway between you and 00:50:04.420 |
the rest of the world that's gonna keep your data safe and allow people to leverage it." 00:50:07.340 |
Right? That's like repeatable business model. 00:50:14.220 |
Be, be the radiology network gatekeeper, right? 00:50:21.820 |
But like, does it make sense how like on a huge variety of tasks, 00:50:26.940 |
a data box silo that you can do data science against, 00:50:32.060 |
a huge variety of models really, really, really quickly. 00:50:39.260 |
Oh, that's not right. Single use accountability. 00:51:02.960 |
Um, get to the airport and you get your bag checked, right? 00:51:08.120 |
Everyone's familiar with this process, I assume. 00:51:10.520 |
What happens? Someone's sitting at a monitor, 00:51:16.960 |
So that occasionally, they can spot objects that are dangerous or illicit, right? 00:51:27.980 |
that they have to sit and look at thousands of, 00:51:32.020 |
basically searching every single person's bag totally and completely, 00:51:34.960 |
just so that occasionally, they can find that one. 00:51:37.460 |
Answer that, the question they actually want to answer is, 00:51:44.060 |
they have to basically acquire access to the whole bag, right? 00:51:52.260 |
the same approach of answering questions using data we can't see. 00:51:55.860 |
The best example of this in the analog world is a sniffing dog. 00:52:01.220 |
so give your bag a whiff at the airport, right? 00:52:03.640 |
This is actually a really privacy-preserving thing, 00:52:06.300 |
because dogs don't speak English or any other language. 00:52:12.160 |
the dog comes by, "Nope, everything's fine," moves on. 00:52:14.600 |
The dog has the ability to only reveal one bit of 00:52:19.940 |
information without you having to search every single bag. 00:52:24.300 |
Okay? That is what I mean when I say a single-use accountability system. 00:52:32.860 |
some data stream because I'm holding someone accountable, right? 00:52:37.060 |
And we want to make it so that I can only answer 00:52:39.380 |
the question that I claim to be looking into. 00:52:42.020 |
So if this is a video feed, right, for example, right? 00:52:44.980 |
Instead of getting access to the raw video feed, 00:52:48.940 |
information every single person in the frame of view, 00:52:51.620 |
walking around doing whatever, which I could use for, 00:52:55.820 |
I technically could use for, for other purposes. 00:53:07.820 |
that looks for whatever I'm supposed to be looking for, right? 00:53:12.780 |
you know, I only open up bags that actually have to. 00:53:25.220 |
our accountability systems more privacy-preserving, which is great. 00:53:27.340 |
Mitigates any potential dual or multi-use, right? 00:53:38.700 |
some things that were simply too off-limits for us to, 00:53:42.980 |
to properly hold people accountable might be possible, right? 00:53:46.500 |
One of the things that was really challenging, 00:53:48.580 |
so we used to do email surveillance, digital reasoning, right? 00:53:54.260 |
investment banks find insider traders, right? 00:53:57.940 |
they, you know, they get fine billion dollar fines if, 00:54:03.500 |
But one of the things that was really difficult about developing 00:54:05.220 |
these kinds of systems was that it's so sensitive, right? 00:54:11.820 |
hundreds of millions of emails at some massive investment bank. 00:54:14.660 |
There's so much private information in there that say, 00:54:21.300 |
work with the data and try to make it better, right? 00:54:26.060 |
yeah, this makes it really, really difficult. 00:54:29.060 |
Third one, and this is the one I think is just incredibly exciting. 00:54:35.820 |
What's up? Everyone familiar with WhatsApp, Telegram, any of these? 00:54:57.100 |
and it's sent directly to someone else's phone, 00:54:59.780 |
and only that person's phone can decrypt it, right? 00:55:02.540 |
Which means that someone can provide a service, you know, 00:55:05.700 |
messaging without the service provider seeing any of 00:55:08.500 |
the information that they're actually providing the service over, right? 00:55:12.340 |
Very powerful idea. What if the intuition here is that, 00:55:21.660 |
encrypted computation, and differential privacy, 00:55:24.460 |
that we could do the same thing for entire services. 00:55:30.220 |
This is really a computation between two different datasets. 00:55:32.980 |
On the one hand, you have dataset that the doctor has, 00:55:53.100 |
your, your genes, your genetic predisposition, 00:55:56.780 |
And you're bringing these two datasets together, 00:56:02.500 |
what, what treatment should you have, if any? 00:56:13.380 |
this new field called structured transparency, 00:56:17.100 |
I'm not sure, I'm not even sure you can call it a new field yet, 00:56:25.180 |
because it's not in the literature, but it's been 00:56:48.660 |
So this, two different people providing their data together, 00:56:54.820 |
So, um, um, so differential privacy protects the output, 00:57:04.860 |
which we talked about earlier, protects the input, right? 00:57:10.340 |
So it allows them to, to, to compute F of X of Y, 00:57:13.660 |
right, without revealing their inputs. Remember this? 00:57:17.940 |
encrypt X, compute the function while it's encrypted. 00:57:22.100 |
Right? And so there's, there's three processes here, right? 00:57:26.980 |
there's logic, and then there's output privacy. 00:57:32.060 |
And this is what you need to be able to do end-to-end encrypted services. 00:57:42.140 |
so there, there are machine learning models that can now do, 00:57:48.040 |
and send it through machine, machine learning model, 00:57:50.100 |
and it'll predict whether or not I have melanoma on my arm, right? 00:57:56.940 |
machine learning model, perhaps owned by a hospital or a startup, 00:58:05.380 |
Encrypt both, the logic is done by the machine learning model. 00:58:09.940 |
The prediction, if it's gonna be published to the output, 00:58:15.620 |
you use differential privacy, but in this case, 00:58:26.100 |
the doctor role facilitated by machine learning can classify whether or not I have cancer, 00:58:31.700 |
can provide this service without anyone seeing my medical information. 00:58:35.260 |
I can go to the doctor and get a prognosis without ever revealing 00:58:39.220 |
my medical records to anyone including the doctor, right? 00:58:47.660 |
if you believe that sort of the services that are repeatable, 00:58:51.860 |
that we do for millions and millions of people, right? 00:58:54.100 |
Can create a training dataset that we can then train a classifier to do, 00:58:58.540 |
then we should be able to upgrade it to be end-to-end encrypted. 00:59:06.860 |
It assumes that, that AI is smart enough to do it. 00:59:12.900 |
and like quality assurance, and all these kinds of things, 00:59:17.020 |
There's very likely to be different institutions that we need. 00:59:19.540 |
But I hope that at least these three sort of big categories, 00:59:23.940 |
but I hope at least these three big categories will be sort of 00:59:26.540 |
sufficient for helping sort of lay the groundwork for how sort of 00:59:31.740 |
sole control over the only copies of their information, 00:59:34.140 |
while still receiving the same goods and services they've become accustomed to. 00:59:47.700 |
Andrew, it was fascinating, really, really fascinating. 01:00:00.100 |
and hope for this can really get rid of the sewage of data. 01:00:04.500 |
This vision of end-to-end encrypted services, 01:00:11.700 |
the algorithm would also run on two or more services, 01:00:21.180 |
But the diagnosis itself is not private though, 01:00:24.020 |
because the output of that is being revealed to the service provider. 01:00:29.780 |
>> So it could optionally be revealed to the service provider. 01:00:32.580 |
So in this case, oh yeah, something I didn't say. 01:00:39.980 |
when you perform computation between two encrypted numbers, 01:00:43.260 |
the result is encrypted between the same shareholders, if that makes sense. 01:00:48.420 |
Z is still encrypted with the same keys as X and Y, 01:00:51.900 |
and then it's up to the key holders to decide who they want to decrypt it for. 01:00:55.120 |
So they could decrypt it for the general public, 01:00:56.740 |
in which case they should apply differential privacy. 01:01:02.100 |
in which case the input owner is not going to hurt anybody else by him knowing 01:01:09.900 |
or it could be decrypted for the model owner, 01:01:14.900 |
perhaps allow them to do more training or some other arbitrary use case. 01:01:18.700 |
So it can be, but not as a strict requirement. 01:01:22.660 |
>> Just to be sure, if Z is being computed by say two parties, 01:01:34.820 |
So in that sense, even if you encrypt Z with the key of Y, 01:01:47.360 |
So when we perform the encrypted computation, 01:02:06.960 |
what actually happens is this creates Z1 and Z2 at the end, right? 01:02:15.960 |
you know, person Y. So we'll say this is Alice and this is Bob. 01:02:22.680 |
Right? So we have Bob's share and Alice's share. 01:02:29.160 |
So if Alice or if Bob sends his share of Z down to Alice, 01:02:43.080 |
differential privacy in the case you're planning to 01:02:45.340 |
decrypt the result for some unknown audience to be able to see. 01:02:50.200 |
>> Some models are biased based on real data biases, 01:03:08.920 |
when you do not see biases in the data and so on? 01:03:15.400 |
the first gimme for that is that people don't 01:03:18.820 |
ever really de-bias a model by physically reading the weights. 01:03:21.480 |
Right? So the fact that the weights are encrypted 01:03:25.800 |
So really what it's about is just making sure that you provision 01:03:29.640 |
enough of your privacy budget to allow you to do 01:03:31.280 |
the introspection that you need to be able to 01:03:40.760 |
>> How far away do you think we are from organizations like 01:03:44.880 |
the FDA requiring differential privacy to be used 01:03:52.360 |
>> So I think the best answer I can give to that, 01:04:03.600 |
prescriptive about things like differential privacy. 01:04:06.360 |
But I think the best and most relevant data point I have for you on 01:04:09.440 |
that is that the US Census this year is going to be 01:04:14.000 |
the 2020 census data using differential privacy. 01:04:19.640 |
applying differential privacy in the world is going on at 01:04:27.840 |
>> So I guess her question was kind of one of my questions, 01:04:32.880 |
but it was more just like how much buy-in are you getting in terms of 01:04:40.880 |
like, do you have like any hospitals that are like participating or? 01:04:49.040 |
So the, so open mind is about two and a half years old. 01:04:55.600 |
In the very beginning, we had very little buy-in. 01:04:58.680 |
Because it was just so early, it was kind of like, 01:05:02.240 |
No one's ever going to sort of really, really care about that. 01:05:17.440 |
you lower the price because you increase the supply and you 01:05:19.520 |
increase the number of people that are also selling it. 01:05:22.000 |
So I think that there's people also waking up to kind of 01:05:26.800 |
your own datasets and protecting the unique statistical signal that they have. 01:05:31.720 |
It's also worth mentioning, so the PyTorch team recently 01:05:35.760 |
sponsored $250,000 in open-source grants to fund 01:05:39.280 |
people to work on our PySyth library, which is really good. 01:05:45.000 |
more grants of similar size later in the year. 01:05:49.120 |
open-source code and like to get paid to do so, 01:05:57.640 |
and buy-in as far as our community is concerned. 01:06:00.480 |
So this year is when I hope to see kind of the first pilots rolling out. 01:06:05.720 |
There are some that are sort of in the works, 01:06:10.800 |
But yeah, so I think basically this is the year for like pilots. 01:06:16.680 |
>> And then I have another question that's kind of on 01:06:19.240 |
the opposite end of the spectrum that's a little more technical weeds. 01:06:28.000 |
you separate everything into each of the different owners, 01:06:34.120 |
Because you need that linearity to add it back and for it to maintain the- 01:06:47.840 |
So you get the biggest performance hit when you have to do them. 01:06:58.440 |
So one line of research is around using polynomial approximations. 01:07:03.160 |
And then the other line is around doing sort of discrete comparison functions. 01:07:18.520 |
And then like the science of kind of like trying to relax 01:07:20.600 |
your security assumptions strategically here and 01:07:22.280 |
there to get more performance is about where we're at. 01:07:24.520 |
But as far as the one thing is worth mentioning though is that, 01:07:31.840 |
SecureNPC sort of on integers and fixed precision numbers. 01:07:40.880 |
but in that sense you get a huge performance ever doing it with binary, 01:07:43.920 |
but you also get the ability to do things sort of more classically with computing. 01:07:47.120 |
Encrypted computation is sort of like doing computing in the 70s. 01:07:49.840 |
Like you get a lot of the same kind of constraints. 01:07:53.000 |
>> Thank you very much for your talk, Andrew. 01:07:55.880 |
>> I'm wondering about your objective to ultimately 01:08:00.640 |
allow every individual to assign a privacy budget. 01:08:05.720 |
You mentioned that it would take a lot of work to provide 01:08:14.720 |
Do you have an idea for what kind of infrastructure is necessary? 01:08:17.840 |
And also when people are reluctant, even perhaps lazy, 01:08:22.960 |
and they don't really care and they don't want their data to be protected. 01:08:31.600 |
what are your thoughts on building that infrastructure? 01:08:37.280 |
It's the kind of thing where people don't usually invest money and 01:08:39.440 |
time and resources into things that aren't like a straight shot to value. 01:08:42.560 |
So I think there's probably going to be multiple individual discrete jumps. 01:08:46.040 |
The first one is going to be just enterprise adoption. 01:08:48.480 |
Enterprises are the ones that already have all the data. 01:08:51.840 |
start adopting privacy-preserving technologies. 01:08:54.080 |
I think that that adoption is going to be driven 01:08:56.200 |
primarily by commercial reasons for commercial reasons, 01:08:59.160 |
meaning my data is inherently more valuable if I can 01:09:01.360 |
keep it scarce while allowing people to answer questions with it. 01:09:03.800 |
Does that make sense? So it's more profitable for me to not send copies of 01:09:08.480 |
my data to people if I can actually have them bring 01:09:11.640 |
me and just get their questions answered. Does that make sense? 01:09:15.940 |
but I think that that narrative is going to mature 01:09:20.320 |
Post enterprise adoption, I think that's when, 01:09:30.480 |
and encrypted services are still really hard at this point. 01:09:35.640 |
lots of compute and lots of network overhead, 01:09:37.600 |
which means that you probably want to have something in the Cloud. 01:09:44.320 |
the Cloud or have the Internet get a lot faster. 01:09:47.120 |
But there's this question of how do we actually get to a world 01:09:55.920 |
notional control over their own personal privacy budget. 01:10:06.360 |
tracking their stuff with differential privacy. 01:10:08.360 |
The piece that you're actually missing here is just some communication between 01:10:13.320 |
all the different enterprises that are joining up and making, 01:10:24.480 |
that you're not double spending in different places. 01:10:28.080 |
Your Epsilon budget that's over here versus over here, 01:10:31.480 |
versus over here is all coming from the same place. 01:10:34.000 |
It's not totally clear who this actor would be. 01:10:37.760 |
Maybe there's an app that just does it for you. 01:10:41.280 |
Maybe there has to be an institution around it. 01:10:48.240 |
Another option is that there'll actually be data banks. 01:10:51.440 |
There's been some literature in the last couple of years around saying, 01:10:53.880 |
"Okay, maybe institutions that they currently handle your money 01:10:58.840 |
might also be the bank where all of your information lives." 01:11:02.440 |
That becomes the gateway to your data or something like that. 01:11:07.080 |
because that would obviously make the accounting much easier. 01:11:09.600 |
Also, that would give you that Cloud-to-Cloud performance increase. 01:11:13.400 |
So I think it's clear we wouldn't go to data banks or 01:11:17.640 |
these kinds of centralized accounting registries 01:11:19.760 |
directly because you have to have the initial adoption first. 01:11:22.720 |
But if I had to guess, it's something like that. 01:11:28.400 |
It's not even clear what that would look like. 01:11:44.000 |
>> I was just wondering if you can comment briefly on what you 01:11:52.080 |
respect to recommendation systems transparency, 01:11:54.880 |
and if you can comment briefly on what you think might 01:12:09.600 |
if you recommended a movie to me based on whether or not it's 01:12:20.640 |
It's not saying, "Hey, you should do this because it's going 01:12:22.360 |
to make your life more fulfilling, more satisfied, whatever. 01:12:25.800 |
It's just going to glue me to my television more." 01:12:30.560 |
and particularly with privacy-preserving machine learning, 01:12:34.680 |
the ability to access private data without actually seeing it, 01:12:40.280 |
the best recommendation so that this person gets 01:12:45.320 |
Like these attributes that are actually particularly sensitive, 01:12:47.920 |
but there are things that we actually want to optimize for, 01:12:52.400 |
beneficial recommendation systems than we do now, 01:12:55.920 |
infrastructure for dealing with private data. 01:12:59.080 |
like the biggest limitation of recommender systems right now, 01:13:03.200 |
it's just that they don't have access to enough 01:13:07.320 |
Does that make sense? We would like for them to have better targets, 01:13:11.000 |
but in order to do that, they have to have access 01:13:14.480 |
I think that's what privacy-preserving technologies 01:13:16.920 |
could bring to bear on recommendation systems. 01:13:21.920 |
>> One more time, please give Andrew a big hand. Thank you so much.