back to indexMichael Kearns: Differential Privacy
Chapters
0:0 Differential Privacy
0:58 Counterfactual
1:42 What is Differential Privacy
3:25 Mechanism of Differential Privacy
00:00:00.000 |
So is there hope for any kind of privacy in a world where a few likes can identify you? 00:00:12.440 |
Yeah, so differential privacy basically is an alternate, much stronger notion of privacy 00:00:21.320 |
And it's a technical definition, but the spirit of it is we compare two alternate worlds. 00:00:31.360 |
So let's suppose I'm a researcher and I want to do... 00:00:36.160 |
There's a database of medical records and one of them is yours. 00:00:39.340 |
And I want to use that database of medical records to build a predictive model for some 00:00:45.400 |
I want to know people's symptoms and test results and the like, I want to build a model 00:00:51.360 |
predicting the probability that people have disease. 00:00:53.240 |
So this is the type of scientific research that we would like to be allowed to continue. 00:00:59.120 |
And in differential privacy, you ask a very particular counterfactual question. 00:01:08.540 |
One is when I build this model on the database of medical records, including your medical 00:01:18.240 |
And the other one is where I do the same exercise with the same database with just your medical 00:01:27.160 |
So basically, it's two databases, one with N records in it and one with N minus one records 00:01:35.000 |
So the N minus one records are the same and the only one that's missing in the second 00:01:41.480 |
So differential privacy basically says that any harms that might come to you from the 00:01:51.640 |
analysis in which your data was included are essentially nearly identical to the harms 00:01:58.720 |
that would have come to you if the same analysis had been done without your medical record 00:02:04.760 |
So in other words, this doesn't say that bad things cannot happen to you as a result 00:02:10.840 |
It just says that these bad things were going to happen to you already, even if your data 00:02:17.120 |
And to give a very concrete example, right, you know, like we discussed at some length, 00:02:23.400 |
the study that, you know, in the '50s that was done that created the, that established 00:02:31.040 |
And we make the point that like, well, if your data was used in that analysis and, you 00:02:36.280 |
know, the world kind of knew that you were a smoker because, you know, there was no stigma 00:02:40.040 |
associated with smoking before that, those findings, real harm might've come to you as 00:02:45.800 |
a result of that study that your data was included in. 00:02:48.840 |
In particular, your insurer now might have a higher posterior belief that you might have 00:02:58.880 |
But the point is, is that if the same analysis has been done without, with all the other 00:03:05.000 |
N minus one medical records and just yours missing, the outcome would have been the same. 00:03:09.840 |
Your, your data was an idiosyncratically crucial to establishing the link between smoking and 00:03:16.600 |
lung cancer, because the link between smoking and lung cancer is like a fact about the world 00:03:21.480 |
that can be discovered with any sufficiently large database of medical records. 00:03:29.320 |
So that's showing that very little harm is done. 00:03:32.520 |
But how, what is the mechanism of differential privacy? 00:03:35.800 |
So that's the kind of beautiful statement of it. 00:03:38.280 |
But what's the mechanism by which privacy is preserved? 00:03:42.320 |
So it's, it's basically by adding noise to computations, right? 00:03:45.640 |
So the basic idea is that every differentially private algorithm, first of all, or every 00:03:51.440 |
good differentially private algorithm, every useful one is a probabilistic algorithm. 00:03:56.400 |
So it doesn't on a given input, if you gave the algorithm the same input multiple times, 00:04:02.000 |
it would give different outputs each time from some distribution. 00:04:06.760 |
And the way you achieve differential privacy algorithmically is by kind of carefully and 00:04:10.840 |
tastefully adding noise to a computation in the right places. 00:04:16.400 |
And you know, to give a very concrete example, if I want to compute the average of a set 00:04:20.840 |
of numbers, right, the non-private way of doing that is to take those numbers and average 00:04:26.480 |
them and release like a numerically precise value for the average. 00:04:33.080 |
In differential privacy, you wouldn't do that. 00:04:35.240 |
You would first compute that average to numerical precisions, and then you'd add some noise 00:04:41.540 |
You'd add some kind of zero mean, you know, Gaussian or exponential noise to it so that 00:04:47.800 |
the actual value you output is not the exact mean, but it'll be close to the mean, but 00:04:55.160 |
The noise that you add will sort of prove that nobody can kind of reverse engineer any 00:05:07.280 |
How many algorithms can be aided by adding noise? 00:05:12.720 |
Yeah, so I'm a relatively recent member of the differential privacy community. 00:05:18.120 |
My coauthor, Aaron Roth, is, you know, really one of the founders of the field and has done 00:05:22.920 |
a great deal of work, and I've learned a tremendous amount working with him on it. 00:05:29.560 |
But I must admit, the first time I saw the definition of differential privacy, my reaction 00:05:33.140 |
was like, "Wow, that is a clever definition, and it's really making very strong promises." 00:05:39.440 |
And my, you know, I first saw the definition in much earlier days, and my first reaction 00:05:44.640 |
was like, "Well, my worry about this definition would be that it's a great definition of privacy, 00:05:49.800 |
but that it'll be so restrictive that we won't really be able to use it." 00:05:53.520 |
Like, you know, we won't be able to compute many things in a differentially private way. 00:05:58.280 |
So that's one of the great successes of the field, I think, is in showing that the opposite 00:06:02.980 |
is true and that, you know, most things that we know how to compute absent any privacy 00:06:10.480 |
considerations can be computed in a differentially private way. 00:06:13.980 |
So for example, pretty much all of statistics in machine learning can be done differentially 00:06:20.380 |
So pick your favorite machine learning algorithm, back propagation and neural networks, you 00:06:25.140 |
know, cart for decision trees, support vector machines, boosting, you name it, as well as 00:06:31.640 |
classic hypothesis testing and the like in statistics. 00:06:35.220 |
None of those algorithms are differentially private in their original form. 00:06:40.820 |
All of them have modifications that add noise to the computation in different places in 00:06:46.780 |
different ways that achieve differential privacy. 00:06:50.180 |
So this really means that to the extent that, you know, we've become a, you know, a scientific 00:06:56.420 |
community very dependent on the use of machine learning and statistical modeling and data 00:07:01.580 |
analysis, we really do have a path to kind of provide privacy guarantees to those methods. 00:07:10.040 |
And so we can still, you know, enjoy the benefits of kind of the data science era while providing, 00:07:17.840 |
you know, rather robust privacy guarantees to individuals.