back to index

Michael Kearns: Differential Privacy


Chapters

0:0 Differential Privacy
0:58 Counterfactual
1:42 What is Differential Privacy
3:25 Mechanism of Differential Privacy

Whisper Transcript | Transcript Only Page

00:00:00.000 | So is there hope for any kind of privacy in a world where a few likes can identify you?
00:00:09.440 | So there is differential privacy, right?
00:00:11.440 | What is differential privacy?
00:00:12.440 | Yeah, so differential privacy basically is an alternate, much stronger notion of privacy
00:00:17.900 | than these anonymization ideas.
00:00:21.320 | And it's a technical definition, but the spirit of it is we compare two alternate worlds.
00:00:31.360 | So let's suppose I'm a researcher and I want to do...
00:00:36.160 | There's a database of medical records and one of them is yours.
00:00:39.340 | And I want to use that database of medical records to build a predictive model for some
00:00:44.400 | disease.
00:00:45.400 | I want to know people's symptoms and test results and the like, I want to build a model
00:00:51.360 | predicting the probability that people have disease.
00:00:53.240 | So this is the type of scientific research that we would like to be allowed to continue.
00:00:59.120 | And in differential privacy, you ask a very particular counterfactual question.
00:01:04.480 | We basically compare two alternatives.
00:01:08.540 | One is when I build this model on the database of medical records, including your medical
00:01:16.920 | record.
00:01:18.240 | And the other one is where I do the same exercise with the same database with just your medical
00:01:25.920 | record removed.
00:01:27.160 | So basically, it's two databases, one with N records in it and one with N minus one records
00:01:34.000 | in it.
00:01:35.000 | So the N minus one records are the same and the only one that's missing in the second
00:01:39.000 | case is your medical record.
00:01:41.480 | So differential privacy basically says that any harms that might come to you from the
00:01:51.640 | analysis in which your data was included are essentially nearly identical to the harms
00:01:58.720 | that would have come to you if the same analysis had been done without your medical record
00:02:03.480 | included.
00:02:04.760 | So in other words, this doesn't say that bad things cannot happen to you as a result
00:02:09.240 | of data analysis.
00:02:10.840 | It just says that these bad things were going to happen to you already, even if your data
00:02:16.120 | wasn't included.
00:02:17.120 | And to give a very concrete example, right, you know, like we discussed at some length,
00:02:23.400 | the study that, you know, in the '50s that was done that created the, that established
00:02:28.220 | the link between smoking and lung cancer.
00:02:31.040 | And we make the point that like, well, if your data was used in that analysis and, you
00:02:36.280 | know, the world kind of knew that you were a smoker because, you know, there was no stigma
00:02:40.040 | associated with smoking before that, those findings, real harm might've come to you as
00:02:45.800 | a result of that study that your data was included in.
00:02:48.840 | In particular, your insurer now might have a higher posterior belief that you might have
00:02:53.520 | lung cancer and raise your premium.
00:02:55.440 | So you've suffered economic damage.
00:02:58.880 | But the point is, is that if the same analysis has been done without, with all the other
00:03:05.000 | N minus one medical records and just yours missing, the outcome would have been the same.
00:03:09.840 | Your, your data was an idiosyncratically crucial to establishing the link between smoking and
00:03:16.600 | lung cancer, because the link between smoking and lung cancer is like a fact about the world
00:03:21.480 | that can be discovered with any sufficiently large database of medical records.
00:03:25.880 | But that's a very low value of harm.
00:03:28.320 | Yeah.
00:03:29.320 | So that's showing that very little harm is done.
00:03:31.520 | Great.
00:03:32.520 | But how, what is the mechanism of differential privacy?
00:03:35.800 | So that's the kind of beautiful statement of it.
00:03:38.280 | But what's the mechanism by which privacy is preserved?
00:03:41.320 | Yeah.
00:03:42.320 | So it's, it's basically by adding noise to computations, right?
00:03:45.640 | So the basic idea is that every differentially private algorithm, first of all, or every
00:03:51.440 | good differentially private algorithm, every useful one is a probabilistic algorithm.
00:03:56.400 | So it doesn't on a given input, if you gave the algorithm the same input multiple times,
00:04:02.000 | it would give different outputs each time from some distribution.
00:04:06.760 | And the way you achieve differential privacy algorithmically is by kind of carefully and
00:04:10.840 | tastefully adding noise to a computation in the right places.
00:04:16.400 | And you know, to give a very concrete example, if I want to compute the average of a set
00:04:20.840 | of numbers, right, the non-private way of doing that is to take those numbers and average
00:04:26.480 | them and release like a numerically precise value for the average.
00:04:32.080 | Okay.
00:04:33.080 | In differential privacy, you wouldn't do that.
00:04:35.240 | You would first compute that average to numerical precisions, and then you'd add some noise
00:04:40.540 | to it, right?
00:04:41.540 | You'd add some kind of zero mean, you know, Gaussian or exponential noise to it so that
00:04:47.800 | the actual value you output is not the exact mean, but it'll be close to the mean, but
00:04:54.160 | it'll be close.
00:04:55.160 | The noise that you add will sort of prove that nobody can kind of reverse engineer any
00:05:01.640 | particular value that went into the average.
00:05:04.520 | So noise is the savior.
00:05:07.280 | How many algorithms can be aided by adding noise?
00:05:12.720 | Yeah, so I'm a relatively recent member of the differential privacy community.
00:05:18.120 | My coauthor, Aaron Roth, is, you know, really one of the founders of the field and has done
00:05:22.920 | a great deal of work, and I've learned a tremendous amount working with him on it.
00:05:26.560 | It's a pretty grown up field already.
00:05:28.280 | Yeah, but now it's pretty mature.
00:05:29.560 | But I must admit, the first time I saw the definition of differential privacy, my reaction
00:05:33.140 | was like, "Wow, that is a clever definition, and it's really making very strong promises."
00:05:39.440 | And my, you know, I first saw the definition in much earlier days, and my first reaction
00:05:44.640 | was like, "Well, my worry about this definition would be that it's a great definition of privacy,
00:05:49.800 | but that it'll be so restrictive that we won't really be able to use it."
00:05:53.520 | Like, you know, we won't be able to compute many things in a differentially private way.
00:05:58.280 | So that's one of the great successes of the field, I think, is in showing that the opposite
00:06:02.980 | is true and that, you know, most things that we know how to compute absent any privacy
00:06:10.480 | considerations can be computed in a differentially private way.
00:06:13.980 | So for example, pretty much all of statistics in machine learning can be done differentially
00:06:18.900 | privately.
00:06:20.380 | So pick your favorite machine learning algorithm, back propagation and neural networks, you
00:06:25.140 | know, cart for decision trees, support vector machines, boosting, you name it, as well as
00:06:31.640 | classic hypothesis testing and the like in statistics.
00:06:35.220 | None of those algorithms are differentially private in their original form.
00:06:40.820 | All of them have modifications that add noise to the computation in different places in
00:06:46.780 | different ways that achieve differential privacy.
00:06:50.180 | So this really means that to the extent that, you know, we've become a, you know, a scientific
00:06:56.420 | community very dependent on the use of machine learning and statistical modeling and data
00:07:01.580 | analysis, we really do have a path to kind of provide privacy guarantees to those methods.
00:07:10.040 | And so we can still, you know, enjoy the benefits of kind of the data science era while providing,
00:07:17.840 | you know, rather robust privacy guarantees to individuals.
00:07:21.180 | Thank you.
00:07:21.680 | [end]
00:07:23.680 | [no audio]
00:07:25.680 | [no audio]
00:07:27.680 | [no audio]
00:07:29.680 | [no audio]
00:07:31.680 | [no audio]
00:07:33.680 | [no audio]
00:07:35.680 | [no audio]
00:07:37.680 | [BLANK_AUDIO]