Michael Kearns: Differential Privacy

00:00:00.000 | So is there hope for any kind of privacy in a world where a few likes can identify you?

00:00:09.440 | So there is differential privacy, right?

00:00:11.440 | What is differential privacy?

00:00:12.440 | Yeah, so differential privacy basically is an alternate, much stronger notion of privacy

00:00:17.900 | than these anonymization ideas.

00:00:21.320 | And it's a technical definition, but the spirit of it is we compare two alternate worlds.

00:00:31.360 | So let's suppose I'm a researcher and I want to do...

00:00:36.160 | There's a database of medical records and one of them is yours.

00:00:39.340 | And I want to use that database of medical records to build a predictive model for some

00:00:44.400 | disease.

00:00:45.400 | I want to know people's symptoms and test results and the like, I want to build a model

00:00:51.360 | predicting the probability that people have disease.

00:00:53.240 | So this is the type of scientific research that we would like to be allowed to continue.

00:00:59.120 | And in differential privacy, you ask a very particular counterfactual question.

00:01:04.480 | We basically compare two alternatives.

00:01:08.540 | One is when I build this model on the database of medical records, including your medical

00:01:16.920 | record.

00:01:18.240 | And the other one is where I do the same exercise with the same database with just your medical

00:01:25.920 | record removed.

00:01:27.160 | So basically, it's two databases, one with N records in it and one with N minus one records

00:01:34.000 | in it.

00:01:35.000 | So the N minus one records are the same and the only one that's missing in the second

00:01:39.000 | case is your medical record.

00:01:41.480 | So differential privacy basically says that any harms that might come to you from the

00:01:51.640 | analysis in which your data was included are essentially nearly identical to the harms

00:01:58.720 | that would have come to you if the same analysis had been done without your medical record

00:02:03.480 | included.

00:02:04.760 | So in other words, this doesn't say that bad things cannot happen to you as a result

00:02:09.240 | of data analysis.

00:02:10.840 | It just says that these bad things were going to happen to you already, even if your data

00:02:16.120 | wasn't included.

00:02:17.120 | And to give a very concrete example, right, you know, like we discussed at some length,

00:02:23.400 | the study that, you know, in the '50s that was done that created the, that established

00:02:28.220 | the link between smoking and lung cancer.

00:02:31.040 | And we make the point that like, well, if your data was used in that analysis and, you

00:02:36.280 | know, the world kind of knew that you were a smoker because, you know, there was no stigma

00:02:40.040 | associated with smoking before that, those findings, real harm might've come to you as

00:02:45.800 | a result of that study that your data was included in.

00:02:48.840 | In particular, your insurer now might have a higher posterior belief that you might have

00:02:53.520 | lung cancer and raise your premium.

00:02:55.440 | So you've suffered economic damage.

00:02:58.880 | But the point is, is that if the same analysis has been done without, with all the other

00:03:05.000 | N minus one medical records and just yours missing, the outcome would have been the same.

00:03:09.840 | Your, your data was an idiosyncratically crucial to establishing the link between smoking and

00:03:16.600 | lung cancer, because the link between smoking and lung cancer is like a fact about the world

00:03:21.480 | that can be discovered with any sufficiently large database of medical records.

00:03:25.880 | But that's a very low value of harm.

00:03:28.320 | Yeah.

00:03:29.320 | So that's showing that very little harm is done.

00:03:31.520 | Great.

00:03:32.520 | But how, what is the mechanism of differential privacy?

00:03:35.800 | So that's the kind of beautiful statement of it.

00:03:38.280 | But what's the mechanism by which privacy is preserved?

00:03:41.320 | Yeah.

00:03:42.320 | So it's, it's basically by adding noise to computations, right?

00:03:45.640 | So the basic idea is that every differentially private algorithm, first of all, or every

00:03:51.440 | good differentially private algorithm, every useful one is a probabilistic algorithm.

00:03:56.400 | So it doesn't on a given input, if you gave the algorithm the same input multiple times,

00:04:02.000 | it would give different outputs each time from some distribution.

00:04:06.760 | And the way you achieve differential privacy algorithmically is by kind of carefully and

00:04:10.840 | tastefully adding noise to a computation in the right places.

00:04:16.400 | And you know, to give a very concrete example, if I want to compute the average of a set

00:04:20.840 | of numbers, right, the non-private way of doing that is to take those numbers and average

00:04:26.480 | them and release like a numerically precise value for the average.

00:04:32.080 | Okay.

00:04:33.080 | In differential privacy, you wouldn't do that.

00:04:35.240 | You would first compute that average to numerical precisions, and then you'd add some noise

00:04:40.540 | to it, right?

00:04:41.540 | You'd add some kind of zero mean, you know, Gaussian or exponential noise to it so that

00:04:47.800 | the actual value you output is not the exact mean, but it'll be close to the mean, but

00:04:54.160 | it'll be close.

00:04:55.160 | The noise that you add will sort of prove that nobody can kind of reverse engineer any

00:05:01.640 | particular value that went into the average.

00:05:04.520 | So noise is the savior.

00:05:07.280 | How many algorithms can be aided by adding noise?

00:05:12.720 | Yeah, so I'm a relatively recent member of the differential privacy community.

00:05:18.120 | My coauthor, Aaron Roth, is, you know, really one of the founders of the field and has done

00:05:22.920 | a great deal of work, and I've learned a tremendous amount working with him on it.

00:05:26.560 | It's a pretty grown up field already.

00:05:28.280 | Yeah, but now it's pretty mature.

00:05:29.560 | But I must admit, the first time I saw the definition of differential privacy, my reaction

00:05:33.140 | was like, "Wow, that is a clever definition, and it's really making very strong promises."

00:05:39.440 | And my, you know, I first saw the definition in much earlier days, and my first reaction

00:05:44.640 | was like, "Well, my worry about this definition would be that it's a great definition of privacy,

00:05:49.800 | but that it'll be so restrictive that we won't really be able to use it."

00:05:53.520 | Like, you know, we won't be able to compute many things in a differentially private way.

00:05:58.280 | So that's one of the great successes of the field, I think, is in showing that the opposite

00:06:02.980 | is true and that, you know, most things that we know how to compute absent any privacy

00:06:10.480 | considerations can be computed in a differentially private way.

00:06:13.980 | So for example, pretty much all of statistics in machine learning can be done differentially

00:06:18.900 | privately.

00:06:20.380 | So pick your favorite machine learning algorithm, back propagation and neural networks, you

00:06:25.140 | know, cart for decision trees, support vector machines, boosting, you name it, as well as

00:06:31.640 | classic hypothesis testing and the like in statistics.

00:06:35.220 | None of those algorithms are differentially private in their original form.

00:06:40.820 | All of them have modifications that add noise to the computation in different places in

00:06:46.780 | different ways that achieve differential privacy.

00:06:50.180 | So this really means that to the extent that, you know, we've become a, you know, a scientific

00:06:56.420 | community very dependent on the use of machine learning and statistical modeling and data

00:07:01.580 | analysis, we really do have a path to kind of provide privacy guarantees to those methods.

00:07:10.040 | And so we can still, you know, enjoy the benefits of kind of the data science era while providing,

00:07:17.840 | you know, rather robust privacy guarantees to individuals.

00:07:21.180 | Thank you.

00:07:21.680 | [end]

00:07:23.680 | [no audio]

00:07:25.680 | [no audio]

00:07:27.680 | [no audio]

00:07:29.680 | [no audio]

00:07:31.680 | [no audio]

00:07:33.680 | [no audio]

00:07:35.680 | [no audio]

00:07:37.680 | [BLANK_AUDIO]

Michael Kearns: Differential Privacy

Chapters