Back to Index

Michael Kearns: Differential Privacy


Chapters

0:0 Differential Privacy
0:58 Counterfactual
1:42 What is Differential Privacy
3:25 Mechanism of Differential Privacy

Transcript

So is there hope for any kind of privacy in a world where a few likes can identify you? So there is differential privacy, right? What is differential privacy? Yeah, so differential privacy basically is an alternate, much stronger notion of privacy than these anonymization ideas. And it's a technical definition, but the spirit of it is we compare two alternate worlds.

So let's suppose I'm a researcher and I want to do... There's a database of medical records and one of them is yours. And I want to use that database of medical records to build a predictive model for some disease. I want to know people's symptoms and test results and the like, I want to build a model predicting the probability that people have disease.

So this is the type of scientific research that we would like to be allowed to continue. And in differential privacy, you ask a very particular counterfactual question. We basically compare two alternatives. One is when I build this model on the database of medical records, including your medical record. And the other one is where I do the same exercise with the same database with just your medical record removed.

So basically, it's two databases, one with N records in it and one with N minus one records in it. So the N minus one records are the same and the only one that's missing in the second case is your medical record. So differential privacy basically says that any harms that might come to you from the analysis in which your data was included are essentially nearly identical to the harms that would have come to you if the same analysis had been done without your medical record included.

So in other words, this doesn't say that bad things cannot happen to you as a result of data analysis. It just says that these bad things were going to happen to you already, even if your data wasn't included. And to give a very concrete example, right, you know, like we discussed at some length, the study that, you know, in the '50s that was done that created the, that established the link between smoking and lung cancer.

And we make the point that like, well, if your data was used in that analysis and, you know, the world kind of knew that you were a smoker because, you know, there was no stigma associated with smoking before that, those findings, real harm might've come to you as a result of that study that your data was included in.

In particular, your insurer now might have a higher posterior belief that you might have lung cancer and raise your premium. So you've suffered economic damage. But the point is, is that if the same analysis has been done without, with all the other N minus one medical records and just yours missing, the outcome would have been the same.

Your, your data was an idiosyncratically crucial to establishing the link between smoking and lung cancer, because the link between smoking and lung cancer is like a fact about the world that can be discovered with any sufficiently large database of medical records. But that's a very low value of harm.

Yeah. So that's showing that very little harm is done. Great. But how, what is the mechanism of differential privacy? So that's the kind of beautiful statement of it. But what's the mechanism by which privacy is preserved? Yeah. So it's, it's basically by adding noise to computations, right? So the basic idea is that every differentially private algorithm, first of all, or every good differentially private algorithm, every useful one is a probabilistic algorithm.

So it doesn't on a given input, if you gave the algorithm the same input multiple times, it would give different outputs each time from some distribution. And the way you achieve differential privacy algorithmically is by kind of carefully and tastefully adding noise to a computation in the right places.

And you know, to give a very concrete example, if I want to compute the average of a set of numbers, right, the non-private way of doing that is to take those numbers and average them and release like a numerically precise value for the average. Okay. In differential privacy, you wouldn't do that.

You would first compute that average to numerical precisions, and then you'd add some noise to it, right? You'd add some kind of zero mean, you know, Gaussian or exponential noise to it so that the actual value you output is not the exact mean, but it'll be close to the mean, but it'll be close.

The noise that you add will sort of prove that nobody can kind of reverse engineer any particular value that went into the average. So noise is the savior. How many algorithms can be aided by adding noise? Yeah, so I'm a relatively recent member of the differential privacy community. My coauthor, Aaron Roth, is, you know, really one of the founders of the field and has done a great deal of work, and I've learned a tremendous amount working with him on it.

It's a pretty grown up field already. Yeah, but now it's pretty mature. But I must admit, the first time I saw the definition of differential privacy, my reaction was like, "Wow, that is a clever definition, and it's really making very strong promises." And my, you know, I first saw the definition in much earlier days, and my first reaction was like, "Well, my worry about this definition would be that it's a great definition of privacy, but that it'll be so restrictive that we won't really be able to use it." Like, you know, we won't be able to compute many things in a differentially private way.

So that's one of the great successes of the field, I think, is in showing that the opposite is true and that, you know, most things that we know how to compute absent any privacy considerations can be computed in a differentially private way. So for example, pretty much all of statistics in machine learning can be done differentially privately.

So pick your favorite machine learning algorithm, back propagation and neural networks, you know, cart for decision trees, support vector machines, boosting, you name it, as well as classic hypothesis testing and the like in statistics. None of those algorithms are differentially private in their original form. All of them have modifications that add noise to the computation in different places in different ways that achieve differential privacy.

So this really means that to the extent that, you know, we've become a, you know, a scientific community very dependent on the use of machine learning and statistical modeling and data analysis, we really do have a path to kind of provide privacy guarantees to those methods. And so we can still, you know, enjoy the benefits of kind of the data science era while providing, you know, rather robust privacy guarantees to individuals.

Thank you.