Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

00:00:00.000 | Today, we're very happy to have Andrew Trask.

00:00:04.240 | He's a brilliant writer, researcher,

00:00:06.720 | tweeter, that's a word,

00:00:09.240 | in the world of machine learning and artificial intelligence.

00:00:12.480 | He is the author of Grok and Deep Learning,

00:00:15.040 | the book that I highly recommended in the lecture on Monday.

00:00:21.200 | He's the leader and creator of OpenMind,

00:00:23.920 | which is an open-source community that strives to make our algorithms,

00:00:27.820 | our data, and our world in general more privacy-preserving.

00:00:31.660 | He is coming to us by way of Oxford,

00:00:34.680 | but without that rich,

00:00:36.480 | complex, beautiful, sophisticated British accent, unfortunately.

00:00:40.600 | He is one of the best educators,

00:00:43.200 | and truly one of the nicest people I know.

00:00:45.240 | So please give him a warm welcome.

00:00:47.880 | >> Thanks. That was a very generous introduction.

00:00:54.440 | So yeah, today we're going to be talking about privacy-preserving AI.

00:00:57.280 | This talk is going to come in two parts.

00:00:59.160 | So the first is going to be looking at privacy tools

00:01:02.400 | from the context of a data scientist or a researcher,

00:01:05.440 | like how their actual UX might change.

00:01:07.600 | Because I think that's the best way to communicate

00:01:09.960 | some of the new technologies that are coming about in that context.

00:01:12.720 | Then we're going to zoom out and look at,

00:01:14.800 | under the assumption that these kinds of technologies become mature,

00:01:17.840 | what is that going to do to society?

00:01:20.760 | What consequences or side effects could these kinds of tools

00:01:24.080 | have, both positive and negative?

00:01:27.440 | So first, let's ask the question,

00:01:29.860 | is it possible to answer questions using data that we cannot see?

00:01:33.480 | This is going to be the key question that we look at today.

00:01:36.760 | Let's start with an example.

00:01:39.840 | So first, if we wanted to answer the question,

00:01:41.540 | what do tumors look like in humans?

00:01:43.460 | Well, this is a pretty complex question.

00:01:46.240 | Tumors are pretty complicated things.

00:01:47.900 | So we might train an AI classifier.

00:01:50.000 | If we wanted to do that,

00:01:51.640 | we would first need to download a dataset of tumor-related images.

00:01:54.680 | So we'd be able to statistically study these and be able to

00:01:56.920 | recognize what tumors look like in humans.

00:01:59.440 | But this kind of data is not very easy to come by.

00:02:02.200 | So it's very rarely that it's collected,

00:02:05.040 | it's difficult to move around, it's highly regulated.

00:02:08.000 | So we're probably going to have to buy it from

00:02:09.960 | relatively small number of sources that

00:02:12.640 | are able to actually collect and manage this kind of information.

00:02:15.860 | The scarcity and constraints around this,

00:02:19.200 | likely to make this a relatively expensive purchase.

00:02:21.640 | If it's going to be an expensive purchase for us to answer this question,

00:02:24.040 | well, then we're going to find someone to finance our project.

00:02:26.400 | If we need someone to finance our project,

00:02:27.700 | we have to come up with a way of how we're going to pay them back.

00:02:30.320 | If we're going to create a business plan,

00:02:31.680 | then we have to find a business partner.

00:02:32.920 | We're going to find a business partner,

00:02:33.800 | we have to spend all our cost in LinkedIn,

00:02:35.240 | looking for someone to start a business with us.

00:02:37.040 | Now, it's because we wanted to answer the question,

00:02:39.000 | what do tumors look like in humans?

00:02:41.600 | What if we wanted to answer a different question?

00:02:44.200 | What if we wanted to answer the question,

00:02:46.020 | what do handwritten digits look like?

00:02:48.360 | Well, this would be a totally different story.

00:02:51.920 | We download a dataset,

00:02:55.800 | we download a state-of-the-art training script from GitHub,

00:02:58.080 | we'd run it, and a few minutes later,

00:02:59.520 | we'd have an ability to classify

00:03:02.120 | handwritten digits with potentially superhuman ability,

00:03:04.560 | if such a thing exists.

00:03:07.000 | Why is this so different between these two questions?

00:03:11.320 | The reason is that getting access to private data,

00:03:13.800 | data about people, is really, really hard.

00:03:18.280 | As a result, we spend most of

00:03:21.120 | our time working on problems and tasks like this.

00:03:24.120 | So ImageNet, MNIST, IFR10.

00:03:25.780 | Anybody who's trained a classifier on MNIST before?

00:03:28.680 | Raise your hand. I expect pretty much everybody.

00:03:32.080 | Instead of working on problems like this,

00:03:36.480 | does anyone train a classifier to predict dementia,

00:03:40.320 | diabetes, Alzheimer's?

00:03:46.000 | Looks like she's going. Depression?

00:03:48.560 | Anxiety? No one.

00:03:50.960 | So why is it that we spend all our time on tasks like this,

00:03:56.520 | when these tasks, these represent our friends and loved ones,

00:04:00.040 | and problems in society that really, really matter.

00:04:02.600 | Not to say that there aren't people working on this.

00:04:04.480 | It's absolutely, there are whole fields dedicated to it.

00:04:07.080 | But the machine learning community at large,

00:04:10.440 | these tasks are pretty inaccessible.

00:04:13.000 | In fact, in order to work on one of these,

00:04:16.840 | just getting access to the data,

00:04:18.160 | you'd have to dedicate a portion of

00:04:19.720 | your life just to getting access to it,

00:04:21.840 | whether it's doing a startup or joining a hospital or what have you.

00:04:26.680 | Whereas for other kinds of datasets,

00:04:28.280 | they're just simply readily accessible.

00:04:30.960 | This brings us back to our question.

00:04:33.360 | Is it possible to answer questions using data that we cannot see?

00:04:39.960 | So in this talk, we're going to walk through a few different techniques.

00:04:43.880 | If the answer to this question is yes,

00:04:46.920 | the combination of these techniques is going to try to make it so that we can

00:04:50.320 | actually pip install access to datasets like

00:04:52.880 | these in the same way that we

00:04:54.760 | pip install access to other deep learning tools.

00:04:57.280 | The idea here is to lower the barrier to entry,

00:04:59.240 | to increase the accessibility to some of

00:05:01.080 | the most important problems that we would like to address.

00:05:05.080 | So as Lex mentioned,

00:05:08.400 | I lead a community called OpenMind,

00:05:09.720 | which is an open-source community of a little over 6,000 people

00:05:12.800 | who are focused on lowering

00:05:14.600 | the barrier to entry to privacy-preserving AI and machine learning.

00:05:17.080 | Specifically, one of the tools that we're

00:05:18.880 | working on we're talking about today is called PySift.

00:05:21.200 | PySift extends the major deep learning frameworks

00:05:24.800 | with the ability to do privacy-preserving machine learning.

00:05:26.680 | So specifically today, we're going to be looking

00:05:28.000 | at the extensions into PyTorch.

00:05:29.760 | So PyTorch, people generally familiar with PyTorch,

00:05:32.160 | yeah, quite a few users.

00:05:34.440 | It's my hope that by walking through a few of these tools,

00:05:39.480 | it'll become clear how we can start to be able to do data science,

00:05:45.680 | the active answering questions using

00:05:47.520 | data that we don't actually have direct access to.

00:05:50.720 | Then in the second half of the talk,

00:05:52.440 | we're going to generalize this to answering

00:05:54.240 | questions even if you're not necessarily a data scientist.

00:05:56.960 | So first, first tool is remote execution.

00:05:59.280 | So let's just walk me through this.

00:06:01.240 | So we're going to jump into code for a minute,

00:06:03.800 | but hopefully this is line by line and relatively simple.

00:06:06.080 | Even if you aren't familiar with PyTorch,

00:06:07.480 | I think it's relatively intuitive.

00:06:08.880 | We're looking at lists of numbers and these kinds of things.

00:06:11.080 | So up at the top,

00:06:12.360 | we import Torch as a deep learning framework.

00:06:14.560 | SIFT extends Torch with this thing called Torch Hook.

00:06:17.280 | All it's doing is just iterating through the library and

00:06:19.200 | basically monkey-patching in lots of new functionality.

00:06:22.000 | Most deep learning frameworks are built around one core primitive,

00:06:25.240 | and that core primitive is the tensor.

00:06:27.320 | For those of you who don't know what tensors are,

00:06:29.320 | just think of them as nested list of numbers for now,

00:06:31.760 | and that'll be good enough for this talk.

00:06:34.000 | But for us, we introduce a second core primitive,

00:06:36.760 | which is the worker.

00:06:38.080 | A worker is a location within which computation is going to be occurring.

00:06:42.680 | So in this case, we have a virtualized worker that

00:06:45.640 | is pointing to say a hospital data center.

00:06:49.280 | The assumption that we have is that this worker will allow us to run

00:06:52.760 | computation inside of the data center without us

00:06:55.200 | actually having direct access to that worker itself.

00:06:58.200 | It gives us a limited,

00:06:59.480 | white-listed set of methods that we can use on this remote machine.

00:07:04.440 | So just to give you an example,

00:07:06.040 | so there's that core primitive we talked about a minute ago.

00:07:08.760 | We have the torch tensor,

00:07:10.040 | so 1, 3, 4, 5.

00:07:12.000 | The first method that we added is called just dot send.

00:07:15.120 | This does exactly what you might expect.

00:07:16.920 | Takes the tensor, serializes it,

00:07:18.600 | sends it into the hospital data center,

00:07:20.220 | and returns back to me a pointer.

00:07:22.400 | This pointer is really, really special.

00:07:24.040 | For those of you who are actually familiar with deep learning frameworks,

00:07:25.920 | I hope that this will really resonate with you.

00:07:28.180 | Because it has the full PyTorch API as a part of it,

00:07:32.200 | but whenever you execute something using this pointer,

00:07:35.040 | instead of it running locally,

00:07:36.480 | even though it looks like and feels like it's running locally,

00:07:39.080 | it actually executes on the remote machine and

00:07:41.720 | returns back to you another pointer to the result.

00:07:45.280 | The idea here being that I can now coordinate remote executions,

00:07:49.660 | remote computations without

00:07:52.040 | necessarily having to have direct access to the machine.

00:07:55.120 | Of course, I can get a dot get request and we'll

00:07:57.200 | see that this is actually really,

00:08:00.000 | really important to getting permissions around when you can do dot get request,

00:08:02.640 | and actually ask for data from a remote machine back to you.

00:08:05.360 | So just remember that.

00:08:07.000 | Cool. So this is where we start.

00:08:09.820 | So in the Pareto principle,

00:08:11.600 | 80 percent for 20 percent,

00:08:13.360 | this is like the first big cut.

00:08:15.840 | So pros, data remains on a remote machine.

00:08:18.680 | We can now, in theory,

00:08:19.960 | do data science on a machine that we don't have access to, that we don't own.

00:08:23.640 | But the problem is, the first column we want to address,

00:08:27.560 | is how can we actually do good data science

00:08:29.560 | without physically seeing the data?

00:08:31.360 | So it's all well and good to say,

00:08:32.640 | "I'm going to train a deep learning classifier."

00:08:34.200 | But the process of answering questions is inherently iterative.

00:08:38.080 | It's inherently give and take.

00:08:40.760 | I learn a little bit and I ask a little bit,

00:08:42.240 | I learn a little bit and I ask a little bit.

00:08:44.280 | This brings me to the second tool.

00:08:46.000 | So search and example data.

00:08:47.200 | Again, we're starting really simple.

00:08:48.680 | It will get more complex here in a minute.

00:08:50.840 | So in this case, let's say we have what's called a grid.

00:08:53.200 | So PyGrid, if PySIFT is a library,

00:08:55.640 | PyGrid is the platform version.

00:08:57.180 | So again, this is all open-source Apache 2 stuff.

00:09:00.640 | This is, we have what's called a grid client.

00:09:03.520 | So this could be a interface to

00:09:06.160 | a large number of datasets inside of a big hospital.

00:09:10.000 | So let's say I wanted

00:09:12.040 | to train a classifier to do something with diabetes.

00:09:14.680 | So it's going to predict diabetes or predict

00:09:16.760 | certain kind of diabetes or certain attribute of diabetes.

00:09:20.080 | I should be able to perform remote search.

00:09:23.040 | I get back pointers to throw the remote information.

00:09:27.040 | I can get back detailed descriptions of

00:09:30.240 | what the information is without me actually looking at it.

00:09:32.400 | So how it was collected,

00:09:33.680 | what the rows and columns are,

00:09:35.480 | what the types of different information is,

00:09:37.840 | what the various ranges of the values can take on,

00:09:40.160 | things that allow me to do remote normalization,

00:09:42.280 | these kinds of things. Then in some cases,

00:09:44.760 | even look at samples of this data.

00:09:46.320 | So these samples could be human curated.

00:09:49.120 | They could be generated from a GAN.

00:09:51.040 | They could be actually short snippets from the actual dataset.

00:09:56.920 | Maybe it's okay to release small amounts but not large amounts.

00:09:59.760 | The reason I highlight this,

00:10:01.920 | this isn't crazy complex stuff.

00:10:03.640 | So prior to going back to school,

00:10:06.000 | I used to work for a company called Digital Reasoning.

00:10:07.800 | We did on-prem data science.

00:10:11.720 | So we delivered AI services to corporations behind the firewall.

00:10:17.120 | So we did classified information.

00:10:19.120 | We worked with investment banks helping prevent insider trading.

00:10:22.600 | Doing data science on data that your home team,

00:10:25.520 | back in Nashville in our case,

00:10:27.280 | is not able to see is really, really challenging.

00:10:29.200 | But there are some things that can give you the first big jump

00:10:33.680 | before you jump into the more complex tools to

00:10:35.680 | handle some of the more challenging use cases.

00:10:37.960 | Cool. So basic remote execution,

00:10:40.240 | so remote procedure calls,

00:10:42.080 | basic private search, and the ability to look at sample data,

00:10:46.480 | gives us enough general context to be able

00:10:48.960 | to start doing things like feature engineering and evaluating quality.

00:10:52.880 | So now the data remains in the remote machine.

00:10:56.600 | We can do some basic feature engineering.

00:10:58.520 | Here's where things get a little more complicated.

00:11:01.360 | So if you remember, in the very first slide,

00:11:04.960 | where I show you some code at the bottom,

00:11:06.600 | I called dot get on the tensor.

00:11:09.360 | What that did was it took the pointer to

00:11:12.200 | some remote information and said, "Hey,

00:11:13.480 | send that information to me."

00:11:15.080 | That is an incredibly important bottleneck.

00:11:18.840 | Unfortunately, despite the fact that I'm doing all my remote execution,

00:11:23.640 | if that's just naively implemented,

00:11:25.200 | well, I can just steal all the data that I want to.

00:11:27.400 | I just call dot get on whatever pointers I want,

00:11:29.640 | and there's no additional added real security.

00:11:32.960 | So what are we going to do about this? This brings us to

00:11:36.480 | tool number three called differential privacy.

00:11:38.280 | Differential privacy, you want to come across?

00:11:40.240 | A little higher? Okay, cool.

00:11:42.760 | Awesome. Good. So I'm going to do

00:11:48.240 | a quick high-level overview of

00:11:49.880 | the intuition of differential privacy,

00:11:51.280 | and then we're going to jump into how it can

00:11:53.400 | and is looking in the code,

00:11:55.320 | and I'll give you resources for deeper dive in

00:11:57.480 | differential privacy at the end of the talk, should you be interested.

00:12:01.520 | So differential privacy, loosely stated,

00:12:04.040 | is a field that allows you to do

00:12:05.880 | statistical analysis without compromising the privacy of the dataset.

00:12:10.280 | More specifically, it allows you to query a database,

00:12:14.120 | while making certain guarantees about the privacy

00:12:16.880 | of the records contained within the database.

00:12:18.720 | So let me show you what I mean.

00:12:20.000 | Let's say we have an example database,

00:12:21.760 | and so this is the canonical DB if you

00:12:23.480 | look in the literature for differential privacy.

00:12:25.640 | It'll have one row for person,

00:12:27.840 | one row for person, and one column of zeros and ones,

00:12:30.600 | which corresponds to true and false.

00:12:32.120 | We don't actually really care what those zeros and ones are indicating.

00:12:34.640 | It could be presence of a disease,

00:12:36.760 | could be male-female, could be just some sensitive attributes,

00:12:39.560 | something that's worth protecting.

00:12:42.040 | What we're going to do is,

00:12:45.360 | our goal is to ensure

00:12:46.640 | statistical analysis doesn't compromise privacy.

00:12:48.520 | What we're going to do is query this database.

00:12:50.400 | We're going to run some function over the entire database,

00:12:53.720 | and we're going to look at the result,

00:12:55.480 | and then we're going to ask a very important question.

00:12:57.760 | We're going to ask, if I were to remove someone from this database,

00:13:04.240 | say John, would the output of my function change?

00:13:10.880 | If the answer to that is no,

00:13:14.800 | then intuitively, we can say that,

00:13:17.960 | well, this output is not conditioned on John's private information.

00:13:21.320 | Now, if we could say that about everyone in the database,

00:13:25.020 | well then, okay, it would be a perfectly privacy-preserving query,

00:13:30.300 | but it might not be that useful.

00:13:32.440 | But this intuitive definition, I think, is quite powerful.

00:13:35.680 | The notion of how can we construct queries that are

00:13:37.840 | invariant to removing someone or replacing them with someone else.

00:13:43.160 | The notion of the maximum amount that

00:13:46.480 | the output of a function can change as a result of removing or

00:13:50.320 | replacing one of the individuals is known as the sensitivity.

00:13:54.640 | So important, so if you're reading literature,

00:13:56.800 | you find it's come across sensitivity,

00:13:58.360 | that's what we're talking about.

00:13:59.840 | So what do we do when we have a really sensitive function?

00:14:03.880 | We're going to take a bit of a sidestep for a minute.

00:14:07.120 | I have a twin sister who's finishing a PhD in political science.

00:14:11.560 | Political science, often they need to answer questions about

00:14:16.240 | very taboo behavior, something that people are likely to lie about.

00:14:20.440 | So let's say I wanted to survey everyone in this room and I wanted to answer

00:14:24.920 | the question what percentage of you are secretly serial killers?

00:14:30.880 | Not because I think any one of you are,

00:14:35.920 | but because I genuinely want to understand this trend.

00:14:38.960 | I'm not trying to arrest people,

00:14:40.280 | I'm not trying to be an instrument of the criminal justice system.

00:14:45.760 | I'm trying to be a sociologist or

00:14:47.400 | political scientist and understand this actual trend.

00:14:49.640 | The problem is if I sit down with each one of you in a private room and I say,

00:14:52.680 | "I promise, I promise, I promise,

00:14:53.960 | I won't tell anybody," I'm still going to get a skewed distribution.

00:14:57.680 | Some people are just going to be like, "Why would I risk

00:15:00.360 | telling you this private information?"

00:15:02.640 | So what sociologists can do is this technique called randomized response,

00:15:06.440 | where I should have brought a coin.

00:15:08.720 | You take a coin and you give it to each person before you survey them,

00:15:13.040 | and you ask them to flip it twice somewhere that you cannot see.

00:15:16.280 | So I would ask each one of you to flip a coin twice somewhere that I cannot see.

00:15:20.520 | Then I would instruct you to,

00:15:23.480 | um, if the first coin flip is a heads, answer honestly.

00:15:29.880 | But if the first coin flip is a tails,

00:15:33.400 | answer yes or no based on the second coin flip.

00:15:37.840 | So roughly half the time,

00:15:39.840 | you'll be honest and the other half of the time,

00:15:43.000 | you'll be, uh, you'll be giving me a perfect 50-50 coin flip.

00:15:47.240 | And the cool thing is that what this is actually doing,

00:15:49.880 | is taking whatever the true mean of the distribution

00:15:51.840 | is and averaging it with a 50-50 coin flip, right?

00:15:55.400 | So if say, um,

00:15:57.680 | 55 percent of you, uh,

00:16:00.520 | answered yes, that, that you are a serial killer,

00:16:05.960 | um, then I know that the true center of the distribution is actually 60 percent,

00:16:10.000 | because it was 60 percent averaged with a 50-50 coin flip.

00:16:12.680 | Does that make sense? However, despite the fact that I can

00:16:15.720 | recover the center of the distribution, right,

00:16:18.920 | given enough samples, um,

00:16:21.480 | each individual person has plausible deniability.

00:16:24.040 | If you said yes,

00:16:25.320 | it could have been because you actually are,

00:16:27.360 | or it could have been because you just happened to

00:16:29.560 | flip a certain sequence of coin flips, okay?

00:16:33.520 | Now this concept of adding noise to data to give

00:16:37.760 | plausible deniability is sort of the secret weapon of differential privacy, right?

00:16:42.120 | And, and the field itself is a,

00:16:44.960 | a set of mathematical proofs for trying to do this as efficiently as possible,

00:16:49.280 | to give sort of the smallest amount of noise to get the most accurate results,

00:16:53.200 | right, um, with the best possible privacy protections, right?

00:16:57.480 | There is a meaningful, um,

00:16:58.920 | sort of base trade-off that you, you, you, you know,

00:17:02.360 | um, there's kind of a Pareto trade-off, right?

00:17:05.320 | And we're trying to, to push that,

00:17:06.680 | push that trade-off down.

00:17:08.080 | Um, um, but so the, the,

00:17:10.160 | the, the field of research that is differential privacy, um,

00:17:13.760 | is looking at how to add noise to data and,

00:17:17.280 | and resulting queries to give plausible deniability to the,

00:17:20.280 | and to the, the members of a,

00:17:22.080 | of a database or a training dataset. Does that make sense?

00:17:25.160 | Now, um, a few,

00:17:27.520 | um, terms that you should be familiar with.

00:17:29.560 | So there's local and there's global differential privacy.

00:17:31.760 | So local differential privacy adds noise to data before it's sent to the statistician.

00:17:37.280 | So in this case, the one with the coin flip,

00:17:38.960 | this was local differential privacy.

00:17:40.240 | It affords you the best amount of protection because you never actually

00:17:43.400 | reveal sort of in the clear your information to someone, right?

00:17:47.480 | And then there's global differential privacy,

00:17:49.640 | which says, okay, we're gonna put everything in the database,

00:17:52.160 | perform a query, and then before the output of the query gets published,

00:17:55.440 | we're gonna add a little bit of noise to the output of the query, okay?

00:17:58.160 | This tends to have a much better privacy trade-off,

00:18:00.260 | but you have to trust the database owner to not compromise the results, okay?

00:18:03.680 | And we'll see there's some other things we can do there.

00:18:05.480 | Um, but with, with me so far,

00:18:06.920 | this is a good, good point for questions if you had any questions.

00:18:09.440 | Got it. So the question is, um,

00:18:11.240 | is this verifiable?

00:18:12.720 | Um, any of this, this process of differential privacy verifiable?

00:18:15.840 | Um, so that is a fantastic question, um,

00:18:18.120 | and one that actually absolutely comes up in practice.

00:18:20.840 | Um, um, so first,

00:18:22.640 | local differential privacy, the nice thing is everyone's doing it for themself, right?

00:18:26.040 | So in that sense, if you're flipping your own coins and answering your own questions,

00:18:29.680 | um, that's, that's your verification, right?

00:18:31.960 | You're, you're kind of trusting yourself.

00:18:33.200 | For global differential privacy, um,

00:18:35.680 | stay tuned for the next tool and we'll, we'll come back to that.

00:18:38.880 | All right. So what does this look like in code?

00:18:42.680 | So first, we have a pointer to a remote private dataset we call dot get.

00:18:46.600 | Whoa, we get big fat error, right?

00:18:48.840 | You just asked to sort of see

00:18:50.680 | the raw value of some private data point which you cannot do, right?

00:18:53.560 | Instead, pass in dot get epsilon to add the appropriate amount of noise.

00:18:57.040 | So one thing I haven't mentioned yet, um,

00:18:59.520 | uh, differential privacy. So I mentioned sensitivity, right?

00:19:02.240 | So sensitivity was, um, uh,

00:19:04.080 | related to the type of query,

00:19:05.520 | the type of function that we wanted to do and it's in variance to,

00:19:07.840 | um, removing or replacing individual entries in the, in the database.

00:19:10.880 | Um, so epsilon is a measure of what we call our privacy budget, right?

00:19:15.800 | And what our privacy budget is, is saying, okay, what, what's the,

00:19:18.400 | what's the amount of, of statistical uniqueness that I'm going to sort of limit?

00:19:22.720 | What's the upper bound for the amount of statistical uniqueness that I'm going to

00:19:25.080 | allow to come out of this, out of this database?

00:19:27.160 | Um, and actually I'm gonna take one more side,

00:19:29.120 | side track here, um,

00:19:30.480 | because I think it's really worth mentioning, um, data anonymization.

00:19:33.640 | Anyone familiar with data anonymization come across this term before?

00:19:36.600 | Taking a document like redacting the,

00:19:39.120 | the social security numbers and like all this kind of stuff?

00:19:42.400 | By and large, it does not work.

00:19:44.520 | If you don't remember anything else from this talk,

00:19:46.280 | it is very dangerous to do just data set anonymization, okay?

00:19:50.080 | And differential privacy in, in some respects is,

00:19:52.480 | is, is the formal version of data anonymization,

00:19:55.120 | where instead of, instead of just saying, okay,

00:19:56.840 | I'm just gonna redact out these pieces and then I'll be fine, um,

00:19:59.720 | this is saying, okay, um, that we, we can do a lot better.

00:20:02.120 | So for example, a Netflix prize,

00:20:03.440 | Netflix machine learning prize,

00:20:04.520 | if you remember this,

00:20:05.880 | a big million dollar prize,

00:20:07.200 | maybe some people in here competed in it.

00:20:08.880 | So in this prize, right, um,

00:20:11.760 | Netflix published an anonymized dataset, right?

00:20:14.600 | And that was, um,

00:20:15.960 | movies and users, right?

00:20:18.040 | And they took all the movies and replaced them with numbers,

00:20:20.440 | and they took all the users and replaced them with numbers,

00:20:22.840 | and then we just had sparsely populated movie ratings in this matrix, right?

00:20:27.360 | Seemingly anonymous, right?

00:20:29.480 | There's no names of any kind.

00:20:31.480 | Um, but the problem is,

00:20:33.800 | is that each row is statistically unique,

00:20:37.960 | meaning it, it kind of is its own fingerprint.

00:20:41.640 | And so two months after the dataset was published,

00:20:44.560 | some researchers at, uh, UT Austin,

00:20:48.080 | um, I think it was, I think it was UT Austin, um,

00:20:51.320 | were able to go and scrape IMDb,

00:20:54.960 | and basically create the same matrix in IMDb,

00:20:58.000 | and then just compare the two.

00:20:59.840 | And it turns out people that were into movie rating,

00:21:02.280 | were into movie rating, and,

00:21:04.400 | and, and were watching movies at similar times,

00:21:06.960 | and similar, similar patterns, and similar tastes, right?

00:21:09.280 | And they were able to de-anonymize

00:21:11.080 | this first dataset with a high degree of accuracy.

00:21:13.000 | Uh, it happened again with, there's a,

00:21:14.240 | there's a famous case of like, uh,

00:21:15.520 | medical records for like, uh, I think,

00:21:17.440 | I think it'd been a Massachusetts senator, I think.

00:21:19.200 | It was someone in the Northeast, um,

00:21:21.080 | being de-anonymized, uh, through very similar techniques.

00:21:23.640 | So some- one person goes and buys a anonymized medical dataset over here that has,

00:21:27.760 | you know, birthdate and zip code,

00:21:29.360 | and this one does zip code and,

00:21:30.640 | and gender, and this one does zip code,

00:21:32.560 | gender, and whether or not you have cancer, right?

00:21:34.920 | And, and when you get all these together, um,

00:21:37.600 | you can start to sort of use the uniqueness in each one to,

00:21:41.480 | to relink it all back together.

00:21:42.760 | I mean, I, um, this is so doable to,

00:21:45.560 | to the extreme that I, I,

00:21:46.880 | unfortunately know of companies whose business model is to buy anonymized datasets,

00:21:52.480 | de-anonymize them, and sell market intelligence to insurance companies.

00:21:56.120 | Ooh, right? But it can be done, okay?

00:22:00.080 | And, and the reason that it can be done is that just

00:22:02.800 | because the dataset that you are publishing,

00:22:05.200 | the one that you are physically looking at,

00:22:07.320 | doesn't seem like it has, you know,

00:22:09.800 | social security number and stuff in it,

00:22:11.120 | does not mean that there's enough unique statistical signal

00:22:13.960 | for it to be linked to something else.

00:22:15.600 | And so when I say maximum amount of epsilon,

00:22:18.380 | epsilon is an upper bound on,

00:22:20.560 | on the, the statistical uniqueness that you're publishing in a dataset, right?

00:22:25.320 | And so what, what this tool represents is saying, okay,

00:22:29.160 | apply however much noise you need to given

00:22:33.400 | whatever computational graph led back to private data for this tensor, right?

00:22:38.320 | To ensure that, you know,

00:22:40.040 | to, to put an upper bound on,

00:22:41.320 | on the potential for linkage attacks, right?

00:22:43.040 | Now, if you said epsilon zero, okay,

00:22:44.720 | then that's, that's saying, um,

00:22:46.480 | effectively like, uh, um,

00:22:49.520 | there's, I'm only going to allow patterns that have occurred at least twice, right?

00:22:54.200 | Okay. So meaning, meaning two different people had

00:22:57.000 | this pattern and thus it's not unique to either one. Yes.

00:22:59.320 | So what happens if you perform the query twice?

00:23:01.320 | So the random noise would be re-randomized and,

00:23:03.440 | and sent again and you're absolutely, absolutely correct.

00:23:05.960 | So this epsilon, this is how much I'm spending with this query.

00:23:09.240 | So if I ran this three times,

00:23:10.800 | I would spend epsilon of 0.3. Does that make sense?

00:23:13.200 | So this is, this is a 0.1 query.

00:23:14.640 | If I did this multiple times, the epsilons would sum.

00:23:16.920 | And so for any given data science project, right?

00:23:19.400 | I should, I, I, what we're,

00:23:20.600 | we're advocating is that you're given

00:23:21.800 | an epsilon budget that you're not allowed to exceed, right?

00:23:24.160 | No matter how many queries that you, you participate.

00:23:26.520 | Now, there's, there's another sort of subfield of differential privacy that's

00:23:29.320 | looking at sort of single query approaches,

00:23:32.720 | which is all around synthetic data sets.

00:23:34.400 | So how can I perform sort of one query against

00:23:36.080 | the whole data set and create a synthetic data set that has,

00:23:39.000 | um, certain invariances that are desirable, right?

00:23:41.800 | So I can do good statistics on it.

00:23:43.520 | But then I can query this as many times as I want.

00:23:45.520 | Because they're basically, um, you can't,

00:23:48.240 | um, uh, yeah, anyway,

00:23:50.480 | but we, we, we don't have to get into that now.

00:23:51.840 | Does that answer your question? Cool. Awesome.

00:23:53.800 | So now you might think, okay,

00:23:55.160 | this is like a lossless cause.

00:23:56.320 | Like how can we be answering questions while protecting,

00:23:58.440 | while, while keeping statistical signal gone.

00:24:00.200 | But like it's, it's the difference between, um,

00:24:02.440 | it's the difference between if I have a data set and I wanna know what causes cancer, right?

00:24:07.540 | I could query a data set and learn that smoking causes cancer

00:24:12.120 | without learning that individuals are,

00:24:14.760 | are or are not smokers. Does that make sense?

00:24:17.360 | Right? And the reason for that is,

00:24:19.560 | is that I'm, I'm, I'm specifically looking for

00:24:22.080 | patterns that are occurring multiple times across different people.

00:24:25.000 | And this actually happens to really, um,

00:24:27.560 | closely mirror the type of

00:24:29.440 | generalization that we want in machine learning statistics anyways.

00:24:32.400 | Does that make sense? Like as machine learning practitioners,

00:24:35.280 | we're actually not really interested in the one-offs, right?

00:24:39.200 | I mean, sometimes our models memorize things.

00:24:40.840 | This, this happens, right?

00:24:42.360 | But we're actually more interested in the things that are,

00:24:45.160 | the things that are not specific to you.

00:24:46.720 | I want, I want the things that are gonna work, you know,

00:24:48.680 | the, the heart treatments that are gonna work for everyone in this room,

00:24:50.640 | and not just, I mean, you know,

00:24:51.880 | obviously if you need a heart treatment,

00:24:52.960 | I'd be happy, that'd be cool for you to have one.

00:24:54.440 | But like what we're chiefly interested in are,

00:24:56.720 | are the things that generalize, right?

00:24:58.280 | Which, which is why this is realistic, um, um,

00:25:01.400 | and why with, with continued effort on both tooling and,

00:25:04.600 | and the theory side, um,

00:25:06.120 | we can, we can have a much better, uh, reality than today.

00:25:09.040 | Cool. So, um, pros, just to review.

00:25:13.080 | So first, uh, remote execution allows us,

00:25:15.640 | that allows data to remain in the remote machine.

00:25:17.320 | Search and sampling, we can feature engineer using toy data.

00:25:19.740 | Differential privacy, we can have a formal rigorous privacy budgeting mechanism, right?

00:25:23.680 | Now, shoot, how is the privacy budget set?

00:25:26.440 | Is it defined by the user or is it defined by the dataset owner or, or someone else?

00:25:31.760 | Um, this is a really, really interesting question actually.

00:25:34.480 | Um, so first, it's definitely not set by the data scientist,

00:25:38.680 | um, because that would be a bit of a conflict of interest.

00:25:40.720 | Um, and at, at first,

00:25:43.520 | you might say it should be the data owner, okay?

00:25:46.600 | So the hospital, right?

00:25:48.000 | They're trying to cover their butt, right?

00:25:50.060 | And make sure that their assets are protected both legally and,

00:25:53.540 | and commercially, right?

00:25:54.580 | So they're, they're trying to make, make money off this.

00:25:56.140 | So there's, there's, um,

00:25:58.140 | um, there's sort of proper incentives there.

00:26:00.980 | But the interesting thing, and this gets back to your question,

00:26:04.020 | is what happens if I have, say,

00:26:06.300 | a radiology scan in two different hospitals, right?

00:26:10.700 | And they both spend one epsilon worth of,

00:26:13.700 | of, of my privacy in each of these hospitals.

00:26:17.380 | Right? That means that actually two epsilon of my private information is out there.

00:26:22.120 | Right? And it just means that one person has to be

00:26:25.840 | clever enough to go to both places to get the join.

00:26:28.680 | This is actually the exact same mechanism we were talking about a second ago when

00:26:31.120 | someone went from Netflix to IMDb, right?

00:26:34.160 | And so the, the true answer of who should be setting epsilon budgets,

00:26:38.600 | although, um, logistically, it's gonna be challenging.

00:26:40.640 | We'll talk about a little bit of this in, in,

00:26:41.880 | in part two of the talk, but I'm going a little bit slow.

00:26:43.840 | Um, but okay. Um, is, um,

00:26:47.540 | it should be us. It should be people,

00:26:49.660 | and it should be people around their own information, right?

00:26:52.860 | You should be setting your personal epsilon budget. That makes sense?

00:26:56.440 | That's an aspirational goal.

00:26:58.140 | Um, we've got a long way before we can get to that level of,

00:27:01.520 | of infrastructure around these kinds of things.

00:27:04.500 | Um, and we can talk about that,

00:27:06.620 | and we can definitely talk about more of that in

00:27:07.860 | the kind of question-answer session as well.

00:27:09.060 | But I think in, in theory,

00:27:10.780 | in theory, that's what, what we want.

00:27:13.620 | [NOISE]

00:27:18.000 | Okay. Um, the two cons that we still- two weaknesses of

00:27:20.440 | this approach that we still lack are- someone asked this question.

00:27:23.040 | I think it was you. Yeah, yeah, you asked the question.

00:27:24.880 | Um, so first, the data is safe,

00:27:26.320 | but the model is put at risk.

00:27:27.440 | Uh, and what if we need to do a join?

00:27:28.680 | Actually, actually, yours is the third one,

00:27:29.840 | which I should totally add to the slide.

00:27:31.200 | Um, so, so first, um,

00:27:34.040 | if I'm sending my, my computations,

00:27:35.880 | my model into the hospital to learn how to be a better cancer classifier, right?

00:27:39.720 | My model is put at risk.

00:27:40.840 | It's kind of a bummer if, like, you know,

00:27:42.960 | this is a $10 million healthcare model.

00:27:44.580 | I'm just sending it to a thousand different hospitals to get learned, to learn.

00:27:47.540 | So that's potentially risky.

00:27:48.980 | Second, um, what if I need to do a join

00:27:50.940 | a computation across multiple different data owners,

00:27:52.900 | who don't trust each other, right?

00:27:54.440 | Who sends whose data to whom, right?

00:27:57.460 | And thirdly, um, as you pointed out,

00:28:01.060 | how do I trust that these computations are actually happening the way

00:28:03.780 | that I am telling the remote machine that they should happen?

00:28:07.420 | This brings me to my absolute favorite tool,

00:28:11.980 | secure multi-party computation.

00:28:13.680 | Come across this before? Raise them high.

00:28:15.640 | Okay, cool. Little bit above average.

00:28:18.400 | Most machine learning people have not heard about this yet,

00:28:20.480 | and I absolutely- this is the coolest,

00:28:23.720 | this is the coolest thing I've learned about since learning about,

00:28:25.720 | like, AI machine learning.

00:28:26.680 | This is a, this is a really, really cool technique.

00:28:28.600 | Encrypted computation, how about homomorphic encryption?

00:28:30.960 | You come across homomorphic encryption?

00:28:32.280 | Okay, a few more. Yeah, this is related to that.

00:28:34.640 | Um, so first, the kind of textbook definition is, is like this.

00:28:39.920 | If you go on Wikipedia, you'd see, uh,

00:28:41.700 | secure NPC allows multiple people to combine

00:28:43.740 | their private inputs to compute

00:28:45.340 | a function without revealing their inputs to each other, okay?

00:28:48.460 | Um, but in the context of machine learning,

00:28:50.580 | the implication of this is multiple different individuals

00:28:53.340 | can share ownership of a number, okay?

00:28:57.220 | Share ownership of a number. Show you what I mean.

00:29:00.580 | So let's say I have the number five,

00:29:02.620 | my happy smiling face,

00:29:04.340 | and I split this into two shares,

00:29:06.540 | two and a three, okay?

00:29:09.720 | I've got two friends, Marianne and Bobby,

00:29:13.480 | and I give them these shares.

00:29:15.720 | They are now the shareholders of this number, okay?

00:29:18.840 | And now I'm gonna go away,

00:29:20.160 | and this number is shared between them, okay?

00:29:24.960 | And this, this gives us several desirable properties.

00:29:27.000 | First, it's encrypted from the standpoint that neither Bob,

00:29:32.800 | nor Marianne can tell what number is

00:29:34.880 | encrypted between them by looking at their own share by itself.

00:29:37.720 | Now, I've, um, um,

00:29:40.760 | for those of you who are familiar with,

00:29:42.760 | uh, kind of cryptographic math,

00:29:44.480 | um, I'm hand-waving over this a little bit.

00:29:46.360 | This would typically be, so in- in- uh,

00:29:48.240 | decryption would be adding the shares together,

00:29:49.980 | modulus, uh, a large prime.

00:29:52.000 | Um, so these would typically look like sort of

00:29:53.880 | large pseudo-random numbers, right?

00:29:56.040 | But for the sake of making it sort of intuitive,

00:29:58.080 | I've picked pseudo-random numbers that are convenient to the eyes.

00:30:01.560 | Um, so first, these two values are encrypted,

00:30:05.520 | and second, we get shared governance,

00:30:08.080 | meaning that we cannot decrypt these numbers or do anything with

00:30:11.320 | these numbers unless all of the shareholders agree, okay?

00:30:17.120 | But the truly extraordinary part is that

00:30:21.660 | while this number is encrypted between these individuals,

00:30:23.840 | we can actually perform computation, right?

00:30:26.120 | So in this case, let's say we wanted to multiply

00:30:28.020 | the shares times- or the encrypted number times two,

00:30:30.520 | each person can multiply their share times two,

00:30:32.520 | and now they have a encrypted number 10, right?

00:30:35.160 | And there's a whole variety of protocols allowing you to do different functions,

00:30:39.360 | um, such as the functions needed for machine learning,

00:30:41.840 | um, while numbers are in this encrypted state, okay?

00:30:45.360 | Um, and I'll give some more resources for you- for you if you're

00:30:47.440 | interested in kind of learning more about this at the end as well.

00:30:50.280 | Now, the big tie-in. Models and datasets are just large collections of numbers,

00:30:55.360 | which we can individually encrypt,

00:30:56.980 | which we can individually, uh,

00:30:58.840 | share governance over.

00:31:00.400 | Um, now, specifically to reference your question,

00:31:02.440 | there's two configurations of,

00:31:04.360 | of SecureNPC, active and passive security.

00:31:06.440 | In the active security model,

00:31:07.680 | you can tell if anyone does computation that you did

00:31:09.640 | not sort of independently authorize, which is great.

00:31:12.920 | So what does this look like in practice when you go back to the code?

00:31:18.440 | So in this case,

00:31:19.880 | we don't need just one worker,

00:31:20.960 | it's not just one hospital because we're looking to have a shared governance,

00:31:23.240 | shared ownership amongst multiple different individuals.

00:31:25.280 | So let's say we have Bob,

00:31:26.520 | Alice, and Tao, and a crypto provider,

00:31:28.800 | which we won't go into now.

00:31:30.080 | Um, and I can take a tensor,

00:31:32.200 | and instead of calling dot send and sending that tensor to someone else,

00:31:35.440 | now I call dot share,

00:31:37.200 | and that splits each value into

00:31:40.920 | multiple different shares and distributes those amongst the shareholders, right?

00:31:44.600 | So in this case, Bob, Alice, and Tao.

00:31:46.680 | However, in the frameworks that we're working on,

00:31:49.880 | you still get kind of the same PyTorch-like interface,

00:31:52.760 | and all the cryptographic protocol happens under the hood.

00:31:55.600 | And the idea here is to make it so that we can sort of do

00:31:58.360 | encrypted machine learning without you necessarily having to be a cryptographer, right?

00:32:02.160 | And vice versa, cryptographers can improve

00:32:04.160 | the algorithms and machine learning people can automatically inherit them, right?

00:32:06.960 | So kind of classic sort of open-source machine learning library,

00:32:10.280 | making complex intelligence more accessible to people, if that makes sense.

00:32:14.800 | And what we can do on tensors,

00:32:18.280 | we can also do on models.

00:32:19.240 | So we can do encrypted training,

00:32:20.800 | encrypted prediction, and we're going to get into,

00:32:23.280 | what kind of awesome use cases this opens up in a bit.

00:32:30.000 | And this is a nice set of features, right?

00:32:33.840 | In my opinion, this is, this is sort of

00:32:35.880 | the MVP of doing privacy-preserving data science, right?

00:32:39.520 | The idea being that I can have remote access to a remote dataset.

00:32:43.600 | I can learn high-level latent patterns like,

00:32:46.240 | like, you know, what causes cancer without learning whether individuals have cancer.

00:32:51.000 | I can pull back just,

00:32:53.760 | just that sort of high-level information with

00:32:55.760 | formal mathematical guarantees over, over,

00:32:59.320 | you know, what's sort of the filter that's coming back through here, right?

00:33:02.920 | And I can work with datasets from multiple different data owners while making

00:33:06.320 | sure that each, each individual data owners are protected.

00:33:09.120 | Now, what's the catch?

00:33:10.880 | Okay. So first, um,

00:33:15.200 | is computational complexity, right?

00:33:17.520 | So encrypted computation, secure MPC, um,

00:33:20.000 | this, this involves sending lots of information over, over the network.

00:33:23.040 | I think this is the state of the art for,

00:33:24.520 | for training, for deep learning prediction,

00:33:27.400 | is that this is a 13x slowdown over plaintext,

00:33:30.480 | which is inconvenient but not deadly, right?

00:33:32.960 | But you do have to understand that,

00:33:34.520 | that assumes like it's like two AWS machines

00:33:36.600 | were like talking to each other, you know, they're relatively fast.

00:33:39.000 | But we also haven't had any like hardware optimization to the extent that,

00:33:42.300 | that, you know, NVIDIA did a lot for deep learning,

00:33:44.840 | like, that there'll be, you know,

00:33:45.960 | probably like some sort of Cisco player that's similar for,

00:33:48.560 | for doing kind of encrypted or secure MPC-based deep learning, right?

00:33:53.040 | Um, let's see.

00:33:55.080 | So this brings us back to kind of the fundamental question,

00:33:57.400 | is it possible to answer questions using data we cannot see?

00:33:59.640 | Um, the theory is absolutely there.

00:34:01.600 | Um, that's, that's something that,

00:34:03.000 | that I feel reasonably confident saying,

00:34:04.720 | that like, like the, the,

00:34:05.920 | the sort of the theoretical frameworks that we have.

00:34:07.680 | And actually, the other thing that's really worth mentioning here is that these come

00:34:10.240 | from totally different fields which is why they kind of

00:34:12.400 | haven't been necessarily combined that much yet.

00:34:13.920 | I'll get, I'll get more into that in a second.

00:34:15.480 | Um, but it's, it's my hope that,

00:34:17.960 | that by sort of,

00:34:20.280 | by considering what these tools can do.

00:34:22.840 | That'll open up your eyes to the potential that in general,

00:34:25.600 | we can have this new ability to answer

00:34:27.200 | questions using information that we don't actually own ourselves.

00:34:29.880 | Um, because from a sociological standpoint,

00:34:32.760 | that's net new for like us as a species, if that makes sense.

00:34:37.360 | If ever previously we want,

00:34:39.120 | we want to just, we had to have, we had to have like a,

00:34:40.360 | a trusted third party who would then take all the information in

00:34:42.880 | themselves and, and make some sort of neutral decision, right?

00:34:46.280 | Um, so we'll come to that in a second.

00:34:49.040 | Um, and so one of the big sort of long-term goals of

00:34:51.760 | our community is to make

00:34:53.160 | infrastructure for this secure enough and robust enough.

00:34:55.440 | And of course in like a free Apache 2 open source license kind of way,

00:34:59.040 | um, that, you know,

00:35:01.440 | information on the world's most important problems will be this accessible, right?

00:35:05.360 | I'm gonna spend sort of less time working on,

00:35:08.680 | um, tasks like that and more time working on tasks like this.

00:35:12.480 | So, um, this is gonna be kind of the,

00:35:14.760 | the breaking point between sort of part one and part two.

00:35:17.040 | Um, part two will be a bit shorter.

00:35:18.640 | Um, but if you're interested in,

00:35:20.120 | in sort of diving deeper on the technicals of this, um,

00:35:22.240 | here's a, like a six or seven hour course that I

00:35:24.760 | taught just on these concepts and on the tools.

00:35:26.860 | It's free, uh, on Udacity.

00:35:28.600 | Feel free to check it out. Um, so the question was, um, uh,

00:35:33.040 | he's asking about how I, I,

00:35:34.160 | I specified that a model can be encrypted during training.

00:35:36.360 | Is that same as homomorphic encryption or is that something else?

00:35:38.600 | So, um, uh, a couple of years ago,

00:35:40.940 | there was a, a big burst in literature around training on encrypted data,

00:35:44.680 | um, where you would homomorphically encrypt the dataset.

00:35:46.800 | And it turned out that some of the statistical regularities

00:35:48.960 | of homomorphic encryption allowed you to actually train on

00:35:50.920 | that dataset without, um, without decrypting it.

00:35:54.720 | Um, so this is similar to that except, um,

00:35:59.440 | the one downside to that is that in order to use that model in the future,

00:36:04.040 | you have to still be able to encrypt data with the same key, um,

00:36:07.760 | which often is, is,

00:36:10.000 | is sort of constraining in practice and also there's a pretty big hit to privacy

00:36:12.680 | because you're, you're training on data that inherently has a lot of noise added to it.

00:36:16.000 | What I'm advocating for here, um,

00:36:19.280 | is instead we actually encrypt, um,

00:36:22.200 | both the model and the dataset, uh,

00:36:24.280 | during training but inside the encryption, inside the box, right?

00:36:28.320 | It's actually performing the same computations

00:36:30.000 | that it would be doing in plain text.

00:36:31.680 | So you don't get any degradation in accuracy, um,

00:36:33.880 | and you don't get tied to one particular public private key pair.

00:36:37.080 | Yeah, yeah, yeah. So, uh,

00:36:38.840 | specifically- so the question was can I comment on

00:36:40.480 | federated learning specifically Google's implementation?

00:36:42.640 | Um, so I think Google's implementation is, is great.

00:36:45.160 | So, um, and obviously the,

00:36:46.720 | the fact that they've shown that this can be done hundreds of

00:36:49.600 | millions of users is incredibly powerful.

00:36:51.320 | I mean, uh, and even inventing the term,

00:36:53.280 | um, uh, and creating momentum in that direction.

00:36:55.520 | Um, I think that there's, um,

00:36:57.520 | one thing that's worth mentioning is that there are two forms of federated learning.

00:37:00.960 | Uh, one is sort of the one where your model is- federated learning, sorry.

00:37:05.720 | Uh, ooh, gotta talk about what that is.

00:37:07.920 | Okay. Um, yes, I'll do that quickly.

00:37:10.800 | Um, so federated learning is,

00:37:13.000 | um, basically the first thing I talked about.

00:37:14.680 | So remote execution.

00:37:15.720 | So if, if everyone has a smartphone, um,

00:37:18.520 | when you plug your phone in at night,

00:37:19.800 | if you've got, you know, Android or iOS,

00:37:22.280 | you plug your phone in at night, attach to Wi-Fi.

00:37:24.120 | You know when you text and it recommends the next word,

00:37:26.880 | um, next word prediction?

00:37:28.520 | Um, that model is trained using federated learning.

00:37:31.320 | Um, meaning that it,

00:37:32.680 | it learns on your device to do that better,

00:37:35.080 | and then that model gets uploaded to the Cloud as opposed to

00:37:37.800 | uploading all of your tweets to the Cloud and training one global model.

00:37:40.120 | Does that make sense? So, so plug your phone at night,

00:37:42.880 | model comes down, trains locally, goes back up.

00:37:44.640 | It's federated, right? That's, that's,

00:37:46.160 | that's basically what federated learning is in a nutshell.

00:37:48.040 | And, and, um, uh,

00:37:49.600 | it was pioneered, uh, by the Quark team at Google.

00:37:51.960 | And, um, and they're,

00:37:54.000 | they're, they do really fantastic work.

00:37:55.600 | They've, they've paid down a lot of the technical debt,

00:37:57.520 | a lot of the, the, the risk,

00:37:59.280 | or technical risk around it.

00:38:00.640 | Um, and they publish really great papers

00:38:03.040 | outlining sort of how they do it, which is fantastic.

00:38:06.040 | Um, what I outlined here is

00:38:08.400 | actually a slightly different style of federated learning.

00:38:10.560 | Because there, there's federated learning with like a fixed dataset and a fixed model.

00:38:14.280 | Um, and lots of users where the,

00:38:16.040 | the data is very, um,

00:38:17.360 | ephemeral like phones are constantly logging in and logging off.

00:38:20.560 | Um, you know, you're, you're,

00:38:22.000 | you're plugging your phone at night and then you're taking it out, right?

00:38:24.400 | Um, um, this is sort of the,

00:38:26.720 | the, the one style of federated learning.

00:38:29.280 | It's, it's really useful for like product development, right?

00:38:31.520 | So it's useful for like if you want to do

00:38:33.640 | a smartphone app that has a piece of intelligence in it,

00:38:35.900 | but train that intelligence is going to be

00:38:37.760 | prohibitively difficult for you to get access to the data for,

00:38:40.320 | um, or you, you want to just have a value prop of protecting privacy, right?

00:38:43.460 | That's a federated learning, that's how federated learning is good for.

00:38:45.760 | What I've outlined here is a bit more exploratory federated learning,

00:38:49.140 | where it's saying, okay, instead of,

00:38:50.980 | instead of, um, the model being hosted in the Cloud and

00:38:53.820 | data owners showing up and making it a bit smarter every once in a while,

00:38:56.840 | now the data is going to be hosted at a variety of different private Clouds, right?

00:39:01.220 | And data scientists are going to show up and say, "Hmm,

00:39:03.100 | I want to do something with di- with diabetes today," or, "Hmm,

00:39:05.300 | I want to do something with,

00:39:06.340 | with, um, studying dementia today," something like that, right?

00:39:09.900 | This is much more difficult because the attack factors for

00:39:12.340 | this are much larger, right?

00:39:14.300 | I'm trying to be able to answer arbitrary questions about

00:39:17.300 | arbitrary datasets, um, in, in a protected environment, right?

00:39:21.280 | So I think, um, yeah, that's,

00:39:22.620 | that's kind of my, my, my general thoughts on.

00:39:25.220 | Does federated learning leak any information?

00:39:27.180 | So federated learning by itself is not a secure protocol, right?

00:39:30.100 | To the, the, to the extent that, um,

00:39:32.340 | and that's why I sort of this ensemble of techniques that I've- so

00:39:35.860 | the question was does federated learning leak information?

00:39:38.260 | Um, so it is perfectly possible for

00:39:40.340 | a federated learning model to simply memorize the dataset,

00:39:42.620 | uh, and then spit that back out later.

00:39:44.100 | You have to combine it with something like differential privacy

00:39:46.260 | in order to be able to prevent that from happening. Does that make sense?

00:39:48.940 | Um, so just, just because the,

00:39:50.580 | the training is happening on my device does not mean it's not memorizing my data.

00:39:53.060 | Does that, does that make sense? Okay.

00:39:55.660 | So now I want to zoom out and,

00:39:57.300 | and go a little less from the kind of

00:39:58.460 | the data science practitioner perspective.

00:40:00.540 | And now take more the perspective of like a,

00:40:02.860 | an economist or political scientist or some,

00:40:05.100 | someone looking kind of globally at like, okay,

00:40:07.300 | what- if, if this becomes mature, what happens?

00:40:10.020 | Right? And, and this is where it gets really exciting.

00:40:12.100 | Anyone entrepreneurial? Anyone? Everyone?

00:40:15.420 | I don't know. No one? Okay.

00:40:17.220 | Cool. Well, this is, this is the, this is the part for you.

00:40:19.700 | So, um, the big difference

00:40:23.820 | is this ability to answer questions using data you can't see.

00:40:26.700 | Because as it turns out,

00:40:28.820 | most people spend a great deal of their life just answering questions,

00:40:33.380 | and a lot of it is involving sort of personal data.

00:40:35.500 | I mean, whether it's minute things like, you know,

00:40:37.980 | where's my water, where are my keys,

00:40:39.740 | or you know, um,

00:40:41.420 | what movie should I watch tonight,

00:40:42.780 | or, or, um, you know,

00:40:46.460 | what kind of diet should I have to,

00:40:49.100 | to be able to sleep well, right?

00:40:50.700 | I mean, a, a wide variety of different questions, right?

00:40:53.220 | And, and we're limited in

00:40:55.380 | our answering ability to the information that we have, right?

00:40:58.860 | So this ability to answer questions using data we don't have,

00:41:01.540 | sociologically I think is quite, quite important.

00:41:04.100 | Um, and, um, there's four different areas that I want to

00:41:08.180 | highlight as like big groups of use cases for this kind of technology,

00:41:13.780 | um, to help kind of inspire you to see where this infrastructure can go.

00:41:16.540 | And actually before, before I jump into that,

00:41:18.300 | um, has anyone been to Edinburgh, Edinburgh?

00:41:21.220 | Cool. Uh, just to like the castle and stuff like that.

00:41:24.580 | Um, so my wife and I, um,

00:41:26.660 | this is my wife, Amber, um,

00:41:28.380 | we went to Edinburgh for the first time,

00:41:30.060 | um, six months ago?

00:41:32.620 | September? September. Um, and, uh,

00:41:36.500 | we did the underground, was it the-

00:41:40.660 | We did a ghost tour. Yeah, yeah, yeah.

00:41:42.660 | We did the ghost tour and, um, that was really cool.

00:41:45.460 | Um, [LAUGHTER] there was one thing that took away from it.

00:41:48.500 | There was this point we were standing, um,

00:41:50.740 | we just walked out of the tunnels and she was pointing up some of the architecture.

00:41:54.580 | Um, and, uh, then, uh,

00:41:58.660 | she started talking about, um,

00:42:02.220 | basically the cobblestone streets and why the cobblestone streets are there.

00:42:07.300 | Cobblestone streets, one of the main purposes of them was to sort of lift you out of the muck.

00:42:11.740 | And the reason there was muck was there is that they didn't have

00:42:13.900 | any in- internal plumbing and so the sewage just poured out into the street,

00:42:17.140 | right? Because you live in a big city.

00:42:18.980 | Um, and this was the norm everywhere, right?

00:42:21.620 | And actually, I think she even sort of implied that like the invention or

00:42:24.100 | popularization of the umbrella had less to do with actual rain,

00:42:26.860 | a bit more would do with buckets of stuff coming down from on high,

00:42:30.820 | um, which is, uh, uh,

00:42:33.300 | it's a whole different world like when you think about what that is.

00:42:36.180 | Um, but the, the reason that I bring this up, um,

00:42:39.620 | is that, you know,

00:42:42.180 | however many hundred years ago,

00:42:44.220 | people were, were walking through,

00:42:46.900 | you know, like sludge,

00:42:49.660 | sewage was just everywhere, right?

00:42:51.180 | It was all over the place and people were walking through it everywhere they go,

00:42:54.680 | and they were wondering why they got sick, right?

00:42:57.900 | And in many cases,

00:42:59.940 | and it wasn't because they wanted it to be that way,

00:43:02.140 | it's just because it was a natural consequence

00:43:03.700 | of the technology that they had at the time, right?

00:43:05.580 | This is not malice, this is not anyone being good or bad or,

00:43:08.940 | or evil or whatever, it's just,

00:43:10.280 | it's just the way things were.

00:43:11.900 | Um, and I think that there's a strong analogy to be made with,

00:43:17.800 | with kind of how our data is handled as society at the moment, right?

00:43:21.720 | We've just sort of walked into a society,

00:43:23.860 | we've had new inventions come up and new things that are practical,

00:43:25.840 | new uses for it, and now everywhere we go,

00:43:28.540 | we're constantly spreading and spewing our data all over the place, right?

00:43:33.100 | I mean, every, every camera that sees me walking down the street, you know,

00:43:36.660 | goodness, there's a, there's a company that takes

00:43:38.300 | a whole picture of the Earth by satellite every day.

00:43:40.580 | Like, how the hell am I supposed to do anything without,

00:43:43.580 | without, you know, everyone following me around all the time, right?

00:43:46.940 | And, um, I imagine that, um,

00:43:52.300 | whoever it was, I'm not a historian,

00:43:54.940 | so I don't really know, but whoever it was that said,

00:43:57.620 | "What if, what if we ran plumbing from every single apartment,

00:44:03.140 | business, school, maybe even some public toilets,

00:44:06.540 | underground, under our city,

00:44:08.720 | all to one location and then processed it,

00:44:10.820 | used chemical treatments, and then turned that into usable drinking water?"

00:44:14.500 | Like, how laughable would that have been?

00:44:16.540 | Would have been just the,

00:44:17.860 | the most massive logistical infrastructure problem

00:44:21.580 | ever to take a working city,

00:44:23.340 | dig up the whole thing,

00:44:24.900 | to take already, already constructed buildings,

00:44:27.700 | and run pipes through all of them.

00:44:29.460 | I mean, uh, so, so Oxford, uh, gosh.

00:44:32.340 | Um, I, there's a building there that's, um,

00:44:34.500 | so old they don't have showers because they didn't want to run the plumbing for the head.

00:44:37.700 | You have to ladle water over yourself.

00:44:39.100 | It's in, uh, Merton College.

00:44:40.140 | It's quite, quite famous, right?

00:44:41.500 | I mean, the, the, the infrastructure,

00:44:43.420 | anyway, the infrastructure challenges,

00:44:45.220 | um, it just must have seen absolutely massive.

00:44:49.100 | And so, as I'm about to walk through kind of like

00:44:51.620 | four broad areas where things could be different,

00:44:54.260 | theoretically, based on this technology,

00:44:55.740 | and I think it's probably going to hit you like,

00:44:57.740 | "Whoa, that's a lot of code."

00:44:59.420 | [LAUGHTER]

00:45:00.140 | Or like, "Whoa, that's, that's a lot of change."

00:45:03.380 | Um, but, but I think that the,

00:45:05.780 | the need is sufficiently great.

00:45:08.340 | I think that, that, I mean,

00:45:10.460 | if you view our lives as just one long process of answering important questions,

00:45:15.100 | whether it's where we're going to get food or what causes cancer,

00:45:17.820 | like making sure that, that the right people can answer questions without,

00:45:20.980 | without, you know, data just getting spewed everywhere so that

00:45:23.620 | the wrong people can answer their questions, right, is important.

00:45:27.180 | And, um, yeah, anyway,

00:45:30.220 | so I know this is going to sound like there's

00:45:32.980 | a certain ridiculousness to, to, to,

00:45:34.580 | to maybe what some of this will be.

00:45:35.940 | But I, I hope that, that you will at least see that,

00:45:38.420 | that theoretically, like the,

00:45:39.580 | the basic blocks are there.

00:45:41.460 | And, and that really what stands between us and a world that's fundamentally

00:45:45.180 | different is, is adoption,

00:45:47.620 | maturing of the technology, and, and, and, and good engineering.

00:45:50.700 | Um, because I think, you know,

00:45:52.420 | once, you know, Sir Thomas Crapper invented the toilet, right?

00:45:55.700 | I do remember that one. Um, um,

00:45:58.220 | at, at that point, the, the basics were there, right?

00:46:01.300 | And, and what stood between them was,

00:46:02.980 | was implementation, adoption, and engineering, right?

00:46:05.980 | And I, I think that that's, that's where we are.

00:46:08.500 | And, and the best part is we have, you know,

00:46:10.540 | companies like Google that have already,

00:46:12.060 | already paved the way with some very,

00:46:14.100 | very large rollouts of,

00:46:16.460 | of the, the early pieces of this technology, right?

00:46:19.340 | Cool. So what are the, what are the big categories?

00:46:22.980 | One I've already talked about, open data for science.

00:46:28.100 | Okay. So this one is a really big deal.

00:46:38.820 | And the reason it's a really big deal is mostly

00:46:41.420 | because everyone gets excited about making AI progress, right?

00:46:45.900 | Everyone gets super excited about superhuman ability in X, Y, Z.

00:46:49.860 | Um, when I started my PhD at Oxford,

00:46:51.940 | I, I worked for, my professor's name is, uh, Phil Bluntsum.

00:46:54.660 | The first thing he told me when I sat my butt down in his office in my first day as a student,

00:46:58.140 | he said, "Andrew, everyone's gonna want to work on models.

00:47:00.060 | But if you look historically,

00:47:01.620 | the biggest jumps in progress have happened when we had

00:47:03.980 | new big datasets or the ability to process new big datasets."

00:47:08.740 | And just to give a few anecdotes,

00:47:10.700 | ImageNet, right?

00:47:12.060 | ImageNet, GPUs allowing us to process larger datasets.

00:47:16.140 | Um, even, even things like AlphaGo.

00:47:18.860 | This is synthetically generated infinite datasets.

00:47:21.260 | Or, or, or if, I don't know, did you guys,

00:47:22.740 | anyone watch the, um,

00:47:23.860 | the AlphaStar, uh, live stream on YouTube?

00:47:26.740 | It talked about how it had trained on like 200 years of,

00:47:29.700 | of like, uh, of StarCraft, right?

00:47:32.980 | Um, or if you look at, um, um,

00:47:35.620 | Watson, the playing, playing Jeopardy, right?

00:47:38.620 | Um, this, this was on the heels of,

00:47:40.420 | of a new large structured dataset based on Wikipedia.

00:47:43.820 | Or if you look at, um,

00:47:45.900 | um, Garry Kasparov and IBM's Deep Blue.

00:47:48.980 | This was on the heels of the largest open dataset of chess,

00:47:53.260 | um, matches haven't been published online, right?

00:47:55.820 | There's this, there's this echo where like big new dataset,

00:47:58.380 | big, big new breakthrough,

00:47:59.660 | big new dataset, big new breakthrough, right?

00:48:01.500 | And what we're talking about here is,

00:48:04.180 | is potentially, you know, several orders of magnitude,

00:48:07.140 | more data relatively quickly.

00:48:08.820 | And the reason for that is that,

00:48:10.460 | that we're not, I'm not saying we're gonna invent a new machine,

00:48:13.940 | and that machine is gonna collect this,

00:48:15.260 | and then it's gonna go online.

00:48:16.140 | I'm saying there's thousands and thousands of enterprises,

00:48:18.960 | millions of smartphones,

00:48:20.420 | there's, there's, uh,

00:48:21.940 | and hundreds of, of governments that are all already have

00:48:24.340 | this data sitting inside of data warehouses, right?

00:48:27.220 | Largely untapped for two reasons.

00:48:29.340 | One, legal risk, and two,

00:48:31.540 | commercial viability, right?

00:48:33.420 | If I give you a dataset,

00:48:34.740 | all of a sudden I just doubled the supply, right?

00:48:36.900 | What does that do to my billing ability?

00:48:39.220 | And there's the legal risk that you might do

00:48:41.740 | something bad with it that comes back to hurt me.

00:48:44.860 | With this category, I know it's like just one phrase,

00:48:47.620 | but, but this is like ImageNet,

00:48:50.640 | but for every data task that's already been established, right?

00:48:57.900 | This is us, like, I mean, I, I,

00:49:00.140 | we're working with a professor at Oxford in

00:49:01.820 | the psychology department who wants to study dementia, right?

00:49:04.380 | He is, the problem with dementia,

00:49:06.660 | is, is every hospital has like five cases, right?

00:49:10.060 | It's not like a very centralized disease,

00:49:11.540 | it's not like all the, all the cancer patients go to,

00:49:14.060 | you know, one big center and like,

00:49:15.620 | it's where all the technology is.

00:49:16.740 | Like dementia, um, it's,

00:49:18.260 | it's, it's sprinkled everywhere.

00:49:20.940 | And so the big thing that's blocking him as

00:49:23.300 | a dementia researcher is access to data.

00:49:25.220 | And so he's investing in private data science platforms.

00:49:28.340 | And I didn't persuade him to, I,

00:49:29.860 | I found him after he was,

00:49:31.020 | he was already looking to do that.

00:49:32.700 | Um, but, but pick,

00:49:34.220 | pick any challenge that, that,

00:49:35.820 | where data is already being collected and,

00:49:37.740 | and this can unlock not larger amounts of data that exists,

00:49:40.980 | but larger amounts of data that can be,

00:49:42.820 | can be used together. Does that make sense?

00:49:45.380 | This is like a thousand startups right here.

00:49:47.980 | Whereas instead of going out and trying to buy as many datasets as you can,

00:49:51.540 | which is a really hard and really expensive task.

00:49:53.860 | Talk to anyone who's in Silicon Valley right now,

00:49:55.620 | trying to do a data science startup, right?

00:49:57.100 | Instead, you go to each individual person that has a dataset and you say,

00:50:00.660 | "Hey, let me create a gateway between you and

00:50:04.420 | the rest of the world that's gonna keep your data safe and allow people to leverage it."

00:50:07.340 | Right? That's like repeatable business model.

00:50:12.580 | Pick a use case, right?

00:50:14.220 | Be, be the radiology network gatekeeper, right?

00:50:18.420 | Um, okay. So enough on that one.

00:50:21.820 | But like, does it make sense how like on a huge variety of tasks,

00:50:25.740 | just the ability to have a,

00:50:26.940 | a data box silo that you can do data science against,

00:50:30.360 | is gonna increase the accuracy of

00:50:32.060 | a huge variety of models really, really, really quickly.

00:50:34.980 | Cool? All right. Second one.

00:50:39.260 | Oh, that's not right. Single use accountability.

00:50:55.780 | Um, this one's a little bit tricky.

00:51:02.960 | Um, get to the airport and you get your bag checked, right?

00:51:08.120 | Everyone's familiar with this process, I assume.

00:51:10.520 | What happens? Someone's sitting at a monitor,

00:51:13.680 | and they see all the objects in your bag.

00:51:16.960 | So that occasionally, they can spot objects that are dangerous or illicit, right?

00:51:24.320 | There's a lot of extra information leakage,

00:51:26.900 | or to the fact that they have,

00:51:27.980 | that they have to sit and look at thousands of,

00:51:30.460 | of all of the objects, you know,

00:51:32.020 | basically searching every single person's bag totally and completely,

00:51:34.960 | just so that occasionally, they can find that one.

00:51:37.460 | Answer that, the question they actually want to answer is,

00:51:39.880 | is there anything dangerous in this bag?

00:51:42.300 | But in order to answer it,

00:51:44.060 | they have to basically acquire access to the whole bag, right?

00:51:48.620 | So let's, let's, let's think about

00:51:52.260 | the same approach of answering questions using data we can't see.

00:51:55.860 | The best example of this in the analog world is a sniffing dog.

00:51:59.880 | Familiar with like sniffing dogs,

00:52:01.220 | so give your bag a whiff at the airport, right?

00:52:03.640 | This is actually a really privacy-preserving thing,

00:52:06.300 | because dogs don't speak English or any other language.

00:52:09.860 | Um, and so the benefit is,

00:52:12.160 | the dog comes by, "Nope, everything's fine," moves on.

00:52:14.600 | The dog has the ability to only reveal one bit of

00:52:19.940 | information without you having to search every single bag.

00:52:24.300 | Okay? That is what I mean when I say a single-use accountability system.

00:52:30.600 | It means I am looking at

00:52:32.860 | some data stream because I'm holding someone accountable, right?

00:52:37.060 | And we want to make it so that I can only answer

00:52:39.380 | the question that I claim to be looking into.

00:52:42.020 | So if this is a video feed, right, for example, right?

00:52:44.980 | Instead of getting access to the raw video feed,

00:52:47.220 | and, and you know, the millions of bits of

00:52:48.940 | information every single person in the frame of view,

00:52:51.620 | walking around doing whatever, which I could use for,

00:52:54.220 | you know, even if I'm a good person,

00:52:55.820 | I technically could use for, for other purposes.

00:52:58.340 | But instead, build a system where I build,

00:53:02.180 | say, a machine learning classifier, right?

00:53:04.560 | That is an auditable piece of technology,

00:53:07.820 | that looks for whatever I'm supposed to be looking for, right?

00:53:10.660 | And I only see frames,

00:53:12.780 | you know, I only open up bags that actually have to.

00:53:16.300 | Okay. This does two things.

00:53:20.460 | One, it makes all of

00:53:25.220 | our accountability systems more privacy-preserving, which is great.

00:53:27.340 | Mitigates any potential dual or multi-use, right?

00:53:31.980 | And two, it means that

00:53:38.700 | some things that were simply too off-limits for us to,

00:53:42.980 | to properly hold people accountable might be possible, right?

00:53:46.500 | One of the things that was really challenging,

00:53:48.580 | so we used to do email surveillance, digital reasoning, right?

00:53:52.540 | And, and it was basically help

00:53:54.260 | investment banks find insider traders, right?

00:53:56.420 | Because they want to help enforce the laws,

00:53:57.940 | they, you know, they get fine billion dollar fines if,

00:54:00.100 | if, if anyone cause an infraction.

00:54:03.500 | But one of the things that was really difficult about developing

00:54:05.220 | these kinds of systems was that it's so sensitive, right?

00:54:10.180 | We're talking about, you know,

00:54:11.820 | hundreds of millions of emails at some massive investment bank.

00:54:14.660 | There's so much private information in there that say,

00:54:17.420 | none of our data scientists,

00:54:19.420 | barely any of them were able to actually

00:54:21.300 | work with the data and try to make it better, right?

00:54:24.220 | And, and, and this, this,

00:54:26.060 | yeah, this makes it really, really difficult.

00:54:27.420 | Anyway, cool. So enough on that.

00:54:29.060 | Third one, and this is the one I think is just incredibly exciting.

00:54:32.700 | End-to-end encrypted services.

00:54:35.820 | What's up? Everyone familiar with WhatsApp, Telegram, any of these?

00:54:51.500 | These are messaging apps, right?

00:54:53.540 | Where a message is encrypted on your phone,

00:54:57.100 | and it's sent directly to someone else's phone,

00:54:59.780 | and only that person's phone can decrypt it, right?

00:55:02.540 | Which means that someone can provide a service, you know,

00:55:05.700 | messaging without the service provider seeing any of

00:55:08.500 | the information that they're actually providing the service over, right?

00:55:12.340 | Very powerful idea. What if the intuition here is that,

00:55:20.220 | with a combination of machine learning,

00:55:21.660 | encrypted computation, and differential privacy,

00:55:24.460 | that we could do the same thing for entire services.

00:55:27.180 | So imagine going to the doctor, okay?

00:55:28.980 | So you go to the doctor.

00:55:30.220 | This is really a computation between two different datasets.

00:55:32.980 | On the one hand, you have dataset that the doctor has,

00:55:36.220 | which is their, you know,

00:55:38.420 | medical background, their knowledge of,

00:55:40.220 | of, of different procedures, and diseases,

00:55:42.780 | and tests, and all this kind of stuff.

00:55:44.860 | And then you have your dataset,

00:55:46.180 | which is your symptoms,

00:55:47.880 | your, your medical history, um, you know,

00:55:50.780 | your recent things that you've eaten, um,

00:55:53.100 | your, your genes, your genetic predisposition,

00:55:55.100 | your heritage, those kinds of things, right?

00:55:56.780 | And you're bringing these two datasets together,

00:55:58.900 | to compute a function.

00:56:00.540 | And that function is, what,

00:56:02.500 | what, what treatment should you have, if any?

00:56:05.860 | Okay? And the idea here is that,

00:56:10.820 | so there's this new, um,

00:56:13.380 | this new field called structured transparency,

00:56:15.220 | I guess I should probably mention.

00:56:17.100 | I'm not sure, I'm not even sure you can call it a new field yet,

00:56:25.180 | because it's not in the literature, but it's been

00:56:26.660 | bouncing around a few different circles.

00:56:28.620 | And the, um, and it's,

00:56:31.980 | it's, uh, F, X,

00:56:37.780 | Y, I'm not very good with chalk, sorry.

00:56:41.180 | Um, and then this is Z. Okay.

00:56:48.660 | So this, two different people providing their data together,

00:56:52.940 | computing a function, and an output.

00:56:54.820 | So, um, um, so differential privacy protects the output,

00:57:02.500 | encrypted computation, so like MPC,

00:57:04.860 | which we talked about earlier, protects the input, right?

00:57:10.340 | So it allows them to, to, to compute F of X of Y,

00:57:13.660 | right, without revealing their inputs. Remember this?

00:57:16.300 | So basically, encrypt Y,

00:57:17.940 | encrypt X, compute the function while it's encrypted.

00:57:20.260 | Do, do we remember, do we remember this?

00:57:22.100 | Right? And so there's, there's three processes here, right?

00:57:25.020 | There's input privacy, which is MPC,

00:57:26.980 | there's logic, and then there's output privacy.

00:57:32.060 | And this is what you need to be able to do end-to-end encrypted services.

00:57:39.580 | Okay. So imagine, imagine, um,

00:57:42.140 | so there, there are machine learning models that can now do,

00:57:44.580 | um, skin cancer prediction, right?

00:57:46.260 | So I can take a picture of my, of my arm,

00:57:48.040 | and send it through machine, machine learning model,

00:57:50.100 | and it'll predict whether or not I have melanoma on my arm, right?

00:57:53.020 | Okay. So in this case,

00:57:56.940 | machine learning model, perhaps owned by a hospital or a startup,

00:58:02.140 | image of my arm, okay?

00:58:05.380 | Encrypt both, the logic is done by the machine learning model.

00:58:09.940 | The prediction, if it's gonna be published to the output,

00:58:14.380 | to the, to the rest of the world,

00:58:15.620 | you use differential privacy, but in this case,

00:58:17.180 | the prediction can come back to me,

00:58:19.580 | and only I see the decrypted result, okay?

00:58:23.780 | The implication being that the, the,

00:58:26.100 | the doctor role facilitated by machine learning can classify whether or not I have cancer,

00:58:31.700 | can provide this service without anyone seeing my medical information.

00:58:35.260 | I can go to the doctor and get a prognosis without ever revealing

00:58:39.220 | my medical records to anyone including the doctor, right?

00:58:43.480 | Does that make sense? And if you believe,

00:58:47.660 | if you believe that sort of the services that are repeatable,

00:58:51.860 | that we do for millions and millions of people, right?

00:58:54.100 | Can create a training dataset that we can then train a classifier to do,

00:58:58.540 | then we should be able to upgrade it to be end-to-end encrypted.

00:59:02.940 | Does that make sense? So again,

00:59:05.140 | it's kind of, it's kind of big.

00:59:06.860 | It assumes that, that AI is smart enough to do it.

00:59:09.740 | There's lots of questions around quality,

00:59:12.900 | and like quality assurance, and all these kinds of things,

00:59:15.260 | that have to be addressed.

00:59:17.020 | There's very likely to be different institutions that we need.

00:59:19.540 | But I hope that at least these three sort of big categories,

00:59:22.340 | this is by no means comprehensive,

00:59:23.940 | but I hope at least these three big categories will be sort of

00:59:26.540 | sufficient for helping sort of lay the groundwork for how sort of

00:59:29.980 | each person could be empowered with

00:59:31.740 | sole control over the only copies of their information,

00:59:34.140 | while still receiving the same goods and services they've become accustomed to.

00:59:38.860 | Cool. Thanks. Questions. Let's do it.

00:59:43.100 | >> First, please give Andrew a big hand.

00:59:47.700 | Andrew, it was fascinating, really, really fascinating.

00:59:57.780 | Amazing, amazing set of ideas,

01:00:00.100 | and hope for this can really get rid of the sewage of data.

01:00:04.500 | This vision of end-to-end encrypted services,

01:00:10.540 | if I understand correctly,

01:00:11.700 | the algorithm would also run on two or more services,

01:00:15.780 | and the skin image would go to them,

01:00:18.060 | and then you would get the diagnosis.

01:00:21.180 | But the diagnosis itself is not private though,

01:00:24.020 | because the output of that is being revealed to the service provider.

01:00:29.780 | >> So it could optionally be revealed to the service provider.

01:00:32.580 | So in this case, oh yeah, something I didn't say.

01:00:34.380 | For a secure NPC for encrypted computation,

01:00:36.420 | except with some exceptions.

01:00:38.580 | But for secure NPC,

01:00:39.980 | when you perform computation between two encrypted numbers,

01:00:43.260 | the result is encrypted between the same shareholders, if that makes sense.

01:00:46.700 | Meaning that by default,

01:00:48.420 | Z is still encrypted with the same keys as X and Y,

01:00:51.900 | and then it's up to the key holders to decide who they want to decrypt it for.

01:00:55.120 | So they could decrypt it for the general public,

01:00:56.740 | in which case they should apply differential privacy.

01:00:58.720 | They could decrypt it for the input owner,

01:01:02.100 | in which case the input owner is not going to hurt anybody else by him knowing

01:01:07.040 | whether he has a certain diagnosis,

01:01:09.900 | or it could be decrypted for the model owner,

01:01:14.900 | perhaps allow them to do more training or some other arbitrary use case.

01:01:18.700 | So it can be, but not as a strict requirement.

01:01:22.660 | >> Just to be sure, if Z is being computed by say two parties,

01:01:27.580 | to send Z back to Y in this case,

01:01:32.380 | the machine knows what Z is.

01:01:34.820 | So in that sense, even if you encrypt Z with the key of Y,

01:01:40.600 | there's no way to protect the output itself.

01:01:44.680 | >> I haven't described this correctly.

01:01:47.360 | So when we perform the encrypted computation,

01:01:54.360 | we split this into shares.

01:01:55.640 | So we'll say Y1 and Y2.

01:02:01.680 | Right? Y2 goes up here, right?

01:02:04.760 | And then this populates,

01:02:06.960 | what actually happens is this creates Z1 and Z2 at the end, right?

01:02:13.600 | Which is still, which is still owned by,

01:02:15.960 | you know, person Y. So we'll say this is Alice and this is Bob.

01:02:22.680 | Right? So we have Bob's share and Alice's share.

01:02:26.320 | What gets populated is, is shares of Z.

01:02:29.160 | So if Alice or if Bob sends his share of Z down to Alice,

01:02:33.560 | only Alice can decrypt the result.

01:02:35.680 | So does that make more sense? Okay, cool.

01:02:37.960 | >> So even the answer is-

01:02:39.680 | >> Even the answer is protected.

01:02:41.680 | Yeah. And you would only need to use

01:02:43.080 | differential privacy in the case you're planning to

01:02:45.340 | decrypt the result for some unknown audience to be able to see.

01:02:50.200 | >> Some models are biased based on real data biases,

01:02:55.200 | and society tries to make unbiased models,

01:02:59.600 | like on gender, race, and so on.

01:03:01.800 | How does it work with privacy,

01:03:04.120 | especially when everything is encrypted?

01:03:06.040 | So how can you unbiased models

01:03:08.920 | when you do not see biases in the data and so on?

01:03:11.960 | >> That's a great question. So the first,

01:03:15.400 | the first gimme for that is that people don't

01:03:18.820 | ever really de-bias a model by physically reading the weights.

01:03:21.480 | Right? So the fact that the weights are encrypted

01:03:23.600 | doesn't necessarily help or hurt you.

01:03:25.800 | So really what it's about is just making sure that you provision

01:03:29.640 | enough of your privacy budget to allow you to do

01:03:31.280 | the introspection that you need to be able to

01:03:33.000 | measure and adjust for bias.

01:03:34.760 | So I think that's, is that sufficient?

01:03:37.400 | >> Yeah.

01:03:37.960 | >> Cool. Awesome. Great question then.

01:03:40.760 | >> How far away do you think we are from organizations like

01:03:44.880 | the FDA requiring differential privacy to be used

01:03:47.800 | in regulating medical algorithms?

01:03:52.360 | >> So I think the best answer I can give to that,

01:03:57.560 | so one, I don't know. And even laws in

01:04:01.120 | the UK regarding privacy GDPR are not

01:04:03.600 | prescriptive about things like differential privacy.

01:04:06.360 | But I think the best and most relevant data point I have for you on

01:04:09.440 | that is that the US Census this year is going to be

01:04:12.720 | protecting the census data,

01:04:14.000 | the 2020 census data using differential privacy.

01:04:16.560 | And some of the leading work on actually

01:04:19.640 | applying differential privacy in the world is going on at

01:04:21.320 | the US Census and I'm sure they'd be

01:04:23.840 | interested in more helpers

01:04:25.840 | if anyone was interested in joining them.

01:04:27.840 | >> So I guess her question was kind of one of my questions,

01:04:32.880 | but it was more just like how much buy-in are you getting in terms of

01:04:36.160 | adoption for open mind or any of,

01:04:40.880 | like, do you have like any hospitals that are like participating or?

01:04:45.760 | >> Yeah. So actually there's a few things

01:04:47.440 | I probably should have mentioned.

01:04:49.040 | So the, so open mind is about two and a half years old.

01:04:55.600 | In the very beginning, we had very little buy-in.

01:04:58.680 | Because it was just so early, it was kind of like,

01:05:00.880 | who cares about privacy?

01:05:02.240 | No one's ever going to sort of really, really care about that.

01:05:04.760 | Post GDPR, total, total change, right?

01:05:08.480 | Everyone's scrambling to protect the data.

01:05:11.080 | But the truth is, it's not just privacy,

01:05:12.920 | it's also commercial usability.

01:05:14.600 | Right now, if you're selling data,

01:05:16.120 | every time you sell it,

01:05:17.440 | you lower the price because you increase the supply and you

01:05:19.520 | increase the number of people that are also selling it.

01:05:22.000 | So I think that there's people also waking up to kind of

01:05:24.200 | the commercial reasons for protecting

01:05:26.800 | your own datasets and protecting the unique statistical signal that they have.

01:05:31.720 | It's also worth mentioning, so the PyTorch team recently

01:05:35.760 | sponsored $250,000 in open-source grants to fund

01:05:39.280 | people to work on our PySyth library, which is really good.

01:05:43.200 | We're hoping to announce sort of

01:05:45.000 | more grants of similar size later in the year.

01:05:47.280 | So if you guys like working on

01:05:49.120 | open-source code and like to get paid to do so,

01:05:52.880 | to that extent, that's sort of a big vote

01:05:57.640 | and buy-in as far as our community is concerned.

01:06:00.480 | So this year is when I hope to see kind of the first pilots rolling out.

01:06:05.720 | There are some that are sort of in the works,

01:06:08.200 | but they're not public yet.

01:06:10.800 | But yeah, so I think basically this is the year for like pilots.

01:06:14.480 | I think it's about as far as we are.

01:06:16.680 | >> And then I have another question that's kind of on

01:06:19.240 | the opposite end of the spectrum that's a little more technical weeds.

01:06:22.440 | >> Cool.

01:06:23.080 | >> So when you're doing the encryption where

01:06:28.000 | you separate everything into each of the different owners,

01:06:31.600 | how does that work for non-linear functions?

01:06:34.120 | Because you need that linearity to add it back and for it to maintain the-

01:06:40.960 | >> Totally. So the non-linear functions are

01:06:44.280 | the most performance intensive.

01:06:47.840 | So you get the biggest performance hit when you have to do them.

01:06:51.760 | The- for deep learning specifically,

01:06:56.600 | there's kind of two trends.

01:06:58.440 | So one line of research is around using polynomial approximations.

01:07:03.160 | And then the other line is around doing sort of discrete comparison functions.

01:07:08.520 | So which is good for ReLUs and it's good for

01:07:10.560 | lopping off the ends of your polynomials so

01:07:12.400 | that your unstable tails can be flat.

01:07:15.320 | And I would say that's about that.

01:07:18.520 | And then like the science of kind of like trying to relax

01:07:20.600 | your security assumptions strategically here and

01:07:22.280 | there to get more performance is about where we're at.

01:07:24.520 | But as far as the one thing is worth mentioning though is that,

01:07:27.680 | there are kind of what I described was

01:07:31.840 | SecureNPC sort of on integers and fixed precision numbers.

01:07:38.960 | You can also do it on sort of binary,

01:07:40.880 | but in that sense you get a huge performance ever doing it with binary,

01:07:43.920 | but you also get the ability to do things sort of more classically with computing.

01:07:47.120 | Encrypted computation is sort of like doing computing in the 70s.

01:07:49.840 | Like you get a lot of the same kind of constraints.

01:07:53.000 | >> Thank you very much for your talk, Andrew.

01:07:55.520 | >> Yeah.

01:07:55.880 | >> I'm wondering about your objective to ultimately

01:08:00.640 | allow every individual to assign a privacy budget.

01:08:05.720 | You mentioned that it would take a lot of work to provide

01:08:10.720 | the infrastructure for that to be possible.

01:08:14.720 | Do you have an idea for what kind of infrastructure is necessary?

01:08:17.840 | And also when people are reluctant, even perhaps lazy,

01:08:22.960 | and they don't really care and they don't want their data to be protected.

01:08:27.240 | >> Yeah.

01:08:27.740 | >> I guess it takes some training, but yeah,

01:08:31.600 | what are your thoughts on building that infrastructure?

01:08:34.880 | >> I think it's going to come in waves.

01:08:37.280 | It's the kind of thing where people don't usually invest money and

01:08:39.440 | time and resources into things that aren't like a straight shot to value.

01:08:42.560 | So I think there's probably going to be multiple individual discrete jumps.

01:08:46.040 | The first one is going to be just enterprise adoption.

01:08:48.480 | Enterprises are the ones that already have all the data.

01:08:50.360 | So they're the ones who are most natural to

01:08:51.840 | start adopting privacy-preserving technologies.

01:08:54.080 | I think that that adoption is going to be driven

01:08:56.200 | primarily by commercial reasons for commercial reasons,

01:08:59.160 | meaning my data is inherently more valuable if I can

01:09:01.360 | keep it scarce while allowing people to answer questions with it.

01:09:03.800 | Does that make sense? So it's more profitable for me to not send copies of

01:09:08.480 | my data to people if I can actually have them bring

01:09:10.280 | their question answering mechanisms to

01:09:11.640 | me and just get their questions answered. Does that make sense?

01:09:14.400 | That's not a privacy narrative,

01:09:15.940 | but I think that that narrative is going to mature

01:09:17.280 | privacy technology quite quickly.

01:09:20.320 | Post enterprise adoption, I think that's when,

01:09:30.480 | and encrypted services are still really hard at this point.

01:09:33.960 | The reason for that is that they require

01:09:35.640 | lots of compute and lots of network overhead,

01:09:37.600 | which means that you probably want to have something in the Cloud.

01:09:42.320 | Some machine that you can control in

01:09:44.320 | the Cloud or have the Internet get a lot faster.

01:09:47.120 | But there's this question of how do we actually get to a world

01:09:51.920 | where each individual person knows or has

01:09:55.920 | notional control over their own personal privacy budget.

01:10:00.040 | Let's just say you had

01:10:04.520 | perfect enterprise adoption and everyone's

01:10:06.360 | tracking their stuff with differential privacy.

01:10:08.360 | The piece that you're actually missing here is just some communication between

01:10:13.320 | all the different enterprises that are joining up and making,

01:10:16.200 | it's just an accounting mechanism.

01:10:17.840 | It's a lot like the IRS.

01:10:20.960 | It's just someone to be there to make sure

01:10:24.480 | that you're not double spending in different places.

01:10:28.080 | Your Epsilon budget that's over here versus over here,

01:10:31.480 | versus over here is all coming from the same place.

01:10:34.000 | It's not totally clear who this actor would be.

01:10:37.760 | Maybe there's an app that just does it for you.

01:10:41.280 | Maybe there has to be an institution around it.

01:10:43.000 | Maybe it won't happen at all.

01:10:44.640 | Maybe it'll just be decentralized,

01:10:46.120 | but whatever.

01:10:48.240 | Another option is that there'll actually be data banks.

01:10:51.440 | There's been some literature in the last couple of years around saying,

01:10:53.880 | "Okay, maybe institutions that they currently handle your money

01:10:58.840 | might also be the bank where all of your information lives."

01:11:02.440 | That becomes the gateway to your data or something like that.

01:11:05.600 | So there's different things that are,

01:11:07.080 | because that would obviously make the accounting much easier.

01:11:09.600 | Also, that would give you that Cloud-to-Cloud performance increase.

01:11:13.400 | So I think it's clear we wouldn't go to data banks or

01:11:17.640 | these kinds of centralized accounting registries

01:11:19.760 | directly because you have to have the initial adoption first.

01:11:22.720 | But if I had to guess, it's something like that.

01:11:25.880 | We won't see that for a while.

01:11:28.400 | It's not even clear what that would look like.

01:11:32.120 | But I think it is possible we just have to

01:11:34.920 | get through non-trivial adoption first.

01:11:37.200 | >> Thank you.

01:11:37.680 | >> Yeah. So it's kind of a hazy,

01:11:39.680 | but it's predicting the future.

01:11:41.680 | So I guess that's how that goes.

01:11:44.000 | >> I was just wondering if you can comment briefly on what you

01:11:48.840 | think is the biggest mistake being made with

01:11:52.080 | respect to recommendation systems transparency,

01:11:54.880 | and if you can comment briefly on what you think might

01:11:58.080 | be the best solution.

01:12:01.520 | >> So I don't know if this is a mistake.

01:12:02.920 | I would say the biggest opportunity for

01:12:04.240 | recommendation systems is that they

01:12:05.400 | have the potential to be more holistic.

01:12:06.880 | So for example,

01:12:09.600 | if you recommended a movie to me based on whether or not it's

01:12:14.800 | most likely to keep me engaged,

01:12:16.960 | keep me watching movies,

01:12:18.720 | it's not really a holistic recommendation.

01:12:20.640 | It's not saying, "Hey, you should do this because it's going

01:12:22.360 | to make your life more fulfilling, more satisfied, whatever.

01:12:25.800 | It's just going to glue me to my television more."

01:12:28.400 | So I think the biggest opportunity,

01:12:30.560 | and particularly with privacy-preserving machine learning,

01:12:32.480 | is that if a recommender system could have

01:12:34.680 | the ability to access private data without actually seeing it,

01:12:37.720 | and answer the question, how do I give

01:12:40.280 | the best recommendation so that this person gets

01:12:42.200 | a good night's sleep or has

01:12:43.680 | more meaningful friendships or whatever.

01:12:45.320 | Like these attributes that are actually particularly sensitive,

01:12:47.920 | but there are things that we actually want to optimize for,

01:12:50.440 | that we could have vastly more

01:12:52.400 | beneficial recommendation systems than we do now,

01:12:54.760 | just by virtue of having better

01:12:55.920 | infrastructure for dealing with private data.

01:12:57.360 | So they actually, as far as

01:12:59.080 | like the biggest limitation of recommender systems right now,

01:13:03.200 | it's just that they don't have access to enough

01:13:04.520 | information to have good targets.

01:13:07.320 | Does that make sense? We would like for them to have better targets,

01:13:11.000 | but in order to do that, they have to have access

01:13:12.520 | to information about those targets.

01:13:14.480 | I think that's what privacy-preserving technologies

01:13:16.920 | could bring to bear on recommendation systems.

01:13:18.640 | >> Thanks.

01:13:19.600 | >> Yeah. Great question, by the way.

01:13:21.920 | >> One more time, please give Andrew a big hand. Thank you so much.

01:13:24.720 | >> Thanks.

01:13:26.160 | >> Thanks for having me.

01:13:29.760 | >> Thank you.

01:13:31.760 | >> Thank you.

01:13:33.760 | >> Thank you.

01:13:35.760 | >> Thank you.

01:13:37.760 | >> Thank you.

01:13:39.760 | >> Thank you.

01:13:41.760 | >> Thank you.

01:13:43.760 | >> Thank you.

01:13:45.760 | [BLANK_AUDIO]

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series

Chapters