back to index

Privacy Preserving AI (Andrew Trask) | MIT Deep Learning Series


Chapters

0:0 Introduction
0:54 Privacy preserving AI talk overview
1:28 Key question: Is it possible to answer questions using data we cannot see?
5:56 Tool 1: remote execution
8:44 Tool 2: search and example data
11:35 Tool 3: differential privacy
28:9 Tool 4: secure multi-party computation
36:37 Federated learning
39:55 AI, privacy, and society
46:23 Open data for science
50:35 Single-use accountability
54:29 End-to-end encrypted services
59:51 Q&A: privacy of the diagnosis
62:49 Q&A: removing bias from data when data is encrypted
63:40 Q&A: regulation of privacy
64:27 Q&A: OpenMined
66:16 Q&A: encryption and nonlinear functions
67:53 Q&A: path to adoption of privacy-preserving technology
71:44 Q&A: recommendation systems

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're very happy to have Andrew Trask.
00:00:04.240 | He's a brilliant writer, researcher,
00:00:06.720 | tweeter, that's a word,
00:00:09.240 | in the world of machine learning and artificial intelligence.
00:00:12.480 | He is the author of Grok and Deep Learning,
00:00:15.040 | the book that I highly recommended in the lecture on Monday.
00:00:21.200 | He's the leader and creator of OpenMind,
00:00:23.920 | which is an open-source community that strives to make our algorithms,
00:00:27.820 | our data, and our world in general more privacy-preserving.
00:00:31.660 | He is coming to us by way of Oxford,
00:00:34.680 | but without that rich,
00:00:36.480 | complex, beautiful, sophisticated British accent, unfortunately.
00:00:40.600 | He is one of the best educators,
00:00:43.200 | and truly one of the nicest people I know.
00:00:45.240 | So please give him a warm welcome.
00:00:47.880 | >> Thanks. That was a very generous introduction.
00:00:54.440 | So yeah, today we're going to be talking about privacy-preserving AI.
00:00:57.280 | This talk is going to come in two parts.
00:00:59.160 | So the first is going to be looking at privacy tools
00:01:02.400 | from the context of a data scientist or a researcher,
00:01:05.440 | like how their actual UX might change.
00:01:07.600 | Because I think that's the best way to communicate
00:01:09.960 | some of the new technologies that are coming about in that context.
00:01:12.720 | Then we're going to zoom out and look at,
00:01:14.800 | under the assumption that these kinds of technologies become mature,
00:01:17.840 | what is that going to do to society?
00:01:20.760 | What consequences or side effects could these kinds of tools
00:01:24.080 | have, both positive and negative?
00:01:27.440 | So first, let's ask the question,
00:01:29.860 | is it possible to answer questions using data that we cannot see?
00:01:33.480 | This is going to be the key question that we look at today.
00:01:36.760 | Let's start with an example.
00:01:39.840 | So first, if we wanted to answer the question,
00:01:41.540 | what do tumors look like in humans?
00:01:43.460 | Well, this is a pretty complex question.
00:01:46.240 | Tumors are pretty complicated things.
00:01:47.900 | So we might train an AI classifier.
00:01:50.000 | If we wanted to do that,
00:01:51.640 | we would first need to download a dataset of tumor-related images.
00:01:54.680 | So we'd be able to statistically study these and be able to
00:01:56.920 | recognize what tumors look like in humans.
00:01:59.440 | But this kind of data is not very easy to come by.
00:02:02.200 | So it's very rarely that it's collected,
00:02:05.040 | it's difficult to move around, it's highly regulated.
00:02:08.000 | So we're probably going to have to buy it from
00:02:09.960 | relatively small number of sources that
00:02:12.640 | are able to actually collect and manage this kind of information.
00:02:15.860 | The scarcity and constraints around this,
00:02:19.200 | likely to make this a relatively expensive purchase.
00:02:21.640 | If it's going to be an expensive purchase for us to answer this question,
00:02:24.040 | well, then we're going to find someone to finance our project.
00:02:26.400 | If we need someone to finance our project,
00:02:27.700 | we have to come up with a way of how we're going to pay them back.
00:02:30.320 | If we're going to create a business plan,
00:02:31.680 | then we have to find a business partner.
00:02:32.920 | We're going to find a business partner,
00:02:33.800 | we have to spend all our cost in LinkedIn,
00:02:35.240 | looking for someone to start a business with us.
00:02:37.040 | Now, it's because we wanted to answer the question,
00:02:39.000 | what do tumors look like in humans?
00:02:41.600 | What if we wanted to answer a different question?
00:02:44.200 | What if we wanted to answer the question,
00:02:46.020 | what do handwritten digits look like?
00:02:48.360 | Well, this would be a totally different story.
00:02:51.920 | We download a dataset,
00:02:55.800 | we download a state-of-the-art training script from GitHub,
00:02:58.080 | we'd run it, and a few minutes later,
00:02:59.520 | we'd have an ability to classify
00:03:02.120 | handwritten digits with potentially superhuman ability,
00:03:04.560 | if such a thing exists.
00:03:07.000 | Why is this so different between these two questions?
00:03:11.320 | The reason is that getting access to private data,
00:03:13.800 | data about people, is really, really hard.
00:03:18.280 | As a result, we spend most of
00:03:21.120 | our time working on problems and tasks like this.
00:03:24.120 | So ImageNet, MNIST, IFR10.
00:03:25.780 | Anybody who's trained a classifier on MNIST before?
00:03:28.680 | Raise your hand. I expect pretty much everybody.
00:03:32.080 | Instead of working on problems like this,
00:03:36.480 | does anyone train a classifier to predict dementia,
00:03:40.320 | diabetes, Alzheimer's?
00:03:46.000 | Looks like she's going. Depression?
00:03:48.560 | Anxiety? No one.
00:03:50.960 | So why is it that we spend all our time on tasks like this,
00:03:56.520 | when these tasks, these represent our friends and loved ones,
00:04:00.040 | and problems in society that really, really matter.
00:04:02.600 | Not to say that there aren't people working on this.
00:04:04.480 | It's absolutely, there are whole fields dedicated to it.
00:04:07.080 | But the machine learning community at large,
00:04:10.440 | these tasks are pretty inaccessible.
00:04:13.000 | In fact, in order to work on one of these,
00:04:16.840 | just getting access to the data,
00:04:18.160 | you'd have to dedicate a portion of
00:04:19.720 | your life just to getting access to it,
00:04:21.840 | whether it's doing a startup or joining a hospital or what have you.
00:04:26.680 | Whereas for other kinds of datasets,
00:04:28.280 | they're just simply readily accessible.
00:04:30.960 | This brings us back to our question.
00:04:33.360 | Is it possible to answer questions using data that we cannot see?
00:04:39.960 | So in this talk, we're going to walk through a few different techniques.
00:04:43.880 | If the answer to this question is yes,
00:04:46.920 | the combination of these techniques is going to try to make it so that we can
00:04:50.320 | actually pip install access to datasets like
00:04:52.880 | these in the same way that we
00:04:54.760 | pip install access to other deep learning tools.
00:04:57.280 | The idea here is to lower the barrier to entry,
00:04:59.240 | to increase the accessibility to some of
00:05:01.080 | the most important problems that we would like to address.
00:05:05.080 | So as Lex mentioned,
00:05:08.400 | I lead a community called OpenMind,
00:05:09.720 | which is an open-source community of a little over 6,000 people
00:05:12.800 | who are focused on lowering
00:05:14.600 | the barrier to entry to privacy-preserving AI and machine learning.
00:05:17.080 | Specifically, one of the tools that we're
00:05:18.880 | working on we're talking about today is called PySift.
00:05:21.200 | PySift extends the major deep learning frameworks
00:05:24.800 | with the ability to do privacy-preserving machine learning.
00:05:26.680 | So specifically today, we're going to be looking
00:05:28.000 | at the extensions into PyTorch.
00:05:29.760 | So PyTorch, people generally familiar with PyTorch,
00:05:32.160 | yeah, quite a few users.
00:05:34.440 | It's my hope that by walking through a few of these tools,
00:05:39.480 | it'll become clear how we can start to be able to do data science,
00:05:45.680 | the active answering questions using
00:05:47.520 | data that we don't actually have direct access to.
00:05:50.720 | Then in the second half of the talk,
00:05:52.440 | we're going to generalize this to answering
00:05:54.240 | questions even if you're not necessarily a data scientist.
00:05:56.960 | So first, first tool is remote execution.
00:05:59.280 | So let's just walk me through this.
00:06:01.240 | So we're going to jump into code for a minute,
00:06:03.800 | but hopefully this is line by line and relatively simple.
00:06:06.080 | Even if you aren't familiar with PyTorch,
00:06:07.480 | I think it's relatively intuitive.
00:06:08.880 | We're looking at lists of numbers and these kinds of things.
00:06:11.080 | So up at the top,
00:06:12.360 | we import Torch as a deep learning framework.
00:06:14.560 | SIFT extends Torch with this thing called Torch Hook.
00:06:17.280 | All it's doing is just iterating through the library and
00:06:19.200 | basically monkey-patching in lots of new functionality.
00:06:22.000 | Most deep learning frameworks are built around one core primitive,
00:06:25.240 | and that core primitive is the tensor.
00:06:27.320 | For those of you who don't know what tensors are,
00:06:29.320 | just think of them as nested list of numbers for now,
00:06:31.760 | and that'll be good enough for this talk.
00:06:34.000 | But for us, we introduce a second core primitive,
00:06:36.760 | which is the worker.
00:06:38.080 | A worker is a location within which computation is going to be occurring.
00:06:42.680 | So in this case, we have a virtualized worker that
00:06:45.640 | is pointing to say a hospital data center.
00:06:49.280 | The assumption that we have is that this worker will allow us to run
00:06:52.760 | computation inside of the data center without us
00:06:55.200 | actually having direct access to that worker itself.
00:06:58.200 | It gives us a limited,
00:06:59.480 | white-listed set of methods that we can use on this remote machine.
00:07:04.440 | So just to give you an example,
00:07:06.040 | so there's that core primitive we talked about a minute ago.
00:07:08.760 | We have the torch tensor,
00:07:10.040 | so 1, 3, 4, 5.
00:07:12.000 | The first method that we added is called just dot send.
00:07:15.120 | This does exactly what you might expect.
00:07:16.920 | Takes the tensor, serializes it,
00:07:18.600 | sends it into the hospital data center,
00:07:20.220 | and returns back to me a pointer.
00:07:22.400 | This pointer is really, really special.
00:07:24.040 | For those of you who are actually familiar with deep learning frameworks,
00:07:25.920 | I hope that this will really resonate with you.
00:07:28.180 | Because it has the full PyTorch API as a part of it,
00:07:32.200 | but whenever you execute something using this pointer,
00:07:35.040 | instead of it running locally,
00:07:36.480 | even though it looks like and feels like it's running locally,
00:07:39.080 | it actually executes on the remote machine and
00:07:41.720 | returns back to you another pointer to the result.
00:07:45.280 | The idea here being that I can now coordinate remote executions,
00:07:49.660 | remote computations without
00:07:52.040 | necessarily having to have direct access to the machine.
00:07:55.120 | Of course, I can get a dot get request and we'll
00:07:57.200 | see that this is actually really,
00:08:00.000 | really important to getting permissions around when you can do dot get request,
00:08:02.640 | and actually ask for data from a remote machine back to you.
00:08:05.360 | So just remember that.
00:08:07.000 | Cool. So this is where we start.
00:08:09.820 | So in the Pareto principle,
00:08:11.600 | 80 percent for 20 percent,
00:08:13.360 | this is like the first big cut.
00:08:15.840 | So pros, data remains on a remote machine.
00:08:18.680 | We can now, in theory,
00:08:19.960 | do data science on a machine that we don't have access to, that we don't own.
00:08:23.640 | But the problem is, the first column we want to address,
00:08:27.560 | is how can we actually do good data science
00:08:29.560 | without physically seeing the data?
00:08:31.360 | So it's all well and good to say,
00:08:32.640 | "I'm going to train a deep learning classifier."
00:08:34.200 | But the process of answering questions is inherently iterative.
00:08:38.080 | It's inherently give and take.
00:08:40.760 | I learn a little bit and I ask a little bit,
00:08:42.240 | I learn a little bit and I ask a little bit.
00:08:44.280 | This brings me to the second tool.
00:08:46.000 | So search and example data.
00:08:47.200 | Again, we're starting really simple.
00:08:48.680 | It will get more complex here in a minute.
00:08:50.840 | So in this case, let's say we have what's called a grid.
00:08:53.200 | So PyGrid, if PySIFT is a library,
00:08:55.640 | PyGrid is the platform version.
00:08:57.180 | So again, this is all open-source Apache 2 stuff.
00:09:00.640 | This is, we have what's called a grid client.
00:09:03.520 | So this could be a interface to
00:09:06.160 | a large number of datasets inside of a big hospital.
00:09:10.000 | So let's say I wanted
00:09:12.040 | to train a classifier to do something with diabetes.
00:09:14.680 | So it's going to predict diabetes or predict
00:09:16.760 | certain kind of diabetes or certain attribute of diabetes.
00:09:20.080 | I should be able to perform remote search.
00:09:23.040 | I get back pointers to throw the remote information.
00:09:27.040 | I can get back detailed descriptions of
00:09:30.240 | what the information is without me actually looking at it.
00:09:32.400 | So how it was collected,
00:09:33.680 | what the rows and columns are,
00:09:35.480 | what the types of different information is,
00:09:37.840 | what the various ranges of the values can take on,
00:09:40.160 | things that allow me to do remote normalization,
00:09:42.280 | these kinds of things. Then in some cases,
00:09:44.760 | even look at samples of this data.
00:09:46.320 | So these samples could be human curated.
00:09:49.120 | They could be generated from a GAN.
00:09:51.040 | They could be actually short snippets from the actual dataset.
00:09:56.920 | Maybe it's okay to release small amounts but not large amounts.
00:09:59.760 | The reason I highlight this,
00:10:01.920 | this isn't crazy complex stuff.
00:10:03.640 | So prior to going back to school,
00:10:06.000 | I used to work for a company called Digital Reasoning.
00:10:07.800 | We did on-prem data science.
00:10:11.720 | So we delivered AI services to corporations behind the firewall.
00:10:17.120 | So we did classified information.
00:10:19.120 | We worked with investment banks helping prevent insider trading.
00:10:22.600 | Doing data science on data that your home team,
00:10:25.520 | back in Nashville in our case,
00:10:27.280 | is not able to see is really, really challenging.
00:10:29.200 | But there are some things that can give you the first big jump
00:10:33.680 | before you jump into the more complex tools to
00:10:35.680 | handle some of the more challenging use cases.
00:10:37.960 | Cool. So basic remote execution,
00:10:40.240 | so remote procedure calls,
00:10:42.080 | basic private search, and the ability to look at sample data,
00:10:46.480 | gives us enough general context to be able
00:10:48.960 | to start doing things like feature engineering and evaluating quality.
00:10:52.880 | So now the data remains in the remote machine.
00:10:56.600 | We can do some basic feature engineering.
00:10:58.520 | Here's where things get a little more complicated.
00:11:01.360 | So if you remember, in the very first slide,
00:11:04.960 | where I show you some code at the bottom,
00:11:06.600 | I called dot get on the tensor.
00:11:09.360 | What that did was it took the pointer to
00:11:12.200 | some remote information and said, "Hey,
00:11:13.480 | send that information to me."
00:11:15.080 | That is an incredibly important bottleneck.
00:11:18.840 | Unfortunately, despite the fact that I'm doing all my remote execution,
00:11:23.640 | if that's just naively implemented,
00:11:25.200 | well, I can just steal all the data that I want to.
00:11:27.400 | I just call dot get on whatever pointers I want,
00:11:29.640 | and there's no additional added real security.
00:11:32.960 | So what are we going to do about this? This brings us to
00:11:36.480 | tool number three called differential privacy.
00:11:38.280 | Differential privacy, you want to come across?
00:11:40.240 | A little higher? Okay, cool.
00:11:42.760 | Awesome. Good. So I'm going to do
00:11:48.240 | a quick high-level overview of
00:11:49.880 | the intuition of differential privacy,
00:11:51.280 | and then we're going to jump into how it can
00:11:53.400 | and is looking in the code,
00:11:55.320 | and I'll give you resources for deeper dive in
00:11:57.480 | differential privacy at the end of the talk, should you be interested.
00:12:01.520 | So differential privacy, loosely stated,
00:12:04.040 | is a field that allows you to do
00:12:05.880 | statistical analysis without compromising the privacy of the dataset.
00:12:10.280 | More specifically, it allows you to query a database,
00:12:14.120 | while making certain guarantees about the privacy
00:12:16.880 | of the records contained within the database.
00:12:18.720 | So let me show you what I mean.
00:12:20.000 | Let's say we have an example database,
00:12:21.760 | and so this is the canonical DB if you
00:12:23.480 | look in the literature for differential privacy.
00:12:25.640 | It'll have one row for person,
00:12:27.840 | one row for person, and one column of zeros and ones,
00:12:30.600 | which corresponds to true and false.
00:12:32.120 | We don't actually really care what those zeros and ones are indicating.
00:12:34.640 | It could be presence of a disease,
00:12:36.760 | could be male-female, could be just some sensitive attributes,
00:12:39.560 | something that's worth protecting.
00:12:42.040 | What we're going to do is,
00:12:45.360 | our goal is to ensure
00:12:46.640 | statistical analysis doesn't compromise privacy.
00:12:48.520 | What we're going to do is query this database.
00:12:50.400 | We're going to run some function over the entire database,
00:12:53.720 | and we're going to look at the result,
00:12:55.480 | and then we're going to ask a very important question.
00:12:57.760 | We're going to ask, if I were to remove someone from this database,
00:13:04.240 | say John, would the output of my function change?
00:13:10.880 | If the answer to that is no,
00:13:14.800 | then intuitively, we can say that,
00:13:17.960 | well, this output is not conditioned on John's private information.
00:13:21.320 | Now, if we could say that about everyone in the database,
00:13:25.020 | well then, okay, it would be a perfectly privacy-preserving query,
00:13:30.300 | but it might not be that useful.
00:13:32.440 | But this intuitive definition, I think, is quite powerful.
00:13:35.680 | The notion of how can we construct queries that are
00:13:37.840 | invariant to removing someone or replacing them with someone else.
00:13:43.160 | The notion of the maximum amount that
00:13:46.480 | the output of a function can change as a result of removing or
00:13:50.320 | replacing one of the individuals is known as the sensitivity.
00:13:54.640 | So important, so if you're reading literature,
00:13:56.800 | you find it's come across sensitivity,
00:13:58.360 | that's what we're talking about.
00:13:59.840 | So what do we do when we have a really sensitive function?
00:14:03.880 | We're going to take a bit of a sidestep for a minute.
00:14:07.120 | I have a twin sister who's finishing a PhD in political science.
00:14:11.560 | Political science, often they need to answer questions about
00:14:16.240 | very taboo behavior, something that people are likely to lie about.
00:14:20.440 | So let's say I wanted to survey everyone in this room and I wanted to answer
00:14:24.920 | the question what percentage of you are secretly serial killers?
00:14:30.880 | Not because I think any one of you are,
00:14:35.920 | but because I genuinely want to understand this trend.
00:14:38.960 | I'm not trying to arrest people,
00:14:40.280 | I'm not trying to be an instrument of the criminal justice system.
00:14:45.760 | I'm trying to be a sociologist or
00:14:47.400 | political scientist and understand this actual trend.
00:14:49.640 | The problem is if I sit down with each one of you in a private room and I say,
00:14:52.680 | "I promise, I promise, I promise,
00:14:53.960 | I won't tell anybody," I'm still going to get a skewed distribution.
00:14:57.680 | Some people are just going to be like, "Why would I risk
00:15:00.360 | telling you this private information?"
00:15:02.640 | So what sociologists can do is this technique called randomized response,
00:15:06.440 | where I should have brought a coin.
00:15:08.720 | You take a coin and you give it to each person before you survey them,
00:15:13.040 | and you ask them to flip it twice somewhere that you cannot see.
00:15:16.280 | So I would ask each one of you to flip a coin twice somewhere that I cannot see.
00:15:20.520 | Then I would instruct you to,
00:15:23.480 | um, if the first coin flip is a heads, answer honestly.
00:15:29.880 | But if the first coin flip is a tails,
00:15:33.400 | answer yes or no based on the second coin flip.
00:15:37.840 | So roughly half the time,
00:15:39.840 | you'll be honest and the other half of the time,
00:15:43.000 | you'll be, uh, you'll be giving me a perfect 50-50 coin flip.
00:15:47.240 | And the cool thing is that what this is actually doing,
00:15:49.880 | is taking whatever the true mean of the distribution
00:15:51.840 | is and averaging it with a 50-50 coin flip, right?
00:15:55.400 | So if say, um,
00:15:57.680 | 55 percent of you, uh,
00:16:00.520 | answered yes, that, that you are a serial killer,
00:16:05.960 | um, then I know that the true center of the distribution is actually 60 percent,
00:16:10.000 | because it was 60 percent averaged with a 50-50 coin flip.
00:16:12.680 | Does that make sense? However, despite the fact that I can
00:16:15.720 | recover the center of the distribution, right,
00:16:18.920 | given enough samples, um,
00:16:21.480 | each individual person has plausible deniability.
00:16:24.040 | If you said yes,
00:16:25.320 | it could have been because you actually are,
00:16:27.360 | or it could have been because you just happened to
00:16:29.560 | flip a certain sequence of coin flips, okay?
00:16:33.520 | Now this concept of adding noise to data to give
00:16:37.760 | plausible deniability is sort of the secret weapon of differential privacy, right?
00:16:42.120 | And, and the field itself is a,
00:16:44.960 | a set of mathematical proofs for trying to do this as efficiently as possible,
00:16:49.280 | to give sort of the smallest amount of noise to get the most accurate results,
00:16:53.200 | right, um, with the best possible privacy protections, right?
00:16:57.480 | There is a meaningful, um,
00:16:58.920 | sort of base trade-off that you, you, you, you know,
00:17:02.360 | um, there's kind of a Pareto trade-off, right?
00:17:05.320 | And we're trying to, to push that,
00:17:06.680 | push that trade-off down.
00:17:08.080 | Um, um, but so the, the,
00:17:10.160 | the, the field of research that is differential privacy, um,
00:17:13.760 | is looking at how to add noise to data and,
00:17:17.280 | and resulting queries to give plausible deniability to the,
00:17:20.280 | and to the, the members of a,
00:17:22.080 | of a database or a training dataset. Does that make sense?
00:17:25.160 | Now, um, a few,
00:17:27.520 | um, terms that you should be familiar with.
00:17:29.560 | So there's local and there's global differential privacy.
00:17:31.760 | So local differential privacy adds noise to data before it's sent to the statistician.
00:17:37.280 | So in this case, the one with the coin flip,
00:17:38.960 | this was local differential privacy.
00:17:40.240 | It affords you the best amount of protection because you never actually
00:17:43.400 | reveal sort of in the clear your information to someone, right?
00:17:47.480 | And then there's global differential privacy,
00:17:49.640 | which says, okay, we're gonna put everything in the database,
00:17:52.160 | perform a query, and then before the output of the query gets published,
00:17:55.440 | we're gonna add a little bit of noise to the output of the query, okay?
00:17:58.160 | This tends to have a much better privacy trade-off,
00:18:00.260 | but you have to trust the database owner to not compromise the results, okay?
00:18:03.680 | And we'll see there's some other things we can do there.
00:18:05.480 | Um, but with, with me so far,
00:18:06.920 | this is a good, good point for questions if you had any questions.
00:18:09.440 | Got it. So the question is, um,
00:18:11.240 | is this verifiable?
00:18:12.720 | Um, any of this, this process of differential privacy verifiable?
00:18:15.840 | Um, so that is a fantastic question, um,
00:18:18.120 | and one that actually absolutely comes up in practice.
00:18:20.840 | Um, um, so first,
00:18:22.640 | local differential privacy, the nice thing is everyone's doing it for themself, right?
00:18:26.040 | So in that sense, if you're flipping your own coins and answering your own questions,
00:18:29.680 | um, that's, that's your verification, right?
00:18:31.960 | You're, you're kind of trusting yourself.
00:18:33.200 | For global differential privacy, um,
00:18:35.680 | stay tuned for the next tool and we'll, we'll come back to that.
00:18:38.880 | All right. So what does this look like in code?
00:18:42.680 | So first, we have a pointer to a remote private dataset we call dot get.
00:18:46.600 | Whoa, we get big fat error, right?
00:18:48.840 | You just asked to sort of see
00:18:50.680 | the raw value of some private data point which you cannot do, right?
00:18:53.560 | Instead, pass in dot get epsilon to add the appropriate amount of noise.
00:18:57.040 | So one thing I haven't mentioned yet, um,
00:18:59.520 | uh, differential privacy. So I mentioned sensitivity, right?
00:19:02.240 | So sensitivity was, um, uh,
00:19:04.080 | related to the type of query,
00:19:05.520 | the type of function that we wanted to do and it's in variance to,
00:19:07.840 | um, removing or replacing individual entries in the, in the database.
00:19:10.880 | Um, so epsilon is a measure of what we call our privacy budget, right?
00:19:15.800 | And what our privacy budget is, is saying, okay, what, what's the,
00:19:18.400 | what's the amount of, of statistical uniqueness that I'm going to sort of limit?
00:19:22.720 | What's the upper bound for the amount of statistical uniqueness that I'm going to
00:19:25.080 | allow to come out of this, out of this database?
00:19:27.160 | Um, and actually I'm gonna take one more side,
00:19:29.120 | side track here, um,
00:19:30.480 | because I think it's really worth mentioning, um, data anonymization.
00:19:33.640 | Anyone familiar with data anonymization come across this term before?
00:19:36.600 | Taking a document like redacting the,
00:19:39.120 | the social security numbers and like all this kind of stuff?
00:19:42.400 | By and large, it does not work.
00:19:44.520 | If you don't remember anything else from this talk,
00:19:46.280 | it is very dangerous to do just data set anonymization, okay?
00:19:50.080 | And differential privacy in, in some respects is,
00:19:52.480 | is, is the formal version of data anonymization,
00:19:55.120 | where instead of, instead of just saying, okay,
00:19:56.840 | I'm just gonna redact out these pieces and then I'll be fine, um,
00:19:59.720 | this is saying, okay, um, that we, we can do a lot better.
00:20:02.120 | So for example, a Netflix prize,
00:20:03.440 | Netflix machine learning prize,
00:20:04.520 | if you remember this,
00:20:05.880 | a big million dollar prize,
00:20:07.200 | maybe some people in here competed in it.
00:20:08.880 | So in this prize, right, um,
00:20:11.760 | Netflix published an anonymized dataset, right?
00:20:14.600 | And that was, um,
00:20:15.960 | movies and users, right?
00:20:18.040 | And they took all the movies and replaced them with numbers,
00:20:20.440 | and they took all the users and replaced them with numbers,
00:20:22.840 | and then we just had sparsely populated movie ratings in this matrix, right?
00:20:27.360 | Seemingly anonymous, right?
00:20:29.480 | There's no names of any kind.
00:20:31.480 | Um, but the problem is,
00:20:33.800 | is that each row is statistically unique,
00:20:37.960 | meaning it, it kind of is its own fingerprint.
00:20:41.640 | And so two months after the dataset was published,
00:20:44.560 | some researchers at, uh, UT Austin,
00:20:48.080 | um, I think it was, I think it was UT Austin, um,
00:20:51.320 | were able to go and scrape IMDb,
00:20:54.960 | and basically create the same matrix in IMDb,
00:20:58.000 | and then just compare the two.
00:20:59.840 | And it turns out people that were into movie rating,
00:21:02.280 | were into movie rating, and,
00:21:04.400 | and, and were watching movies at similar times,
00:21:06.960 | and similar, similar patterns, and similar tastes, right?
00:21:09.280 | And they were able to de-anonymize
00:21:11.080 | this first dataset with a high degree of accuracy.
00:21:13.000 | Uh, it happened again with, there's a,
00:21:14.240 | there's a famous case of like, uh,
00:21:15.520 | medical records for like, uh, I think,
00:21:17.440 | I think it'd been a Massachusetts senator, I think.
00:21:19.200 | It was someone in the Northeast, um,
00:21:21.080 | being de-anonymized, uh, through very similar techniques.
00:21:23.640 | So some- one person goes and buys a anonymized medical dataset over here that has,
00:21:27.760 | you know, birthdate and zip code,
00:21:29.360 | and this one does zip code and,
00:21:30.640 | and gender, and this one does zip code,
00:21:32.560 | gender, and whether or not you have cancer, right?
00:21:34.920 | And, and when you get all these together, um,
00:21:37.600 | you can start to sort of use the uniqueness in each one to,
00:21:41.480 | to relink it all back together.
00:21:42.760 | I mean, I, um, this is so doable to,
00:21:45.560 | to the extreme that I, I,
00:21:46.880 | unfortunately know of companies whose business model is to buy anonymized datasets,
00:21:52.480 | de-anonymize them, and sell market intelligence to insurance companies.
00:21:56.120 | Ooh, right? But it can be done, okay?
00:22:00.080 | And, and the reason that it can be done is that just
00:22:02.800 | because the dataset that you are publishing,
00:22:05.200 | the one that you are physically looking at,
00:22:07.320 | doesn't seem like it has, you know,
00:22:09.800 | social security number and stuff in it,
00:22:11.120 | does not mean that there's enough unique statistical signal
00:22:13.960 | for it to be linked to something else.
00:22:15.600 | And so when I say maximum amount of epsilon,
00:22:18.380 | epsilon is an upper bound on,
00:22:20.560 | on the, the statistical uniqueness that you're publishing in a dataset, right?
00:22:25.320 | And so what, what this tool represents is saying, okay,
00:22:29.160 | apply however much noise you need to given
00:22:33.400 | whatever computational graph led back to private data for this tensor, right?
00:22:38.320 | To ensure that, you know,
00:22:40.040 | to, to put an upper bound on,
00:22:41.320 | on the potential for linkage attacks, right?
00:22:43.040 | Now, if you said epsilon zero, okay,
00:22:44.720 | then that's, that's saying, um,
00:22:46.480 | effectively like, uh, um,
00:22:49.520 | there's, I'm only going to allow patterns that have occurred at least twice, right?
00:22:54.200 | Okay. So meaning, meaning two different people had
00:22:57.000 | this pattern and thus it's not unique to either one. Yes.
00:22:59.320 | So what happens if you perform the query twice?
00:23:01.320 | So the random noise would be re-randomized and,
00:23:03.440 | and sent again and you're absolutely, absolutely correct.
00:23:05.960 | So this epsilon, this is how much I'm spending with this query.
00:23:09.240 | So if I ran this three times,
00:23:10.800 | I would spend epsilon of 0.3. Does that make sense?
00:23:13.200 | So this is, this is a 0.1 query.
00:23:14.640 | If I did this multiple times, the epsilons would sum.
00:23:16.920 | And so for any given data science project, right?
00:23:19.400 | I should, I, I, what we're,
00:23:20.600 | we're advocating is that you're given
00:23:21.800 | an epsilon budget that you're not allowed to exceed, right?
00:23:24.160 | No matter how many queries that you, you participate.
00:23:26.520 | Now, there's, there's another sort of subfield of differential privacy that's
00:23:29.320 | looking at sort of single query approaches,
00:23:32.720 | which is all around synthetic data sets.
00:23:34.400 | So how can I perform sort of one query against
00:23:36.080 | the whole data set and create a synthetic data set that has,
00:23:39.000 | um, certain invariances that are desirable, right?
00:23:41.800 | So I can do good statistics on it.
00:23:43.520 | But then I can query this as many times as I want.
00:23:45.520 | Because they're basically, um, you can't,
00:23:48.240 | um, uh, yeah, anyway,
00:23:50.480 | but we, we, we don't have to get into that now.
00:23:51.840 | Does that answer your question? Cool. Awesome.
00:23:53.800 | So now you might think, okay,
00:23:55.160 | this is like a lossless cause.
00:23:56.320 | Like how can we be answering questions while protecting,
00:23:58.440 | while, while keeping statistical signal gone.
00:24:00.200 | But like it's, it's the difference between, um,
00:24:02.440 | it's the difference between if I have a data set and I wanna know what causes cancer, right?
00:24:07.540 | I could query a data set and learn that smoking causes cancer
00:24:12.120 | without learning that individuals are,
00:24:14.760 | are or are not smokers. Does that make sense?
00:24:17.360 | Right? And the reason for that is,
00:24:19.560 | is that I'm, I'm, I'm specifically looking for
00:24:22.080 | patterns that are occurring multiple times across different people.
00:24:25.000 | And this actually happens to really, um,
00:24:27.560 | closely mirror the type of
00:24:29.440 | generalization that we want in machine learning statistics anyways.
00:24:32.400 | Does that make sense? Like as machine learning practitioners,
00:24:35.280 | we're actually not really interested in the one-offs, right?
00:24:39.200 | I mean, sometimes our models memorize things.
00:24:40.840 | This, this happens, right?
00:24:42.360 | But we're actually more interested in the things that are,
00:24:45.160 | the things that are not specific to you.
00:24:46.720 | I want, I want the things that are gonna work, you know,
00:24:48.680 | the, the heart treatments that are gonna work for everyone in this room,
00:24:50.640 | and not just, I mean, you know,
00:24:51.880 | obviously if you need a heart treatment,
00:24:52.960 | I'd be happy, that'd be cool for you to have one.
00:24:54.440 | But like what we're chiefly interested in are,
00:24:56.720 | are the things that generalize, right?
00:24:58.280 | Which, which is why this is realistic, um, um,
00:25:01.400 | and why with, with continued effort on both tooling and,
00:25:04.600 | and the theory side, um,
00:25:06.120 | we can, we can have a much better, uh, reality than today.
00:25:09.040 | Cool. So, um, pros, just to review.
00:25:13.080 | So first, uh, remote execution allows us,
00:25:15.640 | that allows data to remain in the remote machine.
00:25:17.320 | Search and sampling, we can feature engineer using toy data.
00:25:19.740 | Differential privacy, we can have a formal rigorous privacy budgeting mechanism, right?
00:25:23.680 | Now, shoot, how is the privacy budget set?
00:25:26.440 | Is it defined by the user or is it defined by the dataset owner or, or someone else?
00:25:31.760 | Um, this is a really, really interesting question actually.
00:25:34.480 | Um, so first, it's definitely not set by the data scientist,
00:25:38.680 | um, because that would be a bit of a conflict of interest.
00:25:40.720 | Um, and at, at first,
00:25:43.520 | you might say it should be the data owner, okay?
00:25:46.600 | So the hospital, right?
00:25:48.000 | They're trying to cover their butt, right?
00:25:50.060 | And make sure that their assets are protected both legally and,
00:25:53.540 | and commercially, right?
00:25:54.580 | So they're, they're trying to make, make money off this.
00:25:56.140 | So there's, there's, um,
00:25:58.140 | um, there's sort of proper incentives there.
00:26:00.980 | But the interesting thing, and this gets back to your question,
00:26:04.020 | is what happens if I have, say,
00:26:06.300 | a radiology scan in two different hospitals, right?
00:26:10.700 | And they both spend one epsilon worth of,
00:26:13.700 | of, of my privacy in each of these hospitals.
00:26:17.380 | Right? That means that actually two epsilon of my private information is out there.
00:26:22.120 | Right? And it just means that one person has to be
00:26:25.840 | clever enough to go to both places to get the join.
00:26:28.680 | This is actually the exact same mechanism we were talking about a second ago when
00:26:31.120 | someone went from Netflix to IMDb, right?
00:26:34.160 | And so the, the true answer of who should be setting epsilon budgets,
00:26:38.600 | although, um, logistically, it's gonna be challenging.
00:26:40.640 | We'll talk about a little bit of this in, in,
00:26:41.880 | in part two of the talk, but I'm going a little bit slow.
00:26:43.840 | Um, but okay. Um, is, um,
00:26:47.540 | it should be us. It should be people,
00:26:49.660 | and it should be people around their own information, right?
00:26:52.860 | You should be setting your personal epsilon budget. That makes sense?
00:26:56.440 | That's an aspirational goal.
00:26:58.140 | Um, we've got a long way before we can get to that level of,
00:27:01.520 | of infrastructure around these kinds of things.
00:27:04.500 | Um, and we can talk about that,
00:27:06.620 | and we can definitely talk about more of that in
00:27:07.860 | the kind of question-answer session as well.
00:27:09.060 | But I think in, in theory,
00:27:10.780 | in theory, that's what, what we want.
00:27:13.620 | [NOISE]
00:27:18.000 | Okay. Um, the two cons that we still- two weaknesses of
00:27:20.440 | this approach that we still lack are- someone asked this question.
00:27:23.040 | I think it was you. Yeah, yeah, you asked the question.
00:27:24.880 | Um, so first, the data is safe,
00:27:26.320 | but the model is put at risk.
00:27:27.440 | Uh, and what if we need to do a join?
00:27:28.680 | Actually, actually, yours is the third one,
00:27:29.840 | which I should totally add to the slide.
00:27:31.200 | Um, so, so first, um,
00:27:34.040 | if I'm sending my, my computations,
00:27:35.880 | my model into the hospital to learn how to be a better cancer classifier, right?
00:27:39.720 | My model is put at risk.
00:27:40.840 | It's kind of a bummer if, like, you know,
00:27:42.960 | this is a $10 million healthcare model.
00:27:44.580 | I'm just sending it to a thousand different hospitals to get learned, to learn.
00:27:47.540 | So that's potentially risky.
00:27:48.980 | Second, um, what if I need to do a join
00:27:50.940 | a computation across multiple different data owners,
00:27:52.900 | who don't trust each other, right?
00:27:54.440 | Who sends whose data to whom, right?
00:27:57.460 | And thirdly, um, as you pointed out,
00:28:01.060 | how do I trust that these computations are actually happening the way
00:28:03.780 | that I am telling the remote machine that they should happen?
00:28:07.420 | This brings me to my absolute favorite tool,
00:28:11.980 | secure multi-party computation.
00:28:13.680 | Come across this before? Raise them high.
00:28:15.640 | Okay, cool. Little bit above average.
00:28:18.400 | Most machine learning people have not heard about this yet,
00:28:20.480 | and I absolutely- this is the coolest,
00:28:23.720 | this is the coolest thing I've learned about since learning about,
00:28:25.720 | like, AI machine learning.
00:28:26.680 | This is a, this is a really, really cool technique.
00:28:28.600 | Encrypted computation, how about homomorphic encryption?
00:28:30.960 | You come across homomorphic encryption?
00:28:32.280 | Okay, a few more. Yeah, this is related to that.
00:28:34.640 | Um, so first, the kind of textbook definition is, is like this.
00:28:39.920 | If you go on Wikipedia, you'd see, uh,
00:28:41.700 | secure NPC allows multiple people to combine
00:28:43.740 | their private inputs to compute
00:28:45.340 | a function without revealing their inputs to each other, okay?
00:28:48.460 | Um, but in the context of machine learning,
00:28:50.580 | the implication of this is multiple different individuals
00:28:53.340 | can share ownership of a number, okay?
00:28:57.220 | Share ownership of a number. Show you what I mean.
00:29:00.580 | So let's say I have the number five,
00:29:02.620 | my happy smiling face,
00:29:04.340 | and I split this into two shares,
00:29:06.540 | two and a three, okay?
00:29:09.720 | I've got two friends, Marianne and Bobby,
00:29:13.480 | and I give them these shares.
00:29:15.720 | They are now the shareholders of this number, okay?
00:29:18.840 | And now I'm gonna go away,
00:29:20.160 | and this number is shared between them, okay?
00:29:24.960 | And this, this gives us several desirable properties.
00:29:27.000 | First, it's encrypted from the standpoint that neither Bob,
00:29:32.800 | nor Marianne can tell what number is
00:29:34.880 | encrypted between them by looking at their own share by itself.
00:29:37.720 | Now, I've, um, um,
00:29:40.760 | for those of you who are familiar with,
00:29:42.760 | uh, kind of cryptographic math,
00:29:44.480 | um, I'm hand-waving over this a little bit.
00:29:46.360 | This would typically be, so in- in- uh,
00:29:48.240 | decryption would be adding the shares together,
00:29:49.980 | modulus, uh, a large prime.
00:29:52.000 | Um, so these would typically look like sort of
00:29:53.880 | large pseudo-random numbers, right?
00:29:56.040 | But for the sake of making it sort of intuitive,
00:29:58.080 | I've picked pseudo-random numbers that are convenient to the eyes.
00:30:01.560 | Um, so first, these two values are encrypted,
00:30:05.520 | and second, we get shared governance,
00:30:08.080 | meaning that we cannot decrypt these numbers or do anything with
00:30:11.320 | these numbers unless all of the shareholders agree, okay?
00:30:17.120 | But the truly extraordinary part is that
00:30:21.660 | while this number is encrypted between these individuals,
00:30:23.840 | we can actually perform computation, right?
00:30:26.120 | So in this case, let's say we wanted to multiply
00:30:28.020 | the shares times- or the encrypted number times two,
00:30:30.520 | each person can multiply their share times two,
00:30:32.520 | and now they have a encrypted number 10, right?
00:30:35.160 | And there's a whole variety of protocols allowing you to do different functions,
00:30:39.360 | um, such as the functions needed for machine learning,
00:30:41.840 | um, while numbers are in this encrypted state, okay?
00:30:45.360 | Um, and I'll give some more resources for you- for you if you're
00:30:47.440 | interested in kind of learning more about this at the end as well.
00:30:50.280 | Now, the big tie-in. Models and datasets are just large collections of numbers,
00:30:55.360 | which we can individually encrypt,
00:30:56.980 | which we can individually, uh,
00:30:58.840 | share governance over.
00:31:00.400 | Um, now, specifically to reference your question,
00:31:02.440 | there's two configurations of,
00:31:04.360 | of SecureNPC, active and passive security.
00:31:06.440 | In the active security model,
00:31:07.680 | you can tell if anyone does computation that you did
00:31:09.640 | not sort of independently authorize, which is great.
00:31:12.920 | So what does this look like in practice when you go back to the code?
00:31:18.440 | So in this case,
00:31:19.880 | we don't need just one worker,
00:31:20.960 | it's not just one hospital because we're looking to have a shared governance,
00:31:23.240 | shared ownership amongst multiple different individuals.
00:31:25.280 | So let's say we have Bob,
00:31:26.520 | Alice, and Tao, and a crypto provider,
00:31:28.800 | which we won't go into now.
00:31:30.080 | Um, and I can take a tensor,
00:31:32.200 | and instead of calling dot send and sending that tensor to someone else,
00:31:35.440 | now I call dot share,
00:31:37.200 | and that splits each value into
00:31:40.920 | multiple different shares and distributes those amongst the shareholders, right?
00:31:44.600 | So in this case, Bob, Alice, and Tao.
00:31:46.680 | However, in the frameworks that we're working on,
00:31:49.880 | you still get kind of the same PyTorch-like interface,
00:31:52.760 | and all the cryptographic protocol happens under the hood.
00:31:55.600 | And the idea here is to make it so that we can sort of do
00:31:58.360 | encrypted machine learning without you necessarily having to be a cryptographer, right?
00:32:02.160 | And vice versa, cryptographers can improve
00:32:04.160 | the algorithms and machine learning people can automatically inherit them, right?
00:32:06.960 | So kind of classic sort of open-source machine learning library,
00:32:10.280 | making complex intelligence more accessible to people, if that makes sense.
00:32:14.800 | And what we can do on tensors,
00:32:18.280 | we can also do on models.
00:32:19.240 | So we can do encrypted training,
00:32:20.800 | encrypted prediction, and we're going to get into,
00:32:23.280 | what kind of awesome use cases this opens up in a bit.
00:32:30.000 | And this is a nice set of features, right?
00:32:33.840 | In my opinion, this is, this is sort of
00:32:35.880 | the MVP of doing privacy-preserving data science, right?
00:32:39.520 | The idea being that I can have remote access to a remote dataset.
00:32:43.600 | I can learn high-level latent patterns like,
00:32:46.240 | like, you know, what causes cancer without learning whether individuals have cancer.
00:32:51.000 | I can pull back just,
00:32:53.760 | just that sort of high-level information with
00:32:55.760 | formal mathematical guarantees over, over,
00:32:59.320 | you know, what's sort of the filter that's coming back through here, right?
00:33:02.920 | And I can work with datasets from multiple different data owners while making
00:33:06.320 | sure that each, each individual data owners are protected.
00:33:09.120 | Now, what's the catch?
00:33:10.880 | Okay. So first, um,
00:33:15.200 | is computational complexity, right?
00:33:17.520 | So encrypted computation, secure MPC, um,
00:33:20.000 | this, this involves sending lots of information over, over the network.
00:33:23.040 | I think this is the state of the art for,
00:33:24.520 | for training, for deep learning prediction,
00:33:27.400 | is that this is a 13x slowdown over plaintext,
00:33:30.480 | which is inconvenient but not deadly, right?
00:33:32.960 | But you do have to understand that,
00:33:34.520 | that assumes like it's like two AWS machines
00:33:36.600 | were like talking to each other, you know, they're relatively fast.
00:33:39.000 | But we also haven't had any like hardware optimization to the extent that,
00:33:42.300 | that, you know, NVIDIA did a lot for deep learning,
00:33:44.840 | like, that there'll be, you know,
00:33:45.960 | probably like some sort of Cisco player that's similar for,
00:33:48.560 | for doing kind of encrypted or secure MPC-based deep learning, right?
00:33:53.040 | Um, let's see.
00:33:55.080 | So this brings us back to kind of the fundamental question,
00:33:57.400 | is it possible to answer questions using data we cannot see?
00:33:59.640 | Um, the theory is absolutely there.
00:34:01.600 | Um, that's, that's something that,
00:34:03.000 | that I feel reasonably confident saying,
00:34:04.720 | that like, like the, the,
00:34:05.920 | the sort of the theoretical frameworks that we have.
00:34:07.680 | And actually, the other thing that's really worth mentioning here is that these come
00:34:10.240 | from totally different fields which is why they kind of
00:34:12.400 | haven't been necessarily combined that much yet.
00:34:13.920 | I'll get, I'll get more into that in a second.
00:34:15.480 | Um, but it's, it's my hope that,
00:34:17.960 | that by sort of,
00:34:20.280 | by considering what these tools can do.
00:34:22.840 | That'll open up your eyes to the potential that in general,
00:34:25.600 | we can have this new ability to answer
00:34:27.200 | questions using information that we don't actually own ourselves.
00:34:29.880 | Um, because from a sociological standpoint,
00:34:32.760 | that's net new for like us as a species, if that makes sense.
00:34:37.360 | If ever previously we want,
00:34:39.120 | we want to just, we had to have, we had to have like a,
00:34:40.360 | a trusted third party who would then take all the information in
00:34:42.880 | themselves and, and make some sort of neutral decision, right?
00:34:46.280 | Um, so we'll come to that in a second.
00:34:49.040 | Um, and so one of the big sort of long-term goals of
00:34:51.760 | our community is to make
00:34:53.160 | infrastructure for this secure enough and robust enough.
00:34:55.440 | And of course in like a free Apache 2 open source license kind of way,
00:34:59.040 | um, that, you know,
00:35:01.440 | information on the world's most important problems will be this accessible, right?
00:35:05.360 | I'm gonna spend sort of less time working on,
00:35:08.680 | um, tasks like that and more time working on tasks like this.
00:35:12.480 | So, um, this is gonna be kind of the,
00:35:14.760 | the breaking point between sort of part one and part two.
00:35:17.040 | Um, part two will be a bit shorter.
00:35:18.640 | Um, but if you're interested in,
00:35:20.120 | in sort of diving deeper on the technicals of this, um,
00:35:22.240 | here's a, like a six or seven hour course that I
00:35:24.760 | taught just on these concepts and on the tools.
00:35:26.860 | It's free, uh, on Udacity.
00:35:28.600 | Feel free to check it out. Um, so the question was, um, uh,
00:35:33.040 | he's asking about how I, I,
00:35:34.160 | I specified that a model can be encrypted during training.
00:35:36.360 | Is that same as homomorphic encryption or is that something else?
00:35:38.600 | So, um, uh, a couple of years ago,
00:35:40.940 | there was a, a big burst in literature around training on encrypted data,
00:35:44.680 | um, where you would homomorphically encrypt the dataset.
00:35:46.800 | And it turned out that some of the statistical regularities
00:35:48.960 | of homomorphic encryption allowed you to actually train on
00:35:50.920 | that dataset without, um, without decrypting it.
00:35:54.720 | Um, so this is similar to that except, um,
00:35:59.440 | the one downside to that is that in order to use that model in the future,
00:36:04.040 | you have to still be able to encrypt data with the same key, um,
00:36:07.760 | which often is, is,
00:36:10.000 | is sort of constraining in practice and also there's a pretty big hit to privacy
00:36:12.680 | because you're, you're training on data that inherently has a lot of noise added to it.
00:36:16.000 | What I'm advocating for here, um,
00:36:19.280 | is instead we actually encrypt, um,
00:36:22.200 | both the model and the dataset, uh,
00:36:24.280 | during training but inside the encryption, inside the box, right?
00:36:28.320 | It's actually performing the same computations
00:36:30.000 | that it would be doing in plain text.
00:36:31.680 | So you don't get any degradation in accuracy, um,
00:36:33.880 | and you don't get tied to one particular public private key pair.
00:36:37.080 | Yeah, yeah, yeah. So, uh,
00:36:38.840 | specifically- so the question was can I comment on
00:36:40.480 | federated learning specifically Google's implementation?
00:36:42.640 | Um, so I think Google's implementation is, is great.
00:36:45.160 | So, um, and obviously the,
00:36:46.720 | the fact that they've shown that this can be done hundreds of
00:36:49.600 | millions of users is incredibly powerful.
00:36:51.320 | I mean, uh, and even inventing the term,
00:36:53.280 | um, uh, and creating momentum in that direction.
00:36:55.520 | Um, I think that there's, um,
00:36:57.520 | one thing that's worth mentioning is that there are two forms of federated learning.
00:37:00.960 | Uh, one is sort of the one where your model is- federated learning, sorry.
00:37:05.720 | Uh, ooh, gotta talk about what that is.
00:37:07.920 | Okay. Um, yes, I'll do that quickly.
00:37:10.800 | Um, so federated learning is,
00:37:13.000 | um, basically the first thing I talked about.
00:37:14.680 | So remote execution.
00:37:15.720 | So if, if everyone has a smartphone, um,
00:37:18.520 | when you plug your phone in at night,
00:37:19.800 | if you've got, you know, Android or iOS,
00:37:22.280 | you plug your phone in at night, attach to Wi-Fi.
00:37:24.120 | You know when you text and it recommends the next word,
00:37:26.880 | um, next word prediction?
00:37:28.520 | Um, that model is trained using federated learning.
00:37:31.320 | Um, meaning that it,
00:37:32.680 | it learns on your device to do that better,
00:37:35.080 | and then that model gets uploaded to the Cloud as opposed to
00:37:37.800 | uploading all of your tweets to the Cloud and training one global model.
00:37:40.120 | Does that make sense? So, so plug your phone at night,
00:37:42.880 | model comes down, trains locally, goes back up.
00:37:44.640 | It's federated, right? That's, that's,
00:37:46.160 | that's basically what federated learning is in a nutshell.
00:37:48.040 | And, and, um, uh,
00:37:49.600 | it was pioneered, uh, by the Quark team at Google.
00:37:51.960 | And, um, and they're,
00:37:54.000 | they're, they do really fantastic work.
00:37:55.600 | They've, they've paid down a lot of the technical debt,
00:37:57.520 | a lot of the, the, the risk,
00:37:59.280 | or technical risk around it.
00:38:00.640 | Um, and they publish really great papers
00:38:03.040 | outlining sort of how they do it, which is fantastic.
00:38:06.040 | Um, what I outlined here is
00:38:08.400 | actually a slightly different style of federated learning.
00:38:10.560 | Because there, there's federated learning with like a fixed dataset and a fixed model.
00:38:14.280 | Um, and lots of users where the,
00:38:16.040 | the data is very, um,
00:38:17.360 | ephemeral like phones are constantly logging in and logging off.
00:38:20.560 | Um, you know, you're, you're,
00:38:22.000 | you're plugging your phone at night and then you're taking it out, right?
00:38:24.400 | Um, um, this is sort of the,
00:38:26.720 | the, the one style of federated learning.
00:38:29.280 | It's, it's really useful for like product development, right?
00:38:31.520 | So it's useful for like if you want to do
00:38:33.640 | a smartphone app that has a piece of intelligence in it,
00:38:35.900 | but train that intelligence is going to be
00:38:37.760 | prohibitively difficult for you to get access to the data for,
00:38:40.320 | um, or you, you want to just have a value prop of protecting privacy, right?
00:38:43.460 | That's a federated learning, that's how federated learning is good for.
00:38:45.760 | What I've outlined here is a bit more exploratory federated learning,
00:38:49.140 | where it's saying, okay, instead of,
00:38:50.980 | instead of, um, the model being hosted in the Cloud and
00:38:53.820 | data owners showing up and making it a bit smarter every once in a while,
00:38:56.840 | now the data is going to be hosted at a variety of different private Clouds, right?
00:39:01.220 | And data scientists are going to show up and say, "Hmm,
00:39:03.100 | I want to do something with di- with diabetes today," or, "Hmm,
00:39:05.300 | I want to do something with,
00:39:06.340 | with, um, studying dementia today," something like that, right?
00:39:09.900 | This is much more difficult because the attack factors for
00:39:12.340 | this are much larger, right?
00:39:14.300 | I'm trying to be able to answer arbitrary questions about
00:39:17.300 | arbitrary datasets, um, in, in a protected environment, right?
00:39:21.280 | So I think, um, yeah, that's,
00:39:22.620 | that's kind of my, my, my general thoughts on.
00:39:25.220 | Does federated learning leak any information?
00:39:27.180 | So federated learning by itself is not a secure protocol, right?
00:39:30.100 | To the, the, to the extent that, um,
00:39:32.340 | and that's why I sort of this ensemble of techniques that I've- so
00:39:35.860 | the question was does federated learning leak information?
00:39:38.260 | Um, so it is perfectly possible for
00:39:40.340 | a federated learning model to simply memorize the dataset,
00:39:42.620 | uh, and then spit that back out later.
00:39:44.100 | You have to combine it with something like differential privacy
00:39:46.260 | in order to be able to prevent that from happening. Does that make sense?
00:39:48.940 | Um, so just, just because the,
00:39:50.580 | the training is happening on my device does not mean it's not memorizing my data.
00:39:53.060 | Does that, does that make sense? Okay.
00:39:55.660 | So now I want to zoom out and,
00:39:57.300 | and go a little less from the kind of
00:39:58.460 | the data science practitioner perspective.
00:40:00.540 | And now take more the perspective of like a,
00:40:02.860 | an economist or political scientist or some,
00:40:05.100 | someone looking kind of globally at like, okay,
00:40:07.300 | what- if, if this becomes mature, what happens?
00:40:10.020 | Right? And, and this is where it gets really exciting.
00:40:12.100 | Anyone entrepreneurial? Anyone? Everyone?
00:40:15.420 | I don't know. No one? Okay.
00:40:17.220 | Cool. Well, this is, this is the, this is the part for you.
00:40:19.700 | So, um, the big difference
00:40:23.820 | is this ability to answer questions using data you can't see.
00:40:26.700 | Because as it turns out,
00:40:28.820 | most people spend a great deal of their life just answering questions,
00:40:33.380 | and a lot of it is involving sort of personal data.
00:40:35.500 | I mean, whether it's minute things like, you know,
00:40:37.980 | where's my water, where are my keys,
00:40:39.740 | or you know, um,
00:40:41.420 | what movie should I watch tonight,
00:40:42.780 | or, or, um, you know,
00:40:46.460 | what kind of diet should I have to,
00:40:49.100 | to be able to sleep well, right?
00:40:50.700 | I mean, a, a wide variety of different questions, right?
00:40:53.220 | And, and we're limited in
00:40:55.380 | our answering ability to the information that we have, right?
00:40:58.860 | So this ability to answer questions using data we don't have,
00:41:01.540 | sociologically I think is quite, quite important.
00:41:04.100 | Um, and, um, there's four different areas that I want to
00:41:08.180 | highlight as like big groups of use cases for this kind of technology,
00:41:13.780 | um, to help kind of inspire you to see where this infrastructure can go.
00:41:16.540 | And actually before, before I jump into that,
00:41:18.300 | um, has anyone been to Edinburgh, Edinburgh?
00:41:21.220 | Cool. Uh, just to like the castle and stuff like that.
00:41:24.580 | Um, so my wife and I, um,
00:41:26.660 | this is my wife, Amber, um,
00:41:28.380 | we went to Edinburgh for the first time,
00:41:30.060 | um, six months ago?
00:41:32.620 | September? September. Um, and, uh,
00:41:36.500 | we did the underground, was it the-
00:41:40.660 | We did a ghost tour. Yeah, yeah, yeah.
00:41:42.660 | We did the ghost tour and, um, that was really cool.
00:41:45.460 | Um, [LAUGHTER] there was one thing that took away from it.
00:41:48.500 | There was this point we were standing, um,
00:41:50.740 | we just walked out of the tunnels and she was pointing up some of the architecture.
00:41:54.580 | Um, and, uh, then, uh,
00:41:58.660 | she started talking about, um,
00:42:02.220 | basically the cobblestone streets and why the cobblestone streets are there.
00:42:07.300 | Cobblestone streets, one of the main purposes of them was to sort of lift you out of the muck.
00:42:11.740 | And the reason there was muck was there is that they didn't have
00:42:13.900 | any in- internal plumbing and so the sewage just poured out into the street,
00:42:17.140 | right? Because you live in a big city.
00:42:18.980 | Um, and this was the norm everywhere, right?
00:42:21.620 | And actually, I think she even sort of implied that like the invention or
00:42:24.100 | popularization of the umbrella had less to do with actual rain,
00:42:26.860 | a bit more would do with buckets of stuff coming down from on high,
00:42:30.820 | um, which is, uh, uh,
00:42:33.300 | it's a whole different world like when you think about what that is.
00:42:36.180 | Um, but the, the reason that I bring this up, um,
00:42:39.620 | is that, you know,
00:42:42.180 | however many hundred years ago,
00:42:44.220 | people were, were walking through,
00:42:46.900 | you know, like sludge,
00:42:49.660 | sewage was just everywhere, right?
00:42:51.180 | It was all over the place and people were walking through it everywhere they go,
00:42:54.680 | and they were wondering why they got sick, right?
00:42:57.900 | And in many cases,
00:42:59.940 | and it wasn't because they wanted it to be that way,
00:43:02.140 | it's just because it was a natural consequence
00:43:03.700 | of the technology that they had at the time, right?
00:43:05.580 | This is not malice, this is not anyone being good or bad or,
00:43:08.940 | or evil or whatever, it's just,
00:43:10.280 | it's just the way things were.
00:43:11.900 | Um, and I think that there's a strong analogy to be made with,
00:43:17.800 | with kind of how our data is handled as society at the moment, right?
00:43:21.720 | We've just sort of walked into a society,
00:43:23.860 | we've had new inventions come up and new things that are practical,
00:43:25.840 | new uses for it, and now everywhere we go,
00:43:28.540 | we're constantly spreading and spewing our data all over the place, right?
00:43:33.100 | I mean, every, every camera that sees me walking down the street, you know,
00:43:36.660 | goodness, there's a, there's a company that takes
00:43:38.300 | a whole picture of the Earth by satellite every day.
00:43:40.580 | Like, how the hell am I supposed to do anything without,
00:43:43.580 | without, you know, everyone following me around all the time, right?
00:43:46.940 | And, um, I imagine that, um,
00:43:52.300 | whoever it was, I'm not a historian,
00:43:54.940 | so I don't really know, but whoever it was that said,
00:43:57.620 | "What if, what if we ran plumbing from every single apartment,
00:44:03.140 | business, school, maybe even some public toilets,
00:44:06.540 | underground, under our city,
00:44:08.720 | all to one location and then processed it,
00:44:10.820 | used chemical treatments, and then turned that into usable drinking water?"
00:44:14.500 | Like, how laughable would that have been?
00:44:16.540 | Would have been just the,
00:44:17.860 | the most massive logistical infrastructure problem
00:44:21.580 | ever to take a working city,
00:44:23.340 | dig up the whole thing,
00:44:24.900 | to take already, already constructed buildings,
00:44:27.700 | and run pipes through all of them.
00:44:29.460 | I mean, uh, so, so Oxford, uh, gosh.
00:44:32.340 | Um, I, there's a building there that's, um,
00:44:34.500 | so old they don't have showers because they didn't want to run the plumbing for the head.
00:44:37.700 | You have to ladle water over yourself.
00:44:39.100 | It's in, uh, Merton College.
00:44:40.140 | It's quite, quite famous, right?
00:44:41.500 | I mean, the, the, the infrastructure,
00:44:43.420 | anyway, the infrastructure challenges,
00:44:45.220 | um, it just must have seen absolutely massive.
00:44:49.100 | And so, as I'm about to walk through kind of like
00:44:51.620 | four broad areas where things could be different,
00:44:54.260 | theoretically, based on this technology,
00:44:55.740 | and I think it's probably going to hit you like,
00:44:57.740 | "Whoa, that's a lot of code."
00:44:59.420 | [LAUGHTER]
00:45:00.140 | Or like, "Whoa, that's, that's a lot of change."
00:45:03.380 | Um, but, but I think that the,
00:45:05.780 | the need is sufficiently great.
00:45:08.340 | I think that, that, I mean,
00:45:10.460 | if you view our lives as just one long process of answering important questions,
00:45:15.100 | whether it's where we're going to get food or what causes cancer,
00:45:17.820 | like making sure that, that the right people can answer questions without,
00:45:20.980 | without, you know, data just getting spewed everywhere so that
00:45:23.620 | the wrong people can answer their questions, right, is important.
00:45:27.180 | And, um, yeah, anyway,
00:45:30.220 | so I know this is going to sound like there's
00:45:32.980 | a certain ridiculousness to, to, to,
00:45:34.580 | to maybe what some of this will be.
00:45:35.940 | But I, I hope that, that you will at least see that,
00:45:38.420 | that theoretically, like the,
00:45:39.580 | the basic blocks are there.
00:45:41.460 | And, and that really what stands between us and a world that's fundamentally
00:45:45.180 | different is, is adoption,
00:45:47.620 | maturing of the technology, and, and, and, and good engineering.
00:45:50.700 | Um, because I think, you know,
00:45:52.420 | once, you know, Sir Thomas Crapper invented the toilet, right?
00:45:55.700 | I do remember that one. Um, um,
00:45:58.220 | at, at that point, the, the basics were there, right?
00:46:01.300 | And, and what stood between them was,
00:46:02.980 | was implementation, adoption, and engineering, right?
00:46:05.980 | And I, I think that that's, that's where we are.
00:46:08.500 | And, and the best part is we have, you know,
00:46:10.540 | companies like Google that have already,
00:46:12.060 | already paved the way with some very,
00:46:14.100 | very large rollouts of,
00:46:16.460 | of the, the early pieces of this technology, right?
00:46:19.340 | Cool. So what are the, what are the big categories?
00:46:22.980 | One I've already talked about, open data for science.
00:46:28.100 | Okay. So this one is a really big deal.
00:46:38.820 | And the reason it's a really big deal is mostly
00:46:41.420 | because everyone gets excited about making AI progress, right?
00:46:45.900 | Everyone gets super excited about superhuman ability in X, Y, Z.
00:46:49.860 | Um, when I started my PhD at Oxford,
00:46:51.940 | I, I worked for, my professor's name is, uh, Phil Bluntsum.
00:46:54.660 | The first thing he told me when I sat my butt down in his office in my first day as a student,
00:46:58.140 | he said, "Andrew, everyone's gonna want to work on models.
00:47:00.060 | But if you look historically,
00:47:01.620 | the biggest jumps in progress have happened when we had
00:47:03.980 | new big datasets or the ability to process new big datasets."
00:47:08.740 | And just to give a few anecdotes,
00:47:10.700 | ImageNet, right?
00:47:12.060 | ImageNet, GPUs allowing us to process larger datasets.
00:47:16.140 | Um, even, even things like AlphaGo.
00:47:18.860 | This is synthetically generated infinite datasets.
00:47:21.260 | Or, or, or if, I don't know, did you guys,
00:47:22.740 | anyone watch the, um,
00:47:23.860 | the AlphaStar, uh, live stream on YouTube?
00:47:26.740 | It talked about how it had trained on like 200 years of,
00:47:29.700 | of like, uh, of StarCraft, right?
00:47:32.980 | Um, or if you look at, um, um,
00:47:35.620 | Watson, the playing, playing Jeopardy, right?
00:47:38.620 | Um, this, this was on the heels of,
00:47:40.420 | of a new large structured dataset based on Wikipedia.
00:47:43.820 | Or if you look at, um,
00:47:45.900 | um, Garry Kasparov and IBM's Deep Blue.
00:47:48.980 | This was on the heels of the largest open dataset of chess,
00:47:53.260 | um, matches haven't been published online, right?
00:47:55.820 | There's this, there's this echo where like big new dataset,
00:47:58.380 | big, big new breakthrough,
00:47:59.660 | big new dataset, big new breakthrough, right?
00:48:01.500 | And what we're talking about here is,
00:48:04.180 | is potentially, you know, several orders of magnitude,
00:48:07.140 | more data relatively quickly.
00:48:08.820 | And the reason for that is that,
00:48:10.460 | that we're not, I'm not saying we're gonna invent a new machine,
00:48:13.940 | and that machine is gonna collect this,
00:48:15.260 | and then it's gonna go online.
00:48:16.140 | I'm saying there's thousands and thousands of enterprises,
00:48:18.960 | millions of smartphones,
00:48:20.420 | there's, there's, uh,
00:48:21.940 | and hundreds of, of governments that are all already have
00:48:24.340 | this data sitting inside of data warehouses, right?
00:48:27.220 | Largely untapped for two reasons.
00:48:29.340 | One, legal risk, and two,
00:48:31.540 | commercial viability, right?
00:48:33.420 | If I give you a dataset,
00:48:34.740 | all of a sudden I just doubled the supply, right?
00:48:36.900 | What does that do to my billing ability?
00:48:39.220 | And there's the legal risk that you might do
00:48:41.740 | something bad with it that comes back to hurt me.
00:48:44.860 | With this category, I know it's like just one phrase,
00:48:47.620 | but, but this is like ImageNet,
00:48:50.640 | but for every data task that's already been established, right?
00:48:57.900 | This is us, like, I mean, I, I,
00:49:00.140 | we're working with a professor at Oxford in
00:49:01.820 | the psychology department who wants to study dementia, right?
00:49:04.380 | He is, the problem with dementia,
00:49:06.660 | is, is every hospital has like five cases, right?
00:49:10.060 | It's not like a very centralized disease,
00:49:11.540 | it's not like all the, all the cancer patients go to,
00:49:14.060 | you know, one big center and like,
00:49:15.620 | it's where all the technology is.
00:49:16.740 | Like dementia, um, it's,
00:49:18.260 | it's, it's sprinkled everywhere.
00:49:20.940 | And so the big thing that's blocking him as
00:49:23.300 | a dementia researcher is access to data.
00:49:25.220 | And so he's investing in private data science platforms.
00:49:28.340 | And I didn't persuade him to, I,
00:49:29.860 | I found him after he was,
00:49:31.020 | he was already looking to do that.
00:49:32.700 | Um, but, but pick,
00:49:34.220 | pick any challenge that, that,
00:49:35.820 | where data is already being collected and,
00:49:37.740 | and this can unlock not larger amounts of data that exists,
00:49:40.980 | but larger amounts of data that can be,
00:49:42.820 | can be used together. Does that make sense?
00:49:45.380 | This is like a thousand startups right here.
00:49:47.980 | Whereas instead of going out and trying to buy as many datasets as you can,
00:49:51.540 | which is a really hard and really expensive task.
00:49:53.860 | Talk to anyone who's in Silicon Valley right now,
00:49:55.620 | trying to do a data science startup, right?
00:49:57.100 | Instead, you go to each individual person that has a dataset and you say,
00:50:00.660 | "Hey, let me create a gateway between you and
00:50:04.420 | the rest of the world that's gonna keep your data safe and allow people to leverage it."
00:50:07.340 | Right? That's like repeatable business model.
00:50:12.580 | Pick a use case, right?
00:50:14.220 | Be, be the radiology network gatekeeper, right?
00:50:18.420 | Um, okay. So enough on that one.
00:50:21.820 | But like, does it make sense how like on a huge variety of tasks,
00:50:25.740 | just the ability to have a,
00:50:26.940 | a data box silo that you can do data science against,
00:50:30.360 | is gonna increase the accuracy of
00:50:32.060 | a huge variety of models really, really, really quickly.
00:50:34.980 | Cool? All right. Second one.
00:50:39.260 | Oh, that's not right. Single use accountability.
00:50:55.780 | Um, this one's a little bit tricky.
00:51:02.960 | Um, get to the airport and you get your bag checked, right?
00:51:08.120 | Everyone's familiar with this process, I assume.
00:51:10.520 | What happens? Someone's sitting at a monitor,
00:51:13.680 | and they see all the objects in your bag.
00:51:16.960 | So that occasionally, they can spot objects that are dangerous or illicit, right?
00:51:24.320 | There's a lot of extra information leakage,
00:51:26.900 | or to the fact that they have,
00:51:27.980 | that they have to sit and look at thousands of,
00:51:30.460 | of all of the objects, you know,
00:51:32.020 | basically searching every single person's bag totally and completely,
00:51:34.960 | just so that occasionally, they can find that one.
00:51:37.460 | Answer that, the question they actually want to answer is,
00:51:39.880 | is there anything dangerous in this bag?
00:51:42.300 | But in order to answer it,
00:51:44.060 | they have to basically acquire access to the whole bag, right?
00:51:48.620 | So let's, let's, let's think about
00:51:52.260 | the same approach of answering questions using data we can't see.
00:51:55.860 | The best example of this in the analog world is a sniffing dog.
00:51:59.880 | Familiar with like sniffing dogs,
00:52:01.220 | so give your bag a whiff at the airport, right?
00:52:03.640 | This is actually a really privacy-preserving thing,
00:52:06.300 | because dogs don't speak English or any other language.
00:52:09.860 | Um, and so the benefit is,
00:52:12.160 | the dog comes by, "Nope, everything's fine," moves on.
00:52:14.600 | The dog has the ability to only reveal one bit of
00:52:19.940 | information without you having to search every single bag.
00:52:24.300 | Okay? That is what I mean when I say a single-use accountability system.
00:52:30.600 | It means I am looking at
00:52:32.860 | some data stream because I'm holding someone accountable, right?
00:52:37.060 | And we want to make it so that I can only answer
00:52:39.380 | the question that I claim to be looking into.
00:52:42.020 | So if this is a video feed, right, for example, right?
00:52:44.980 | Instead of getting access to the raw video feed,
00:52:47.220 | and, and you know, the millions of bits of
00:52:48.940 | information every single person in the frame of view,
00:52:51.620 | walking around doing whatever, which I could use for,
00:52:54.220 | you know, even if I'm a good person,
00:52:55.820 | I technically could use for, for other purposes.
00:52:58.340 | But instead, build a system where I build,
00:53:02.180 | say, a machine learning classifier, right?
00:53:04.560 | That is an auditable piece of technology,
00:53:07.820 | that looks for whatever I'm supposed to be looking for, right?
00:53:10.660 | And I only see frames,
00:53:12.780 | you know, I only open up bags that actually have to.
00:53:16.300 | Okay. This does two things.
00:53:20.460 | One, it makes all of
00:53:25.220 | our accountability systems more privacy-preserving, which is great.
00:53:27.340 | Mitigates any potential dual or multi-use, right?
00:53:31.980 | And two, it means that
00:53:38.700 | some things that were simply too off-limits for us to,
00:53:42.980 | to properly hold people accountable might be possible, right?
00:53:46.500 | One of the things that was really challenging,
00:53:48.580 | so we used to do email surveillance, digital reasoning, right?
00:53:52.540 | And, and it was basically help
00:53:54.260 | investment banks find insider traders, right?
00:53:56.420 | Because they want to help enforce the laws,
00:53:57.940 | they, you know, they get fine billion dollar fines if,
00:54:00.100 | if, if anyone cause an infraction.
00:54:03.500 | But one of the things that was really difficult about developing
00:54:05.220 | these kinds of systems was that it's so sensitive, right?
00:54:10.180 | We're talking about, you know,
00:54:11.820 | hundreds of millions of emails at some massive investment bank.
00:54:14.660 | There's so much private information in there that say,
00:54:17.420 | none of our data scientists,
00:54:19.420 | barely any of them were able to actually
00:54:21.300 | work with the data and try to make it better, right?
00:54:24.220 | And, and, and this, this,
00:54:26.060 | yeah, this makes it really, really difficult.
00:54:27.420 | Anyway, cool. So enough on that.
00:54:29.060 | Third one, and this is the one I think is just incredibly exciting.
00:54:32.700 | End-to-end encrypted services.
00:54:35.820 | What's up? Everyone familiar with WhatsApp, Telegram, any of these?
00:54:51.500 | These are messaging apps, right?
00:54:53.540 | Where a message is encrypted on your phone,
00:54:57.100 | and it's sent directly to someone else's phone,
00:54:59.780 | and only that person's phone can decrypt it, right?
00:55:02.540 | Which means that someone can provide a service, you know,
00:55:05.700 | messaging without the service provider seeing any of
00:55:08.500 | the information that they're actually providing the service over, right?
00:55:12.340 | Very powerful idea. What if the intuition here is that,
00:55:20.220 | with a combination of machine learning,
00:55:21.660 | encrypted computation, and differential privacy,
00:55:24.460 | that we could do the same thing for entire services.
00:55:27.180 | So imagine going to the doctor, okay?
00:55:28.980 | So you go to the doctor.
00:55:30.220 | This is really a computation between two different datasets.
00:55:32.980 | On the one hand, you have dataset that the doctor has,
00:55:36.220 | which is their, you know,
00:55:38.420 | medical background, their knowledge of,
00:55:40.220 | of, of different procedures, and diseases,
00:55:42.780 | and tests, and all this kind of stuff.
00:55:44.860 | And then you have your dataset,
00:55:46.180 | which is your symptoms,
00:55:47.880 | your, your medical history, um, you know,
00:55:50.780 | your recent things that you've eaten, um,
00:55:53.100 | your, your genes, your genetic predisposition,
00:55:55.100 | your heritage, those kinds of things, right?
00:55:56.780 | And you're bringing these two datasets together,
00:55:58.900 | to compute a function.
00:56:00.540 | And that function is, what,
00:56:02.500 | what, what treatment should you have, if any?
00:56:05.860 | Okay? And the idea here is that,
00:56:10.820 | so there's this new, um,
00:56:13.380 | this new field called structured transparency,
00:56:15.220 | I guess I should probably mention.
00:56:17.100 | I'm not sure, I'm not even sure you can call it a new field yet,
00:56:25.180 | because it's not in the literature, but it's been
00:56:26.660 | bouncing around a few different circles.
00:56:28.620 | And the, um, and it's,
00:56:31.980 | it's, uh, F, X,
00:56:37.780 | Y, I'm not very good with chalk, sorry.
00:56:41.180 | Um, and then this is Z. Okay.
00:56:48.660 | So this, two different people providing their data together,
00:56:52.940 | computing a function, and an output.
00:56:54.820 | So, um, um, so differential privacy protects the output,
00:57:02.500 | encrypted computation, so like MPC,
00:57:04.860 | which we talked about earlier, protects the input, right?
00:57:10.340 | So it allows them to, to, to compute F of X of Y,
00:57:13.660 | right, without revealing their inputs. Remember this?
00:57:16.300 | So basically, encrypt Y,
00:57:17.940 | encrypt X, compute the function while it's encrypted.
00:57:20.260 | Do, do we remember, do we remember this?
00:57:22.100 | Right? And so there's, there's three processes here, right?
00:57:25.020 | There's input privacy, which is MPC,
00:57:26.980 | there's logic, and then there's output privacy.
00:57:32.060 | And this is what you need to be able to do end-to-end encrypted services.
00:57:39.580 | Okay. So imagine, imagine, um,
00:57:42.140 | so there, there are machine learning models that can now do,
00:57:44.580 | um, skin cancer prediction, right?
00:57:46.260 | So I can take a picture of my, of my arm,
00:57:48.040 | and send it through machine, machine learning model,
00:57:50.100 | and it'll predict whether or not I have melanoma on my arm, right?
00:57:53.020 | Okay. So in this case,
00:57:56.940 | machine learning model, perhaps owned by a hospital or a startup,
00:58:02.140 | image of my arm, okay?
00:58:05.380 | Encrypt both, the logic is done by the machine learning model.
00:58:09.940 | The prediction, if it's gonna be published to the output,
00:58:14.380 | to the, to the rest of the world,
00:58:15.620 | you use differential privacy, but in this case,
00:58:17.180 | the prediction can come back to me,
00:58:19.580 | and only I see the decrypted result, okay?
00:58:23.780 | The implication being that the, the,
00:58:26.100 | the doctor role facilitated by machine learning can classify whether or not I have cancer,
00:58:31.700 | can provide this service without anyone seeing my medical information.
00:58:35.260 | I can go to the doctor and get a prognosis without ever revealing
00:58:39.220 | my medical records to anyone including the doctor, right?
00:58:43.480 | Does that make sense? And if you believe,
00:58:47.660 | if you believe that sort of the services that are repeatable,
00:58:51.860 | that we do for millions and millions of people, right?
00:58:54.100 | Can create a training dataset that we can then train a classifier to do,
00:58:58.540 | then we should be able to upgrade it to be end-to-end encrypted.
00:59:02.940 | Does that make sense? So again,
00:59:05.140 | it's kind of, it's kind of big.
00:59:06.860 | It assumes that, that AI is smart enough to do it.
00:59:09.740 | There's lots of questions around quality,
00:59:12.900 | and like quality assurance, and all these kinds of things,
00:59:15.260 | that have to be addressed.
00:59:17.020 | There's very likely to be different institutions that we need.
00:59:19.540 | But I hope that at least these three sort of big categories,
00:59:22.340 | this is by no means comprehensive,
00:59:23.940 | but I hope at least these three big categories will be sort of
00:59:26.540 | sufficient for helping sort of lay the groundwork for how sort of
00:59:29.980 | each person could be empowered with
00:59:31.740 | sole control over the only copies of their information,
00:59:34.140 | while still receiving the same goods and services they've become accustomed to.
00:59:38.860 | Cool. Thanks. Questions. Let's do it.
00:59:43.100 | >> First, please give Andrew a big hand.
00:59:47.700 | Andrew, it was fascinating, really, really fascinating.
00:59:57.780 | Amazing, amazing set of ideas,
01:00:00.100 | and hope for this can really get rid of the sewage of data.
01:00:04.500 | This vision of end-to-end encrypted services,
01:00:10.540 | if I understand correctly,
01:00:11.700 | the algorithm would also run on two or more services,
01:00:15.780 | and the skin image would go to them,
01:00:18.060 | and then you would get the diagnosis.
01:00:21.180 | But the diagnosis itself is not private though,
01:00:24.020 | because the output of that is being revealed to the service provider.
01:00:29.780 | >> So it could optionally be revealed to the service provider.
01:00:32.580 | So in this case, oh yeah, something I didn't say.
01:00:34.380 | For a secure NPC for encrypted computation,
01:00:36.420 | except with some exceptions.
01:00:38.580 | But for secure NPC,
01:00:39.980 | when you perform computation between two encrypted numbers,
01:00:43.260 | the result is encrypted between the same shareholders, if that makes sense.
01:00:46.700 | Meaning that by default,
01:00:48.420 | Z is still encrypted with the same keys as X and Y,
01:00:51.900 | and then it's up to the key holders to decide who they want to decrypt it for.
01:00:55.120 | So they could decrypt it for the general public,
01:00:56.740 | in which case they should apply differential privacy.
01:00:58.720 | They could decrypt it for the input owner,
01:01:02.100 | in which case the input owner is not going to hurt anybody else by him knowing
01:01:07.040 | whether he has a certain diagnosis,
01:01:09.900 | or it could be decrypted for the model owner,
01:01:14.900 | perhaps allow them to do more training or some other arbitrary use case.
01:01:18.700 | So it can be, but not as a strict requirement.
01:01:22.660 | >> Just to be sure, if Z is being computed by say two parties,
01:01:27.580 | to send Z back to Y in this case,
01:01:32.380 | the machine knows what Z is.
01:01:34.820 | So in that sense, even if you encrypt Z with the key of Y,
01:01:40.600 | there's no way to protect the output itself.
01:01:44.680 | >> I haven't described this correctly.
01:01:47.360 | So when we perform the encrypted computation,
01:01:54.360 | we split this into shares.
01:01:55.640 | So we'll say Y1 and Y2.
01:02:01.680 | Right? Y2 goes up here, right?
01:02:04.760 | And then this populates,
01:02:06.960 | what actually happens is this creates Z1 and Z2 at the end, right?
01:02:13.600 | Which is still, which is still owned by,
01:02:15.960 | you know, person Y. So we'll say this is Alice and this is Bob.
01:02:22.680 | Right? So we have Bob's share and Alice's share.
01:02:26.320 | What gets populated is, is shares of Z.
01:02:29.160 | So if Alice or if Bob sends his share of Z down to Alice,
01:02:33.560 | only Alice can decrypt the result.
01:02:35.680 | So does that make more sense? Okay, cool.
01:02:37.960 | >> So even the answer is-
01:02:39.680 | >> Even the answer is protected.
01:02:41.680 | Yeah. And you would only need to use
01:02:43.080 | differential privacy in the case you're planning to
01:02:45.340 | decrypt the result for some unknown audience to be able to see.
01:02:50.200 | >> Some models are biased based on real data biases,
01:02:55.200 | and society tries to make unbiased models,
01:02:59.600 | like on gender, race, and so on.
01:03:01.800 | How does it work with privacy,
01:03:04.120 | especially when everything is encrypted?
01:03:06.040 | So how can you unbiased models
01:03:08.920 | when you do not see biases in the data and so on?
01:03:11.960 | >> That's a great question. So the first,
01:03:15.400 | the first gimme for that is that people don't
01:03:18.820 | ever really de-bias a model by physically reading the weights.
01:03:21.480 | Right? So the fact that the weights are encrypted
01:03:23.600 | doesn't necessarily help or hurt you.
01:03:25.800 | So really what it's about is just making sure that you provision
01:03:29.640 | enough of your privacy budget to allow you to do
01:03:31.280 | the introspection that you need to be able to
01:03:33.000 | measure and adjust for bias.
01:03:34.760 | So I think that's, is that sufficient?
01:03:37.400 | >> Yeah.
01:03:37.960 | >> Cool. Awesome. Great question then.
01:03:40.760 | >> How far away do you think we are from organizations like
01:03:44.880 | the FDA requiring differential privacy to be used
01:03:47.800 | in regulating medical algorithms?
01:03:52.360 | >> So I think the best answer I can give to that,
01:03:57.560 | so one, I don't know. And even laws in
01:04:01.120 | the UK regarding privacy GDPR are not
01:04:03.600 | prescriptive about things like differential privacy.
01:04:06.360 | But I think the best and most relevant data point I have for you on
01:04:09.440 | that is that the US Census this year is going to be
01:04:12.720 | protecting the census data,
01:04:14.000 | the 2020 census data using differential privacy.
01:04:16.560 | And some of the leading work on actually
01:04:19.640 | applying differential privacy in the world is going on at
01:04:21.320 | the US Census and I'm sure they'd be
01:04:23.840 | interested in more helpers
01:04:25.840 | if anyone was interested in joining them.
01:04:27.840 | >> So I guess her question was kind of one of my questions,
01:04:32.880 | but it was more just like how much buy-in are you getting in terms of
01:04:36.160 | adoption for open mind or any of,
01:04:40.880 | like, do you have like any hospitals that are like participating or?
01:04:45.760 | >> Yeah. So actually there's a few things
01:04:47.440 | I probably should have mentioned.
01:04:49.040 | So the, so open mind is about two and a half years old.
01:04:55.600 | In the very beginning, we had very little buy-in.
01:04:58.680 | Because it was just so early, it was kind of like,
01:05:00.880 | who cares about privacy?
01:05:02.240 | No one's ever going to sort of really, really care about that.
01:05:04.760 | Post GDPR, total, total change, right?
01:05:08.480 | Everyone's scrambling to protect the data.
01:05:11.080 | But the truth is, it's not just privacy,
01:05:12.920 | it's also commercial usability.
01:05:14.600 | Right now, if you're selling data,
01:05:16.120 | every time you sell it,
01:05:17.440 | you lower the price because you increase the supply and you
01:05:19.520 | increase the number of people that are also selling it.
01:05:22.000 | So I think that there's people also waking up to kind of
01:05:24.200 | the commercial reasons for protecting
01:05:26.800 | your own datasets and protecting the unique statistical signal that they have.
01:05:31.720 | It's also worth mentioning, so the PyTorch team recently
01:05:35.760 | sponsored $250,000 in open-source grants to fund
01:05:39.280 | people to work on our PySyth library, which is really good.
01:05:43.200 | We're hoping to announce sort of
01:05:45.000 | more grants of similar size later in the year.
01:05:47.280 | So if you guys like working on
01:05:49.120 | open-source code and like to get paid to do so,
01:05:52.880 | to that extent, that's sort of a big vote
01:05:57.640 | and buy-in as far as our community is concerned.
01:06:00.480 | So this year is when I hope to see kind of the first pilots rolling out.
01:06:05.720 | There are some that are sort of in the works,
01:06:08.200 | but they're not public yet.
01:06:10.800 | But yeah, so I think basically this is the year for like pilots.
01:06:14.480 | I think it's about as far as we are.
01:06:16.680 | >> And then I have another question that's kind of on
01:06:19.240 | the opposite end of the spectrum that's a little more technical weeds.
01:06:22.440 | >> Cool.
01:06:23.080 | >> So when you're doing the encryption where
01:06:28.000 | you separate everything into each of the different owners,
01:06:31.600 | how does that work for non-linear functions?
01:06:34.120 | Because you need that linearity to add it back and for it to maintain the-
01:06:40.960 | >> Totally. So the non-linear functions are
01:06:44.280 | the most performance intensive.
01:06:47.840 | So you get the biggest performance hit when you have to do them.
01:06:51.760 | The- for deep learning specifically,
01:06:56.600 | there's kind of two trends.
01:06:58.440 | So one line of research is around using polynomial approximations.
01:07:03.160 | And then the other line is around doing sort of discrete comparison functions.
01:07:08.520 | So which is good for ReLUs and it's good for
01:07:10.560 | lopping off the ends of your polynomials so
01:07:12.400 | that your unstable tails can be flat.
01:07:15.320 | And I would say that's about that.
01:07:18.520 | And then like the science of kind of like trying to relax
01:07:20.600 | your security assumptions strategically here and
01:07:22.280 | there to get more performance is about where we're at.
01:07:24.520 | But as far as the one thing is worth mentioning though is that,
01:07:27.680 | there are kind of what I described was
01:07:31.840 | SecureNPC sort of on integers and fixed precision numbers.
01:07:38.960 | You can also do it on sort of binary,
01:07:40.880 | but in that sense you get a huge performance ever doing it with binary,
01:07:43.920 | but you also get the ability to do things sort of more classically with computing.
01:07:47.120 | Encrypted computation is sort of like doing computing in the 70s.
01:07:49.840 | Like you get a lot of the same kind of constraints.
01:07:53.000 | >> Thank you very much for your talk, Andrew.
01:07:55.520 | >> Yeah.
01:07:55.880 | >> I'm wondering about your objective to ultimately
01:08:00.640 | allow every individual to assign a privacy budget.
01:08:05.720 | You mentioned that it would take a lot of work to provide
01:08:10.720 | the infrastructure for that to be possible.
01:08:14.720 | Do you have an idea for what kind of infrastructure is necessary?
01:08:17.840 | And also when people are reluctant, even perhaps lazy,
01:08:22.960 | and they don't really care and they don't want their data to be protected.
01:08:27.240 | >> Yeah.
01:08:27.740 | >> I guess it takes some training, but yeah,
01:08:31.600 | what are your thoughts on building that infrastructure?
01:08:34.880 | >> I think it's going to come in waves.
01:08:37.280 | It's the kind of thing where people don't usually invest money and
01:08:39.440 | time and resources into things that aren't like a straight shot to value.
01:08:42.560 | So I think there's probably going to be multiple individual discrete jumps.
01:08:46.040 | The first one is going to be just enterprise adoption.
01:08:48.480 | Enterprises are the ones that already have all the data.
01:08:50.360 | So they're the ones who are most natural to
01:08:51.840 | start adopting privacy-preserving technologies.
01:08:54.080 | I think that that adoption is going to be driven
01:08:56.200 | primarily by commercial reasons for commercial reasons,
01:08:59.160 | meaning my data is inherently more valuable if I can
01:09:01.360 | keep it scarce while allowing people to answer questions with it.
01:09:03.800 | Does that make sense? So it's more profitable for me to not send copies of
01:09:08.480 | my data to people if I can actually have them bring
01:09:10.280 | their question answering mechanisms to
01:09:11.640 | me and just get their questions answered. Does that make sense?
01:09:14.400 | That's not a privacy narrative,
01:09:15.940 | but I think that that narrative is going to mature
01:09:17.280 | privacy technology quite quickly.
01:09:20.320 | Post enterprise adoption, I think that's when,
01:09:30.480 | and encrypted services are still really hard at this point.
01:09:33.960 | The reason for that is that they require
01:09:35.640 | lots of compute and lots of network overhead,
01:09:37.600 | which means that you probably want to have something in the Cloud.
01:09:42.320 | Some machine that you can control in
01:09:44.320 | the Cloud or have the Internet get a lot faster.
01:09:47.120 | But there's this question of how do we actually get to a world
01:09:51.920 | where each individual person knows or has
01:09:55.920 | notional control over their own personal privacy budget.
01:10:00.040 | Let's just say you had
01:10:04.520 | perfect enterprise adoption and everyone's
01:10:06.360 | tracking their stuff with differential privacy.
01:10:08.360 | The piece that you're actually missing here is just some communication between
01:10:13.320 | all the different enterprises that are joining up and making,
01:10:16.200 | it's just an accounting mechanism.
01:10:17.840 | It's a lot like the IRS.
01:10:20.960 | It's just someone to be there to make sure
01:10:24.480 | that you're not double spending in different places.
01:10:28.080 | Your Epsilon budget that's over here versus over here,
01:10:31.480 | versus over here is all coming from the same place.
01:10:34.000 | It's not totally clear who this actor would be.
01:10:37.760 | Maybe there's an app that just does it for you.
01:10:41.280 | Maybe there has to be an institution around it.
01:10:43.000 | Maybe it won't happen at all.
01:10:44.640 | Maybe it'll just be decentralized,
01:10:46.120 | but whatever.
01:10:48.240 | Another option is that there'll actually be data banks.
01:10:51.440 | There's been some literature in the last couple of years around saying,
01:10:53.880 | "Okay, maybe institutions that they currently handle your money
01:10:58.840 | might also be the bank where all of your information lives."
01:11:02.440 | That becomes the gateway to your data or something like that.
01:11:05.600 | So there's different things that are,
01:11:07.080 | because that would obviously make the accounting much easier.
01:11:09.600 | Also, that would give you that Cloud-to-Cloud performance increase.
01:11:13.400 | So I think it's clear we wouldn't go to data banks or
01:11:17.640 | these kinds of centralized accounting registries
01:11:19.760 | directly because you have to have the initial adoption first.
01:11:22.720 | But if I had to guess, it's something like that.
01:11:25.880 | We won't see that for a while.
01:11:28.400 | It's not even clear what that would look like.
01:11:32.120 | But I think it is possible we just have to
01:11:34.920 | get through non-trivial adoption first.
01:11:37.200 | >> Thank you.
01:11:37.680 | >> Yeah. So it's kind of a hazy,
01:11:39.680 | but it's predicting the future.
01:11:41.680 | So I guess that's how that goes.
01:11:44.000 | >> I was just wondering if you can comment briefly on what you
01:11:48.840 | think is the biggest mistake being made with
01:11:52.080 | respect to recommendation systems transparency,
01:11:54.880 | and if you can comment briefly on what you think might
01:11:58.080 | be the best solution.
01:12:01.520 | >> So I don't know if this is a mistake.
01:12:02.920 | I would say the biggest opportunity for
01:12:04.240 | recommendation systems is that they
01:12:05.400 | have the potential to be more holistic.
01:12:06.880 | So for example,
01:12:09.600 | if you recommended a movie to me based on whether or not it's
01:12:14.800 | most likely to keep me engaged,
01:12:16.960 | keep me watching movies,
01:12:18.720 | it's not really a holistic recommendation.
01:12:20.640 | It's not saying, "Hey, you should do this because it's going
01:12:22.360 | to make your life more fulfilling, more satisfied, whatever.
01:12:25.800 | It's just going to glue me to my television more."
01:12:28.400 | So I think the biggest opportunity,
01:12:30.560 | and particularly with privacy-preserving machine learning,
01:12:32.480 | is that if a recommender system could have
01:12:34.680 | the ability to access private data without actually seeing it,
01:12:37.720 | and answer the question, how do I give
01:12:40.280 | the best recommendation so that this person gets
01:12:42.200 | a good night's sleep or has
01:12:43.680 | more meaningful friendships or whatever.
01:12:45.320 | Like these attributes that are actually particularly sensitive,
01:12:47.920 | but there are things that we actually want to optimize for,
01:12:50.440 | that we could have vastly more
01:12:52.400 | beneficial recommendation systems than we do now,
01:12:54.760 | just by virtue of having better
01:12:55.920 | infrastructure for dealing with private data.
01:12:57.360 | So they actually, as far as
01:12:59.080 | like the biggest limitation of recommender systems right now,
01:13:03.200 | it's just that they don't have access to enough
01:13:04.520 | information to have good targets.
01:13:07.320 | Does that make sense? We would like for them to have better targets,
01:13:11.000 | but in order to do that, they have to have access
01:13:12.520 | to information about those targets.
01:13:14.480 | I think that's what privacy-preserving technologies
01:13:16.920 | could bring to bear on recommendation systems.
01:13:18.640 | >> Thanks.
01:13:19.600 | >> Yeah. Great question, by the way.
01:13:21.920 | >> One more time, please give Andrew a big hand. Thank you so much.
01:13:24.720 | >> Thanks.
01:13:26.160 | >> Thanks for having me.
01:13:29.760 | >> Thank you.
01:13:31.760 | >> Thank you.
01:13:33.760 | >> Thank you.
01:13:35.760 | >> Thank you.
01:13:37.760 | >> Thank you.
01:13:39.760 | >> Thank you.
01:13:41.760 | >> Thank you.
01:13:43.760 | >> Thank you.
01:13:45.760 | [BLANK_AUDIO]