back to index

What is Statistics? (Michael I. Jordan) | AI Podcast Clips


Chapters

0:0
0:3 What Is Statistics
1:38 Inverse Probability
4:44 Decision Theory
5:25 Bayesian Frequentist
7:26 Empirical Bayes
8:26 False Discovery Rate

Whisper Transcript | Transcript Only Page

00:00:00.000 | - An absurd question, but what is statistics?
00:00:05.000 | - So here it's a little bit,
00:00:08.640 | it's somewhere between math and science and technology.
00:00:10.960 | It's somewhere in that convex hole.
00:00:12.200 | So it's a set of principles that allow you
00:00:14.780 | to make inferences that have got some reason to be believed.
00:00:17.700 | And also principles allow you to make decisions
00:00:20.220 | where you can have some reason to believe
00:00:22.160 | you're not gonna make errors.
00:00:23.880 | So all that requires some assumptions
00:00:25.320 | about what do you mean by an error?
00:00:26.400 | What do you mean by the probabilities?
00:00:28.800 | And, but, you know,
00:00:31.640 | after you start making some assumptions,
00:00:33.120 | you're led to conclusions that yes,
00:00:36.760 | I can guarantee that, you know,
00:00:38.520 | if you do this in this way,
00:00:39.560 | your probability of making an error will be small.
00:00:42.360 | Your probability of continuing to not make errors
00:00:45.000 | over time will be small.
00:00:46.480 | And probability you found something that's real
00:00:49.320 | will be small, will be high.
00:00:51.480 | - So decision-making is a big part.
00:00:53.320 | - Decision-making is a big part, yeah.
00:00:54.520 | So the original, so statistics, you know,
00:00:57.320 | short history was that, you know,
00:00:58.720 | it's kind of goes back as a formal discipline,
00:01:01.560 | you know, 250 years or so.
00:01:03.720 | It was called inverse probability
00:01:05.280 | because around that era, probability was developed
00:01:08.200 | sort of especially to explain gambling situations.
00:01:12.160 | - Of course. - And--
00:01:13.560 | - Interesting.
00:01:14.400 | - So you would say, well, given the state of nature is this,
00:01:17.040 | there's a certain roulette board
00:01:18.000 | that has a certain mechanism in it.
00:01:19.720 | What kind of outcomes do I expect to see?
00:01:22.280 | And especially if I do things long amounts of time,
00:01:25.680 | what outcomes will I see?
00:01:26.520 | And the physicists started to pay attention to this.
00:01:29.360 | And then people said, well, given,
00:01:30.800 | let's turn the problem around.
00:01:32.240 | What if I saw certain outcomes?
00:01:34.120 | Could I infer what the underlying mechanism was?
00:01:36.200 | That's an inverse problem.
00:01:37.300 | And in fact, for quite a while,
00:01:38.720 | statistics was called inverse probability.
00:01:40.760 | That was the name of the field.
00:01:42.680 | And I believe that it was Laplace
00:01:46.200 | who was working in Napoleon's government
00:01:48.280 | who was trying, who needed to do a census of France,
00:01:51.880 | learn about the people there.
00:01:53.080 | So he went and got and gathered data
00:01:54.760 | and he analyzed that data to determine policy
00:01:58.800 | and said, well, let's call this field
00:02:01.680 | that does this kind of thing statistics
00:02:03.400 | 'cause the word state is in there.
00:02:06.200 | In French, that's etat.
00:02:07.480 | But it's the study of data for the state.
00:02:10.780 | So anyway, that caught on
00:02:13.460 | and it's been called statistics ever since.
00:02:15.960 | But by the time it got formalized,
00:02:19.280 | it was sort of in the 30s.
00:02:20.720 | And around that time, there was game theory
00:02:24.000 | and decision theory developed nearby.
00:02:26.220 | People in that era didn't think of themselves
00:02:28.640 | as either computer science or statistics
00:02:30.480 | or control or econ.
00:02:31.480 | They were all the above.
00:02:33.240 | And so, Von Neumann is developing game theory,
00:02:35.400 | but also thinking of that as decision theory.
00:02:38.040 | Wald is an econometrician developing decision theory
00:02:41.280 | and then turn that into statistics.
00:02:43.680 | And so it's all about, here's not just data
00:02:46.880 | and you analyze it.
00:02:47.720 | Here's a loss function.
00:02:48.960 | Here's what you care about.
00:02:49.800 | Here's the question you're trying to ask.
00:02:51.680 | Here is a probability model
00:02:53.640 | and here is the risk you will face
00:02:55.200 | if you make certain decisions.
00:02:56.700 | And to this day, in most advanced statistical curricula,
00:03:01.840 | you teach decision theory as the starting point.
00:03:04.160 | And then it branches out into the two branches
00:03:06.200 | of Bayesian and Frequentist.
00:03:07.240 | But it's all about decisions.
00:03:09.380 | - In statistics, what is the most beautiful,
00:03:15.040 | mysterious, maybe surprising idea that you've come across?
00:03:18.940 | - Yeah, good question.
00:03:23.100 | I mean, there's a bunch of surprising ones.
00:03:26.340 | There's something that's way too technical for this thing,
00:03:28.420 | but something called James Stein estimation,
00:03:30.240 | which is kind of surprising
00:03:32.340 | and really takes time to wrap your head around.
00:03:34.740 | - Can you try to maybe--
00:03:35.980 | - Nah, I think I don't even wanna try.
00:03:38.740 | Let me just say a colleague at Steven Stigler
00:03:41.540 | at University of Chicago wrote a really beautiful paper
00:03:43.460 | on James Stein estimation, which helps to,
00:03:45.940 | it's viewed as a paradox.
00:03:47.260 | It kind of defeats the mind's attempts to understand it,
00:03:49.460 | but you can, and Steve has a nice perspective on that.
00:03:52.420 | So one of the troubles with statistics
00:03:57.220 | is that it's like in physics, or in quantum physics,
00:03:59.580 | you have multiple interpretations.
00:04:01.300 | There's a wave and particle duality in physics.
00:04:03.580 | And you get used to that over time,
00:04:06.100 | but it still kind of haunts you
00:04:07.380 | that you don't really quite understand the relationship.
00:04:10.500 | The electron's a wave and electron's a particle.
00:04:12.900 | Well, the same thing happens here.
00:04:15.500 | There's Bayesian ways of thinking and Frequentist,
00:04:17.900 | and they are different.
00:04:19.260 | They sometimes become sort of the same in practice,
00:04:22.620 | but they are physically different.
00:04:23.740 | And then in some practice, they are not the same at all.
00:04:26.460 | They give you rather different answers.
00:04:29.220 | And so it is very much like wave and particle duality,
00:04:31.780 | and that is something you have to kind of
00:04:33.140 | get used to in the field.
00:04:34.580 | - Can you define Bayesian and Frequentist?
00:04:36.460 | - Yeah, in decision theory, you can make,
00:04:37.740 | I have a video that people could see.
00:04:40.120 | It's called Are You a Bayesian or a Frequentist?
00:04:42.060 | And kind of help try to make it really clear.
00:04:44.860 | It comes from decision theory.
00:04:45.940 | So, decision theory, you're talking about loss functions,
00:04:50.180 | which are a function of data X and parameter theta.
00:04:53.700 | So, they're a function of two arguments.
00:04:55.700 | Neither one of those arguments is known.
00:04:58.680 | You don't know the data a priori, it's random,
00:05:01.700 | and the parameter's unknown.
00:05:03.260 | So you have this function of two things you don't know,
00:05:05.100 | and you're trying to say, I want that function to be small.
00:05:07.060 | I want small loss.
00:05:08.100 | Well, what are you gonna do?
00:05:12.220 | So you sort of say, well, I'm gonna average
00:05:13.980 | over these quantities or maximize over them or something,
00:05:16.780 | so that I turn that uncertainty into something certain.
00:05:20.840 | So you could look at the first argument and average over it,
00:05:24.260 | or you could look at the second argument, average over it.
00:05:25.780 | That's Bayesian and Frequentist.
00:05:26.780 | So the Frequentist says, I'm gonna look at the X, the data,
00:05:31.020 | and I'm gonna take that as random,
00:05:32.540 | and I'm gonna average over the distribution.
00:05:34.100 | So I take the expectation of loss under X.
00:05:37.420 | Theta's held fixed, all right?
00:05:39.420 | That's called the risk.
00:05:40.860 | And so it's looking at all the datasets you could get,
00:05:43.820 | and saying how well will a certain procedure do
00:05:46.980 | under all those datasets?
00:05:48.860 | That's called a Frequentist guarantee.
00:05:51.300 | So I think it is very appropriate
00:05:52.900 | when you're building a piece of software,
00:05:54.740 | and you're shipping it out there,
00:05:55.820 | and people are using it on all kinds of datasets.
00:05:58.020 | You wanna have a stamp, a guarantee on it,
00:05:59.700 | that as people run it on many, many datasets
00:06:01.460 | that you never even thought about,
00:06:02.540 | that 95% of the time it will do the right thing.
00:06:05.000 | Perfectly reasonable.
00:06:07.260 | The Bayesian perspective says, well, no,
00:06:10.540 | I'm gonna look at the other argument of the loss function,
00:06:12.500 | the theta part, okay?
00:06:13.820 | That's unknown, and I'm uncertain about it.
00:06:16.300 | So I could have my own personal probability for what it is.
00:06:19.500 | How many tall people are there out there?
00:06:20.980 | I'm trying to infer the average height of the population.
00:06:22.860 | Well, I have an idea of roughly what the height is.
00:06:26.140 | So I'm gonna average over the theta.
00:06:30.860 | So now that loss function has only now, again,
00:06:34.260 | one argument's gone.
00:06:35.940 | Now it's a function of X.
00:06:37.580 | And that's what a Bayesian does, is they say,
00:06:39.180 | well, let's just focus on the particular X we got,
00:06:41.180 | the dataset we got.
00:06:42.020 | We condition on that.
00:06:43.780 | Condition on the X, I say something about my loss.
00:06:46.940 | That's a Bayesian approach to things.
00:06:49.140 | And the Bayesian will argue that it's not relevant
00:06:52.060 | to look at all the other datasets you could have gotten
00:06:54.700 | and average over them, the frequentist approach.
00:06:57.500 | It's really only the dataset you got, all right?
00:07:00.780 | And I do agree with that,
00:07:02.340 | especially in situations where you're working
00:07:04.020 | with a scientist, you can learn a lot about the domain,
00:07:06.380 | and you really only focus on certain kinds of data,
00:07:08.420 | and you've gathered your data, and you make inferences.
00:07:11.140 | I don't agree with it though, in the sense that
00:07:14.780 | there are needs for frequentist guarantees.
00:07:16.980 | You're writing software, people are using it out there,
00:07:18.820 | you wanna say something.
00:07:19.700 | So these two things have to got to fight each other
00:07:21.740 | a little bit, but they have to blend.
00:07:23.620 | So long story short, there's a set of ideas
00:07:25.580 | that are right in the middle,
00:07:26.420 | that are called empirical Bayes.
00:07:28.420 | And empirical Bayes sort of starts
00:07:30.340 | with the Bayesian framework.
00:07:31.740 | It's kind of arguably philosophically more reasonable
00:07:37.940 | and kosher, write down a bunch of the math
00:07:40.780 | that kind of flows from that,
00:07:42.260 | and then realize there's a bunch of things you don't know,
00:07:44.180 | because it's the real world,
00:07:45.540 | and you don't know everything,
00:07:46.820 | so you're uncertain about certain quantities.
00:07:48.900 | At that point, ask, is there a reasonable way
00:07:50.940 | to plug in an estimate for those things?
00:07:52.940 | And in some cases, there's quite a reasonable thing to do,
00:07:57.980 | to plug in.
00:07:58.820 | There's a natural thing you can observe in the world
00:08:00.620 | that you can plug in,
00:08:01.980 | and then do a little bit more mathematics
00:08:03.500 | and assure yourself it's really good.
00:08:05.180 | - So based on math or based on human expertise,
00:08:07.580 | what are good--
00:08:08.860 | - They're both going in.
00:08:09.700 | The Bayesian framework allows you
00:08:10.740 | to put a lot of human expertise in,
00:08:12.540 | but the math kind of guides you along that path,
00:08:16.660 | and then kind of reassures you at the end,
00:08:17.860 | you could put that stamp of approval.
00:08:19.340 | Under certain assumptions, this thing will work.
00:08:21.540 | So you asked the question, what's my favorite,
00:08:23.620 | what's the most surprising nice idea?
00:08:24.900 | So one that is more accessible
00:08:26.500 | is something called false discovery rate,
00:08:28.460 | which is you're making not just one hypothesis test,
00:08:33.260 | or making one decision, you're making a whole bag of them.
00:08:35.900 | And in that bag of decisions,
00:08:38.180 | you look at the ones where you made a discovery,
00:08:39.980 | you announced that something interesting had happened.
00:08:42.620 | All right, that's gonna be some subset of your big bag.
00:08:45.860 | In the ones you made a discovery,
00:08:47.340 | which subset of those are bad,
00:08:49.540 | that are false, false discoveries?
00:08:52.060 | You'd like the fraction of your false discoveries
00:08:53.980 | among your discoveries to be small.
00:08:56.340 | That's a different criterion than accuracy,
00:08:58.680 | or precision, or recall, or sensitivity and specificity.
00:09:01.660 | It's a different quantity.
00:09:03.580 | Those latter ones, or almost all of them,
00:09:05.680 | have more of a frequentist flavor.
00:09:08.740 | They say, given the truth is that the null hypothesis is
00:09:12.220 | true, here's what accuracy I would get.
00:09:14.560 | Or given that the alternative is true,
00:09:15.940 | here's what I would get.
00:09:17.340 | So it's kind of going forward from the state of nature
00:09:19.860 | to the data.
00:09:21.180 | The Bayesian goes the other direction from the data
00:09:23.340 | back to the state of nature.
00:09:24.660 | And that's actually what false discovery rate is.
00:09:26.960 | It says, given you made a discovery,
00:09:29.420 | okay, that's conditioned on your data,
00:09:31.380 | what's the probability of the hypothesis?
00:09:33.820 | It's going the other direction.
00:09:35.380 | And so the classical frequentist look at that,
00:09:38.380 | so I can't know that there's some priors needed in that.
00:09:41.300 | And the empirical Bayesian goes ahead and plows forward
00:09:44.700 | and starts writing down these formulas
00:09:46.420 | and realizes at some point,
00:09:48.180 | some of those things can actually be estimated
00:09:49.700 | in a reasonable way.
00:09:51.380 | And so it's a beautiful set of ideas.
00:09:53.040 | So this kind of line of argument has come out,
00:09:55.460 | it's not certainly mine,
00:09:56.720 | but it sort of came out from Robbins around 1960.
00:10:01.100 | Brad Efron has written beautifully about this
00:10:03.820 | in various papers and books.
00:10:05.100 | And the FDR is, you know, Ben Yamini in Israel,
00:10:10.100 | John Story did this Bayesian interpretation and so on.
00:10:13.560 | So I've just absorbed these things over the years
00:10:15.820 | and find it a very healthy way to think about statistics.
00:10:18.660 | (silence)
00:10:20.820 | (silence)
00:10:22.980 | (silence)
00:10:25.140 | (silence)
00:10:27.300 | (silence)
00:10:29.460 | (silence)
00:10:31.620 | (silence)
00:10:33.780 | [BLANK_AUDIO]