Back to Index

What is Statistics? (Michael I. Jordan) | AI Podcast Clips


Chapters

0:0
0:3 What Is Statistics
1:38 Inverse Probability
4:44 Decision Theory
5:25 Bayesian Frequentist
7:26 Empirical Bayes
8:26 False Discovery Rate

Transcript

- An absurd question, but what is statistics? - So here it's a little bit, it's somewhere between math and science and technology. It's somewhere in that convex hole. So it's a set of principles that allow you to make inferences that have got some reason to be believed. And also principles allow you to make decisions where you can have some reason to believe you're not gonna make errors.

So all that requires some assumptions about what do you mean by an error? What do you mean by the probabilities? And, but, you know, after you start making some assumptions, you're led to conclusions that yes, I can guarantee that, you know, if you do this in this way, your probability of making an error will be small.

Your probability of continuing to not make errors over time will be small. And probability you found something that's real will be small, will be high. - So decision-making is a big part. - Decision-making is a big part, yeah. So the original, so statistics, you know, short history was that, you know, it's kind of goes back as a formal discipline, you know, 250 years or so.

It was called inverse probability because around that era, probability was developed sort of especially to explain gambling situations. - Of course. - And-- - Interesting. - So you would say, well, given the state of nature is this, there's a certain roulette board that has a certain mechanism in it.

What kind of outcomes do I expect to see? And especially if I do things long amounts of time, what outcomes will I see? And the physicists started to pay attention to this. And then people said, well, given, let's turn the problem around. What if I saw certain outcomes? Could I infer what the underlying mechanism was?

That's an inverse problem. And in fact, for quite a while, statistics was called inverse probability. That was the name of the field. And I believe that it was Laplace who was working in Napoleon's government who was trying, who needed to do a census of France, learn about the people there.

So he went and got and gathered data and he analyzed that data to determine policy and said, well, let's call this field that does this kind of thing statistics 'cause the word state is in there. In French, that's etat. But it's the study of data for the state. So anyway, that caught on and it's been called statistics ever since.

But by the time it got formalized, it was sort of in the 30s. And around that time, there was game theory and decision theory developed nearby. People in that era didn't think of themselves as either computer science or statistics or control or econ. They were all the above. And so, Von Neumann is developing game theory, but also thinking of that as decision theory.

Wald is an econometrician developing decision theory and then turn that into statistics. And so it's all about, here's not just data and you analyze it. Here's a loss function. Here's what you care about. Here's the question you're trying to ask. Here is a probability model and here is the risk you will face if you make certain decisions.

And to this day, in most advanced statistical curricula, you teach decision theory as the starting point. And then it branches out into the two branches of Bayesian and Frequentist. But it's all about decisions. - In statistics, what is the most beautiful, mysterious, maybe surprising idea that you've come across?

- Yeah, good question. I mean, there's a bunch of surprising ones. There's something that's way too technical for this thing, but something called James Stein estimation, which is kind of surprising and really takes time to wrap your head around. - Can you try to maybe-- - Nah, I think I don't even wanna try.

Let me just say a colleague at Steven Stigler at University of Chicago wrote a really beautiful paper on James Stein estimation, which helps to, it's viewed as a paradox. It kind of defeats the mind's attempts to understand it, but you can, and Steve has a nice perspective on that.

So one of the troubles with statistics is that it's like in physics, or in quantum physics, you have multiple interpretations. There's a wave and particle duality in physics. And you get used to that over time, but it still kind of haunts you that you don't really quite understand the relationship.

The electron's a wave and electron's a particle. Well, the same thing happens here. There's Bayesian ways of thinking and Frequentist, and they are different. They sometimes become sort of the same in practice, but they are physically different. And then in some practice, they are not the same at all.

They give you rather different answers. And so it is very much like wave and particle duality, and that is something you have to kind of get used to in the field. - Can you define Bayesian and Frequentist? - Yeah, in decision theory, you can make, I have a video that people could see.

It's called Are You a Bayesian or a Frequentist? And kind of help try to make it really clear. It comes from decision theory. So, decision theory, you're talking about loss functions, which are a function of data X and parameter theta. So, they're a function of two arguments. Neither one of those arguments is known.

You don't know the data a priori, it's random, and the parameter's unknown. So you have this function of two things you don't know, and you're trying to say, I want that function to be small. I want small loss. Well, what are you gonna do? So you sort of say, well, I'm gonna average over these quantities or maximize over them or something, so that I turn that uncertainty into something certain.

So you could look at the first argument and average over it, or you could look at the second argument, average over it. That's Bayesian and Frequentist. So the Frequentist says, I'm gonna look at the X, the data, and I'm gonna take that as random, and I'm gonna average over the distribution.

So I take the expectation of loss under X. Theta's held fixed, all right? That's called the risk. And so it's looking at all the datasets you could get, and saying how well will a certain procedure do under all those datasets? That's called a Frequentist guarantee. So I think it is very appropriate when you're building a piece of software, and you're shipping it out there, and people are using it on all kinds of datasets.

You wanna have a stamp, a guarantee on it, that as people run it on many, many datasets that you never even thought about, that 95% of the time it will do the right thing. Perfectly reasonable. The Bayesian perspective says, well, no, I'm gonna look at the other argument of the loss function, the theta part, okay?

That's unknown, and I'm uncertain about it. So I could have my own personal probability for what it is. How many tall people are there out there? I'm trying to infer the average height of the population. Well, I have an idea of roughly what the height is. So I'm gonna average over the theta.

So now that loss function has only now, again, one argument's gone. Now it's a function of X. And that's what a Bayesian does, is they say, well, let's just focus on the particular X we got, the dataset we got. We condition on that. Condition on the X, I say something about my loss.

That's a Bayesian approach to things. And the Bayesian will argue that it's not relevant to look at all the other datasets you could have gotten and average over them, the frequentist approach. It's really only the dataset you got, all right? And I do agree with that, especially in situations where you're working with a scientist, you can learn a lot about the domain, and you really only focus on certain kinds of data, and you've gathered your data, and you make inferences.

I don't agree with it though, in the sense that there are needs for frequentist guarantees. You're writing software, people are using it out there, you wanna say something. So these two things have to got to fight each other a little bit, but they have to blend. So long story short, there's a set of ideas that are right in the middle, that are called empirical Bayes.

And empirical Bayes sort of starts with the Bayesian framework. It's kind of arguably philosophically more reasonable and kosher, write down a bunch of the math that kind of flows from that, and then realize there's a bunch of things you don't know, because it's the real world, and you don't know everything, so you're uncertain about certain quantities.

At that point, ask, is there a reasonable way to plug in an estimate for those things? And in some cases, there's quite a reasonable thing to do, to plug in. There's a natural thing you can observe in the world that you can plug in, and then do a little bit more mathematics and assure yourself it's really good.

- So based on math or based on human expertise, what are good-- - They're both going in. The Bayesian framework allows you to put a lot of human expertise in, but the math kind of guides you along that path, and then kind of reassures you at the end, you could put that stamp of approval.

Under certain assumptions, this thing will work. So you asked the question, what's my favorite, what's the most surprising nice idea? So one that is more accessible is something called false discovery rate, which is you're making not just one hypothesis test, or making one decision, you're making a whole bag of them.

And in that bag of decisions, you look at the ones where you made a discovery, you announced that something interesting had happened. All right, that's gonna be some subset of your big bag. In the ones you made a discovery, which subset of those are bad, that are false, false discoveries?

You'd like the fraction of your false discoveries among your discoveries to be small. That's a different criterion than accuracy, or precision, or recall, or sensitivity and specificity. It's a different quantity. Those latter ones, or almost all of them, have more of a frequentist flavor. They say, given the truth is that the null hypothesis is true, here's what accuracy I would get.

Or given that the alternative is true, here's what I would get. So it's kind of going forward from the state of nature to the data. The Bayesian goes the other direction from the data back to the state of nature. And that's actually what false discovery rate is. It says, given you made a discovery, okay, that's conditioned on your data, what's the probability of the hypothesis?

It's going the other direction. And so the classical frequentist look at that, so I can't know that there's some priors needed in that. And the empirical Bayesian goes ahead and plows forward and starts writing down these formulas and realizes at some point, some of those things can actually be estimated in a reasonable way.

And so it's a beautiful set of ideas. So this kind of line of argument has come out, it's not certainly mine, but it sort of came out from Robbins around 1960. Brad Efron has written beautifully about this in various papers and books. And the FDR is, you know, Ben Yamini in Israel, John Story did this Bayesian interpretation and so on.

So I've just absorbed these things over the years and find it a very healthy way to think about statistics. (silence) (silence) (silence) (silence) (silence) (silence) (silence)