back to indexComplete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series
Chapters
0:0 Introduction
0:46 Overview: Complete Statistical Theory of Learning
3:47 Part 1: VC Theory of Generalization
11:4 Part 2: Target Functional for Minimization
27:13 Part 3: Selection of Admissible Set of Functions
37:26 Part 4: Complete Solution in Reproducing Kernel Hilbert Space (RKHS)
53:16 Part 5: LUSI Approach in Neural Networks
59:28 Part 6: Examples of Predicates
70:39 Conclusion
76:10 Q&A: Overfitting
77:18 Q&A: Language
00:00:00.000 |
Today we're happy and honored to have Vladimir Vapnik with us, 00:00:04.160 |
co-inventor of supported vector machines, support vector clustering, VC theory of 00:00:08.480 |
statistical learning, and author of statistical learning theory. 00:00:12.560 |
He's one of the greatest and most impactful statisticians and computer 00:00:17.120 |
scientists of our time. Plus he grew up in the Soviet Union, 00:00:22.080 |
eventually heading the computer science department, Institute of Control Sciences 00:00:26.160 |
in Moscow. So he will give today's lecture in 00:00:29.440 |
Russian and I will translate. Just kidding, right? 00:00:35.520 |
It's an honor and a pleasure to have Vladimir with us today, so please give 00:00:49.120 |
About 50 years ago, Professor Chervonenkis and me started statistical 00:00:57.120 |
learning theory. The problem was to answer the question 00:01:02.000 |
when, if you will do well with training data, 00:01:06.400 |
if you will have small amount of error on training data, 00:01:10.320 |
you will do well also on the test data. You will minimize expectation of error. 00:01:19.920 |
So we solved this problem. And there are this theory almost in all 00:01:30.800 |
languages, but mostly they follow this line. The 00:01:38.160 |
law of large number is not enough. We need uniform law of large number, 00:01:43.200 |
uniform convergence, and so on. But because we started this discussion 00:01:49.840 |
about empirical error, how good you do on the 00:01:54.480 |
training data, and bound shows that the better you're doing on training data, 00:01:59.360 |
you will be better on test data, people decided 00:02:03.120 |
that it is only way to have training data and do something 00:02:08.800 |
with this training data, to minimize number of error. 00:02:13.360 |
And all algorithm was constructed based on this principle. 00:02:19.680 |
About five years ago, I found that there exists another principle, 00:02:28.960 |
if this principle, it is brute force principle, 00:02:32.320 |
give me more data, I will give better answer. 00:02:35.360 |
So the second principle is intelligent principle. 00:02:38.960 |
And I will talk today about this. So I will start with statistical 00:02:45.200 |
learning theory, and then I will introduce this new theory. 00:02:50.000 |
But because there are only two ways for generalization, one with data, 00:02:56.160 |
another I will show what. I call it complete statistical learning 00:03:09.040 |
third way for generalization. You should use both of them. So that is 00:03:16.080 |
complete theory. But it is not so bad, because 00:03:21.200 |
you will see that learning theory move in different direction, 00:03:27.920 |
in direction of intelligence, to understanding what is intelligence. 00:03:36.560 |
Turing told that you should imitate intelligent person. 00:03:40.960 |
But now question, what is intelligence? And we will discuss this. So let me start. 00:03:49.360 |
The first part is, this is theory of generalization. 00:04:03.280 |
function now. This is pretty general function now. 00:04:07.360 |
Instead of y f of x and l is loss function, I can 00:04:18.960 |
It is what we're doing in regression and pattern recognition and 00:04:24.560 |
all this stuff, but this is more general setting. But this 00:04:28.240 |
is more important, that when we consider minimization of 00:04:32.720 |
function now, we should say in which set of function. 00:04:37.040 |
In a given set of function, we have to minimize function. 00:04:41.920 |
If probability measure is unknown, but we are given 00:04:45.840 |
iid data pairs. That exact setting of both pattern recognition 00:04:54.160 |
and regression estimation problem. For pattern recognition set of function is 00:05:02.640 |
indicator functions. For regression function it is 00:05:09.520 |
continuous function, but the statement is the same. 00:05:15.840 |
But the answer is very simple. We can minimize 00:05:20.240 |
this function now using data if and only if this dimension 00:05:27.280 |
h of set of function is finite. Everything depends 00:05:31.440 |
on which set of function you have to minimize this function. 00:05:35.680 |
And you cannot avoid that. So we call it capacity, 00:05:40.560 |
but maybe it better called diversity of the set of function. 00:05:45.280 |
But measure of this capacity diversity is VC dimension. 00:05:51.360 |
And what is VC dimension? I will give you definition. 00:05:56.000 |
First for set of indicator function, T is step function. 00:06:00.160 |
You consider continuous function F and in front of it you're looking for 00:06:10.640 |
function is positive, you say one. If it's not positive, you say zero. 00:06:17.040 |
That is indicator. And we see this dimension of set of indicator function 00:06:23.200 |
equal h if h is the maximal number of vectors 00:06:31.040 |
that can be shattered, separated in all possible to the 00:06:35.200 |
h subset using indicator function from this set. 00:06:50.480 |
which you can shatter in all possible way, in all possible way. 00:06:54.480 |
But you cannot shatter in h plus one vector. Then VC dimension 00:07:00.480 |
of this set of function will be h. This is pure combinatorical definition. 00:07:07.520 |
And if you can shatter for any l, then we say that VC dimension is infinite. 00:07:15.520 |
And then I will give you two theorems which we will use 00:07:19.680 |
and then the main theorem, probability of the theory. 00:07:24.880 |
If set of function has VC dimension h, then this probability of one minus z, 00:07:31.360 |
for all functions in the set, the bound holds true. 00:07:38.080 |
And because this bound holds true for all function, 00:07:41.200 |
and you would like to have a minimal estimate, minimal right-hand 00:07:49.040 |
side, you will choose function which minimize empirical loss. 00:07:56.240 |
And the second theorem, if you have set of linear function, 00:08:01.360 |
indicator from the set of linear function, and so happen 00:08:05.280 |
that your vector x inside of circle for radius one, 00:08:10.560 |
and w inside of c, then VC dimension is bounded by a maximum of two 00:08:18.400 |
values, c and n, n is dimensionality of vector, plus one. 00:08:24.960 |
That means exactly that VC dimension can be smaller than dimensionality of the space. 00:08:33.200 |
And you can control somehow VC dimension, and I will show you how to do it. 00:08:42.560 |
But what is the general way to adjust VC dimension for 00:08:47.760 |
searching for function? You have set of function. 00:08:52.320 |
Maybe set of function have infinite VC dimension. 00:08:55.920 |
Then you make a structure of the set of function. 00:08:59.920 |
You choose subset of function, the small subset of function, 00:09:04.080 |
with VC dimension h, then another subset of function which includes the small one, 00:09:10.000 |
with VC dimension h2, so this nested, we have nested subset of function. 00:09:15.840 |
This nested VC dimension. And then, when you're minimizing this 00:09:20.320 |
function, you're doing two things. You choose appropriate subset, 00:09:27.040 |
and then in this subset, you pick up function which minimize 00:09:37.040 |
But you can see that epsilon in this depend on VC dimension 00:09:47.680 |
At VC dimension over l. And also it depends on empirical loss 00:09:52.960 |
which you achieve. So you can do, as soon as you make a structure, 00:09:59.680 |
you can do whatever you want, even if you have infinite VC dimension in initial 00:10:11.120 |
But this is more or less all what contain the main result of VC theory. 00:10:19.120 |
But VC theory does not answer four very important questions. 00:10:30.480 |
I told that any function. Have to select admissible set of function, 00:10:35.920 |
f of x. I told you that given set of function, but maybe this is a stupid 00:10:42.960 |
set of function. Have to construct good set of function. 00:10:50.080 |
Have to construct structure on the admissible set of function. 00:10:55.440 |
And then have to minimize functional to construct the structure. 00:11:00.160 |
In this talk, I will try to answer all these questions. 00:11:15.120 |
What is setting of pattern recognition problem? 00:11:19.360 |
I will consider in this talk just pattern recognition 00:11:23.840 |
problem for two class classification, but generalization is straightforward. 00:11:31.920 |
So what is setting of pattern recognition problem? 00:11:37.840 |
Given generator, given nature, we generate randomly 00:11:42.160 |
and independently, situation x, which come on the object. 00:11:48.000 |
And this object is conditional probability y, 00:11:52.320 |
given x, says that y equal one, given x, and y equals zero, given x. 00:12:01.840 |
This object knows this conditional probability 00:12:05.440 |
and play dice. So he has a function of conditional probability, he has x 00:12:12.880 |
on the input, he plays dice and say y. That is the most general setting of 00:12:19.200 |
learning problem. Deterministic is particular case of this 00:12:26.640 |
learning machine has a set of functions and it can choose any 00:12:35.520 |
observing L observations, x1, y1, xL, yL, to pick up 00:12:44.240 |
the function for classification. That means given 00:12:52.800 |
observations generated by a conditional probability p, 00:12:58.160 |
x, y, equal p, y, given x on p of x, that's the whole scheme, 00:13:05.760 |
finds the rules that minimize functional. So in my case, when I have 00:13:10.640 |
y zero or one, or n theta also zero one, it is indicator function, 00:13:17.280 |
so last function I'll just collect how many errors I do. 00:13:25.920 |
expectation of error. I would like to find function which guarantee 00:13:32.320 |
the smallest expectation of error. But this is not very good. Why it's not 00:13:38.640 |
very good? Because my function now, this model is 00:13:44.560 |
just, it is everywhere zero, except for some points where it is one, 00:13:50.960 |
or minus one. So if I took model, it is one. So it is 00:13:58.000 |
zero everywhere and defined in one in some point. 00:14:05.360 |
So I cannot use gradient in this case. So I should do something smarter. 00:14:15.760 |
model and indicator function with just y minus f of x, 00:14:26.480 |
instead of whatever I formulated before. It's not so bad choice because it's so 00:14:33.840 |
happened that minimum of this function now gives 00:14:38.960 |
conditional probability function, probability of y 00:14:42.640 |
equal one given x, and then when we can find this probability of y 00:14:54.320 |
our decision rule. We just consider function if 00:15:01.520 |
our conditional probability exceeds 0.5, say first class. If it's less than 0.5, 00:15:08.080 |
say second class. And this is optimal solution. 00:15:18.400 |
Let us rewrite this first line. I will subtract from bracket inside 00:15:26.160 |
on the first term, regression, and then add regression. So I 00:15:36.320 |
And then I make a square. So the last integral 00:15:44.320 |
on function, which I'm looking for, and I have to minimize my function now 00:15:51.200 |
over f, over set of function, just sum of two last terms. 00:16:00.240 |
How I got this? It is just normal binomial for two terms. 00:16:07.440 |
Square of one, square of second, and two termless multiplications. 00:16:20.960 |
to find function which is close to conditional probability function, 00:16:26.320 |
not sum of two integrals. We can show that second 00:16:38.240 |
To have a rate of convergence big, we need to find a way how to minimize 00:16:46.160 |
first integral, not sum of these two. And that means that not least square 00:17:01.600 |
First of all, when y, it is zero or one, probability of 00:17:11.600 |
y equal one given x is some real valued function between zero 00:17:26.080 |
we know that probability of conditional probability of y 00:17:30.400 |
equal one given x on density p of x, it is joint density p of y given 00:17:42.400 |
always true. Now if I will multiply on some function g of x minus x star, 00:17:54.560 |
which belongs to L2 space, and take integral, I will have this 00:18:04.400 |
And I can say that conditional probability is solution of this equation. 00:18:16.240 |
you see I put conditional probability, I did something like that. But I would 00:18:29.120 |
when I don't know probability measure, but I'm given data. 00:18:35.680 |
Given observations generated according to px, pyx, 00:18:43.040 |
I would like to solve this equation. But solving equation, it is ill-posed 00:18:48.560 |
problem. Okay, let's do that. But before I will do 00:19:04.560 |
how to replace unknown probability measure with 00:19:11.200 |
empirical measure. And that is the most important part, 00:19:15.760 |
it is main induction step in statistics. In statistics, we're given 00:19:24.000 |
doesn't matter how many data we will see, it is not equivalent to 00:19:32.720 |
classical statistics, people suggest to approximate cumulative distribution 00:19:39.200 |
function by empirical cumulative distribution 00:19:46.000 |
distribution function. In 30 years, a mathematician tried to 00:19:55.920 |
can do that. And in 33, Kolmogorov found exact bound 00:20:01.680 |
which is on the last line, almost the same like on the last line. 00:20:07.200 |
And then people proved that we have bound like that. 00:20:12.560 |
Now, if we can replace unknown measure with empirical measure, 00:20:24.320 |
problem, our constructing problem, what we should do. 00:20:29.360 |
Let us replace in this function now which we would like 00:20:33.360 |
to minimize instead of real measure, empirical measure. And we will have 00:20:42.080 |
empirical least square root function now, which we have to minimize 00:20:52.960 |
our problem, to find our conditional probability of decision 00:20:58.560 |
or whatever we want. But let me consider new constructive 00:21:03.520 |
setting, where we also will replace unknown probability measure 00:21:09.200 |
with empirical probability measure obtained on the training data. And you 00:21:18.640 |
to find conditional probability, we have to solve 00:21:30.240 |
Right hand side is known because our function g is known. 00:21:37.920 |
it is our goal in the set of function to find this function. 00:21:43.840 |
In classical statistics, it was one algorithm called Watson-Nadarayev 00:21:50.480 |
estimator, which show how to estimate conditional probability 00:21:54.960 |
or regression function. They just somehow define this. 00:22:00.640 |
And this is a show function which is have a numerator and denominator. 00:22:15.520 |
So this is estimate of regression, a very general way of estimation regression, 00:22:21.920 |
conditional probability. And all this business, how to find, 00:22:26.800 |
if it Gaussian, how to find the best value of variance to approximate 00:22:33.840 |
conditional probability well. So they spent a lot of time on this subject and 00:22:43.760 |
the line in middle. This is Watson-Nadarayev estimator comes from 00:22:52.080 |
corrupted equation, not from equation which we derive. 00:22:56.560 |
Here it is in the middle is f of x, not of f of x i, 00:23:04.000 |
like in last, like in correct. So then you can 00:23:07.520 |
put out of some f of x and you will get this function. 00:23:12.080 |
So actually classical Watson-Nadarayev estimator, 00:23:16.720 |
it is solution of corrupted equation which we obtained. 00:23:24.240 |
But what means to solve equation? To solve equation 00:23:28.960 |
means I will take a difference between left-hand side and right-hand side, 00:23:35.600 |
define area where I would like that my function will operate, 00:23:42.800 |
take a square and integrate over some probability measure 00:23:49.920 |
and minimize this function. So let us do that. And if you will do 00:23:56.320 |
simple whole algebra, it is just very simple and you can check it, 00:24:02.080 |
you will have this f function which is y minus f of x, 00:24:09.840 |
y, yj minus f of xj, multiply on some coefficients 00:24:22.960 |
So we can estimate this value. And this value is 00:24:29.680 |
j, xy, xi, j, xi, xj over measure, and this is matrix. 00:24:38.400 |
If you know from Watson-Nadarayev exact formula, this, we know this matrix, 00:24:53.440 |
If you will replace this integral, this empirical integral, 00:24:59.120 |
we will have this estimate of V matrix. If you will use 00:25:10.880 |
and mu of x is uniformly distributed over this line, 00:25:24.000 |
and g is Gaussian distribution, we will have this V matrix. So V matrix is 00:25:36.880 |
vector notation. What I will do, I will call y vectors of elements of 00:25:49.520 |
so I create y, vector y, which is vector of all y's. 00:25:57.360 |
I will create F, capital from F, which is also l-dimensional vector of, 00:26:05.040 |
I will pick up function f, and this is value of this function 00:26:11.360 |
in point x1, and the last value of this function 00:26:15.280 |
is the point xl. So this is F. And I have V matrix. 00:26:22.400 |
So, and I can rewrite this functional in the way 00:26:26.000 |
that in matrix form I have y minus F, capital from F, 00:26:32.800 |
V y minus F, capital from F. But if I will write this notation, 00:26:39.840 |
least square method, I will have y minus F, capital from F, 00:26:46.240 |
y minus F, capital from F, and here instead of y, identity matrix. 00:26:53.600 |
So I got some improvement over least square method, 00:27:01.280 |
and I hope that it will give me rate of convergence 00:27:13.440 |
Let's see. But it is not major stuff, because, okay, I improve rate of 00:27:19.600 |
convergence, but it's still least square method good. 00:27:23.280 |
But now the most important part, selection of admissible set of 00:27:28.160 |
functions. What it means? When you're constructing neural network, 00:27:37.200 |
you're talking that you're doing smart structure. 00:27:40.720 |
What it means? You're talking that you're constructing 00:27:45.200 |
smart admissible set of functions. You know something, you're smart guys. 00:27:52.160 |
You're just constructing, and then you minimize over this set of functions. 00:27:58.960 |
But let me consider from mathematical perspective 00:28:04.240 |
what it is. If you consider Hilbert space, and also if 00:28:15.760 |
ways of convergence. Strong convergence, it is convergence of functions. 00:28:22.240 |
It is first line. My sequence of functions, fL, 00:28:27.680 |
converge to f0 if fL, this integral converge when fL 00:28:34.640 |
goes to infinity. But there exists weak convergence. 00:28:45.920 |
converge for f0 if this inner product converge to this inner 00:28:52.640 |
product for all function phi from Hilbert space. 00:28:59.440 |
You can, thinking that this is inner product, describe 00:29:03.840 |
property of function. If for all functions, property is the same, 00:29:15.600 |
So it is easy to show that if you have strong convergence, 00:29:21.920 |
you also have weak convergence. It's just from 00:29:26.080 |
Schwarz inequality, Cauchy-Schwarz inequality. 00:29:32.000 |
But also one can prove that if you have weak convergence 00:29:38.480 |
and your set of functions belong to compact, you also have strong convergence. 00:29:44.240 |
So in some sense, the both convergence are equivalent. 00:29:50.000 |
In our consideration of machine learning, we consider 00:29:56.080 |
strong convergence everywhere, 100%. But what about weak convergence? 00:30:13.600 |
For pattern recognition case, the first line I just 00:30:17.600 |
writing definition of weak convergence equals second. 00:30:22.560 |
And that is, I use by same equation, it is phi from dp 00:30:30.560 |
equal one over phi for all function from phi. 00:30:35.760 |
So it converge, it must converge for all function from phi. 00:30:48.800 |
I will have one function which is desired one. 00:30:54.080 |
But it is not realistic. Let us do following. 00:30:58.160 |
Let us select from set of Hilbert space m function phi. 00:31:05.600 |
We will talk a lot how to select these functions. 00:31:14.560 |
not for all function, but just for this m function. 00:31:22.960 |
the set of function which satisfies this equality. 00:31:33.520 |
this equality for any phi, any phi, because of weak convergence. 00:31:53.200 |
cumulative distribution function, it empirical estimate, 00:31:59.600 |
we will have instead of integral property like here, 00:32:09.120 |
Again, let me use the same notation for matrix scheme. 00:32:16.320 |
I will use y vector, I will use function F, capital from F, 00:32:22.160 |
which is F from x1, F from xl, which is vector. 00:32:42.320 |
Because phi is function, I can consider value of function 00:32:59.680 |
I would like that my admissible set of function 00:33:32.760 |
and quack like a duck, then it probably is a duck. 00:33:36.760 |
What this means, we have the statistical invariance 00:34:01.920 |
which explain what means looks, swims, and quack, 00:34:06.600 |
then your admissible function will be function, 00:34:23.200 |
Concept of predicate, it is very different from feature. 00:34:35.960 |
the VC dimension of admissible set of function is decrease. 00:35:04.520 |
because your decision will becoming more and more diverse. 00:35:11.680 |
So what is exact setting of complete learning problem? 00:35:24.480 |
of least square functional, subject to this constraint. 00:35:28.080 |
And this constraint is what you would like to see 00:35:40.160 |
But existing classical method of pattern recognition, 00:36:14.680 |
But approximation is unconditional optimization. 00:36:21.440 |
subject to this constraint, but I will do following. 00:36:52.700 |
Data, and how important for me to minimize constraint. 00:37:11.900 |
And where P is just covariance matrix of your predicate. 00:37:58.120 |
But here it was written for any set of function. 00:38:30.960 |
This is definition of reproducing kernel Hilbert space. 00:38:36.340 |
You have some kernel, which is Mercer kernel. 00:38:41.740 |
with function f of x, and you have the same function. 00:38:48.420 |
It's called reproducing kernel Hilbert space. 00:38:53.020 |
And it is known that Mercer kernel have expansion 00:38:57.240 |
over lambda, whereas lambda, it is non-negative values. 00:39:41.460 |
this set of function, and if you will introduce 00:39:46.020 |
special inner product, inner product of this type, 00:39:55.500 |
So you will have reproducing kernel Hilbert space. 00:40:07.280 |
you have a great theorem called representer theorem. 00:40:17.000 |
in subset of function, and subset of function, 00:40:28.960 |
So, then your solution has a linear representation 00:40:33.600 |
over kernel with finite number of parameters. 00:40:36.840 |
So, let introduce matrix, vector of functions. 00:41:14.040 |
Function from reproducing kernel Hilbert space. 00:41:23.160 |
and this is F of F from reproducing kernel Hilbert space 00:41:31.400 |
Subset of function is bounded norm in finite VC dimension. 00:41:48.000 |
And to control VC dimension, you should control just C. 00:42:04.880 |
kernel Hilbert space has closed form solution 00:42:09.080 |
to minimize this function, subject to this constraint. 00:42:12.900 |
And constraint on bound of the norm will give you, 00:42:17.900 |
and this is your solution, it linear expansion 00:42:23.960 |
over this vector of functions and value of coefficients 00:42:28.960 |
is like that, where it is just in closed form 00:42:39.320 |
it is function of matrix, it's multiplication of matrix. 00:42:52.340 |
This is identical matrix, or you have this solution. 00:43:12.380 |
kernel Hilbert space have closed form solution. 00:43:32.540 |
and this is how it looks closed form solution. 00:43:35.280 |
Your solution is coefficients over this expansion, 00:44:21.620 |
that is exactly you will get support vector machine. 00:44:26.620 |
If you will look how derive support vector machine, 00:44:38.380 |
I will do data like in support vector machine, 00:44:44.300 |
y i minus A k a, that is I would like approximate data. 00:44:50.180 |
Is a norm like here, but also I would like to, 00:45:34.660 |
If I would like here to make strong for invariants 00:45:39.660 |
and not too strong for approximation function, 00:45:59.180 |
I will have that my A is defined by invariants, 00:46:08.620 |
But my function of predicate, how I choose predicate, 00:46:16.060 |
That means that if I, I can choose one function 00:46:38.280 |
But for estimating coefficients, I don't need data. 00:47:03.360 |
show that philosophy of this learning is very different. 00:47:29.480 |
belong to one layer network, not multi-layer network. 00:47:34.480 |
But because of reproducing kernel Hilbert space 00:47:44.080 |
Maybe it is not the best idea to use deep network. 00:47:49.080 |
Deep network, okay, we will discuss this in depth. 00:47:55.120 |
Observation vectors and kernel define basis for expansion 00:48:12.320 |
And what it means, if you will write in the form, 00:48:24.520 |
phi is predicate matrix, so you have vector over here. 00:48:38.200 |
You have different formulation of learning problem 00:48:56.720 |
since any function from Hilbert space can be used. 00:49:13.380 |
So we have a chance to pick up several smart functions, 00:49:18.940 |
or maybe even one, and it will be enough for our training. 00:49:23.460 |
And then we will talk about what means smart. 00:49:26.020 |
It is intelligent learning, not just brute force learning. 00:49:30.300 |
So, and let me give you illustration to have a feeling. 00:49:48.780 |
Here, if I will introduce invariants, I will do better. 00:49:53.220 |
Here, if I will use both invariant and V matrix. 00:50:01.400 |
For 48 points, and to be sure that it is difficult problem, 00:50:07.320 |
we took 16 for one class and 32 from another class. 00:51:39.260 |
then you can have a chance to improve your performance. 00:51:53.940 |
find situation, the box, where existing solution, 00:52:03.180 |
contradicts evidence, does not keep invariants. 00:52:14.660 |
obtain new approximation which resolves this contradiction, 00:52:22.780 |
So you're just looking, you're inventing invariant 00:52:31.060 |
that use physicist to discover law of nature. 00:52:39.180 |
where existing theory contradict observations. 00:52:58.460 |
which does not contain contradiction observed. 00:53:02.260 |
But the most important part in physics, like in here, 00:53:06.340 |
is the most difficult part in scientific discovery, 00:53:22.620 |
But we can use our theory for neural net as well. 00:53:31.980 |
That is neural net, you're minimizing least square error. 00:53:35.600 |
I would minimize this approximation of invariance 00:53:48.140 |
V plus P matrix, matrix P is matrix with invariant. 00:53:53.140 |
And then I will do the same back propagation. 00:53:58.700 |
It so happened that I easily can do back propagation 00:54:02.780 |
not just for this matrix, but also for this matrix. 00:54:10.380 |
I need a back propagation to do only one correction. 00:54:18.860 |
where is y minus what you have on the last layer, 00:54:45.500 |
You do forward step, it is okay, it's the same. 00:54:49.420 |
You have back propagation step, it is boundary condition. 00:54:54.260 |
In back propagation step, you should do some correction, 00:55:03.020 |
So I came to NSE and asked guys who have a neural network 00:55:14.720 |
small improvement with, I took just one invariant 00:55:27.420 |
I will show you that it's not so simple invariant. 00:55:33.620 |
And then we said V equal y, we did not use V matrix. 00:55:42.460 |
We use thousand examples, hundred per class, batch six. 00:55:55.020 |
deep neural network, they have a good neural network. 00:56:17.940 |
cosine coefficient, I have a picture of my digit. 00:56:22.940 |
I make cosine expansion, so I have Fourier coefficients, 00:56:42.140 |
So I will use predicate with this one coefficient. 00:57:33.760 |
The statistical part of learning theory is complete. 00:57:43.480 |
Because there exist only two way for convergence 00:57:49.160 |
Convergence in functions, convergence in functionals. 00:58:01.400 |
you can play one or two game or two game together. 00:58:22.880 |
you should ask yourself how to choose invariants. 00:58:28.000 |
And invariant, it is something about intelligence. 00:59:32.760 |
I think that this is for many hundred years story. 00:59:38.040 |
of major philosophy from Plato to Hegel to Wigner 00:59:48.320 |
So my claim that when you're talking not about imitation, 01:00:19.880 |
Expected number of element of class y equal one, 01:00:29.200 |
equal to number of training example of the first class. 01:00:38.080 |
you will pick up function which will give you 01:00:48.940 |
represent of the first class exactly the same 01:01:04.160 |
for neural net and 10-class digit recognition 01:01:21.800 |
expected with respect to conditional probability, 01:02:07.320 |
Using function, sorry, predication which I will get 01:02:21.000 |
like correlation which I absorbed on my training data. 01:02:39.200 |
I am not so smart to construct very general predicate. 01:02:47.800 |
Like u, x1, x2, the function of two variables, 01:02:57.120 |
And we have this function y, this function y. 01:03:00.360 |
And let's consider predicate like I will take image, 01:03:16.360 |
Will be the same like I see on my training data. 01:03:25.480 |
By the way, convolution neural network is one predicate. 01:03:33.160 |
And this is this convolution of point x, y, x 01:03:41.460 |
You can use wavelet, you can use whatever you want. 01:04:00.160 |
it means understand which predicate involved in that. 01:04:04.140 |
But also, see difference between predicate and invariant. 01:04:17.360 |
It what makes specific your abstract concept. 01:04:37.640 |
Let us consider vector x, y, x, z, just digit recognition. 01:04:57.040 |
small linear transformation of your pixel space. 01:05:00.720 |
And if you have small linear transformation of pixel space, 01:05:11.720 |
and also you can transform using Lie derivative. 01:05:23.440 |
We show you this is transformation, the first line. 01:05:29.400 |
In pixel space, you make a linear transformation. 01:06:07.240 |
which is transformed with respect to Lie derivative. 01:06:26.360 |
And from any data, you can get, okay, this predicate, 01:06:36.560 |
It is horizontal translation, X, first coordinate plus A. 01:06:50.360 |
The standard from geometry formula for rotation, 01:07:05.840 |
And this is, again, illustration from Patrice Simard. 01:07:19.360 |
Just you choose one, two, three, four, five Lie derivatives. 01:07:41.960 |
and we'll take invariant with respect to Lie derivative. 01:07:47.640 |
It is, I would like to learn using statistical invariant, 01:07:58.160 |
which keep invariant with respect with all Lie derivatives. 01:08:12.680 |
So, suppose I have this set of clones of my digit. 01:08:17.680 |
This is set of another, of clones of another digit. 01:08:37.080 |
They are, massage them with linear transformation 01:08:51.800 |
And now, I believe that this is general concept 01:09:13.340 |
What I can do, I have, this is my digit three. 01:09:22.240 |
I will take first line here, second line here, 01:10:01.480 |
This tangent distance, these two different pictures, 01:10:08.920 |
this is coefficient of symmetry of this three. 01:10:12.660 |
But I can say, horizontal symmetry, vertical symmetry, 01:10:43.160 |
we say that we can minimize this function now, 01:10:47.280 |
which is slightly better than, say, this square, 01:10:50.960 |
subject to this constraint, which is serious. 01:10:54.800 |
Because this constraint means admissible set of function, 01:11:04.680 |
So, I can consider, say, all continuous function. 01:11:13.400 |
select by smart predicate admissible set of functions. 01:11:18.400 |
And I can do it, because according to weak convergence, 01:11:56.120 |
It goes in direction of searching of predicate, 01:12:21.200 |
I believe that it's not a lot, I will show you why. 01:12:27.280 |
So, and I think that this line, it's very old line. 01:13:09.180 |
it was classical German philosophy about that, 01:13:16.260 |
And Hegel thought whatever reasonable it exist, 01:13:28.900 |
But recently, 60 years ago, Wigner wrote an article 01:13:35.500 |
about unreasonable effectiveness of mathematics. 01:13:48.460 |
you should look in equation and you will see how it works. 01:14:02.900 |
This two concept reflect essence of intelligence, 01:14:08.940 |
not just it imitation, which is in artificial intelligence. 01:14:30.220 |
did 0.5% of error rate using 60,000 observation. 01:15:07.540 |
that allow to synthesize all Russian folk tales. 01:15:24.100 |
to film, to television, to television series, 01:15:47.240 |
Probably there exist small amount of predicates 01:15:57.860 |
And that what I believe should be learning about. 01:16:20.960 |
that you recommend for language classification tasks 01:16:25.560 |
And the second question is, do you have any strategies 01:16:49.300 |
And you have small amount of data in selecting function. 01:17:03.580 |
So the more predicate, the less overfitting happen. 01:17:09.420 |
- And if you will, a theory, mathematics says 01:17:11.760 |
that if you have infinite number of predicate, 01:17:14.140 |
you're left with one function, which you want. 01:17:24.080 |
natural language processing, the Turing test, 01:17:41.860 |
- Whatever I'm talking, it is very trivial, simple. 01:17:52.940 |
And you can think, like this guy, Vladimir Prok, 01:18:19.620 |
Yeah, but don't do them, immediately hard problem, 01:18:25.420 |
- Yeah, I try, they're very simple, just step by step. 01:18:29.460 |
It so happens that it is main line of philosophy 01:18:38.380 |
from Plato to this guy, who says that ideas is not too much, 01:18:54.020 |
- Vladimir, thank you so much for coming today.