Back to Index

Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series


Chapters

0:0 Introduction
0:46 Overview: Complete Statistical Theory of Learning
3:47 Part 1: VC Theory of Generalization
11:4 Part 2: Target Functional for Minimization
27:13 Part 3: Selection of Admissible Set of Functions
37:26 Part 4: Complete Solution in Reproducing Kernel Hilbert Space (RKHS)
53:16 Part 5: LUSI Approach in Neural Networks
59:28 Part 6: Examples of Predicates
70:39 Conclusion
76:10 Q&A: Overfitting
77:18 Q&A: Language

Transcript

Today we're happy and honored to have Vladimir Vapnik with us, co-inventor of supported vector machines, support vector clustering, VC theory of statistical learning, and author of statistical learning theory. He's one of the greatest and most impactful statisticians and computer scientists of our time. Plus he grew up in the Soviet Union, eventually heading the computer science department, Institute of Control Sciences in Moscow.

So he will give today's lecture in Russian and I will translate. Just kidding, right? It's an honor and a pleasure to have Vladimir with us today, so please give him a warm welcome. Thank you. About 50 years ago, Professor Chervonenkis and me started statistical learning theory. The problem was to answer the question when, if you will do well with training data, if you will have small amount of error on training data, you will do well also on the test data.

You will minimize expectation of error. So we solved this problem. And there are this theory almost in all books, they might be written in different languages, but mostly they follow this line. The line is that just law of large number is not enough. We need uniform law of large number, uniform convergence, and so on.

But because we started this discussion about empirical error, how good you do on the training data, and bound shows that the better you're doing on training data, you will be better on test data, people decided that it is only way to have training data and do something with this training data, to minimize number of error.

And all algorithm was constructed based on this principle. About five years ago, I found that there exists another principle, even more interesting than this one, because if this principle, it is brute force principle, give me more data, I will give better answer. So the second principle is intelligent principle.

And I will talk today about this. So I will start with statistical learning theory, and then I will introduce this new theory. But because there are only two ways for generalization, one with data, another I will show what. I call it complete statistical learning theory, because there are no another way to do something.

There are no third way for generalization. You should use both of them. So that is complete theory. But it is not so bad, because you will see that learning theory move in different direction, in direction of intelligence, to understanding what is intelligence. It is not the same what Turing explained.

Turing told that you should imitate intelligent person. But now question, what is intelligence? And we will discuss this. So let me start. The first part is, this is theory of generalization. And that is the question, when instead of in a given set of function, you can minimize function now.

This is pretty general function now. Instead of y f of x and l is loss function, I can consider difference between y and function. It is what we're doing in regression and pattern recognition and all this stuff, but this is more general setting. But this is more important, that when we consider minimization of function now, we should say in which set of function.

In a given set of function, we have to minimize function. If probability measure is unknown, but we are given iid data pairs. That exact setting of both pattern recognition and regression estimation problem. For pattern recognition set of function is indicator functions. For regression function it is continuous function, but the statement is the same.

But the answer is very simple. We can minimize this function now using data if and only if this dimension h of set of function is finite. Everything depends on which set of function you have to minimize this function. And you cannot avoid that. So we call it capacity, but maybe it better called diversity of the set of function.

But measure of this capacity diversity is VC dimension. And what is VC dimension? I will give you definition. First for set of indicator function, T is step function. You consider continuous function F and in front of it you're looking for indicator. If function is positive, you say one. If it's not positive, you say zero.

That is indicator. And we see this dimension of set of indicator function equal h if h is the maximal number of vectors that can be shattered, separated in all possible to the h subset using indicator function from this set. So you have set, you should find h vectors which you can shatter in all possible way, in all possible way.

But you cannot shatter in h plus one vector. Then VC dimension of this set of function will be h. This is pure combinatorical definition. And if you can shatter for any l, then we say that VC dimension is infinite. And then I will give you two theorems which we will use and then the main theorem, probability of the theory.

If set of function has VC dimension h, then this probability of one minus z, for all functions in the set, the bound holds true. And because this bound holds true for all function, and you would like to have a minimal estimate, minimal right-hand side, you will choose function which minimize empirical loss.

And the second theorem, if you have set of linear function, indicator from the set of linear function, and so happen that your vector x inside of circle for radius one, and w inside of c, then VC dimension is bounded by a maximum of two values, c and n, n is dimensionality of vector, plus one.

That means exactly that VC dimension can be smaller than dimensionality of the space. And you can control somehow VC dimension, and I will show you how to do it. It is very important theorem. But what is the general way to adjust VC dimension for searching for function? You have set of function.

Maybe set of function have infinite VC dimension. Then you make a structure of the set of function. You choose subset of function, the small subset of function, with VC dimension h, then another subset of function which includes the small one, with VC dimension h2, so this nested, we have nested subset of function.

This nested VC dimension. And then, when you're minimizing this function, you're doing two things. You choose appropriate subset, and then in this subset, you pick up function which minimize empirical loss. But you can see that epsilon in this depend on VC dimension of subset you choose. At VC dimension over l.

And also it depends on empirical loss which you achieve. So you can do, as soon as you make a structure, you can do whatever you want, even if you have infinite VC dimension in initial set of function. But this is more or less all what contain the main result of VC theory.

But VC theory does not answer four very important questions. Have to choose loss function, lyf. I told that any function. Have to select admissible set of function, f of x. I told you that given set of function, but maybe this is a stupid set of function. Have to construct good set of function.

Have to construct structure on the admissible set of function. And then have to minimize functional to construct the structure. In this talk, I will try to answer all these questions. Target functional minimization. And this is important slide. God plays dice. What is setting of pattern recognition problem? I will consider in this talk just pattern recognition problem for two class classification, but generalization is straightforward.

So what is setting of pattern recognition problem? Given generator, given nature, we generate randomly and independently, situation x, which come on the object. And this object is conditional probability y, given x, says that y equal one, given x, and y equals zero, given x. This object knows this conditional probability and play dice.

So he has a function of conditional probability, he has x on the input, he plays dice and say y. That is the most general setting of learning problem. Deterministic is particular case of this set. And what does learning machine? A learning machine has a set of functions and it can choose any function from this set.

And the problem is observing L observations, x1, y1, xL, yL, to pick up the function for classification. That means given observations generated by a conditional probability p, x, y, equal p, y, given x on p of x, that's the whole scheme, finds the rules that minimize functional. So in my case, when I have y zero or one, or n theta also zero one, it is indicator function, so last function I'll just collect how many errors I do.

But I have probability measure, it collect expectation of error. I would like to find function which guarantee the smallest expectation of error. But this is not very good. Why it's not very good? Because my function now, this model is just, it is everywhere zero, except for some points where it is one, or minus one.

So if I took model, it is one. So it is zero everywhere and defined in one in some point. So I cannot use gradient in this case. So I should do something smarter. And people, what people doing, they replace model and indicator function with just y minus f of x, least square error, instead of whatever I formulated before.

It's not so bad choice because it's so happened that minimum of this function now gives conditional probability function, probability of y equal one given x, and then when we can find this probability of y equal one given x, we easily can construct our decision rule. We just consider function if our conditional probability exceeds 0.5, say first class.

If it's less than 0.5, say second class. And this is optimal solution. But something wrong with this replacement. Let us rewrite this first line. I will subtract from bracket inside on the first term, regression, and then add regression. So I have two brackets instead of one. And then I make a square.

So the last integral shows me the first integral does not depend on function, which I'm looking for, and I have to minimize my function now over f, over set of function, just sum of two last terms. How I got this? It is just normal binomial for two terms. Square of one, square of second, and two termless multiplications.

But our goal is to minimize first integral, to find function which is close to conditional probability function, not sum of two integrals. We can show that second integral eventually will go to zero, but it will slow down rate of convergence. To have a rate of convergence big, we need to find a way how to minimize first integral, not sum of these two.

And that means that not least square loss, but something else. What we can do? There exists... First of all, when y, it is zero or one, probability of y equal one given x is some real valued function between zero and one. Because from Bayesian formula, we know that probability of conditional probability of y equal one given x on density p of x, it is joint density p of y given one comma x.

That is just always true. Now if I will multiply on some function g of x minus x star, which belongs to L2 space, and take integral, I will have this equation. And I can say that conditional probability is solution of this equation. It was constructed like that because you see I put conditional probability, I did something like that.

But I would like to solve this equation to find function when I don't know probability measure, but I'm given data. Given observations generated according to px, pyx, I would like to solve this equation. But solving equation, it is ill-posed problem. Okay, let's do that. But before I will do that, I would like to mention that in classical statistics, there is a way how to replace unknown probability measure with empirical measure.

And that is the most important part, it is main induction step in statistics. In statistics, we're given data and would like to know function. And it doesn't matter how many data we will see, it is not equivalent to function. So in classical statistics, people suggest to approximate cumulative distribution function by empirical cumulative distribution function.

And that is empirical cumulative distribution function. In 30 years, a mathematician tried to prove that it is good idea that we can do that. And in 33, Kolmogorov found exact bound which is on the last line, almost the same like on the last line. And then people proved that we have bound like that.

Now, if we can replace unknown measure with empirical measure, we can construct our problem, our constructing problem, what we should do. Let us replace in this function now which we would like to minimize instead of real measure, empirical measure. And we will have empirical least square root function now, which we have to minimize to find our problem, to find our conditional probability of decision or whatever we want.

But let me consider new constructive setting, where we also will replace unknown probability measure with empirical probability measure obtained on the training data. And you will see the last equation. And to find conditional probability, we have to solve this equation in set of function f of x. Right hand side is known because our function g is known.

In left hand side, we don't know f of x, so it is our goal in the set of function to find this function. In classical statistics, it was one algorithm called Watson-Nadarayev estimator, which show how to estimate conditional probability or regression function. They just somehow define this. And this is a show function which is have a numerator and denominator.

So g is special kernel, say Gaussian. So this is estimate of regression, a very general way of estimation regression, conditional probability. And all this business, how to find, if it Gaussian, how to find the best value of variance to approximate conditional probability well. So they spent a lot of time on this subject and they have this.

But you can see the line in middle. This is Watson-Nadarayev estimator comes from corrupted equation, not from equation which we derive. Here it is in the middle is f of x, not of f of x i, like in last, like in correct. So then you can put out of some f of x and you will get this function.

So actually classical Watson-Nadarayev estimator, it is solution of corrupted equation which we obtained. But what means to solve equation? To solve equation means I will take a difference between left-hand side and right-hand side, define area where I would like that my function will operate, take a square and integrate over some probability measure and minimize this function.

So let us do that. And if you will do simple whole algebra, it is just very simple and you can check it, you will have this f function which is y minus f of x, y, yj minus f of xj, multiply on some coefficients y, xy, xj. So we can estimate this value.

And this value is j, xy, xi, j, xi, xj over measure, and this is matrix. If you know from Watson-Nadarayev exact formula, this, we know this matrix, this element of matrix. So what is V matrix? If you will replace this integral, this empirical integral, we will have this estimate of V matrix.

If you will use just, say, line between minus one and one, and mu of x is uniformly distributed over this line, we will have, and we will have and g is Gaussian distribution, we will have this V matrix. So V matrix is easy to find. Now I would like to use vector notation.

What I will do, I will call y vectors of elements of training data y1, yl. I'm given l pairs, so I create y, vector y, which is vector of all y's. I will create F, capital from F, which is also l-dimensional vector of, I will pick up function f, and this is value of this function in point x1, and the last value of this function is the point xl.

So this is F. And I have V matrix. So, and I can rewrite this functional in the way that in matrix form I have y minus F, capital from F, V y minus F, capital from F. But if I will write this notation, least square method, I will have y minus F, capital from F, y minus F, capital from F, and here instead of y, identity matrix.

So I got some improvement over least square method, and I hope that it will give me rate of convergence better than this least square method. Let's see. But it is not major stuff, because, okay, I improve rate of convergence, but it's still least square method good. But now the most important part, selection of admissible set of functions.

What it means? When you're constructing neural network, you're talking that you're doing smart structure. What it means? You're talking that you're constructing smart admissible set of functions. You know something, you're smart guys. You're just constructing, and then you minimize over this set of functions. But let me consider from mathematical perspective what it is.

If you consider Hilbert space, and also if Euclidean space, in this space there are two ways of convergence. Strong convergence, it is convergence of functions. It is first line. My sequence of functions, fL, converge to f0 if fL, this integral converge when fL goes to infinity. But there exists weak convergence.

You say that my sequence of function fL converge for f0 if this inner product converge to this inner product for all function phi from Hilbert space. You can, thinking that this is inner product, describe property of function. If for all functions, property is the same, that will be convergence.

So it is easy to show that if you have strong convergence, you also have weak convergence. It's just from Schwarz inequality, Cauchy-Schwarz inequality. But also one can prove that if you have weak convergence and your set of functions belong to compact, you also have strong convergence. So in some sense, the both convergence are equivalent.

In our consideration of machine learning, we consider strong convergence everywhere, 100%. But what about weak convergence? Let's explore this opportunity. Let us consider pattern recognition case. For pattern recognition case, the first line I just writing definition of weak convergence equals second. And that is, I use by same equation, it is phi from dp equal one over phi for all function from phi.

So it converge, it must converge for all function from phi. If it converge for all function from phi, I will have one function which is desired one. But it is not realistic. Let us do following. Let us select from set of Hilbert space m function phi. We will talk a lot how to select these functions.

So, and then we will consider equality, not for all function, but just for this m function. And we will call admissible set of functions the set of function which satisfies this equality. We know that our function must satisfy this equality for any phi, any phi, because of weak convergence.

So, but we select something which we want. Okay, if you will use instead of our cumulative distribution function, it empirical estimate, we will have instead of integral property like here, the property written by the sum. Again, let me use the same notation for matrix scheme. I will use y vector, I will use function F, capital from F, which is F from x1, F from xl, which is vector.

For any function F, I have vector. And also, I introduce new vector, vector of predicates on the value x1, xl. Because phi is function, I can consider value of function in this case. Then I can write my function F, my equation in this form. I would like that my admissible set of function satisfy in vector form this m equations.

Now, let me explain what we're talking about. There is a duck test logic. If it looks like a duck, swims like a duck, and quack like a duck, then it probably is a duck. What this means, we have the statistical invariance in vector form of this line. What it does, it collect admissible function which identify animals as a duck if it looks, swims, and quack like a duck.

So if you will choose predicate, which explain what means looks, swims, and quack, then your admissible function will be function, such function which classify animals that swim, quack, and looks like a duck. Concept of predicate, it is very different from feature. Why it's so? With increasing number of predicate, the VC dimension of admissible set of function is decrease.

Why it is a decrease? Because we have set of function, then we, from this set of function, select function which satisfy new predicate. Not all of function will satisfy, and consider only set of function which satisfy all this predicate. But with increasing number of features, VC dimension increase, because your decision will becoming more and more diverse.

So what is exact setting of complete learning problem? Minimize functional, which is, with this V matrix, a little bit improve of least square functional, subject to this constraint. And this constraint is what you would like to see in the set of admissible functions. But existing classical method of pattern recognition, they just minimize this functional, that is least square functional.

So we're minimizing this functional subject to this constraint. That is our setting. That was exact setting, which is, in mathematical, is called conditional optimization. Optimization of functional under conditions. But approximation is unconditional optimization. I would like minimize this functional subject to this constraint, but I will do following. I will make a sum of this functional, and I will take square of difference between this constraint, and make a sum, with some weight.

Weight, it should be one. So I would like to do both, to minimize over both, and put weights, how important for me to minimize. Data, and how important for me to minimize constraint. And then if I will do that, I can rewrite this function in this way, where you have to minimize this functional.

And where P is just covariance matrix of your predicate. And everything is simple compute. So that is concept. What we have to do? We have to solve our problem using both big and strong convergence. That means using invariance, and minimizing functionals. And we can do it in exact way, and in approximation.

But here it was written for any set of function. I did not talk how I will minimize that. So it is true how, as a least square method, you can minimize least square functional for any set of functions. That is the same here. But now, let me see how I can find the solution.

And first of all, I will do it for reproducing kernel Hilbert space. This is definition of reproducing kernel Hilbert space. You have some kernel, which is Mercer kernel. You multiply, you're taking the product with function f of x, and you have the same function. It is reproducing the same function.

It's called reproducing kernel Hilbert space. And it is known that Mercer kernel have expansion over lambda, whereas lambda, it is non-negative values. And psi is orthonormal functions. So, set of function, this inner product and norm. This is inner product, and that is norm. Forms are reproducing kernel Hilbert space.

It is very easy to check. It means that if you will use some function psi, orthonormal function psi, and expansion of C, and if you will have this set of function, and if you will introduce special inner product, inner product of this type, and then you will have this definition.

So you will have reproducing kernel Hilbert space. It is pretty general space. But in reproducing kernel Hilbert space, you have a great theorem called representer theorem. And representer theorem says, if you would like to minimize this function, in subset of function, and subset of function, it's norm of your function in reproducing kernel Hilbert space is bounded.

So, then your solution has a linear representation over kernel with finite number of parameters. So, let introduce matrix, vector of functions. F, K, X one of X, it is vector of expansion. And then square of our norm will be A, and this is A over K. This is what is, how it looks, your function which you're looking for.

A, K, A is norm of your function. Function from reproducing kernel Hilbert space. K is matrix K, X, I, X, J, and this is F of F from reproducing kernel Hilbert space is just linear function. Subset of function is bounded norm in finite VC dimension. The smaller C, the smaller VC dimension.

And that is according to second theorem which I show you before. And to control VC dimension, you should control just C. You should look for norm of this function. So, conditional minimization in producing kernel Hilbert space has closed form solution to minimize this function, subject to this constraint. And constraint on bound of the norm will give you, and this is your solution, it linear expansion over this vector of functions and value of coefficients is like that, where it is just in closed form it is function of matrix, it's multiplication of matrix.

This is gamma C, C is depend on this, where you see gamma of C depend on this C. This is identical matrix, or you have this solution. And to find mu over here, you have to solve linear equation. So, you're solving linear equation, and then you have closed form solution.

So, the complete problem in reproducing kernel Hilbert space have closed form solution. But what about unconditional minimization, which approximately minimization like that in this constraint? It also has closed form solution, and this is how it looks closed form solution. Your solution is coefficients over this expansion, and you have this equation how to find A.

So, everything is computable. But very special role play in explanation support vector machine. What is support vector machine? Given data, I would like in reproducing kernel Hilbert space, this bounded norm, minimize this function. And then, when I'm minimizing this function, that is exactly you will get support vector machine.

If you will look how derive support vector machine, it is just minimization of this function, this set of function. But here I will do something else. I will do data like in support vector machine, y i minus A k a, that is I would like approximate data. Is a norm like here, but also I would like to, that my invariant will be good enough.

Will be close. Left hand side and right hand side of invariant will be close. So, I would like minimize this function now under the same constraint. And this is the solution. The solution is like A. A is function which is phi from T. And phi from T, as you remember, is it is vector of predicate.

And this is indicator vector. If I would like here to make strong for invariants and not too strong for approximation function, I just want to use just weak convergence. I will have that my A is defined by invariants, by function of predicate. But my function of predicate, how I choose predicate, I can choose any function I want.

That means that if I, I can choose one function which will give me optimal solution. That there exists so smart predicate that I will not need a lot of data. Why I need data? I need data for expansion over kernel. But for estimating coefficients, I don't need data. I use my predicate.

And that means that, what is, what means predicate? It is property. It is explanation about what I want. I will talk about this later. So this example on support vector machine show that philosophy of this learning is very different. According to representer theorem, solution of learning problem in reproducing kernel Hilbert space have a property.

It define linear parametric function in form of expansion of kernel function. That means that optimal expansion belong to one layer network, not multi-layer network. But because of reproducing kernel Hilbert space is richer than this. Maybe it is not the best idea to use deep network. Deep network, okay, we will discuss this in depth.

Observation vectors and kernel define basis for expansion for optimal parametric solution. But parameters is defined by invariants. But you can use this invariants. And what it means, if you will write in the form, that you have K, which is matrix, depending on your training data, phi is predicate matrix, so you have vector over here.

And this is y, f, y is vector, f is vector. So you have element. You have different formulation of learning problem in term of this vector and these values. So since function in Hilbert space can be, since any function from Hilbert space can be used. Because when we told about weak convergence, it converge for all functions simultaneously of Hilbert space.

So any function can be used. So we have a chance to pick up several smart functions, or maybe even one, and it will be enough for our training. And then we will talk about what means smart. It is intelligent learning, not just brute force learning. So, and let me give you illustration to have a feeling.

This is least square method, what it does. This is V matrix method. If I will do a little bit improvement, it's a little bit better. Here, if I will introduce invariants, I will do better. Here, if I will use both invariant and V matrix. So you can see this difference.

For 48 points, and to be sure that it is difficult problem, we took 16 for one class and 32 from another class. It is not equal. This is 98 points, the same story. This is one, 92 points, the same story. And this is multi-dimensional case, so we check something and we did it.

And that very interesting case. We introduced some invariants, and got 2273 error rate for diabetes. And then we decided, can we find invariant to improve performance? And what we did, we just looking for area where our invariants does not take place. They violated. Then we just took predicate, which just doing with this area, which count how one, when point below, when point belong to this area, and zero when it doesn't belong.

And we improve using this invariant from 73, .73 to .7. So that means that if you have smart way to looking for invariants, then you can have a chance to improve your performance. But this is exactly the same philosophy which use physicist. Find the solution, the box in figure, where the existing solution, find situation, the box, where existing solution, the approximation which we obtained, contradicts evidence, does not keep invariants.

Contradict invariants inside the box. And then modify the solution, obtain new approximation which resolves this contradiction, which does not have this contradiction. So you're just looking, you're inventing invariant and looking where you have contradiction. And that is exactly the same principles that use physicist to discover law of nature.

To discover law of nature, physicist is first trying to find situation where existing theory contradict observations. So invariant way, theoretical prediction do not support it by experiments. That means invariant way. Then they're trying to reconstruct theory, remove contradiction. They construct new approximation of theory which does not contain contradiction observed.

But the most important part in physics, like in here, is the most difficult part in scientific discovery, have to find contradictive situation. But let me show something about neural net. I am not fond of neural net. But we can use our theory for neural net as well. What is neural net?

That is neural net, you're minimizing least square error. I would minimize this approximation of invariance which is contain VP matrix. V plus P matrix, matrix P is matrix with invariant. And then I will do the same back propagation. It so happened that I easily can do back propagation not just for this matrix, but also for this matrix.

And if I will do back propagation, I need a back propagation to do only one correction. Instead of back propagation error, where is y minus what you have on the last layer, and you have all L observation, you just have this matrix and you multiply your vector of propagation on this matrix.

You're correcting your propagation. So that is what is back propagation about. You do forward step, it is okay, it's the same. You have back propagation step, it is boundary condition. In back propagation step, you should do some correction, only one on the very last level, and then update your state.

So I came to NSE and asked guys who have a neural network to incorporate this, just this improvement, small improvement with, I took just one invariant and very trivial invariant. C of x equal one, C of x equal one. I will show you that it's not so simple invariant.

We will discuss what it invariant does. And then we said V equal y, we did not use V matrix. Here it is just identity matrix. We use thousand examples, hundred per class, batch six. And this line is what does neural network, deep neural network, they have a good neural network.

And that's what does improve neural network. Instead of 3.1, they got 2.9. Okay, let's do just one, not very important, cosine coefficient, I have a picture of my digit. I make cosine expansion, so I have Fourier coefficients, one coefficient of Fourier. And I will use this like predicate. So I will use predicate with this one coefficient.

And again, you will see just one invariant, this stupid cosine makes improvement. Okay, then we decide let's do more. Let's do 16 coefficients of Fourier. Four from x1 and four from x2. And we got 0.6, bigger improvement. But I can do whatever I want, I can do 1600, say, invariants, and I can make, say, it's a lot of game can be played.

But that's also, but in neural net, we use approximation of exact solution, but it works. The statistical part of learning theory is complete. Why it is complete? Because there exist only two way for convergence in Hilbert space. Convergence in functions, convergence in functionals. There are no third way of convergence.

So from conceptual point of view, you can play one or two game or two game together. So why I call this complete? You cannot imagine something else. If you would like to do something that is improved learning, you should ask yourself how to choose invariants. And that's what we will talk about.

And invariant, it is something about intelligence. When I talk about duck test, I use looks like a duck, swims like a duck, quack like a duck. But I can say play chess like a duck. I can say whatever I want. And it should be invariant equality. Or say, singing an opera like a duck, okay?

But among all this stupid predicate, there is a smart predicate. And subject of learning, and subject of intelligent learning, some have to pick up this smart invariants. Let us discuss what is predicate. I don't know that. I think that this is for many hundred years story. I will show you that it is continuation of major philosophy from Plato to Hegel to Wigner and so on, I will show you.

But it is in predicate. So my claim that when you're talking not about imitation, but what is essence of intelligence, essence in predicate. Predicate is something abstract. Okay, I will show you later. Let's see predicate. That is, that invariant holds true. Must, must, yes. So let's take f of x equal one.

What does this predicate? Expected number of element of class y equal one, computed using this predicate, equal to number of training example of the first class. When you will use this predicate, you will pick up function which will give you on this training data the number of represent of the first class exactly the same like in your training data.

So you, and that is this predicate. And you saw how strong this predicate for neural net and 10-class digit recognition because they have something. Let's take another stupid predicate. Just x, it looks like that. It is center of mass. I want that expected center of mass, expected with respect to conditional probability, will be equal to average to center of mass which I see on my training data.

That looks like that. But you can do smarter, looks like that. So I can consider x, x transport, which make matrix, it is correlation, a variation matrix. And I can see n square over two predicate of this type. That's correlation which I will get using this predicate. Using function, sorry, predication which I will get with obtained solution will be the same like correlation which I absorbed on my training data.

That means this predicate. So, but I think that we should not go for general predicate. And we can imagine many predicate I am not so smart to construct very general predicate. But let's do for 2D image predicate. Like u, x1, x2, the function of two variables, it is image of digit, say, in our case.

And we have this function y, this function y. And let's consider predicate like I will take image, I will consider coefficients of Fourier. It is my predicate. I want expected value with respect to this over my predicate. Will be the same like I see on my training data. I can consider convolution.

By the way, convolution neural network is one predicate. I will show you later which is predicate. And this is this convolution of point x, y, x of different points. You can use wavelet, you can use whatever you want. Because whatever is coming in the product, it is you who decided what you should use.

And the understand of image recognition, it means understand which predicate involved in that. But also, see difference between predicate and invariant. Predicate is abstract concept. But invariant involve your training data. It what makes specific your abstract concept. It's also general, but it is specific. And that is, I want you to show instrument for special predicate.

Let us consider vector x, y, x, z, just digit recognition. Keep in mind your digit recognition. And suppose x is your pixel space. And you make linear transformation, small linear transformation of your pixel space. And if you have small linear transformation of pixel space, you first transform your picture.

But you can transform picture, and also you can transform using Lie derivative. I will show you what is that. But to see what is that, I took this picture from paper by Simart and all. We show you this is transformation, the first line. In pixel space, you make a linear transformation.

And here, you make the same transformation with Lie derivative. This is Lie derivative, this black one. It's just computing Lie derivative. I'll show you how to compute. And alpha is coefficient. And using different coefficients, a equal minus two, you just have this. A equal minus one, you can have this, this, this, and all this stuff.

So you can create clones of digit two, which is transformed with respect to Lie derivative. You don't need to have a lot of data. You need to have, for digit recognition, you need to have predicate. And from any data, you can get, okay, this predicate, and then, okay, I will show you invariant with this predicate.

And this is what is Lie derivative. It is horizontal translation, X, first coordinate plus A. You just move it in direction X1. Then, vertical transformation. This is rotation. The standard from geometry formula for rotation, for small rotation, you have that. And this is ddx. This is ddx2, and so on.

You have all this stuff. And this is, again, illustration from Patrice Simard. Clones, you have this three, and you create all these clones. Just you choose one, two, three, four, five Lie derivatives. Two different coefficients with them, and you have all this stuff. But this is smart guy. Why I'm not taking, like, predicate, just Lie derivative of this, and we'll take invariant with respect to Lie derivative.

It is, I would like to learn using statistical invariant, try to estimate such conditional probability which keep invariant with respect with all Lie derivatives. But even more, it's again from, was done. So, suppose I have this set of clones of my digit. This is set of another, of clones of another digit.

I call tangent distance the closest element from these two clones. What I mean, I have two digits. They are different, say, two, three. They are, massage them with linear transformation to make it as close as possible, and measure closeness of them. That's called Lie invariant. That's called tangent distance.

And now, I believe that this is general concept of predicate, symmetry. When you have any picture, you can say what is measure of symmetry of these two picture. So, for example, you have three. What I can do, I have, this is my digit three. I will have for horizontal symmetry.

I will take first line here, second line here, last line here. Then I will do the following. I will make another digit. I will make last line the first line. This line the second line. So, what I am doing, I take three. So, I reading like vector like that.

Now, I will do this. It is two different images. And now, I would say, let me massage them. This tangent distance, these two different pictures, how close they are. If they're very close, I can say, this is coefficient of symmetry of this three. But I can say, horizontal symmetry, vertical symmetry, anti-symmetry.

What means anti-symmetry? Digit S. It is has vertical anti-symmetry. I can have diagonal symmetry, everything. So, you can play many games with symmetry. And this is just one predicate. I have conclusion remark. What we did, we say that we can minimize this function now, which is slightly better than, say, this square, subject to this constraint, which is serious.

Because this constraint means admissible set of function, which I'm trying to construct being smart. So, I can consider, say, all continuous function. And then, from this continuous function, select by smart predicate admissible set of functions. And I can do it, because according to weak convergence, any invariant take place with any function.

Invariant take place with any predicate. So, and also, this provide unique solution for reproducing kernel Hilbert space, and approximation for neural network. Approximate solution. But through the progress, goes beyond statistical reasoning. It goes in direction of searching of predicate, which form basis for understanding of problem existing in the world.

And what means understanding? It means that in, say, 2D image recognition, there exists concept of symmetry. There exists concept of structure. And if you will know this concept, I believe that it's not a lot, I will show you why. I think that it is not a lot. You understand this problem.

So, and I think that this line, it's very old line. It start from Plato. What says Plato? Plato says that there is a world of ideas and world of things. And world of ideas make world of things. But you see that I have ideas, which is predicate, which abstract, that can be applied to different situation.

But I have world of things. But then, in 300 years ago, it was classical German philosophy about that, what means ideas, what means things. And Hegel thought whatever reasonable it exist, it is exactly what we said about predicate. And whatever exist, it is reasonable. So there is two connections.

But recently, 60 years ago, Wigner wrote an article about unreasonable effectiveness of mathematics. Which he says that mathematics know something about reality. If you would like to understand reality, you should look in equation and you will see how it works. So predicate is abstract idea. Why invariants that are built using them form element of solution.

This two concept reflect essence of intelligence, not just it imitation, which is in artificial intelligence. But that is subject which we should attack. And also, I have two more slides. One slide is challenge. I know that people from deep network did 0.5% of error rate using 60,000 observation. The challenge is use 1% of this data and got the same result.

But invent smart predicate and all these clones which exist. I think that it is durable. And the very last slide. In 1928, Guy Vladimir Propp published book Morphology of Folk Tales. Where he describes 31 predicate that allow to synthesize all Russian folk tales. Later, his morphology, this 31 predicate, was applied to literature, to theater, to film, to television, to television series, to games, et cetera.

And this 31 was enough. And this I read from Wikipedia. You can check it, Wikipedia, William Propp. Propp found 31 predicate which describe different actions of people in real world. Probably there exist small amount of predicates that describe 2D images. And that is intelligence. That is how to find them.

And that what I believe should be learning about. Thank you. (audience applauding) - I think we have time for a few questions. - Hello, thank you. I have two questions. First one is, do you know of any predicates that you recommend for language classification tasks specifically? And the second question is, do you have any strategies for hedging against overfitting?

Like if you specify too many predicates, then you might be sort of-- - I not hear you well. But second question is about overfitting. - Overfitting, yes. - Yes, let me answer this. - Sure. - The more predicate, you have, why does overfitting happen? Because your set of function is big.

And you have small amount of data in selecting function. So you can select whatever you want. But if you increase number of predicate, you decrease set of function. So the more predicate, the less overfitting happen. - Good point. - And if you will, a theory, mathematics says that if you have infinite number of predicate, you're left with one function, which you want.

- You also asked about natural language, recommendation for predicates for language, natural language processing, the Turing test, any good predicates? - You know, it is very complicated story, natural language, I don't know. - Questions. - Whatever I'm talking, it is very trivial, simple. Everyone familiar with 2D images? And you can think, like this guy, Vladimir Prok, what is predicate in these images?

Can we formulate, if you're smart guy, say, couple of dozens, or maybe one dozen, predicate, which will be enough? (audience member speaking faintly) - But language is harder than images. - Oh, yeah. (audience laughing) Yeah, but don't do them, immediately hard problem, try to-- - Try. - Yeah, I try, they're very simple, just step by step.

It so happens that it is main line of philosophy from Plato to this guy, who says that ideas is not too much, too much, is not too many ideas that exist in world of ideas. It could be like that. - Vladimir, thank you so much for coming today. - Thank you, thank you.

- And please give him a big hand. (audience applauding) Thank you. (audience applauding) (upbeat music) (upbeat music) (upbeat music) (upbeat music)