Complete Statistical Theory of Learning (Vladimir Vapnik)

00:00:00.000 | Today we're happy and honored to have Vladimir Vapnik with us,

00:00:04.160 | co-inventor of supported vector machines, support vector clustering, VC theory of

00:00:08.480 | statistical learning, and author of statistical learning theory.

00:00:12.560 | He's one of the greatest and most impactful statisticians and computer

00:00:17.120 | scientists of our time. Plus he grew up in the Soviet Union,

00:00:22.080 | eventually heading the computer science department, Institute of Control Sciences

00:00:26.160 | in Moscow. So he will give today's lecture in

00:00:29.440 | Russian and I will translate. Just kidding, right?

00:00:35.520 | It's an honor and a pleasure to have Vladimir with us today, so please give

00:00:41.200 | him a warm welcome.

00:00:44.320 | Thank you.

00:00:49.120 | About 50 years ago, Professor Chervonenkis and me started statistical

00:00:57.120 | learning theory. The problem was to answer the question

00:01:02.000 | when, if you will do well with training data,

00:01:06.400 | if you will have small amount of error on training data,

00:01:10.320 | you will do well also on the test data. You will minimize expectation of error.

00:01:19.920 | So we solved this problem. And there are this theory almost in all

00:01:28.320 | books, they might be written in different

00:01:30.800 | languages, but mostly they follow this line. The

00:01:34.720 | line is that just

00:01:38.160 | law of large number is not enough. We need uniform law of large number,

00:01:43.200 | uniform convergence, and so on. But because we started this discussion

00:01:49.840 | about empirical error, how good you do on the

00:01:54.480 | training data, and bound shows that the better you're doing on training data,

00:01:59.360 | you will be better on test data, people decided

00:02:03.120 | that it is only way to have training data and do something

00:02:08.800 | with this training data, to minimize number of error.

00:02:13.360 | And all algorithm was constructed based on this principle.

00:02:19.680 | About five years ago, I found that there exists another principle,

00:02:25.680 | even more interesting than this one, because

00:02:28.960 | if this principle, it is brute force principle,

00:02:32.320 | give me more data, I will give better answer.

00:02:35.360 | So the second principle is intelligent principle.

00:02:38.960 | And I will talk today about this. So I will start with statistical

00:02:45.200 | learning theory, and then I will introduce this new theory.

00:02:50.000 | But because there are only two ways for generalization, one with data,

00:02:56.160 | another I will show what. I call it complete statistical learning

00:03:00.960 | theory, because there are no

00:03:05.200 | another way to do something. There are no

00:03:09.040 | third way for generalization. You should use both of them. So that is

00:03:16.080 | complete theory. But it is not so bad, because

00:03:21.200 | you will see that learning theory move in different direction,

00:03:27.920 | in direction of intelligence, to understanding what is intelligence.

00:03:33.040 | It is not the same what Turing explained.

00:03:36.560 | Turing told that you should imitate intelligent person.

00:03:40.960 | But now question, what is intelligence? And we will discuss this. So let me start.

00:03:49.360 | The first part is, this is theory of generalization.

00:03:55.520 | And that is the question, when instead of

00:03:59.680 | in a given set of function, you can minimize

00:04:03.280 | function now. This is pretty general function now.

00:04:07.360 | Instead of y f of x and l is loss function, I can

00:04:15.280 | consider difference between y and function.

00:04:18.960 | It is what we're doing in regression and pattern recognition and

00:04:24.560 | all this stuff, but this is more general setting. But this

00:04:28.240 | is more important, that when we consider minimization of

00:04:32.720 | function now, we should say in which set of function.

00:04:37.040 | In a given set of function, we have to minimize function.

00:04:41.920 | If probability measure is unknown, but we are given

00:04:45.840 | iid data pairs. That exact setting of both pattern recognition

00:04:54.160 | and regression estimation problem. For pattern recognition set of function is

00:05:02.640 | indicator functions. For regression function it is

00:05:09.520 | continuous function, but the statement is the same.

00:05:15.840 | But the answer is very simple. We can minimize

00:05:20.240 | this function now using data if and only if this dimension

00:05:27.280 | h of set of function is finite. Everything depends

00:05:31.440 | on which set of function you have to minimize this function.

00:05:35.680 | And you cannot avoid that. So we call it capacity,

00:05:40.560 | but maybe it better called diversity of the set of function.

00:05:45.280 | But measure of this capacity diversity is VC dimension.

00:05:51.360 | And what is VC dimension? I will give you definition.

00:05:56.000 | First for set of indicator function, T is step function.

00:06:00.160 | You consider continuous function F and in front of it you're looking for

00:06:06.720 | indicator. If

00:06:10.640 | function is positive, you say one. If it's not positive, you say zero.

00:06:17.040 | That is indicator. And we see this dimension of set of indicator function

00:06:23.200 | equal h if h is the maximal number of vectors

00:06:31.040 | that can be shattered, separated in all possible to the

00:06:35.200 | h subset using indicator function from this set.

00:06:43.120 | So you have set, you should find h vectors

00:06:50.480 | which you can shatter in all possible way, in all possible way.

00:06:54.480 | But you cannot shatter in h plus one vector. Then VC dimension

00:07:00.480 | of this set of function will be h. This is pure combinatorical definition.

00:07:07.520 | And if you can shatter for any l, then we say that VC dimension is infinite.

00:07:15.520 | And then I will give you two theorems which we will use

00:07:19.680 | and then the main theorem, probability of the theory.

00:07:24.880 | If set of function has VC dimension h, then this probability of one minus z,

00:07:31.360 | for all functions in the set, the bound holds true.

00:07:38.080 | And because this bound holds true for all function,

00:07:41.200 | and you would like to have a minimal estimate, minimal right-hand

00:07:49.040 | side, you will choose function which minimize empirical loss.

00:07:56.240 | And the second theorem, if you have set of linear function,

00:08:01.360 | indicator from the set of linear function, and so happen

00:08:05.280 | that your vector x inside of circle for radius one,

00:08:10.560 | and w inside of c, then VC dimension is bounded by a maximum of two

00:08:18.400 | values, c and n, n is dimensionality of vector, plus one.

00:08:24.960 | That means exactly that VC dimension can be smaller than dimensionality of the space.

00:08:33.200 | And you can control somehow VC dimension, and I will show you how to do it.

00:08:38.640 | It is very important theorem.

00:08:42.560 | But what is the general way to adjust VC dimension for

00:08:47.760 | searching for function? You have set of function.

00:08:52.320 | Maybe set of function have infinite VC dimension.

00:08:55.920 | Then you make a structure of the set of function.

00:08:59.920 | You choose subset of function, the small subset of function,

00:09:04.080 | with VC dimension h, then another subset of function which includes the small one,

00:09:10.000 | with VC dimension h2, so this nested, we have nested subset of function.

00:09:15.840 | This nested VC dimension. And then, when you're minimizing this

00:09:20.320 | function, you're doing two things. You choose appropriate subset,

00:09:27.040 | and then in this subset, you pick up function which minimize

00:09:33.440 | empirical loss.

00:09:37.040 | But you can see that epsilon in this depend on VC dimension

00:09:44.720 | of subset you choose.

00:09:47.680 | At VC dimension over l. And also it depends on empirical loss

00:09:52.960 | which you achieve. So you can do, as soon as you make a structure,

00:09:59.680 | you can do whatever you want, even if you have infinite VC dimension in initial

00:10:06.480 | set of function.

00:10:11.120 | But this is more or less all what contain the main result of VC theory.

00:10:19.120 | But VC theory does not answer four very important questions.

00:10:26.080 | Have to choose loss function, lyf.

00:10:30.480 | I told that any function. Have to select admissible set of function,

00:10:35.920 | f of x. I told you that given set of function, but maybe this is a stupid

00:10:42.960 | set of function. Have to construct good set of function.

00:10:50.080 | Have to construct structure on the admissible set of function.

00:10:55.440 | And then have to minimize functional to construct the structure.

00:11:00.160 | In this talk, I will try to answer all these questions.

00:11:04.720 | Target functional minimization.

00:11:08.640 | And this is important slide. God plays dice.

00:11:15.120 | What is setting of pattern recognition problem?

00:11:19.360 | I will consider in this talk just pattern recognition

00:11:23.840 | problem for two class classification, but generalization is straightforward.

00:11:31.920 | So what is setting of pattern recognition problem?

00:11:37.840 | Given generator, given nature, we generate randomly

00:11:42.160 | and independently, situation x, which come on the object.

00:11:48.000 | And this object is conditional probability y,

00:11:52.320 | given x, says that y equal one, given x, and y equals zero, given x.

00:12:01.840 | This object knows this conditional probability

00:12:05.440 | and play dice. So he has a function of conditional probability, he has x

00:12:12.880 | on the input, he plays dice and say y. That is the most general setting of

00:12:19.200 | learning problem. Deterministic is particular case of this

00:12:23.280 | set. And what does learning machine? A

00:12:26.640 | learning machine has a set of functions and it can choose any

00:12:31.440 | function from this set. And the problem is

00:12:35.520 | observing L observations, x1, y1, xL, yL, to pick up

00:12:44.240 | the function for classification. That means given

00:12:52.800 | observations generated by a conditional probability p,

00:12:58.160 | x, y, equal p, y, given x on p of x, that's the whole scheme,

00:13:05.760 | finds the rules that minimize functional. So in my case, when I have

00:13:10.640 | y zero or one, or n theta also zero one, it is indicator function,

00:13:17.280 | so last function I'll just collect how many errors I do.

00:13:22.240 | But I have probability measure, it collect

00:13:25.920 | expectation of error. I would like to find function which guarantee

00:13:32.320 | the smallest expectation of error. But this is not very good. Why it's not

00:13:38.640 | very good? Because my function now, this model is

00:13:44.560 | just, it is everywhere zero, except for some points where it is one,

00:13:50.960 | or minus one. So if I took model, it is one. So it is

00:13:58.000 | zero everywhere and defined in one in some point.

00:14:05.360 | So I cannot use gradient in this case. So I should do something smarter.

00:14:12.640 | And people, what people doing, they replace

00:14:15.760 | model and indicator function with just y minus f of x,

00:14:23.520 | least square error,

00:14:26.480 | instead of whatever I formulated before. It's not so bad choice because it's so

00:14:33.840 | happened that minimum of this function now gives

00:14:38.960 | conditional probability function, probability of y

00:14:42.640 | equal one given x, and then when we can find this probability of y

00:14:50.720 | equal one given x, we easily can construct

00:14:54.320 | our decision rule. We just consider function if

00:15:01.520 | our conditional probability exceeds 0.5, say first class. If it's less than 0.5,

00:15:08.080 | say second class. And this is optimal solution.

00:15:13.120 | But something wrong with this replacement.

00:15:18.400 | Let us rewrite this first line. I will subtract from bracket inside

00:15:26.160 | on the first term, regression, and then add regression. So I

00:15:32.320 | have two brackets instead of one.

00:15:36.320 | And then I make a square. So the last integral

00:15:40.240 | shows me the first integral does not depend

00:15:44.320 | on function, which I'm looking for, and I have to minimize my function now

00:15:51.200 | over f, over set of function, just sum of two last terms.

00:16:00.240 | How I got this? It is just normal binomial for two terms.

00:16:07.440 | Square of one, square of second, and two termless multiplications.

00:16:15.920 | But our goal is to minimize first integral,

00:16:20.960 | to find function which is close to conditional probability function,

00:16:26.320 | not sum of two integrals. We can show that second

00:16:29.680 | integral eventually will go to zero,

00:16:33.200 | but it will slow down rate of convergence.

00:16:38.240 | To have a rate of convergence big, we need to find a way how to minimize

00:16:46.160 | first integral, not sum of these two. And that means that not least square

00:16:53.120 | loss, but something else. What we can do?

00:16:58.640 | There exists...

00:17:01.600 | First of all, when y, it is zero or one, probability of

00:17:11.600 | y equal one given x is some real valued function between zero

00:17:18.560 | and one. Because from Bayesian formula,

00:17:26.080 | we know that probability of conditional probability of y

00:17:30.400 | equal one given x on density p of x, it is joint density p of y given

00:17:37.440 | one comma x. That is just

00:17:42.400 | always true. Now if I will multiply on some function g of x minus x star,

00:17:54.560 | which belongs to L2 space, and take integral, I will have this

00:18:01.840 | equation.

00:18:04.400 | And I can say that conditional probability is solution of this equation.

00:18:11.440 | It was constructed like that because

00:18:16.240 | you see I put conditional probability, I did something like that. But I would

00:18:22.800 | like to solve this equation to find function

00:18:29.120 | when I don't know probability measure, but I'm given data.

00:18:35.680 | Given observations generated according to px, pyx,

00:18:43.040 | I would like to solve this equation. But solving equation, it is ill-posed

00:18:48.560 | problem. Okay, let's do that. But before I will do

00:18:53.680 | that, I would like to mention that in

00:18:56.640 | classical statistics, there is a way

00:19:04.560 | how to replace unknown probability measure with

00:19:11.200 | empirical measure. And that is the most important part,

00:19:15.760 | it is main induction step in statistics. In statistics, we're given

00:19:20.640 | data and would like to know function. And it

00:19:24.000 | doesn't matter how many data we will see, it is not equivalent to

00:19:28.080 | function. So in

00:19:32.720 | classical statistics, people suggest to approximate cumulative distribution

00:19:39.200 | function by empirical cumulative distribution

00:19:43.520 | function. And that is empirical cumulative

00:19:46.000 | distribution function. In 30 years, a mathematician tried to

00:19:51.760 | prove that it is good idea that we

00:19:55.920 | can do that. And in 33, Kolmogorov found exact bound

00:20:01.680 | which is on the last line, almost the same like on the last line.

00:20:07.200 | And then people proved that we have bound like that.

00:20:12.560 | Now, if we can replace unknown measure with empirical measure,

00:20:19.760 | we can construct our

00:20:24.320 | problem, our constructing problem, what we should do.

00:20:29.360 | Let us replace in this function now which we would like

00:20:33.360 | to minimize instead of real measure, empirical measure. And we will have

00:20:42.080 | empirical least square root function now, which we have to minimize

00:20:49.440 | to find

00:20:52.960 | our problem, to find our conditional probability of decision

00:20:58.560 | or whatever we want. But let me consider new constructive

00:21:03.520 | setting, where we also will replace unknown probability measure

00:21:09.200 | with empirical probability measure obtained on the training data. And you

00:21:14.960 | will see the last equation. And

00:21:18.640 | to find conditional probability, we have to solve

00:21:24.080 | this equation in set of function f of x.

00:21:30.240 | Right hand side is known because our function g is known.

00:21:34.800 | In left hand side, we don't know f of x, so

00:21:37.920 | it is our goal in the set of function to find this function.

00:21:43.840 | In classical statistics, it was one algorithm called Watson-Nadarayev

00:21:50.480 | estimator, which show how to estimate conditional probability

00:21:54.960 | or regression function. They just somehow define this.

00:22:00.640 | And this is a show function which is have a numerator and denominator.

00:22:08.800 | So g is special kernel, say Gaussian.

00:22:15.520 | So this is estimate of regression, a very general way of estimation regression,

00:22:21.920 | conditional probability. And all this business, how to find,

00:22:26.800 | if it Gaussian, how to find the best value of variance to approximate

00:22:33.840 | conditional probability well. So they spent a lot of time on this subject and

00:22:38.720 | they have this. But you can see

00:22:43.760 | the line in middle. This is Watson-Nadarayev estimator comes from

00:22:52.080 | corrupted equation, not from equation which we derive.

00:22:56.560 | Here it is in the middle is f of x, not of f of x i,

00:23:04.000 | like in last, like in correct. So then you can

00:23:07.520 | put out of some f of x and you will get this function.

00:23:12.080 | So actually classical Watson-Nadarayev estimator,

00:23:16.720 | it is solution of corrupted equation which we obtained.

00:23:24.240 | But what means to solve equation? To solve equation

00:23:28.960 | means I will take a difference between left-hand side and right-hand side,

00:23:35.600 | define area where I would like that my function will operate,

00:23:42.800 | take a square and integrate over some probability measure

00:23:49.920 | and minimize this function. So let us do that. And if you will do

00:23:56.320 | simple whole algebra, it is just very simple and you can check it,

00:24:02.080 | you will have this f function which is y minus f of x,

00:24:09.840 | y, yj minus f of xj, multiply on some coefficients

00:24:19.520 | y, xy, xj.

00:24:22.960 | So we can estimate this value. And this value is

00:24:29.680 | j, xy, xi, j, xi, xj over measure, and this is matrix.

00:24:38.400 | If you know from Watson-Nadarayev exact formula, this, we know this matrix,

00:24:45.440 | this element of matrix. So what is V matrix?

00:24:53.440 | If you will replace this integral, this empirical integral,

00:24:59.120 | we will have this estimate of V matrix. If you will use

00:25:04.720 | just, say, line between minus one and one,

00:25:10.880 | and mu of x is uniformly distributed over this line,

00:25:15.760 | we will have,

00:25:18.400 | and we will have

00:25:24.000 | and g is Gaussian distribution, we will have this V matrix. So V matrix is

00:25:30.960 | easy to find. Now I would like to use

00:25:36.880 | vector notation. What I will do, I will call y vectors of elements of

00:25:44.000 | training data y1, yl. I'm given l pairs,

00:25:49.520 | so I create y, vector y, which is vector of all y's.

00:25:57.360 | I will create F, capital from F, which is also l-dimensional vector of,

00:26:05.040 | I will pick up function f, and this is value of this function

00:26:11.360 | in point x1, and the last value of this function

00:26:15.280 | is the point xl. So this is F. And I have V matrix.

00:26:22.400 | So, and I can rewrite this functional in the way

00:26:26.000 | that in matrix form I have y minus F, capital from F,

00:26:32.800 | V y minus F, capital from F. But if I will write this notation,

00:26:39.840 | least square method, I will have y minus F, capital from F,

00:26:46.240 | y minus F, capital from F, and here instead of y, identity matrix.

00:26:53.600 | So I got some improvement over least square method,

00:27:01.280 | and I hope that it will give me rate of convergence

00:27:08.880 | better than this least square method.

00:27:13.440 | Let's see. But it is not major stuff, because, okay, I improve rate of

00:27:19.600 | convergence, but it's still least square method good.

00:27:23.280 | But now the most important part, selection of admissible set of

00:27:28.160 | functions. What it means? When you're constructing neural network,

00:27:37.200 | you're talking that you're doing smart structure.

00:27:40.720 | What it means? You're talking that you're constructing

00:27:45.200 | smart admissible set of functions. You know something, you're smart guys.

00:27:52.160 | You're just constructing, and then you minimize over this set of functions.

00:27:58.960 | But let me consider from mathematical perspective

00:28:04.240 | what it is. If you consider Hilbert space, and also if

00:28:11.200 | Euclidean space, in this space there are two

00:28:15.760 | ways of convergence. Strong convergence, it is convergence of functions.

00:28:22.240 | It is first line. My sequence of functions, fL,

00:28:27.680 | converge to f0 if fL, this integral converge when fL

00:28:34.640 | goes to infinity. But there exists weak convergence.

00:28:39.520 | You say that my sequence of function fL

00:28:45.920 | converge for f0 if this inner product converge to this inner

00:28:52.640 | product for all function phi from Hilbert space.

00:28:59.440 | You can, thinking that this is inner product, describe

00:29:03.840 | property of function. If for all functions, property is the same,

00:29:11.200 | that will be convergence.

00:29:15.600 | So it is easy to show that if you have strong convergence,

00:29:21.920 | you also have weak convergence. It's just from

00:29:26.080 | Schwarz inequality, Cauchy-Schwarz inequality.

00:29:32.000 | But also one can prove that if you have weak convergence

00:29:38.480 | and your set of functions belong to compact, you also have strong convergence.

00:29:44.240 | So in some sense, the both convergence are equivalent.

00:29:50.000 | In our consideration of machine learning, we consider

00:29:56.080 | strong convergence everywhere, 100%. But what about weak convergence?

00:30:01.600 | Let's explore this opportunity.

00:30:05.440 | Let us consider pattern recognition case.

00:30:13.600 | For pattern recognition case, the first line I just

00:30:17.600 | writing definition of weak convergence equals second.

00:30:22.560 | And that is, I use by same equation, it is phi from dp

00:30:30.560 | equal one over phi for all function from phi.

00:30:35.760 | So it converge, it must converge for all function from phi.

00:30:44.240 | If it converge for all function from phi,

00:30:48.800 | I will have one function which is desired one.

00:30:54.080 | But it is not realistic. Let us do following.

00:30:58.160 | Let us select from set of Hilbert space m function phi.

00:31:05.600 | We will talk a lot how to select these functions.

00:31:09.280 | So, and then we will consider equality,

00:31:14.560 | not for all function, but just for this m function.

00:31:19.360 | And we will call admissible set of functions

00:31:22.960 | the set of function which satisfies this equality.

00:31:27.120 | We know that our function must satisfy

00:31:33.520 | this equality for any phi, any phi, because of weak convergence.

00:31:39.920 | So, but we select something which we want.

00:31:44.960 | Okay, if you will use instead of our

00:31:53.200 | cumulative distribution function, it empirical estimate,

00:31:59.600 | we will have instead of integral property like here,

00:32:04.880 | the property written by the sum.

00:32:09.120 | Again, let me use the same notation for matrix scheme.

00:32:16.320 | I will use y vector, I will use function F, capital from F,

00:32:22.160 | which is F from x1, F from xl, which is vector.

00:32:27.040 | For any function F, I have vector.

00:32:30.080 | And also, I introduce new vector,

00:32:35.080 | vector of predicates on the value x1, xl.

00:32:42.320 | Because phi is function, I can consider value of function

00:32:47.320 | in this case.

00:32:49.720 | Then I can write my function F,

00:32:55.680 | my equation in this form.

00:32:59.680 | I would like that my admissible set of function

00:33:04.680 | satisfy in vector form this m equations.

00:33:09.680 | Now,

00:33:11.680 | let me explain what we're talking about.

00:33:22.680 | There is a duck test logic.

00:33:27.680 | If it looks like a duck, swims like a duck,

00:33:32.760 | and quack like a duck, then it probably is a duck.

00:33:36.760 | What this means, we have the statistical invariance

00:33:41.880 | in vector form of this line.

00:33:44.800 | What it does, it collect admissible function

00:33:48.480 | which identify animals as a duck

00:33:53.320 | if it looks, swims, and quack like a duck.

00:33:56.160 | So if you will choose predicate,

00:34:01.920 | which explain what means looks, swims, and quack,

00:34:06.600 | then your admissible function will be function,

00:34:11.920 | such function which classify animals

00:34:18.200 | that swim, quack, and looks like a duck.

00:34:23.200 | Concept of predicate, it is very different from feature.

00:34:29.540 | Why it's so?

00:34:31.680 | With increasing number of predicate,

00:34:35.960 | the VC dimension of admissible set of function is decrease.

00:34:40.720 | Why it is a decrease?

00:34:42.160 | Because we have set of function,

00:34:44.160 | then we, from this set of function,

00:34:47.440 | select function which satisfy new predicate.

00:34:51.040 | Not all of function will satisfy,

00:34:54.400 | and consider only set of function

00:34:56.520 | which satisfy all this predicate.

00:34:58.560 | But with increasing number of features,

00:35:02.480 | VC dimension increase,

00:35:04.520 | because your decision will becoming more and more diverse.

00:35:11.680 | So what is exact setting of complete learning problem?

00:35:15.360 | Minimize functional, which is,

00:35:21.480 | with this V matrix, a little bit improve

00:35:24.480 | of least square functional, subject to this constraint.

00:35:28.080 | And this constraint is what you would like to see

00:35:34.400 | in the set of admissible functions.

00:35:40.160 | But existing classical method of pattern recognition,

00:35:45.000 | they just minimize this functional,

00:35:49.320 | that is least square functional.

00:35:51.520 | So we're minimizing this functional

00:35:54.800 | subject to this constraint.

00:35:56.160 | That is our setting.

00:35:57.800 | That was exact setting,

00:36:01.280 | which is, in mathematical, is called

00:36:07.240 | conditional optimization.

00:36:12.080 | Optimization of functional under conditions.

00:36:14.680 | But approximation is unconditional optimization.

00:36:19.240 | I would like minimize this functional

00:36:21.440 | subject to this constraint, but I will do following.

00:36:24.800 | I will make a sum of this functional,

00:36:29.520 | and I will take square of difference

00:36:34.280 | between this constraint,

00:36:38.400 | and make a sum, with some weight.

00:36:41.200 | Weight, it should be one.

00:36:43.660 | So I would like to do both,

00:36:46.680 | to minimize over both, and put weights,

00:36:50.680 | how important for me to minimize.

00:36:52.700 | Data, and how important for me to minimize constraint.

00:36:58.480 | And then if I will do that,

00:37:03.260 | I can rewrite this function in this way,

00:37:05.920 | where you have to minimize this functional.

00:37:11.900 | And where P is just covariance matrix of your predicate.

00:37:18.240 | And everything is simple compute.

00:37:23.620 | So that is concept.

00:37:29.660 | What we have to do?

00:37:33.000 | We have to solve our problem using both

00:37:38.000 | big and strong convergence.

00:37:40.520 | That means using invariance,

00:37:44.700 | and minimizing functionals.

00:37:47.980 | And we can do it in exact way,

00:37:54.740 | and in approximation.

00:37:58.120 | But here it was written for any set of function.

00:38:03.120 | I did not talk how I will minimize that.

00:38:07.020 | So it is true how, as a least square method,

00:38:09.960 | you can minimize least square functional

00:38:12.860 | for any set of functions.

00:38:14.260 | That is the same here.

00:38:15.420 | But now, let me see

00:38:20.060 | how I can find the solution.

00:38:26.680 | And first of all, I will do it

00:38:28.700 | for reproducing kernel Hilbert space.

00:38:30.960 | This is definition of reproducing kernel Hilbert space.

00:38:36.340 | You have some kernel, which is Mercer kernel.

00:38:39.420 | You multiply, you're taking the product

00:38:41.740 | with function f of x, and you have the same function.

00:38:46.300 | It is reproducing the same function.

00:38:48.420 | It's called reproducing kernel Hilbert space.

00:38:53.020 | And it is known that Mercer kernel have expansion

00:38:57.240 | over lambda, whereas lambda, it is non-negative values.

00:39:01.920 | And psi is orthonormal functions.

00:39:06.400 | So, set of function,

00:39:11.360 | this inner product and norm.

00:39:13.960 | This is inner product, and that is norm.

00:39:18.040 | Forms are reproducing kernel Hilbert space.

00:39:20.760 | It is very easy to check.

00:39:23.300 | It means that if you will use

00:39:25.860 | some function psi, orthonormal function psi,

00:39:34.060 | and expansion of C, and if you will have

00:39:41.460 | this set of function, and if you will introduce

00:39:46.020 | special inner product, inner product of this type,

00:39:51.020 | and then you will have this definition.

00:39:55.500 | So you will have reproducing kernel Hilbert space.

00:39:58.560 | It is pretty general space.

00:40:00.320 | But in reproducing kernel Hilbert space,

00:40:07.280 | you have a great theorem called representer theorem.

00:40:10.000 | And representer theorem says,

00:40:14.060 | if you would like to minimize this function,

00:40:17.000 | in subset of function, and subset of function,

00:40:20.200 | it's norm of your function in reproducing

00:40:24.200 | kernel Hilbert space is bounded.

00:40:28.960 | So, then your solution has a linear representation

00:40:33.600 | over kernel with finite number of parameters.

00:40:36.840 | So, let introduce matrix, vector of functions.

00:40:44.800 | F, K, X one of X, it is vector of expansion.

00:40:49.800 | And then square of our norm will be A,

00:40:56.180 | and this is A over K.

00:41:02.060 | This is what is, how it looks,

00:41:06.140 | your function which you're looking for.

00:41:08.420 | A, K, A is norm of your function.

00:41:14.040 | Function from reproducing kernel Hilbert space.

00:41:16.960 | K is matrix K, X, I, X, J,

00:41:23.160 | and this is F of F from reproducing kernel Hilbert space

00:41:28.480 | is just linear function.

00:41:31.400 | Subset of function is bounded norm in finite VC dimension.

00:41:38.860 | The smaller C, the smaller VC dimension.

00:41:43.360 | And that is according to second theorem

00:41:45.240 | which I show you before.

00:41:48.000 | And to control VC dimension, you should control just C.

00:41:54.380 | You should look for norm of this function.

00:42:00.540 | So, conditional minimization in producing

00:42:04.880 | kernel Hilbert space has closed form solution

00:42:09.080 | to minimize this function, subject to this constraint.

00:42:12.900 | And constraint on bound of the norm will give you,

00:42:17.900 | and this is your solution, it linear expansion

00:42:23.960 | over this vector of functions and value of coefficients

00:42:28.960 | is like that, where it is just in closed form

00:42:39.320 | it is function of matrix, it's multiplication of matrix.

00:42:43.360 | This is gamma C, C is depend on this,

00:42:48.360 | where you see gamma of C depend on this C.

00:42:52.340 | This is identical matrix, or you have this solution.

00:42:57.340 | And to find mu over here, you have to solve

00:43:01.100 | linear equation.

00:43:02.900 | So, you're solving linear equation,

00:43:05.300 | and then you have closed form solution.

00:43:08.300 | So, the complete problem in reproducing

00:43:12.380 | kernel Hilbert space have closed form solution.

00:43:17.380 | But what about unconditional minimization,

00:43:24.220 | which approximately minimization like that

00:43:27.060 | in this constraint?

00:43:29.740 | It also has closed form solution,

00:43:32.540 | and this is how it looks closed form solution.

00:43:35.280 | Your solution is coefficients over this expansion,

00:43:39.180 | and you have this equation how to find A.

00:43:44.180 | So, everything is computable.

00:43:48.540 | But very special role play in explanation

00:43:56.940 | support vector machine.

00:43:58.380 | What is support vector machine?

00:44:02.220 | Given data, I would like in reproducing

00:44:07.220 | kernel Hilbert space, this bounded norm,

00:44:13.440 | minimize this function.

00:44:16.500 | And then, when I'm minimizing this function,

00:44:21.620 | that is exactly you will get support vector machine.

00:44:26.620 | If you will look how derive support vector machine,

00:44:32.100 | it is just minimization of this function,

00:44:34.900 | this set of function.

00:44:36.500 | But here I will do something else.

00:44:38.380 | I will do data like in support vector machine,

00:44:44.300 | y i minus A k a, that is I would like approximate data.

00:44:50.180 | Is a norm like here, but also I would like to,

00:44:56.580 | that my invariant will be good enough.

00:45:02.420 | Will be close.

00:45:03.620 | Left hand side and right hand side

00:45:05.380 | of invariant will be close.

00:45:07.220 | So, I would like minimize this function now

00:45:10.260 | under the same constraint.

00:45:12.060 | And this is the solution.

00:45:14.060 | The solution is like A.

00:45:16.900 | A is function which is phi from T.

00:45:21.900 | And phi from T, as you remember,

00:45:27.420 | is it is vector of predicate.

00:45:31.320 | And this is indicator vector.

00:45:34.660 | If I would like here to make strong for invariants

00:45:39.660 | and not too strong for approximation function,

00:45:51.020 | I just want to use just weak convergence.

00:45:59.180 | I will have that my A is defined by invariants,

00:46:04.180 | by function of predicate.

00:46:08.620 | But my function of predicate, how I choose predicate,

00:46:14.000 | I can choose any function I want.

00:46:16.060 | That means that if I, I can choose one function

00:46:22.120 | which will give me optimal solution.

00:46:25.360 | That there exists so smart predicate

00:46:29.000 | that I will not need a lot of data.

00:46:31.600 | Why I need data?

00:46:34.680 | I need data for expansion over kernel.

00:46:38.280 | But for estimating coefficients, I don't need data.

00:46:42.520 | I use my predicate.

00:46:44.160 | And that means that, what is,

00:46:50.120 | what means predicate?

00:46:51.920 | It is property.

00:46:53.420 | It is explanation about what I want.

00:46:56.180 | I will talk about this later.

00:46:58.360 | So this example on support vector machine

00:47:03.360 | show that philosophy of this learning is very different.

00:47:08.920 | According to representer theorem,

00:47:12.360 | solution of learning problem in reproducing

00:47:15.600 | kernel Hilbert space have a property.

00:47:18.180 | It define linear parametric function

00:47:21.040 | in form of expansion of kernel function.

00:47:27.400 | That means that optimal expansion

00:47:29.480 | belong to one layer network, not multi-layer network.

00:47:34.480 | But because of reproducing kernel Hilbert space

00:47:40.820 | is richer than this.

00:47:44.080 | Maybe it is not the best idea to use deep network.

00:47:49.080 | Deep network, okay, we will discuss this in depth.

00:47:55.120 | Observation vectors and kernel define basis for expansion

00:47:59.920 | for optimal parametric solution.

00:48:03.280 | But parameters is defined by invariants.

00:48:08.280 | But you can use this invariants.

00:48:12.320 | And what it means, if you will write in the form,

00:48:18.080 | that you have K, which is matrix,

00:48:22.520 | depending on your training data,

00:48:24.520 | phi is predicate matrix, so you have vector over here.

00:48:29.520 | And this is y, f, y is vector, f is vector.

00:48:35.520 | So you have element.

00:48:38.200 | You have different formulation of learning problem

00:48:44.280 | in term of this vector and these values.

00:48:52.140 | So since function in Hilbert space can be,

00:48:56.720 | since any function from Hilbert space can be used.

00:49:02.180 | Because when we told about weak convergence,

00:49:06.220 | it converge for all functions simultaneously

00:49:09.240 | of Hilbert space.

00:49:10.500 | So any function can be used.

00:49:13.380 | So we have a chance to pick up several smart functions,

00:49:18.940 | or maybe even one, and it will be enough for our training.

00:49:23.460 | And then we will talk about what means smart.

00:49:26.020 | It is intelligent learning, not just brute force learning.

00:49:30.300 | So, and let me give you illustration to have a feeling.

00:49:37.420 | This is least square method, what it does.

00:49:40.880 | This is V matrix method.

00:49:44.340 | If I will do a little bit improvement,

00:49:46.780 | it's a little bit better.

00:49:48.780 | Here, if I will introduce invariants, I will do better.

00:49:53.220 | Here, if I will use both invariant and V matrix.

00:49:58.220 | So you can see this difference.

00:50:01.400 | For 48 points, and to be sure that it is difficult problem,

00:50:07.320 | we took 16 for one class and 32 from another class.

00:50:15.200 | It is not equal.

00:50:17.380 | This is 98 points, the same story.

00:50:19.980 | This is one, 92 points, the same story.

00:50:22.660 | And this is multi-dimensional case,

00:50:27.340 | so we check something and we did it.

00:50:30.700 | And that very interesting case.

00:50:33.780 | We introduced some invariants,

00:50:37.300 | and got 2273 error rate for diabetes.

00:50:44.020 | And then we decided, can we find invariant

00:50:46.860 | to improve performance?

00:50:48.900 | And what we did, we just looking for area

00:50:51.780 | where our invariants does not take place.

00:50:56.780 | They violated.

00:50:57.620 | Then we just took predicate,

00:51:03.100 | which just doing with this area,

00:51:07.300 | which count how one, when point below,

00:51:13.140 | when point belong to this area,

00:51:15.180 | and zero when it doesn't belong.

00:51:17.540 | And we improve using this invariant

00:51:22.540 | from 73, .73 to .7.

00:51:27.980 | So that means that if you have smart way

00:51:35.180 | to looking for invariants,

00:51:39.260 | then you can have a chance to improve your performance.

00:51:43.500 | But this is exactly the same philosophy

00:51:46.540 | which use physicist.

00:51:47.860 | Find the solution, the box in figure,

00:51:52.060 | where the existing solution,

00:51:53.940 | find situation, the box, where existing solution,

00:51:59.860 | the approximation which we obtained,

00:52:03.180 | contradicts evidence, does not keep invariants.

00:52:09.020 | Contradict invariants inside the box.

00:52:12.460 | And then modify the solution,

00:52:14.660 | obtain new approximation which resolves this contradiction,

00:52:19.660 | which does not have this contradiction.

00:52:22.780 | So you're just looking, you're inventing invariant

00:52:26.420 | and looking where you have contradiction.

00:52:28.980 | And that is exactly the same principles

00:52:31.060 | that use physicist to discover law of nature.

00:52:34.920 | To discover law of nature,

00:52:36.180 | physicist is first trying to find situation

00:52:39.180 | where existing theory contradict observations.

00:52:42.260 | So invariant way, theoretical prediction

00:52:47.540 | do not support it by experiments.

00:52:49.740 | That means invariant way.

00:52:51.100 | Then they're trying to reconstruct theory,

00:52:54.260 | remove contradiction.

00:52:55.940 | They construct new approximation of theory

00:52:58.460 | which does not contain contradiction observed.

00:53:02.260 | But the most important part in physics, like in here,

00:53:06.340 | is the most difficult part in scientific discovery,

00:53:11.340 | have to find contradictive situation.

00:53:14.080 | But let me show something about neural net.

00:53:20.380 | I am not fond of neural net.

00:53:22.620 | But we can use our theory for neural net as well.

00:53:27.620 | What is neural net?

00:53:31.980 | That is neural net, you're minimizing least square error.

00:53:35.600 | I would minimize this approximation of invariance

00:53:42.540 | which is contain VP matrix.

00:53:48.140 | V plus P matrix, matrix P is matrix with invariant.

00:53:53.140 | And then I will do the same back propagation.

00:53:58.700 | It so happened that I easily can do back propagation

00:54:02.780 | not just for this matrix, but also for this matrix.

00:54:05.860 | And if I will do back propagation,

00:54:10.380 | I need a back propagation to do only one correction.

00:54:14.340 | Instead of back propagation error,

00:54:18.860 | where is y minus what you have on the last layer,

00:54:23.860 | and you have all L observation,

00:54:27.040 | you just have this matrix and you multiply

00:54:32.040 | your vector of propagation on this matrix.

00:54:36.620 | You're correcting your propagation.

00:54:38.660 | So that is what is back propagation about.

00:54:45.500 | You do forward step, it is okay, it's the same.

00:54:49.420 | You have back propagation step, it is boundary condition.

00:54:54.260 | In back propagation step, you should do some correction,

00:54:57.720 | only one on the very last level,

00:55:00.860 | and then update your state.

00:55:03.020 | So I came to NSE and asked guys who have a neural network

00:55:08.020 | to incorporate this, just this improvement,

00:55:14.720 | small improvement with, I took just one invariant

00:55:20.700 | and very trivial invariant.

00:55:24.140 | C of x equal one, C of x equal one.

00:55:27.420 | I will show you that it's not so simple invariant.

00:55:30.020 | We will discuss what it invariant does.

00:55:33.620 | And then we said V equal y, we did not use V matrix.

00:55:38.620 | Here it is just identity matrix.

00:55:42.460 | We use thousand examples, hundred per class, batch six.

00:55:50.020 | And this line is what does neural network,

00:55:55.020 | deep neural network, they have a good neural network.

00:56:02.400 | And that's what does improve neural network.

00:56:08.140 | Instead of 3.1, they got 2.9.

00:56:10.580 | Okay, let's do just one, not very important,

00:56:17.940 | cosine coefficient, I have a picture of my digit.

00:56:22.940 | I make cosine expansion, so I have Fourier coefficients,

00:56:31.460 | one coefficient of Fourier.

00:56:37.420 | And I will use this like predicate.

00:56:42.140 | So I will use predicate with this one coefficient.

00:56:45.660 | And again, you will see just one invariant,

00:56:48.740 | this stupid cosine makes improvement.

00:56:52.020 | Okay, then we decide let's do more.

00:56:55.220 | Let's do 16 coefficients of Fourier.

00:56:59.460 | Four from x1 and four from x2.

00:57:03.740 | And we got 0.6, bigger improvement.

00:57:09.580 | But I can do whatever I want, I can do 1600,

00:57:13.760 | say, invariants, and I can make, say,

00:57:17.480 | it's a lot of game can be played.

00:57:21.080 | But that's also, but in neural net,

00:57:26.080 | we use approximation of exact solution,

00:57:31.160 | but it works.

00:57:33.760 | The statistical part of learning theory is complete.

00:57:39.980 | Why it is complete?

00:57:43.480 | Because there exist only two way for convergence

00:57:47.960 | in Hilbert space.

00:57:49.160 | Convergence in functions, convergence in functionals.

00:57:54.840 | There are no third way of convergence.

00:57:58.360 | So from conceptual point of view,

00:58:01.400 | you can play one or two game or two game together.

00:58:04.400 | So why I call this complete?

00:58:09.160 | You cannot imagine something else.

00:58:12.560 | If you would like to do something

00:58:15.840 | that is improved learning,

00:58:22.880 | you should ask yourself how to choose invariants.

00:58:26.240 | And that's what we will talk about.

00:58:28.000 | And invariant, it is something about intelligence.

00:58:40.960 | When I talk about duck test,

00:58:45.960 | I use looks like a duck, swims like a duck,

00:58:51.240 | quack like a duck.

00:58:52.480 | But I can say play chess like a duck.

00:58:54.480 | I can say whatever I want.

00:58:57.520 | And it should be invariant equality.

00:59:02.520 | Or say, singing an opera like a duck, okay?

00:59:08.800 | But among all this stupid predicate,

00:59:12.080 | there is a smart predicate.

00:59:14.600 | And subject of learning,

00:59:19.080 | and subject of intelligent learning,

00:59:21.440 | some have to pick up this smart invariants.

00:59:24.920 | Let us discuss what is predicate.

00:59:31.280 | I don't know that.

00:59:32.760 | I think that this is for many hundred years story.

00:59:35.320 | I will show you that it is continuation

00:59:38.040 | of major philosophy from Plato to Hegel to Wigner

00:59:43.040 | and so on, I will show you.

00:59:45.360 | But it is in predicate.

00:59:48.320 | So my claim that when you're talking not about imitation,

00:59:53.320 | but what is essence of intelligence,

00:59:57.640 | essence in predicate.

00:59:59.740 | Predicate is something abstract.

01:00:01.680 | Okay, I will show you later.

01:00:04.880 | Let's see predicate.

01:00:07.400 | That is, that invariant holds true.

01:00:12.400 | Must, must, yes.

01:00:14.360 | So let's take f of x equal one.

01:00:16.920 | What does this predicate?

01:00:19.880 | Expected number of element of class y equal one,

01:00:26.760 | computed using this predicate,

01:00:29.200 | equal to number of training example of the first class.

01:00:36.080 | When you will use this predicate,

01:00:38.080 | you will pick up function which will give you

01:00:43.940 | on this training data the number of

01:00:48.940 | represent of the first class exactly the same

01:00:53.320 | like in your training data.

01:00:55.000 | So you, and that is this predicate.

01:01:01.760 | And you saw how strong this predicate

01:01:04.160 | for neural net and 10-class digit recognition

01:01:06.520 | because they have something.

01:01:08.500 | Let's take another stupid predicate.

01:01:12.800 | Just x, it looks like that.

01:01:15.840 | It is center of mass.

01:01:17.380 | I want that expected center of mass,

01:01:21.800 | expected with respect to conditional probability,

01:01:24.400 | will be equal to average to center of mass

01:01:30.000 | which I see on my training data.

01:01:34.140 | That looks like that.

01:01:37.800 | But you can do smarter, looks like that.

01:01:41.680 | So I can consider x, x transport,

01:01:47.840 | which make matrix, it is correlation,

01:01:51.960 | a variation matrix.

01:01:53.560 | And I can see n square over two predicate

01:01:59.840 | of this type.

01:02:01.360 | That's correlation which I will get

01:02:04.600 | using this predicate.

01:02:07.320 | Using function, sorry, predication which I will get

01:02:15.200 | with obtained solution will be the same

01:02:21.000 | like correlation which I absorbed on my training data.

01:02:24.480 | That means this predicate.

01:02:27.040 | So, but I think that we should not go

01:02:32.040 | for general predicate.

01:02:36.840 | And we can imagine many predicate

01:02:39.200 | I am not so smart to construct very general predicate.

01:02:43.080 | But let's do for 2D image predicate.

01:02:47.800 | Like u, x1, x2, the function of two variables,

01:02:53.280 | it is image of digit, say, in our case.

01:02:57.120 | And we have this function y, this function y.

01:03:00.360 | And let's consider predicate like I will take image,

01:03:05.080 | I will consider coefficients of Fourier.

01:03:09.080 | It is my predicate.

01:03:10.920 | I want expected value with respect to this

01:03:14.480 | over my predicate.

01:03:16.360 | Will be the same like I see on my training data.

01:03:21.280 | I can consider convolution.

01:03:25.480 | By the way, convolution neural network is one predicate.

01:03:30.000 | I will show you later which is predicate.

01:03:33.160 | And this is this convolution of point x, y, x

01:03:40.120 | of different points.

01:03:41.460 | You can use wavelet, you can use whatever you want.

01:03:47.120 | Because whatever is coming in the product,

01:03:49.880 | it is you who decided what you should use.

01:03:53.720 | And the understand of image recognition,

01:04:00.160 | it means understand which predicate involved in that.

01:04:04.140 | But also, see difference between predicate and invariant.

01:04:09.140 | Predicate is abstract concept.

01:04:12.440 | But invariant involve your training data.

01:04:17.360 | It what makes specific your abstract concept.

01:04:21.680 | It's also general, but it is specific.

01:04:23.960 | And that is, I want you to show

01:04:33.160 | instrument for special predicate.

01:04:37.640 | Let us consider vector x, y, x, z, just digit recognition.

01:04:46.560 | Keep in mind your digit recognition.

01:04:49.200 | And suppose x is your pixel space.

01:04:53.000 | And you make linear transformation,

01:04:57.040 | small linear transformation of your pixel space.

01:05:00.720 | And if you have small linear transformation of pixel space,

01:05:07.360 | you first transform your picture.

01:05:10.360 | But you can transform picture,

01:05:11.720 | and also you can transform using Lie derivative.

01:05:14.720 | I will show you what is that.

01:05:16.480 | But to see what is that, I took this picture

01:05:20.160 | from paper by Simart and all.

01:05:23.440 | We show you this is transformation, the first line.

01:05:29.400 | In pixel space, you make a linear transformation.

01:05:34.200 | And here, you make the same transformation

01:05:36.600 | with Lie derivative.

01:05:38.120 | This is Lie derivative, this black one.

01:05:42.360 | It's just computing Lie derivative.

01:05:44.160 | I'll show you how to compute.

01:05:46.040 | And alpha is coefficient.

01:05:48.360 | And using different coefficients,

01:05:50.320 | a equal minus two, you just have this.

01:05:53.840 | A equal minus one, you can have this,

01:05:58.480 | this, this, and all this stuff.

01:06:00.040 | So you can create clones of digit two,

01:06:07.240 | which is transformed with respect to Lie derivative.

01:06:14.160 | You don't need to have a lot of data.

01:06:18.240 | You need to have, for digit recognition,

01:06:23.440 | you need to have predicate.

01:06:26.360 | And from any data, you can get, okay, this predicate,

01:06:29.440 | and then, okay, I will show you invariant

01:06:32.160 | with this predicate.

01:06:33.160 | And this is what is Lie derivative.

01:06:36.560 | It is horizontal translation, X, first coordinate plus A.

01:06:42.920 | You just move it in direction X1.

01:06:45.880 | Then, vertical transformation.

01:06:49.120 | This is rotation.

01:06:50.360 | The standard from geometry formula for rotation,

01:06:54.740 | for small rotation, you have that.

01:06:57.680 | And this is ddx.

01:07:02.040 | This is ddx2, and so on.

01:07:04.280 | You have all this stuff.

01:07:05.840 | And this is, again, illustration from Patrice Simard.

01:07:12.240 | Clones, you have this three,

01:07:14.960 | and you create all these clones.

01:07:19.360 | Just you choose one, two, three, four, five Lie derivatives.

01:07:24.360 | Two different coefficients with them,

01:07:27.440 | and you have all this stuff.

01:07:28.840 | But this is smart guy.

01:07:37.280 | Why I'm not taking, like, predicate,

01:07:40.320 | just Lie derivative of this,

01:07:41.960 | and we'll take invariant with respect to Lie derivative.

01:07:47.640 | It is, I would like to learn using statistical invariant,

01:07:52.640 | try to estimate such conditional probability

01:07:58.160 | which keep invariant with respect with all Lie derivatives.

01:08:08.600 | But even more, it's again from, was done.

01:08:12.680 | So, suppose I have this set of clones of my digit.

01:08:17.680 | This is set of another, of clones of another digit.

01:08:25.680 | I call tangent distance the closest element

01:08:31.640 | from these two clones.

01:08:32.600 | What I mean, I have two digits.

01:08:34.920 | They are different, say, two, three.

01:08:37.080 | They are, massage them with linear transformation

01:08:39.960 | to make it as close as possible,

01:08:42.800 | and measure closeness of them.

01:08:45.480 | That's called Lie invariant.

01:08:48.680 | That's called tangent distance.

01:08:51.800 | And now, I believe that this is general concept

01:08:57.760 | of predicate, symmetry.

01:09:01.900 | When you have any picture,

01:09:06.860 | you can say what is measure of symmetry

01:09:09.320 | of these two picture.

01:09:10.800 | So, for example, you have three.

01:09:13.340 | What I can do, I have, this is my digit three.

01:09:18.340 | I will have for horizontal symmetry.

01:09:22.240 | I will take first line here, second line here,

01:09:25.680 | last line here.

01:09:26.940 | Then I will do the following.

01:09:28.880 | I will make another digit.

01:09:30.640 | I will make last line the first line.

01:09:33.040 | This line the second line.

01:09:36.820 | So, what I am doing,

01:09:38.000 | I take three.

01:09:44.040 | So, I reading like vector like that.

01:09:49.800 | Now, I will do this.

01:09:51.260 | It is two different images.

01:09:56.420 | And now, I would say, let me massage them.

01:10:01.480 | This tangent distance, these two different pictures,

01:10:04.360 | how close they are.

01:10:06.460 | If they're very close, I can say,

01:10:08.920 | this is coefficient of symmetry of this three.

01:10:12.660 | But I can say, horizontal symmetry, vertical symmetry,

01:10:17.800 | anti-symmetry.

01:10:19.240 | What means anti-symmetry?

01:10:20.960 | Digit S.

01:10:22.100 | It is has vertical anti-symmetry.

01:10:26.620 | I can have diagonal symmetry, everything.

01:10:31.120 | So, you can play many games with symmetry.

01:10:34.720 | And this is just one predicate.

01:10:36.400 | I have conclusion remark.

01:10:41.200 | What we did,

01:10:43.160 | we say that we can minimize this function now,

01:10:47.280 | which is slightly better than, say, this square,

01:10:50.960 | subject to this constraint, which is serious.

01:10:54.800 | Because this constraint means admissible set of function,

01:11:00.480 | which I'm trying to construct being smart.

01:11:04.680 | So, I can consider, say, all continuous function.

01:11:10.800 | And then, from this continuous function,

01:11:13.400 | select by smart predicate admissible set of functions.

01:11:18.400 | And I can do it, because according to weak convergence,

01:11:24.080 | any invariant take place with any function.

01:11:28.680 | Invariant take place with any predicate.

01:11:31.500 | So, and also, this provide unique solution

01:11:39.920 | for reproducing kernel Hilbert space,

01:11:42.880 | and approximation for neural network.

01:11:46.200 | Approximate solution.

01:11:49.620 | But through the progress,

01:11:51.120 | goes beyond statistical reasoning.

01:11:56.120 | It goes in direction of searching of predicate,

01:11:59.200 | which form basis for understanding

01:12:01.080 | of problem existing in the world.

01:12:02.880 | And what means understanding?

01:12:06.720 | It means that in, say, 2D image recognition,

01:12:11.720 | there exists concept of symmetry.

01:12:15.340 | There exists concept of structure.

01:12:17.980 | And if you will know this concept,

01:12:21.200 | I believe that it's not a lot, I will show you why.

01:12:23.880 | I think that it is not a lot.

01:12:25.560 | You understand this problem.

01:12:27.280 | So, and I think that this line, it's very old line.

01:12:33.520 | It start from Plato.

01:12:36.800 | What says Plato?

01:12:38.240 | Plato says that there is a world of ideas

01:12:41.120 | and world of things.

01:12:42.400 | And world of ideas make world of things.

01:12:48.020 | But you see that I have ideas,

01:12:55.580 | which is predicate, which abstract,

01:12:57.260 | that can be applied to different situation.

01:13:01.380 | But I have world of things.

01:13:03.260 | But then, in 300 years ago,

01:13:09.180 | it was classical German philosophy about that,

01:13:12.660 | what means ideas, what means things.

01:13:16.260 | And Hegel thought whatever reasonable it exist,

01:13:20.980 | it is exactly what we said about predicate.

01:13:24.340 | And whatever exist, it is reasonable.

01:13:27.180 | So there is two connections.

01:13:28.900 | But recently, 60 years ago, Wigner wrote an article

01:13:35.500 | about unreasonable effectiveness of mathematics.

01:13:41.940 | Which he says that mathematics

01:13:43.540 | know something about reality.

01:13:45.260 | If you would like to understand reality,

01:13:48.460 | you should look in equation and you will see how it works.

01:13:53.100 | So predicate is abstract idea.

01:13:55.820 | Why invariants that are built using them

01:14:01.180 | form element of solution.

01:14:02.900 | This two concept reflect essence of intelligence,

01:14:08.940 | not just it imitation, which is in artificial intelligence.

01:14:13.560 | But that is subject which we should attack.

01:14:19.520 | And also, I have two more slides.

01:14:23.020 | One slide is challenge.

01:14:24.460 | I know that people from deep network

01:14:30.220 | did 0.5% of error rate using 60,000 observation.

01:14:35.220 | The challenge is use 1% of this data

01:14:39.380 | and got the same result.

01:14:41.340 | But invent smart predicate

01:14:44.740 | and all these clones which exist.

01:14:46.760 | I think that it is durable.

01:14:51.780 | And the very last slide.

01:14:53.980 | In 1928, Guy Vladimir Propp published book

01:15:00.020 | Morphology of Folk Tales.

01:15:03.900 | Where he describes 31 predicate

01:15:07.540 | that allow to synthesize all Russian folk tales.

01:15:13.060 | Later, his morphology, this 31 predicate,

01:15:19.860 | was applied to literature, to theater,

01:15:24.100 | to film, to television, to television series,

01:15:29.040 | to games, et cetera.

01:15:31.020 | And this 31 was enough.

01:15:33.300 | And this I read from Wikipedia.

01:15:36.940 | You can check it, Wikipedia, William Propp.

01:15:39.140 | Propp found 31 predicate which describe

01:15:42.780 | different actions of people in real world.

01:15:47.240 | Probably there exist small amount of predicates

01:15:50.400 | that describe 2D images.

01:15:52.180 | And that is intelligence.

01:15:56.080 | That is how to find them.

01:15:57.860 | And that what I believe should be learning about.

01:16:03.920 | Thank you.

01:16:04.760 | (audience applauding)

01:16:07.920 | - I think we have time for a few questions.

01:16:14.440 | - Hello, thank you.

01:16:16.540 | I have two questions.

01:16:17.940 | First one is, do you know of any predicates

01:16:20.960 | that you recommend for language classification tasks

01:16:24.360 | specifically?

01:16:25.560 | And the second question is, do you have any strategies

01:16:28.480 | for hedging against overfitting?

01:16:30.920 | Like if you specify too many predicates,

01:16:33.320 | then you might be sort of--

01:16:35.400 | - I not hear you well.

01:16:36.800 | But second question is about overfitting.

01:16:39.480 | - Overfitting, yes.

01:16:40.320 | - Yes, let me answer this.

01:16:41.840 | - Sure.

01:16:42.680 | - The more predicate, you have,

01:16:44.820 | why does overfitting happen?

01:16:47.080 | Because your set of function is big.

01:16:49.300 | And you have small amount of data in selecting function.

01:16:54.100 | So you can select whatever you want.

01:16:56.620 | But if you increase number of predicate,

01:16:59.320 | you decrease set of function.

01:17:03.580 | So the more predicate, the less overfitting happen.

01:17:08.580 | - Good point.

01:17:09.420 | - And if you will, a theory, mathematics says

01:17:11.760 | that if you have infinite number of predicate,

01:17:14.140 | you're left with one function, which you want.

01:17:16.940 | - You also asked about natural language,

01:17:20.940 | recommendation for predicates for language,

01:17:24.080 | natural language processing, the Turing test,

01:17:28.020 | any good predicates?

01:17:29.240 | - You know, it is very complicated story,

01:17:34.900 | natural language, I don't know.

01:17:36.600 | - Questions.

01:17:41.860 | - Whatever I'm talking, it is very trivial, simple.

01:17:46.020 | Everyone familiar with 2D images?

01:17:52.940 | And you can think, like this guy, Vladimir Prok,

01:17:58.880 | what is predicate in these images?

01:18:00.840 | Can we formulate, if you're smart guy,

01:18:04.940 | say, couple of dozens, or maybe one dozen,

01:18:08.140 | predicate, which will be enough?

01:18:10.780 | (audience member speaking faintly)

01:18:13.380 | - But language is harder than images.

01:18:15.940 | - Oh, yeah.

01:18:16.780 | (audience laughing)

01:18:19.620 | Yeah, but don't do them, immediately hard problem,

01:18:23.660 | try to--

01:18:24.580 | - Try.

01:18:25.420 | - Yeah, I try, they're very simple, just step by step.

01:18:29.460 | It so happens that it is main line of philosophy

01:18:38.380 | from Plato to this guy, who says that ideas is not too much,

01:18:43.380 | too much, is not too many ideas that exist

01:18:50.060 | in world of ideas.

01:18:51.240 | It could be like that.

01:18:54.020 | - Vladimir, thank you so much for coming today.

01:18:57.740 | - Thank you, thank you.

01:18:58.580 | - And please give him a big hand.

01:18:59.900 | (audience applauding)

01:19:03.140 | Thank you.

01:19:03.980 | (audience applauding)

01:19:07.140 | (upbeat music)

01:19:09.720 | (upbeat music)

01:19:12.300 | (upbeat music)

01:19:14.880 | (upbeat music)

01:19:17.460 | [BLANK_AUDIO]

Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series

Chapters