back to index

Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series


Chapters

0:0 Introduction
0:46 Overview: Complete Statistical Theory of Learning
3:47 Part 1: VC Theory of Generalization
11:4 Part 2: Target Functional for Minimization
27:13 Part 3: Selection of Admissible Set of Functions
37:26 Part 4: Complete Solution in Reproducing Kernel Hilbert Space (RKHS)
53:16 Part 5: LUSI Approach in Neural Networks
59:28 Part 6: Examples of Predicates
70:39 Conclusion
76:10 Q&A: Overfitting
77:18 Q&A: Language

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're happy and honored to have Vladimir Vapnik with us,
00:00:04.160 | co-inventor of supported vector machines, support vector clustering, VC theory of
00:00:08.480 | statistical learning, and author of statistical learning theory.
00:00:12.560 | He's one of the greatest and most impactful statisticians and computer
00:00:17.120 | scientists of our time. Plus he grew up in the Soviet Union,
00:00:22.080 | eventually heading the computer science department, Institute of Control Sciences
00:00:26.160 | in Moscow. So he will give today's lecture in
00:00:29.440 | Russian and I will translate. Just kidding, right?
00:00:35.520 | It's an honor and a pleasure to have Vladimir with us today, so please give
00:00:41.200 | him a warm welcome.
00:00:44.320 | Thank you.
00:00:49.120 | About 50 years ago, Professor Chervonenkis and me started statistical
00:00:57.120 | learning theory. The problem was to answer the question
00:01:02.000 | when, if you will do well with training data,
00:01:06.400 | if you will have small amount of error on training data,
00:01:10.320 | you will do well also on the test data. You will minimize expectation of error.
00:01:19.920 | So we solved this problem. And there are this theory almost in all
00:01:28.320 | books, they might be written in different
00:01:30.800 | languages, but mostly they follow this line. The
00:01:34.720 | line is that just
00:01:38.160 | law of large number is not enough. We need uniform law of large number,
00:01:43.200 | uniform convergence, and so on. But because we started this discussion
00:01:49.840 | about empirical error, how good you do on the
00:01:54.480 | training data, and bound shows that the better you're doing on training data,
00:01:59.360 | you will be better on test data, people decided
00:02:03.120 | that it is only way to have training data and do something
00:02:08.800 | with this training data, to minimize number of error.
00:02:13.360 | And all algorithm was constructed based on this principle.
00:02:19.680 | About five years ago, I found that there exists another principle,
00:02:25.680 | even more interesting than this one, because
00:02:28.960 | if this principle, it is brute force principle,
00:02:32.320 | give me more data, I will give better answer.
00:02:35.360 | So the second principle is intelligent principle.
00:02:38.960 | And I will talk today about this. So I will start with statistical
00:02:45.200 | learning theory, and then I will introduce this new theory.
00:02:50.000 | But because there are only two ways for generalization, one with data,
00:02:56.160 | another I will show what. I call it complete statistical learning
00:03:00.960 | theory, because there are no
00:03:05.200 | another way to do something. There are no
00:03:09.040 | third way for generalization. You should use both of them. So that is
00:03:16.080 | complete theory. But it is not so bad, because
00:03:21.200 | you will see that learning theory move in different direction,
00:03:27.920 | in direction of intelligence, to understanding what is intelligence.
00:03:33.040 | It is not the same what Turing explained.
00:03:36.560 | Turing told that you should imitate intelligent person.
00:03:40.960 | But now question, what is intelligence? And we will discuss this. So let me start.
00:03:49.360 | The first part is, this is theory of generalization.
00:03:55.520 | And that is the question, when instead of
00:03:59.680 | in a given set of function, you can minimize
00:04:03.280 | function now. This is pretty general function now.
00:04:07.360 | Instead of y f of x and l is loss function, I can
00:04:15.280 | consider difference between y and function.
00:04:18.960 | It is what we're doing in regression and pattern recognition and
00:04:24.560 | all this stuff, but this is more general setting. But this
00:04:28.240 | is more important, that when we consider minimization of
00:04:32.720 | function now, we should say in which set of function.
00:04:37.040 | In a given set of function, we have to minimize function.
00:04:41.920 | If probability measure is unknown, but we are given
00:04:45.840 | iid data pairs. That exact setting of both pattern recognition
00:04:54.160 | and regression estimation problem. For pattern recognition set of function is
00:05:02.640 | indicator functions. For regression function it is
00:05:09.520 | continuous function, but the statement is the same.
00:05:15.840 | But the answer is very simple. We can minimize
00:05:20.240 | this function now using data if and only if this dimension
00:05:27.280 | h of set of function is finite. Everything depends
00:05:31.440 | on which set of function you have to minimize this function.
00:05:35.680 | And you cannot avoid that. So we call it capacity,
00:05:40.560 | but maybe it better called diversity of the set of function.
00:05:45.280 | But measure of this capacity diversity is VC dimension.
00:05:51.360 | And what is VC dimension? I will give you definition.
00:05:56.000 | First for set of indicator function, T is step function.
00:06:00.160 | You consider continuous function F and in front of it you're looking for
00:06:06.720 | indicator. If
00:06:10.640 | function is positive, you say one. If it's not positive, you say zero.
00:06:17.040 | That is indicator. And we see this dimension of set of indicator function
00:06:23.200 | equal h if h is the maximal number of vectors
00:06:31.040 | that can be shattered, separated in all possible to the
00:06:35.200 | h subset using indicator function from this set.
00:06:43.120 | So you have set, you should find h vectors
00:06:50.480 | which you can shatter in all possible way, in all possible way.
00:06:54.480 | But you cannot shatter in h plus one vector. Then VC dimension
00:07:00.480 | of this set of function will be h. This is pure combinatorical definition.
00:07:07.520 | And if you can shatter for any l, then we say that VC dimension is infinite.
00:07:15.520 | And then I will give you two theorems which we will use
00:07:19.680 | and then the main theorem, probability of the theory.
00:07:24.880 | If set of function has VC dimension h, then this probability of one minus z,
00:07:31.360 | for all functions in the set, the bound holds true.
00:07:38.080 | And because this bound holds true for all function,
00:07:41.200 | and you would like to have a minimal estimate, minimal right-hand
00:07:49.040 | side, you will choose function which minimize empirical loss.
00:07:56.240 | And the second theorem, if you have set of linear function,
00:08:01.360 | indicator from the set of linear function, and so happen
00:08:05.280 | that your vector x inside of circle for radius one,
00:08:10.560 | and w inside of c, then VC dimension is bounded by a maximum of two
00:08:18.400 | values, c and n, n is dimensionality of vector, plus one.
00:08:24.960 | That means exactly that VC dimension can be smaller than dimensionality of the space.
00:08:33.200 | And you can control somehow VC dimension, and I will show you how to do it.
00:08:38.640 | It is very important theorem.
00:08:42.560 | But what is the general way to adjust VC dimension for
00:08:47.760 | searching for function? You have set of function.
00:08:52.320 | Maybe set of function have infinite VC dimension.
00:08:55.920 | Then you make a structure of the set of function.
00:08:59.920 | You choose subset of function, the small subset of function,
00:09:04.080 | with VC dimension h, then another subset of function which includes the small one,
00:09:10.000 | with VC dimension h2, so this nested, we have nested subset of function.
00:09:15.840 | This nested VC dimension. And then, when you're minimizing this
00:09:20.320 | function, you're doing two things. You choose appropriate subset,
00:09:27.040 | and then in this subset, you pick up function which minimize
00:09:33.440 | empirical loss.
00:09:37.040 | But you can see that epsilon in this depend on VC dimension
00:09:44.720 | of subset you choose.
00:09:47.680 | At VC dimension over l. And also it depends on empirical loss
00:09:52.960 | which you achieve. So you can do, as soon as you make a structure,
00:09:59.680 | you can do whatever you want, even if you have infinite VC dimension in initial
00:10:06.480 | set of function.
00:10:11.120 | But this is more or less all what contain the main result of VC theory.
00:10:19.120 | But VC theory does not answer four very important questions.
00:10:26.080 | Have to choose loss function, lyf.
00:10:30.480 | I told that any function. Have to select admissible set of function,
00:10:35.920 | f of x. I told you that given set of function, but maybe this is a stupid
00:10:42.960 | set of function. Have to construct good set of function.
00:10:50.080 | Have to construct structure on the admissible set of function.
00:10:55.440 | And then have to minimize functional to construct the structure.
00:11:00.160 | In this talk, I will try to answer all these questions.
00:11:04.720 | Target functional minimization.
00:11:08.640 | And this is important slide. God plays dice.
00:11:15.120 | What is setting of pattern recognition problem?
00:11:19.360 | I will consider in this talk just pattern recognition
00:11:23.840 | problem for two class classification, but generalization is straightforward.
00:11:31.920 | So what is setting of pattern recognition problem?
00:11:37.840 | Given generator, given nature, we generate randomly
00:11:42.160 | and independently, situation x, which come on the object.
00:11:48.000 | And this object is conditional probability y,
00:11:52.320 | given x, says that y equal one, given x, and y equals zero, given x.
00:12:01.840 | This object knows this conditional probability
00:12:05.440 | and play dice. So he has a function of conditional probability, he has x
00:12:12.880 | on the input, he plays dice and say y. That is the most general setting of
00:12:19.200 | learning problem. Deterministic is particular case of this
00:12:23.280 | set. And what does learning machine? A
00:12:26.640 | learning machine has a set of functions and it can choose any
00:12:31.440 | function from this set. And the problem is
00:12:35.520 | observing L observations, x1, y1, xL, yL, to pick up
00:12:44.240 | the function for classification. That means given
00:12:52.800 | observations generated by a conditional probability p,
00:12:58.160 | x, y, equal p, y, given x on p of x, that's the whole scheme,
00:13:05.760 | finds the rules that minimize functional. So in my case, when I have
00:13:10.640 | y zero or one, or n theta also zero one, it is indicator function,
00:13:17.280 | so last function I'll just collect how many errors I do.
00:13:22.240 | But I have probability measure, it collect
00:13:25.920 | expectation of error. I would like to find function which guarantee
00:13:32.320 | the smallest expectation of error. But this is not very good. Why it's not
00:13:38.640 | very good? Because my function now, this model is
00:13:44.560 | just, it is everywhere zero, except for some points where it is one,
00:13:50.960 | or minus one. So if I took model, it is one. So it is
00:13:58.000 | zero everywhere and defined in one in some point.
00:14:05.360 | So I cannot use gradient in this case. So I should do something smarter.
00:14:12.640 | And people, what people doing, they replace
00:14:15.760 | model and indicator function with just y minus f of x,
00:14:23.520 | least square error,
00:14:26.480 | instead of whatever I formulated before. It's not so bad choice because it's so
00:14:33.840 | happened that minimum of this function now gives
00:14:38.960 | conditional probability function, probability of y
00:14:42.640 | equal one given x, and then when we can find this probability of y
00:14:50.720 | equal one given x, we easily can construct
00:14:54.320 | our decision rule. We just consider function if
00:15:01.520 | our conditional probability exceeds 0.5, say first class. If it's less than 0.5,
00:15:08.080 | say second class. And this is optimal solution.
00:15:13.120 | But something wrong with this replacement.
00:15:18.400 | Let us rewrite this first line. I will subtract from bracket inside
00:15:26.160 | on the first term, regression, and then add regression. So I
00:15:32.320 | have two brackets instead of one.
00:15:36.320 | And then I make a square. So the last integral
00:15:40.240 | shows me the first integral does not depend
00:15:44.320 | on function, which I'm looking for, and I have to minimize my function now
00:15:51.200 | over f, over set of function, just sum of two last terms.
00:16:00.240 | How I got this? It is just normal binomial for two terms.
00:16:07.440 | Square of one, square of second, and two termless multiplications.
00:16:15.920 | But our goal is to minimize first integral,
00:16:20.960 | to find function which is close to conditional probability function,
00:16:26.320 | not sum of two integrals. We can show that second
00:16:29.680 | integral eventually will go to zero,
00:16:33.200 | but it will slow down rate of convergence.
00:16:38.240 | To have a rate of convergence big, we need to find a way how to minimize
00:16:46.160 | first integral, not sum of these two. And that means that not least square
00:16:53.120 | loss, but something else. What we can do?
00:16:58.640 | There exists...
00:17:01.600 | First of all, when y, it is zero or one, probability of
00:17:11.600 | y equal one given x is some real valued function between zero
00:17:18.560 | and one. Because from Bayesian formula,
00:17:26.080 | we know that probability of conditional probability of y
00:17:30.400 | equal one given x on density p of x, it is joint density p of y given
00:17:37.440 | one comma x. That is just
00:17:42.400 | always true. Now if I will multiply on some function g of x minus x star,
00:17:54.560 | which belongs to L2 space, and take integral, I will have this
00:18:01.840 | equation.
00:18:04.400 | And I can say that conditional probability is solution of this equation.
00:18:11.440 | It was constructed like that because
00:18:16.240 | you see I put conditional probability, I did something like that. But I would
00:18:22.800 | like to solve this equation to find function
00:18:29.120 | when I don't know probability measure, but I'm given data.
00:18:35.680 | Given observations generated according to px, pyx,
00:18:43.040 | I would like to solve this equation. But solving equation, it is ill-posed
00:18:48.560 | problem. Okay, let's do that. But before I will do
00:18:53.680 | that, I would like to mention that in
00:18:56.640 | classical statistics, there is a way
00:19:04.560 | how to replace unknown probability measure with
00:19:11.200 | empirical measure. And that is the most important part,
00:19:15.760 | it is main induction step in statistics. In statistics, we're given
00:19:20.640 | data and would like to know function. And it
00:19:24.000 | doesn't matter how many data we will see, it is not equivalent to
00:19:28.080 | function. So in
00:19:32.720 | classical statistics, people suggest to approximate cumulative distribution
00:19:39.200 | function by empirical cumulative distribution
00:19:43.520 | function. And that is empirical cumulative
00:19:46.000 | distribution function. In 30 years, a mathematician tried to
00:19:51.760 | prove that it is good idea that we
00:19:55.920 | can do that. And in 33, Kolmogorov found exact bound
00:20:01.680 | which is on the last line, almost the same like on the last line.
00:20:07.200 | And then people proved that we have bound like that.
00:20:12.560 | Now, if we can replace unknown measure with empirical measure,
00:20:19.760 | we can construct our
00:20:24.320 | problem, our constructing problem, what we should do.
00:20:29.360 | Let us replace in this function now which we would like
00:20:33.360 | to minimize instead of real measure, empirical measure. And we will have
00:20:42.080 | empirical least square root function now, which we have to minimize
00:20:49.440 | to find
00:20:52.960 | our problem, to find our conditional probability of decision
00:20:58.560 | or whatever we want. But let me consider new constructive
00:21:03.520 | setting, where we also will replace unknown probability measure
00:21:09.200 | with empirical probability measure obtained on the training data. And you
00:21:14.960 | will see the last equation. And
00:21:18.640 | to find conditional probability, we have to solve
00:21:24.080 | this equation in set of function f of x.
00:21:30.240 | Right hand side is known because our function g is known.
00:21:34.800 | In left hand side, we don't know f of x, so
00:21:37.920 | it is our goal in the set of function to find this function.
00:21:43.840 | In classical statistics, it was one algorithm called Watson-Nadarayev
00:21:50.480 | estimator, which show how to estimate conditional probability
00:21:54.960 | or regression function. They just somehow define this.
00:22:00.640 | And this is a show function which is have a numerator and denominator.
00:22:08.800 | So g is special kernel, say Gaussian.
00:22:15.520 | So this is estimate of regression, a very general way of estimation regression,
00:22:21.920 | conditional probability. And all this business, how to find,
00:22:26.800 | if it Gaussian, how to find the best value of variance to approximate
00:22:33.840 | conditional probability well. So they spent a lot of time on this subject and
00:22:38.720 | they have this. But you can see
00:22:43.760 | the line in middle. This is Watson-Nadarayev estimator comes from
00:22:52.080 | corrupted equation, not from equation which we derive.
00:22:56.560 | Here it is in the middle is f of x, not of f of x i,
00:23:04.000 | like in last, like in correct. So then you can
00:23:07.520 | put out of some f of x and you will get this function.
00:23:12.080 | So actually classical Watson-Nadarayev estimator,
00:23:16.720 | it is solution of corrupted equation which we obtained.
00:23:24.240 | But what means to solve equation? To solve equation
00:23:28.960 | means I will take a difference between left-hand side and right-hand side,
00:23:35.600 | define area where I would like that my function will operate,
00:23:42.800 | take a square and integrate over some probability measure
00:23:49.920 | and minimize this function. So let us do that. And if you will do
00:23:56.320 | simple whole algebra, it is just very simple and you can check it,
00:24:02.080 | you will have this f function which is y minus f of x,
00:24:09.840 | y, yj minus f of xj, multiply on some coefficients
00:24:19.520 | y, xy, xj.
00:24:22.960 | So we can estimate this value. And this value is
00:24:29.680 | j, xy, xi, j, xi, xj over measure, and this is matrix.
00:24:38.400 | If you know from Watson-Nadarayev exact formula, this, we know this matrix,
00:24:45.440 | this element of matrix. So what is V matrix?
00:24:53.440 | If you will replace this integral, this empirical integral,
00:24:59.120 | we will have this estimate of V matrix. If you will use
00:25:04.720 | just, say, line between minus one and one,
00:25:10.880 | and mu of x is uniformly distributed over this line,
00:25:15.760 | we will have,
00:25:18.400 | and we will have
00:25:24.000 | and g is Gaussian distribution, we will have this V matrix. So V matrix is
00:25:30.960 | easy to find. Now I would like to use
00:25:36.880 | vector notation. What I will do, I will call y vectors of elements of
00:25:44.000 | training data y1, yl. I'm given l pairs,
00:25:49.520 | so I create y, vector y, which is vector of all y's.
00:25:57.360 | I will create F, capital from F, which is also l-dimensional vector of,
00:26:05.040 | I will pick up function f, and this is value of this function
00:26:11.360 | in point x1, and the last value of this function
00:26:15.280 | is the point xl. So this is F. And I have V matrix.
00:26:22.400 | So, and I can rewrite this functional in the way
00:26:26.000 | that in matrix form I have y minus F, capital from F,
00:26:32.800 | V y minus F, capital from F. But if I will write this notation,
00:26:39.840 | least square method, I will have y minus F, capital from F,
00:26:46.240 | y minus F, capital from F, and here instead of y, identity matrix.
00:26:53.600 | So I got some improvement over least square method,
00:27:01.280 | and I hope that it will give me rate of convergence
00:27:08.880 | better than this least square method.
00:27:13.440 | Let's see. But it is not major stuff, because, okay, I improve rate of
00:27:19.600 | convergence, but it's still least square method good.
00:27:23.280 | But now the most important part, selection of admissible set of
00:27:28.160 | functions. What it means? When you're constructing neural network,
00:27:37.200 | you're talking that you're doing smart structure.
00:27:40.720 | What it means? You're talking that you're constructing
00:27:45.200 | smart admissible set of functions. You know something, you're smart guys.
00:27:52.160 | You're just constructing, and then you minimize over this set of functions.
00:27:58.960 | But let me consider from mathematical perspective
00:28:04.240 | what it is. If you consider Hilbert space, and also if
00:28:11.200 | Euclidean space, in this space there are two
00:28:15.760 | ways of convergence. Strong convergence, it is convergence of functions.
00:28:22.240 | It is first line. My sequence of functions, fL,
00:28:27.680 | converge to f0 if fL, this integral converge when fL
00:28:34.640 | goes to infinity. But there exists weak convergence.
00:28:39.520 | You say that my sequence of function fL
00:28:45.920 | converge for f0 if this inner product converge to this inner
00:28:52.640 | product for all function phi from Hilbert space.
00:28:59.440 | You can, thinking that this is inner product, describe
00:29:03.840 | property of function. If for all functions, property is the same,
00:29:11.200 | that will be convergence.
00:29:15.600 | So it is easy to show that if you have strong convergence,
00:29:21.920 | you also have weak convergence. It's just from
00:29:26.080 | Schwarz inequality, Cauchy-Schwarz inequality.
00:29:32.000 | But also one can prove that if you have weak convergence
00:29:38.480 | and your set of functions belong to compact, you also have strong convergence.
00:29:44.240 | So in some sense, the both convergence are equivalent.
00:29:50.000 | In our consideration of machine learning, we consider
00:29:56.080 | strong convergence everywhere, 100%. But what about weak convergence?
00:30:01.600 | Let's explore this opportunity.
00:30:05.440 | Let us consider pattern recognition case.
00:30:13.600 | For pattern recognition case, the first line I just
00:30:17.600 | writing definition of weak convergence equals second.
00:30:22.560 | And that is, I use by same equation, it is phi from dp
00:30:30.560 | equal one over phi for all function from phi.
00:30:35.760 | So it converge, it must converge for all function from phi.
00:30:44.240 | If it converge for all function from phi,
00:30:48.800 | I will have one function which is desired one.
00:30:54.080 | But it is not realistic. Let us do following.
00:30:58.160 | Let us select from set of Hilbert space m function phi.
00:31:05.600 | We will talk a lot how to select these functions.
00:31:09.280 | So, and then we will consider equality,
00:31:14.560 | not for all function, but just for this m function.
00:31:19.360 | And we will call admissible set of functions
00:31:22.960 | the set of function which satisfies this equality.
00:31:27.120 | We know that our function must satisfy
00:31:33.520 | this equality for any phi, any phi, because of weak convergence.
00:31:39.920 | So, but we select something which we want.
00:31:44.960 | Okay, if you will use instead of our
00:31:53.200 | cumulative distribution function, it empirical estimate,
00:31:59.600 | we will have instead of integral property like here,
00:32:04.880 | the property written by the sum.
00:32:09.120 | Again, let me use the same notation for matrix scheme.
00:32:16.320 | I will use y vector, I will use function F, capital from F,
00:32:22.160 | which is F from x1, F from xl, which is vector.
00:32:27.040 | For any function F, I have vector.
00:32:30.080 | And also, I introduce new vector,
00:32:35.080 | vector of predicates on the value x1, xl.
00:32:42.320 | Because phi is function, I can consider value of function
00:32:47.320 | in this case.
00:32:49.720 | Then I can write my function F,
00:32:55.680 | my equation in this form.
00:32:59.680 | I would like that my admissible set of function
00:33:04.680 | satisfy in vector form this m equations.
00:33:11.680 | let me explain what we're talking about.
00:33:22.680 | There is a duck test logic.
00:33:27.680 | If it looks like a duck, swims like a duck,
00:33:32.760 | and quack like a duck, then it probably is a duck.
00:33:36.760 | What this means, we have the statistical invariance
00:33:41.880 | in vector form of this line.
00:33:44.800 | What it does, it collect admissible function
00:33:48.480 | which identify animals as a duck
00:33:53.320 | if it looks, swims, and quack like a duck.
00:33:56.160 | So if you will choose predicate,
00:34:01.920 | which explain what means looks, swims, and quack,
00:34:06.600 | then your admissible function will be function,
00:34:11.920 | such function which classify animals
00:34:18.200 | that swim, quack, and looks like a duck.
00:34:23.200 | Concept of predicate, it is very different from feature.
00:34:29.540 | Why it's so?
00:34:31.680 | With increasing number of predicate,
00:34:35.960 | the VC dimension of admissible set of function is decrease.
00:34:40.720 | Why it is a decrease?
00:34:42.160 | Because we have set of function,
00:34:44.160 | then we, from this set of function,
00:34:47.440 | select function which satisfy new predicate.
00:34:51.040 | Not all of function will satisfy,
00:34:54.400 | and consider only set of function
00:34:56.520 | which satisfy all this predicate.
00:34:58.560 | But with increasing number of features,
00:35:02.480 | VC dimension increase,
00:35:04.520 | because your decision will becoming more and more diverse.
00:35:11.680 | So what is exact setting of complete learning problem?
00:35:15.360 | Minimize functional, which is,
00:35:21.480 | with this V matrix, a little bit improve
00:35:24.480 | of least square functional, subject to this constraint.
00:35:28.080 | And this constraint is what you would like to see
00:35:34.400 | in the set of admissible functions.
00:35:40.160 | But existing classical method of pattern recognition,
00:35:45.000 | they just minimize this functional,
00:35:49.320 | that is least square functional.
00:35:51.520 | So we're minimizing this functional
00:35:54.800 | subject to this constraint.
00:35:56.160 | That is our setting.
00:35:57.800 | That was exact setting,
00:36:01.280 | which is, in mathematical, is called
00:36:07.240 | conditional optimization.
00:36:12.080 | Optimization of functional under conditions.
00:36:14.680 | But approximation is unconditional optimization.
00:36:19.240 | I would like minimize this functional
00:36:21.440 | subject to this constraint, but I will do following.
00:36:24.800 | I will make a sum of this functional,
00:36:29.520 | and I will take square of difference
00:36:34.280 | between this constraint,
00:36:38.400 | and make a sum, with some weight.
00:36:41.200 | Weight, it should be one.
00:36:43.660 | So I would like to do both,
00:36:46.680 | to minimize over both, and put weights,
00:36:50.680 | how important for me to minimize.
00:36:52.700 | Data, and how important for me to minimize constraint.
00:36:58.480 | And then if I will do that,
00:37:03.260 | I can rewrite this function in this way,
00:37:05.920 | where you have to minimize this functional.
00:37:11.900 | And where P is just covariance matrix of your predicate.
00:37:18.240 | And everything is simple compute.
00:37:23.620 | So that is concept.
00:37:29.660 | What we have to do?
00:37:33.000 | We have to solve our problem using both
00:37:38.000 | big and strong convergence.
00:37:40.520 | That means using invariance,
00:37:44.700 | and minimizing functionals.
00:37:47.980 | And we can do it in exact way,
00:37:54.740 | and in approximation.
00:37:58.120 | But here it was written for any set of function.
00:38:03.120 | I did not talk how I will minimize that.
00:38:07.020 | So it is true how, as a least square method,
00:38:09.960 | you can minimize least square functional
00:38:12.860 | for any set of functions.
00:38:14.260 | That is the same here.
00:38:15.420 | But now, let me see
00:38:20.060 | how I can find the solution.
00:38:26.680 | And first of all, I will do it
00:38:28.700 | for reproducing kernel Hilbert space.
00:38:30.960 | This is definition of reproducing kernel Hilbert space.
00:38:36.340 | You have some kernel, which is Mercer kernel.
00:38:39.420 | You multiply, you're taking the product
00:38:41.740 | with function f of x, and you have the same function.
00:38:46.300 | It is reproducing the same function.
00:38:48.420 | It's called reproducing kernel Hilbert space.
00:38:53.020 | And it is known that Mercer kernel have expansion
00:38:57.240 | over lambda, whereas lambda, it is non-negative values.
00:39:01.920 | And psi is orthonormal functions.
00:39:06.400 | So, set of function,
00:39:11.360 | this inner product and norm.
00:39:13.960 | This is inner product, and that is norm.
00:39:18.040 | Forms are reproducing kernel Hilbert space.
00:39:20.760 | It is very easy to check.
00:39:23.300 | It means that if you will use
00:39:25.860 | some function psi, orthonormal function psi,
00:39:34.060 | and expansion of C, and if you will have
00:39:41.460 | this set of function, and if you will introduce
00:39:46.020 | special inner product, inner product of this type,
00:39:51.020 | and then you will have this definition.
00:39:55.500 | So you will have reproducing kernel Hilbert space.
00:39:58.560 | It is pretty general space.
00:40:00.320 | But in reproducing kernel Hilbert space,
00:40:07.280 | you have a great theorem called representer theorem.
00:40:10.000 | And representer theorem says,
00:40:14.060 | if you would like to minimize this function,
00:40:17.000 | in subset of function, and subset of function,
00:40:20.200 | it's norm of your function in reproducing
00:40:24.200 | kernel Hilbert space is bounded.
00:40:28.960 | So, then your solution has a linear representation
00:40:33.600 | over kernel with finite number of parameters.
00:40:36.840 | So, let introduce matrix, vector of functions.
00:40:44.800 | F, K, X one of X, it is vector of expansion.
00:40:49.800 | And then square of our norm will be A,
00:40:56.180 | and this is A over K.
00:41:02.060 | This is what is, how it looks,
00:41:06.140 | your function which you're looking for.
00:41:08.420 | A, K, A is norm of your function.
00:41:14.040 | Function from reproducing kernel Hilbert space.
00:41:16.960 | K is matrix K, X, I, X, J,
00:41:23.160 | and this is F of F from reproducing kernel Hilbert space
00:41:28.480 | is just linear function.
00:41:31.400 | Subset of function is bounded norm in finite VC dimension.
00:41:38.860 | The smaller C, the smaller VC dimension.
00:41:43.360 | And that is according to second theorem
00:41:45.240 | which I show you before.
00:41:48.000 | And to control VC dimension, you should control just C.
00:41:54.380 | You should look for norm of this function.
00:42:00.540 | So, conditional minimization in producing
00:42:04.880 | kernel Hilbert space has closed form solution
00:42:09.080 | to minimize this function, subject to this constraint.
00:42:12.900 | And constraint on bound of the norm will give you,
00:42:17.900 | and this is your solution, it linear expansion
00:42:23.960 | over this vector of functions and value of coefficients
00:42:28.960 | is like that, where it is just in closed form
00:42:39.320 | it is function of matrix, it's multiplication of matrix.
00:42:43.360 | This is gamma C, C is depend on this,
00:42:48.360 | where you see gamma of C depend on this C.
00:42:52.340 | This is identical matrix, or you have this solution.
00:42:57.340 | And to find mu over here, you have to solve
00:43:01.100 | linear equation.
00:43:02.900 | So, you're solving linear equation,
00:43:05.300 | and then you have closed form solution.
00:43:08.300 | So, the complete problem in reproducing
00:43:12.380 | kernel Hilbert space have closed form solution.
00:43:17.380 | But what about unconditional minimization,
00:43:24.220 | which approximately minimization like that
00:43:27.060 | in this constraint?
00:43:29.740 | It also has closed form solution,
00:43:32.540 | and this is how it looks closed form solution.
00:43:35.280 | Your solution is coefficients over this expansion,
00:43:39.180 | and you have this equation how to find A.
00:43:44.180 | So, everything is computable.
00:43:48.540 | But very special role play in explanation
00:43:56.940 | support vector machine.
00:43:58.380 | What is support vector machine?
00:44:02.220 | Given data, I would like in reproducing
00:44:07.220 | kernel Hilbert space, this bounded norm,
00:44:13.440 | minimize this function.
00:44:16.500 | And then, when I'm minimizing this function,
00:44:21.620 | that is exactly you will get support vector machine.
00:44:26.620 | If you will look how derive support vector machine,
00:44:32.100 | it is just minimization of this function,
00:44:34.900 | this set of function.
00:44:36.500 | But here I will do something else.
00:44:38.380 | I will do data like in support vector machine,
00:44:44.300 | y i minus A k a, that is I would like approximate data.
00:44:50.180 | Is a norm like here, but also I would like to,
00:44:56.580 | that my invariant will be good enough.
00:45:02.420 | Will be close.
00:45:03.620 | Left hand side and right hand side
00:45:05.380 | of invariant will be close.
00:45:07.220 | So, I would like minimize this function now
00:45:10.260 | under the same constraint.
00:45:12.060 | And this is the solution.
00:45:14.060 | The solution is like A.
00:45:16.900 | A is function which is phi from T.
00:45:21.900 | And phi from T, as you remember,
00:45:27.420 | is it is vector of predicate.
00:45:31.320 | And this is indicator vector.
00:45:34.660 | If I would like here to make strong for invariants
00:45:39.660 | and not too strong for approximation function,
00:45:51.020 | I just want to use just weak convergence.
00:45:59.180 | I will have that my A is defined by invariants,
00:46:04.180 | by function of predicate.
00:46:08.620 | But my function of predicate, how I choose predicate,
00:46:14.000 | I can choose any function I want.
00:46:16.060 | That means that if I, I can choose one function
00:46:22.120 | which will give me optimal solution.
00:46:25.360 | That there exists so smart predicate
00:46:29.000 | that I will not need a lot of data.
00:46:31.600 | Why I need data?
00:46:34.680 | I need data for expansion over kernel.
00:46:38.280 | But for estimating coefficients, I don't need data.
00:46:42.520 | I use my predicate.
00:46:44.160 | And that means that, what is,
00:46:50.120 | what means predicate?
00:46:51.920 | It is property.
00:46:53.420 | It is explanation about what I want.
00:46:56.180 | I will talk about this later.
00:46:58.360 | So this example on support vector machine
00:47:03.360 | show that philosophy of this learning is very different.
00:47:08.920 | According to representer theorem,
00:47:12.360 | solution of learning problem in reproducing
00:47:15.600 | kernel Hilbert space have a property.
00:47:18.180 | It define linear parametric function
00:47:21.040 | in form of expansion of kernel function.
00:47:27.400 | That means that optimal expansion
00:47:29.480 | belong to one layer network, not multi-layer network.
00:47:34.480 | But because of reproducing kernel Hilbert space
00:47:40.820 | is richer than this.
00:47:44.080 | Maybe it is not the best idea to use deep network.
00:47:49.080 | Deep network, okay, we will discuss this in depth.
00:47:55.120 | Observation vectors and kernel define basis for expansion
00:47:59.920 | for optimal parametric solution.
00:48:03.280 | But parameters is defined by invariants.
00:48:08.280 | But you can use this invariants.
00:48:12.320 | And what it means, if you will write in the form,
00:48:18.080 | that you have K, which is matrix,
00:48:22.520 | depending on your training data,
00:48:24.520 | phi is predicate matrix, so you have vector over here.
00:48:29.520 | And this is y, f, y is vector, f is vector.
00:48:35.520 | So you have element.
00:48:38.200 | You have different formulation of learning problem
00:48:44.280 | in term of this vector and these values.
00:48:52.140 | So since function in Hilbert space can be,
00:48:56.720 | since any function from Hilbert space can be used.
00:49:02.180 | Because when we told about weak convergence,
00:49:06.220 | it converge for all functions simultaneously
00:49:09.240 | of Hilbert space.
00:49:10.500 | So any function can be used.
00:49:13.380 | So we have a chance to pick up several smart functions,
00:49:18.940 | or maybe even one, and it will be enough for our training.
00:49:23.460 | And then we will talk about what means smart.
00:49:26.020 | It is intelligent learning, not just brute force learning.
00:49:30.300 | So, and let me give you illustration to have a feeling.
00:49:37.420 | This is least square method, what it does.
00:49:40.880 | This is V matrix method.
00:49:44.340 | If I will do a little bit improvement,
00:49:46.780 | it's a little bit better.
00:49:48.780 | Here, if I will introduce invariants, I will do better.
00:49:53.220 | Here, if I will use both invariant and V matrix.
00:49:58.220 | So you can see this difference.
00:50:01.400 | For 48 points, and to be sure that it is difficult problem,
00:50:07.320 | we took 16 for one class and 32 from another class.
00:50:15.200 | It is not equal.
00:50:17.380 | This is 98 points, the same story.
00:50:19.980 | This is one, 92 points, the same story.
00:50:22.660 | And this is multi-dimensional case,
00:50:27.340 | so we check something and we did it.
00:50:30.700 | And that very interesting case.
00:50:33.780 | We introduced some invariants,
00:50:37.300 | and got 2273 error rate for diabetes.
00:50:44.020 | And then we decided, can we find invariant
00:50:46.860 | to improve performance?
00:50:48.900 | And what we did, we just looking for area
00:50:51.780 | where our invariants does not take place.
00:50:56.780 | They violated.
00:50:57.620 | Then we just took predicate,
00:51:03.100 | which just doing with this area,
00:51:07.300 | which count how one, when point below,
00:51:13.140 | when point belong to this area,
00:51:15.180 | and zero when it doesn't belong.
00:51:17.540 | And we improve using this invariant
00:51:22.540 | from 73, .73 to .7.
00:51:27.980 | So that means that if you have smart way
00:51:35.180 | to looking for invariants,
00:51:39.260 | then you can have a chance to improve your performance.
00:51:43.500 | But this is exactly the same philosophy
00:51:46.540 | which use physicist.
00:51:47.860 | Find the solution, the box in figure,
00:51:52.060 | where the existing solution,
00:51:53.940 | find situation, the box, where existing solution,
00:51:59.860 | the approximation which we obtained,
00:52:03.180 | contradicts evidence, does not keep invariants.
00:52:09.020 | Contradict invariants inside the box.
00:52:12.460 | And then modify the solution,
00:52:14.660 | obtain new approximation which resolves this contradiction,
00:52:19.660 | which does not have this contradiction.
00:52:22.780 | So you're just looking, you're inventing invariant
00:52:26.420 | and looking where you have contradiction.
00:52:28.980 | And that is exactly the same principles
00:52:31.060 | that use physicist to discover law of nature.
00:52:34.920 | To discover law of nature,
00:52:36.180 | physicist is first trying to find situation
00:52:39.180 | where existing theory contradict observations.
00:52:42.260 | So invariant way, theoretical prediction
00:52:47.540 | do not support it by experiments.
00:52:49.740 | That means invariant way.
00:52:51.100 | Then they're trying to reconstruct theory,
00:52:54.260 | remove contradiction.
00:52:55.940 | They construct new approximation of theory
00:52:58.460 | which does not contain contradiction observed.
00:53:02.260 | But the most important part in physics, like in here,
00:53:06.340 | is the most difficult part in scientific discovery,
00:53:11.340 | have to find contradictive situation.
00:53:14.080 | But let me show something about neural net.
00:53:20.380 | I am not fond of neural net.
00:53:22.620 | But we can use our theory for neural net as well.
00:53:27.620 | What is neural net?
00:53:31.980 | That is neural net, you're minimizing least square error.
00:53:35.600 | I would minimize this approximation of invariance
00:53:42.540 | which is contain VP matrix.
00:53:48.140 | V plus P matrix, matrix P is matrix with invariant.
00:53:53.140 | And then I will do the same back propagation.
00:53:58.700 | It so happened that I easily can do back propagation
00:54:02.780 | not just for this matrix, but also for this matrix.
00:54:05.860 | And if I will do back propagation,
00:54:10.380 | I need a back propagation to do only one correction.
00:54:14.340 | Instead of back propagation error,
00:54:18.860 | where is y minus what you have on the last layer,
00:54:23.860 | and you have all L observation,
00:54:27.040 | you just have this matrix and you multiply
00:54:32.040 | your vector of propagation on this matrix.
00:54:36.620 | You're correcting your propagation.
00:54:38.660 | So that is what is back propagation about.
00:54:45.500 | You do forward step, it is okay, it's the same.
00:54:49.420 | You have back propagation step, it is boundary condition.
00:54:54.260 | In back propagation step, you should do some correction,
00:54:57.720 | only one on the very last level,
00:55:00.860 | and then update your state.
00:55:03.020 | So I came to NSE and asked guys who have a neural network
00:55:08.020 | to incorporate this, just this improvement,
00:55:14.720 | small improvement with, I took just one invariant
00:55:20.700 | and very trivial invariant.
00:55:24.140 | C of x equal one, C of x equal one.
00:55:27.420 | I will show you that it's not so simple invariant.
00:55:30.020 | We will discuss what it invariant does.
00:55:33.620 | And then we said V equal y, we did not use V matrix.
00:55:38.620 | Here it is just identity matrix.
00:55:42.460 | We use thousand examples, hundred per class, batch six.
00:55:50.020 | And this line is what does neural network,
00:55:55.020 | deep neural network, they have a good neural network.
00:56:02.400 | And that's what does improve neural network.
00:56:08.140 | Instead of 3.1, they got 2.9.
00:56:10.580 | Okay, let's do just one, not very important,
00:56:17.940 | cosine coefficient, I have a picture of my digit.
00:56:22.940 | I make cosine expansion, so I have Fourier coefficients,
00:56:31.460 | one coefficient of Fourier.
00:56:37.420 | And I will use this like predicate.
00:56:42.140 | So I will use predicate with this one coefficient.
00:56:45.660 | And again, you will see just one invariant,
00:56:48.740 | this stupid cosine makes improvement.
00:56:52.020 | Okay, then we decide let's do more.
00:56:55.220 | Let's do 16 coefficients of Fourier.
00:56:59.460 | Four from x1 and four from x2.
00:57:03.740 | And we got 0.6, bigger improvement.
00:57:09.580 | But I can do whatever I want, I can do 1600,
00:57:13.760 | say, invariants, and I can make, say,
00:57:17.480 | it's a lot of game can be played.
00:57:21.080 | But that's also, but in neural net,
00:57:26.080 | we use approximation of exact solution,
00:57:31.160 | but it works.
00:57:33.760 | The statistical part of learning theory is complete.
00:57:39.980 | Why it is complete?
00:57:43.480 | Because there exist only two way for convergence
00:57:47.960 | in Hilbert space.
00:57:49.160 | Convergence in functions, convergence in functionals.
00:57:54.840 | There are no third way of convergence.
00:57:58.360 | So from conceptual point of view,
00:58:01.400 | you can play one or two game or two game together.
00:58:04.400 | So why I call this complete?
00:58:09.160 | You cannot imagine something else.
00:58:12.560 | If you would like to do something
00:58:15.840 | that is improved learning,
00:58:22.880 | you should ask yourself how to choose invariants.
00:58:26.240 | And that's what we will talk about.
00:58:28.000 | And invariant, it is something about intelligence.
00:58:40.960 | When I talk about duck test,
00:58:45.960 | I use looks like a duck, swims like a duck,
00:58:51.240 | quack like a duck.
00:58:52.480 | But I can say play chess like a duck.
00:58:54.480 | I can say whatever I want.
00:58:57.520 | And it should be invariant equality.
00:59:02.520 | Or say, singing an opera like a duck, okay?
00:59:08.800 | But among all this stupid predicate,
00:59:12.080 | there is a smart predicate.
00:59:14.600 | And subject of learning,
00:59:19.080 | and subject of intelligent learning,
00:59:21.440 | some have to pick up this smart invariants.
00:59:24.920 | Let us discuss what is predicate.
00:59:31.280 | I don't know that.
00:59:32.760 | I think that this is for many hundred years story.
00:59:35.320 | I will show you that it is continuation
00:59:38.040 | of major philosophy from Plato to Hegel to Wigner
00:59:43.040 | and so on, I will show you.
00:59:45.360 | But it is in predicate.
00:59:48.320 | So my claim that when you're talking not about imitation,
00:59:53.320 | but what is essence of intelligence,
00:59:57.640 | essence in predicate.
00:59:59.740 | Predicate is something abstract.
01:00:01.680 | Okay, I will show you later.
01:00:04.880 | Let's see predicate.
01:00:07.400 | That is, that invariant holds true.
01:00:12.400 | Must, must, yes.
01:00:14.360 | So let's take f of x equal one.
01:00:16.920 | What does this predicate?
01:00:19.880 | Expected number of element of class y equal one,
01:00:26.760 | computed using this predicate,
01:00:29.200 | equal to number of training example of the first class.
01:00:36.080 | When you will use this predicate,
01:00:38.080 | you will pick up function which will give you
01:00:43.940 | on this training data the number of
01:00:48.940 | represent of the first class exactly the same
01:00:53.320 | like in your training data.
01:00:55.000 | So you, and that is this predicate.
01:01:01.760 | And you saw how strong this predicate
01:01:04.160 | for neural net and 10-class digit recognition
01:01:06.520 | because they have something.
01:01:08.500 | Let's take another stupid predicate.
01:01:12.800 | Just x, it looks like that.
01:01:15.840 | It is center of mass.
01:01:17.380 | I want that expected center of mass,
01:01:21.800 | expected with respect to conditional probability,
01:01:24.400 | will be equal to average to center of mass
01:01:30.000 | which I see on my training data.
01:01:34.140 | That looks like that.
01:01:37.800 | But you can do smarter, looks like that.
01:01:41.680 | So I can consider x, x transport,
01:01:47.840 | which make matrix, it is correlation,
01:01:51.960 | a variation matrix.
01:01:53.560 | And I can see n square over two predicate
01:01:59.840 | of this type.
01:02:01.360 | That's correlation which I will get
01:02:04.600 | using this predicate.
01:02:07.320 | Using function, sorry, predication which I will get
01:02:15.200 | with obtained solution will be the same
01:02:21.000 | like correlation which I absorbed on my training data.
01:02:24.480 | That means this predicate.
01:02:27.040 | So, but I think that we should not go
01:02:32.040 | for general predicate.
01:02:36.840 | And we can imagine many predicate
01:02:39.200 | I am not so smart to construct very general predicate.
01:02:43.080 | But let's do for 2D image predicate.
01:02:47.800 | Like u, x1, x2, the function of two variables,
01:02:53.280 | it is image of digit, say, in our case.
01:02:57.120 | And we have this function y, this function y.
01:03:00.360 | And let's consider predicate like I will take image,
01:03:05.080 | I will consider coefficients of Fourier.
01:03:09.080 | It is my predicate.
01:03:10.920 | I want expected value with respect to this
01:03:14.480 | over my predicate.
01:03:16.360 | Will be the same like I see on my training data.
01:03:21.280 | I can consider convolution.
01:03:25.480 | By the way, convolution neural network is one predicate.
01:03:30.000 | I will show you later which is predicate.
01:03:33.160 | And this is this convolution of point x, y, x
01:03:40.120 | of different points.
01:03:41.460 | You can use wavelet, you can use whatever you want.
01:03:47.120 | Because whatever is coming in the product,
01:03:49.880 | it is you who decided what you should use.
01:03:53.720 | And the understand of image recognition,
01:04:00.160 | it means understand which predicate involved in that.
01:04:04.140 | But also, see difference between predicate and invariant.
01:04:09.140 | Predicate is abstract concept.
01:04:12.440 | But invariant involve your training data.
01:04:17.360 | It what makes specific your abstract concept.
01:04:21.680 | It's also general, but it is specific.
01:04:23.960 | And that is, I want you to show
01:04:33.160 | instrument for special predicate.
01:04:37.640 | Let us consider vector x, y, x, z, just digit recognition.
01:04:46.560 | Keep in mind your digit recognition.
01:04:49.200 | And suppose x is your pixel space.
01:04:53.000 | And you make linear transformation,
01:04:57.040 | small linear transformation of your pixel space.
01:05:00.720 | And if you have small linear transformation of pixel space,
01:05:07.360 | you first transform your picture.
01:05:10.360 | But you can transform picture,
01:05:11.720 | and also you can transform using Lie derivative.
01:05:14.720 | I will show you what is that.
01:05:16.480 | But to see what is that, I took this picture
01:05:20.160 | from paper by Simart and all.
01:05:23.440 | We show you this is transformation, the first line.
01:05:29.400 | In pixel space, you make a linear transformation.
01:05:34.200 | And here, you make the same transformation
01:05:36.600 | with Lie derivative.
01:05:38.120 | This is Lie derivative, this black one.
01:05:42.360 | It's just computing Lie derivative.
01:05:44.160 | I'll show you how to compute.
01:05:46.040 | And alpha is coefficient.
01:05:48.360 | And using different coefficients,
01:05:50.320 | a equal minus two, you just have this.
01:05:53.840 | A equal minus one, you can have this,
01:05:58.480 | this, this, and all this stuff.
01:06:00.040 | So you can create clones of digit two,
01:06:07.240 | which is transformed with respect to Lie derivative.
01:06:14.160 | You don't need to have a lot of data.
01:06:18.240 | You need to have, for digit recognition,
01:06:23.440 | you need to have predicate.
01:06:26.360 | And from any data, you can get, okay, this predicate,
01:06:29.440 | and then, okay, I will show you invariant
01:06:32.160 | with this predicate.
01:06:33.160 | And this is what is Lie derivative.
01:06:36.560 | It is horizontal translation, X, first coordinate plus A.
01:06:42.920 | You just move it in direction X1.
01:06:45.880 | Then, vertical transformation.
01:06:49.120 | This is rotation.
01:06:50.360 | The standard from geometry formula for rotation,
01:06:54.740 | for small rotation, you have that.
01:06:57.680 | And this is ddx.
01:07:02.040 | This is ddx2, and so on.
01:07:04.280 | You have all this stuff.
01:07:05.840 | And this is, again, illustration from Patrice Simard.
01:07:12.240 | Clones, you have this three,
01:07:14.960 | and you create all these clones.
01:07:19.360 | Just you choose one, two, three, four, five Lie derivatives.
01:07:24.360 | Two different coefficients with them,
01:07:27.440 | and you have all this stuff.
01:07:28.840 | But this is smart guy.
01:07:37.280 | Why I'm not taking, like, predicate,
01:07:40.320 | just Lie derivative of this,
01:07:41.960 | and we'll take invariant with respect to Lie derivative.
01:07:47.640 | It is, I would like to learn using statistical invariant,
01:07:52.640 | try to estimate such conditional probability
01:07:58.160 | which keep invariant with respect with all Lie derivatives.
01:08:08.600 | But even more, it's again from, was done.
01:08:12.680 | So, suppose I have this set of clones of my digit.
01:08:17.680 | This is set of another, of clones of another digit.
01:08:25.680 | I call tangent distance the closest element
01:08:31.640 | from these two clones.
01:08:32.600 | What I mean, I have two digits.
01:08:34.920 | They are different, say, two, three.
01:08:37.080 | They are, massage them with linear transformation
01:08:39.960 | to make it as close as possible,
01:08:42.800 | and measure closeness of them.
01:08:45.480 | That's called Lie invariant.
01:08:48.680 | That's called tangent distance.
01:08:51.800 | And now, I believe that this is general concept
01:08:57.760 | of predicate, symmetry.
01:09:01.900 | When you have any picture,
01:09:06.860 | you can say what is measure of symmetry
01:09:09.320 | of these two picture.
01:09:10.800 | So, for example, you have three.
01:09:13.340 | What I can do, I have, this is my digit three.
01:09:18.340 | I will have for horizontal symmetry.
01:09:22.240 | I will take first line here, second line here,
01:09:25.680 | last line here.
01:09:26.940 | Then I will do the following.
01:09:28.880 | I will make another digit.
01:09:30.640 | I will make last line the first line.
01:09:33.040 | This line the second line.
01:09:36.820 | So, what I am doing,
01:09:38.000 | I take three.
01:09:44.040 | So, I reading like vector like that.
01:09:49.800 | Now, I will do this.
01:09:51.260 | It is two different images.
01:09:56.420 | And now, I would say, let me massage them.
01:10:01.480 | This tangent distance, these two different pictures,
01:10:04.360 | how close they are.
01:10:06.460 | If they're very close, I can say,
01:10:08.920 | this is coefficient of symmetry of this three.
01:10:12.660 | But I can say, horizontal symmetry, vertical symmetry,
01:10:17.800 | anti-symmetry.
01:10:19.240 | What means anti-symmetry?
01:10:20.960 | Digit S.
01:10:22.100 | It is has vertical anti-symmetry.
01:10:26.620 | I can have diagonal symmetry, everything.
01:10:31.120 | So, you can play many games with symmetry.
01:10:34.720 | And this is just one predicate.
01:10:36.400 | I have conclusion remark.
01:10:41.200 | What we did,
01:10:43.160 | we say that we can minimize this function now,
01:10:47.280 | which is slightly better than, say, this square,
01:10:50.960 | subject to this constraint, which is serious.
01:10:54.800 | Because this constraint means admissible set of function,
01:11:00.480 | which I'm trying to construct being smart.
01:11:04.680 | So, I can consider, say, all continuous function.
01:11:10.800 | And then, from this continuous function,
01:11:13.400 | select by smart predicate admissible set of functions.
01:11:18.400 | And I can do it, because according to weak convergence,
01:11:24.080 | any invariant take place with any function.
01:11:28.680 | Invariant take place with any predicate.
01:11:31.500 | So, and also, this provide unique solution
01:11:39.920 | for reproducing kernel Hilbert space,
01:11:42.880 | and approximation for neural network.
01:11:46.200 | Approximate solution.
01:11:49.620 | But through the progress,
01:11:51.120 | goes beyond statistical reasoning.
01:11:56.120 | It goes in direction of searching of predicate,
01:11:59.200 | which form basis for understanding
01:12:01.080 | of problem existing in the world.
01:12:02.880 | And what means understanding?
01:12:06.720 | It means that in, say, 2D image recognition,
01:12:11.720 | there exists concept of symmetry.
01:12:15.340 | There exists concept of structure.
01:12:17.980 | And if you will know this concept,
01:12:21.200 | I believe that it's not a lot, I will show you why.
01:12:23.880 | I think that it is not a lot.
01:12:25.560 | You understand this problem.
01:12:27.280 | So, and I think that this line, it's very old line.
01:12:33.520 | It start from Plato.
01:12:36.800 | What says Plato?
01:12:38.240 | Plato says that there is a world of ideas
01:12:41.120 | and world of things.
01:12:42.400 | And world of ideas make world of things.
01:12:48.020 | But you see that I have ideas,
01:12:55.580 | which is predicate, which abstract,
01:12:57.260 | that can be applied to different situation.
01:13:01.380 | But I have world of things.
01:13:03.260 | But then, in 300 years ago,
01:13:09.180 | it was classical German philosophy about that,
01:13:12.660 | what means ideas, what means things.
01:13:16.260 | And Hegel thought whatever reasonable it exist,
01:13:20.980 | it is exactly what we said about predicate.
01:13:24.340 | And whatever exist, it is reasonable.
01:13:27.180 | So there is two connections.
01:13:28.900 | But recently, 60 years ago, Wigner wrote an article
01:13:35.500 | about unreasonable effectiveness of mathematics.
01:13:41.940 | Which he says that mathematics
01:13:43.540 | know something about reality.
01:13:45.260 | If you would like to understand reality,
01:13:48.460 | you should look in equation and you will see how it works.
01:13:53.100 | So predicate is abstract idea.
01:13:55.820 | Why invariants that are built using them
01:14:01.180 | form element of solution.
01:14:02.900 | This two concept reflect essence of intelligence,
01:14:08.940 | not just it imitation, which is in artificial intelligence.
01:14:13.560 | But that is subject which we should attack.
01:14:19.520 | And also, I have two more slides.
01:14:23.020 | One slide is challenge.
01:14:24.460 | I know that people from deep network
01:14:30.220 | did 0.5% of error rate using 60,000 observation.
01:14:35.220 | The challenge is use 1% of this data
01:14:39.380 | and got the same result.
01:14:41.340 | But invent smart predicate
01:14:44.740 | and all these clones which exist.
01:14:46.760 | I think that it is durable.
01:14:51.780 | And the very last slide.
01:14:53.980 | In 1928, Guy Vladimir Propp published book
01:15:00.020 | Morphology of Folk Tales.
01:15:03.900 | Where he describes 31 predicate
01:15:07.540 | that allow to synthesize all Russian folk tales.
01:15:13.060 | Later, his morphology, this 31 predicate,
01:15:19.860 | was applied to literature, to theater,
01:15:24.100 | to film, to television, to television series,
01:15:29.040 | to games, et cetera.
01:15:31.020 | And this 31 was enough.
01:15:33.300 | And this I read from Wikipedia.
01:15:36.940 | You can check it, Wikipedia, William Propp.
01:15:39.140 | Propp found 31 predicate which describe
01:15:42.780 | different actions of people in real world.
01:15:47.240 | Probably there exist small amount of predicates
01:15:50.400 | that describe 2D images.
01:15:52.180 | And that is intelligence.
01:15:56.080 | That is how to find them.
01:15:57.860 | And that what I believe should be learning about.
01:16:03.920 | Thank you.
01:16:04.760 | (audience applauding)
01:16:07.920 | - I think we have time for a few questions.
01:16:14.440 | - Hello, thank you.
01:16:16.540 | I have two questions.
01:16:17.940 | First one is, do you know of any predicates
01:16:20.960 | that you recommend for language classification tasks
01:16:24.360 | specifically?
01:16:25.560 | And the second question is, do you have any strategies
01:16:28.480 | for hedging against overfitting?
01:16:30.920 | Like if you specify too many predicates,
01:16:33.320 | then you might be sort of--
01:16:35.400 | - I not hear you well.
01:16:36.800 | But second question is about overfitting.
01:16:39.480 | - Overfitting, yes.
01:16:40.320 | - Yes, let me answer this.
01:16:41.840 | - Sure.
01:16:42.680 | - The more predicate, you have,
01:16:44.820 | why does overfitting happen?
01:16:47.080 | Because your set of function is big.
01:16:49.300 | And you have small amount of data in selecting function.
01:16:54.100 | So you can select whatever you want.
01:16:56.620 | But if you increase number of predicate,
01:16:59.320 | you decrease set of function.
01:17:03.580 | So the more predicate, the less overfitting happen.
01:17:08.580 | - Good point.
01:17:09.420 | - And if you will, a theory, mathematics says
01:17:11.760 | that if you have infinite number of predicate,
01:17:14.140 | you're left with one function, which you want.
01:17:16.940 | - You also asked about natural language,
01:17:20.940 | recommendation for predicates for language,
01:17:24.080 | natural language processing, the Turing test,
01:17:28.020 | any good predicates?
01:17:29.240 | - You know, it is very complicated story,
01:17:34.900 | natural language, I don't know.
01:17:36.600 | - Questions.
01:17:41.860 | - Whatever I'm talking, it is very trivial, simple.
01:17:46.020 | Everyone familiar with 2D images?
01:17:52.940 | And you can think, like this guy, Vladimir Prok,
01:17:58.880 | what is predicate in these images?
01:18:00.840 | Can we formulate, if you're smart guy,
01:18:04.940 | say, couple of dozens, or maybe one dozen,
01:18:08.140 | predicate, which will be enough?
01:18:10.780 | (audience member speaking faintly)
01:18:13.380 | - But language is harder than images.
01:18:15.940 | - Oh, yeah.
01:18:16.780 | (audience laughing)
01:18:19.620 | Yeah, but don't do them, immediately hard problem,
01:18:23.660 | try to--
01:18:24.580 | - Try.
01:18:25.420 | - Yeah, I try, they're very simple, just step by step.
01:18:29.460 | It so happens that it is main line of philosophy
01:18:38.380 | from Plato to this guy, who says that ideas is not too much,
01:18:43.380 | too much, is not too many ideas that exist
01:18:50.060 | in world of ideas.
01:18:51.240 | It could be like that.
01:18:54.020 | - Vladimir, thank you so much for coming today.
01:18:57.740 | - Thank you, thank you.
01:18:58.580 | - And please give him a big hand.
01:18:59.900 | (audience applauding)
01:19:03.140 | Thank you.
01:19:03.980 | (audience applauding)
01:19:07.140 | (upbeat music)
01:19:09.720 | (upbeat music)
01:19:12.300 | (upbeat music)
01:19:14.880 | (upbeat music)
01:19:17.460 | [BLANK_AUDIO]