back to indexTheano Tutorial (Pascal Lamblin, MILA)
Chapters
0:0
0:36 Objectives
9:6 The back-propagation algorithm
13:34 Computing values
16:23 Graph optimizations
27:41 Visualization, debugging and diagnostic tools
31:52 Examples
00:00:02.000 |
>> Okay. So today I'm going to briefly introduce you to TNO, 00:00:09.000 |
how to use it, and go over the basic principles behind the 00:00:13.000 |
libraries, and if you paid attention during yesterday's 00:00:17.000 |
presentation of TensorFlow, some concepts will be familiar to you 00:00:21.000 |
as well. And if you paid attention to Hugo La Rochelle's 00:00:28.000 |
talk, you will have heard of it. And I will talk about some 00:00:34.000 |
similar concepts as well. So there's going to be four main 00:00:38.000 |
parts. So the first one is, well, these slides and 00:00:42.000 |
introduction about what the concepts of TNO are. There is a 00:00:47.000 |
companion iPython notebook that's on GitHub. So if you go 00:00:52.000 |
on that page or clone that GitHub repository, there is an 00:00:57.000 |
example of how to use it. And you can download code snippets 00:01:03.000 |
from the slides so that you can run them at the same time. 00:01:07.000 |
Then we're going to have a more hands-on example, basically 00:01:12.000 |
applying logistic regression on the MNIST data set. And then if 00:01:18.000 |
we have time, we'll go quickly over two more examples, 00:01:24.000 |
and then we'll talk about the concept of TNO and how to use 00:01:36.000 |
So TNO is, we can say, a mathematical symbolic expression 00:01:41.000 |
compiler. So what does that mean? It means that it makes it 00:01:45.000 |
possible to define expressions that represent mathematical 00:01:51.000 |
expressions. So it's a very simple expression, easy to use, 00:01:57.000 |
and it supports all the kind of basic mathematical operations 00:02:03.000 |
like min, max, addition, subtraction, all the kind of 00:02:10.000 |
basic things, not only larger blocks like layers of neural 00:02:16.000 |
networks, but also things like that. So it's basically 00:02:22.000 |
making those expressions, doing graph substitutions, cloning and 00:02:28.000 |
replacement, things like that, and also making possible to go 00:02:33.000 |
through that graph and perform things like automatic 00:02:38.000 |
differentiation, symbolic differentiation, actually, or the 00:02:43.000 |
kind of things that we call the TNO-like function. And then 00:02:49.000 |
it's possible to use that optimized graph and TNO's 00:02:56.000 |
runtime to actually compute some values, some output values 00:03:00.000 |
given inputs. We also have a couple of tools that help debug 00:03:07.000 |
both TNO's code and the user's code and try to inspect and 00:03:12.000 |
see if there's any errors in the code. So let's talk about 00:03:17.000 |
TNO. So TNO is currently more than eight years old. It 00:03:22.000 |
started small with only a couple of contributors from the 00:03:29.000 |
ancestor of Mila, which was called Lizat at the time, and 00:03:34.000 |
it grew a lot. We now have contributors from all over the 00:03:39.000 |
world. We have a lot of TNO's, prototypes for industrial 00:03:45.000 |
applications in startups and in larger companies. 00:03:51.000 |
TNO has also been the base of other software projects that 00:03:57.000 |
built on top of TNO. So, for instance, blocks, Keras, 00:04:04.000 |
and so on. TNO is more the backend and provides user 00:04:12.000 |
interface that is a higher level. So that has concepts of 00:04:18.000 |
layers, of training algorithms, of those kind of things. 00:04:23.000 |
Whereas TNO is more the backend. It's scale on TNO as 00:04:27.000 |
well, which is nice because it has a converter to load cafe 00:04:32.000 |
and it's a lot of fun. And it's a lot of fun to use. 00:04:38.000 |
And it's a lot of fun to use. And it uses TNO not to do 00:04:45.000 |
machine learning, but programming. And we have two 00:04:50.000 |
other libraries, Platoon, and TNO MPI, which are layers on 00:04:57.000 |
TNO. And they are more the backend and more the backend 00:05:03.000 |
level of model parallelism and data parallelism. 00:05:09.000 |
So, how to use TNO? Well, first of all, we are working 00:05:16.000 |
with symbolic expression, symbolic variables. So that 00:05:22.000 |
is a function that is used to define the expression. So to 00:05:28.000 |
define the symbolic expression, so we define the expression 00:05:32.000 |
first, then we want to compile a function, and then execute 00:05:36.000 |
that function on values. So to define the expression, we 00:05:40.000 |
start by defining inputs. So the inputs are symbolic 00:05:44.000 |
variables that have some type. So you have to define in 00:05:49.000 |
the matrix what its data type is, floating point, integers, 00:05:55.000 |
and so on. So things like the number of dimensions have to 00:06:00.000 |
be known in advance. But the shape is not fixed. The memory 00:06:05.000 |
layout is not fixed. So you could have shapes that change 00:06:10.000 |
between like one mini-batch and the next, or different calls 00:06:14.000 |
that are made. So you have to define the symbolic expression 00:06:18.000 |
in general. So X and Y are purely symbolic variables here. 00:06:24.000 |
We will give them values later, but for now, that's just 00:06:29.000 |
empty. There's another kind of input variables that is shared 00:06:34.000 |
variables, and they are symbolic, but they also hold a 00:06:38.000 |
value, and that value is persistent across function 00:06:42.000 |
types. So we have a function that has two dimensions. It's 00:06:47.000 |
usually used, for instance, for storing parameters of the model 00:06:51.000 |
that you want to learn, and yet these values can be updated as 00:06:56.000 |
well. So here we create two other variables from -- so 00:07:01.000 |
shared variables from values. This one has two dimensions 00:07:05.000 |
because its initial values have two dimensions, and this one 00:07:09.000 |
has two dimensions. So we can define a variable that will 00:07:13.000 |
represent the bias. We can name variables by assigning to the 00:07:18.000 |
name attributes. Shared variables do not have a fixed 00:07:22.000 |
side either. They are usually kept fixed in most models, but 00:07:26.000 |
it's not a requirement. Then from these inputs, we can define 00:07:31.000 |
expressions that will build new variables, intermediate 00:07:35.000 |
variables, and so on. And so, for instance, here we can 00:07:41.000 |
define, well, the product of X and W, add the bias, apply a 00:07:49.000 |
function on that, and let's say this is our output variable, and 00:07:53.000 |
from the output variable and Y, we can define just, say, the 00:07:59.000 |
bias. So those new variables are connected to the previous 00:08:05.000 |
ones through the operations that we define, and we can 00:08:09.000 |
visualize the graph structure like that by using, for 00:08:13.000 |
instance, the pilot print, which is a helper function. So 00:08:17.000 |
variables are those squared boxes, and we have other nodes 00:08:21.000 |
here, we call apply nodes that represent the mathematical 00:08:26.000 |
function, and we can visualize the graph structure. 00:08:30.000 |
So input variables and shared variables do not have any 00:08:36.000 |
ancestors. They don't have any connecting from them, but then 00:08:40.000 |
you see that intermediate result and more of them. 00:08:46.000 |
Usually, when we visualize, we don't necessarily care about all 00:08:50.000 |
the intermediate variables unless they have a name or 00:08:54.000 |
something. So here, we have exactly the same graph where we 00:08:58.000 |
hide the unnamed intermediate variables, but you can still see 00:09:02.000 |
all the operations. And actually, you see the type on 00:09:12.000 |
graphs, say your forward computation for your model, we 00:09:16.000 |
want to be able to use back propagation to get gradients. 00:09:23.000 |
So here, we have just the basic concept of the chain rule. We 00:09:29.000 |
have a scalar cost, we have intermediate variables that are 00:09:35.000 |
vectors. Here is just the chain rule starting from the cost. 00:09:42.000 |
And so the whole derivative of, say, that function G is 00:09:50.000 |
the cost of the output. And the cost of the output is 00:09:56.000 |
that M by N if the intermediate variables are vectors of size 00:10:02.000 |
N and M. And usually, you don't need that. And it's usually a 00:10:06.000 |
bad idea to compute it explicitly unless you need it 00:10:10.000 |
for some other purposes. The only thing you need is an 00:10:14.000 |
expression that's given any vector representing the 00:10:18.000 |
gradient of the cost with respect to the input. So 00:10:22.000 |
basically, the dot product between that vector and the 00:10:26.000 |
whole Jacobian matrix. So that's also called the L 00:10:30.000 |
operator sometimes. And so almost all operations in 00:10:36.000 |
Tiano implement a function that returns that. And it actually 00:10:43.000 |
returns not numbers, not a numerical expression for that, 00:10:49.000 |
but it returns a symbolic expression that represents that 00:10:55.000 |
computation. Again, usually without having to explicitly 00:11:00.000 |
represent or define that whole Jacobian matrix. 00:11:05.000 |
So you can call Tiano dot grad, which will back propagate 00:11:11.000 |
the cost of the input. And it will start from the cost 00:11:17.000 |
towards the inputs that you give. And along the way, it 00:11:21.000 |
will call that grad method of each operation, back 00:11:25.000 |
propagating -- starting from one for the cost and back 00:11:29.000 |
propagating through the whole graph, accumulating when you 00:11:33.000 |
have the same variables that use more than once and so on. 00:11:38.000 |
So that's basically what Tiano does. The same way as if you 00:11:44.000 |
had manually defined the gradient expression using Tiano 00:11:48.000 |
operations like the dot product, the sigmoid and so on 00:11:52.000 |
that we've seen earlier. So we have non-numerical values at 00:11:56.000 |
that point. And they are part of the computation graph. So 00:12:01.000 |
the computation graph was extended to add these 00:12:07.000 |
variables. So we can use Tiano to compute the 00:12:11.000 |
gradient descent, extending the graph from these variables, 00:12:15.000 |
for instance, to compute the expressions corresponding to 00:12:19.000 |
gradient descent, something like that, like we do here. 00:12:23.000 |
So, for instance, this is what the extended graph for the 00:12:27.000 |
gradients looks like. So you see there's a lot of small 00:12:31.000 |
operations that have been inserted. And the output you 00:12:35.000 |
can see here is the gradient descent, the gradient output, 00:12:39.000 |
and an intermediate result that will help compute the gradient 00:12:43.000 |
with respect to the weights. And here's the graph for the 00:12:54.000 |
variables, the gradients that we had on the previous slide, 00:12:59.000 |
and then basically just the scale version with the constant 00:13:04.000 |
weight. So once we have defined the whole graph, the whole 00:13:10.000 |
expression that we actually care about, from the inputs and 00:13:16.000 |
initial weights to the weight updates for our training 00:13:21.000 |
algorithm, we want to compile a function that will be able to 00:13:25.000 |
actually compute those numbers, given inputs, and perform the 00:13:29.000 |
computation. So to do that, we have a function that has 00:13:35.000 |
values. What we do is call TNO.function, and you provide it 00:13:39.000 |
with the input variables that you want to feed, and the output 00:13:43.000 |
variables that you want to get. And you don't have necessarily 00:13:46.000 |
to provide values for all of the inputs that you might have 00:13:51.000 |
declared, especially if you don't want to go all the way 00:13:55.000 |
back to the beginning. So we can actually compute 00:14:00.000 |
expression for a subset of the graph. For instance, we can 00:14:04.000 |
have a predict function here that goes only from X to out. 00:14:08.000 |
We don't need values from Y. We don't need -- and so the 00:14:13.000 |
gradient and so on will not be computed. It's just going to 00:14:17.000 |
take a small part of the graph and make a function out of it. 00:14:24.000 |
So to do that, you have to provide a value and call it. So 00:14:28.000 |
you have to provide values for all of the input variables that 00:14:32.000 |
you define. You don't have to provide values for shared 00:14:36.000 |
variables, the W and B that we declared earlier. They are 00:14:40.000 |
implicit inputs to all of the functions, and their value will 00:14:49.000 |
You can declare other functions, like a monitoring function that 00:14:53.000 |
is a function that is not a prediction. So you have two 00:14:57.000 |
outputs, you also need the second input, Y. You can compute 00:15:02.000 |
the function that does not start from the beginning. Like, 00:15:06.000 |
for instance, if I want an error function that only computes the 00:15:11.000 |
mismatch between the prediction and the actual target, then I 00:15:15.000 |
don't have to start from the input. I can just start from 00:15:23.000 |
Then the next thing that we want to do is update shared 00:15:28.000 |
variables for training. It's necessary. And, again, you can 00:15:32.000 |
pass to TNO functions updates, a list of updates. And updates 00:15:37.000 |
are pairs of a shared variable and a symbolic expression that 00:15:42.000 |
will compute the new value for that shared variable. 00:15:46.000 |
So, for example, if we want to update Y, we can update B here 00:15:52.000 |
as implicit outputs of the function, like W and B were 00:15:56.000 |
implicit inputs. Update W and B are implicit outputs that will 00:16:00.000 |
be computed at the same time as C. And then, after all the 00:16:05.000 |
outputs are computed, the updates are actually effective 00:16:15.000 |
Here, if we print the value of B before and after having 00:16:21.000 |
calling -- after having called the train function, then we see 00:16:25.000 |
the value has changed. What happens also during graph 00:16:31.000 |
compilation is that the subgraph that we selected for that 00:16:36.000 |
function, the subgraph that we selected for that function, what 00:16:41.000 |
we mean by that is that it's going to be rewritten in parts. 00:16:45.000 |
There are some expressions that will be substituted and so on. 00:16:53.000 |
Some are quite simple. For instance, if we have the same 00:16:59.000 |
computation being defined twice, we only want it to be executed 00:17:03.000 |
twice. And there are some other things that are not necessary. 00:17:07.000 |
You don't want to compute them at all. For instance, if you 00:17:12.000 |
have X divided by X, and X is not used anywhere else, we just 00:17:17.000 |
want to replace that by one. There are numerical stability 00:17:22.000 |
optimizations. For instance, log of 1 plus X can underflow if 00:17:28.000 |
you have a large number of operations, and you can just 00:17:33.000 |
replace it by 1 plus X. Things like log of soft max get 00:17:38.000 |
optimized into a more stable operation. It's also the time 00:17:42.000 |
where in place and destructive operations are inserted. For 00:17:46.000 |
instance, if an operation is the last to be executed on some 00:17:50.000 |
numbers, it can, instead of allocating output memory, it can 00:17:54.000 |
allocate the output memory. And the transfer of the graph 00:18:00.000 |
expression to the GPU is done during the optimization phase. 00:18:06.000 |
So, by default, TNO tries to apply most of the optimization 00:18:12.000 |
so that you have a run time that's almost as fast as 00:18:15.000 |
possible, except for a couple of checks and assertions. But if 00:18:19.000 |
you're not a big fan of fast feedback and don't care that 00:18:25.000 |
much about the run time speed, then you have a couple of ways 00:18:31.000 |
of enabling and disabling some sets of optimizations, and you 00:18:37.000 |
can do that either globally or function by function. 00:18:41.000 |
So, to have a look at, for instance, what happens during 00:18:47.000 |
the optimization phase, here's the original unoptimized graph 00:18:53.000 |
going from the inputs X and W going to the output prediction. 00:18:59.000 |
It's the same one that we've seen before. And if we compare 00:19:03.000 |
that with the function, the compiled function that goes from 00:19:09.000 |
these input variables to out, which was called predict, this 00:19:13.000 |
is the graph that we saw before. So, I won't go into details 00:19:19.000 |
about what's happening in there, but here you have a GMV 00:19:25.000 |
operation, which basically calls an optimized BLAS routine that 00:19:29.000 |
can also do multiplication and accumulation at the same time. 00:19:35.000 |
We have a sigmoid operation here that will work in place 00:19:41.000 |
operations. If you have a look at, for instance, the 00:19:47.000 |
optimized graph computing the expression for the updated WNB, 00:19:53.000 |
this was the original one. And the optimized one is much 00:19:59.000 |
smaller. It has also in place operations. It has fused 00:20:05.000 |
operations. Like, for instance, if you have a whole tensor and 00:20:09.000 |
you want to do addition with a constant and then a sigmoid and 00:20:15.000 |
then something else and so on, you want to only loop once 00:20:19.000 |
through the array and apply all the scalar operations on each 00:20:23.000 |
element and then go to the next and so on and not iterate each 00:20:27.000 |
time that you want to apply a new operation. And those kind 00:20:31.000 |
of things happen often when you have automatically generated 00:20:35.000 |
a new operation. And here you see the update for the 00:20:41.000 |
variables, which are inputs. So you see the cost and the 00:20:47.000 |
implicit outputs for the updated WNB here and here. 00:20:52.000 |
Another graph visualization tool that exists is the back 00:20:57.000 |
print, which basically prints text-based structure of the 00:21:01.000 |
graph. So you can see the input of the array IDs and the 00:21:07.000 |
variable names and so on. So here you can see more in detail 00:21:13.000 |
what the structure is and you see the input of the scaling 00:21:19.000 |
parameters and so on. So when the function is compiled, then 00:21:23.000 |
we can actually run it. So the function is a callable Python 00:21:30.000 |
function that we can call. And we've seen those examples 00:21:38.000 |
here, for instance, where we call train and so on. 00:21:44.000 |
But what happens to have, say, an optimized run time, it's not 00:21:54.000 |
just the function that is running. So we have on-the-fly 00:22:00.000 |
code generations, but we also generate C++ or CUDA code. 00:22:06.000 |
For instance, for the LMYs loop fusion that I mentioned, we 00:22:10.000 |
can't know in advance which element was operation will be 00:22:15.000 |
-- will occur in which order in any graph that the user might 00:22:19.000 |
be running. So we have a bunch of generations for that. We 00:22:25.000 |
generate Python module written in C++ or CUDA that gets 00:22:29.000 |
compiled and imported back so we can use it from Python. 00:22:34.000 |
The run time environment then calls in the right order the 00:22:40.000 |
different operations that have to be executed from the inputs 00:22:44.000 |
that we get. So we can use the same code to get the desired 00:22:49.000 |
results. We have a couple of different ones, and in 00:22:53.000 |
particular, there's one which was written in C++ which 00:22:57.000 |
avoids having to switch context between the Python 00:23:05.000 |
Something else that's really crucial for speed and 00:23:09.000 |
accuracy is the GPU. We wanted to make it as simple as 00:23:15.000 |
possible in usual cases. So now it supports a couple of 00:23:23.000 |
different data types, not only float 32, but double precision 00:23:28.000 |
if you really need that, integers as well. And we have 00:23:34.000 |
a lot of GPUs that are running on the GPU. We have a lot of 00:23:40.000 |
GPUs that are running with GPU arrays from Python itself. So 00:23:44.000 |
you can just use Python code to handle GPU arrays outside of a 00:23:49.000 |
function if you like. All of that will be in the future 0.9 00:23:56.000 |
And to use it, well, you select the device that you want to 00:24:00.000 |
use, and you can use it with just a configuration flag. For 00:24:06.000 |
instance, use CUDA to get the first GPU that's available, or 00:24:12.000 |
one specific one. And if you specify that in the 00:24:16.000 |
configuration, then all shared variables will by default be 00:24:21.000 |
created in GPU memory. And the optimizations that move the 00:24:26.000 |
CPU operation by GPU operations are going to be applied. 00:24:32.000 |
Usually, you want to make sure you use float 32 or even float 00:24:38.000 |
16 for storage, which is experimental, but because most 00:24:43.000 |
GPUs don't have a good performance for the bulk 00:24:48.000 |
precision. So how you set those configuration flags? You have 00:24:54.000 |
to set them in the configuration file. So if you 00:24:58.000 |
remember the configuration file that you can -- it's just a 00:25:03.000 |
configuration file for Python. You have an environment 00:25:07.000 |
variable where you can define those, and the environment 00:25:11.000 |
variable overrides the config file, and you can also set 00:25:15.000 |
things directly from Python. But some flags have to be known in 00:25:19.000 |
the configuration file. So if you want to set the device 00:25:24.000 |
itself, you have to set it either in the configuration file 00:25:36.000 |
So I'm going to quickly go over more advanced topics, and if 00:25:41.000 |
you want to learn more about that, there's other tutorials 00:25:46.000 |
on the web, and there's a lot of documentation on 00:25:51.000 |
deeplearning.net. So to have loops in the graph, we've seen 00:25:55.000 |
that the expression graph is basically a directed acyclic 00:26:00.000 |
graph, and we cannot have loops in there. One way, if you know 00:26:05.000 |
in advance the number of iterations, is just to unroll 00:26:09.000 |
the loop, use for loop in Python that builds all the nodes for 00:26:13.000 |
the loop. So it doesn't work if you want, for instance, to 00:26:20.000 |
have dynamic size for the loop. For models that generate 00:26:30.000 |
So what we have for that in TNO is called scan, and 00:26:36.000 |
basically, it's one node that encapsulates another whole TNO 00:26:42.000 |
function, and it's going to compute the -- it's going to 00:26:48.000 |
represent the computation that has to be done at each time 00:26:52.000 |
step. So you have a TNO function that performs the 00:26:56.000 |
computation for one time step, and you have the scan node that 00:27:00.000 |
calls it in the loop, taking care of the bookkeeping of 00:27:04.000 |
indices and sequences and feeding the right slice at the 00:27:08.000 |
right time. And having that structure makes it also 00:27:13.000 |
possible to define a gradient for that node, which is 00:27:17.000 |
basically another scan node, another loop that goes 00:27:21.000 |
backwards and applies backprops with time. And it can be 00:27:25.000 |
transferred to GPU as well, in which case the internal 00:27:29.000 |
function is going to be transferred and recompiled on 00:27:34.000 |
the GPU. And we will talk about the LSTM example later. 00:27:40.000 |
This is just a small example, but we don't really have time 00:27:44.000 |
for that. We also have visualization, debugging, and 00:27:50.000 |
diagnostic tools. One of the reasons it's important is that 00:27:54.000 |
in TNO, like in TensorFlow, the definition of a function is 00:27:59.000 |
defined in the expression. So if something doesn't work 00:28:05.000 |
during the execution, if you encounter errors and so on, 00:28:09.000 |
then it's not obvious how to connect that from where the 00:28:13.000 |
expression was actually defined. So we try to have 00:28:20.000 |
informative error messages, and we have some completion 00:28:25.000 |
methods, like the test value, which is not a number, for 00:28:30.000 |
large values. You can assign test values to the symbolic 00:28:35.000 |
variables so that each time you create a new symbolic 00:28:40.000 |
intermediate variable, each time you define a new expression, 00:28:45.000 |
then the test value gets computed, and so you can 00:28:49.000 |
evaluate on one piece of data at the same time as you build 00:28:53.000 |
the new expression. So you can avoid mistakes, 00:29:00.000 |
It's possible to extend TNO in a couple of ways. You can 00:29:05.000 |
create an op just from Python by calling Python wrappers for 00:29:12.000 |
existing efficient libraries. You can extend TNO by writing 00:29:18.000 |
new functions, either for increased numerical stability, 00:29:23.000 |
for instance, or for more efficient computation, or for 00:29:28.000 |
introducing your new ops instead of the naive versions that a 00:29:39.000 |
We have a couple of new features that have been 00:29:43.000 |
added to TNO. I mentioned the new GPU backend, with support 00:29:49.000 |
for many data types, and we've had some performance 00:29:54.000 |
improvements, especially for convolution, 2D and 3D, and 00:29:59.000 |
especially on GPU. We've made some progress on the 00:30:05.000 |
time of the graph optimization phase, and also have 00:30:10.000 |
improved the performance of the graph. We have new ways of 00:30:15.000 |
avoiding recompiling the same graph over and over again, and 00:30:19.000 |
we have new diagnostic tools that are quite useful, an 00:30:24.000 |
interactive graph visualization tool, and PDB breakpoints that 00:30:28.000 |
enables you to monitor a couple of variables and only break if 00:30:33.000 |
some condition is met, rather than monitoring something every 00:30:42.000 |
working on new operations on GPU, we still want to wrap more 00:30:49.000 |
operations for better performance, in particular, the 00:30:55.000 |
basic RNNs should be completed in the following days, 00:30:59.000 |
hopefully. Someone has been working on that a lot recently. 00:31:04.000 |
And, of course, we have support for 3D convolutions, still 00:31:10.000 |
faster optimisation, and more work on data parallelism as 00:31:20.000 |
most of my colleagues and main TNO developers, and people who 00:31:26.000 |
contributed one way or another to our lab and the software 00:31:31.000 |
development team, and the team that is working on the 00:31:37.000 |
organisers for this call. Now, yes, so the slides are 00:31:49.000 |
companion notebook, and now we can start to - and more 00:31:53.000 |
resources if you want to go further. And now I think it is 00:31:57.000 |
time to move on to the demo. So, for those who have not 00:32:06.000 |
cloned the repository yet, then this is the command line you 00:32:11.000 |
want to launch. For those who have cloned it, you might want 00:32:16.000 |
to do a Git pull, just to get the latest - to make sure we 00:32:21.000 |
have the latest version of the Jupyter notebook on the 00:32:29.000 |
repository itself. So we have three examples that we are 00:32:39.000 |
ConvNet, and LSTM. So I've launched the Jupyter 00:32:45.000 |
notebook here. So, intro TNO was the companion notebook. 00:32:50.000 |
So, we have the logistic regression, and we have the 00:32:56.000 |
logistic regression. So let's go with the logistic 00:33:00.000 |
regression. Is that big enough, or do I need to increase the 00:33:06.000 |
font size? Okay. So I'm going to skip over the text, because 00:33:13.000 |
you probably know already about the model. We have some - we 00:33:19.000 |
have some data that we want to load. So let's load the data 00:33:28.000 |
with the - on GitHub, with the repository. So let's load the 00:33:35.000 |
data. And here, let's see how we define the model. So it's 00:33:40.000 |
basically the same way that we did in the slides. We define 00:33:46.000 |
an input variable. Here it's a matrix, because we want to use 00:33:52.000 |
many batches. And we have shared variables initialised from 00:33:59.000 |
zeros. Then we define the - our model. So, here's our 00:34:10.000 |
predictor. So the probability of the class given the input, 00:34:17.000 |
and we're going to use, well, so here, the fine model, and 00:34:25.000 |
then the softmax on top of it, and the prediction, if you want 00:34:31.000 |
to have a prediction, it's going to be the class of maximum 00:34:37.000 |
probability. So, max over that axis, because we still want one 00:34:43.000 |
prediction for each element of the mini-batch. 00:34:49.000 |
Then we define the last function. So here is going to be 00:34:53.000 |
the log likelihood of the label given the input, or the 00:34:59.000 |
cross-entropy, and we define it simply, we don't have, like, we 00:35:03.000 |
don't need to have one cross-entropy or log likelihood 00:35:09.000 |
operation by itself. You can just build it from the basic 00:35:12.000 |
building blocks. So you take the log of the probability, you 00:35:17.000 |
take the index of the actual target, and then you take the 00:35:24.000 |
mean of that to have the mean prediction over the mini-batch. 00:35:29.000 |
And then you have the gradient. Derive the update rules. So, 00:35:36.000 |
again, we don't have, like, one gradient descent object or 00:35:41.000 |
something like that. We just build whatever rule we want. 00:35:48.000 |
So, yeah, we could use momentum by defining other shared 00:35:54.000 |
variables, like the gradient, and then we have the gradient 00:35:59.000 |
for the velocity, and then the expressions for both the 00:36:09.000 |
And then we compile a train function going from X and Y, 00:36:19.000 |
So, yeah, we have the train function, and then we have the 00:36:24.000 |
train function, and then the train is getting optimized. 00:36:27.000 |
Let's see the next step. Well, we also want to monitor not 00:36:32.000 |
only the log likelihood, but actually the misclassification 00:36:39.000 |
rate on validation and test set. So it's simply the different, 00:36:45.000 |
like, the rate, and the rate is the mean or the mini-batch, and 00:36:51.000 |
we create another -- we compile another function, and not doing 00:36:57.000 |
any updates, of course. So, to train the model, well, first, 00:37:02.000 |
we need to process the data a little bit. So we want to feed 00:37:07.000 |
the model one mini-batch of data at a time. So here we have 00:37:12.000 |
the train function, and then we have the mini-batch, and it's 00:37:18.000 |
not a Python generator, but a helper function that gives us 00:37:23.000 |
the mini-batch number I, and it's going to be the same 00:37:27.000 |
function used for the training and validation and test set. 00:37:32.000 |
We define a couple of parameters for early stopping in that 00:37:36.000 |
training loop. It's not necessary. It's just, like, a 00:37:41.000 |
little bit more complex than the model that was encountered 00:37:46.000 |
during the optimization. So let's define that. 00:37:51.000 |
And this is the main training loop. It's a bit more complex 00:37:56.000 |
than it might be, but it's because we use this early 00:38:01.000 |
stopping, and we want to only validate when we are confident 00:38:06.000 |
that the model is running. So, we have a couple of parameters 00:38:11.000 |
that we can run down enough, but basically, the most important 00:38:16.000 |
part is you loop over the epochs unless you encounter the 00:38:23.000 |
early stopping conditions, and then, during each epoch, you 00:38:28.000 |
want to loop over the mini-batches and call train 00:38:33.000 |
loop. And then, we want to get some result of the validation 00:38:38.000 |
error, so here we call test model on the validation set for 00:38:43.000 |
that, and then keep track of what the best model currently 00:38:57.000 |
And save the best one. So, to save the best one, to save 00:39:02.000 |
the best one, we need to save the values of all parameters, 00:39:07.000 |
which is more robust than trying to pickle the whole Python 00:39:14.000 |
object, and it also enables more easily transfer to other 00:39:19.000 |
frameworks, to visualization frameworks, and so on. So let's 00:39:22.000 |
try to execute that. So, of course, it's a simple 00:39:27.000 |
model, and it's running on the same time, so it should not 00:39:42.000 |
beginning, well, almost at each iteration, we are better on the 00:39:46.000 |
training set, and then, after a while, the progress is slower, 00:39:56.000 |
and then, we are better on the training set. So, wait a little 00:40:15.000 |
So, now, if we want to visualize what filters were learned, or 00:40:22.000 |
what we learned, we are using a helper function here to 00:40:28.000 |
visualize the filters. It's not really important. But here, 00:40:32.000 |
what we use is we call get value on the weights to access the 00:40:40.000 |
internal value of the shared variable, and then we use that 00:40:46.000 |
to visualize the filter. So, we have the training set, and we 00:40:50.000 |
have the filters, and we can see it's kind of reasonable, like, 00:40:54.000 |
this is the filter for class 0, and you can see kind of like a 00:40:59.000 |
0, 1, what's important for the two is to have an opening here, 00:41:05.000 |
and so on. So, yeah, if we have a look at the 00:41:13.000 |
training error, is - well, do we see the training error? No, 00:41:19.000 |
I'm not plotting it. But the validation and the test error are 00:41:24.000 |
quite high, and we know that the human level performance is 00:41:29.000 |
quite low, and the performance of other models is quite low, so 00:41:33.000 |
it really means that the model is too simple, and we should use 00:41:37.000 |
something more advanced. So, to use something more advanced, if 00:41:44.000 |
you go back to the home of the Jupyter notebook, you can have a 00:42:00.000 |
So, this new example is basically - it's the same data, 00:42:04.000 |
but it's a bit more advanced, and it's a bit more optimized, 00:42:09.000 |
because it has the advantage of training fast even on an older 00:42:14.000 |
laptop, but this time, we're going to use a convolutional net 00:42:19.000 |
with a couple of convolution layers, and fully connected 00:42:24.000 |
layers, and the final classifier. So I'm going to make sure that 00:42:29.000 |
I'm not over-composing the data. So, let's see how we could use 00:42:35.000 |
TNO to define helper classes that are layers that can make it 00:42:41.000 |
easier for a user to compose them if they want to replicate 00:42:47.000 |
some results, or use some classical architectures. 00:42:53.000 |
This is done usually in frameworks built on top of TNO 00:42:58.000 |
and TNO, and they develop their own mini-framework with their 00:43:04.000 |
own versions of layers and so on that they find useful and 00:43:10.000 |
intuitive. So, this logistic regression layer 00:43:17.000 |
basically holds, well, parameters, weight, and bias, and 00:43:26.000 |
it's a very simple classifier. It's a set of a variety of 00:43:32.000 |
classes, prediction holds the params, and have expressions for 00:43:38.000 |
the negative log likelihood and the errors. So, if you were to 00:43:44.000 |
use only that class, then it's doing essentially the same as 00:43:50.000 |
the previous one. And, in the same way, we can define a layer 00:43:57.000 |
that has convolution and pooling. So, again, in the init 00:44:03.000 |
method, we pass it, well, filter shape, image shape, the size of 00:44:09.000 |
pooling, and so on. We initialise the weights using the 00:44:15.000 |
same classifier, and we compute the bias from the input, well, 00:44:23.000 |
we compute to the convolution with the filters. We then 00:44:31.000 |
compute max pooling, and output, well, tan h of the pooling plus 00:44:39.000 |
the bias. So, we have a bias, and here, the bias is only like 00:44:45.000 |
one number for each channel, so which means that you don't have 00:44:49.000 |
a different bias for each location in the image. So, you 00:44:54.000 |
could actually apply such a layer on images of various size 00:45:00.000 |
without having to initialise new parameters or retrain that. 00:45:06.000 |
So, we have a layer, and we have a hidden layer, which is just 00:45:12.000 |
a fully connected layer. Again, initialising weights and bias, 00:45:17.000 |
and expression going from the symbolic expression going from 00:45:22.000 |
the input, and the shared variables to the output after 00:45:27.000 |
activation. And, again, we want to collect the parameters so 00:45:31.000 |
that we can actually apply the same thing to the output. 00:45:37.000 |
And then, here's a function that has the main loop, and the 00:45:43.000 |
main training loop. So we have a mini-batch generator, again, 00:45:48.000 |
the same code as before, and here, we are building the whole 00:45:52.000 |
graph. So, always the same process. We define input 00:46:00.000 |
vectors, and we define the input vectors. So, L vector is a 00:46:06.000 |
vector of long, because the targets here are indices, and 00:46:12.000 |
not one-hot vectors or masks or something like that. And we 00:46:17.000 |
create the first layer, which is a layer with size. We want to 00:46:24.000 |
have a size of one, and we want to have a size of two. So, 00:46:31.000 |
here, the image size changes. This is mostly for efficiency, 00:46:36.000 |
actually. You don't really have to pass that for those 00:46:40.000 |
particular models. But you still need the shape of filters. I 00:46:45.000 |
mean, you have the filters anyway. And then, it's useful to 00:46:50.000 |
have a layer that is fully connected. So, the convolution 00:46:55.000 |
layers can handle arbitrary-sized images. And then, 00:46:59.000 |
after that, we want to flatten the whole feature maps and feed 00:47:04.000 |
that into a fully connected layer and into the prediction 00:47:07.000 |
layer. So this one has to be fixed, so we have to know what 00:47:14.000 |
the output layer is. So, we have four dimensions. And here we 00:47:21.000 |
go. A fully connected layer, and the output layer, that's 00:47:26.000 |
just the class, the same as before. We want the final cost 00:47:32.000 |
to be the log likelihood of that. We have, again, the errors, 00:47:39.000 |
the parameters, or the concatenation of the parameters 00:47:44.000 |
of all layers. And once we have that, we can build the 00:47:49.000 |
gradient. So, just one call of grad of cost with respect to 00:47:54.000 |
params. Get the updates. So, again, just regular SGD, but we 00:48:00.000 |
could have a class or something that performs like momentum, 00:48:06.000 |
whatever you need. Compile the function. And here we have, 00:48:12.000 |
again, the early stopping routine with the same main loop 00:48:17.000 |
for all epochs until we're done. Then loop over the mini 00:48:21.000 |
batches and validate every once in a while and stop when it's 00:48:25.000 |
finished. So, let's just declare that. Loading the data. 00:48:31.000 |
So, we have the same output, exactly the same as before. And 00:48:38.000 |
here we can actually run that. So, this was the result of a 00:48:46.000 |
previous run. That took five minutes. So, I will probably 00:48:53.000 |
not have time to do that, but here you can see basically what 00:48:58.000 |
the result looks like. So, if you want to try that or try that 00:49:04.000 |
during the lunch break or later, you're welcome to play with it. 00:49:12.000 |
And after that, yeah, you can visualize the run filters as 00:49:19.000 |
well. So, here you have the first layer. And for the -- and 00:49:29.000 |
here you have the -- an example of the activations of the first 00:49:34.000 |
layer for one example. So, we have just a little bit more 00:49:43.000 |
information. So, let's just go back to the tutorial. I mean, 00:49:50.000 |
example. So, if you go back to the home of the Jupyter 00:49:59.000 |
notebook and go to LSTM, then -- so, this model is an LSTM 00:50:12.000 |
model. So, it's a model of a sentence given the previous 00:50:18.000 |
ones. So, not going to go into details, but here you can see 00:50:26.000 |
that the LSTM layer is defined here with variables for all the 00:50:36.000 |
matrices that you need and the different biases for the 00:50:39.000 |
parameters. So, you have a lot of parameters. It would be 00:50:45.000 |
possible and sometimes more efficient to actually define, 00:50:51.000 |
say, only one variable that contains the concatenation of a 00:50:57.000 |
couple of matrices, and that way you can do more efficient, 00:51:03.000 |
more efficient, and more efficient implementation. 00:51:09.000 |
And here's an example of how to use a scan for the loop. So, 00:51:15.000 |
here we define a step function that takes, well, a couple of 00:51:26.000 |
different inputs, so you have, like, the different activation 00:51:30.000 |
steps, you have the current sequence input and so on, and 00:51:36.000 |
from them, here's basically the LSTM formula where you have the 00:51:43.000 |
dot product and sigma or tan H of the different connection 00:51:49.000 |
inside the cell, and in the end, you have the hidden and that 00:51:59.000 |
is the function that you need to use. So, once you have that, 00:52:05.000 |
that step function is going to be passed to TNO dot scan, where 00:52:13.000 |
the sequences are the mask and the input. So, the mask is 00:52:19.000 |
useful because we are using many batches of sequences and not 00:52:23.000 |
all the sequences in the same batch have the same length. 00:52:28.000 |
So, we can use the same step function to group examples of 00:52:34.000 |
similar lengths together, but they may not always be exactly 00:52:38.000 |
the same length. So, in that case, we pad that to only the 00:52:42.000 |
longest sequence in the mini-batch, not the longest 00:52:46.000 |
sequence in the whole set, just for the mini-batch, but we have 00:52:50.000 |
to pad and remember what the length of the different 00:53:01.000 |
Here, we define the cost function, that's the 00:53:07.000 |
cross-entropy of the sequence, and here, again, you see that 00:53:10.000 |
the mask is used so that we don't consider the predictions 00:53:14.000 |
after the end of the sequence. Logistic regression, the same as 00:53:19.000 |
the cost function, but we are using the same cost. 00:53:26.000 |
Here, for processing the data, we are using fuel, which is 00:53:30.000 |
another tool being developed by a couple of students at Mila, 00:53:34.000 |
and it's nice because it can read from just plain text data, 00:53:40.000 |
do some preprocessing on the fly, including things that I 00:53:45.000 |
will show you in a second. So, we are doing sequences by 00:53:51.000 |
similar length, and then shuffling them, and padding, and 00:53:55.000 |
doing all of that. And so, that puts a generator that you 00:54:03.000 |
can then feed in your main loop through a TNO function. So, 00:54:07.000 |
that whole preprocessing happens outside of TNO, and then the 00:54:18.000 |
So, yes, here we build our final TNO graph. We have symbolic 00:54:24.000 |
inputs for, well, the input and the mask. We create a layer, 00:54:33.000 |
the layer, define our cost, parameters, all the 00:54:38.000 |
parameters, and the recurrent layer. Take the gradients, of 00:54:43.000 |
course, with respect to all parameters. So, as I mentioned, 00:54:47.000 |
it's going to use backprop through time to get the gradient 00:54:51.000 |
through the scan operation. The update rule, again, simple, 00:55:01.000 |
SGD, no momentum, nothing. It's something you can add if you 00:55:05.000 |
want to. And then, we have a function to evaluate the model. 00:55:12.000 |
So, here, the main loop is training, and we also have 00:55:19.000 |
another function that generates one character at a time, given 00:55:23.000 |
the previous ones. That's why we declare inputs here. And so 00:55:29.000 |
we have a function that gets predictions, we normalize them, 00:55:36.000 |
because we are working in float 32, and sometimes, if you divide 00:55:40.000 |
by the sum, and the sum, it doesn't add up to one. So we 00:55:45.000 |
want the higher precision for just that operation. And then 00:55:51.000 |
try to generate a sequence every once in a while. 00:55:58.000 |
So, we have a function that generates a sequence every once 00:56:01.000 |
in a while. So, we have a function that generates a 00:56:04.000 |
sequence in the previous run. So, we see the -- so, for 00:56:10.000 |
monitoring, we see that prediction with the meaning of 00:56:15.000 |
life is, and then we let the network generate. So, if I try 00:56:19.000 |
to run it now, it's going to be long, but here's some examples 00:56:24.000 |
of how it works. So, the first one is a model that we 00:56:29.000 |
developed with not that much, and it has, like, a couple of 00:56:34.000 |
unusual characters. I mean, it's usually -- it's not usual to 00:56:39.000 |
have, like, one Chinese character in the middle of words. 00:56:44.000 |
You have, like, punctuation in the middle of words, and so on. 00:56:50.000 |
And so, we have a function that generates a sequence every once 00:56:55.000 |
in a while. And we see that it's getting slowly better and 00:57:02.000 |
better. And the meaning of life is the that, and so on. 00:57:09.000 |
So, of course, this is not what's going to give you the 00:57:22.000 |
So, yeah, so I interrupted the training at some point, but you 00:57:29.000 |
can play with it a little bit, and here are some suggestions 00:57:34.000 |
of things you might want to do, like better training 00:57:40.000 |
strategies, different linearities inside the LSTM cell, 00:57:46.000 |
different initialization of weights, try to generate 00:57:50.000 |
something else that the meaning of life is, and, yeah. 00:57:56.000 |
So, I hope I could give you a good introduction of what TNO 00:58:02.000 |
is, what it can be used for, and what you can build on top of 00:58:14.000 |
later, then we have TNO users mailing lists. We are answering 00:58:20.000 |
questions on Stack Overflow as well. And we would be happy to 00:58:33.000 |
>> We have time for a few quick questions. There's one here. 00:58:39.000 |
Could you go to the mic? >> Can you just give a quick 00:58:47.000 |
example of what debugging might look like in TNO? Could you 00:58:51.000 |
break something in there and show us what happens and how 00:58:55.000 |
you figure out what it was? >> Sure. Actually, yeah, I 00:58:59.000 |
can show you a few examples. Okay. So, let's go to, say, a 00:59:06.000 |
simple example. Okay. So, I'm just going to go to the 00:59:13.000 |
logistic regression one, and say, for instance, that when I 00:59:23.000 |
execute this, I don't have the right shape. So, you can still 00:59:35.000 |
build the whole symbolic graph, and at the time where you want 00:59:44.000 |
to actually execute it, then you have an error message that 00:59:50.000 |
says, "Why does this not have the right shape?" So, let's say 00:59:56.000 |
that X has columns and rows, but Y has only that number of 01:00:02.000 |
rows. And the apply node that caused the error is that dot 01:00:08.000 |
product, and gives the input again, and in that case, it 01:00:12.000 |
tells you -- it's not really able to tell you where it was 01:00:17.000 |
defined. So, we can do that, and we can go back to where the 01:00:26.000 |
train operation was defined, train model, TNO function, and 01:00:33.000 |
we can say, "Mode optimizer equals none." Sorry. I have to 01:00:49.000 |
do -- mode equals TNO.mode, optimizer, none. Is that 01:00:57.000 |
right? So, let's do that. Let's reconfigure everything. 01:01:11.000 |
And then, the updated error message says, "Backtrace when 01:01:17.000 |
the node was created," and it's somewhere in my kernel, and 01:01:23.000 |
it's not there. So, we can go back to that. So, of course, we 01:01:29.000 |
have a lot of things in there, but you know that there's a dot 01:01:33.000 |
product, and it's probably a mismatch between those. So, 01:01:37.000 |
that's one example. Then, the other techniques that we can 01:01:41.000 |
use, we can have the break points, as I said, and so on. I 01:01:45.000 |
don't have right now a tutorial about that, but I have some 01:01:55.000 |
>> One last question. >> I have some models I would like 01:01:58.000 |
to distribute, and I don't want to require people to install 01:02:02.000 |
Python and a bunch of compilers and stuff. Do you have any 01:02:11.000 |
>> Okay. So, unfortunately, at the time, we're pretty 01:02:15.000 |
limited in the number of models that we can do. So, most of 01:02:19.000 |
the work is done by Python, and we use an empire and DRAs for 01:02:24.000 |
our intermediate values on the CPU and the similar structure on 01:02:28.000 |
the GPU, even though that one might be easier to convert. But 01:02:32.000 |
yes, all our C code deals with Python and does the ink ref and 01:02:36.000 |
the graph and so on, so that Python manages the memory. So, 01:02:40.000 |
we're not able to do that. We have a lot of work to do. 01:02:45.000 |
>> So, how about something like a Docker container? 01:02:49.000 |
>> Something like that. Recently, even for GPU, NVIDIA 01:02:53.000 |
Docker is quite efficient, and we don't have any model slowdowns 01:02:57.000 |
that we had seen earlier. So, it's not ideal, and if, like, 01:03:03.000 |
someone has some time and the will to help us disentangle 01:03:20.000 |
>> We convene in 55 minutes for the next talk. Have a good