Theano Tutorial (Pascal Lamblin, MILA)

00:00:00.000 | >> Thank you. [ Applause ].

00:00:02.000 | >> Okay. So today I'm going to briefly introduce you to TNO,

00:00:09.000 | how to use it, and go over the basic principles behind the

00:00:13.000 | libraries, and if you paid attention during yesterday's

00:00:17.000 | presentation of TensorFlow, some concepts will be familiar to you

00:00:21.000 | as well. And if you paid attention to Hugo La Rochelle's

00:00:28.000 | talk, you will have heard of it. And I will talk about some

00:00:34.000 | similar concepts as well. So there's going to be four main

00:00:38.000 | parts. So the first one is, well, these slides and

00:00:42.000 | introduction about what the concepts of TNO are. There is a

00:00:47.000 | companion iPython notebook that's on GitHub. So if you go

00:00:52.000 | on that page or clone that GitHub repository, there is an

00:00:57.000 | example of how to use it. And you can download code snippets

00:01:03.000 | from the slides so that you can run them at the same time.

00:01:07.000 | Then we're going to have a more hands-on example, basically

00:01:12.000 | applying logistic regression on the MNIST data set. And then if

00:01:18.000 | we have time, we'll go quickly over two more examples,

00:01:24.000 | and then we'll talk about the concept of TNO and how to use

00:01:30.000 | them for character-level generation of text.

00:01:36.000 | So TNO is, we can say, a mathematical symbolic expression

00:01:41.000 | compiler. So what does that mean? It means that it makes it

00:01:45.000 | possible to define expressions that represent mathematical

00:01:51.000 | expressions. So it's a very simple expression, easy to use,

00:01:57.000 | and it supports all the kind of basic mathematical operations

00:02:03.000 | like min, max, addition, subtraction, all the kind of

00:02:10.000 | basic things, not only larger blocks like layers of neural

00:02:16.000 | networks, but also things like that. So it's basically

00:02:22.000 | making those expressions, doing graph substitutions, cloning and

00:02:28.000 | replacement, things like that, and also making possible to go

00:02:33.000 | through that graph and perform things like automatic

00:02:38.000 | differentiation, symbolic differentiation, actually, or the

00:02:43.000 | kind of things that we call the TNO-like function. And then

00:02:49.000 | it's possible to use that optimized graph and TNO's

00:02:56.000 | runtime to actually compute some values, some output values

00:03:00.000 | given inputs. We also have a couple of tools that help debug

00:03:07.000 | both TNO's code and the user's code and try to inspect and

00:03:12.000 | see if there's any errors in the code. So let's talk about

00:03:17.000 | TNO. So TNO is currently more than eight years old. It

00:03:22.000 | started small with only a couple of contributors from the

00:03:29.000 | ancestor of Mila, which was called Lizat at the time, and

00:03:34.000 | it grew a lot. We now have contributors from all over the

00:03:39.000 | world. We have a lot of TNO's, prototypes for industrial

00:03:45.000 | applications in startups and in larger companies.

00:03:51.000 | TNO has also been the base of other software projects that

00:03:57.000 | built on top of TNO. So, for instance, blocks, Keras,

00:04:04.000 | and so on. TNO is more the backend and provides user

00:04:12.000 | interface that is a higher level. So that has concepts of

00:04:18.000 | layers, of training algorithms, of those kind of things.

00:04:23.000 | Whereas TNO is more the backend. It's scale on TNO as

00:04:27.000 | well, which is nice because it has a converter to load cafe

00:04:32.000 | and it's a lot of fun. And it's a lot of fun to use.

00:04:38.000 | And it's a lot of fun to use. And it uses TNO not to do

00:04:45.000 | machine learning, but programming. And we have two

00:04:50.000 | other libraries, Platoon, and TNO MPI, which are layers on

00:04:57.000 | TNO. And they are more the backend and more the backend

00:05:03.000 | level of model parallelism and data parallelism.

00:05:09.000 | So, how to use TNO? Well, first of all, we are working

00:05:16.000 | with symbolic expression, symbolic variables. So that

00:05:22.000 | is a function that is used to define the expression. So to

00:05:28.000 | define the symbolic expression, so we define the expression

00:05:32.000 | first, then we want to compile a function, and then execute

00:05:36.000 | that function on values. So to define the expression, we

00:05:40.000 | start by defining inputs. So the inputs are symbolic

00:05:44.000 | variables that have some type. So you have to define in

00:05:49.000 | the matrix what its data type is, floating point, integers,

00:05:55.000 | and so on. So things like the number of dimensions have to

00:06:00.000 | be known in advance. But the shape is not fixed. The memory

00:06:05.000 | layout is not fixed. So you could have shapes that change

00:06:10.000 | between like one mini-batch and the next, or different calls

00:06:14.000 | that are made. So you have to define the symbolic expression

00:06:18.000 | in general. So X and Y are purely symbolic variables here.

00:06:24.000 | We will give them values later, but for now, that's just

00:06:29.000 | empty. There's another kind of input variables that is shared

00:06:34.000 | variables, and they are symbolic, but they also hold a

00:06:38.000 | value, and that value is persistent across function

00:06:42.000 | types. So we have a function that has two dimensions. It's

00:06:47.000 | usually used, for instance, for storing parameters of the model

00:06:51.000 | that you want to learn, and yet these values can be updated as

00:06:56.000 | well. So here we create two other variables from -- so

00:07:01.000 | shared variables from values. This one has two dimensions

00:07:05.000 | because its initial values have two dimensions, and this one

00:07:09.000 | has two dimensions. So we can define a variable that will

00:07:13.000 | represent the bias. We can name variables by assigning to the

00:07:18.000 | name attributes. Shared variables do not have a fixed

00:07:22.000 | side either. They are usually kept fixed in most models, but

00:07:26.000 | it's not a requirement. Then from these inputs, we can define

00:07:31.000 | expressions that will build new variables, intermediate

00:07:35.000 | variables, and so on. And so, for instance, here we can

00:07:41.000 | define, well, the product of X and W, add the bias, apply a

00:07:49.000 | function on that, and let's say this is our output variable, and

00:07:53.000 | from the output variable and Y, we can define just, say, the

00:07:59.000 | bias. So those new variables are connected to the previous

00:08:05.000 | ones through the operations that we define, and we can

00:08:09.000 | visualize the graph structure like that by using, for

00:08:13.000 | instance, the pilot print, which is a helper function. So

00:08:17.000 | variables are those squared boxes, and we have other nodes

00:08:21.000 | here, we call apply nodes that represent the mathematical

00:08:26.000 | function, and we can visualize the graph structure.

00:08:30.000 | So input variables and shared variables do not have any

00:08:36.000 | ancestors. They don't have any connecting from them, but then

00:08:40.000 | you see that intermediate result and more of them.

00:08:46.000 | Usually, when we visualize, we don't necessarily care about all

00:08:50.000 | the intermediate variables unless they have a name or

00:08:54.000 | something. So here, we have exactly the same graph where we

00:08:58.000 | hide the unnamed intermediate variables, but you can still see

00:09:02.000 | all the operations. And actually, you see the type on

00:09:08.000 | the edges. So once you have defined some

00:09:12.000 | graphs, say your forward computation for your model, we

00:09:16.000 | want to be able to use back propagation to get gradients.

00:09:23.000 | So here, we have just the basic concept of the chain rule. We

00:09:29.000 | have a scalar cost, we have intermediate variables that are

00:09:35.000 | vectors. Here is just the chain rule starting from the cost.

00:09:42.000 | And so the whole derivative of, say, that function G is

00:09:50.000 | the cost of the output. And the cost of the output is

00:09:56.000 | that M by N if the intermediate variables are vectors of size

00:10:02.000 | N and M. And usually, you don't need that. And it's usually a

00:10:06.000 | bad idea to compute it explicitly unless you need it

00:10:10.000 | for some other purposes. The only thing you need is an

00:10:14.000 | expression that's given any vector representing the

00:10:18.000 | gradient of the cost with respect to the input. So

00:10:22.000 | basically, the dot product between that vector and the

00:10:26.000 | whole Jacobian matrix. So that's also called the L

00:10:30.000 | operator sometimes. And so almost all operations in

00:10:36.000 | Tiano implement a function that returns that. And it actually

00:10:43.000 | returns not numbers, not a numerical expression for that,

00:10:49.000 | but it returns a symbolic expression that represents that

00:10:55.000 | computation. Again, usually without having to explicitly

00:11:00.000 | represent or define that whole Jacobian matrix.

00:11:05.000 | So you can call Tiano dot grad, which will back propagate

00:11:11.000 | the cost of the input. And it will start from the cost

00:11:17.000 | towards the inputs that you give. And along the way, it

00:11:21.000 | will call that grad method of each operation, back

00:11:25.000 | propagating -- starting from one for the cost and back

00:11:29.000 | propagating through the whole graph, accumulating when you

00:11:33.000 | have the same variables that use more than once and so on.

00:11:38.000 | So that's basically what Tiano does. The same way as if you

00:11:44.000 | had manually defined the gradient expression using Tiano

00:11:48.000 | operations like the dot product, the sigmoid and so on

00:11:52.000 | that we've seen earlier. So we have non-numerical values at

00:11:56.000 | that point. And they are part of the computation graph. So

00:12:01.000 | the computation graph was extended to add these

00:12:07.000 | variables. So we can use Tiano to compute the

00:12:11.000 | gradient descent, extending the graph from these variables,

00:12:15.000 | for instance, to compute the expressions corresponding to

00:12:19.000 | gradient descent, something like that, like we do here.

00:12:23.000 | So, for instance, this is what the extended graph for the

00:12:27.000 | gradients looks like. So you see there's a lot of small

00:12:31.000 | operations that have been inserted. And the output you

00:12:35.000 | can see here is the gradient descent, the gradient output,

00:12:39.000 | and an intermediate result that will help compute the gradient

00:12:43.000 | with respect to the weights. And here's the graph for the

00:12:49.000 | expressions. So you have as intermediate

00:12:54.000 | variables, the gradients that we had on the previous slide,

00:12:59.000 | and then basically just the scale version with the constant

00:13:04.000 | weight. So once we have defined the whole graph, the whole

00:13:10.000 | expression that we actually care about, from the inputs and

00:13:16.000 | initial weights to the weight updates for our training

00:13:21.000 | algorithm, we want to compile a function that will be able to

00:13:25.000 | actually compute those numbers, given inputs, and perform the

00:13:29.000 | computation. So to do that, we have a function that has

00:13:35.000 | values. What we do is call TNO.function, and you provide it

00:13:39.000 | with the input variables that you want to feed, and the output

00:13:43.000 | variables that you want to get. And you don't have necessarily

00:13:46.000 | to provide values for all of the inputs that you might have

00:13:51.000 | declared, especially if you don't want to go all the way

00:13:55.000 | back to the beginning. So we can actually compute

00:14:00.000 | expression for a subset of the graph. For instance, we can

00:14:04.000 | have a predict function here that goes only from X to out.

00:14:08.000 | We don't need values from Y. We don't need -- and so the

00:14:13.000 | gradient and so on will not be computed. It's just going to

00:14:17.000 | take a small part of the graph and make a function out of it.

00:14:24.000 | So to do that, you have to provide a value and call it. So

00:14:28.000 | you have to provide values for all of the input variables that

00:14:32.000 | you define. You don't have to provide values for shared

00:14:36.000 | variables, the W and B that we declared earlier. They are

00:14:40.000 | implicit inputs to all of the functions, and their value will

00:14:45.000 | automatically be fetched when it's needed.

00:14:49.000 | You can declare other functions, like a monitoring function that

00:14:53.000 | is a function that is not a prediction. So you have two

00:14:57.000 | outputs, you also need the second input, Y. You can compute

00:15:02.000 | the function that does not start from the beginning. Like,

00:15:06.000 | for instance, if I want an error function that only computes the

00:15:11.000 | mismatch between the prediction and the actual target, then I

00:15:15.000 | don't have to start from the input. I can just start from

00:15:19.000 | the input. So that's the first thing.

00:15:23.000 | Then the next thing that we want to do is update shared

00:15:28.000 | variables for training. It's necessary. And, again, you can

00:15:32.000 | pass to TNO functions updates, a list of updates. And updates

00:15:37.000 | are pairs of a shared variable and a symbolic expression that

00:15:42.000 | will compute the new value for that shared variable.

00:15:46.000 | So, for example, if we want to update Y, we can update B here

00:15:52.000 | as implicit outputs of the function, like W and B were

00:15:56.000 | implicit inputs. Update W and B are implicit outputs that will

00:16:00.000 | be computed at the same time as C. And then, after all the

00:16:05.000 | outputs are computed, the updates are actually effective

00:16:10.000 | and the train function is actually running.

00:16:15.000 | Here, if we print the value of B before and after having

00:16:21.000 | calling -- after having called the train function, then we see

00:16:25.000 | the value has changed. What happens also during graph

00:16:31.000 | compilation is that the subgraph that we selected for that

00:16:36.000 | function, the subgraph that we selected for that function, what

00:16:41.000 | we mean by that is that it's going to be rewritten in parts.

00:16:45.000 | There are some expressions that will be substituted and so on.

00:16:50.000 | And there are different goals for that.

00:16:53.000 | Some are quite simple. For instance, if we have the same

00:16:59.000 | computation being defined twice, we only want it to be executed

00:17:03.000 | twice. And there are some other things that are not necessary.

00:17:07.000 | You don't want to compute them at all. For instance, if you

00:17:12.000 | have X divided by X, and X is not used anywhere else, we just

00:17:17.000 | want to replace that by one. There are numerical stability

00:17:22.000 | optimizations. For instance, log of 1 plus X can underflow if

00:17:28.000 | you have a large number of operations, and you can just

00:17:33.000 | replace it by 1 plus X. Things like log of soft max get

00:17:38.000 | optimized into a more stable operation. It's also the time

00:17:42.000 | where in place and destructive operations are inserted. For

00:17:46.000 | instance, if an operation is the last to be executed on some

00:17:50.000 | numbers, it can, instead of allocating output memory, it can

00:17:54.000 | allocate the output memory. And the transfer of the graph

00:18:00.000 | expression to the GPU is done during the optimization phase.

00:18:06.000 | So, by default, TNO tries to apply most of the optimization

00:18:12.000 | so that you have a run time that's almost as fast as

00:18:15.000 | possible, except for a couple of checks and assertions. But if

00:18:19.000 | you're not a big fan of fast feedback and don't care that

00:18:25.000 | much about the run time speed, then you have a couple of ways

00:18:31.000 | of enabling and disabling some sets of optimizations, and you

00:18:37.000 | can do that either globally or function by function.

00:18:41.000 | So, to have a look at, for instance, what happens during

00:18:47.000 | the optimization phase, here's the original unoptimized graph

00:18:53.000 | going from the inputs X and W going to the output prediction.

00:18:59.000 | It's the same one that we've seen before. And if we compare

00:19:03.000 | that with the function, the compiled function that goes from

00:19:09.000 | these input variables to out, which was called predict, this

00:19:13.000 | is the graph that we saw before. So, I won't go into details

00:19:19.000 | about what's happening in there, but here you have a GMV

00:19:25.000 | operation, which basically calls an optimized BLAS routine that

00:19:29.000 | can also do multiplication and accumulation at the same time.

00:19:35.000 | We have a sigmoid operation here that will work in place

00:19:41.000 | operations. If you have a look at, for instance, the

00:19:47.000 | optimized graph computing the expression for the updated WNB,

00:19:53.000 | this was the original one. And the optimized one is much

00:19:59.000 | smaller. It has also in place operations. It has fused

00:20:05.000 | operations. Like, for instance, if you have a whole tensor and

00:20:09.000 | you want to do addition with a constant and then a sigmoid and

00:20:15.000 | then something else and so on, you want to only loop once

00:20:19.000 | through the array and apply all the scalar operations on each

00:20:23.000 | element and then go to the next and so on and not iterate each

00:20:27.000 | time that you want to apply a new operation. And those kind

00:20:31.000 | of things happen often when you have automatically generated

00:20:35.000 | a new operation. And here you see the update for the

00:20:41.000 | variables, which are inputs. So you see the cost and the

00:20:47.000 | implicit outputs for the updated WNB here and here.

00:20:52.000 | Another graph visualization tool that exists is the back

00:20:57.000 | print, which basically prints text-based structure of the

00:21:01.000 | graph. So you can see the input of the array IDs and the

00:21:07.000 | variable names and so on. So here you can see more in detail

00:21:13.000 | what the structure is and you see the input of the scaling

00:21:19.000 | parameters and so on. So when the function is compiled, then

00:21:23.000 | we can actually run it. So the function is a callable Python

00:21:30.000 | function that we can call. And we've seen those examples

00:21:38.000 | here, for instance, where we call train and so on.

00:21:44.000 | But what happens to have, say, an optimized run time, it's not

00:21:54.000 | just the function that is running. So we have on-the-fly

00:22:00.000 | code generations, but we also generate C++ or CUDA code.

00:22:06.000 | For instance, for the LMYs loop fusion that I mentioned, we

00:22:10.000 | can't know in advance which element was operation will be

00:22:15.000 | -- will occur in which order in any graph that the user might

00:22:19.000 | be running. So we have a bunch of generations for that. We

00:22:25.000 | generate Python module written in C++ or CUDA that gets

00:22:29.000 | compiled and imported back so we can use it from Python.

00:22:34.000 | The run time environment then calls in the right order the

00:22:40.000 | different operations that have to be executed from the inputs

00:22:44.000 | that we get. So we can use the same code to get the desired

00:22:49.000 | results. We have a couple of different ones, and in

00:22:53.000 | particular, there's one which was written in C++ which

00:22:57.000 | avoids having to switch context between the Python

00:23:01.000 | interpreter and the C++ execution engine.

00:23:05.000 | Something else that's really crucial for speed and

00:23:09.000 | accuracy is the GPU. We wanted to make it as simple as

00:23:15.000 | possible in usual cases. So now it supports a couple of

00:23:23.000 | different data types, not only float 32, but double precision

00:23:28.000 | if you really need that, integers as well. And we have

00:23:34.000 | a lot of GPUs that are running on the GPU. We have a lot of

00:23:40.000 | GPUs that are running with GPU arrays from Python itself. So

00:23:44.000 | you can just use Python code to handle GPU arrays outside of a

00:23:49.000 | function if you like. All of that will be in the future 0.9

00:23:53.000 | release that we hope to get out soon.

00:23:56.000 | And to use it, well, you select the device that you want to

00:24:00.000 | use, and you can use it with just a configuration flag. For

00:24:06.000 | instance, use CUDA to get the first GPU that's available, or

00:24:12.000 | one specific one. And if you specify that in the

00:24:16.000 | configuration, then all shared variables will by default be

00:24:21.000 | created in GPU memory. And the optimizations that move the

00:24:26.000 | CPU operation by GPU operations are going to be applied.

00:24:32.000 | Usually, you want to make sure you use float 32 or even float

00:24:38.000 | 16 for storage, which is experimental, but because most

00:24:43.000 | GPUs don't have a good performance for the bulk

00:24:48.000 | precision. So how you set those configuration flags? You have

00:24:54.000 | to set them in the configuration file. So if you

00:24:58.000 | remember the configuration file that you can -- it's just a

00:25:03.000 | configuration file for Python. You have an environment

00:25:07.000 | variable where you can define those, and the environment

00:25:11.000 | variable overrides the config file, and you can also set

00:25:15.000 | things directly from Python. But some flags have to be known in

00:25:19.000 | the configuration file. So if you want to set the device

00:25:24.000 | itself, you have to set it either in the configuration file

00:25:30.000 | or through flags.

00:25:36.000 | So I'm going to quickly go over more advanced topics, and if

00:25:41.000 | you want to learn more about that, there's other tutorials

00:25:46.000 | on the web, and there's a lot of documentation on

00:25:51.000 | deeplearning.net. So to have loops in the graph, we've seen

00:25:55.000 | that the expression graph is basically a directed acyclic

00:26:00.000 | graph, and we cannot have loops in there. One way, if you know

00:26:05.000 | in advance the number of iterations, is just to unroll

00:26:09.000 | the loop, use for loop in Python that builds all the nodes for

00:26:13.000 | the loop. So it doesn't work if you want, for instance, to

00:26:20.000 | have dynamic size for the loop. For models that generate

00:26:26.000 | sequences, for instance, it can be an issue.

00:26:30.000 | So what we have for that in TNO is called scan, and

00:26:36.000 | basically, it's one node that encapsulates another whole TNO

00:26:42.000 | function, and it's going to compute the -- it's going to

00:26:48.000 | represent the computation that has to be done at each time

00:26:52.000 | step. So you have a TNO function that performs the

00:26:56.000 | computation for one time step, and you have the scan node that

00:27:00.000 | calls it in the loop, taking care of the bookkeeping of

00:27:04.000 | indices and sequences and feeding the right slice at the

00:27:08.000 | right time. And having that structure makes it also

00:27:13.000 | possible to define a gradient for that node, which is

00:27:17.000 | basically another scan node, another loop that goes

00:27:21.000 | backwards and applies backprops with time. And it can be

00:27:25.000 | transferred to GPU as well, in which case the internal

00:27:29.000 | function is going to be transferred and recompiled on

00:27:34.000 | the GPU. And we will talk about the LSTM example later.

00:27:40.000 | This is just a small example, but we don't really have time

00:27:44.000 | for that. We also have visualization, debugging, and

00:27:50.000 | diagnostic tools. One of the reasons it's important is that

00:27:54.000 | in TNO, like in TensorFlow, the definition of a function is

00:27:59.000 | defined in the expression. So if something doesn't work

00:28:05.000 | during the execution, if you encounter errors and so on,

00:28:09.000 | then it's not obvious how to connect that from where the

00:28:13.000 | expression was actually defined. So we try to have

00:28:20.000 | informative error messages, and we have some completion

00:28:25.000 | methods, like the test value, which is not a number, for

00:28:30.000 | large values. You can assign test values to the symbolic

00:28:35.000 | variables so that each time you create a new symbolic

00:28:40.000 | intermediate variable, each time you define a new expression,

00:28:45.000 | then the test value gets computed, and so you can

00:28:49.000 | evaluate on one piece of data at the same time as you build

00:28:53.000 | the new expression. So you can avoid mistakes,

00:28:57.000 | mistakes, errors, and things like that.

00:29:00.000 | It's possible to extend TNO in a couple of ways. You can

00:29:05.000 | create an op just from Python by calling Python wrappers for

00:29:12.000 | existing efficient libraries. You can extend TNO by writing

00:29:18.000 | new functions, either for increased numerical stability,

00:29:23.000 | for instance, or for more efficient computation, or for

00:29:28.000 | introducing your new ops instead of the naive versions that a

00:29:35.000 | user might have used.

00:29:39.000 | We have a couple of new features that have been

00:29:43.000 | added to TNO. I mentioned the new GPU backend, with support

00:29:49.000 | for many data types, and we've had some performance

00:29:54.000 | improvements, especially for convolution, 2D and 3D, and

00:29:59.000 | especially on GPU. We've made some progress on the

00:30:05.000 | time of the graph optimization phase, and also have

00:30:10.000 | improved the performance of the graph. We have new ways of

00:30:15.000 | avoiding recompiling the same graph over and over again, and

00:30:19.000 | we have new diagnostic tools that are quite useful, an

00:30:24.000 | interactive graph visualization tool, and PDB breakpoints that

00:30:28.000 | enables you to monitor a couple of variables and only break if

00:30:33.000 | some condition is met, rather than monitoring something every

00:30:37.000 | time. In the future, while we're still

00:30:42.000 | working on new operations on GPU, we still want to wrap more

00:30:49.000 | operations for better performance, in particular, the

00:30:55.000 | basic RNNs should be completed in the following days,

00:30:59.000 | hopefully. Someone has been working on that a lot recently.

00:31:04.000 | And, of course, we have support for 3D convolutions, still

00:31:10.000 | faster optimisation, and more work on data parallelism as

00:31:15.000 | well. So, yes, I want to thank

00:31:20.000 | most of my colleagues and main TNO developers, and people who

00:31:26.000 | contributed one way or another to our lab and the software

00:31:31.000 | development team, and the team that is working on the

00:31:37.000 | organisers for this call. Now, yes, so the slides are

00:31:43.000 | available online. As I mentioned, there is a

00:31:49.000 | companion notebook, and now we can start to - and more

00:31:53.000 | resources if you want to go further. And now I think it is

00:31:57.000 | time to move on to the demo. So, for those who have not

00:32:06.000 | cloned the repository yet, then this is the command line you

00:32:11.000 | want to launch. For those who have cloned it, you might want

00:32:16.000 | to do a Git pull, just to get the latest - to make sure we

00:32:21.000 | have the latest version of the Jupyter notebook on the

00:32:29.000 | repository itself. So we have three examples that we are

00:32:33.000 | going to go over. Logistic regression,

00:32:39.000 | ConvNet, and LSTM. So I've launched the Jupyter

00:32:45.000 | notebook here. So, intro TNO was the companion notebook.

00:32:50.000 | So, we have the logistic regression, and we have the

00:32:56.000 | logistic regression. So let's go with the logistic

00:33:00.000 | regression. Is that big enough, or do I need to increase the

00:33:06.000 | font size? Okay. So I'm going to skip over the text, because

00:33:13.000 | you probably know already about the model. We have some - we

00:33:19.000 | have some data that we want to load. So let's load the data

00:33:28.000 | with the - on GitHub, with the repository. So let's load the

00:33:35.000 | data. And here, let's see how we define the model. So it's

00:33:40.000 | basically the same way that we did in the slides. We define

00:33:46.000 | an input variable. Here it's a matrix, because we want to use

00:33:52.000 | many batches. And we have shared variables initialised from

00:33:59.000 | zeros. Then we define the - our model. So, here's our

00:34:10.000 | predictor. So the probability of the class given the input,

00:34:17.000 | and we're going to use, well, so here, the fine model, and

00:34:25.000 | then the softmax on top of it, and the prediction, if you want

00:34:31.000 | to have a prediction, it's going to be the class of maximum

00:34:37.000 | probability. So, max over that axis, because we still want one

00:34:43.000 | prediction for each element of the mini-batch.

00:34:49.000 | Then we define the last function. So here is going to be

00:34:53.000 | the log likelihood of the label given the input, or the

00:34:59.000 | cross-entropy, and we define it simply, we don't have, like, we

00:35:03.000 | don't need to have one cross-entropy or log likelihood

00:35:09.000 | operation by itself. You can just build it from the basic

00:35:12.000 | building blocks. So you take the log of the probability, you

00:35:17.000 | take the index of the actual target, and then you take the

00:35:24.000 | mean of that to have the mean prediction over the mini-batch.

00:35:29.000 | And then you have the gradient. Derive the update rules. So,

00:35:36.000 | again, we don't have, like, one gradient descent object or

00:35:41.000 | something like that. We just build whatever rule we want.

00:35:48.000 | So, yeah, we could use momentum by defining other shared

00:35:54.000 | variables, like the gradient, and then we have the gradient

00:35:59.000 | for the velocity, and then the expressions for both the

00:36:04.000 | velocity and the shared variable itself.

00:36:09.000 | And then we compile a train function going from X and Y,

00:36:14.000 | outputting the loss, and updating W and B.

00:36:19.000 | So, yeah, we have the train function, and then we have the

00:36:24.000 | train function, and then the train is getting optimized.

00:36:27.000 | Let's see the next step. Well, we also want to monitor not

00:36:32.000 | only the log likelihood, but actually the misclassification

00:36:39.000 | rate on validation and test set. So it's simply the different,

00:36:45.000 | like, the rate, and the rate is the mean or the mini-batch, and

00:36:51.000 | we create another -- we compile another function, and not doing

00:36:57.000 | any updates, of course. So, to train the model, well, first,

00:37:02.000 | we need to process the data a little bit. So we want to feed

00:37:07.000 | the model one mini-batch of data at a time. So here we have

00:37:12.000 | the train function, and then we have the mini-batch, and it's

00:37:18.000 | not a Python generator, but a helper function that gives us

00:37:23.000 | the mini-batch number I, and it's going to be the same

00:37:27.000 | function used for the training and validation and test set.

00:37:32.000 | We define a couple of parameters for early stopping in that

00:37:36.000 | training loop. It's not necessary. It's just, like, a

00:37:41.000 | little bit more complex than the model that was encountered

00:37:46.000 | during the optimization. So let's define that.

00:37:51.000 | And this is the main training loop. It's a bit more complex

00:37:56.000 | than it might be, but it's because we use this early

00:38:01.000 | stopping, and we want to only validate when we are confident

00:38:06.000 | that the model is running. So, we have a couple of parameters

00:38:11.000 | that we can run down enough, but basically, the most important

00:38:16.000 | part is you loop over the epochs unless you encounter the

00:38:23.000 | early stopping conditions, and then, during each epoch, you

00:38:28.000 | want to loop over the mini-batches and call train

00:38:33.000 | loop. And then, we want to get some result of the validation

00:38:38.000 | error, so here we call test model on the validation set for

00:38:43.000 | that, and then keep track of what the best model currently

00:38:50.000 | is, and get the test error as well.

00:38:57.000 | And save the best one. So, to save the best one, to save

00:39:02.000 | the best one, we need to save the values of all parameters,

00:39:07.000 | which is more robust than trying to pickle the whole Python

00:39:14.000 | object, and it also enables more easily transfer to other

00:39:19.000 | frameworks, to visualization frameworks, and so on. So let's

00:39:22.000 | try to execute that. So, of course, it's a simple

00:39:27.000 | model, and it's running on the same time, so it should not

00:39:34.000 | take that long. So, you see that at the

00:39:42.000 | beginning, well, almost at each iteration, we are better on the

00:39:46.000 | training set, and then, after a while, the progress is slower,

00:39:56.000 | and then, we are better on the training set. So, wait a little

00:40:00.000 | bit more. Seems to stall more and more.

00:40:08.000 | Okay. And here is the end after 96 epochs.

00:40:15.000 | So, now, if we want to visualize what filters were learned, or

00:40:22.000 | what we learned, we are using a helper function here to

00:40:28.000 | visualize the filters. It's not really important. But here,

00:40:32.000 | what we use is we call get value on the weights to access the

00:40:40.000 | internal value of the shared variable, and then we use that

00:40:46.000 | to visualize the filter. So, we have the training set, and we

00:40:50.000 | have the filters, and we can see it's kind of reasonable, like,

00:40:54.000 | this is the filter for class 0, and you can see kind of like a

00:40:59.000 | 0, 1, what's important for the two is to have an opening here,

00:41:05.000 | and so on. So, yeah, if we have a look at the

00:41:13.000 | training error, is - well, do we see the training error? No,

00:41:19.000 | I'm not plotting it. But the validation and the test error are

00:41:24.000 | quite high, and we know that the human level performance is

00:41:29.000 | quite low, and the performance of other models is quite low, so

00:41:33.000 | it really means that the model is too simple, and we should use

00:41:37.000 | something more advanced. So, to use something more advanced, if

00:41:44.000 | you go back to the home of the Jupyter notebook, you can have a

00:41:51.000 | look at the conf net, and run the net.

00:42:00.000 | So, this new example is basically - it's the same data,

00:42:04.000 | but it's a bit more advanced, and it's a bit more optimized,

00:42:09.000 | because it has the advantage of training fast even on an older

00:42:14.000 | laptop, but this time, we're going to use a convolutional net

00:42:19.000 | with a couple of convolution layers, and fully connected

00:42:24.000 | layers, and the final classifier. So I'm going to make sure that

00:42:29.000 | I'm not over-composing the data. So, let's see how we could use

00:42:35.000 | TNO to define helper classes that are layers that can make it

00:42:41.000 | easier for a user to compose them if they want to replicate

00:42:47.000 | some results, or use some classical architectures.

00:42:53.000 | This is done usually in frameworks built on top of TNO

00:42:58.000 | and TNO, and they develop their own mini-framework with their

00:43:04.000 | own versions of layers and so on that they find useful and

00:43:10.000 | intuitive. So, this logistic regression layer

00:43:17.000 | basically holds, well, parameters, weight, and bias, and

00:43:26.000 | it's a very simple classifier. It's a set of a variety of

00:43:32.000 | classes, prediction holds the params, and have expressions for

00:43:38.000 | the negative log likelihood and the errors. So, if you were to

00:43:44.000 | use only that class, then it's doing essentially the same as

00:43:50.000 | the previous one. And, in the same way, we can define a layer

00:43:57.000 | that has convolution and pooling. So, again, in the init

00:44:03.000 | method, we pass it, well, filter shape, image shape, the size of

00:44:09.000 | pooling, and so on. We initialise the weights using the

00:44:15.000 | same classifier, and we compute the bias from the input, well,

00:44:23.000 | we compute to the convolution with the filters. We then

00:44:31.000 | compute max pooling, and output, well, tan h of the pooling plus

00:44:39.000 | the bias. So, we have a bias, and here, the bias is only like

00:44:45.000 | one number for each channel, so which means that you don't have

00:44:49.000 | a different bias for each location in the image. So, you

00:44:54.000 | could actually apply such a layer on images of various size

00:45:00.000 | without having to initialise new parameters or retrain that.

00:45:06.000 | So, we have a layer, and we have a hidden layer, which is just

00:45:12.000 | a fully connected layer. Again, initialising weights and bias,

00:45:17.000 | and expression going from the symbolic expression going from

00:45:22.000 | the input, and the shared variables to the output after

00:45:27.000 | activation. And, again, we want to collect the parameters so

00:45:31.000 | that we can actually apply the same thing to the output.

00:45:37.000 | And then, here's a function that has the main loop, and the

00:45:43.000 | main training loop. So we have a mini-batch generator, again,

00:45:48.000 | the same code as before, and here, we are building the whole

00:45:52.000 | graph. So, always the same process. We define input

00:46:00.000 | vectors, and we define the input vectors. So, L vector is a

00:46:06.000 | vector of long, because the targets here are indices, and

00:46:12.000 | not one-hot vectors or masks or something like that. And we

00:46:17.000 | create the first layer, which is a layer with size. We want to

00:46:24.000 | have a size of one, and we want to have a size of two. So,

00:46:31.000 | here, the image size changes. This is mostly for efficiency,

00:46:36.000 | actually. You don't really have to pass that for those

00:46:40.000 | particular models. But you still need the shape of filters. I

00:46:45.000 | mean, you have the filters anyway. And then, it's useful to

00:46:50.000 | have a layer that is fully connected. So, the convolution

00:46:55.000 | layers can handle arbitrary-sized images. And then,

00:46:59.000 | after that, we want to flatten the whole feature maps and feed

00:47:04.000 | that into a fully connected layer and into the prediction

00:47:07.000 | layer. So this one has to be fixed, so we have to know what

00:47:14.000 | the output layer is. So, we have four dimensions. And here we

00:47:21.000 | go. A fully connected layer, and the output layer, that's

00:47:26.000 | just the class, the same as before. We want the final cost

00:47:32.000 | to be the log likelihood of that. We have, again, the errors,

00:47:39.000 | the parameters, or the concatenation of the parameters

00:47:44.000 | of all layers. And once we have that, we can build the

00:47:49.000 | gradient. So, just one call of grad of cost with respect to

00:47:54.000 | params. Get the updates. So, again, just regular SGD, but we

00:48:00.000 | could have a class or something that performs like momentum,

00:48:06.000 | whatever you need. Compile the function. And here we have,

00:48:12.000 | again, the early stopping routine with the same main loop

00:48:17.000 | for all epochs until we're done. Then loop over the mini

00:48:21.000 | batches and validate every once in a while and stop when it's

00:48:25.000 | finished. So, let's just declare that. Loading the data.

00:48:31.000 | So, we have the same output, exactly the same as before. And

00:48:38.000 | here we can actually run that. So, this was the result of a

00:48:46.000 | previous run. That took five minutes. So, I will probably

00:48:53.000 | not have time to do that, but here you can see basically what

00:48:58.000 | the result looks like. So, if you want to try that or try that

00:49:04.000 | during the lunch break or later, you're welcome to play with it.

00:49:12.000 | And after that, yeah, you can visualize the run filters as

00:49:19.000 | well. So, here you have the first layer. And for the -- and

00:49:29.000 | here you have the -- an example of the activations of the first

00:49:34.000 | layer for one example. So, we have just a little bit more

00:49:43.000 | information. So, let's just go back to the tutorial. I mean,

00:49:50.000 | example. So, if you go back to the home of the Jupyter

00:49:59.000 | notebook and go to LSTM, then -- so, this model is an LSTM

00:50:12.000 | model. So, it's a model of a sentence given the previous

00:50:18.000 | ones. So, not going to go into details, but here you can see

00:50:26.000 | that the LSTM layer is defined here with variables for all the

00:50:36.000 | matrices that you need and the different biases for the

00:50:39.000 | parameters. So, you have a lot of parameters. It would be

00:50:45.000 | possible and sometimes more efficient to actually define,

00:50:51.000 | say, only one variable that contains the concatenation of a

00:50:57.000 | couple of matrices, and that way you can do more efficient,

00:51:03.000 | more efficient, and more efficient implementation.

00:51:09.000 | And here's an example of how to use a scan for the loop. So,

00:51:15.000 | here we define a step function that takes, well, a couple of

00:51:26.000 | different inputs, so you have, like, the different activation

00:51:30.000 | steps, you have the current sequence input and so on, and

00:51:36.000 | from them, here's basically the LSTM formula where you have the

00:51:43.000 | dot product and sigma or tan H of the different connection

00:51:49.000 | inside the cell, and in the end, you have the hidden and that

00:51:59.000 | is the function that you need to use. So, once you have that,

00:52:05.000 | that step function is going to be passed to TNO dot scan, where

00:52:13.000 | the sequences are the mask and the input. So, the mask is

00:52:19.000 | useful because we are using many batches of sequences and not

00:52:23.000 | all the sequences in the same batch have the same length.

00:52:28.000 | So, we can use the same step function to group examples of

00:52:34.000 | similar lengths together, but they may not always be exactly

00:52:38.000 | the same length. So, in that case, we pad that to only the

00:52:42.000 | longest sequence in the mini-batch, not the longest

00:52:46.000 | sequence in the whole set, just for the mini-batch, but we have

00:52:50.000 | to pad and remember what the length of the different

00:52:54.000 | sequences are. So, let's define that.

00:53:01.000 | Here, we define the cost function, that's the

00:53:07.000 | cross-entropy of the sequence, and here, again, you see that

00:53:10.000 | the mask is used so that we don't consider the predictions

00:53:14.000 | after the end of the sequence. Logistic regression, the same as

00:53:19.000 | the cost function, but we are using the same cost.

00:53:26.000 | Here, for processing the data, we are using fuel, which is

00:53:30.000 | another tool being developed by a couple of students at Mila,

00:53:34.000 | and it's nice because it can read from just plain text data,

00:53:40.000 | do some preprocessing on the fly, including things that I

00:53:45.000 | will show you in a second. So, we are doing sequences by

00:53:51.000 | similar length, and then shuffling them, and padding, and

00:53:55.000 | doing all of that. And so, that puts a generator that you

00:54:03.000 | can then feed in your main loop through a TNO function. So,

00:54:07.000 | that whole preprocessing happens outside of TNO, and then the

00:54:12.000 | rest of the data is fed into the main loop.

00:54:18.000 | So, yes, here we build our final TNO graph. We have symbolic

00:54:24.000 | inputs for, well, the input and the mask. We create a layer,

00:54:33.000 | the layer, define our cost, parameters, all the

00:54:38.000 | parameters, and the recurrent layer. Take the gradients, of

00:54:43.000 | course, with respect to all parameters. So, as I mentioned,

00:54:47.000 | it's going to use backprop through time to get the gradient

00:54:51.000 | through the scan operation. The update rule, again, simple,

00:55:01.000 | SGD, no momentum, nothing. It's something you can add if you

00:55:05.000 | want to. And then, we have a function to evaluate the model.

00:55:12.000 | So, here, the main loop is training, and we also have

00:55:19.000 | another function that generates one character at a time, given

00:55:23.000 | the previous ones. That's why we declare inputs here. And so

00:55:29.000 | we have a function that gets predictions, we normalize them,

00:55:36.000 | because we are working in float 32, and sometimes, if you divide

00:55:40.000 | by the sum, and the sum, it doesn't add up to one. So we

00:55:45.000 | want the higher precision for just that operation. And then

00:55:51.000 | try to generate a sequence every once in a while.

00:55:58.000 | So, we have a function that generates a sequence every once

00:56:01.000 | in a while. So, we have a function that generates a

00:56:04.000 | sequence in the previous run. So, we see the -- so, for

00:56:10.000 | monitoring, we see that prediction with the meaning of

00:56:15.000 | life is, and then we let the network generate. So, if I try

00:56:19.000 | to run it now, it's going to be long, but here's some examples

00:56:24.000 | of how it works. So, the first one is a model that we

00:56:29.000 | developed with not that much, and it has, like, a couple of

00:56:34.000 | unusual characters. I mean, it's usually -- it's not usual to

00:56:39.000 | have, like, one Chinese character in the middle of words.

00:56:44.000 | You have, like, punctuation in the middle of words, and so on.

00:56:50.000 | And so, we have a function that generates a sequence every once

00:56:55.000 | in a while. And we see that it's getting slowly better and

00:57:02.000 | better. And the meaning of life is the that, and so on.

00:57:09.000 | So, of course, this is not what's going to give you the

00:57:15.000 | meaning of life. And this is this.

00:57:22.000 | So, yeah, so I interrupted the training at some point, but you

00:57:29.000 | can play with it a little bit, and here are some suggestions

00:57:34.000 | of things you might want to do, like better training

00:57:40.000 | strategies, different linearities inside the LSTM cell,

00:57:46.000 | different initialization of weights, try to generate

00:57:50.000 | something else that the meaning of life is, and, yeah.

00:57:56.000 | So, I hope I could give you a good introduction of what TNO

00:58:02.000 | is, what it can be used for, and what you can build on top of

00:58:07.000 | it. So, if you have any questions

00:58:14.000 | later, then we have TNO users mailing lists. We are answering

00:58:20.000 | questions on Stack Overflow as well. And we would be happy to

00:58:26.000 | have your feedback. [ Applause ]

00:58:33.000 | >> We have time for a few quick questions. There's one here.

00:58:39.000 | Could you go to the mic? >> Can you just give a quick

00:58:47.000 | example of what debugging might look like in TNO? Could you

00:58:51.000 | break something in there and show us what happens and how

00:58:55.000 | you figure out what it was? >> Sure. Actually, yeah, I

00:58:59.000 | can show you a few examples. Okay. So, let's go to, say, a

00:59:06.000 | simple example. Okay. So, I'm just going to go to the

00:59:13.000 | logistic regression one, and say, for instance, that when I

00:59:23.000 | execute this, I don't have the right shape. So, you can still

00:59:35.000 | build the whole symbolic graph, and at the time where you want

00:59:44.000 | to actually execute it, then you have an error message that

00:59:50.000 | says, "Why does this not have the right shape?" So, let's say

00:59:56.000 | that X has columns and rows, but Y has only that number of

01:00:02.000 | rows. And the apply node that caused the error is that dot

01:00:08.000 | product, and gives the input again, and in that case, it

01:00:12.000 | tells you -- it's not really able to tell you where it was

01:00:17.000 | defined. So, we can do that, and we can go back to where the

01:00:26.000 | train operation was defined, train model, TNO function, and

01:00:33.000 | we can say, "Mode optimizer equals none." Sorry. I have to

01:00:49.000 | do -- mode equals TNO.mode, optimizer, none. Is that

01:00:57.000 | right? So, let's do that. Let's reconfigure everything.

01:01:11.000 | And then, the updated error message says, "Backtrace when

01:01:17.000 | the node was created," and it's somewhere in my kernel, and

01:01:23.000 | it's not there. So, we can go back to that. So, of course, we

01:01:29.000 | have a lot of things in there, but you know that there's a dot

01:01:33.000 | product, and it's probably a mismatch between those. So,

01:01:37.000 | that's one example. Then, the other techniques that we can

01:01:41.000 | use, we can have the break points, as I said, and so on. I

01:01:45.000 | don't have right now a tutorial about that, but I have some

01:01:49.000 | examples. So, I'm going to go back to that.

01:01:55.000 | >> One last question. >> I have some models I would like

01:01:58.000 | to distribute, and I don't want to require people to install

01:02:02.000 | Python and a bunch of compilers and stuff. Do you have any

01:02:08.000 | support for compiling models into a binary?

01:02:11.000 | >> Okay. So, unfortunately, at the time, we're pretty

01:02:15.000 | limited in the number of models that we can do. So, most of

01:02:19.000 | the work is done by Python, and we use an empire and DRAs for

01:02:24.000 | our intermediate values on the CPU and the similar structure on

01:02:28.000 | the GPU, even though that one might be easier to convert. But

01:02:32.000 | yes, all our C code deals with Python and does the ink ref and

01:02:36.000 | the graph and so on, so that Python manages the memory. So,

01:02:40.000 | we're not able to do that. We have a lot of work to do.

01:02:45.000 | >> So, how about something like a Docker container?

01:02:49.000 | >> Something like that. Recently, even for GPU, NVIDIA

01:02:53.000 | Docker is quite efficient, and we don't have any model slowdowns

01:02:57.000 | that we had seen earlier. So, it's not ideal, and if, like,

01:03:03.000 | someone has some time and the will to help us disentangle

01:03:09.000 | that, we can do that.

01:03:13.000 | >> Okay. Let's thank Pascal again.

01:03:19.000 | [Applause]

01:03:20.000 | >> We convene in 55 minutes for the next talk. Have a good

01:03:23.000 | lunch.

01:03:24.000 | Lynch.

Theano Tutorial (Pascal Lamblin, MILA)

Chapters