back to index

Theano Tutorial (Pascal Lamblin, MILA)


Chapters

0:0
0:36 Objectives
9:6 The back-propagation algorithm
13:34 Computing values
16:23 Graph optimizations
27:41 Visualization, debugging and diagnostic tools
31:52 Examples

Whisper Transcript | Transcript Only Page

00:00:00.000 | >> Thank you. [ Applause ].
00:00:02.000 | >> Okay. So today I'm going to briefly introduce you to TNO,
00:00:09.000 | how to use it, and go over the basic principles behind the
00:00:13.000 | libraries, and if you paid attention during yesterday's
00:00:17.000 | presentation of TensorFlow, some concepts will be familiar to you
00:00:21.000 | as well. And if you paid attention to Hugo La Rochelle's
00:00:28.000 | talk, you will have heard of it. And I will talk about some
00:00:34.000 | similar concepts as well. So there's going to be four main
00:00:38.000 | parts. So the first one is, well, these slides and
00:00:42.000 | introduction about what the concepts of TNO are. There is a
00:00:47.000 | companion iPython notebook that's on GitHub. So if you go
00:00:52.000 | on that page or clone that GitHub repository, there is an
00:00:57.000 | example of how to use it. And you can download code snippets
00:01:03.000 | from the slides so that you can run them at the same time.
00:01:07.000 | Then we're going to have a more hands-on example, basically
00:01:12.000 | applying logistic regression on the MNIST data set. And then if
00:01:18.000 | we have time, we'll go quickly over two more examples,
00:01:24.000 | and then we'll talk about the concept of TNO and how to use
00:01:30.000 | them for character-level generation of text.
00:01:36.000 | So TNO is, we can say, a mathematical symbolic expression
00:01:41.000 | compiler. So what does that mean? It means that it makes it
00:01:45.000 | possible to define expressions that represent mathematical
00:01:51.000 | expressions. So it's a very simple expression, easy to use,
00:01:57.000 | and it supports all the kind of basic mathematical operations
00:02:03.000 | like min, max, addition, subtraction, all the kind of
00:02:10.000 | basic things, not only larger blocks like layers of neural
00:02:16.000 | networks, but also things like that. So it's basically
00:02:22.000 | making those expressions, doing graph substitutions, cloning and
00:02:28.000 | replacement, things like that, and also making possible to go
00:02:33.000 | through that graph and perform things like automatic
00:02:38.000 | differentiation, symbolic differentiation, actually, or the
00:02:43.000 | kind of things that we call the TNO-like function. And then
00:02:49.000 | it's possible to use that optimized graph and TNO's
00:02:56.000 | runtime to actually compute some values, some output values
00:03:00.000 | given inputs. We also have a couple of tools that help debug
00:03:07.000 | both TNO's code and the user's code and try to inspect and
00:03:12.000 | see if there's any errors in the code. So let's talk about
00:03:17.000 | TNO. So TNO is currently more than eight years old. It
00:03:22.000 | started small with only a couple of contributors from the
00:03:29.000 | ancestor of Mila, which was called Lizat at the time, and
00:03:34.000 | it grew a lot. We now have contributors from all over the
00:03:39.000 | world. We have a lot of TNO's, prototypes for industrial
00:03:45.000 | applications in startups and in larger companies.
00:03:51.000 | TNO has also been the base of other software projects that
00:03:57.000 | built on top of TNO. So, for instance, blocks, Keras,
00:04:04.000 | and so on. TNO is more the backend and provides user
00:04:12.000 | interface that is a higher level. So that has concepts of
00:04:18.000 | layers, of training algorithms, of those kind of things.
00:04:23.000 | Whereas TNO is more the backend. It's scale on TNO as
00:04:27.000 | well, which is nice because it has a converter to load cafe
00:04:32.000 | and it's a lot of fun. And it's a lot of fun to use.
00:04:38.000 | And it's a lot of fun to use. And it uses TNO not to do
00:04:45.000 | machine learning, but programming. And we have two
00:04:50.000 | other libraries, Platoon, and TNO MPI, which are layers on
00:04:57.000 | TNO. And they are more the backend and more the backend
00:05:03.000 | level of model parallelism and data parallelism.
00:05:09.000 | So, how to use TNO? Well, first of all, we are working
00:05:16.000 | with symbolic expression, symbolic variables. So that
00:05:22.000 | is a function that is used to define the expression. So to
00:05:28.000 | define the symbolic expression, so we define the expression
00:05:32.000 | first, then we want to compile a function, and then execute
00:05:36.000 | that function on values. So to define the expression, we
00:05:40.000 | start by defining inputs. So the inputs are symbolic
00:05:44.000 | variables that have some type. So you have to define in
00:05:49.000 | the matrix what its data type is, floating point, integers,
00:05:55.000 | and so on. So things like the number of dimensions have to
00:06:00.000 | be known in advance. But the shape is not fixed. The memory
00:06:05.000 | layout is not fixed. So you could have shapes that change
00:06:10.000 | between like one mini-batch and the next, or different calls
00:06:14.000 | that are made. So you have to define the symbolic expression
00:06:18.000 | in general. So X and Y are purely symbolic variables here.
00:06:24.000 | We will give them values later, but for now, that's just
00:06:29.000 | empty. There's another kind of input variables that is shared
00:06:34.000 | variables, and they are symbolic, but they also hold a
00:06:38.000 | value, and that value is persistent across function
00:06:42.000 | types. So we have a function that has two dimensions. It's
00:06:47.000 | usually used, for instance, for storing parameters of the model
00:06:51.000 | that you want to learn, and yet these values can be updated as
00:06:56.000 | well. So here we create two other variables from -- so
00:07:01.000 | shared variables from values. This one has two dimensions
00:07:05.000 | because its initial values have two dimensions, and this one
00:07:09.000 | has two dimensions. So we can define a variable that will
00:07:13.000 | represent the bias. We can name variables by assigning to the
00:07:18.000 | name attributes. Shared variables do not have a fixed
00:07:22.000 | side either. They are usually kept fixed in most models, but
00:07:26.000 | it's not a requirement. Then from these inputs, we can define
00:07:31.000 | expressions that will build new variables, intermediate
00:07:35.000 | variables, and so on. And so, for instance, here we can
00:07:41.000 | define, well, the product of X and W, add the bias, apply a
00:07:49.000 | function on that, and let's say this is our output variable, and
00:07:53.000 | from the output variable and Y, we can define just, say, the
00:07:59.000 | bias. So those new variables are connected to the previous
00:08:05.000 | ones through the operations that we define, and we can
00:08:09.000 | visualize the graph structure like that by using, for
00:08:13.000 | instance, the pilot print, which is a helper function. So
00:08:17.000 | variables are those squared boxes, and we have other nodes
00:08:21.000 | here, we call apply nodes that represent the mathematical
00:08:26.000 | function, and we can visualize the graph structure.
00:08:30.000 | So input variables and shared variables do not have any
00:08:36.000 | ancestors. They don't have any connecting from them, but then
00:08:40.000 | you see that intermediate result and more of them.
00:08:46.000 | Usually, when we visualize, we don't necessarily care about all
00:08:50.000 | the intermediate variables unless they have a name or
00:08:54.000 | something. So here, we have exactly the same graph where we
00:08:58.000 | hide the unnamed intermediate variables, but you can still see
00:09:02.000 | all the operations. And actually, you see the type on
00:09:08.000 | the edges. So once you have defined some
00:09:12.000 | graphs, say your forward computation for your model, we
00:09:16.000 | want to be able to use back propagation to get gradients.
00:09:23.000 | So here, we have just the basic concept of the chain rule. We
00:09:29.000 | have a scalar cost, we have intermediate variables that are
00:09:35.000 | vectors. Here is just the chain rule starting from the cost.
00:09:42.000 | And so the whole derivative of, say, that function G is
00:09:50.000 | the cost of the output. And the cost of the output is
00:09:56.000 | that M by N if the intermediate variables are vectors of size
00:10:02.000 | N and M. And usually, you don't need that. And it's usually a
00:10:06.000 | bad idea to compute it explicitly unless you need it
00:10:10.000 | for some other purposes. The only thing you need is an
00:10:14.000 | expression that's given any vector representing the
00:10:18.000 | gradient of the cost with respect to the input. So
00:10:22.000 | basically, the dot product between that vector and the
00:10:26.000 | whole Jacobian matrix. So that's also called the L
00:10:30.000 | operator sometimes. And so almost all operations in
00:10:36.000 | Tiano implement a function that returns that. And it actually
00:10:43.000 | returns not numbers, not a numerical expression for that,
00:10:49.000 | but it returns a symbolic expression that represents that
00:10:55.000 | computation. Again, usually without having to explicitly
00:11:00.000 | represent or define that whole Jacobian matrix.
00:11:05.000 | So you can call Tiano dot grad, which will back propagate
00:11:11.000 | the cost of the input. And it will start from the cost
00:11:17.000 | towards the inputs that you give. And along the way, it
00:11:21.000 | will call that grad method of each operation, back
00:11:25.000 | propagating -- starting from one for the cost and back
00:11:29.000 | propagating through the whole graph, accumulating when you
00:11:33.000 | have the same variables that use more than once and so on.
00:11:38.000 | So that's basically what Tiano does. The same way as if you
00:11:44.000 | had manually defined the gradient expression using Tiano
00:11:48.000 | operations like the dot product, the sigmoid and so on
00:11:52.000 | that we've seen earlier. So we have non-numerical values at
00:11:56.000 | that point. And they are part of the computation graph. So
00:12:01.000 | the computation graph was extended to add these
00:12:07.000 | variables. So we can use Tiano to compute the
00:12:11.000 | gradient descent, extending the graph from these variables,
00:12:15.000 | for instance, to compute the expressions corresponding to
00:12:19.000 | gradient descent, something like that, like we do here.
00:12:23.000 | So, for instance, this is what the extended graph for the
00:12:27.000 | gradients looks like. So you see there's a lot of small
00:12:31.000 | operations that have been inserted. And the output you
00:12:35.000 | can see here is the gradient descent, the gradient output,
00:12:39.000 | and an intermediate result that will help compute the gradient
00:12:43.000 | with respect to the weights. And here's the graph for the
00:12:49.000 | expressions. So you have as intermediate
00:12:54.000 | variables, the gradients that we had on the previous slide,
00:12:59.000 | and then basically just the scale version with the constant
00:13:04.000 | weight. So once we have defined the whole graph, the whole
00:13:10.000 | expression that we actually care about, from the inputs and
00:13:16.000 | initial weights to the weight updates for our training
00:13:21.000 | algorithm, we want to compile a function that will be able to
00:13:25.000 | actually compute those numbers, given inputs, and perform the
00:13:29.000 | computation. So to do that, we have a function that has
00:13:35.000 | values. What we do is call TNO.function, and you provide it
00:13:39.000 | with the input variables that you want to feed, and the output
00:13:43.000 | variables that you want to get. And you don't have necessarily
00:13:46.000 | to provide values for all of the inputs that you might have
00:13:51.000 | declared, especially if you don't want to go all the way
00:13:55.000 | back to the beginning. So we can actually compute
00:14:00.000 | expression for a subset of the graph. For instance, we can
00:14:04.000 | have a predict function here that goes only from X to out.
00:14:08.000 | We don't need values from Y. We don't need -- and so the
00:14:13.000 | gradient and so on will not be computed. It's just going to
00:14:17.000 | take a small part of the graph and make a function out of it.
00:14:24.000 | So to do that, you have to provide a value and call it. So
00:14:28.000 | you have to provide values for all of the input variables that
00:14:32.000 | you define. You don't have to provide values for shared
00:14:36.000 | variables, the W and B that we declared earlier. They are
00:14:40.000 | implicit inputs to all of the functions, and their value will
00:14:45.000 | automatically be fetched when it's needed.
00:14:49.000 | You can declare other functions, like a monitoring function that
00:14:53.000 | is a function that is not a prediction. So you have two
00:14:57.000 | outputs, you also need the second input, Y. You can compute
00:15:02.000 | the function that does not start from the beginning. Like,
00:15:06.000 | for instance, if I want an error function that only computes the
00:15:11.000 | mismatch between the prediction and the actual target, then I
00:15:15.000 | don't have to start from the input. I can just start from
00:15:19.000 | the input. So that's the first thing.
00:15:23.000 | Then the next thing that we want to do is update shared
00:15:28.000 | variables for training. It's necessary. And, again, you can
00:15:32.000 | pass to TNO functions updates, a list of updates. And updates
00:15:37.000 | are pairs of a shared variable and a symbolic expression that
00:15:42.000 | will compute the new value for that shared variable.
00:15:46.000 | So, for example, if we want to update Y, we can update B here
00:15:52.000 | as implicit outputs of the function, like W and B were
00:15:56.000 | implicit inputs. Update W and B are implicit outputs that will
00:16:00.000 | be computed at the same time as C. And then, after all the
00:16:05.000 | outputs are computed, the updates are actually effective
00:16:10.000 | and the train function is actually running.
00:16:15.000 | Here, if we print the value of B before and after having
00:16:21.000 | calling -- after having called the train function, then we see
00:16:25.000 | the value has changed. What happens also during graph
00:16:31.000 | compilation is that the subgraph that we selected for that
00:16:36.000 | function, the subgraph that we selected for that function, what
00:16:41.000 | we mean by that is that it's going to be rewritten in parts.
00:16:45.000 | There are some expressions that will be substituted and so on.
00:16:50.000 | And there are different goals for that.
00:16:53.000 | Some are quite simple. For instance, if we have the same
00:16:59.000 | computation being defined twice, we only want it to be executed
00:17:03.000 | twice. And there are some other things that are not necessary.
00:17:07.000 | You don't want to compute them at all. For instance, if you
00:17:12.000 | have X divided by X, and X is not used anywhere else, we just
00:17:17.000 | want to replace that by one. There are numerical stability
00:17:22.000 | optimizations. For instance, log of 1 plus X can underflow if
00:17:28.000 | you have a large number of operations, and you can just
00:17:33.000 | replace it by 1 plus X. Things like log of soft max get
00:17:38.000 | optimized into a more stable operation. It's also the time
00:17:42.000 | where in place and destructive operations are inserted. For
00:17:46.000 | instance, if an operation is the last to be executed on some
00:17:50.000 | numbers, it can, instead of allocating output memory, it can
00:17:54.000 | allocate the output memory. And the transfer of the graph
00:18:00.000 | expression to the GPU is done during the optimization phase.
00:18:06.000 | So, by default, TNO tries to apply most of the optimization
00:18:12.000 | so that you have a run time that's almost as fast as
00:18:15.000 | possible, except for a couple of checks and assertions. But if
00:18:19.000 | you're not a big fan of fast feedback and don't care that
00:18:25.000 | much about the run time speed, then you have a couple of ways
00:18:31.000 | of enabling and disabling some sets of optimizations, and you
00:18:37.000 | can do that either globally or function by function.
00:18:41.000 | So, to have a look at, for instance, what happens during
00:18:47.000 | the optimization phase, here's the original unoptimized graph
00:18:53.000 | going from the inputs X and W going to the output prediction.
00:18:59.000 | It's the same one that we've seen before. And if we compare
00:19:03.000 | that with the function, the compiled function that goes from
00:19:09.000 | these input variables to out, which was called predict, this
00:19:13.000 | is the graph that we saw before. So, I won't go into details
00:19:19.000 | about what's happening in there, but here you have a GMV
00:19:25.000 | operation, which basically calls an optimized BLAS routine that
00:19:29.000 | can also do multiplication and accumulation at the same time.
00:19:35.000 | We have a sigmoid operation here that will work in place
00:19:41.000 | operations. If you have a look at, for instance, the
00:19:47.000 | optimized graph computing the expression for the updated WNB,
00:19:53.000 | this was the original one. And the optimized one is much
00:19:59.000 | smaller. It has also in place operations. It has fused
00:20:05.000 | operations. Like, for instance, if you have a whole tensor and
00:20:09.000 | you want to do addition with a constant and then a sigmoid and
00:20:15.000 | then something else and so on, you want to only loop once
00:20:19.000 | through the array and apply all the scalar operations on each
00:20:23.000 | element and then go to the next and so on and not iterate each
00:20:27.000 | time that you want to apply a new operation. And those kind
00:20:31.000 | of things happen often when you have automatically generated
00:20:35.000 | a new operation. And here you see the update for the
00:20:41.000 | variables, which are inputs. So you see the cost and the
00:20:47.000 | implicit outputs for the updated WNB here and here.
00:20:52.000 | Another graph visualization tool that exists is the back
00:20:57.000 | print, which basically prints text-based structure of the
00:21:01.000 | graph. So you can see the input of the array IDs and the
00:21:07.000 | variable names and so on. So here you can see more in detail
00:21:13.000 | what the structure is and you see the input of the scaling
00:21:19.000 | parameters and so on. So when the function is compiled, then
00:21:23.000 | we can actually run it. So the function is a callable Python
00:21:30.000 | function that we can call. And we've seen those examples
00:21:38.000 | here, for instance, where we call train and so on.
00:21:44.000 | But what happens to have, say, an optimized run time, it's not
00:21:54.000 | just the function that is running. So we have on-the-fly
00:22:00.000 | code generations, but we also generate C++ or CUDA code.
00:22:06.000 | For instance, for the LMYs loop fusion that I mentioned, we
00:22:10.000 | can't know in advance which element was operation will be
00:22:15.000 | -- will occur in which order in any graph that the user might
00:22:19.000 | be running. So we have a bunch of generations for that. We
00:22:25.000 | generate Python module written in C++ or CUDA that gets
00:22:29.000 | compiled and imported back so we can use it from Python.
00:22:34.000 | The run time environment then calls in the right order the
00:22:40.000 | different operations that have to be executed from the inputs
00:22:44.000 | that we get. So we can use the same code to get the desired
00:22:49.000 | results. We have a couple of different ones, and in
00:22:53.000 | particular, there's one which was written in C++ which
00:22:57.000 | avoids having to switch context between the Python
00:23:01.000 | interpreter and the C++ execution engine.
00:23:05.000 | Something else that's really crucial for speed and
00:23:09.000 | accuracy is the GPU. We wanted to make it as simple as
00:23:15.000 | possible in usual cases. So now it supports a couple of
00:23:23.000 | different data types, not only float 32, but double precision
00:23:28.000 | if you really need that, integers as well. And we have
00:23:34.000 | a lot of GPUs that are running on the GPU. We have a lot of
00:23:40.000 | GPUs that are running with GPU arrays from Python itself. So
00:23:44.000 | you can just use Python code to handle GPU arrays outside of a
00:23:49.000 | function if you like. All of that will be in the future 0.9
00:23:53.000 | release that we hope to get out soon.
00:23:56.000 | And to use it, well, you select the device that you want to
00:24:00.000 | use, and you can use it with just a configuration flag. For
00:24:06.000 | instance, use CUDA to get the first GPU that's available, or
00:24:12.000 | one specific one. And if you specify that in the
00:24:16.000 | configuration, then all shared variables will by default be
00:24:21.000 | created in GPU memory. And the optimizations that move the
00:24:26.000 | CPU operation by GPU operations are going to be applied.
00:24:32.000 | Usually, you want to make sure you use float 32 or even float
00:24:38.000 | 16 for storage, which is experimental, but because most
00:24:43.000 | GPUs don't have a good performance for the bulk
00:24:48.000 | precision. So how you set those configuration flags? You have
00:24:54.000 | to set them in the configuration file. So if you
00:24:58.000 | remember the configuration file that you can -- it's just a
00:25:03.000 | configuration file for Python. You have an environment
00:25:07.000 | variable where you can define those, and the environment
00:25:11.000 | variable overrides the config file, and you can also set
00:25:15.000 | things directly from Python. But some flags have to be known in
00:25:19.000 | the configuration file. So if you want to set the device
00:25:24.000 | itself, you have to set it either in the configuration file
00:25:30.000 | or through flags.
00:25:36.000 | So I'm going to quickly go over more advanced topics, and if
00:25:41.000 | you want to learn more about that, there's other tutorials
00:25:46.000 | on the web, and there's a lot of documentation on
00:25:51.000 | deeplearning.net. So to have loops in the graph, we've seen
00:25:55.000 | that the expression graph is basically a directed acyclic
00:26:00.000 | graph, and we cannot have loops in there. One way, if you know
00:26:05.000 | in advance the number of iterations, is just to unroll
00:26:09.000 | the loop, use for loop in Python that builds all the nodes for
00:26:13.000 | the loop. So it doesn't work if you want, for instance, to
00:26:20.000 | have dynamic size for the loop. For models that generate
00:26:26.000 | sequences, for instance, it can be an issue.
00:26:30.000 | So what we have for that in TNO is called scan, and
00:26:36.000 | basically, it's one node that encapsulates another whole TNO
00:26:42.000 | function, and it's going to compute the -- it's going to
00:26:48.000 | represent the computation that has to be done at each time
00:26:52.000 | step. So you have a TNO function that performs the
00:26:56.000 | computation for one time step, and you have the scan node that
00:27:00.000 | calls it in the loop, taking care of the bookkeeping of
00:27:04.000 | indices and sequences and feeding the right slice at the
00:27:08.000 | right time. And having that structure makes it also
00:27:13.000 | possible to define a gradient for that node, which is
00:27:17.000 | basically another scan node, another loop that goes
00:27:21.000 | backwards and applies backprops with time. And it can be
00:27:25.000 | transferred to GPU as well, in which case the internal
00:27:29.000 | function is going to be transferred and recompiled on
00:27:34.000 | the GPU. And we will talk about the LSTM example later.
00:27:40.000 | This is just a small example, but we don't really have time
00:27:44.000 | for that. We also have visualization, debugging, and
00:27:50.000 | diagnostic tools. One of the reasons it's important is that
00:27:54.000 | in TNO, like in TensorFlow, the definition of a function is
00:27:59.000 | defined in the expression. So if something doesn't work
00:28:05.000 | during the execution, if you encounter errors and so on,
00:28:09.000 | then it's not obvious how to connect that from where the
00:28:13.000 | expression was actually defined. So we try to have
00:28:20.000 | informative error messages, and we have some completion
00:28:25.000 | methods, like the test value, which is not a number, for
00:28:30.000 | large values. You can assign test values to the symbolic
00:28:35.000 | variables so that each time you create a new symbolic
00:28:40.000 | intermediate variable, each time you define a new expression,
00:28:45.000 | then the test value gets computed, and so you can
00:28:49.000 | evaluate on one piece of data at the same time as you build
00:28:53.000 | the new expression. So you can avoid mistakes,
00:28:57.000 | mistakes, errors, and things like that.
00:29:00.000 | It's possible to extend TNO in a couple of ways. You can
00:29:05.000 | create an op just from Python by calling Python wrappers for
00:29:12.000 | existing efficient libraries. You can extend TNO by writing
00:29:18.000 | new functions, either for increased numerical stability,
00:29:23.000 | for instance, or for more efficient computation, or for
00:29:28.000 | introducing your new ops instead of the naive versions that a
00:29:35.000 | user might have used.
00:29:39.000 | We have a couple of new features that have been
00:29:43.000 | added to TNO. I mentioned the new GPU backend, with support
00:29:49.000 | for many data types, and we've had some performance
00:29:54.000 | improvements, especially for convolution, 2D and 3D, and
00:29:59.000 | especially on GPU. We've made some progress on the
00:30:05.000 | time of the graph optimization phase, and also have
00:30:10.000 | improved the performance of the graph. We have new ways of
00:30:15.000 | avoiding recompiling the same graph over and over again, and
00:30:19.000 | we have new diagnostic tools that are quite useful, an
00:30:24.000 | interactive graph visualization tool, and PDB breakpoints that
00:30:28.000 | enables you to monitor a couple of variables and only break if
00:30:33.000 | some condition is met, rather than monitoring something every
00:30:37.000 | time. In the future, while we're still
00:30:42.000 | working on new operations on GPU, we still want to wrap more
00:30:49.000 | operations for better performance, in particular, the
00:30:55.000 | basic RNNs should be completed in the following days,
00:30:59.000 | hopefully. Someone has been working on that a lot recently.
00:31:04.000 | And, of course, we have support for 3D convolutions, still
00:31:10.000 | faster optimisation, and more work on data parallelism as
00:31:15.000 | well. So, yes, I want to thank
00:31:20.000 | most of my colleagues and main TNO developers, and people who
00:31:26.000 | contributed one way or another to our lab and the software
00:31:31.000 | development team, and the team that is working on the
00:31:37.000 | organisers for this call. Now, yes, so the slides are
00:31:43.000 | available online. As I mentioned, there is a
00:31:49.000 | companion notebook, and now we can start to - and more
00:31:53.000 | resources if you want to go further. And now I think it is
00:31:57.000 | time to move on to the demo. So, for those who have not
00:32:06.000 | cloned the repository yet, then this is the command line you
00:32:11.000 | want to launch. For those who have cloned it, you might want
00:32:16.000 | to do a Git pull, just to get the latest - to make sure we
00:32:21.000 | have the latest version of the Jupyter notebook on the
00:32:29.000 | repository itself. So we have three examples that we are
00:32:33.000 | going to go over. Logistic regression,
00:32:39.000 | ConvNet, and LSTM. So I've launched the Jupyter
00:32:45.000 | notebook here. So, intro TNO was the companion notebook.
00:32:50.000 | So, we have the logistic regression, and we have the
00:32:56.000 | logistic regression. So let's go with the logistic
00:33:00.000 | regression. Is that big enough, or do I need to increase the
00:33:06.000 | font size? Okay. So I'm going to skip over the text, because
00:33:13.000 | you probably know already about the model. We have some - we
00:33:19.000 | have some data that we want to load. So let's load the data
00:33:28.000 | with the - on GitHub, with the repository. So let's load the
00:33:35.000 | data. And here, let's see how we define the model. So it's
00:33:40.000 | basically the same way that we did in the slides. We define
00:33:46.000 | an input variable. Here it's a matrix, because we want to use
00:33:52.000 | many batches. And we have shared variables initialised from
00:33:59.000 | zeros. Then we define the - our model. So, here's our
00:34:10.000 | predictor. So the probability of the class given the input,
00:34:17.000 | and we're going to use, well, so here, the fine model, and
00:34:25.000 | then the softmax on top of it, and the prediction, if you want
00:34:31.000 | to have a prediction, it's going to be the class of maximum
00:34:37.000 | probability. So, max over that axis, because we still want one
00:34:43.000 | prediction for each element of the mini-batch.
00:34:49.000 | Then we define the last function. So here is going to be
00:34:53.000 | the log likelihood of the label given the input, or the
00:34:59.000 | cross-entropy, and we define it simply, we don't have, like, we
00:35:03.000 | don't need to have one cross-entropy or log likelihood
00:35:09.000 | operation by itself. You can just build it from the basic
00:35:12.000 | building blocks. So you take the log of the probability, you
00:35:17.000 | take the index of the actual target, and then you take the
00:35:24.000 | mean of that to have the mean prediction over the mini-batch.
00:35:29.000 | And then you have the gradient. Derive the update rules. So,
00:35:36.000 | again, we don't have, like, one gradient descent object or
00:35:41.000 | something like that. We just build whatever rule we want.
00:35:48.000 | So, yeah, we could use momentum by defining other shared
00:35:54.000 | variables, like the gradient, and then we have the gradient
00:35:59.000 | for the velocity, and then the expressions for both the
00:36:04.000 | velocity and the shared variable itself.
00:36:09.000 | And then we compile a train function going from X and Y,
00:36:14.000 | outputting the loss, and updating W and B.
00:36:19.000 | So, yeah, we have the train function, and then we have the
00:36:24.000 | train function, and then the train is getting optimized.
00:36:27.000 | Let's see the next step. Well, we also want to monitor not
00:36:32.000 | only the log likelihood, but actually the misclassification
00:36:39.000 | rate on validation and test set. So it's simply the different,
00:36:45.000 | like, the rate, and the rate is the mean or the mini-batch, and
00:36:51.000 | we create another -- we compile another function, and not doing
00:36:57.000 | any updates, of course. So, to train the model, well, first,
00:37:02.000 | we need to process the data a little bit. So we want to feed
00:37:07.000 | the model one mini-batch of data at a time. So here we have
00:37:12.000 | the train function, and then we have the mini-batch, and it's
00:37:18.000 | not a Python generator, but a helper function that gives us
00:37:23.000 | the mini-batch number I, and it's going to be the same
00:37:27.000 | function used for the training and validation and test set.
00:37:32.000 | We define a couple of parameters for early stopping in that
00:37:36.000 | training loop. It's not necessary. It's just, like, a
00:37:41.000 | little bit more complex than the model that was encountered
00:37:46.000 | during the optimization. So let's define that.
00:37:51.000 | And this is the main training loop. It's a bit more complex
00:37:56.000 | than it might be, but it's because we use this early
00:38:01.000 | stopping, and we want to only validate when we are confident
00:38:06.000 | that the model is running. So, we have a couple of parameters
00:38:11.000 | that we can run down enough, but basically, the most important
00:38:16.000 | part is you loop over the epochs unless you encounter the
00:38:23.000 | early stopping conditions, and then, during each epoch, you
00:38:28.000 | want to loop over the mini-batches and call train
00:38:33.000 | loop. And then, we want to get some result of the validation
00:38:38.000 | error, so here we call test model on the validation set for
00:38:43.000 | that, and then keep track of what the best model currently
00:38:50.000 | is, and get the test error as well.
00:38:57.000 | And save the best one. So, to save the best one, to save
00:39:02.000 | the best one, we need to save the values of all parameters,
00:39:07.000 | which is more robust than trying to pickle the whole Python
00:39:14.000 | object, and it also enables more easily transfer to other
00:39:19.000 | frameworks, to visualization frameworks, and so on. So let's
00:39:22.000 | try to execute that. So, of course, it's a simple
00:39:27.000 | model, and it's running on the same time, so it should not
00:39:34.000 | take that long. So, you see that at the
00:39:42.000 | beginning, well, almost at each iteration, we are better on the
00:39:46.000 | training set, and then, after a while, the progress is slower,
00:39:56.000 | and then, we are better on the training set. So, wait a little
00:40:00.000 | bit more. Seems to stall more and more.
00:40:08.000 | Okay. And here is the end after 96 epochs.
00:40:15.000 | So, now, if we want to visualize what filters were learned, or
00:40:22.000 | what we learned, we are using a helper function here to
00:40:28.000 | visualize the filters. It's not really important. But here,
00:40:32.000 | what we use is we call get value on the weights to access the
00:40:40.000 | internal value of the shared variable, and then we use that
00:40:46.000 | to visualize the filter. So, we have the training set, and we
00:40:50.000 | have the filters, and we can see it's kind of reasonable, like,
00:40:54.000 | this is the filter for class 0, and you can see kind of like a
00:40:59.000 | 0, 1, what's important for the two is to have an opening here,
00:41:05.000 | and so on. So, yeah, if we have a look at the
00:41:13.000 | training error, is - well, do we see the training error? No,
00:41:19.000 | I'm not plotting it. But the validation and the test error are
00:41:24.000 | quite high, and we know that the human level performance is
00:41:29.000 | quite low, and the performance of other models is quite low, so
00:41:33.000 | it really means that the model is too simple, and we should use
00:41:37.000 | something more advanced. So, to use something more advanced, if
00:41:44.000 | you go back to the home of the Jupyter notebook, you can have a
00:41:51.000 | look at the conf net, and run the net.
00:42:00.000 | So, this new example is basically - it's the same data,
00:42:04.000 | but it's a bit more advanced, and it's a bit more optimized,
00:42:09.000 | because it has the advantage of training fast even on an older
00:42:14.000 | laptop, but this time, we're going to use a convolutional net
00:42:19.000 | with a couple of convolution layers, and fully connected
00:42:24.000 | layers, and the final classifier. So I'm going to make sure that
00:42:29.000 | I'm not over-composing the data. So, let's see how we could use
00:42:35.000 | TNO to define helper classes that are layers that can make it
00:42:41.000 | easier for a user to compose them if they want to replicate
00:42:47.000 | some results, or use some classical architectures.
00:42:53.000 | This is done usually in frameworks built on top of TNO
00:42:58.000 | and TNO, and they develop their own mini-framework with their
00:43:04.000 | own versions of layers and so on that they find useful and
00:43:10.000 | intuitive. So, this logistic regression layer
00:43:17.000 | basically holds, well, parameters, weight, and bias, and
00:43:26.000 | it's a very simple classifier. It's a set of a variety of
00:43:32.000 | classes, prediction holds the params, and have expressions for
00:43:38.000 | the negative log likelihood and the errors. So, if you were to
00:43:44.000 | use only that class, then it's doing essentially the same as
00:43:50.000 | the previous one. And, in the same way, we can define a layer
00:43:57.000 | that has convolution and pooling. So, again, in the init
00:44:03.000 | method, we pass it, well, filter shape, image shape, the size of
00:44:09.000 | pooling, and so on. We initialise the weights using the
00:44:15.000 | same classifier, and we compute the bias from the input, well,
00:44:23.000 | we compute to the convolution with the filters. We then
00:44:31.000 | compute max pooling, and output, well, tan h of the pooling plus
00:44:39.000 | the bias. So, we have a bias, and here, the bias is only like
00:44:45.000 | one number for each channel, so which means that you don't have
00:44:49.000 | a different bias for each location in the image. So, you
00:44:54.000 | could actually apply such a layer on images of various size
00:45:00.000 | without having to initialise new parameters or retrain that.
00:45:06.000 | So, we have a layer, and we have a hidden layer, which is just
00:45:12.000 | a fully connected layer. Again, initialising weights and bias,
00:45:17.000 | and expression going from the symbolic expression going from
00:45:22.000 | the input, and the shared variables to the output after
00:45:27.000 | activation. And, again, we want to collect the parameters so
00:45:31.000 | that we can actually apply the same thing to the output.
00:45:37.000 | And then, here's a function that has the main loop, and the
00:45:43.000 | main training loop. So we have a mini-batch generator, again,
00:45:48.000 | the same code as before, and here, we are building the whole
00:45:52.000 | graph. So, always the same process. We define input
00:46:00.000 | vectors, and we define the input vectors. So, L vector is a
00:46:06.000 | vector of long, because the targets here are indices, and
00:46:12.000 | not one-hot vectors or masks or something like that. And we
00:46:17.000 | create the first layer, which is a layer with size. We want to
00:46:24.000 | have a size of one, and we want to have a size of two. So,
00:46:31.000 | here, the image size changes. This is mostly for efficiency,
00:46:36.000 | actually. You don't really have to pass that for those
00:46:40.000 | particular models. But you still need the shape of filters. I
00:46:45.000 | mean, you have the filters anyway. And then, it's useful to
00:46:50.000 | have a layer that is fully connected. So, the convolution
00:46:55.000 | layers can handle arbitrary-sized images. And then,
00:46:59.000 | after that, we want to flatten the whole feature maps and feed
00:47:04.000 | that into a fully connected layer and into the prediction
00:47:07.000 | layer. So this one has to be fixed, so we have to know what
00:47:14.000 | the output layer is. So, we have four dimensions. And here we
00:47:21.000 | go. A fully connected layer, and the output layer, that's
00:47:26.000 | just the class, the same as before. We want the final cost
00:47:32.000 | to be the log likelihood of that. We have, again, the errors,
00:47:39.000 | the parameters, or the concatenation of the parameters
00:47:44.000 | of all layers. And once we have that, we can build the
00:47:49.000 | gradient. So, just one call of grad of cost with respect to
00:47:54.000 | params. Get the updates. So, again, just regular SGD, but we
00:48:00.000 | could have a class or something that performs like momentum,
00:48:06.000 | whatever you need. Compile the function. And here we have,
00:48:12.000 | again, the early stopping routine with the same main loop
00:48:17.000 | for all epochs until we're done. Then loop over the mini
00:48:21.000 | batches and validate every once in a while and stop when it's
00:48:25.000 | finished. So, let's just declare that. Loading the data.
00:48:31.000 | So, we have the same output, exactly the same as before. And
00:48:38.000 | here we can actually run that. So, this was the result of a
00:48:46.000 | previous run. That took five minutes. So, I will probably
00:48:53.000 | not have time to do that, but here you can see basically what
00:48:58.000 | the result looks like. So, if you want to try that or try that
00:49:04.000 | during the lunch break or later, you're welcome to play with it.
00:49:12.000 | And after that, yeah, you can visualize the run filters as
00:49:19.000 | well. So, here you have the first layer. And for the -- and
00:49:29.000 | here you have the -- an example of the activations of the first
00:49:34.000 | layer for one example. So, we have just a little bit more
00:49:43.000 | information. So, let's just go back to the tutorial. I mean,
00:49:50.000 | example. So, if you go back to the home of the Jupyter
00:49:59.000 | notebook and go to LSTM, then -- so, this model is an LSTM
00:50:12.000 | model. So, it's a model of a sentence given the previous
00:50:18.000 | ones. So, not going to go into details, but here you can see
00:50:26.000 | that the LSTM layer is defined here with variables for all the
00:50:36.000 | matrices that you need and the different biases for the
00:50:39.000 | parameters. So, you have a lot of parameters. It would be
00:50:45.000 | possible and sometimes more efficient to actually define,
00:50:51.000 | say, only one variable that contains the concatenation of a
00:50:57.000 | couple of matrices, and that way you can do more efficient,
00:51:03.000 | more efficient, and more efficient implementation.
00:51:09.000 | And here's an example of how to use a scan for the loop. So,
00:51:15.000 | here we define a step function that takes, well, a couple of
00:51:26.000 | different inputs, so you have, like, the different activation
00:51:30.000 | steps, you have the current sequence input and so on, and
00:51:36.000 | from them, here's basically the LSTM formula where you have the
00:51:43.000 | dot product and sigma or tan H of the different connection
00:51:49.000 | inside the cell, and in the end, you have the hidden and that
00:51:59.000 | is the function that you need to use. So, once you have that,
00:52:05.000 | that step function is going to be passed to TNO dot scan, where
00:52:13.000 | the sequences are the mask and the input. So, the mask is
00:52:19.000 | useful because we are using many batches of sequences and not
00:52:23.000 | all the sequences in the same batch have the same length.
00:52:28.000 | So, we can use the same step function to group examples of
00:52:34.000 | similar lengths together, but they may not always be exactly
00:52:38.000 | the same length. So, in that case, we pad that to only the
00:52:42.000 | longest sequence in the mini-batch, not the longest
00:52:46.000 | sequence in the whole set, just for the mini-batch, but we have
00:52:50.000 | to pad and remember what the length of the different
00:52:54.000 | sequences are. So, let's define that.
00:53:01.000 | Here, we define the cost function, that's the
00:53:07.000 | cross-entropy of the sequence, and here, again, you see that
00:53:10.000 | the mask is used so that we don't consider the predictions
00:53:14.000 | after the end of the sequence. Logistic regression, the same as
00:53:19.000 | the cost function, but we are using the same cost.
00:53:26.000 | Here, for processing the data, we are using fuel, which is
00:53:30.000 | another tool being developed by a couple of students at Mila,
00:53:34.000 | and it's nice because it can read from just plain text data,
00:53:40.000 | do some preprocessing on the fly, including things that I
00:53:45.000 | will show you in a second. So, we are doing sequences by
00:53:51.000 | similar length, and then shuffling them, and padding, and
00:53:55.000 | doing all of that. And so, that puts a generator that you
00:54:03.000 | can then feed in your main loop through a TNO function. So,
00:54:07.000 | that whole preprocessing happens outside of TNO, and then the
00:54:12.000 | rest of the data is fed into the main loop.
00:54:18.000 | So, yes, here we build our final TNO graph. We have symbolic
00:54:24.000 | inputs for, well, the input and the mask. We create a layer,
00:54:33.000 | the layer, define our cost, parameters, all the
00:54:38.000 | parameters, and the recurrent layer. Take the gradients, of
00:54:43.000 | course, with respect to all parameters. So, as I mentioned,
00:54:47.000 | it's going to use backprop through time to get the gradient
00:54:51.000 | through the scan operation. The update rule, again, simple,
00:55:01.000 | SGD, no momentum, nothing. It's something you can add if you
00:55:05.000 | want to. And then, we have a function to evaluate the model.
00:55:12.000 | So, here, the main loop is training, and we also have
00:55:19.000 | another function that generates one character at a time, given
00:55:23.000 | the previous ones. That's why we declare inputs here. And so
00:55:29.000 | we have a function that gets predictions, we normalize them,
00:55:36.000 | because we are working in float 32, and sometimes, if you divide
00:55:40.000 | by the sum, and the sum, it doesn't add up to one. So we
00:55:45.000 | want the higher precision for just that operation. And then
00:55:51.000 | try to generate a sequence every once in a while.
00:55:58.000 | So, we have a function that generates a sequence every once
00:56:01.000 | in a while. So, we have a function that generates a
00:56:04.000 | sequence in the previous run. So, we see the -- so, for
00:56:10.000 | monitoring, we see that prediction with the meaning of
00:56:15.000 | life is, and then we let the network generate. So, if I try
00:56:19.000 | to run it now, it's going to be long, but here's some examples
00:56:24.000 | of how it works. So, the first one is a model that we
00:56:29.000 | developed with not that much, and it has, like, a couple of
00:56:34.000 | unusual characters. I mean, it's usually -- it's not usual to
00:56:39.000 | have, like, one Chinese character in the middle of words.
00:56:44.000 | You have, like, punctuation in the middle of words, and so on.
00:56:50.000 | And so, we have a function that generates a sequence every once
00:56:55.000 | in a while. And we see that it's getting slowly better and
00:57:02.000 | better. And the meaning of life is the that, and so on.
00:57:09.000 | So, of course, this is not what's going to give you the
00:57:15.000 | meaning of life. And this is this.
00:57:22.000 | So, yeah, so I interrupted the training at some point, but you
00:57:29.000 | can play with it a little bit, and here are some suggestions
00:57:34.000 | of things you might want to do, like better training
00:57:40.000 | strategies, different linearities inside the LSTM cell,
00:57:46.000 | different initialization of weights, try to generate
00:57:50.000 | something else that the meaning of life is, and, yeah.
00:57:56.000 | So, I hope I could give you a good introduction of what TNO
00:58:02.000 | is, what it can be used for, and what you can build on top of
00:58:07.000 | it. So, if you have any questions
00:58:14.000 | later, then we have TNO users mailing lists. We are answering
00:58:20.000 | questions on Stack Overflow as well. And we would be happy to
00:58:26.000 | have your feedback. [ Applause ]
00:58:33.000 | >> We have time for a few quick questions. There's one here.
00:58:39.000 | Could you go to the mic? >> Can you just give a quick
00:58:47.000 | example of what debugging might look like in TNO? Could you
00:58:51.000 | break something in there and show us what happens and how
00:58:55.000 | you figure out what it was? >> Sure. Actually, yeah, I
00:58:59.000 | can show you a few examples. Okay. So, let's go to, say, a
00:59:06.000 | simple example. Okay. So, I'm just going to go to the
00:59:13.000 | logistic regression one, and say, for instance, that when I
00:59:23.000 | execute this, I don't have the right shape. So, you can still
00:59:35.000 | build the whole symbolic graph, and at the time where you want
00:59:44.000 | to actually execute it, then you have an error message that
00:59:50.000 | says, "Why does this not have the right shape?" So, let's say
00:59:56.000 | that X has columns and rows, but Y has only that number of
01:00:02.000 | rows. And the apply node that caused the error is that dot
01:00:08.000 | product, and gives the input again, and in that case, it
01:00:12.000 | tells you -- it's not really able to tell you where it was
01:00:17.000 | defined. So, we can do that, and we can go back to where the
01:00:26.000 | train operation was defined, train model, TNO function, and
01:00:33.000 | we can say, "Mode optimizer equals none." Sorry. I have to
01:00:49.000 | do -- mode equals TNO.mode, optimizer, none. Is that
01:00:57.000 | right? So, let's do that. Let's reconfigure everything.
01:01:11.000 | And then, the updated error message says, "Backtrace when
01:01:17.000 | the node was created," and it's somewhere in my kernel, and
01:01:23.000 | it's not there. So, we can go back to that. So, of course, we
01:01:29.000 | have a lot of things in there, but you know that there's a dot
01:01:33.000 | product, and it's probably a mismatch between those. So,
01:01:37.000 | that's one example. Then, the other techniques that we can
01:01:41.000 | use, we can have the break points, as I said, and so on. I
01:01:45.000 | don't have right now a tutorial about that, but I have some
01:01:49.000 | examples. So, I'm going to go back to that.
01:01:55.000 | >> One last question. >> I have some models I would like
01:01:58.000 | to distribute, and I don't want to require people to install
01:02:02.000 | Python and a bunch of compilers and stuff. Do you have any
01:02:08.000 | support for compiling models into a binary?
01:02:11.000 | >> Okay. So, unfortunately, at the time, we're pretty
01:02:15.000 | limited in the number of models that we can do. So, most of
01:02:19.000 | the work is done by Python, and we use an empire and DRAs for
01:02:24.000 | our intermediate values on the CPU and the similar structure on
01:02:28.000 | the GPU, even though that one might be easier to convert. But
01:02:32.000 | yes, all our C code deals with Python and does the ink ref and
01:02:36.000 | the graph and so on, so that Python manages the memory. So,
01:02:40.000 | we're not able to do that. We have a lot of work to do.
01:02:45.000 | >> So, how about something like a Docker container?
01:02:49.000 | >> Something like that. Recently, even for GPU, NVIDIA
01:02:53.000 | Docker is quite efficient, and we don't have any model slowdowns
01:02:57.000 | that we had seen earlier. So, it's not ideal, and if, like,
01:03:03.000 | someone has some time and the will to help us disentangle
01:03:09.000 | that, we can do that.
01:03:13.000 | >> Okay. Let's thank Pascal again.
01:03:19.000 | [Applause]
01:03:20.000 | >> We convene in 55 minutes for the next talk. Have a good
01:03:23.000 | lunch.
01:03:24.000 | Lynch.