Theano Tutorial (Pascal Lamblin, MILA)

>> Thank you. . >> Okay. So today I'm going to briefly introduce you to TNO, how to use it, and go over the basic principles behind the libraries, and if you paid attention during yesterday's presentation of TensorFlow, some concepts will be familiar to you as well. And if you paid attention to Hugo La Rochelle's talk, you will have heard of it.

And I will talk about some similar concepts as well. So there's going to be four main parts. So the first one is, well, these slides and introduction about what the concepts of TNO are. There is a companion iPython notebook that's on GitHub. So if you go on that page or clone that GitHub repository, there is an example of how to use it.

And you can download code snippets from the slides so that you can run them at the same time. Then we're going to have a more hands-on example, basically applying logistic regression on the MNIST data set. And then if we have time, we'll go quickly over two more examples, and then we'll talk about the concept of TNO and how to use them for character-level generation of text.

So TNO is, we can say, a mathematical symbolic expression compiler. So what does that mean? It means that it makes it possible to define expressions that represent mathematical expressions. So it's a very simple expression, easy to use, and it supports all the kind of basic mathematical operations like min, max, addition, subtraction, all the kind of basic things, not only larger blocks like layers of neural networks, but also things like that.

So it's basically making those expressions, doing graph substitutions, cloning and replacement, things like that, and also making possible to go through that graph and perform things like automatic differentiation, symbolic differentiation, actually, or the kind of things that we call the TNO-like function. And then it's possible to use that optimized graph and TNO's runtime to actually compute some values, some output values given inputs.

We also have a couple of tools that help debug both TNO's code and the user's code and try to inspect and see if there's any errors in the code. So let's talk about TNO. So TNO is currently more than eight years old. It started small with only a couple of contributors from the ancestor of Mila, which was called Lizat at the time, and it grew a lot.

We now have contributors from all over the world. We have a lot of TNO's, prototypes for industrial applications in startups and in larger companies. TNO has also been the base of other software projects that built on top of TNO. So, for instance, blocks, Keras, and so on. TNO is more the backend and provides user interface that is a higher level.

So that has concepts of layers, of training algorithms, of those kind of things. Whereas TNO is more the backend. It's scale on TNO as well, which is nice because it has a converter to load cafe and it's a lot of fun. And it's a lot of fun to use.

And it's a lot of fun to use. And it uses TNO not to do machine learning, but programming. And we have two other libraries, Platoon, and TNO MPI, which are layers on TNO. And they are more the backend and more the backend level of model parallelism and data parallelism.

So, how to use TNO? Well, first of all, we are working with symbolic expression, symbolic variables. So that is a function that is used to define the expression. So to define the symbolic expression, so we define the expression first, then we want to compile a function, and then execute that function on values.

So to define the expression, we start by defining inputs. So the inputs are symbolic variables that have some type. So you have to define in the matrix what its data type is, floating point, integers, and so on. So things like the number of dimensions have to be known in advance.

But the shape is not fixed. The memory layout is not fixed. So you could have shapes that change between like one mini-batch and the next, or different calls that are made. So you have to define the symbolic expression in general. So X and Y are purely symbolic variables here.

We will give them values later, but for now, that's just empty. There's another kind of input variables that is shared variables, and they are symbolic, but they also hold a value, and that value is persistent across function types. So we have a function that has two dimensions. It's usually used, for instance, for storing parameters of the model that you want to learn, and yet these values can be updated as well.

So here we create two other variables from -- so shared variables from values. This one has two dimensions because its initial values have two dimensions, and this one has two dimensions. So we can define a variable that will represent the bias. We can name variables by assigning to the name attributes.

Shared variables do not have a fixed side either. They are usually kept fixed in most models, but it's not a requirement. Then from these inputs, we can define expressions that will build new variables, intermediate variables, and so on. And so, for instance, here we can define, well, the product of X and W, add the bias, apply a function on that, and let's say this is our output variable, and from the output variable and Y, we can define just, say, the bias.

So those new variables are connected to the previous ones through the operations that we define, and we can visualize the graph structure like that by using, for instance, the pilot print, which is a helper function. So variables are those squared boxes, and we have other nodes here, we call apply nodes that represent the mathematical function, and we can visualize the graph structure.

So input variables and shared variables do not have any ancestors. They don't have any connecting from them, but then you see that intermediate result and more of them. Usually, when we visualize, we don't necessarily care about all the intermediate variables unless they have a name or something. So here, we have exactly the same graph where we hide the unnamed intermediate variables, but you can still see all the operations.

And actually, you see the type on the edges. So once you have defined some graphs, say your forward computation for your model, we want to be able to use back propagation to get gradients. So here, we have just the basic concept of the chain rule. We have a scalar cost, we have intermediate variables that are vectors.

Here is just the chain rule starting from the cost. And so the whole derivative of, say, that function G is the cost of the output. And the cost of the output is that M by N if the intermediate variables are vectors of size N and M. And usually, you don't need that.

And it's usually a bad idea to compute it explicitly unless you need it for some other purposes. The only thing you need is an expression that's given any vector representing the gradient of the cost with respect to the input. So basically, the dot product between that vector and the whole Jacobian matrix.

So that's also called the L operator sometimes. And so almost all operations in Tiano implement a function that returns that. And it actually returns not numbers, not a numerical expression for that, but it returns a symbolic expression that represents that computation. Again, usually without having to explicitly represent or define that whole Jacobian matrix.

So you can call Tiano dot grad, which will back propagate the cost of the input. And it will start from the cost towards the inputs that you give. And along the way, it will call that grad method of each operation, back propagating -- starting from one for the cost and back propagating through the whole graph, accumulating when you have the same variables that use more than once and so on.

So that's basically what Tiano does. The same way as if you had manually defined the gradient expression using Tiano operations like the dot product, the sigmoid and so on that we've seen earlier. So we have non-numerical values at that point. And they are part of the computation graph. So the computation graph was extended to add these variables.

So we can use Tiano to compute the gradient descent, extending the graph from these variables, for instance, to compute the expressions corresponding to gradient descent, something like that, like we do here. So, for instance, this is what the extended graph for the gradients looks like. So you see there's a lot of small operations that have been inserted.

And the output you can see here is the gradient descent, the gradient output, and an intermediate result that will help compute the gradient with respect to the weights. And here's the graph for the expressions. So you have as intermediate variables, the gradients that we had on the previous slide, and then basically just the scale version with the constant weight.

So once we have defined the whole graph, the whole expression that we actually care about, from the inputs and initial weights to the weight updates for our training algorithm, we want to compile a function that will be able to actually compute those numbers, given inputs, and perform the computation.

So to do that, we have a function that has values. What we do is call TNO.function, and you provide it with the input variables that you want to feed, and the output variables that you want to get. And you don't have necessarily to provide values for all of the inputs that you might have declared, especially if you don't want to go all the way back to the beginning.

So we can actually compute expression for a subset of the graph. For instance, we can have a predict function here that goes only from X to out. We don't need values from Y. We don't need -- and so the gradient and so on will not be computed. It's just going to take a small part of the graph and make a function out of it.

So to do that, you have to provide a value and call it. So you have to provide values for all of the input variables that you define. You don't have to provide values for shared variables, the W and B that we declared earlier. They are implicit inputs to all of the functions, and their value will automatically be fetched when it's needed.

You can declare other functions, like a monitoring function that is a function that is not a prediction. So you have two outputs, you also need the second input, Y. You can compute the function that does not start from the beginning. Like, for instance, if I want an error function that only computes the mismatch between the prediction and the actual target, then I don't have to start from the input.

I can just start from the input. So that's the first thing. Then the next thing that we want to do is update shared variables for training. It's necessary. And, again, you can pass to TNO functions updates, a list of updates. And updates are pairs of a shared variable and a symbolic expression that will compute the new value for that shared variable.

So, for example, if we want to update Y, we can update B here as implicit outputs of the function, like W and B were implicit inputs. Update W and B are implicit outputs that will be computed at the same time as C. And then, after all the outputs are computed, the updates are actually effective and the train function is actually running.

Here, if we print the value of B before and after having calling -- after having called the train function, then we see the value has changed. What happens also during graph compilation is that the subgraph that we selected for that function, the subgraph that we selected for that function, what we mean by that is that it's going to be rewritten in parts.

There are some expressions that will be substituted and so on. And there are different goals for that. Some are quite simple. For instance, if we have the same computation being defined twice, we only want it to be executed twice. And there are some other things that are not necessary.

You don't want to compute them at all. For instance, if you have X divided by X, and X is not used anywhere else, we just want to replace that by one. There are numerical stability optimizations. For instance, log of 1 plus X can underflow if you have a large number of operations, and you can just replace it by 1 plus X.

Things like log of soft max get optimized into a more stable operation. It's also the time where in place and destructive operations are inserted. For instance, if an operation is the last to be executed on some numbers, it can, instead of allocating output memory, it can allocate the output memory.

And the transfer of the graph expression to the GPU is done during the optimization phase. So, by default, TNO tries to apply most of the optimization so that you have a run time that's almost as fast as possible, except for a couple of checks and assertions. But if you're not a big fan of fast feedback and don't care that much about the run time speed, then you have a couple of ways of enabling and disabling some sets of optimizations, and you can do that either globally or function by function.

So, to have a look at, for instance, what happens during the optimization phase, here's the original unoptimized graph going from the inputs X and W going to the output prediction. It's the same one that we've seen before. And if we compare that with the function, the compiled function that goes from these input variables to out, which was called predict, this is the graph that we saw before.

So, I won't go into details about what's happening in there, but here you have a GMV operation, which basically calls an optimized BLAS routine that can also do multiplication and accumulation at the same time. We have a sigmoid operation here that will work in place operations. If you have a look at, for instance, the optimized graph computing the expression for the updated WNB, this was the original one.

And the optimized one is much smaller. It has also in place operations. It has fused operations. Like, for instance, if you have a whole tensor and you want to do addition with a constant and then a sigmoid and then something else and so on, you want to only loop once through the array and apply all the scalar operations on each element and then go to the next and so on and not iterate each time that you want to apply a new operation.

And those kind of things happen often when you have automatically generated a new operation. And here you see the update for the variables, which are inputs. So you see the cost and the implicit outputs for the updated WNB here and here. Another graph visualization tool that exists is the back print, which basically prints text-based structure of the graph.

So you can see the input of the array IDs and the variable names and so on. So here you can see more in detail what the structure is and you see the input of the scaling parameters and so on. So when the function is compiled, then we can actually run it.

So the function is a callable Python function that we can call. And we've seen those examples here, for instance, where we call train and so on. But what happens to have, say, an optimized run time, it's not just the function that is running. So we have on-the-fly code generations, but we also generate C++ or CUDA code.

For instance, for the LMYs loop fusion that I mentioned, we can't know in advance which element was operation will be -- will occur in which order in any graph that the user might be running. So we have a bunch of generations for that. We generate Python module written in C++ or CUDA that gets compiled and imported back so we can use it from Python.

The run time environment then calls in the right order the different operations that have to be executed from the inputs that we get. So we can use the same code to get the desired results. We have a couple of different ones, and in particular, there's one which was written in C++ which avoids having to switch context between the Python interpreter and the C++ execution engine.

Something else that's really crucial for speed and accuracy is the GPU. We wanted to make it as simple as possible in usual cases. So now it supports a couple of different data types, not only float 32, but double precision if you really need that, integers as well. And we have a lot of GPUs that are running on the GPU.

We have a lot of GPUs that are running with GPU arrays from Python itself. So you can just use Python code to handle GPU arrays outside of a function if you like. All of that will be in the future 0.9 release that we hope to get out soon. And to use it, well, you select the device that you want to use, and you can use it with just a configuration flag.

For instance, use CUDA to get the first GPU that's available, or one specific one. And if you specify that in the configuration, then all shared variables will by default be created in GPU memory. And the optimizations that move the CPU operation by GPU operations are going to be applied.

Usually, you want to make sure you use float 32 or even float 16 for storage, which is experimental, but because most GPUs don't have a good performance for the bulk precision. So how you set those configuration flags? You have to set them in the configuration file. So if you remember the configuration file that you can -- it's just a configuration file for Python.

You have an environment variable where you can define those, and the environment variable overrides the config file, and you can also set things directly from Python. But some flags have to be known in the configuration file. So if you want to set the device itself, you have to set it either in the configuration file or through flags.

So I'm going to quickly go over more advanced topics, and if you want to learn more about that, there's other tutorials on the web, and there's a lot of documentation on deeplearning.net. So to have loops in the graph, we've seen that the expression graph is basically a directed acyclic graph, and we cannot have loops in there.

One way, if you know in advance the number of iterations, is just to unroll the loop, use for loop in Python that builds all the nodes for the loop. So it doesn't work if you want, for instance, to have dynamic size for the loop. For models that generate sequences, for instance, it can be an issue.

So what we have for that in TNO is called scan, and basically, it's one node that encapsulates another whole TNO function, and it's going to compute the -- it's going to represent the computation that has to be done at each time step. So you have a TNO function that performs the computation for one time step, and you have the scan node that calls it in the loop, taking care of the bookkeeping of indices and sequences and feeding the right slice at the right time.

And having that structure makes it also possible to define a gradient for that node, which is basically another scan node, another loop that goes backwards and applies backprops with time. And it can be transferred to GPU as well, in which case the internal function is going to be transferred and recompiled on the GPU.

And we will talk about the LSTM example later. This is just a small example, but we don't really have time for that. We also have visualization, debugging, and diagnostic tools. One of the reasons it's important is that in TNO, like in TensorFlow, the definition of a function is defined in the expression.

So if something doesn't work during the execution, if you encounter errors and so on, then it's not obvious how to connect that from where the expression was actually defined. So we try to have informative error messages, and we have some completion methods, like the test value, which is not a number, for large values.

You can assign test values to the symbolic variables so that each time you create a new symbolic intermediate variable, each time you define a new expression, then the test value gets computed, and so you can evaluate on one piece of data at the same time as you build the new expression.

So you can avoid mistakes, mistakes, errors, and things like that. It's possible to extend TNO in a couple of ways. You can create an op just from Python by calling Python wrappers for existing efficient libraries. You can extend TNO by writing new functions, either for increased numerical stability, for instance, or for more efficient computation, or for introducing your new ops instead of the naive versions that a user might have used.

We have a couple of new features that have been added to TNO. I mentioned the new GPU backend, with support for many data types, and we've had some performance improvements, especially for convolution, 2D and 3D, and especially on GPU. We've made some progress on the time of the graph optimization phase, and also have improved the performance of the graph.

We have new ways of avoiding recompiling the same graph over and over again, and we have new diagnostic tools that are quite useful, an interactive graph visualization tool, and PDB breakpoints that enables you to monitor a couple of variables and only break if some condition is met, rather than monitoring something every time.

In the future, while we're still working on new operations on GPU, we still want to wrap more operations for better performance, in particular, the basic RNNs should be completed in the following days, hopefully. Someone has been working on that a lot recently. And, of course, we have support for 3D convolutions, still faster optimisation, and more work on data parallelism as well.

So, yes, I want to thank most of my colleagues and main TNO developers, and people who contributed one way or another to our lab and the software development team, and the team that is working on the organisers for this call. Now, yes, so the slides are available online. As I mentioned, there is a companion notebook, and now we can start to - and more resources if you want to go further.

And now I think it is time to move on to the demo. So, for those who have not cloned the repository yet, then this is the command line you want to launch. For those who have cloned it, you might want to do a Git pull, just to get the latest - to make sure we have the latest version of the Jupyter notebook on the repository itself.

So we have three examples that we are going to go over. Logistic regression, ConvNet, and LSTM. So I've launched the Jupyter notebook here. So, intro TNO was the companion notebook. So, we have the logistic regression, and we have the logistic regression. So let's go with the logistic regression. Is that big enough, or do I need to increase the font size?

Okay. So I'm going to skip over the text, because you probably know already about the model. We have some - we have some data that we want to load. So let's load the data with the - on GitHub, with the repository. So let's load the data. And here, let's see how we define the model.

So it's basically the same way that we did in the slides. We define an input variable. Here it's a matrix, because we want to use many batches. And we have shared variables initialised from zeros. Then we define the - our model. So, here's our predictor. So the probability of the class given the input, and we're going to use, well, so here, the fine model, and then the softmax on top of it, and the prediction, if you want to have a prediction, it's going to be the class of maximum probability.

So, max over that axis, because we still want one prediction for each element of the mini-batch. Then we define the last function. So here is going to be the log likelihood of the label given the input, or the cross-entropy, and we define it simply, we don't have, like, we don't need to have one cross-entropy or log likelihood operation by itself.

You can just build it from the basic building blocks. So you take the log of the probability, you take the index of the actual target, and then you take the mean of that to have the mean prediction over the mini-batch. And then you have the gradient. Derive the update rules.

So, again, we don't have, like, one gradient descent object or something like that. We just build whatever rule we want. So, yeah, we could use momentum by defining other shared variables, like the gradient, and then we have the gradient for the velocity, and then the expressions for both the velocity and the shared variable itself.

And then we compile a train function going from X and Y, outputting the loss, and updating W and B. So, yeah, we have the train function, and then we have the train function, and then the train is getting optimized. Let's see the next step. Well, we also want to monitor not only the log likelihood, but actually the misclassification rate on validation and test set.

So it's simply the different, like, the rate, and the rate is the mean or the mini-batch, and we create another -- we compile another function, and not doing any updates, of course. So, to train the model, well, first, we need to process the data a little bit. So we want to feed the model one mini-batch of data at a time.

So here we have the train function, and then we have the mini-batch, and it's not a Python generator, but a helper function that gives us the mini-batch number I, and it's going to be the same function used for the training and validation and test set. We define a couple of parameters for early stopping in that training loop.

It's not necessary. It's just, like, a little bit more complex than the model that was encountered during the optimization. So let's define that. And this is the main training loop. It's a bit more complex than it might be, but it's because we use this early stopping, and we want to only validate when we are confident that the model is running.

So, we have a couple of parameters that we can run down enough, but basically, the most important part is you loop over the epochs unless you encounter the early stopping conditions, and then, during each epoch, you want to loop over the mini-batches and call train loop. And then, we want to get some result of the validation error, so here we call test model on the validation set for that, and then keep track of what the best model currently is, and get the test error as well.

And save the best one. So, to save the best one, to save the best one, we need to save the values of all parameters, which is more robust than trying to pickle the whole Python object, and it also enables more easily transfer to other frameworks, to visualization frameworks, and so on.

So let's try to execute that. So, of course, it's a simple model, and it's running on the same time, so it should not take that long. So, you see that at the beginning, well, almost at each iteration, we are better on the training set, and then, after a while, the progress is slower, and then, we are better on the training set.

So, wait a little bit more. Seems to stall more and more. Okay. And here is the end after 96 epochs. So, now, if we want to visualize what filters were learned, or what we learned, we are using a helper function here to visualize the filters. It's not really important.

But here, what we use is we call get value on the weights to access the internal value of the shared variable, and then we use that to visualize the filter. So, we have the training set, and we have the filters, and we can see it's kind of reasonable, like, this is the filter for class 0, and you can see kind of like a 0, 1, what's important for the two is to have an opening here, and so on.

So, yeah, if we have a look at the training error, is - well, do we see the training error? No, I'm not plotting it. But the validation and the test error are quite high, and we know that the human level performance is quite low, and the performance of other models is quite low, so it really means that the model is too simple, and we should use something more advanced.

So, to use something more advanced, if you go back to the home of the Jupyter notebook, you can have a look at the conf net, and run the net. So, this new example is basically - it's the same data, but it's a bit more advanced, and it's a bit more optimized, because it has the advantage of training fast even on an older laptop, but this time, we're going to use a convolutional net with a couple of convolution layers, and fully connected layers, and the final classifier.

So I'm going to make sure that I'm not over-composing the data. So, let's see how we could use TNO to define helper classes that are layers that can make it easier for a user to compose them if they want to replicate some results, or use some classical architectures. This is done usually in frameworks built on top of TNO and TNO, and they develop their own mini-framework with their own versions of layers and so on that they find useful and intuitive.

So, this logistic regression layer basically holds, well, parameters, weight, and bias, and it's a very simple classifier. It's a set of a variety of classes, prediction holds the params, and have expressions for the negative log likelihood and the errors. So, if you were to use only that class, then it's doing essentially the same as the previous one.

And, in the same way, we can define a layer that has convolution and pooling. So, again, in the init method, we pass it, well, filter shape, image shape, the size of pooling, and so on. We initialise the weights using the same classifier, and we compute the bias from the input, well, we compute to the convolution with the filters.

We then compute max pooling, and output, well, tan h of the pooling plus the bias. So, we have a bias, and here, the bias is only like one number for each channel, so which means that you don't have a different bias for each location in the image. So, you could actually apply such a layer on images of various size without having to initialise new parameters or retrain that.

So, we have a layer, and we have a hidden layer, which is just a fully connected layer. Again, initialising weights and bias, and expression going from the symbolic expression going from the input, and the shared variables to the output after activation. And, again, we want to collect the parameters so that we can actually apply the same thing to the output.

And then, here's a function that has the main loop, and the main training loop. So we have a mini-batch generator, again, the same code as before, and here, we are building the whole graph. So, always the same process. We define input vectors, and we define the input vectors. So, L vector is a vector of long, because the targets here are indices, and not one-hot vectors or masks or something like that.

And we create the first layer, which is a layer with size. We want to have a size of one, and we want to have a size of two. So, here, the image size changes. This is mostly for efficiency, actually. You don't really have to pass that for those particular models.

But you still need the shape of filters. I mean, you have the filters anyway. And then, it's useful to have a layer that is fully connected. So, the convolution layers can handle arbitrary-sized images. And then, after that, we want to flatten the whole feature maps and feed that into a fully connected layer and into the prediction layer.

So this one has to be fixed, so we have to know what the output layer is. So, we have four dimensions. And here we go. A fully connected layer, and the output layer, that's just the class, the same as before. We want the final cost to be the log likelihood of that.

We have, again, the errors, the parameters, or the concatenation of the parameters of all layers. And once we have that, we can build the gradient. So, just one call of grad of cost with respect to params. Get the updates. So, again, just regular SGD, but we could have a class or something that performs like momentum, whatever you need.

Compile the function. And here we have, again, the early stopping routine with the same main loop for all epochs until we're done. Then loop over the mini batches and validate every once in a while and stop when it's finished. So, let's just declare that. Loading the data. So, we have the same output, exactly the same as before.

And here we can actually run that. So, this was the result of a previous run. That took five minutes. So, I will probably not have time to do that, but here you can see basically what the result looks like. So, if you want to try that or try that during the lunch break or later, you're welcome to play with it.

And after that, yeah, you can visualize the run filters as well. So, here you have the first layer. And for the -- and here you have the -- an example of the activations of the first layer for one example. So, we have just a little bit more information. So, let's just go back to the tutorial.

I mean, example. So, if you go back to the home of the Jupyter notebook and go to LSTM, then -- so, this model is an LSTM model. So, it's a model of a sentence given the previous ones. So, not going to go into details, but here you can see that the LSTM layer is defined here with variables for all the matrices that you need and the different biases for the parameters.

So, you have a lot of parameters. It would be possible and sometimes more efficient to actually define, say, only one variable that contains the concatenation of a couple of matrices, and that way you can do more efficient, more efficient, and more efficient implementation. And here's an example of how to use a scan for the loop.

So, here we define a step function that takes, well, a couple of different inputs, so you have, like, the different activation steps, you have the current sequence input and so on, and from them, here's basically the LSTM formula where you have the dot product and sigma or tan H of the different connection inside the cell, and in the end, you have the hidden and that is the function that you need to use.

So, once you have that, that step function is going to be passed to TNO dot scan, where the sequences are the mask and the input. So, the mask is useful because we are using many batches of sequences and not all the sequences in the same batch have the same length.

So, we can use the same step function to group examples of similar lengths together, but they may not always be exactly the same length. So, in that case, we pad that to only the longest sequence in the mini-batch, not the longest sequence in the whole set, just for the mini-batch, but we have to pad and remember what the length of the different sequences are.

So, let's define that. Here, we define the cost function, that's the cross-entropy of the sequence, and here, again, you see that the mask is used so that we don't consider the predictions after the end of the sequence. Logistic regression, the same as the cost function, but we are using the same cost.

Here, for processing the data, we are using fuel, which is another tool being developed by a couple of students at Mila, and it's nice because it can read from just plain text data, do some preprocessing on the fly, including things that I will show you in a second. So, we are doing sequences by similar length, and then shuffling them, and padding, and doing all of that.

And so, that puts a generator that you can then feed in your main loop through a TNO function. So, that whole preprocessing happens outside of TNO, and then the rest of the data is fed into the main loop. So, yes, here we build our final TNO graph. We have symbolic inputs for, well, the input and the mask.

We create a layer, the layer, define our cost, parameters, all the parameters, and the recurrent layer. Take the gradients, of course, with respect to all parameters. So, as I mentioned, it's going to use backprop through time to get the gradient through the scan operation. The update rule, again, simple, SGD, no momentum, nothing.

It's something you can add if you want to. And then, we have a function to evaluate the model. So, here, the main loop is training, and we also have another function that generates one character at a time, given the previous ones. That's why we declare inputs here. And so we have a function that gets predictions, we normalize them, because we are working in float 32, and sometimes, if you divide by the sum, and the sum, it doesn't add up to one.

So we want the higher precision for just that operation. And then try to generate a sequence every once in a while. So, we have a function that generates a sequence every once in a while. So, we have a function that generates a sequence in the previous run. So, we see the -- so, for monitoring, we see that prediction with the meaning of life is, and then we let the network generate.

So, if I try to run it now, it's going to be long, but here's some examples of how it works. So, the first one is a model that we developed with not that much, and it has, like, a couple of unusual characters. I mean, it's usually -- it's not usual to have, like, one Chinese character in the middle of words.

You have, like, punctuation in the middle of words, and so on. And so, we have a function that generates a sequence every once in a while. And we see that it's getting slowly better and better. And the meaning of life is the that, and so on. So, of course, this is not what's going to give you the meaning of life.

And this is this. So, yeah, so I interrupted the training at some point, but you can play with it a little bit, and here are some suggestions of things you might want to do, like better training strategies, different linearities inside the LSTM cell, different initialization of weights, try to generate something else that the meaning of life is, and, yeah.

So, I hope I could give you a good introduction of what TNO is, what it can be used for, and what you can build on top of it. So, if you have any questions later, then we have TNO users mailing lists. We are answering questions on Stack Overflow as well.

And we would be happy to have your feedback. >> We have time for a few quick questions. There's one here. Could you go to the mic? >> Can you just give a quick example of what debugging might look like in TNO? Could you break something in there and show us what happens and how you figure out what it was?

>> Sure. Actually, yeah, I can show you a few examples. Okay. So, let's go to, say, a simple example. Okay. So, I'm just going to go to the logistic regression one, and say, for instance, that when I execute this, I don't have the right shape. So, you can still build the whole symbolic graph, and at the time where you want to actually execute it, then you have an error message that says, "Why does this not have the right shape?" So, let's say that X has columns and rows, but Y has only that number of rows.

And the apply node that caused the error is that dot product, and gives the input again, and in that case, it tells you -- it's not really able to tell you where it was defined. So, we can do that, and we can go back to where the train operation was defined, train model, TNO function, and we can say, "Mode optimizer equals none." Sorry.

I have to do -- mode equals TNO.mode, optimizer, none. Is that right? So, let's do that. Let's reconfigure everything. And then, the updated error message says, "Backtrace when the node was created," and it's somewhere in my kernel, and it's not there. So, we can go back to that. So, of course, we have a lot of things in there, but you know that there's a dot product, and it's probably a mismatch between those.

So, that's one example. Then, the other techniques that we can use, we can have the break points, as I said, and so on. I don't have right now a tutorial about that, but I have some examples. So, I'm going to go back to that. >> One last question. >> I have some models I would like to distribute, and I don't want to require people to install Python and a bunch of compilers and stuff.

Do you have any support for compiling models into a binary? >> Okay. So, unfortunately, at the time, we're pretty limited in the number of models that we can do. So, most of the work is done by Python, and we use an empire and DRAs for our intermediate values on the CPU and the similar structure on the GPU, even though that one might be easier to convert.

But yes, all our C code deals with Python and does the ink ref and the graph and so on, so that Python manages the memory. So, we're not able to do that. We have a lot of work to do. >> So, how about something like a Docker container? >> Something like that.

Recently, even for GPU, NVIDIA Docker is quite efficient, and we don't have any model slowdowns that we had seen earlier. So, it's not ideal, and if, like, someone has some time and the will to help us disentangle that, we can do that. >> Okay. Let's thank Pascal again. >> We convene in 55 minutes for the next talk.

Have a good lunch. Lynch.

Theano Tutorial (Pascal Lamblin, MILA)

Chapters

Transcript