Torch Tutorial (Alex Wiltschko, Twitter)

So I'm going to tell you about machine learning with Torch and with Torch Autograd. So the description of the talk isn't entirely correct. I'm going to do practical stuff for the first half. And then what I want to do is dive into Torch Autograd and some of the concepts that are behind it.

And those concepts also happen to be shared amongst all deep learning libraries. So I really want to give you a perspective of the common thread that links all deep learning software you could possibly use. And then also talk a bit about what makes each of the libraries different and why there's -- I will hypothesize why there's so many and the different choices.

So one thing I want to try -- there's been a lot of questions and we've gone over time. But if there's not questions that go over time in the room, there's a lot of people watching online. And if there's extra time, we'll, of course, prioritize people here. But if you ask a question with the #DLschool hashtag or if you tweet at me directly, I will try to answer those questions from online and I'll certainly answer them offline as well.

So ask if you're watching at home. Maybe that will kind of increase meaningful participation for people watching through the stream that aren't here today. A lot of this material was developed with Sumit Chintala at Facebook. He's kind of the czar of the Torch ecosystem these days. And Hugo La Rochelle, who you heard from yesterday.

And also Ryan Adams, who's at Twitter with us. And all this material is available on this GitHub repository that you got actually on a printed sheet for installing Torch. So all the examples that I'll show you will be in one notebook. And then there's a separate notebook, which I actually won't reference in the talk, that's a full end-to-end walkthrough of how to train a convolutional neural network on CIFAR-10.

So that's kind of a self-paced tutorial notebook that you can work through on your own time. But I'm going to focus on the basics, on the fundamentals, and hopefully give you some of the concepts and vocabulary that you can use to really dive into Torch on your own time.

So let's get going. So Torch is an array programming language for Lua. So it's like NumPy, it's like MATLAB, but it's in the Lua language. So Torch is to Lua, as NumPy is to Python. So what you can do in Torch, you can do in any language. This is the absolute minimum basics.

You can grab strings and print them. You can put things in associative data types. In Python, there's tuples and lists and sets and dictionaries. In Lua, there's just one data type called a table. So you'll see that a lot. But you can do all those things that I mentioned before with a table.

And you've got for loops and if statements. The core type of Torch is the tensor. Just like in NumPy, when you have the nd array, which is a way of shaping sets of numbers into matrices or tensors, we have the tensor. And you can fill it up with random numbers.

You can multiply them. Standard stuff. But the tensor is the core data type of Torch. We've got plotting functionality. Going over at a very high level, I'll show you some more specific code in a moment. So you can do all the kind of standard stuff that you'd do in any other array-based language.

There's all the tensor functions that you'd like to use, including all the linear algebra and convolutions and, you know, blast functions. And I'm leaving this link here. When the slides get uploaded, you can follow this and kind of dive into the documentation and see exactly what kind of tools you have at your disposal.

In the notebook, in the itorch notebook, which is something that Sumith put together, you can prepend any Torch function with a question mark. And that gives you the help for that function. So it makes it really nice to discover functionality in the Torch library, in the notebook. So why is it in Lua?

It's kind of maybe a strange, maybe esoteric language to write things in. Lua is unreasonably fast for how convenient it is to use, especially a flavor of Lua called LuaJIT. For loops in LuaJIT are basically the same speed as C. So this for loop here is actually in production code in master in Torch.

It's not C code. But this is perfectly fast enough. So that's a really nice aspect of Lua, is you can depend on super high performance C code, and then on top of it, you've got this very convenient glue layer, but you don't pay much of a speed penalty to use that glue layer.

So that's one of the reasons why we've used Lua. Another advantage that some people might see as a plus is the language itself is quite small. There's 10,000 lines of C code that define the whole language of Lua. So you can really sit down with the manual in an afternoon and understand most of the language on your own that same day.

Another aspect which is pretty critical for deep learning, but also for other fields, is that it's really easy to interoperate with C libraries. It was designed originally to be embedded. So Lua was a language that was designed to run inside of another C program, but have a little scripting layer inside of it.

So it's very easy to call into C. It's very easy for C to call into Lua. So this is another reason why it's kind of an appropriate choice for deep learning libraries. The FFI call signature and the idea has been copied into many other languages. So C, FFI, and Python is a Python version of the Lua FFI.

Julia has something similar as well. And as I mentioned, it was originally designed to be embedded. And it's in all kinds of crazy places that you maybe wouldn't expect Lua to be. So in World of Warcraft, all the graphics are in C++ or whatever they wrote it in. But like the boss battles or the quests.

So like when you go give the gem to the blacksmith or whatever and they give you back the magic sword, the scripting of those events happens in Lua. And if you write scripts for World of Warcraft to make your own quests, that's Lua. Adobe Lightroom is a photo processing app.

All the image processing is done in C++, but all the UI and everything was done in Lua. So again, it was used to bind together high-performance code with kind of a scripting layer. And Redis and Nginx, which are kind of workhorses in the field of web development, are both scriptable with Lua.

And in fact, if you go to GitHub pages, like mypage.github.io, if somebody's hosting a web page on GitHub, that's served in part by Lua. The apocryphal story of why it was originally chosen, maybe you could correct me, is Clemence Barabay was trying to build an embedded machine learning application, some device he could wear on his bike helmet and classify the world with the CNN when he was a Yon student.

And he was trying to do this with Python. And it's incredibly frustrating to get Python to run on embedded chips. Maybe it's easier now with Raspberry Pi, but that just wasn't the case. And then he stumbled upon Lua, and it turns out people had been building Lua into embedded applications for years before that.

And so that kind of was the snowballing effect. So that's the hearsay for how we arrived at Lua. But maybe there's another story. Another really nice feature of Torch is we have first-class support for GPU computation, interactive GPU computation. So it's very, very easy to get some data from the CPU to the GPU, and then everything that you do with that data happens on the GPU without you having to worry about writing CUDA kernels.

So this has been a feature of Torch, which is becoming maybe a little bit less unique now, but this was a pretty solid feature when it first came out. So interactive GPU computing. And I'll go very quickly over some of the basic features. And all of these examples, again, are in a notebook, which you can do kind of at your own pace if you'd like.

So there's all the basic arithmetic, like creating matrices and doing arithmetic between them, taking maxes of numbers and arrays, clamping, building tensors out of ranges, Boolean operations over entire arrays, special functions. This is supported through a wrapper around the Cepheys library. This is what NumPy uses to support things like tanh and atan2 and other kinds of functions that I guess are in the special class.

And then Sumith, again, has wrapped the bokeh.js library, which is originally just for Python, but it provides really nice and beautiful plots in the iTorch notebook. And so we can, you know, draw random numbers from our favorite distributions and make nice histograms of these. So you can do nice data exploration in the iTorch notebook along with deep learning.

So one feature that is attractive to some folks, but just an interesting feature of the Torch ecosystem, is that although there's a lot of industry support, it is not industry owned. So at Twitter and at Facebook AI Research and at NVIDIA, we all contribute a lot to the Torch community, but we don't own it.

We can't really steer it to go one way or the other definitively. And there's a ton of other people that participate academically in this ecosystem, and that's a really nice feature. And along with -- I guess because of the really nice habits of people in deep learning, when a paper comes out, there's often a high quality code implementation that follows it.

Not always, but very often, at least compared with other fields. And Torch is one of the environments in which you'll often see high quality implementations of really cutting edge stuff. So if you just browse through GitHub and you kind of follow researchers on GitHub, you can see really high quality implementations of image captioning, of neural style transfer, so you can just clone this GitHub repository and run this yourself.

Seek to seek models, kind of whatever is the state of the art, there's usually a Torch implementation of it. Some of the recent work in generating very realistic synthetic images with generative adversarial networks also has great Torch code implementing it. So given that there's this active community on GitHub in deep learning for Torch, how does that stack up against other communities?

Just to give you some context. So the Python data science community is pretty enormous, and its focuses are also very varied. If you enter into the data science community in Torch and Lua, you'll likely find deep learning people, but not a lot of other people. So its strength in deep learning compared to its size is actually quite enormous.

And for those that are kind of thinking of switching between Python and Lua and giving Torch a try, the effort to switch from Python to Lua, you can probably do that in a day if you've tried some Python programming. So I was a Python programmer for a while, and getting started on Lua took me maybe a couple days, and I was actually productive at work in maybe a week or so.

But you can actually run your code and understand and write new things pretty quickly if you've worked in a scripting language like MATLAB or Python. So if you're intimidated or waiting to try it, you should just dive in. So how does Torch compare to other deep learning libraries specifically, as opposed to languages?

The first thing I'll say is there's really no silver bullet right now. There are a lot of deep learning libraries out there. I'd say TensorFlow is by far the largest. And this is a plot that was made by a colleague of Sumit's, and I wish it kind of had confidence intervals on it, because it's not strictly that these are, like, you know, points in deep learning space.

But maybe this is a good guess of where things kind of fit. It seems as if TensorFlow was engineered to be very good in an industrial production setting, and it seems like it's really fulfilling that. Theano seems to have always had a research goal in mind and has been really awesome in the research community for some time.

Torch tends to be more towards research than industry. I think Twitter maybe has pulled it a little bit towards production. We maybe are the only example -- I'd love to learn of others, but we're maybe the only example of a large company that uses Torch in production to serve models.

So every piece of media that comes in to Twitter goes through a Torch model at this point. So we're really dealing with an enormous amount of data in a live setting. The development of Torch, just to give you a sense of how we think about how it was built and how we're extending it, there's some kind of tenets of our core philosophy.

Really the first is things should be -- this isn't necessarily good or bad, but this is our choice. Whenever you hit enter on a particular line in your iTorch notebook or on the command line, you should get an answer back. And this is something that we've tried to stick to pretty tightly.

So no compilation time. Imperative programming, right? So just write your code and, you know, each line of code executes something and passes it to the next line. In minimal abstraction -- what I mean by minimal abstraction is if you want to reason about how your code is performing, it shouldn't take you that many jumps to go to the C code that's actually being run.

In fact, it usually is one or two jumps from the file that defines the function that you care about to the actual C code. So if you want to reason about performance or really understand what's going on, it's quite easy to do so in Torch. I want to take a little bit of a detour and tell you about how Torch thinks about its objects, how it thinks about the tensor, because this can help you also reason about performance.

A lot of the reason why people come to Torch is to build high-performance models very quickly and easily. So I mentioned tensors before. So a tensor is an n-dimensional array. And a tensor is actually just a pointer. It's a view into your data that's sitting in memory. So it's just a shape.

It's a view into what's actually being stored in your RAM. It's stored in a row major way. So that means if I go to the first element of my tensor in memory and I move over one, I'm moving over one in a row and not one in a column.

Column major memory storage does exist. It's just less common today. So you'll often see row major. So this tensor is defined by its link to some storage and its size, 4 by 6, and its stride, 6 by 1. And 6 by 1 means if I move one down in the column direction, I actually have to skip six elements in memory, right?

Whereas the 1 here means if I move over one in the second axis, the row axis, I just have to go over one in memory. So if I take a slice of this tensor using the select command, so I select along the first dimension, the third element, what it gives me back is a new tensor.

It doesn't give me new memory. This is a thing that happens a lot in Torch, is you'll deal with views into memory. You won't do memory copies. So you're usually working with kind of the raw data in RAM. And so this creates a new tensor with the size of 6 because there's six elements, a stride of 1 because we've pulled out a row, not a column, and an offset of 13.

That means I have to go 13 elements from the beginning of the original storage to find that piece of memory. So if I pull out a column, then something different happens, which is I still have a size of 4 here. And my stride is now 6 because in order to grab each element of the column, I have to skip 6.

And then the offset of 3 is because I grabbed the third element there. So that's kind of a view of the memory model. And if we actually run something like this, like we instantiate a tensor of double values inside of the tensor and fill it with uniform distribution and print it, we can see the values here.

And then if you grab a slice B and print it, it's just this row. And then we can fill B with just some number and print it. Now it's filled with that number. And if we go back and print A, we've actually overwritten the values there. So this is something you see a lot in Torch, is working on one big piece of shared memory.

And as I mentioned before, working with CUDA is really, really easy. So if you just require a CUTORCH, which is installed automatically if you have a CUDA GPU using the instructions on the GitHub repository, you can instantiate a tensor on the GPU and do the same thing. And it will just work.

So now I want to talk a bit about the frameworks that you'll use to actually train neural networks in Torch. So this is a schematic kind of cartoon of how we-- of the pieces we typically need to train a neural network. So we've got our data stored on a hard drive or on a big distributed file system.

And we have some system for loading that data off of that file system, which goes into a nice queue. And then some training code which orchestrates a neural network, so the thing actually making the prediction, a cost function, which is a measure of how good our neural network is at any point in our training, and an optimizer, which is going to take the gradient of the cost with respect to the parameters in the neural network and try to make the neural network better.

So in the Torch ecosystem, we've got some packages that tackle each one of these separately. So I won't talk about threads here. There's actually several different libraries that will do this. There's actually several different libraries that will do each one of these things. But this one is maybe the most common or the easiest to start with.

And NN here will cover both the specification of the neural network and the cost function, as well as the mechanisms to push data through the neural network and the cost function and pull the gradients back from the cost to the parameters. And then the optimizer, which is-- we've heard mentioned several times today, stochastic gradient descent or Adam or AdaGrad.

So let me talk about NN first, give you a flavor of how it works and what the pieces are. So NN is a package for building feedforward neural networks, mostly feedforward neural networks, by clicking Lego blocks together. So you might start with your input and then click together a fully connected layer, and then another fully connected layer, and then maybe some output.

So here, I've defined a sequential container, which is going to be a container for all my Lego blocks. And then I might click in a spatial convolution. So I'm going to be working with images, maybe, a non-linearity, some max pooling, some other layers, as well, to kind of complete the whole neural network.

And then I might add a log soft max at the end to compute class probabilities. So this is kind of the structure that you'll build neural networks with in NN, is define a container and then one by one add pieces down a processing hierarchy. And I mentioned the sequential container, which is starting from inputs and then proceeding linearly.

There's two other types of containers that you might use. But generally, NN shines when your architecture is linear, not when it's got some crazy branches or anything like that. There's not a lot of API to the NN package. So if you learn these couple functions, which will be in the slides for later if you want to refer to them back, you'll understand all the mechanisms that you need to know to push data through a neural network and then to push it through a criterion or a loss function and then to pull those gradients back in order to make a gradient update to your model.

So these are really the APIs, the levers that you need to know to kind of drive your neural network. And, of course, we have a CUDA back end for NN. So in the same way that you'll just call CUDA on some data, you can call CUDA on a container.

And that will move the whole model onto the GPU. And then anything that you do with that model will occur on the GPU. So it's kind of a one-liner to start training models on a graphics processor. So for doing feedforward neural networks, NN is pretty great. But for starting to try weirder architectures, like Richard Socher yesterday mentioned, a pretty complicated NLP model that starts with glove vectors, which are kind of like shallow neural networks and then a recursive neural network and then a tension mechanism and all these things were interacting in strange ways, that's actually pretty hard to specify in NN.

At Twitter, we have a package called Torch Autograd, which makes these kinds of gluing different model pieces together really easy. And, in fact, the pieces can be as small as addition, division, multiplication, and subtraction. So you can glue together any size piece of computation and still get a correct model out.

And we'll talk about that in a moment. The Optin package is what you need in order to train models with stochastic gradient descent or Autograd or Autodelta, whatever your optimizer is that you favor. The API is pretty straightforward, but maybe a little bit different for people kind of coming from the Python world.

It's got a bit of a functional approach, where it will actually -- you'll pass a function to Optin that will evaluate your neural network and pass back the gradients. So that's just something to be aware of. It's a little bit of a different style. Another gotcha with Optin that you might run into and you'll see in some of the notebooks that are online is your parameters should be linear in memory.

So if you want to optimize two neural networks that are interacting in some way, you actually need to first bring their parameters together into one tensor and then pass that to Optin. It's just something to be aware of. So I want to talk for the rest of the talk about Torch Autograd, but also about some of the ideas that are behind Torch Autograd and how those link all the deep learning libraries that you possibly could choose.

So first I want to take a step back and say that -- just appreciate the wonderful stable abstractions that we have in scientific computing. So Fortran, you know, back in '57 -- I don't think anybody uses Fortran '57, but people might actually still use Fortran '90. The idea of an array didn't exist on a computer.

And it really took some pretty crazy thinking, I think, to build a system that made array something we take for granted. Same with linear algebra. Over about a 20-year period, starting in the late '70s, people decided, oh, maybe we should think about linear algebra in a systematic way. And now we don't really worry about this.

If you want to multiply two matrices, that used to be a PhD's worth of work to do that at scale. And now we just -- we don't even actually import BLAST. There's so many wrappers of BLAST that we don't even think about this anymore. So this is another abstraction.

And also the idea that we should have all of the routines that we would possibly want to call in one place available that we don't have to write, that was kind of invented, I would say, by MATLAB in the mid '80s and then really popularized in the open source community by NumPy.

And we should take them for granted. We should totally forget about them. Because they make us faster, they make us better for us to assume these things will work. So machine learning has other abstractions besides these computational ones that we take for granted. All gradient-based optimization, that includes neural nets as a subset, relies on automatic differentiation to calculate those gradients.

And I like this definition from Barak Perlmutter, automatic differentiation mechanically calculates derivatives as functions expressed as computer programs. So it doesn't derive things I write on a piece of paper with a pencil. It derives computer programs at machine precision and with complexity guarantees. Those last two clauses differentiate it from finite differences where you take the input to a program, you perturb it slightly, and you measure the gradient that way.

That's a very bad way to measure gradients. It's numerically very unstable. And it's not symbolic differentiation. So it's not writing down the symbolic expression of a neural network, putting it in Mathematica or Maple, and then asking for the derivative. Because your expression might go from this to this. So you get expression swell when you do naive symbolic differentiation.

And you don't get that with automatic differentiation. So automatic differentiation, I would say, is the abstraction for gradient-based machine learning. It's been rediscovered several times. There's a review by Woodrow and Lehr. I think the first implementation where it actually operates on a computer program was by Bert Spielpetting in 1980, although it has been described back in 1964 by Wengert.

In neural networks, RumbleHeart is the one that I suppose popularized it as backpropagation, although backpropagation is a special case of autodiff. This I think is important. In nuclear science and computational fluid dynamics and in weather modeling, these people have been using autodiff for years, decades. And their tools in many ways are much more sophisticated than we have in machine learning.

There's a lot of ideas that we have yet to import from people that model the weather that would really benefit our ability to train larger and larger models. And I would clarify that our abstraction in machine learning is actually reverse mode automatic differentiation. There's two different types, two extremes I should say, forward mode and reverse mode.

You never hear about forward mode. And you never hear about forward mode in machine learning because it's a very bad idea to try forward mode in machine learning, and I'll show you why. So here is a cat picture from the internet. And my job at my job is to decide that that is in fact a cat picture.

This is actually something that we do do at Twitter. What I am doing is passing this cat through successive layers of transformations and eventually producing a probability over classes. I'm getting it wrong. My classifier thinks it's a dog, so I'd like to train my neural net to think it's a cat.

So I have a loss, a gradient of my loss, and I have it with respect to my parameters. And this is my gradient that will let me update my parameters. And it is composed of multiple pieces. And using the chain rule, I know that I can fold this together to actually compute the loss I want, which is the gradient of the loss with respect to the parameters.

The issue is I can do it either left to right or right to left. So going from left to right looks like this. Whoops, that was very fast. Okay. I'll do two big matrix-matrix multiplies. So this is bad. This is not good because we have these huge matrix-matrix products that we're keeping around.

It's actually worse than this, and I'll show you in another view of forward mode. So say I have a computer program, so no longer a symbolic representation of a neural net. This is just some computer program. And let's say I'd like to optimize A. A is the single parameter of my neural net.

It's a very silly, trivial example, but I think it will help illustrate the point. So I can execute this program and look at all of the arithmetic operations that occur and build what's called a trace. So I'll define, say, A is 3. I'll define B as 2. C is 1.

And then I'll start executing the code. I'm actually going to look if B is greater than C and choose a branch to operate on, but then ignore it in my trace. So I've chosen one of those branches, which is the first, because B is greater than C. And I have some output value D, and I'll return the output value.

So this is a trace execution of my program given some inputs. So to calculate in forward mode the derivative of my output D with respect to A, I'll define A as 3 and then initialize a gradient of A with respect to itself. And the idea is I eventually want the derivative of D with respect to A, and I'll build it up sequentially.

DA, DA, and then I'll do DBDA, and then DCDA, and DDDA. So I'm moving from the left to the right, building up my gradient. I can't do much about the derivative of B with respect to A right now. So I'll define C and the derivative of C with respect to A.

And then I have my value D. And then I can define my target value, which is the gradient of D with respect to A. So if I wanted the gradient of D with respect to B-- so if I had a two-parameter neural network and I wanted to optimize both at once-- I would have to execute this whole thing again and initialize this guy here as DBDB as 1.

So if you have a million parameters in your neural network, or tens of millions, you have to do a million evaluations of forward mode, or tens of millions of evaluations of forward mode. It's a very bad idea to try forward mode automatic differentiation on neural network, and that's why you've probably never heard of it.

So now you can forget about it. But the alternative is reverse mode, and that's starting from the right to the left. So now I've got this nice matrix vector products, which are much smaller, and the complexity is much better. And there's an interesting difference when I actually go to do this in computer code.

And you'll see these words are closer together, and that's because for reverse mode, I actually have to evaluate the whole program before I can start deriving. Because I'm starting with the derivative of D with respect to D, and then decrementing derivative of D with respect to C, with respect to B, with respect to A.

So I'm going the other way, but I have to have all the information first before I start that. So now I can initialize derivative of D with respect to D, and I can walk backwards and return both the value and the gradient. What's really nice about this is you'll notice here, I actually have all the information I need to calculate the derivatives of D with respect to these other parameters.

So that's why we really like reverse mode autodiff, aka back propagation for neural nets, is if you have a million of these guys, you really want to be ready to compute them all at once. And doing these with matrices is a very efficient thing to do on the computer.

So we've implemented this, trace-based automatic differentiation, in a package called autograd. And this is the entirety of a neural network. So this is how you'd specify and train a neural network in autograd. So I'll initialize my parameters. They'll just be some random numbers. And then here is my neural network function.

I'm multiplying my image that I'm passing in by my weight matrix, and adding a bias, non-linearity, doing it again, and then returning some probabilities. And I have a loss, which will take in an image and return a prediction. So just using this function. And then I'll just take the mean squared error, or the sum squared error.

In order to get the gradients of this function, the derivative of the loss with respect to these parameters, all I have to do is import this autograd package, and then call grad on this function. This returns a new function that returns the gradients of my original function. So it's what's called a higher order function.

Its inputs and its outputs are a function. So whenever you see that nabla, that upside down triangle, the grad triangle, this is the coding equivalent of that. And then to train, we'll just call our D loss function on our parameters, our image, and our label, which I'm just pretending like you already have a system to get here.

And we have our gradients. And then we're updating with stochastic gradient descent here. So it's a very thin-- it's really just this. This is the interface with which you talk with autograd. So what's actually happening? So here's my simple function. As we evaluate it, we're actually keeping track of everything that you're doing in order to be able to reverse it.

So we're actually building that trace list that I described before and keeping track of it internally. So we'll start on line-- I guess that's 5. So we'll multiply some things. We'll keep track of the fact you multiplied and the inputs. We'll keep track of the addition and the inputs and also the output of addition.

We'll keep track of inputs, outputs, and the function every time. And we'll kind of walk down this function and build your compute graph just in time. So as you're running your code, we're learning what you've done. And the way we track that-- and I won't go into details-- we actually replace every function in Torch with like a spy function.

So instead of just running Torch.sum, our spy function says, oh, I hear you're running Torch.sum. Let me remember the parameters you gave me. Let me run sum on those parameters, remember the output, and then return it like nothing happened. But internally, we're remembering all those things. And the way we do this to actually compute the gradients is we're walking back this list like I described before.

And every time we get to a point where we need to calculate a partial derivative, we look it up. And we've written all of the partial derivatives for Torch functions. And really, every neural network library is going to do this at some level of granularity. So let me walk you through another couple examples just to show you what it could do.

So this is kind of a pretty vanilla one. We can add and multiply scalars and get the correct gradient. This is where things get a little bit more interesting if there's an if statement. So this control flow can be a little bit difficult or awkward in a lot of existing deep learning libraries.

Because we just listen to what arithmetic functions get run, we ignore control flow. So we just go right through this stuff. So we can get the correct gradient even with if statements. We actually care about tensors when we're doing optimization or machine learning. So everything I've shown you that works with scalars also works with tensors just as easily.

This is in the notebook that is on the GitHub repository if you want to play with it. This is where things get a little bit interesting. For loops also work just fine. And not just for loops that have a fixed length, which is something that is perhaps easy to unroll.

But for loops whose duration can depend on data you just computed. Or while loops whose stopping condition can depend on a computation that occurs in the while loop. We don't really care. We're building your graph dynamically. And when it's done, when you've returned some value, we'll calculate the derivatives of the graph that we have.

You can turn any for loop into a recursive function. This is kind of wacky. I don't know how you would actually use this in practice. But you can cook up a lot of crazy things you might try with Autograd and they just work. So here we have a function f.

If b is at some stopping condition, we'll return a. Otherwise we'll call f. And we're going to differentiate this. So we're going to differentiate a fully recursive function. And it works just fine. Another aspect which is coming up more and more as papers are coming out that basically disrespect the sanctity of the partial, you know, of the derivative of the gradient.

People are computing synthetic gradients. They're adding, they're clipping to gradients. People are messing with kind of the internals of back propagation or of autodiff. It's actually pretty easy to start to engage with in Autograd. So say I'm going to sum the floor of a to the third power. So the floor operation is piecewise constant.

So the derivative is zero almost everywhere except for where it's undefined. Why would I want to do this? For instance, if you wanted to build a differentiable JPEG encoder or differentiable MPEG encoder, in compression algorithms like that, there's often a quantization step that will floor around or truncate numbers.

And if you wanted to differentiate through that to build like a neural JPEG algorithm or something, you need to pass gradients through something that ordinarily does not. And so if we look at what the gradient is, it's zero everywhere. I won't go into the details, but you can ask Autograd to use your own gradient for anything.

So if you have a new module that you want to define, and either you've written high-performance code for it and you want to use it, or you want to redefine or overwrite the gradients that we have, there's a pretty easy mechanism for doing that. And then when you call your special.floor, you can propagate gradients through it.

And here I was just saying basically ignore the gradient of floor. So this is a toy example, but there are real places where you have a non-differentiable bottleneck inside of your compute graph, and you want to either hop over it or find some approximation. And Autograd has a mechanism for very easily plugging those types of things in.

So that's a bit of what Autograd is and what it can do. And I want to turn our attention to how Autograd relates to other deep learning libraries and maybe how they're common and how they're similar and how they're different. So one big difference that I found between different deep learning libraries is the level of granularity at which you are allowed to specify your neural network.

So there's a lot of libraries where you say you get a convnet or you get a feedforward neural network, and that's it. So the menu is two items long. And that's fine. But I think Andre really hit it on the head where if you want to solve a problem, don't be a hero.

Use somebody else's network. So maybe this is VGG that you've downloaded from the model zoo or something like that. So this is the don't be a hero regime on the left. In the middle, there's a lot of really convenient neural net-specific libraries like TorchNN and Keras and Lasagna. And you get to put together big layers.

And you don't really get to see what's inside those layers, but you get to click together linear layers or convolutions. And usually that's kind of what you want to do. And on the far end of the spectrum, the things you can click together are the numeric functions in your kind of host scientific computing library, right, like add, multiply, subtract.

And these are features of projects like Autograd and Theano and TensorFlow. And the reason why these boundaries are made is because the developers have chosen to give you partial derivatives at these interfaces. So this is how they've defined their APIs. These are the interfaces across which you as a user cannot pass.

If you want a new one of these modules for the type on the left or the type in the middle, you have to go in and build a whole new model and actually implement the partial derivatives. But with the types of libraries on the right, you can build your own modules by composing primitive operations.

So that's one difference that you can find. In practice, how these things are implemented under the hood usually means this is the totally shrink-wrapped stuff and maybe they implemented this whole thing by hand. Usually these guys in the middle are wrappers. They're wrapping some other library. And the guys on the right are usually actually implementing automatic differentiation.

So Autograd and Theano and TensorFlow all implement autodiff. And the guys in the middle are taking advantage of that to make more convenient wrappers. So another aspect that's different is how these graphs are built. So I'll remind you, in Autograd, we build these things just in time by listening to what you're doing and recording it.

But that's not how all neural network libraries are built. And this is an axis along which I think that they are differentiated meaningfully. So there's a lot of libraries that build these graphs explicitly, where you say, I'm going to click this Lego block into this Lego block, where I'm going to give you this YAML specification file.

The graph is totally static and you really have no opportunity for compiler optimizations there. And then there are the just-in-time libraries, so Autograd and Chainer is another one, where you get any graph. The graph can be anything. It can change from sample to sample. The length of the graph can be determined by the compute that occurs in the graph.

You have very little opportunity for compiler optimizations there. So speed can be an issue sometimes. And in the middle, there's ahead-of-time libraries like TensorFlow and Theano, where you construct your graph using a domain-specific language, you hand it off to their runtime, and then they can do crazy stuff to make it faster.

The problem with that is it can be awkward to work with -- I guess it got cut off -- it can be awkward to work with control flow. And I think there's a reason why it can be awkward to work with control flow. And it's because of the types of graphs that these libraries are actually manipulating.

So we say compute graph a lot, we say data flow graph a lot. Data flow graph has a pretty restricted meaning, and it means that the nodes in your graph do computation and the edges are data. And there's no room for control flow in a graph that is a data flow graph.

So static data flow is the type of graph that NNNCafe use, because all the ops are the nodes and the edges are just the data, and the graph can't change. Limited data flow, just-in-time compiled data flow like Autograd and Chainer has the same characteristics, but the graph can change from iteration to iteration, because we wait until you're done computing the forward pass to build the graph.

In the middle, there's kind of a hybrid, and I don't know what to call that graph type. The ops are nodes, the edges are data, but then there's special information that the runtime gets in order to expand control flow or for loops. So scan in Theano is an instance of this, where the Theano runtime has special information that allows it to make scan work, but it's kind of, it's conspiring with the graph data type to do that.

There's actually another graph type that naturally expresses control flow and data flow together that I haven't seen implemented in a deep learning library. It's called C of nodes from Cliff Clicks thesis in the mid-90s. It seems like a really natural thing to try, and maybe that's something that comes up in the future, but that's kind of a big question mark.

Maybe one of you will try that out and see how well it works. So in practice, this level of granularity can sometimes slow us down. Having to work with addition and multiplication can be nice if you want to try crazy stuff, but if you know you want to make a convnet, why don't you just rush all the way over to the left?

If you want to take, you know, inception and add another layer, you want to use the type in the middle. And Autograd allows you to do that. So I'll just kind of walk through writing a neural net three ways very quickly and then close for questions shortly thereafter. So using the fully granular approach, there's a lot of text on the screen, but the top half is basically let's instantiate our parameters the way that we want to, and then here, just like I've showed you in previous slides, let's do a multiply and let's do an addition and put it through nonlinearity.

We're being very explicit, right? So we're breaking all the abstraction boundaries and we're just using primitive operations. We can use the layer-based approach. So in Autograd, we have a facility to turn all of the NN modules, of which there are a lot, maybe an exhaustive list for what you'd want to use for standard deep learning applications.

You can turn them into functions and then just use them. So linear one on the linear parameters and your input and some activation. You can go through your neural network this way. So you can use a layer-based approach if you want. And if you just want neural network, just a feedforward neural network, we've got a couple of these kind of standard models just ready to go.

So you can just say, give me a neural network, give me a Logsoft max and a loss, and let me glue these guys together. So you can do it any of those three ways. Autograd at Twitter has had a pretty cool impact. We use NN for a lot of stuff and we use Autograd as well, but being able to reach for Autograd to try something totally crazy and just knowing that you're going to get the right gradients has really accelerated the pace of high-risk, potentially high-payoff attempts that we make.

So one crazy thing you might want to try is experiment with loss functions. So instead of, I have 100 image classes and I want to have my convolutional neural network be good at classifying those 100 image classes, maybe you have a taxonomy of classes. Maybe you have a vehicle and then a bus, a car, and a motorcycle.

If you guess any one of those, you kind of want partial credit for vehicle, or if you guess motorcycle, you want partial credit for car. So building that kind of a tree loss is actually really straightforward in Autograd, and you can do that in just one sitting. But it might be more complicated to do that in other libraries where you have to crack open the abstraction barrier, write your own partial derivatives, glue it back together, and then use that module that you've built.

We've trained models that are in production in Autograd. So this is something that's battle-tested to a sense, and is running on a large amount of media at Twitter. In a sense, Autograd doesn't actually matter when you're running in production, because you have your function definition for your prediction of your neural network, and then the gradient part just goes away.

So all the fancy stuff where we place torch with our secret listener functions, all that just goes away and you just have some numerical code. So there's actually no speed penalty at test time at all. We have an optimized mode, which does a little bit of compiler stuff, still a work in progress.

But for the average model, it's as fast, sometimes faster than NN. And for really complicated stuff, if you wrote that by hand, it would probably be faster. But the time to first model fit using Autograd is dramatically reduced, because you don't have to worry about correctness. So this is a big wall of text, but it's meant to put in your head some ideas of things from automatic differentiation from that world that we don't have yet, that we really want, to be able to train models faster and better.

So the first is checkpointing. This is not checkpointing where you save your model every 10 iterations. This is checkpointing where, on your forward pass, in normal reverse mode automatic differentiation, you have to remember every single piece of computation you do, because you might need it to calculate the derivatives.

In checkpointing, you just delete them. You let them go away, because you think that some of those might actually be easier to recompute than to store. So for pointwise nonlinearities, for instance, it might be easier, once you've loaded your data, just to recompute the ReLU, as opposed to saving the result of ReLU and loading that back in again.

Mixing forward and reverse mode is something that you can imagine being important for kind of complicated architectures, although I don't really know how much impact that would have. So in the chain rule, you can either go from left to right, or you could start in the middle and go out.

You can do all kinds of crazy stuff if you want. And we really just do reverse mode. For diamond-shaped graphs, where your computation explodes out and then comes back in, that might be useful to start with forward mode and then finish with reverse mode. Or in hourglass, you might want to start with reverse mode and end with forward mode.

Stencils are a generalization of convolutions that people use a lot in computer graphics. Basically calculating really efficient derivatives of image processing, just general image processing algorithms is under active investigation in the graphics world and in the computer vision world. So these are two references that are kind of neat papers.

Source-to-source transformations is something that hasn't really made it-- it basically has kind of been dormant for about 10 or 15 years. So the gold standard used to be, you take a piece of code as text, and you output another piece of code as text. What we're doing now in deep learning is we're always building runtimes.

We're always building some domain-specific layer that depends on you actually running code. It used to be that you would just read that text and kind of like a compiler, spit out the gradient. This was the gold standard. It might not be now, but I think it's worth reinvestigating. And then higher-order gradients.

So Hessian vector products and kind of Hessian-based optimization maybe doesn't always have full payoff. I actually don't recall hearing anything about this at this school so far, because it's very expensive and difficult to do, expensive computationally. Hessian is just if you take the grad of f, it gives you the gradients.

If you want the second derivative, so you take grad of grad of f. So there's efficient ways to do this. It's still kind of an open problem, but there are libraries out there. The Python version of Autograd does this well. DiffSharp and Hype both also do this as well.

So to kind of close out, you should just try it out. It's really easy to get it. If you have Anaconda, if you use Python, we've made it so that Lua is fully installable with Anaconda. So if you're already using it, it's very, very easy to get all of the tools that I've showed you today.

And that's kind of the single line to interface with it. And if you have any questions, you can find me on Twitter or email or GitHub, but I'm happy to answer any questions that you have. >> We have plenty of time for questions. >> Oh, yeah. I have no idea.

>> Hi. Thanks for the great talk. I was wondering what's the state of the data visualization facilities in Lua compared to, say, Python. >> If I'm frank, it's not as good. Python has been at this for, you know, five, ten years, really actively building matplotlib and, you know, seaborn and all these other libraries.

And in Lua, we're importing other people's work. So bokeh.js is really the best that I've seen so far. And that's something you can use in the notebook. So you have the full suite of that particular library. Yeah. >> Hi. Thanks for the talk. Is it possible to convert a model train with Torch into a C model that's deployable in, like, you know, production?

>> Yeah, for sure. We just run Torch in production. We use a Lua model. But you want to run it in C. So the whole layer of Torch that's actually doing the work is in C. And calling Torch from C, I don't have a specific website I can point you to.

But you can very easily call and execute a Lua script from C. It's like three or four lines of code in C. >> Cool. Thank you. >> I'd like to follow up the question about C just now. Like, just, like, if I want to compile -- I mean, I want to have Torch into my C++ code, what kind of overhead do I see?

Do I see a lot of overhead? Just now you mentioned you saw, like, I have a 10,000 line Lua just-in-time compiler that I need to add in there; right? Or can I avoid that? Because, for example, I think about if I'm going to put Lua in an embedded system that have limited amount of resource of anything.

>> During inference time -- sorry. During inference time, there's no appreciable overhead, if I'm understanding your question right. So you are importing a Lua. So in your C code, you're going to basically say, Lua, please run this Lua script. And that's going to call out into other C code.

So all this overhead I talked about with Autograd, that's training time. That doesn't exist at test time at all. >> So during test time, but the thing is I still need to have Lua compiled into my C code; right? >> Yeah. So this is something people have been doing for, like, 15, 20 years.

It's pretty mature. So Lua is in, like, microwaves, for instance. People have done very embedded applications of Lua. I think the binary for Lua is, like, I don't want to -- it's, like, kilobytes. That's very, very small. There's 10,000 lines of code. So when it compiles down, it's small.

So there's a question from the Twitters. It says, I'm using a combination of Keras and TensorFlow. Why should I use Torch or Autograd? If you're happy, then, you know, that's great. I guess -- so people tend to reach for Torch when they would like to be able to reason very easily about performance.

The kind of -- the more of a compiler infrastructure that gets added to a deep learning environment, the harder it can be for the end user, right, away from the people that originally made the library, it can be harder for the end user to reason why is this slow, why is this not working.

You might eventually see some GitHub issue later. Why is my network slow in these conditions, and then it gets closed a year after you had to have shipped your project. I mean, these things can happen. It's not the fault of anybody. It's just that Torch was designed to basically be a thin layer over C code.

So if that's something that you care about, Torch is a really good thing to work for. If Keras and TensorFlow is working great for you, then keep deep learning. You know, that's awesome. So... I'm trying to see. Maybe. It's hard to filter. Where will the slides be posted? It's not a deep learning question, but...

They will be posted. That's the answer to that question. >> I have a question. How do I access through... So normally all the web services in production generally are in other... In a fast-based application in Python or Java-based web services, right? Or maybe in the cell phone through Android, which is also Java, right?

So how do you call these models which were trained in Torch? How would you actually access those? >> There's a couple different ways you can do that. If you're using a feedforward neural network, writing the Java code to do the matrix multiplies can be pretty straightforward. We've actually done that before.

Or it's just simpler to just write the deep learning code, load in the weights. We'll serialize it however it needs to be loaded. That's one approach. It's kind of hacking short-term. At Twitter, we've engineered a system where we actually have Lua virtual machines running inside of Java, and we talk over the JNI.

So we have a more permanent solution for that. But if you're using standard model architectures, you might try to serialize your weights and then use the native deep learning library that exists to load up those weights and then run for it. And with some debugging, I think that's a perfectly fair approach, if you have this split between testing and kind of deployment where you're constrained by language or environment.

>> That's generally the thing that, you know... I mean, you do basically just serialize your model and then try to read it. What about the latency, actually? So related to this... So when you serialize that hackish way, at least you can get that latency thing solved out. But is there any plan, basically, to have interfaces available for other languages so that you don't have to do this extra step of serializing and then loading it into a different language?

Because if you don't, like in your case, you were mentioning that in Twitter you have Lua available inside your Java JVM, access through the JVM using JNI. So what impact does it have on the latency for those models? >> And by latency, you mean time to ship the model, not the latency of how long it takes to make a prediction?

>> Predictions, basically. That's going to be very engineering dependent. So if you're calling Torch from C code, the latency is not appreciable over if you're just running Lua code. And that can be extremely fast. If you're going through some wrapper, like through the JNI or something like that, you will incur an overhead, and you should just try to pick the interfaces that reduce that as much, even if you incur engineering overhead to do so.

I don't know if that answers your question. >> So do you have any numbers that basically you have seen in the past, you know, the latency numbers? >> I'm a little bit distant from the service side, so I can't give you -- I just don't know. But generally, I think what I can say that's fair is we're constrained by machine learning, you know, model complexity latency.

We are not constrained by overhead of, like, figuring out how to actually get those predictions like to an HTTP request, for instance. That's not constraining. >> Yeah, like TensorFlow has TensorFlow serving, which is kind of sort of solving this problem. >> Yeah. >> Is there anything in line? Do you know?

>> Not that I'm aware of. Again, the Torch community is not centralized, and so people could be working on a totally awesome, you know, complement to the TensorFlow server, but I am not aware of it. >> Thank you. >> Okay. We are going to take a short break of 15 minutes.

Let's thank Alex again. >> Thank you. >> Thank you. >> Thank you.

Torch Tutorial (Alex Wiltschko, Twitter)

Chapters

Transcript