Deep Learning for Speech Recognition (Adam Coates, Baidu)

So I want to tell you guys about speech recognition and deep learning. I think deep learning has been playing an increasingly large role in speech recognition. And one of the things I think is most exciting about this field is that speech recognition is at a place right now where it's becoming good enough to enable really exciting applications that end up in the hands of users.

So for example, if we want to caption video content and make it accessible to everyone, it used to be that we would sort of try to do this, but you still need a human to get really good captioning for something like a lecture. But it's possible that we can do a lot of this with higher quality in the future with deep learning.

We can do things like hands-free interfaces in cars, make it safer to use technology while we're on the go and keep people's eyes on the road. Of course, it would make mobile devices, home devices much easier, much more efficient and enjoyable to use. But another actually sort of fun recent study that some folks at Baidu participated in, along with Stanford and UW, was to show that for even something straightforward that we sort of take for granted as an application of speech, which is just texting someone with voice or writing a piece of text, the study showed that you can actually go three times faster with voice recognition systems that are available today.

So it's not just like a little bit faster now, even with the errors that a speech recognition system can make. It's actually a lot faster. And the reason I wanted to highlight this result, which is pretty recent, is that the speech engine that was used for this study is actually powered by a lot of the deep learning methods that I'm going to tell you about.

So hopefully when you walk away today, you have an appreciation or an understanding of the sort of high-level ideas that make a result like this possible. So there are a whole bunch of different components that make up a complete speech application. So for example, there's speech transcription. So if I just talk, I want to come up with words that represent whatever I just said.

There's also other tasks, though, like word spotting or triggering. So for example, if my phone is sitting over there and I want to say, "Hey, phone, go do something for me," it actually has to be listening continuously for me to say that word. And likewise, there are things like speaker identification or verification, so that if I want to authenticate myself or I want to be able to tell apart different users in a room, I've got to be able to recognize your voice, even though I don't know what you're saying.

So these are different tasks. I'm not going to cover all of them today. Instead, I'm going to just focus on the bread and butter of speech recognition. We're going to focus on building a speech engine that can accurately transcribe audio into words. So that's our main goal. This is a very basic goal of artificial intelligence.

Historically, people are very, very good at listening to someone talk, just like you guys are listening to me right now. And you can very quickly turn audio into words and into meaning on your own, almost effortlessly. And for machines, this has historically been incredibly hard. So you think of this as like one of those sort of consummate AI tasks.

So the goal of building a speech pipeline is, if you just give me a raw audio wave, like you recorded on your laptop or your cell phone, I want to somehow build a speech recognizer that can do this very simple task of printing out "Hello, world" when I actually say "Hello, world." So before I dig into the deep learning part, I want to step back a little bit and spend maybe 10 minutes talking about how a traditional speech recognition pipeline is working, for two reasons.

If you're out in the wild, you're doing an internship, you're trying to build a speech recognition system with a lot of the tools that are out there, you're going to bump into a lot of systems that are built on technologies that look like this. So I want you to understand a little bit of the vocabulary and how those things are put together.

And also, this will sort of give you a story for what deep learning is doing in speech recognition today that is kind of special and that I think paves the way for much bigger results in the future. So traditional systems break the problem of converting an audio wave, of taking audio and turning it into a transcription, into a bunch of different pieces.

So I'm going to start out with my raw audio, and I'm just going to represent that by X. And then usually we have to decide on some kind of feature representation. We have to convert this into some other form that's easier to deal with than a raw audio wave.

And in a traditional speech system, I often have something called an acoustic model. And the job of the acoustic model is to learn the relationship between these features that represent my audio and the words that someone is trying to say. And then I'll often have a language model, which encapsulates all of my knowledge about what kinds of words, what spellings and what combinations of words are most likely in the language that I'm trying to transcribe.

And once you have all of these pieces, so these might be -- these different models might be driven by machine learning themselves, what you would need to build in a traditional system is something called a decoder. And the job of a decoder, which itself might involve some modeling efforts and machine learning algorithms, is to find the sequence of words W that maximizes this probability.

The probability of the particular sequence W, given your audio. That's straightforward. But that's equivalent to maximizing the product of the contributions from your acoustic model and from your language model. So a traditional speech system is broken down into these pieces, and a lot of the effort in getting that system to work is in developing this sort of portion that combines them all.

So it turns out that if you want to just directly transcribe audio, you can't just go straight to characters. And the reason is, and it's especially apparent in English, that the way something is spelled in characters doesn't always correspond well to the way that it sounds. So if I give you the word "night," for example, without context, you don't really know whether I'm talking about a night in armor or whether I'm talking about night like an evening.

And so a way to get around this, to abstract this problem away from a traditional system, is to replace this with a sort of intermediate representation. Instead of trying to predict characters, I'll just try to predict something called phonemes. So as an example, if I want to represent the word "hello," what I might try to do is break it down into these units of sound.

So the first one is like the "h," that H sound in "hello," and then an "uh" sound, which is actually only one possible pronunciation of an E, and then an L and an O sound. And that would be my string that I try to come up with using all of my different speech components.

So this, in one sense, makes the modeling problem easier. My acoustic model and so on can be simpler, because I don't have to worry about spelling. But it does have this problem that I have to think about where these things come from. So these phonemes are intuitively, they're the perceptually distinct units of sound that we can use to distinguish words.

And they're very approximate. This might be our imagination that these things actually exist. It's not clear how fundamental this is. But they're sort of standardized. There are a bunch of different conventions for how to define these. And if you end up working on a system that uses phonemes, one popular data set is called TIMIT.

And so this actually has a corpus of audio frames with examples of each of these phonemes. So once you have this phoneme representation, unfortunately, it adds even more complexity to this traditional pipeline. Because now, my acoustic model doesn't associate this audio feature with words. It actually associates them with another kind of transcription, with the transcription into phonemes.

And so I have to introduce yet another component into my pipeline that tries to understand how do I convert the transcriptions in phonemes into actual spellings. And so I need some kind of dictionary or a lexicon to tell me all of that. So this is a way of taking our knowledge about a language and baking it into this engineered pipeline.

And then once you've got all that, again, all of your work now goes into this decoder that has a slightly more complicated task in order to infer the most likely word transcription given the audio. So this is a tried and true pipeline. It's been around for a long time.

You'll see a whole bunch of these systems out there. And we're still using a lot of the vocabulary from these systems. But traditionally, the big advantage is that it's very tweakable. If you want to go add a new pronunciation for a word you've never heard before, you can just drop it right in.

That's great. But it's also really hard to get working well. If you start from scratch with this system and you have no experience in speech recognition, it's actually quite confusing and hard to debug. It's very difficult to know which of these various models is the one that's behind your error.

And especially once we start dealing with things like accents, heavy noise, different kinds of ambiguity, that makes the problem even harder to engineer around. Because trying to think ourselves about how do I tweak my pronunciation model, for example, to account for someone's accent that I haven't heard, that's a very hard engineering judgment for us to make.

So there are all kinds of design decisions that go into this pipeline, like choosing the feature representation, for example. So the first place that deep learning has started to make an impact in speech recognition, starting a few years ago, is to just take one of the core machine learning components of the system and replace it with a deep learning algorithm.

So I mentioned back in this previous pipeline that we had this little model here whose job is to learn the relationship between a sequence of phonemes and the audio that we're hearing. So this is called the acoustic model. And there are lots of different methods for training this thing.

So take your favorite machine learning algorithm. You can probably find someone who is trained in acoustic model with that algorithm, whether it's a Gaussian mixture model or a bunch of decision trees and random forests, anything for estimating these kinds of densities. There's a lot of work in trying to make better acoustic models.

So some work by George Dahl and co-authors took what was a state of the art deep learning system back in 2011, which is a deep belief network with some pre-training strategies, and dropped it into a state of the art pipeline in place of this acoustic model. And the results are actually pretty striking, because even though we had neural networks and these pipelines for a while, what ended up happening is that when you replace the Gaussian mixture model and HMM system that already existed with this deep belief network as an acoustic model, you actually got something between like a 10% and 20% relative improvement in accuracy, which is a huge jump.

This is highly noticeable to a person. And if you compare this to the amount of progress that had been made in preceding years, this is a giant leap for a single paper to make, compared to progress we'd been able to make previously. So this is in some sense the first generation of deep learning for speech recognition, which is I take one of these components and I swap it out for my favorite deep learning algorithm.

So the picture looks sort of like this. So with these traditional speech recognition pipelines, the problem that we would always run into is that if you gave me a lot more data, you gave me a much bigger computer so that I could train a huge model, that actually didn't help me because all the problems I had were in the construction of this pipeline.

And so eventually, if you gave me more data and a bigger computer, the performance of our speech recognition system would just kind of peter out. It would just reach a ceiling that was very hard to get over. And so we just start coming up with lots of different strategies.

We start specializing for each application. We try to specialize for each user and try to make things a little bit better around the edges. And what these deep learning acoustic models did was in some sense moved that barrier a little ways. It made it possible for us to take a bit more data, much faster computers that let us try a whole lot of models, and move that ceiling up quite a ways.

So the question that many in the research community, including folks at Baidu, have been trying to answer is, can we go to a next generation version of this insight? Can we, for instance, build a speech engine that is powered by deep learning all the way from the audio input to the transcription itself?

Can we replace as much of that traditional system with deep learning as possible so that over time, as you give researchers more data and bigger computers and the ability to try more models, their speech recognition performance just keeps going up and we can potentially solve speech for everybody? So the goal of this tutorial is not to get you up here, which requires a whole bunch of things that I'll tell you about near the end.

But what we want to try to do is give you enough to get a point on this curve. And then once you're on the curve, the idea is that what remains is now a problem of scale. It's about data and about getting bigger computers and coming up with ways to build bigger models.

So that's my objective, so that when you walk away from here, you have a picture of what you would need to build to get this point. And then after that, it's hopefully all about scale. So thanks to Vinay Rao, who's been helping put this tutorial together, there is going to be some starter code live for the basic pipeline, the deep learning part of the pipeline that we're talking about.

So there are some open source implementations of things like CTC, but we wanted to make sure that there's a system out there that's pretty representative of the acoustic models that I'm going to be talking about in the first half of the presentation here. So this will be enough that you can get a simple pipeline going with something called max decoding, which I'll tell you about later.

And the idea is that this is sort of a scale model of the acoustic models that Baidu and other places are powering real production speech engines. So this will get you that point on the curve. Okay. So here's what we're going to talk about. The first part, I'm just going to introduce a few preliminaries, talk about preprocessing.

So we still have a little bit of preprocessing around, but it's not really fundamental. I think it's probably going to go away in the long run. We'll talk about what is probably the most mature piece of sequence learning technologies for deep learning right now. So it turns out that one of the fundamental problems of doing speech recognition is how do I build a neural network that can map this audio signal to a transcription that can have a quite variable length.

And so CTC is one highly mature method for doing this. And I think you're actually going to hear about maybe some other solutions later today. Then I'll say a little bit about training and just what that looks like. And then finally say a bit about decoding and language models, which is sort of an addendum to the current acoustic models that we can build that make them perform a lot better.

And then once you have this, that's a picture of what you need to get this point on the curve. And then I'll talk a little bit about what's remaining. How do you scale up from this little scale model up to the full thing? What does that actually entail? And then time permitting, we'll talk a little bit about production.

How could you put something like this into a cloud server and actually serve real users with it? Great. So how is audio represented? This should be pretty straightforward, I think. Unlike a two-dimensional image where we normally have a 2D grid of pixels, audio is just a 1D signal. And there are a bunch of different formats for audio, but typically this one-dimensional wave that is actually me saying something like, "Hello, world," is something like 8,000 samples per second or 16,000 samples per second.

And each wave is quantized into 8 or 16 bits. So when we represent this audio signal that's going to go into our pipeline, you could just think of that as a one-dimensional vector. So when I had that box called x that represented my audio signal, you can think of this as being broken down into samples, x1, x2, and so forth.

And if I had a one-second audio clip, this vector would have a length of either, say, 8,000 or 16,000 samples. And each element would be, say, a floating point number that I'd extracted from this 8 or 16-bit sample. So it's really simple. Now once I have an audio clip, we'll do a little bit of preprocessing.

So there are a couple of ways to start. The first is to just do some vanilla preprocessing, like convert to a simple spectrogram. So if you look at a traditional speech pipeline, you're going to see things like MFCCs, which are male frequency kepstral coefficients. You'll see a whole bunch of plays on spectrograms where you take differences in different kinds of features and try to engineer complex representations.

But for the stuff that we're going to do today, a simple spectrogram is just fine. And it turns out, as you'll see in a second, we lose a little bit of information when we do this, but it turns out not to be a huge difference. Now I said a moment ago that I think probably this is going to go away in the long run.

And that's because today you can actually find recent research in trying to do away with even this preprocessing part and having your neural network process the audio wave directly and just train its own feature transformation. So there's some references at the end that you can look at for this.

So here's a quick straw poll. How many people have seen a spectrogram or computed a spectrogram before? Pretty good. Maybe 50%. OK. So the idea behind a spectrogram is that it's sort of like a frequency domain representation, but instead of representing this entire signal in terms of frequencies, I'm just going to represent a small window in terms of frequencies.

So to process this audio clip, the first thing I'm going to do is cut out a little window that's typically about 20 milliseconds long. And when you get down to that scale, it's usually very clear that these audio signals are made up of sort of a combination of different frequencies of sine waves.

And then what we do is we compute an FFT. It basically converts this little signal into the frequency domain. And then we just take the log of the power at each frequency. And so if you look at what the result of this is, it basically tells us for every frequency of sine wave, what is the magnitude, what's the amount of power represented by that sine wave that makes up this original signal.

So over here in this example, we have a very strong low frequency component in the signal. And then we have differing magnitudes at different differing frequencies. So we can just think of this as a vector. So now instead of representing this little 20 millisecond slice as sort of a sequence of audio samples, instead I'm going to represent it as a vector here where each element represents sort of the strength of each frequency in this little window.

And the next step beyond this is that if I just told you how to process one little window, you can of course apply this to a whole bunch of windows across the entire piece of audio. And that gives you what we call a spectrogram. And you can use either disjoint windows that are just sort of adjacent or you can apply them to overlapping windows if you like.

So there's a little bit of parameter tuning there. But this is an alternative representation of this audio signal that happens to be easier to use for a lot of purposes. So our goal, starting from this representation, is to build what I'm going to call an acoustic model, but which is really, to the extent we can make it happen, is really going to be an entire speech engine that is represented by a neural network.

So what we would like to do is build a neural net that if we could train it from a whole bunch of pairs, X, which is my original audio that I turn into a spectrogram, and Y star, that's the ground truth transcription that some human has given me. If I were to train this big neural network off of these pairs, what I'd like it to produce is some kind of output that I'm representing by the character C here, so that I could later extract the correct transcription, which I'm going to denote by Y.

So if I said hello, the first thing I'm going to do is run preprocessing to get all these spectrogram frames. And then I'm going to have a recurrent neural network that consumes each frame and processes them into some new representation called C. And hopefully, I can engineer my network in such a way that I can just read the transcription off of these output neurons.

So that's kind of the intuitive picture of what we want to accomplish. So as I mentioned back in the outline, there's one obvious fundamental problem here, which is that the length of the input is not the same as the length of the transcription. So if I say hello very slowly, then I can have a very long audio signal, even though I didn't change the length of the transcription.

Or if I say hello very quickly, then I can have a very short piece of audio. And so that means that this output of my neural network is changing length, and I need to come up with some way to map that variable length neural network output to this fixed length transcription, and also do it in a way that we can actually train this pipeline.

So the traditional way to deal with this problem, if you were building a speech engine several years ago, is to just try to bootstrap the whole system. So I'd actually train a neural network to correctly predict the sounds at every frame using some kind of data set like Timit, where someone has lovingly annotated all of the phonemes for me.

And then I'd try to figure out the alignment between my saying hello in a phonetic transcription with the input audio. And then once I've lined up all of the sounds with the input audio, now I don't care about length anymore, because I can just make a one-to-one mapping between the audio input and the phoneme outputs that I'm trying to target.

But this alignment process is horribly error-prone. You have to do a lot of extra work to make it work well, and so we really don't want to do this. We really want to have some kind of solution that lets us solve this straightaway. So there are multiple ways to do it.

And as I mentioned, there's some current research on how to use things like attentional models, sequence-to-sequence models that you'll hear about later, in order to solve this kind of problem. And then, as I said, we'll focus on something called connectionist temporal classification, or CTC, that is sort of current state-of-the-art for how to do this.

So here's the basic idea. So our recurrent neural network has these output neurons that I'm calling C. And the job of these output neurons is to encode a distribution over the output symbols. So because of the structure of the recurrent network, the length of this symbol sequence C is the same as the length of my audio input.

So if my audio input, say, was two seconds long, that might have 100 audio frames. And that would mean that the length of C is also 100 different values. So if we were working on a phoneme-based model, then C would be some kind of phoneme representation. And we would also include a blank symbol, which is special for CTC.

But if, as we'll do in the rest of this talk, we're trying to just predict the graphemes, trying to predict the characters in this language directly from the audio, then I would just let C take on a value that's in my alphabet, or take on a blank or a space, if my language has spaces in it.

And then the second thing I'm going to do, once my RNN gives me a distribution over these symbols C, is that I'm going to try to define some kind of mapping that can convert this long transcription C into the final transcription Y. That's like, hello. That's the actual string that I want.

And now, recognizing that C is itself a probabilistic creature, there's a distribution over choices of C that correspond to the audio. Once I apply this function, that also means that there's a distribution over Y. There's a distribution over the possible transcriptions that I could get. And what I'll want to do to train my network is to maximize the probability of the correct transcription given the audio.

So those are the three steps that we have to accomplish in order to make CTC work. So let's start with the first one. So we have these output neurons C, and they represent a distribution over the different symbols that I could be hearing in the audio. So I've got some audio signal down here.

You can see the spectrogram frames poking up. And this is being processed by this recurrent neural network. And the output is a big bank of softmax neurons. So for the first frame of audio, I have a neuron that corresponds to each of the symbols that C could represent. And this set of softmax neurons here, with the output summing to 1, represents the probability of, say, C1 having the value A, B, C, and so on, or this special blank character.

So for example, if I pick one of the neurons over here, then the first row, which represents the character B, and the 17th column, which is the 17th frame in time, this represents the probability that C17 represents the character B, given the audio. So once I have this, that also means that I can just define a distribution not just over the individual characters, but if I just assume that all of the characters are independent, which is kind of a naive assumption, but if I bake this into the system, I can define a distribution over all possible sequences of characters in this alphabet.

So if I gave you a specific instance, a specific character string using this alphabet, for instance, I represent the string hello as H-H-H-E, blank E, blank blank L-L, blank L-O, and then a bunch of blanks. This is a string in this alphabet for C, and I can just use this formula to compute the probability of this specific sequence of characters.

So that's how we compute the probability for a sequence of characters when they have the same length as the audio input. So the second step, and this is in some sense the kind of neat trick in CTC, is to define a mapping from this long encoding of the audio into symbols that crunches it down to the actual transcription that we're trying to predict.

And the rule is this operator takes this character sequence, and it picks up all the duplicates, all of the adjacent characters that are repeated, and discards the duplicates and just keeps one of them, and then it drops all of the blanks. So in this example, you see you have three H's together, so I just keep one H, and then I have a blank, I throw that away, and I keep an E, and I have two L's, so I keep one of the L's over here, and then another blank, and an L-O.

And the one key thing to note is that when I have two characters that are different right next to each other, I just end up keeping those two characters in my output. But if I ever have a double character, like L-L in "hello," then I'll need to have a blank character that gets put in between.

But if our neural network gave me this transcription, told me that this was the right answer, we just have to apply this operator, and we get back the string "hello." So now that we have a way to define a distribution over these sequences of symbols that are the same length as the audio, and we now have a mapping from those strings into transcriptions, as I said, this gives us a probability distribution over the possible final transcriptions.

So if I look at the probability distribution over all the different sequences of symbols, I might have "hello" written out like on the last slide, and maybe that has probability .1, and then I might have "hello" but written a different way, by say replacing this H with a blank that has a smaller probability, and I have a whole bunch of different possible symbol sequences below that.

And what you'll notice is that if I go through every possible combination of symbols here, there are several combinations that all map to the same transcription. So here's one version of "hello," there's a second version of "hello," there's a third version of "hello." And so if I now ask, "What's the probability of the transcription 'hello'?" The way that I compute that is I go through all of the possible character sequences that correspond to the transcription "hello," and I add up all of their probabilities.

So I have to sum over all possible choices of C that could give me that transcription in the end. So you can kind of think of this as searching through all the possible alignments, right? I could shift these characters around a little bit, I could move them forward, backward, I could expand them by adding duplicates or squish them up, depending on how fast someone is talking, and that corresponds to every possible alignment between the audio and the characters that I want to transcribe.

It sort of solves the problem of the variable length. And the way that I get the probability of a specific transcription is to sum up, to marginalize over all the different alignments that could be feasible. And then if we have a whole bunch of other possibilities in here, like the word "yellow," I'd compute them in the same way.

So this equation just says to sum over all the character sequences C so that when I apply this little mapping operator, I end up with the transcription y. Oh, oh. I'm missing a double E. You're talking about this one? So when we apply this sort of squeezing operator here, we drop this double E to get a single E in "hello." And we remove all the duplicates.

So the same way we did for an H. Right. So whenever you see two characters together like this, where they're adjacent duplicates, you sort of squeeze all those duplicates out, and you just keep one of them. But here we have a blank in between. So if we drop all the duplicates first, then we still have two L's left, and then we remove all the blanks.

So this gives the algorithm a way to represent repeated characters in the transcription. There's another one in the back. Oh, oh, I see. Yeah. This is maybe-- I put a space in here. Really I should have put a space character in here instead of a blank. Really this could be H-E-L-L-O-H.

Yeah. So the space here is erroneous. OK. We good? OK. So once I've defined this, I just gave you a formula to compute the probability of a string given the audio. So as with every good starting to a machine learning algorithm, we go and we try to apply maximum likelihood.

I now give you the correct transcription, and your job is to tune the neural network to maximize the probability of that transcription using this model that I just defined. So in equations, what I'm going to do is I want to maximize the log probability of y star for a given example.

I want to maximize the probability of the correct transcription given the audio x. And then I'm just going to sum over all the examples. And then what I want to do is just replace this with the equation that I had on the last page that says in order to compute the probability of a given transcription, I have to sum over all of the possible symbol sequences that could have given me that transcription, sum over all the possible alignments that would map that transcription to my audio.

So Alex Graves and co-authors in 2006 actually show that because of this independence assumption, there is a clever way, there is a dynamic programming algorithm that can efficiently compute this summation for you. And not only compute this summation so that you can compute the objective function, but actually compute its gradient with respect to the output neurons of your neural network.

So if you look at the paper, the algorithm details are in there. What's cool right now in the history of speech and deep learning is that this is at the level of a technology. This is something that's now implemented in a bunch of places so that you can download a software package that efficiently will calculate this CTC loss function for you that can calculate this likelihood and can also just give you back the gradient.

So I won't go into the equations here. Instead, I'll tell you that there are a whole bunch of implementations on the web that you can now use as part of deep learning packages. So one of them from Baidu implements CTC on the GPU. It's called WarpCTC. Stanford and the group there, actually one of Andrew's students, has a CTC implementation.

And there's also now CTC losses implemented in packages like TensorFlow. So this is something that's sufficiently widely distributed that you can use these algorithms off the shelf. So the way that these work, the way that we go about training, is we start from our audio spectrogram. We have our neural network structure where you get to choose how it's put together.

And then it outputs this bank of softmax neurons. And then there are pieces of off-the-shelf software that will compute for you the CTC cost function. They'll compute this log likelihood given a transcription and the output neurons from your recurrent network. And then the software will also be able to tell you the gradient with respect to the output neurons.

And once you've got that, you're set. You can feed them back into the rest of your code and get the gradient with respect to all of these parameters. So as I said, this is all available now in sort of efficient, off-the-shelf software. So you don't have to do this work yourself.

So that's pretty much all there is to the high-level algorithm. With this, it's actually enough to get a sort of working drosophila of speech recognition going. There are a few little tricks, though, that you might need along the way. On easy problems, you might not need these. But as you get to more difficult data sets with a lot of noise, they can become more and more important.

So the first one that we've been calling "sort of grad" in the vein of all of the grad algorithms out there is basically a trick to help with recurrent neural networks. So it turns out that when you try to train one of these big RNN models on some off-the-shelf speech data, one of the things that can really get you is seeing very long utterances early in the process.

Because if you have a really long utterance, then if your neural network is badly initialized, you'll often end up with things like underflow and overflow as you try to go and compute the probabilities. And you end up with gradients exploding as you try to do back propagation. And it can make your optimization a real mess.

And it's coming from the fact that these utterances are really long and really hard, and the neural network just isn't ready to deal with those transcriptions. And so one of the fixes that you can use is, during the early parts of training, usually in the first epoch, is you just sort all of your audio by length.

And now, when you process a mini-batch, you just take the short utterances first so that you're working with really short RNNs that are quite easy to train and don't blow up and don't have a lot of catastrophic numerical problems. And then as time goes by, you start operating on longer and longer utterances that get more and more difficult.

So we call this "sort of grad." It's basically a curriculum learning method. And so you can see some work from Yoshio Bengio and his team on a whole bunch of strategies for this. But you can think of the short utterances as being the easy ones. And if you start out with the easy utterances and move to the longer ones, your optimization algorithm can do better.

So here's an example from one of the models that we've trained, where your CTC cost starts up here. And after a while, you optimize, and you sort of bottom out around, I don't know, what? A log likelihood of maybe 30. And then if you add this sort of grad strategy, after the first epoch, you're actually doing better.

And you can reach a better optimum than you could without it. And in addition, another strategy that's extremely helpful for recurrent networks and very deep neural networks is batch normalization. So this is becoming very popular. And it's also available as sort of an off-the-shelf package inside of a lot of the different frameworks that are available today.

So if you start having trouble, you can consider putting batch normalization into your network. So our neural network now spits out this big bank of softmax neurons. We've got a training algorithm. We're just doing gradient descent. How do we actually get a transcription? This process, as I said, is meant to be as close to characters as possible.

But we still sort of need to decode these outputs. And you might think that one simple solution, which turns out to be approximate, to get the correct transcription is just go through here and pick the most likely sequence of symbols for C, and then apply our little squeeze operator to get back the transcription the way that we defined it.

So this turns out not to be the optimal thing. This actually doesn't give you the most likely transcription, because it's not accounting for the fact that every transcription might have multiple sequences of Cs, multiple alignments in this representation. But you can actually do this, and this is called the max decoding.

And so for this sort of contrived example here, I put little red dots on the most likely C. And if you see, there's a couple of blanks, a couple of Cs, there's another blank, A, more blanks, Bs, more blanks. And if you apply our little squeeze operator, you just get the word cab.

If you do this, it is often terrible. It will often give you a very strange transcription that doesn't look like English necessarily. But the reason I mention it is that this is a really handy diagnostic. If you're kind of wondering what's going on in the network, glancing at a few of these will often tell you if the network's starting to pick up any signal or if it's just outputting gobbledygook.

So I'll give you a more detailed example in a second of how that happens. All right. So these are all the concepts of our very simple pipeline. And the demo code that we're going to put up on the web will basically let you work on all of these pieces.

So once we try to train these, I want to give you an example of the sort of data that we're training on. >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo. >> So this is just a person sitting there reading the Wall Street Journal to us.

So this is a sort of simple data set. It's really popular in the speech research community. It's published by the Linguistic Data Consortium. There's also a free alternative called Libris Speech that's very similar. But instead of people reading the Wall Street Journal, it's people reading Creative Commons audiobooks. So in the demo code that we have, a really simple network that works reasonably well looks like this.

So there's a sort of family of models that we've been working with, where you start from your spectrogram. You have maybe one layer or several of convolutional filters at the bottom. And then on top of that, you have some kind of recurrent neural network. It might just be a vanilla RNN, but you can also use LSTM or GRU cells, any of your favorite RNN creatures from the literature.

And then on top of that, we have some fully connected layers that produce these softmax outputs. And those are the things that go into CTC for training. So this is pretty straightforward. The implementation on the web uses the warp CTC code. And then we would just train this big neural network with stochastic gradient descent, Nesterov's momentum, all the stuff that you've probably seen in a whole bunch of other talks so far.

All right. So if you actually run this, what is going on inside? So I mentioned that looking at the max decoding is kind of a handy way to see what's going on inside this creature. So I wanted to show you an example. So this is a picture. This is a visualization of those softmax neurons at the top of one of these big neural networks.

So this is the representation of C from all the previous slides. So on the horizontal axis, this is basically time. This is the frame number or which chunk of the spectrogram we're seeing. And then on the vertical axis here, you see these are all the characters in the English alphabet or a space or a blank.

So after 300 iterations of training, which is not very much, the system has learned something amazing, which is that it should just output blanks and spaces all the time. Because these are by far, because of all the silence and things in your data set, these are the most common characters.

I just want to fill up the whole space with blanks. But you can see it's kind of randomly poking out a few characters here. And if you run your little max decoding strategy to see what does the system think the transcription is, it thinks the transcription is "eh." But after 300 iterations, that's okay.

But this is a sign that the neural network's not going crazy. Your gradient isn't busted. It's at least learned what is the most likely characters. Then after maybe 1,500 or so, you start to get a little bit of structure. And if you try to like mouth these words, you might be able to sort of see that there's some English-like sounds in here, like "Beyar justinfrutin." Something kind of odd.

But it's actually looking much better than just "h." It's actually starting to output something. Go a little bit farther. It's a little bit more organized. You can start to see that we have sort of fragments of possibly words starting to form. And then after you're getting close to convergence, it's still not a real sentence.

But does this make sense to people? He guessed what the correct transcription might be. You might have a couple of candidates. The correct one is actually "there justinfrunt." And so you can see that sort of it's sort of sounding it out with English characters. I have a young son, and I kind of figure I'm eventually going to see him producing max-decoded outputs of English.

And you're just going to sound these things out and be like, "Is it there justinfrunt? There?" But this is why this max-decoding strategy is really handy. Because you can kind of look at this output and say, yeah, it's starting to get some actual signal out of the data. It's not just gobbledygook.

So because this is like my favorite speech recognition party game, I wanted to show you a few more of these. So here's the max-decoded output. "The poor little things," cried Cynthia, "think of them having been turned to the wall all these years." And so you can hear like the sound of the breath at the end.

Turns into a little bit of a word. "Cynthia" is sort of in this transcription. And you'll find that things like proper names and so on tend to get sounded out. But if those names are not in your audio data, there's no way the network could have learned how to say the name Cynthia.

And we'll come back to how to solve that later. But you see the true label is "The poor little things," cried Cynthia. And that the last word is actually "all these years." And there isn't a word hanging off at the end. So here's another one. >> That is true, bad dealt gray.

>> How many people figured out what this is? This is the max-decoded transcription. It sounds good to you. It sounds good to me. If you told me that this was the ground truth, I'd go, "That's weird. I have to go look up what this is." Here's the actual true label.

Turns out this is a French word that means something like "rubbernecking." I had no idea what this word was. So this is, again, the cool examples of what these neural networks are able to figure out with no knowledge of the language itself. Okay. So let's go back to decoding.

We just talked about max-decoding, which is sort of an approximate way of going from these probability vectors to a transcription Y. And if you want to find the actual most likely transcription Y, there's actually no algorithm in general that can give you the perfect solution efficiently. So the reason for that, remember, is that for a single transcription Y, I have an efficient algorithm to compute its probability.

But if I want to search over every possible transcription, I don't know how to do that because there are exponentially many possible transcriptions, and I'd have to run this algorithm to compute the probability of all of them. So we have to resort to some kind of generic search strategy.

And so one proposed in the original paper briefly is a sort of prefix-decoding strategy. So I don't want to spend a ton of time on this. Instead, I want to step to sort of the next piece of the picture. So there were a bunch of examples in there, right, like proper names, like Cynthia and things like Baddourie, where unless you had heard this word before, you have no hope of getting it right with your neural network.

And so there are lots of examples like this in the literature of things that are sort of spelled out phonetically but aren't legitimate English transcriptions. And so what we'd like to do is come up with a way to fold in just a little bit of that knowledge about the language, to take a small step backward from a perfect end-to-end system and make these transcriptions better.

So as I said, the real problem here is that you don't have enough audio available to learn all these things. If you had millions and millions of hours of audio sitting around, you could probably learn all these transcriptions because you just hear enough words that you know how to spell them all, maybe the way a human does.

But unfortunately, we just don't have enough audio for that. So we have to find a way to get around that data problem. There's also an example of something that in the AI lab we've dubbed the Tchaikovsky problem, which is that there are certain names in the world, right, like proper names, that if you've never heard of it before, you have no idea how it's spelled.

And the only way to know it is to have seen this word in text before and to see it in context. So part of the purpose of these language models is to get examples like this correct. So there are a couple of solutions. One would be to just step back to a more traditional pipeline, right, use phonemes, because then we can bake new words in along with their phonetic pronunciation and the system will just get it right.

But in this case, I want to focus on just fusing in a traditional language model that gives us the probability a priori of any sequence of words. So the reason that this is helpful is that using a language model, we can train these things from massive text corpora. We have way, way more text in the world than we have transcribed audio.

And so that makes it possible to train these giant language models with huge vocabulary, and they can also pick up the sort of contextual things that will tip you off to the fact that Tchaikovsky concerto is a reasonable thing for a person to ask, and that this particular transcription which we have seen in the past, Tchaikovsky concerto, even though composed of legitimate English words, is nonsense.

So there's actually not much to see on the language modeling front for this, except that the reasons for sticking with traditional N-gram models are kind of interesting if you're excited about speech applications. So if you go use a package like KenLM on the web to go build yourself a giant N-gram language model, these are really simple and well supported.

And so that makes them easy to get working. And they'll let you train from lots of corpora, but for speech recognition in practice, one of the nice things about N-gram models as opposed to trying to, say, use like an RNN model is that we can update these things very quickly.

If you have a big distributed cluster, you can update that N-gram model very rapidly in parallel from new data to keep track of whatever the trending words are today that your speech engine might need to deal with. And we also have the need to query this thing very rapidly inside our decoding loop that you'll see in just a second.

And so being able to just look up the probabilities in a table the way an N-gram model is structured is very valuable. So I hope someday all of this will go away and be replaced with an amazing neural network. But this is a really best practice today. So in order to fuse this into the system, since to get the most likely transcription, right, probability of Y given X, to maximize that thing, we need to use a generic search algorithm anyway.

This opens up a door. Once we're using a generic search scheme to do our decoding and find the most likely transcription, we can add some extra cost terms. So in a previous piece of work from Auni, Hanun, and several co-authors, what you do is you take the probability of a given word sequence from your audio.

So this is what you would get from your giant RNN. And you can just multiply it by some extra terms, the probability of the word sequence according to your language model raised to some power, and then multiply by the length raised to another power. And you see that if you just take the log of this objective function, right, then you get the log probability that was your original objective.

You get alpha times the log probability of the language model, and beta times the log of the length. And these alpha and beta parameters let you sort of trade off the importance of getting a transcription that makes sense to your language model versus getting a transcription that makes sense to your acoustic model and actually sounds like the thing that you heard.

And the reason for this extra term over here is that as you're multiplying in all of these terms, you tend to penalize long transcriptions a bit too much. And so having a little bonus or penalty at the end to tweak to get the transcription length right is very helpful.

So the basic idea behind this is just to use BeamSearch. BeamSearch, really popular search algorithm, a whole bunch of instances of it. And the rough strategy is this. So starting from time 0, starting from t equals 1 at the very beginning of your audio input, I start out with an empty list that I'm going to populate with prefixes.

And these prefixes are just partial transcriptions that represent what I think I've heard so far in the audio up to the current time. And the way that this proceeds is I'm going to take at the current time step each candidate prefix out of this list. And then I'm going to try all of the possible characters in my softmax neurons that could possibly follow it.

So for example, I can try adding a blank. I can say if the next element of C is actually supposed to be a blank, then what that would mean is that I don't change my prefix, right, because the blanks are just going to get dropped later. But I need to incorporate the probability of that blank character into the probability of this prefix, right?

It represents one of the ways that I could reach that prefix. And so I need to sum that probability into that candidate. And likewise, whenever I add a space to the end of a prefix, that signals that this prefix represents the end of a word. And so in addition to adding the probability of the space into my current estimate, this gives me the chance to go look up that word in my language model and fold that into my current score.

And then if I try adding a new character onto this prefix, it's just straightforward. I just go and update the probabilities based on the probability of that character. And then at the end of this, I'm going to have a huge list of possible prefixes that could be generated. And this is where you would normally get the exponential blow up of trying all possible prefixes to find the best one.

And what BeamSearch does is it just says, take the k most probable prefixes after I remove all the duplicates in here, and then go and do this again. And so if you have a really large k, then your algorithm will be a bit more accurate in finding the best possible solution to this maximization problem, but it'll be slower.

So here's what ends up happening. If you run this decoding algorithm, if you just run it on the RNN outputs, you'll see that you get actually better than straight max decoding. You find slightly better solutions. But you still make things like spelling errors, like Boston with an I. But once you add in a language model that can actually tell you that the word Boston with an O is much more probable than Boston with an I.

So one place that you can also drop in deep learning that I wanted to mention very rapidly is just if you're not happy with your N-gram model, because it doesn't have enough context, or you've seen a really amazing neural language modeling paper that you'd like to fold in, one really easy way to do this and link it to your current pipeline is to do rescoring.

So when this decoding strategy finishes, it can give you the most probable transcription, but it also gives you this big list of the top k transcriptions in terms of probability. And what you can do is take your recurrent network and just rescore all of these, basically reorder them according to this new model.

So in the instance of a neural language model, let's say that this is my N best list. I have five candidates that were output by my decoding strategy. And the first one is I'm a connoisseur looking for wine and pork chops. Sounds good to me. I'm a connoisseur looking for wine and pork chops.

So this is actually quite subtle. And depending on what kind of connoisseur you are, it's sort of up to interpretation what you're looking for. But perhaps a neural language model is going to be a little bit better at figuring out that wine and pork are closely related. And if you're a connoisseur, you might be looking for wine and pork chops.

And so what you would hope to happen is that a neural language model trained on a bunch of text is going to correctly reorder these things and figure out that the second beam candidate is actually the correct one, even though your N-gram model didn't help you. So that is really the scale model.

That is the set of concepts that you need to get a working speech recognition engine based on deep learning. And so the thing that's left to go to state of the art performance and start serving users is scale. So I'm going to kind of run through quickly a bunch of the different tactics that you can use to try to get there.

So the two pieces of scale that I want to cover, of course, are data and computing power. Where do you get them? So the first thing to know, this is just a number you can keep in the back of your head for all purposes, which is that transcribing speech data is not cheap, but it's also not prohibitive.

It's about 50 cents to a dollar a minute, depending on the quality you want and who's transcribing it and the difficulty of the data. So typical speech benchmarks you'll see out there are maybe hundreds to thousands of hours. So like the Libri speech data set is maybe hundreds of hours.

There's another data set called VoxForge, and you can kind of cobble these together and get maybe hundreds to thousands of hours. But the real challenge is that the application matters a lot. So all the utterances I was playing for you are examples of read speech. People are sitting in a nice quiet room, they're reading something wonderful to me, and so I'm going to end up with a speech engine that's really awesome at listening to the Wall Street Journal, but maybe not so good at listening to someone in a crowded cafe.

So the application that you want to target really needs to match your data set. And so it's worth, at the outset, if you're thinking about going and buying a bunch of speech data, to think of what is the style of speech you're actually targeting. Are you worried about read speech, like the ones we're hearing, or do you care about conversational speech?

It turns out that when people talk in a conversation, when they're spontaneous, they're just coming up with what to say on the fly versus if they have something that they're just dictating and they already know what to say, they behave differently. And they can exhibit all of these effects like disfluency and stuttering.

And then in addition to that, we have all kinds of environmental factors that might matter for an application, like reverb and echo. We start to care about the quality of microphones and whether they have noise canceling. There's something called Lombard effect that I'll mention again in a second, and of course things like speaker accents, where you really have to think carefully about how you collect your data to make sure that you actually represent the kinds of cases you want to test on.

So the reason that read speech is really popular is because we can get a lot of it. And even if it doesn't perfectly match your application, it's cheap and getting a lot of it can still help you. So I wanted to say a few things about read speech, because for less than $10 an hour, often a lot less, you can get a whole bunch of data.

And it has the disadvantage that you lose a lot of things like inflection and conversationality, but it can still be helpful. So one of the things that we've tried doing, and I'm always interested to hear more clever schemes for this, is you can kind of engineer the way that people read to try to get the effects that you want.

So here's one, which is that if you want a little bit more conversationality, you want to get people out of that kind of humdrum dictation, you can start giving them reading material that's a little more exciting. You can give them movie scripts and books, and people will actually start voice acting for you.

>> Creep in, said the witch, and see if it is properly heated so that we can put the bread in. >> So these are really wonderful workers; right? They're kind of really getting into it to give you better data. >> The wolf is dead. The wolf is dead and danced for joy around about the well with their mother.

>> So you have people reading poetry. They get this sort of lyrical quality into it that you don't get from just reading the Wall Street Journal. And finally, there's something called the Lombard effect that happens when people are in noisy environments. So if you're in a noisy party and you're trying to talk to your friend who's a couple of chairs away, you'll catch yourself involuntarily going, "Hey, over there, what are you doing?" You raise your inflection, and you kind of -- you try to use different tactics to get your signal-to-noise ratio up.

You'll sort of work around the channel problem. And so this is very problematic when you're trying to do transcription in a noisy environment because people will talk to their phones using all these effects, even though the noise canceling and everything could actually help them. So one strategy we've tried with varying levels of success -- >> Then they fell asleep and evening passed, but no one came to the poor children.

>> -- is to actually play loud noise in people's headphones to try to get them to elicit this behavior. So this person is kind of raising their voice a little bit in a way that they wouldn't if they were just reading. And similarly, as I mentioned, there are a whole bunch of different augmentation strategies.

So there are all these effects of environment, like reverberation, echo, background noise, that we would like our speech engine to be robust to. And one way you could go about trying to solve this is to go collect a bunch of audio from those cases and then transcribe it, but getting that raw audio is really expensive.

So instead, an alternative is to take the really cheap red speech that's very clean and use some, like, off-the-shelf open-source audio toolkit to synthesize all the things you want to be robust to. So for example, if we want to simulate noise in a cafe, here's just me talking to my laptop in a quiet room.

Hello, how are you? So I'm just asking, how are you? And then here's the sound of a cafe. So I can obviously collect these independently, very cheaply. Then I can synthesize this by just adding these signals together. Hello, how are you? Which actually sounds, I don't know, sounds to me like my talking to my laptop at a Starbucks or something.

And so for our work on deep speech, we actually take something like 10,000 hours of raw audio that sounds kind of like this, and then we pile on lots and lots of audio tracks from Creative Commons videos. It turns out there's a strange thing. People upload, like, noise tracks to the web that last for hours.

It's, like, really soothing to listen to the highway or something. And so you can download all this free found data, and you can just overlay it on this voice, and you can synthesize perhaps hundreds of thousands of hours of unique audio. And so the idea here is that it's just much easier to engineer your data pipeline to be robust than it is to engineer the speech engine itself to be robust.

So whenever you encounter an environment that you've never seen before and your speech engine is breaking down, you should shift your instinct away from trying to engineer the engine to fix it and toward this idea of how do I reproduce it really cheaply in my data. So here's that Wall Street Journal example again.

>> A tanker is a ship designed to carry large volumes of oil or other liquid cargo. >> And so if I wanted to, for instance, deal with a person reading Wall Street Journal on a tanker, maybe something like this. >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.

>> There's lots of reverb in this room, so you can't hear the reverb on the audio. But basically, you can synthesize these things with one line of socks on the command line. So from some of our own work with building a large-scale speech engine with these technologies, this helps a ton.

And you can actually see that when we run on clean and noisy test utterances, as we add more and more data all the way up to about 10,000 hours and using a lot of these synthesis strategies, we can just steadily improve the performance of the engine. And in fact, on things like clean speech, you can get down well below 10% word error rate, which is a pretty strong engine.

Okay. Let's talk about computation. Because the caveat on that last slide is, yes, more data will help if you have a big enough model. And big models usually mean lots of computation. So what I haven't talked about is how big are these neural networks and how big is one experiment.

So if you actually want to train one of these things at scale, what are you in for? So here's the back of the envelope. It's going to take at least the number of connections in your neural network. So take one slice of that RNN, the number of unique connections, multiplied by the number of frames once you unroll the recurrent network, once you unfold it, multiplied by the number of utterances you've got to process in your dataset, times the number of training epics, the number of times you loop through the dataset, times 3, because you have to do forward prop, backward prop, and then a gradient update.

It's about a factor of 3 increase. And then 2 flops for every connection, because there's a multiply and an add. So if you multiply this out for some parameters from the deep speech engine at Baidu, you get something like 1.2 times 10 to the 19 flops. It's about 10 exaflops.

And if you run this on a Titan X card, this will take about a month. Now if you already know what the model is, that might be tolerable. If you're on your epic run to get your best performance so far, then this is OK. But if you don't know what model's going to work, you're targeting some new scenario, then you want it done now so that you can try lots and lots of models quickly.

So the easy fix is just to try using a bunch more GPUs with data parallelism. And the good news is that so far, it looks like speech recognition allows us to use mini-batch sizes. We can process enough utterances in parallel that this is actually efficient. So you'd like to keep maybe a bit more than 64 utterances on each GPU, and up to a total mini-batch size of like 1,000 or maybe 2,000 is still useful.

And so if you're putting together your infrastructure, you can go out and you can buy a server that'll fit eight of these Titan GPUs in them, and that'll actually get you to less than a week training time, which is pretty respectable. So there are a whole bunch of ways to use GPUs.

If I do, we've been using synchronous SGD. It turns out that you've got to optimize things like all reduced code. Once you leave one node, you have to start worrying about your network. And if you want to keep scaling, then thinking about things like network traffic and the right strategy for moving all of your data becomes important.

But we've had success scaling really well all the way out to things like 64 GPUs and just getting linear speedups all of the way. So if you've got a big cluster available, these things scale really well. And there are a bunch of other solutions. For instance, asynchronous SGD is now kind of a mainstay of distributed deep learning.

There's also been some work recently of trying to go back to synchronous SGD that has a lot of nice properties, but using things like backup workers. So that's sort of the easy thing. Just throw more GPUs at it and go faster. One word of warning as you're trying to build these systems is to watch for code that isn't as optimized as you expected it to be.

And so this back-of-the-envelope calculation that we did of figuring out how many flops are involved in our network and then calculating how long it would take to run if our GPU were running at full efficiency, you should actually do this for your network. We call this the speed of light.

This is the fastest your code could ever run on one GPU. And if you find that you're just drastically underperforming that number, what could be happening to you is that you've hit a little edge case in one of the libraries that you're using and you're actually suffering a huge setback that you don't need to be feeling right now.

So one of the things we found back in November is that in libraries like Kublai's, you can actually use mini batch sizes that hit these weird catastrophic cases in the library, where you could be suffering like a factor of two or three performance reduction. So that might take your wonderful one-week training time and blow it up to, say, a three-week training time.

So that's why I wanted to go through this and ask you to keep in mind while you're training these things, try to figure out how long it ought to be taking. And if it's going a lot slower, be suspicious that there's some code you could be optimizing. Another good trick that's particular to speech, you can also use this for other recurrent tricks, is to try to keep similar length utterances together.

So if you look at your dataset, like a lot of things, you have this sort of distribution over possible utterance lengths. And so you see there's a whole bunch that are, you know, maybe within about 50% of each other, but there's also a large number of utterances that are very short.

And so what happens is when we want to process a whole bunch of these utterances in parallel, if we just randomly select, say, 1,000 utterances to go into a mini-batch, there's a high probability that we're going to get a whole bunch of these little short utterances along with some really long utterances.

And in order to make all the CTC libraries work and all of our recurrent network computations easy, what we have to do is pad these audio signals with zero. And that winds up meaning that we're wasting huge amounts of computation, maybe a factor of two or more. And so one way to get around it is just sort all of your utterances by length and then try to keep the mini-batches to be similar lengths so that you just don't end up with quite as much waste in each mini-batch.

And this kind of modifies your algorithm a little bit, but in the end is worthwhile. All right. So that's kind of all I want to say about computation. If you've got a few GPUs, keep an eye on your running time so that you know what to optimize and pay attention to the easy wins, like keeping your utterances together.

You can actually scale really well. And I think for a lot of the jobs we see, you can have your GPU running at something like 50% of the peak. And that's all in. With network time, with all the bandwidth-bound stuff, you can actually run at two to three teraflops on a GPU that can only do five teraflops in the perfect case.

So what can you actually do with this? One of my favorite results from one of our largest models is actually in Mandarin. So we have a whole bunch of labeled Mandarin data at Baidu. And so one of the things that we did was we scaled up this model, trained it on a huge amount of Mandarin data, and then, as we always do, we sit down and we do error analysis.

And what we would do is have a whole bunch of humans sitting around, try to debate the transcriptions and figure out the ground truth that tend to be very high quality. And then we'd go and we'd run now a sort of holdout test on some new people and on the speech engine itself.

And so if you benchmark a single human being against this deep speech engine in Mandarin that's powered by all the technologies we were just talking about, it turns out that the speech engine can get an error rate that's down below 6% character error rate. So only about 6% of the characters are wrong.

And a single human sitting there listening to these transcriptions actually does quite a bit worse. It gets almost 10%. If you give people a bit of an advantage, which is you now assemble a committee of people and you get them a fresh test set so that no one has seen it before and we run this test again, it turns out that the two engines, or that the two cases are actually really similar.

And you can end up with a committee of native Mandarin speakers sitting around debating, "No, no, I think this person said this," or "No, they have an accent. It's from the north. I think they're actually saying that." And then when you show them the deep speech transcription, they actually go, "Oh, that's what it was." And so you can actually get this technology up to a point where it's highly competitive with human beings, even human beings working together.

And this is sort of where I think all of the speech recognition systems are heading, thanks to deep learning and the technologies that we're talking about here. Any questions so far? Yeah, go ahead. So how do you know the actual label of the data? Yep. Sorry? Repeat the question.

Yeah. So the question is, if humans have such a hard time coming up with the correct transcription, how do you know what the truth is? And the real answer is you don't really. Sometimes you might have a little bit of user feedback, but in this instance, we have very high-quality transcriptions that are coming from many labelers teamed up with a speech engine.

And so that could be wrong. We do occasionally find errors where we just think that's a label error. But when you have a committee of humans around, the really astonishing thing is that you can look at the output of the speech engines, and the humans will suddenly jump ship and say, oh, no, no, no, no.

The speech engine is actually correct, because it'll often come up with an obscure word or place that they weren't aware of. Once they see the label, can they be biased towards that label? Yeah. So this is an inherently ambiguous result. But let's say that a committee of human beings tend to disagree with another committee of human beings about the same amount as a speech engine does.

Yeah. So this is basically doing a sequence-to-sequence sort of task, right? So we're going to hear about a really different approach to that later. Can you say anything about the -- Yeah. So this is using the CTC cost, right? That's really the core component of this system. It's how you deal with mapping one variable-length sequence to another.

When the CTC cost is not perfect, it has this assumption of independence baked into the probabilistic model. And because of that assumption, we're introducing some bias into the system. And for languages like English, where the characters are obviously not independent of each other, this might be a limitation. In practice, the thing that we see is that as you add a lot of data and your model gets much more powerful, you can still find your way around it, but it might take more data and a bigger model than necessary.

And of course, we hope that all the new state-of-the-art methods coming out of the deep learning community are going to give us an even better solution. Okay. Go ahead. In this spectrogram, you're saying that there's a 20 milliseconds sample that you take. Is there a reason for the prediction that you can have a bigger or smaller -- Empirically determined.

Yeah. So the question is, for a spectrogram with -- We talked about these little spectrogram frames being computed from 20 milliseconds of audio. And is that number special? Is there a reason for it? So this is really determined from years and years of experience. This is captured from the traditional speech community.

We know this works pretty well. There's actually some fun things you can do. You can take a spectrogram, go back and find the best audio that corresponds to that spectrogram to listen to it and see if you lost anything. And spectrograms of about this level of quantization, you can kind of tell what people are saying.

It's a little bit garbled, but it's still actually pretty good. So amongst all the hyperparameters you could choose, this one's kind of a good tradeoff in keeping the information, but also saving a little bit of the phase by doing it frequently. Yeah. Are you doing overlapping things? I think in a lot of the models in the demo, for example, we don't use overlapping windows.

They're just adjacent. Yeah. You mentioned you get linear scale up across CPU and GPUs. Is that, does it really matter? Like, what do you get when you do that? Yeah. So those results are from in-house software at Baidu. If you use something like OpenMPI, for example, on a cluster of GPUs, it actually works pretty well on a bunch of machines.

But I think some of the algorithms all reduce once you start moving huge amounts of data. They're not optimal. You'll suffer a hit once you start going to that many GPUs. Within a single box, if you use the CUDA libraries to move data back and forth just on a local box, that stuff is pretty well optimized, and you can often do it yourself.

Okay. So I want to take a few more questions at the end, and maybe we can run into the break a little bit. I wanted to just dive right through a few comments about production here. So of course, the ultimate goal of solving speech recognition is to improve people's lives and enable exciting products, and so that means even though so far we've trained a bunch of acoustic and language models, we also want to get these things in production.

And users tend to care about more than just accuracy. Accuracy of course matters a lot, but we also care about things like latency. Users want to see the engine send them some feedback very quickly so that they know that it's responding and that it's understanding what they're saying. And we also need this to be economical so that we can serve lots of users without breaking the bank.

So in practice, a lot of the neural networks that we use in research papers, because they're awesome for beating benchmark results, turn out not to work that well on a production engine. So one in particular that I think is worth keeping an eye on is that it's really common to use bidirectional recurrent neural networks.

And so throughout the talk, I've been drawing my RNN with connections that just go forward in time, but you'll see a lot of research results that also have a path that goes backward in time. And this works fine if you just want to process data offline. But the problem is that if I want to compute this neuron's output up at the top of my network, I have to wait until I see the entire audio segment so that I can compute this backward recurrence and get this response.

So this sort of anti-causal part of my neural network that gets to see the future means that I can't respond to a user on the fly because I need to wait for the end of their signal. So if you start out with these bidirectional RNNs that are actually much easier to get working and then you jump to using a recurrent network that is forward only, it'll turn out that you're going to lose some accuracy.

And you might kind of hope that CTC, because it doesn't care about the alignment, would somehow magically learn to shift the output over to get better accuracy and just artificially delay the response so that it could get more context on its own. But it kind of turns out to only do that a little bit in practice.

It's really tough to control it. And so if you find that you're doing much worse, sometimes you have to sort of engage in model engineering. So even though I've been talking about these recurrent networks, I want you to bear in mind that there's this dual optimization going on. You want to find a model structure that gives you really good accuracy, but you also have to think carefully about how you set up the structure so that this little neuron at the top can actually see enough context to get an accurate answer and not depend too much on the future.

So for example, what we could do is tweak this model so that this neuron at the top that's trying to output the character L in hello can see some future frames, but it doesn't have this backward recurrence. So it only gets to see a little bit of context. That lets us kind of contain the amount of latency in the model.

I'm going to skip over this. So in terms of other online aspects, of course, we want this to be efficient. We want to serve lots of users on a small number of machines if possible. And one of the things that you might find if you have a really big deep neural network or recurrent neural network is that it's really hard to deploy them on conventional CPUs.

CPUs are awesome for serial jobs. You just want to go as fast as you can for this one string of instructions. But as we've discovered with so much of deep learning, GPUs are really fantastic because when we work with neural networks, we love processing lots and lots of arithmetic in parallel.

But it's really only efficient if the batch that we're working on, the hunks of audio that we're working on, are in a big enough batch. So if we just process one stream of audio so that my GPU is multiplying matrices times vectors, then my GPU is going to be really inefficient.

So for example, on like a K1200 GPU, so something you could put in a server in the cloud, what you'll find is that you get really poor throughput considering the dollar value of this hardware if you're only processing one piece of audio at a time. Whereas if you could somehow batch up audio to have, say, 10 or 32 streams going at once, then you can actually squeeze out a lot more performance from that piece of hardware.

So one of the things that we've been working on that works really well and is not too bad to implement is to just batch all of the packets as data comes in. So if I have a whole bunch of users talking to my server and they're sending me little hundred millisecond packets of audio, what I can do is I can sit and I can listen to all these users, and when I catch a whole batch of utterances coming in or a whole bunch of audio packets coming in from different people that start around the same time, I plug those all into my GPU and I process those matrix multiplications together.

So instead of multiplying a matrix times only one little audio piece, I get to multiply it by a batch of, say, four audio pieces, and it's much more efficient. And if you actually do this on a live server and you plow a whole bunch of audio streams through it, you could support maybe 10, 20, 30 users in parallel, and as the load on that server goes up, I have more and more users piling on, what happens is that the GPU will naturally start batching up more and more packets into single matrix multiplications.

So as you get more users, you actually get much more efficient as well. And so in practice, when you have a whole bunch of users on one machine, you usually don't see matrix multiplications happening with fewer than maybe batch sizes of four. So the summary of all of this is that deep learning is really making the first steps to building a state-of-the-art speech engine easier than they've ever been.

So if you want to build a new state-of-the-art speech engine for some new language, all of the components that you need are things that we've covered so far. And the performance now is really significantly driven by data and models, and I think, as we were discussing earlier, I think future models from deep learning are going to make that influence of data and computing power even stronger.

And of course, data and compute is important so that we can try lots and lots of models and keep making progress. And I think this technology is now at a stage where it's not just a research system anymore. We're seeing that the end-to-end deep learning technologies are now mature enough that we can get them into productions.

I think you guys are going to be seeing deep learning play a bigger, bigger role in the speech engines that are powering all the devices that we use. So thank you very much. So I think we're right at the end of time. Sounds good. All right, we had one in the back who was waiting patiently.

Go ahead. More than one voice simultaneously? So the question is, how does the engine handle more than one voice simultaneously? So right now, there's nothing in this formalism that allows you to account for multiple speakers. And so usually, when you listen to an audio clip in practice, it's clear that there's one dominant speaker.

And so this speech engine, of course, learns whatever it was taught from the labels. And it will try to filter out background speakers and just transcribe the dominant one. But if it's really ambiguous, then undefined results. Can you customize the transcription to the specific characteristics of a particular speaker?

So we're not doing that in these pipelines right now. But of course, a lot of different strategies have been developed in the traditional speech literature. There are things like iVectors that try to quantify someone's voice. And those make useful features for improving speech engines. You could also imagine taking a lot of the concepts like embeddings, for example, and tossing them in here.

So I think a lot of that is left open to future work. Adam, question. I think we have to break for time. But I'll step off stage here. And you guys can come to me with your questions. Thank you so much. Thanks, Adam. So we'll reconvene at 2.45 for a presentation by Alex.

Deep Learning for Speech Recognition (Adam Coates, Baidu)

Chapters

Transcript