Deep Learning for Speech Recognition (Adam Coates, Baidu)

00:00:00.000 | So I want to tell you guys about speech recognition and deep learning.

00:00:04.840 | I think deep learning has been playing an increasingly large role in speech recognition.

00:00:10.360 | And one of the things I think is most exciting about this field is that speech recognition

00:00:15.040 | is at a place right now where it's becoming good enough to enable really exciting applications

00:00:20.480 | that end up in the hands of users.

00:00:23.800 | So for example, if we want to caption video content and make it accessible to everyone,

00:00:28.680 | it used to be that we would sort of try to do this, but you still need a human to get

00:00:32.960 | really good captioning for something like a lecture.

00:00:36.480 | But it's possible that we can do a lot of this with higher quality in the future with

00:00:39.880 | deep learning.

00:00:40.920 | We can do things like hands-free interfaces in cars, make it safer to use technology while

00:00:46.040 | we're on the go and keep people's eyes on the road.

00:00:48.040 | Of course, it would make mobile devices, home devices much easier, much more efficient and

00:00:53.460 | enjoyable to use.

00:00:56.040 | But another actually sort of fun recent study that some folks at Baidu participated in,

00:01:02.680 | along with Stanford and UW, was to show that for even something straightforward that we

00:01:06.640 | sort of take for granted as an application of speech, which is just texting someone with

00:01:11.960 | voice or writing a piece of text, the study showed that you can actually go three times

00:01:16.720 | faster with voice recognition systems that are available today.

00:01:20.680 | So it's not just like a little bit faster now, even with the errors that a speech recognition

00:01:25.520 | system can make.

00:01:26.520 | It's actually a lot faster.

00:01:28.800 | And the reason I wanted to highlight this result, which is pretty recent, is that the

00:01:34.320 | speech engine that was used for this study is actually powered by a lot of the deep learning

00:01:39.040 | methods that I'm going to tell you about.

00:01:40.840 | So hopefully when you walk away today, you have an appreciation or an understanding of

00:01:45.120 | the sort of high-level ideas that make a result like this possible.

00:01:50.560 | So there are a whole bunch of different components that make up a complete speech application.

00:01:57.280 | So for example, there's speech transcription.

00:02:00.360 | So if I just talk, I want to come up with words that represent whatever I just said.

00:02:07.740 | There's also other tasks, though, like word spotting or triggering.

00:02:11.080 | So for example, if my phone is sitting over there and I want to say, "Hey, phone, go do

00:02:14.560 | something for me," it actually has to be listening continuously for me to say that word.

00:02:20.240 | And likewise, there are things like speaker identification or verification, so that if

00:02:25.120 | I want to authenticate myself or I want to be able to tell apart different users in a

00:02:28.920 | room, I've got to be able to recognize your voice, even though I don't know what you're

00:02:32.440 | saying.

00:02:33.440 | So these are different tasks.

00:02:34.440 | I'm not going to cover all of them today.

00:02:36.720 | Instead, I'm going to just focus on the bread and butter of speech recognition.

00:02:40.760 | We're going to focus on building a speech engine that can accurately transcribe audio

00:02:45.820 | into words.

00:02:47.420 | So that's our main goal.

00:02:48.880 | This is a very basic goal of artificial intelligence.

00:02:53.680 | Historically, people are very, very good at listening to someone talk, just like you guys

00:02:59.680 | are listening to me right now.

00:03:01.520 | And you can very quickly turn audio into words and into meaning on your own, almost effortlessly.

00:03:10.280 | And for machines, this has historically been incredibly hard.

00:03:13.480 | So you think of this as like one of those sort of consummate AI tasks.

00:03:18.160 | So the goal of building a speech pipeline is, if you just give me a raw audio wave,

00:03:23.040 | like you recorded on your laptop or your cell phone, I want to somehow build a speech recognizer

00:03:28.360 | that can do this very simple task of printing out "Hello, world" when I actually say "Hello,

00:03:33.920 | world."

00:03:35.040 | So before I dig into the deep learning part, I want to step back a little bit and spend

00:03:42.320 | maybe 10 minutes talking about how a traditional speech recognition pipeline is working, for

00:03:48.160 | two reasons.

00:03:49.600 | If you're out in the wild, you're doing an internship, you're trying to build a speech

00:03:54.800 | recognition system with a lot of the tools that are out there, you're going to bump into

00:03:59.160 | a lot of systems that are built on technologies that look like this.

00:04:02.880 | So I want you to understand a little bit of the vocabulary and how those things are put

00:04:07.040 | together.

00:04:08.200 | And also, this will sort of give you a story for what deep learning is doing in speech

00:04:13.440 | recognition today that is kind of special and that I think paves the way for much bigger

00:04:20.140 | results in the future.

00:04:22.900 | So traditional systems break the problem of converting an audio wave, of taking audio

00:04:31.180 | and turning it into a transcription, into a bunch of different pieces.

00:04:36.280 | So I'm going to start out with my raw audio, and I'm just going to represent that by X.

00:04:43.160 | And then usually we have to decide on some kind of feature representation.

00:04:47.280 | We have to convert this into some other form that's easier to deal with than a raw audio

00:04:52.380 | wave.

00:04:54.160 | And in a traditional speech system, I often have something called an acoustic model.

00:04:58.300 | And the job of the acoustic model is to learn the relationship between these features that

00:05:04.320 | represent my audio and the words that someone is trying to say.

00:05:10.040 | And then I'll often have a language model, which encapsulates all of my knowledge about

00:05:14.520 | what kinds of words, what spellings and what combinations of words are most likely in the

00:05:19.960 | language that I'm trying to transcribe.

00:05:22.640 | And once you have all of these pieces, so these might be -- these different models might

00:05:27.160 | be driven by machine learning themselves, what you would need to build in a traditional

00:05:31.160 | system is something called a decoder.

00:05:34.240 | And the job of a decoder, which itself might involve some modeling efforts and machine

00:05:39.160 | learning algorithms, is to find the sequence of words W that maximizes this probability.

00:05:47.780 | The probability of the particular sequence W, given your audio.

00:05:51.720 | That's straightforward.

00:05:53.460 | But that's equivalent to maximizing the product of the contributions from your acoustic model

00:05:58.560 | and from your language model.

00:06:00.840 | So a traditional speech system is broken down into these pieces, and a lot of the effort

00:06:05.180 | in getting that system to work is in developing this sort of portion that combines them all.

00:06:12.880 | So it turns out that if you want to just directly transcribe audio, you can't just go straight

00:06:19.120 | to characters.

00:06:20.720 | And the reason is, and it's especially apparent in English, that the way something is spelled

00:06:26.040 | in characters doesn't always correspond well to the way that it sounds.

00:06:30.700 | So if I give you the word "night," for example, without context, you don't really know whether

00:06:36.280 | I'm talking about a night in armor or whether I'm talking about night like an evening.

00:06:41.520 | And so a way to get around this, to abstract this problem away from a traditional system,

00:06:46.720 | is to replace this with a sort of intermediate representation.

00:06:50.920 | Instead of trying to predict characters, I'll just try to predict something called phonemes.

00:06:55.640 | So as an example, if I want to represent the word "hello," what I might try to do is break

00:07:01.520 | it down into these units of sound.

00:07:04.580 | So the first one is like the "h," that H sound in "hello," and then an "uh" sound, which

00:07:10.040 | is actually only one possible pronunciation of an E, and then an L and an O sound.

00:07:16.120 | And that would be my string that I try to come up with using all of my different speech

00:07:22.040 | components.

00:07:24.120 | So this, in one sense, makes the modeling problem easier.

00:07:27.680 | My acoustic model and so on can be simpler, because I don't have to worry about spelling.

00:07:33.080 | But it does have this problem that I have to think about where these things come from.

00:07:38.180 | So these phonemes are intuitively, they're the perceptually distinct units of sound that

00:07:45.920 | we can use to distinguish words.

00:07:48.640 | And they're very approximate.

00:07:51.680 | This might be our imagination that these things actually exist.

00:07:55.240 | It's not clear how fundamental this is.

00:07:57.880 | But they're sort of standardized.

00:07:58.880 | There are a bunch of different conventions for how to define these.

00:08:05.320 | And if you end up working on a system that uses phonemes, one popular data set is called

00:08:10.760 | TIMIT.

00:08:12.240 | And so this actually has a corpus of audio frames with examples of each of these phonemes.

00:08:19.640 | So once you have this phoneme representation, unfortunately, it adds even more complexity

00:08:28.120 | to this traditional pipeline.

00:08:30.580 | Because now, my acoustic model doesn't associate this audio feature with words.

00:08:35.720 | It actually associates them with another kind of transcription, with the transcription into

00:08:39.720 | phonemes.

00:08:41.000 | And so I have to introduce yet another component into my pipeline that tries to understand

00:08:46.680 | how do I convert the transcriptions in phonemes into actual spellings.

00:08:51.760 | And so I need some kind of dictionary or a lexicon to tell me all of that.

00:08:56.800 | So this is a way of taking our knowledge about a language and baking it into this engineered

00:09:01.660 | pipeline.

00:09:03.560 | And then once you've got all that, again, all of your work now goes into this decoder

00:09:08.560 | that has a slightly more complicated task in order to infer the most likely word transcription

00:09:14.560 | given the audio.

00:09:17.420 | So this is a tried and true pipeline.

00:09:20.040 | It's been around for a long time.

00:09:22.480 | You'll see a whole bunch of these systems out there.

00:09:25.660 | And we're still using a lot of the vocabulary from these systems.

00:09:30.920 | But traditionally, the big advantage is that it's very tweakable.

00:09:34.800 | If you want to go add a new pronunciation for a word you've never heard before, you

00:09:38.920 | can just drop it right in.

00:09:40.040 | That's great.

00:09:42.340 | But it's also really hard to get working well.

00:09:44.920 | If you start from scratch with this system and you have no experience in speech recognition,

00:09:50.360 | it's actually quite confusing and hard to debug.

00:09:53.240 | It's very difficult to know which of these various models is the one that's behind your

00:09:58.000 | error.

00:09:59.000 | And especially once we start dealing with things like accents, heavy noise, different

00:10:03.480 | kinds of ambiguity, that makes the problem even harder to engineer around.

00:10:08.080 | Because trying to think ourselves about how do I tweak my pronunciation model, for example,

00:10:13.540 | to account for someone's accent that I haven't heard, that's a very hard engineering judgment

00:10:18.180 | for us to make.

00:10:20.180 | So there are all kinds of design decisions that go into this pipeline, like choosing

00:10:24.800 | the feature representation, for example.

00:10:28.100 | So the first place that deep learning has started to make an impact in speech recognition,

00:10:35.260 | starting a few years ago, is to just take one of the core machine learning components

00:10:40.780 | of the system and replace it with a deep learning algorithm.

00:10:44.660 | So I mentioned back in this previous pipeline that we had this little model here whose job

00:10:50.500 | is to learn the relationship between a sequence of phonemes and the audio that we're hearing.

00:10:56.460 | So this is called the acoustic model.

00:10:59.500 | And there are lots of different methods for training this thing.

00:11:03.140 | So take your favorite machine learning algorithm.

00:11:06.140 | You can probably find someone who is trained in acoustic model with that algorithm, whether

00:11:09.740 | it's a Gaussian mixture model or a bunch of decision trees and random forests, anything

00:11:15.200 | for estimating these kinds of densities.

00:11:17.980 | There's a lot of work in trying to make better acoustic models.

00:11:21.700 | So some work by George Dahl and co-authors took what was a state of the art deep learning

00:11:28.940 | system back in 2011, which is a deep belief network with some pre-training strategies,

00:11:35.420 | and dropped it into a state of the art pipeline in place of this acoustic model.

00:11:41.140 | And the results are actually pretty striking, because even though we had neural networks

00:11:46.460 | and these pipelines for a while, what ended up happening is that when you replace the

00:11:52.140 | Gaussian mixture model and HMM system that already existed with this deep belief network

00:11:58.380 | as an acoustic model, you actually got something between like a 10% and 20% relative improvement

00:12:03.300 | in accuracy, which is a huge jump.

00:12:06.320 | This is highly noticeable to a person.

00:12:09.140 | And if you compare this to the amount of progress that had been made in preceding years, this

00:12:14.940 | is a giant leap for a single paper to make, compared to progress we'd been able to make

00:12:21.020 | previously.

00:12:22.780 | So this is in some sense the first generation of deep learning for speech recognition, which

00:12:28.380 | is I take one of these components and I swap it out for my favorite deep learning algorithm.

00:12:37.060 | So the picture looks sort of like this.

00:12:40.480 | So with these traditional speech recognition pipelines, the problem that we would always

00:12:46.180 | run into is that if you gave me a lot more data, you gave me a much bigger computer so

00:12:51.540 | that I could train a huge model, that actually didn't help me because all the problems I

00:12:56.420 | had were in the construction of this pipeline.

00:13:00.580 | And so eventually, if you gave me more data and a bigger computer, the performance of

00:13:04.780 | our speech recognition system would just kind of peter out.

00:13:08.340 | It would just reach a ceiling that was very hard to get over.

00:13:11.580 | And so we just start coming up with lots of different strategies.

00:13:14.180 | We start specializing for each application.

00:13:16.740 | We try to specialize for each user and try to make things a little bit better around

00:13:21.220 | the edges.

00:13:22.500 | And what these deep learning acoustic models did was in some sense moved that barrier a

00:13:28.540 | little ways.

00:13:29.740 | It made it possible for us to take a bit more data, much faster computers that let us try

00:13:34.860 | a whole lot of models, and move that ceiling up quite a ways.

00:13:40.300 | So the question that many in the research community, including folks at Baidu, have

00:13:44.780 | been trying to answer is, can we go to a next generation version of this insight?

00:13:51.780 | Can we, for instance, build a speech engine that is powered by deep learning all the way

00:13:56.740 | from the audio input to the transcription itself?

00:14:00.740 | Can we replace as much of that traditional system with deep learning as possible so that

00:14:05.380 | over time, as you give researchers more data and bigger computers and the ability to try

00:14:11.620 | more models, their speech recognition performance just keeps going up and we can potentially

00:14:16.340 | solve speech for everybody?

00:14:18.880 | So the goal of this tutorial is not to get you up here, which requires a whole bunch

00:14:26.180 | of things that I'll tell you about near the end.

00:14:29.100 | But what we want to try to do is give you enough to get a point on this curve.

00:14:33.460 | And then once you're on the curve, the idea is that what remains is now a problem of scale.

00:14:40.740 | It's about data and about getting bigger computers and coming up with ways to build bigger models.

00:14:47.500 | So that's my objective, so that when you walk away from here, you have a picture of what

00:14:51.700 | you would need to build to get this point.

00:14:54.900 | And then after that, it's hopefully all about scale.

00:14:59.180 | So thanks to Vinay Rao, who's been helping put this tutorial together, there is going

00:15:04.580 | to be some starter code live for the basic pipeline, the deep learning part of the pipeline

00:15:10.540 | that we're talking about.

00:15:12.300 | So there are some open source implementations of things like CTC, but we wanted to make

00:15:18.180 | sure that there's a system out there that's pretty representative of the acoustic models

00:15:22.220 | that I'm going to be talking about in the first half of the presentation here.

00:15:27.100 | So this will be enough that you can get a simple pipeline going with something called

00:15:31.760 | max decoding, which I'll tell you about later.

00:15:34.780 | And the idea is that this is sort of a scale model of the acoustic models that Baidu and

00:15:40.060 | other places are powering real production speech engines.

00:15:44.260 | So this will get you that point on the curve.

00:15:48.020 | Okay.

00:15:49.740 | So here's what we're going to talk about.

00:15:52.860 | The first part, I'm just going to introduce a few preliminaries, talk about preprocessing.

00:15:57.360 | So we still have a little bit of preprocessing around, but it's not really fundamental.

00:16:01.460 | I think it's probably going to go away in the long run.

00:16:04.660 | We'll talk about what is probably the most mature piece of sequence learning technologies

00:16:11.940 | for deep learning right now.

00:16:13.500 | So it turns out that one of the fundamental problems of doing speech recognition is how

00:16:17.740 | do I build a neural network that can map this audio signal to a transcription that can have

00:16:23.060 | a quite variable length.

00:16:25.580 | And so CTC is one highly mature method for doing this.

00:16:29.580 | And I think you're actually going to hear about maybe some other solutions later today.

00:16:33.820 | Then I'll say a little bit about training and just what that looks like.

00:16:40.020 | And then finally say a bit about decoding and language models, which is sort of an addendum

00:16:45.060 | to the current acoustic models that we can build that make them perform a lot better.

00:16:50.580 | And then once you have this, that's a picture of what you need to get this point on the

00:16:55.580 | curve.

00:16:56.660 | And then I'll talk a little bit about what's remaining.

00:16:59.300 | How do you scale up from this little scale model up to the full thing?

00:17:04.180 | What does that actually entail?

00:17:05.860 | And then time permitting, we'll talk a little bit about production.

00:17:08.740 | How could you put something like this into a cloud server and actually serve real users

00:17:13.540 | with it?

00:17:15.540 | Great.

00:17:17.260 | So how is audio represented?

00:17:20.760 | This should be pretty straightforward, I think.

00:17:24.100 | Unlike a two-dimensional image where we normally have a 2D grid of pixels, audio is just a

00:17:28.900 | 1D signal.

00:17:30.580 | And there are a bunch of different formats for audio, but typically this one-dimensional

00:17:35.180 | wave that is actually me saying something like, "Hello, world," is something like 8,000

00:17:42.580 | samples per second or 16,000 samples per second.

00:17:46.500 | And each wave is quantized into 8 or 16 bits.

00:17:51.020 | So when we represent this audio signal that's going to go into our pipeline, you could just

00:17:55.020 | think of that as a one-dimensional vector.

00:17:57.380 | So when I had that box called x that represented my audio signal, you can think of this as

00:18:02.540 | being broken down into samples, x1, x2, and so forth.

00:18:07.220 | And if I had a one-second audio clip, this vector would have a length of either, say,

00:18:12.300 | 8,000 or 16,000 samples.

00:18:14.840 | And each element would be, say, a floating point number that I'd extracted from this

00:18:19.460 | 8 or 16-bit sample.

00:18:21.140 | So it's really simple.

00:18:23.800 | Now once I have an audio clip, we'll do a little bit of preprocessing.

00:18:28.700 | So there are a couple of ways to start.

00:18:31.020 | The first is to just do some vanilla preprocessing, like convert to a simple spectrogram.

00:18:37.540 | So if you look at a traditional speech pipeline, you're going to see things like MFCCs, which

00:18:42.660 | are male frequency kepstral coefficients.

00:18:45.740 | You'll see a whole bunch of plays on spectrograms where you take differences in different kinds

00:18:50.860 | of features and try to engineer complex representations.

00:18:55.500 | But for the stuff that we're going to do today, a simple spectrogram is just fine.

00:18:59.700 | And it turns out, as you'll see in a second, we lose a little bit of information when we

00:19:04.180 | do this, but it turns out not to be a huge difference.

00:19:08.740 | Now I said a moment ago that I think probably this is going to go away in the long run.

00:19:14.260 | And that's because today you can actually find recent research in trying to do away

00:19:19.540 | with even this preprocessing part and having your neural network process the audio wave

00:19:23.860 | directly and just train its own feature transformation.

00:19:27.660 | So there's some references at the end that you can look at for this.

00:19:32.980 | So here's a quick straw poll.

00:19:35.460 | How many people have seen a spectrogram or computed a spectrogram before?

00:19:39.020 | Pretty good.

00:19:40.340 | Maybe 50%.

00:19:41.940 | OK.

00:19:43.060 | So the idea behind a spectrogram is that it's sort of like a frequency domain representation,

00:19:49.940 | but instead of representing this entire signal in terms of frequencies, I'm just going to

00:19:55.100 | represent a small window in terms of frequencies.

00:19:59.940 | So to process this audio clip, the first thing I'm going to do is cut out a little window

00:20:06.660 | that's typically about 20 milliseconds long.

00:20:09.020 | And when you get down to that scale, it's usually very clear that these audio signals

00:20:12.880 | are made up of sort of a combination of different frequencies of sine waves.

00:20:18.960 | And then what we do is we compute an FFT.

00:20:21.780 | It basically converts this little signal into the frequency domain.

00:20:26.460 | And then we just take the log of the power at each frequency.

00:20:31.260 | And so if you look at what the result of this is, it basically tells us for every frequency

00:20:39.420 | of sine wave, what is the magnitude, what's the amount of power represented by that sine

00:20:44.820 | wave that makes up this original signal.

00:20:48.200 | So over here in this example, we have a very strong low frequency component in the signal.

00:20:55.780 | And then we have differing magnitudes at different differing frequencies.

00:21:01.180 | So we can just think of this as a vector.

00:21:06.020 | So now instead of representing this little 20 millisecond slice as sort of a sequence

00:21:10.560 | of audio samples, instead I'm going to represent it as a vector here where each element represents

00:21:18.180 | sort of the strength of each frequency in this little window.

00:21:22.500 | And the next step beyond this is that if I just told you how to process one little window,

00:21:28.100 | you can of course apply this to a whole bunch of windows across the entire piece of audio.

00:21:34.960 | And that gives you what we call a spectrogram.

00:21:37.400 | And you can use either disjoint windows that are just sort of adjacent or you can apply

00:21:41.800 | them to overlapping windows if you like.

00:21:44.560 | So there's a little bit of parameter tuning there.

00:21:46.980 | But this is an alternative representation of this audio signal that happens to be easier

00:21:53.000 | to use for a lot of purposes.

00:21:57.760 | So our goal, starting from this representation, is to build what I'm going to call an acoustic

00:22:04.200 | model, but which is really, to the extent we can make it happen, is really going to

00:22:08.720 | be an entire speech engine that is represented by a neural network.

00:22:12.780 | So what we would like to do is build a neural net that if we could train it from a whole

00:22:18.740 | bunch of pairs, X, which is my original audio that I turn into a spectrogram, and Y star,

00:22:25.000 | that's the ground truth transcription that some human has given me.

00:22:29.000 | If I were to train this big neural network off of these pairs, what I'd like it to produce

00:22:35.600 | is some kind of output that I'm representing by the character C here, so that I could later

00:22:41.560 | extract the correct transcription, which I'm going to denote by Y.

00:22:46.920 | So if I said hello, the first thing I'm going to do is run preprocessing to get all these

00:22:51.540 | spectrogram frames.

00:22:53.340 | And then I'm going to have a recurrent neural network that consumes each frame and processes

00:22:58.440 | them into some new representation called C. And hopefully, I can engineer my network in

00:23:04.640 | such a way that I can just read the transcription off of these output neurons.

00:23:09.940 | So that's kind of the intuitive picture of what we want to accomplish.

00:23:15.960 | So as I mentioned back in the outline, there's one obvious fundamental problem here, which

00:23:21.920 | is that the length of the input is not the same as the length of the transcription.

00:23:28.840 | So if I say hello very slowly, then I can have a very long audio signal, even though

00:23:35.120 | I didn't change the length of the transcription.

00:23:36.760 | Or if I say hello very quickly, then I can have a very short piece of audio.

00:23:43.120 | And so that means that this output of my neural network is changing length, and I need to

00:23:47.640 | come up with some way to map that variable length neural network output to this fixed

00:23:53.440 | length transcription, and also do it in a way that we can actually train this pipeline.

00:23:58.940 | So the traditional way to deal with this problem, if you were building a speech engine several

00:24:07.020 | years ago, is to just try to bootstrap the whole system.

00:24:11.040 | So I'd actually train a neural network to correctly predict the sounds at every frame

00:24:16.160 | using some kind of data set like Timit, where someone has lovingly annotated all of the

00:24:21.320 | phonemes for me.

00:24:23.140 | And then I'd try to figure out the alignment between my saying hello in a phonetic transcription

00:24:28.320 | with the input audio.

00:24:30.140 | And then once I've lined up all of the sounds with the input audio, now I don't care about

00:24:35.080 | length anymore, because I can just make a one-to-one mapping between the audio input

00:24:40.320 | and the phoneme outputs that I'm trying to target.

00:24:43.440 | But this alignment process is horribly error-prone.

00:24:47.420 | You have to do a lot of extra work to make it work well, and so we really don't want

00:24:51.160 | to do this.

00:24:52.160 | We really want to have some kind of solution that lets us solve this straightaway.

00:24:57.920 | So there are multiple ways to do it.

00:24:59.680 | And as I mentioned, there's some current research on how to use things like attentional models,

00:25:04.280 | sequence-to-sequence models that you'll hear about later, in order to solve this kind of

00:25:10.960 | problem.

00:25:11.960 | And then, as I said, we'll focus on something called connectionist temporal classification,

00:25:17.280 | or CTC, that is sort of current state-of-the-art for how to do this.

00:25:23.000 | So here's the basic idea.

00:25:24.960 | So our recurrent neural network has these output neurons that I'm calling C. And the

00:25:31.680 | job of these output neurons is to encode a distribution over the output symbols.

00:25:39.560 | So because of the structure of the recurrent network, the length of this symbol sequence

00:25:45.480 | C is the same as the length of my audio input.

00:25:48.280 | So if my audio input, say, was two seconds long, that might have 100 audio frames.

00:25:55.500 | And that would mean that the length of C is also 100 different values.

00:26:01.200 | So if we were working on a phoneme-based model, then C would be some kind of phoneme representation.

00:26:07.640 | And we would also include a blank symbol, which is special for CTC.

00:26:12.640 | But if, as we'll do in the rest of this talk, we're trying to just predict the graphemes,

00:26:19.040 | trying to predict the characters in this language directly from the audio, then I would just

00:26:24.400 | let C take on a value that's in my alphabet, or take on a blank or a space, if my language

00:26:30.940 | has spaces in it.

00:26:33.520 | And then the second thing I'm going to do, once my RNN gives me a distribution over these

00:26:39.540 | symbols C, is that I'm going to try to define some kind of mapping that can convert this

00:26:45.240 | long transcription C into the final transcription Y.

00:26:50.480 | That's like, hello.

00:26:51.720 | That's the actual string that I want.

00:26:55.000 | And now, recognizing that C is itself a probabilistic creature, there's a distribution over choices

00:27:01.260 | of C that correspond to the audio.

00:27:04.680 | Once I apply this function, that also means that there's a distribution over Y.

00:27:08.760 | There's a distribution over the possible transcriptions that I could get.

00:27:12.680 | And what I'll want to do to train my network is to maximize the probability of the correct

00:27:17.760 | transcription given the audio.

00:27:20.600 | So those are the three steps that we have to accomplish in order to make CTC work.

00:27:26.400 | So let's start with the first one.

00:27:29.520 | So we have these output neurons C, and they represent a distribution over the different

00:27:36.240 | symbols that I could be hearing in the audio.

00:27:39.680 | So I've got some audio signal down here.

00:27:41.380 | You can see the spectrogram frames poking up.

00:27:44.740 | And this is being processed by this recurrent neural network.

00:27:48.440 | And the output is a big bank of softmax neurons.

00:27:53.320 | So for the first frame of audio, I have a neuron that corresponds to each of the symbols

00:27:59.820 | that C could represent.

00:28:02.680 | And this set of softmax neurons here, with the output summing to 1, represents the probability

00:28:10.620 | of, say, C1 having the value A, B, C, and so on, or this special blank character.

00:28:17.680 | So for example, if I pick one of the neurons over here, then the first row, which represents

00:28:24.280 | the character B, and the 17th column, which is the 17th frame in time, this represents

00:28:31.760 | the probability that C17 represents the character B, given the audio.

00:28:40.920 | So once I have this, that also means that I can just define a distribution not just

00:28:47.160 | over the individual characters, but if I just assume that all of the characters are independent,

00:28:53.080 | which is kind of a naive assumption, but if I bake this into the system, I can define

00:28:57.820 | a distribution over all possible sequences of characters in this alphabet.

00:29:04.520 | So if I gave you a specific instance, a specific character string using this alphabet, for

00:29:11.960 | instance, I represent the string hello as H-H-H-E, blank E, blank blank L-L, blank L-O,

00:29:21.080 | and then a bunch of blanks.

00:29:22.600 | This is a string in this alphabet for C, and I can just use this formula to compute the

00:29:28.960 | probability of this specific sequence of characters.

00:29:33.560 | So that's how we compute the probability for a sequence of characters when they have the

00:29:38.760 | same length as the audio input.

00:29:43.960 | So the second step, and this is in some sense the kind of neat trick in CTC, is to define

00:29:52.440 | a mapping from this long encoding of the audio into symbols that crunches it down to the

00:30:03.240 | actual transcription that we're trying to predict.

00:30:06.000 | And the rule is this operator takes this character sequence, and it picks up all the duplicates,

00:30:13.440 | all of the adjacent characters that are repeated, and discards the duplicates and just keeps

00:30:18.720 | one of them, and then it drops all of the blanks.

00:30:23.240 | So in this example, you see you have three H's together, so I just keep one H, and then

00:30:29.600 | I have a blank, I throw that away, and I keep an E, and I have two L's, so I keep one of

00:30:34.640 | the L's over here, and then another blank, and an L-O.

00:30:38.400 | And the one key thing to note is that when I have two characters that are different right

00:30:43.280 | next to each other, I just end up keeping those two characters in my output.

00:30:48.680 | But if I ever have a double character, like L-L in "hello," then I'll need to have a blank

00:30:54.880 | character that gets put in between.

00:30:58.880 | But if our neural network gave me this transcription, told me that this was the right answer, we

00:31:04.360 | just have to apply this operator, and we get back the string "hello."

00:31:12.020 | So now that we have a way to define a distribution over these sequences of symbols that are the

00:31:18.320 | same length as the audio, and we now have a mapping from those strings into transcriptions,

00:31:25.160 | as I said, this gives us a probability distribution over the possible final transcriptions.

00:31:30.920 | So if I look at the probability distribution over all the different sequences of symbols,

00:31:37.720 | I might have "hello" written out like on the last slide, and maybe that has probability

00:31:42.620 | .1, and then I might have "hello" but written a different way, by say replacing this H with

00:31:49.880 | a blank that has a smaller probability, and I have a whole bunch of different possible

00:31:55.160 | symbol sequences below that.

00:31:58.440 | And what you'll notice is that if I go through every possible combination of symbols here,

00:32:06.160 | there are several combinations that all map to the same transcription.

00:32:10.640 | So here's one version of "hello," there's a second version of "hello," there's a third

00:32:15.120 | version of "hello."

00:32:16.800 | And so if I now ask, "What's the probability of the transcription 'hello'?"

00:32:21.720 | The way that I compute that is I go through all of the possible character sequences that

00:32:28.080 | correspond to the transcription "hello," and I add up all of their probabilities.

00:32:33.240 | So I have to sum over all possible choices of C that could give me that transcription

00:32:38.680 | in the end.

00:32:40.400 | So you can kind of think of this as searching through all the possible alignments, right?

00:32:48.080 | I could shift these characters around a little bit, I could move them forward, backward,

00:32:52.360 | I could expand them by adding duplicates or squish them up, depending on how fast someone

00:32:56.500 | is talking, and that corresponds to every possible alignment between the audio and the

00:33:04.000 | characters that I want to transcribe.

00:33:05.000 | It sort of solves the problem of the variable length.

00:33:08.760 | And the way that I get the probability of a specific transcription is to sum up, to

00:33:14.360 | marginalize over all the different alignments that could be feasible.

00:33:20.920 | And then if we have a whole bunch of other possibilities in here, like the word "yellow,"

00:33:25.480 | I'd compute them in the same way.

00:33:27.400 | So this equation just says to sum over all the character sequences C so that when I apply

00:33:32.800 | this little mapping operator, I end up with the transcription y.

00:33:36.160 | Oh, oh.

00:33:47.880 | I'm missing a double E. You're talking about this one?

00:33:51.920 | So when we apply this sort of squeezing operator here, we drop this double E to get a single

00:33:59.120 | E in "hello."

00:34:00.120 | And we remove all the duplicates.

00:34:03.320 | So the same way we did for an H.

00:34:04.320 | [INAUDIBLE]

00:34:05.320 | Right.

00:34:09.160 | So whenever you see two characters together like this, where they're adjacent duplicates,

00:34:16.120 | you sort of squeeze all those duplicates out, and you just keep one of them.

00:34:19.520 | But here we have a blank in between.

00:34:21.880 | So if we drop all the duplicates first, then we still have two L's left, and then we remove

00:34:27.440 | all the blanks.

00:34:29.160 | So this gives the algorithm a way to represent repeated characters in the transcription.

00:34:33.280 | There's another one in the back.

00:34:38.120 | [INAUDIBLE]

00:34:39.120 | Oh, oh, I see.

00:34:42.640 | Yeah.

00:34:43.640 | This is maybe-- I put a space in here.

00:34:47.520 | Really I should have put a space character in here instead of a blank.

00:34:51.000 | Really this could be H-E-L-L-O-H.

00:34:54.920 | Yeah.

00:34:58.120 | So the space here is erroneous.

00:35:01.240 | OK.

00:35:03.240 | We good?

00:35:06.680 | OK.

00:35:08.040 | So once I've defined this, I just gave you a formula to compute the probability of a

00:35:13.600 | string given the audio.

00:35:16.360 | So as with every good starting to a machine learning algorithm, we go and we try to apply

00:35:22.120 | maximum likelihood.

00:35:23.720 | I now give you the correct transcription, and your job is to tune the neural network

00:35:28.440 | to maximize the probability of that transcription using this model that I just defined.

00:35:34.160 | So in equations, what I'm going to do is I want to maximize the log probability of y

00:35:40.960 | star for a given example.

00:35:46.280 | I want to maximize the probability of the correct transcription given the audio x.

00:35:51.680 | And then I'm just going to sum over all the examples.

00:35:56.160 | And then what I want to do is just replace this with the equation that I had on the last

00:36:02.040 | page that says in order to compute the probability of a given transcription, I have to sum over

00:36:07.320 | all of the possible symbol sequences that could have given me that transcription, sum

00:36:12.120 | over all the possible alignments that would map that transcription to my audio.

00:36:18.800 | So Alex Graves and co-authors in 2006 actually show that because of this independence assumption,

00:36:26.000 | there is a clever way, there is a dynamic programming algorithm that can efficiently

00:36:29.720 | compute this summation for you.

00:36:33.000 | And not only compute this summation so that you can compute the objective function, but

00:36:37.200 | actually compute its gradient with respect to the output neurons of your neural network.

00:36:42.240 | So if you look at the paper, the algorithm details are in there.

00:36:47.000 | What's cool right now in the history of speech and deep learning is that this is at the level

00:36:52.200 | of a technology.

00:36:53.540 | This is something that's now implemented in a bunch of places so that you can download

00:36:57.500 | a software package that efficiently will calculate this CTC loss function for you that can calculate

00:37:05.540 | this likelihood and can also just give you back the gradient.

00:37:08.980 | So I won't go into the equations here.

00:37:11.240 | Instead, I'll tell you that there are a whole bunch of implementations on the web that you

00:37:15.760 | can now use as part of deep learning packages.

00:37:19.140 | So one of them from Baidu implements CTC on the GPU.

00:37:23.120 | It's called WarpCTC.

00:37:25.960 | Stanford and the group there, actually one of Andrew's students, has a CTC implementation.

00:37:33.360 | And there's also now CTC losses implemented in packages like TensorFlow.

00:37:37.880 | So this is something that's sufficiently widely distributed that you can use these algorithms

00:37:44.480 | off the shelf.

00:37:46.880 | So the way that these work, the way that we go about training, is we start from our audio

00:37:51.480 | spectrogram.

00:37:52.680 | We have our neural network structure where you get to choose how it's put together.

00:37:58.060 | And then it outputs this bank of softmax neurons.

00:38:01.640 | And then there are pieces of off-the-shelf software that will compute for you the CTC

00:38:07.520 | cost function.

00:38:08.520 | They'll compute this log likelihood given a transcription and the output neurons from

00:38:13.640 | your recurrent network.

00:38:16.440 | And then the software will also be able to tell you the gradient with respect to the

00:38:20.800 | output neurons.

00:38:22.040 | And once you've got that, you're set.

00:38:23.440 | You can feed them back into the rest of your code and get the gradient with respect to

00:38:27.640 | all of these parameters.

00:38:29.980 | So as I said, this is all available now in sort of efficient, off-the-shelf software.

00:38:34.280 | So you don't have to do this work yourself.

00:38:37.520 | So that's pretty much all there is to the high-level algorithm.

00:38:41.840 | With this, it's actually enough to get a sort of working drosophila of speech recognition

00:38:48.480 | going.

00:38:49.800 | There are a few little tricks, though, that you might need along the way.

00:38:54.680 | On easy problems, you might not need these.

00:38:57.520 | But as you get to more difficult data sets with a lot of noise, they can become more

00:39:01.720 | and more important.

00:39:03.360 | So the first one that we've been calling "sort of grad" in the vein of all of the grad algorithms

00:39:09.080 | out there is basically a trick to help with recurrent neural networks.

00:39:16.600 | So it turns out that when you try to train one of these big RNN models on some off-the-shelf

00:39:22.360 | speech data, one of the things that can really get you is seeing very long utterances early

00:39:28.680 | in the process.

00:39:30.540 | Because if you have a really long utterance, then if your neural network is badly initialized,

00:39:36.680 | you'll often end up with things like underflow and overflow as you try to go and compute

00:39:41.000 | the probabilities.

00:39:42.400 | And you end up with gradients exploding as you try to do back propagation.

00:39:46.320 | And it can make your optimization a real mess.

00:39:49.200 | And it's coming from the fact that these utterances are really long and really hard, and the neural

00:39:53.440 | network just isn't ready to deal with those transcriptions.

00:39:57.280 | And so one of the fixes that you can use is, during the early parts of training, usually

00:40:02.400 | in the first epoch, is you just sort all of your audio by length.

00:40:07.040 | And now, when you process a mini-batch, you just take the short utterances first so that

00:40:12.040 | you're working with really short RNNs that are quite easy to train and don't blow up

00:40:16.760 | and don't have a lot of catastrophic numerical problems.

00:40:20.480 | And then as time goes by, you start operating on longer and longer utterances that get more

00:40:25.440 | and more difficult.

00:40:27.440 | So we call this "sort of grad."

00:40:28.760 | It's basically a curriculum learning method.

00:40:31.480 | And so you can see some work from Yoshio Bengio and his team on a whole bunch of strategies

00:40:36.520 | for this.

00:40:37.520 | But you can think of the short utterances as being the easy ones.

00:40:40.480 | And if you start out with the easy utterances and move to the longer ones, your optimization

00:40:44.640 | algorithm can do better.

00:40:46.640 | So here's an example from one of the models that we've trained, where your CTC cost starts

00:40:53.100 | up here.

00:40:54.720 | And after a while, you optimize, and you sort of bottom out around, I don't know, what?

00:40:59.560 | A log likelihood of maybe 30.

00:41:02.080 | And then if you add this sort of grad strategy, after the first epoch, you're actually doing

00:41:07.380 | better.

00:41:08.380 | And you can reach a better optimum than you could without it.

00:41:12.120 | And in addition, another strategy that's extremely helpful for recurrent networks and very deep

00:41:17.120 | neural networks is batch normalization.

00:41:20.760 | So this is becoming very popular.

00:41:22.880 | And it's also available as sort of an off-the-shelf package inside of a lot of the different frameworks

00:41:28.120 | that are available today.

00:41:29.520 | So if you start having trouble, you can consider putting batch normalization into your network.

00:41:36.280 | So our neural network now spits out this big bank of softmax neurons.

00:41:41.720 | We've got a training algorithm.

00:41:42.760 | We're just doing gradient descent.

00:41:45.720 | How do we actually get a transcription?

00:41:47.740 | This process, as I said, is meant to be as close to characters as possible.

00:41:53.000 | But we still sort of need to decode these outputs.

00:41:56.600 | And you might think that one simple solution, which turns out to be approximate, to get

00:42:01.720 | the correct transcription is just go through here and pick the most likely sequence of

00:42:06.960 | symbols for C, and then apply our little squeeze operator to get back the transcription the

00:42:13.280 | way that we defined it.

00:42:14.960 | So this turns out not to be the optimal thing.

00:42:17.560 | This actually doesn't give you the most likely transcription, because it's not accounting

00:42:22.320 | for the fact that every transcription might have multiple sequences of Cs, multiple alignments

00:42:28.640 | in this representation.

00:42:32.320 | But you can actually do this, and this is called the max decoding.

00:42:36.280 | And so for this sort of contrived example here, I put little red dots on the most likely

00:42:42.720 | C. And if you see, there's a couple of blanks, a couple of Cs, there's another blank, A,

00:42:50.560 | more blanks, Bs, more blanks.

00:42:52.680 | And if you apply our little squeeze operator, you just get the word cab.

00:42:58.800 | If you do this, it is often terrible.

00:43:02.200 | It will often give you a very strange transcription that doesn't look like English necessarily.

00:43:09.280 | But the reason I mention it is that this is a really handy diagnostic.

00:43:13.400 | If you're kind of wondering what's going on in the network, glancing at a few of these

00:43:17.160 | will often tell you if the network's starting to pick up any signal or if it's just outputting

00:43:21.900 | gobbledygook.

00:43:22.900 | So I'll give you a more detailed example in a second of how that happens.

00:43:28.360 | All right.

00:43:29.640 | So these are all the concepts of our very simple pipeline.

00:43:32.880 | And the demo code that we're going to put up on the web will basically let you work

00:43:37.480 | on all of these pieces.

00:43:39.200 | So once we try to train these, I want to give you an example of the sort of data that we're

00:43:44.400 | training on.

00:43:45.400 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.

00:43:51.920 | >> So this is just a person sitting there reading the Wall Street Journal to us.

00:43:55.960 | So this is a sort of simple data set.

00:43:58.520 | It's really popular in the speech research community.

00:44:02.240 | It's published by the Linguistic Data Consortium.

00:44:05.320 | There's also a free alternative called Libris Speech that's very similar.

00:44:08.720 | But instead of people reading the Wall Street Journal, it's people reading Creative Commons

00:44:12.360 | audiobooks.

00:44:15.200 | So in the demo code that we have, a really simple network that works reasonably well

00:44:22.680 | looks like this.

00:44:23.680 | So there's a sort of family of models that we've been working with, where you start from

00:44:28.120 | your spectrogram.

00:44:29.640 | You have maybe one layer or several of convolutional filters at the bottom.

00:44:35.380 | And then on top of that, you have some kind of recurrent neural network.

00:44:38.040 | It might just be a vanilla RNN, but you can also use LSTM or GRU cells, any of your favorite

00:44:47.360 | RNN creatures from the literature.

00:44:50.760 | And then on top of that, we have some fully connected layers that produce these softmax

00:44:54.800 | outputs.

00:44:55.800 | And those are the things that go into CTC for training.

00:44:59.680 | So this is pretty straightforward.

00:45:01.040 | The implementation on the web uses the warp CTC code.

00:45:05.000 | And then we would just train this big neural network with stochastic gradient descent,

00:45:08.880 | Nesterov's momentum, all the stuff that you've probably seen in a whole bunch of other talks

00:45:13.120 | so far.

00:45:14.120 | All right.

00:45:15.120 | So if you actually run this, what is going on inside?

00:45:21.680 | So I mentioned that looking at the max decoding is kind of a handy way to see what's going

00:45:26.960 | on inside this creature.

00:45:29.920 | So I wanted to show you an example.

00:45:31.840 | So this is a picture.

00:45:34.640 | This is a visualization of those softmax neurons at the top of one of these big neural networks.

00:45:40.860 | So this is the representation of C from all the previous slides.

00:45:45.880 | So on the horizontal axis, this is basically time.

00:45:48.680 | This is the frame number or which chunk of the spectrogram we're seeing.

00:45:52.800 | And then on the vertical axis here, you see these are all the characters in the English

00:45:56.460 | alphabet or a space or a blank.

00:45:59.280 | So after 300 iterations of training, which is not very much, the system has learned something

00:46:04.960 | amazing, which is that it should just output blanks and spaces all the time.

00:46:09.600 | Because these are by far, because of all the silence and things in your data set, these

00:46:14.040 | are the most common characters.

00:46:16.000 | I just want to fill up the whole space with blanks.

00:46:18.360 | But you can see it's kind of randomly poking out a few characters here.

00:46:23.300 | And if you run your little max decoding strategy to see what does the system think the transcription

00:46:28.440 | is, it thinks the transcription is "eh."

00:46:33.160 | But after 300 iterations, that's okay.

00:46:35.600 | But this is a sign that the neural network's not going crazy.

00:46:38.360 | Your gradient isn't busted.

00:46:40.240 | It's at least learned what is the most likely characters.

00:46:44.560 | Then after maybe 1,500 or so, you start to get a little bit of structure.

00:46:49.680 | And if you try to like mouth these words, you might be able to sort of see that there's

00:46:54.860 | some English-like sounds in here, like "Beyar justinfrutin."

00:47:00.760 | Something kind of odd.

00:47:01.760 | But it's actually looking much better than just "h."

00:47:03.600 | It's actually starting to output something.

00:47:07.120 | Go a little bit farther.

00:47:08.480 | It's a little bit more organized.

00:47:11.920 | You can start to see that we have sort of fragments of possibly words starting to form.

00:47:18.980 | And then after you're getting close to convergence, it's still not a real sentence.

00:47:23.880 | But does this make sense to people?

00:47:25.160 | He guessed what the correct transcription might be.

00:47:31.640 | You might have a couple of candidates.

00:47:34.160 | The correct one is actually "there justinfrunt."

00:47:38.440 | And so you can see that sort of it's sort of sounding it out with English characters.

00:47:44.220 | I have a young son, and I kind of figure I'm eventually going to see him producing max-decoded

00:47:48.600 | outputs of English.

00:47:51.520 | And you're just going to sound these things out and be like, "Is it there justinfrunt?

00:47:55.960 | There?"

00:47:56.960 | But this is why this max-decoding strategy is really handy.

00:48:00.200 | Because you can kind of look at this output and say, yeah, it's starting to get some actual

00:48:03.880 | signal out of the data.

00:48:05.080 | It's not just gobbledygook.

00:48:07.240 | So because this is like my favorite speech recognition party game, I wanted to show you

00:48:11.800 | a few more of these.

00:48:13.480 | So here's the max-decoded output.

00:48:15.920 | "The poor little things," cried Cynthia, "think of them having been turned to the wall all

00:48:20.720 | these years."

00:48:21.720 | And so you can hear like the sound of the breath at the end.

00:48:26.400 | Turns into a little bit of a word.

00:48:29.440 | "Cynthia" is sort of in this transcription.

00:48:34.040 | And you'll find that things like proper names and so on tend to get sounded out.

00:48:38.240 | But if those names are not in your audio data, there's no way the network could have learned

00:48:42.320 | how to say the name Cynthia.

00:48:45.080 | And we'll come back to how to solve that later.

00:48:47.000 | But you see the true label is "The poor little things," cried Cynthia.

00:48:52.160 | And that the last word is actually "all these years."

00:48:54.560 | And there isn't a word hanging off at the end.

00:48:58.280 | So here's another one.

00:48:59.280 | >> That is true, bad dealt gray.

00:49:03.920 | >> How many people figured out what this is?

00:49:06.160 | This is the max-decoded transcription.

00:49:08.680 | It sounds good to you.

00:49:12.080 | It sounds good to me.

00:49:13.080 | If you told me that this was the ground truth, I'd go, "That's weird.

00:49:16.720 | I have to go look up what this is."

00:49:19.480 | Here's the actual true label.

00:49:21.920 | Turns out this is a French word that means something like "rubbernecking."

00:49:26.840 | I had no idea what this word was.

00:49:29.440 | So this is, again, the cool examples of what these neural networks are able to figure out

00:49:34.040 | with no knowledge of the language itself.

00:49:38.120 | Okay.

00:49:40.480 | So let's go back to decoding.

00:49:42.280 | We just talked about max-decoding, which is sort of an approximate way of going from these

00:49:49.040 | probability vectors to a transcription Y.

00:49:52.240 | And if you want to find the actual most likely transcription Y, there's actually no algorithm

00:49:58.400 | in general that can give you the perfect solution efficiently.

00:50:03.840 | So the reason for that, remember, is that for a single transcription Y, I have an efficient

00:50:09.200 | algorithm to compute its probability.

00:50:11.320 | But if I want to search over every possible transcription, I don't know how to do that

00:50:15.880 | because there are exponentially many possible transcriptions, and I'd have to run this algorithm

00:50:22.600 | to compute the probability of all of them.

00:50:25.400 | So we have to resort to some kind of generic search strategy.

00:50:29.840 | And so one proposed in the original paper briefly is a sort of prefix-decoding strategy.

00:50:36.760 | So I don't want to spend a ton of time on this.

00:50:39.200 | Instead, I want to step to sort of the next piece of the picture.

00:50:44.480 | So there were a bunch of examples in there, right, like proper names, like Cynthia and

00:50:49.160 | things like Baddourie, where unless you had heard this word before, you have no hope of

00:50:57.000 | getting it right with your neural network.

00:50:59.280 | And so there are lots of examples like this in the literature of things that are sort

00:51:04.680 | of spelled out phonetically but aren't legitimate English transcriptions.

00:51:10.240 | And so what we'd like to do is come up with a way to fold in just a little bit of that

00:51:17.040 | knowledge about the language, to take a small step backward from a perfect end-to-end system

00:51:22.200 | and make these transcriptions better.

00:51:25.280 | So as I said, the real problem here is that you don't have enough audio available to learn

00:51:31.200 | all these things.

00:51:32.200 | If you had millions and millions of hours of audio sitting around, you could probably

00:51:35.560 | learn all these transcriptions because you just hear enough words that you know how to

00:51:39.440 | spell them all, maybe the way a human does.

00:51:42.800 | But unfortunately, we just don't have enough audio for that.

00:51:45.960 | So we have to find a way to get around that data problem.

00:51:50.120 | There's also an example of something that in the AI lab we've dubbed the Tchaikovsky

00:51:53.880 | problem, which is that there are certain names in the world, right, like proper names, that

00:51:59.360 | if you've never heard of it before, you have no idea how it's spelled.

00:52:03.560 | And the only way to know it is to have seen this word in text before and to see it in

00:52:08.400 | context.

00:52:10.400 | So part of the purpose of these language models is to get examples like this correct.

00:52:14.760 | So there are a couple of solutions.

00:52:16.840 | One would be to just step back to a more traditional pipeline, right, use phonemes, because then

00:52:21.940 | we can bake new words in along with their phonetic pronunciation and the system will

00:52:27.680 | just get it right.

00:52:29.280 | But in this case, I want to focus on just fusing in a traditional language model that

00:52:34.960 | gives us the probability a priori of any sequence of words.

00:52:40.000 | So the reason that this is helpful is that using a language model, we can train these

00:52:45.480 | things from massive text corpora.

00:52:47.880 | We have way, way more text in the world than we have transcribed audio.

00:52:52.640 | And so that makes it possible to train these giant language models with huge vocabulary,

00:52:57.960 | and they can also pick up the sort of contextual things that will tip you off to the fact that

00:53:02.720 | Tchaikovsky concerto is a reasonable thing for a person to ask, and that this particular

00:53:08.520 | transcription which we have seen in the past, Tchaikovsky concerto, even though composed

00:53:15.200 | of legitimate English words, is nonsense.

00:53:19.680 | So there's actually not much to see on the language modeling front for this, except that

00:53:25.840 | the reasons for sticking with traditional N-gram models are kind of interesting if you're

00:53:30.520 | excited about speech applications.

00:53:32.960 | So if you go use a package like KenLM on the web to go build yourself a giant N-gram language

00:53:39.880 | model, these are really simple and well supported.

00:53:44.240 | And so that makes them easy to get working.

00:53:47.040 | And they'll let you train from lots of corpora, but for speech recognition in practice, one

00:53:52.360 | of the nice things about N-gram models as opposed to trying to, say, use like an RNN

00:53:58.160 | model is that we can update these things very quickly.

00:54:00.440 | If you have a big distributed cluster, you can update that N-gram model very rapidly

00:54:05.160 | in parallel from new data to keep track of whatever the trending words are today that

00:54:09.600 | your speech engine might need to deal with.

00:54:12.440 | And we also have the need to query this thing very rapidly inside our decoding loop that

00:54:18.960 | you'll see in just a second.

00:54:20.340 | And so being able to just look up the probabilities in a table the way an N-gram model is structured

00:54:24.680 | is very valuable.

00:54:26.880 | So I hope someday all of this will go away and be replaced with an amazing neural network.

00:54:32.760 | But this is a really best practice today.

00:54:37.040 | So in order to fuse this into the system, since to get the most likely transcription,

00:54:44.880 | right, probability of Y given X, to maximize that thing, we need to use a generic search

00:54:49.940 | algorithm anyway.

00:54:51.460 | This opens up a door.

00:54:54.060 | Once we're using a generic search scheme to do our decoding and find the most likely transcription,

00:54:58.340 | we can add some extra cost terms.

00:55:00.840 | So in a previous piece of work from Auni, Hanun, and several co-authors, what you do

00:55:07.820 | is you take the probability of a given word sequence from your audio.

00:55:13.480 | So this is what you would get from your giant RNN.

00:55:17.920 | And you can just multiply it by some extra terms, the probability of the word sequence

00:55:22.560 | according to your language model raised to some power, and then multiply by the length

00:55:26.480 | raised to another power.

00:55:28.160 | And you see that if you just take the log of this objective function, right, then you

00:55:33.840 | get the log probability that was your original objective.

00:55:37.380 | You get alpha times the log probability of the language model, and beta times the log

00:55:43.080 | of the length.

00:55:44.380 | And these alpha and beta parameters let you sort of trade off the importance of getting

00:55:49.880 | a transcription that makes sense to your language model versus getting a transcription that

00:55:53.480 | makes sense to your acoustic model and actually sounds like the thing that you heard.

00:55:59.040 | And the reason for this extra term over here is that as you're multiplying in all of these

00:56:04.520 | terms, you tend to penalize long transcriptions a bit too much.

00:56:09.120 | And so having a little bonus or penalty at the end to tweak to get the transcription

00:56:13.600 | length right is very helpful.

00:56:16.700 | So the basic idea behind this is just to use BeamSearch.

00:56:19.920 | BeamSearch, really popular search algorithm, a whole bunch of instances of it.

00:56:25.000 | And the rough strategy is this.

00:56:28.840 | So starting from time 0, starting from t equals 1 at the very beginning of your audio input,

00:56:35.120 | I start out with an empty list that I'm going to populate with prefixes.

00:56:40.540 | And these prefixes are just partial transcriptions that represent what I think I've heard so

00:56:45.300 | far in the audio up to the current time.

00:56:50.200 | And the way that this proceeds is I'm going to take at the current time step each candidate

00:56:56.440 | prefix out of this list.

00:56:58.960 | And then I'm going to try all of the possible characters in my softmax neurons that could

00:57:04.160 | possibly follow it.

00:57:06.040 | So for example, I can try adding a blank.

00:57:08.800 | I can say if the next element of C is actually supposed to be a blank, then what that would

00:57:15.440 | mean is that I don't change my prefix, right, because the blanks are just going to get dropped

00:57:19.680 | later.

00:57:21.020 | But I need to incorporate the probability of that blank character into the probability

00:57:26.780 | of this prefix, right?

00:57:28.440 | It represents one of the ways that I could reach that prefix.

00:57:32.400 | And so I need to sum that probability into that candidate.

00:57:37.060 | And likewise, whenever I add a space to the end of a prefix, that signals that this prefix

00:57:43.640 | represents the end of a word.

00:57:45.600 | And so in addition to adding the probability of the space into my current estimate, this

00:57:50.400 | gives me the chance to go look up that word in my language model and fold that into my

00:57:55.540 | current score.

00:57:57.520 | And then if I try adding a new character onto this prefix, it's just straightforward.

00:58:01.600 | I just go and update the probabilities based on the probability of that character.

00:58:06.400 | And then at the end of this, I'm going to have a huge list of possible prefixes that

00:58:11.120 | could be generated.

00:58:12.480 | And this is where you would normally get the exponential blow up of trying all possible

00:58:18.560 | prefixes to find the best one.

00:58:20.880 | And what BeamSearch does is it just says, take the k most probable prefixes after I

00:58:27.220 | remove all the duplicates in here, and then go and do this again.

00:58:31.360 | And so if you have a really large k, then your algorithm will be a bit more accurate

00:58:35.400 | in finding the best possible solution to this maximization problem, but it'll be slower.

00:58:42.860 | So here's what ends up happening.

00:58:44.620 | If you run this decoding algorithm, if you just run it on the RNN outputs, you'll see

00:58:49.720 | that you get actually better than straight max decoding.

00:58:53.680 | You find slightly better solutions.

00:58:55.600 | But you still make things like spelling errors, like Boston with an I.

00:59:00.560 | But once you add in a language model that can actually tell you that the word Boston

00:59:05.000 | with an O is much more probable than Boston with an I.

00:59:14.120 | So one place that you can also drop in deep learning that I wanted to mention very rapidly

00:59:18.100 | is just if you're not happy with your N-gram model, because it doesn't have enough context,

00:59:22.760 | or you've seen a really amazing neural language modeling paper that you'd like to fold in,

00:59:29.080 | one really easy way to do this and link it to your current pipeline is to do rescoring.

00:59:35.080 | So when this decoding strategy finishes, it can give you the most probable transcription,

00:59:40.800 | but it also gives you this big list of the top k transcriptions in terms of probability.

00:59:48.560 | And what you can do is take your recurrent network and just rescore all of these, basically

01:00:00.320 | reorder them according to this new model.

01:00:03.940 | So in the instance of a neural language model, let's say that this is my N best list.

01:00:09.760 | I have five candidates that were output by my decoding strategy.

01:00:15.840 | And the first one is I'm a connoisseur looking for wine and pork chops.

01:00:19.320 | Sounds good to me.

01:00:21.120 | I'm a connoisseur looking for wine and pork chops.

01:00:25.400 | So this is actually quite subtle.

01:00:28.000 | And depending on what kind of connoisseur you are, it's sort of up to interpretation

01:00:34.040 | what you're looking for.

01:00:35.400 | But perhaps a neural language model is going to be a little bit better at figuring out

01:00:38.840 | that wine and pork are closely related.

01:00:40.920 | And if you're a connoisseur, you might be looking for wine and pork chops.

01:00:45.000 | And so what you would hope to happen is that a neural language model trained on a bunch

01:00:48.960 | of text is going to correctly reorder these things and figure out that the second beam

01:00:56.080 | candidate is actually the correct one, even though your N-gram model didn't help you.

01:01:03.520 | So that is really the scale model.

01:01:07.160 | That is the set of concepts that you need to get a working speech recognition engine

01:01:14.200 | based on deep learning.

01:01:16.240 | And so the thing that's left to go to state of the art performance and start serving users

01:01:21.320 | is scale.

01:01:22.640 | So I'm going to kind of run through quickly a bunch of the different tactics that you

01:01:28.200 | can use to try to get there.

01:01:30.720 | So the two pieces of scale that I want to cover, of course, are data and computing power.

01:01:35.320 | Where do you get them?

01:01:37.920 | So the first thing to know, this is just a number you can keep in the back of your head

01:01:41.080 | for all purposes, which is that transcribing speech data is not cheap, but it's also not

01:01:45.880 | prohibitive.

01:01:46.880 | It's about 50 cents to a dollar a minute, depending on the quality you want and who's

01:01:50.460 | transcribing it and the difficulty of the data.

01:01:54.320 | So typical speech benchmarks you'll see out there are maybe hundreds to thousands of hours.

01:01:59.920 | So like the Libri speech data set is maybe hundreds of hours.

01:02:04.920 | There's another data set called VoxForge, and you can kind of cobble these together

01:02:08.240 | and get maybe hundreds to thousands of hours.

01:02:11.120 | But the real challenge is that the application matters a lot.

01:02:16.080 | So all the utterances I was playing for you are examples of read speech.

01:02:21.760 | People are sitting in a nice quiet room, they're reading something wonderful to me, and so

01:02:25.880 | I'm going to end up with a speech engine that's really awesome at listening to the Wall Street

01:02:29.680 | Journal, but maybe not so good at listening to someone in a crowded cafe.

01:02:35.600 | So the application that you want to target really needs to match your data set.

01:02:41.000 | And so it's worth, at the outset, if you're thinking about going and buying a bunch of

01:02:44.160 | speech data, to think of what is the style of speech you're actually targeting.

01:02:49.160 | Are you worried about read speech, like the ones we're hearing, or do you care about conversational

01:02:53.520 | speech?

01:02:55.000 | It turns out that when people talk in a conversation, when they're spontaneous, they're just coming

01:03:00.140 | up with what to say on the fly versus if they have something that they're just dictating

01:03:04.640 | and they already know what to say, they behave differently.

01:03:07.880 | And they can exhibit all of these effects like disfluency and stuttering.

01:03:14.000 | And then in addition to that, we have all kinds of environmental factors that might

01:03:17.040 | matter for an application, like reverb and echo.

01:03:20.040 | We start to care about the quality of microphones and whether they have noise canceling.

01:03:24.880 | There's something called Lombard effect that I'll mention again in a second, and of course

01:03:28.680 | things like speaker accents, where you really have to think carefully about how you collect

01:03:32.680 | your data to make sure that you actually represent the kinds of cases you want to test on.

01:03:39.600 | So the reason that read speech is really popular is because we can get a lot of it.

01:03:44.600 | And even if it doesn't perfectly match your application, it's cheap and getting a lot

01:03:49.480 | of it can still help you.

01:03:51.340 | So I wanted to say a few things about read speech, because for less than $10 an hour,

01:03:55.560 | often a lot less, you can get a whole bunch of data.

01:03:58.400 | And it has the disadvantage that you lose a lot of things like inflection and conversationality,

01:04:06.560 | but it can still be helpful.

01:04:08.360 | So one of the things that we've tried doing, and I'm always interested to hear more clever

01:04:13.920 | schemes for this, is you can kind of engineer the way that people read to try to get the

01:04:18.880 | effects that you want.

01:04:21.520 | So here's one, which is that if you want a little bit more conversationality, you want

01:04:26.000 | to get people out of that kind of humdrum dictation, you can start giving them reading

01:04:30.120 | material that's a little more exciting.

01:04:31.960 | You can give them movie scripts and books, and people will actually start voice acting

01:04:36.000 | for you.

01:04:37.000 | >> Creep in, said the witch, and see if it is properly heated so that we can put the

01:04:42.960 | bread in.

01:04:46.520 | >> So these are really wonderful workers; right?

01:04:48.600 | They're kind of really getting into it to give you better data.

01:04:59.320 | >> The wolf is dead.

01:05:00.600 | The wolf is dead and danced for joy around about the well with their mother.

01:05:08.600 | >> So you have people reading poetry.

01:05:10.200 | They get this sort of lyrical quality into it that you don't get from just reading the

01:05:14.120 | Wall Street Journal.

01:05:16.080 | And finally, there's something called the Lombard effect that happens when people are

01:05:20.400 | in noisy environments.

01:05:22.200 | So if you're in a noisy party and you're trying to talk to your friend who's a couple of chairs

01:05:26.640 | away, you'll catch yourself involuntarily going, "Hey, over there, what are you doing?"

01:05:31.760 | You raise your inflection, and you kind of -- you try to use different tactics to get

01:05:37.080 | your signal-to-noise ratio up.

01:05:39.200 | You'll sort of work around the channel problem.

01:05:43.040 | And so this is very problematic when you're trying to do transcription in a noisy environment

01:05:47.600 | because people will talk to their phones using all these effects, even though the noise canceling

01:05:52.120 | and everything could actually help them.

01:05:54.620 | So one strategy we've tried with varying levels of success --

01:05:57.760 | >> Then they fell asleep and evening passed, but no one came to the poor children.

01:06:02.920 | >> -- is to actually play loud noise in people's headphones to try to get them to elicit this

01:06:08.840 | behavior.

01:06:09.840 | So this person is kind of raising their voice a little bit in a way that they wouldn't if

01:06:14.160 | they were just reading.

01:06:17.080 | And similarly, as I mentioned, there are a whole bunch of different augmentation strategies.

01:06:23.000 | So there are all these effects of environment, like reverberation, echo, background noise,

01:06:28.640 | that we would like our speech engine to be robust to.

01:06:31.840 | And one way you could go about trying to solve this is to go collect a bunch of audio from

01:06:36.000 | those cases and then transcribe it, but getting that raw audio is really expensive.

01:06:41.440 | So instead, an alternative is to take the really cheap red speech that's very clean

01:06:46.560 | and use some, like, off-the-shelf open-source audio toolkit to synthesize all the things

01:06:55.360 | you want to be robust to.

01:06:57.800 | So for example, if we want to simulate noise in a cafe, here's just me talking to my laptop

01:07:05.240 | in a quiet room.

01:07:07.440 | Hello, how are you?

01:07:12.480 | So I'm just asking, how are you?

01:07:14.120 | And then here's the sound of a cafe.

01:07:19.200 | So I can obviously collect these independently, very cheaply.

01:07:22.840 | Then I can synthesize this by just adding these signals together.

01:07:25.800 | Hello, how are you?

01:07:28.000 | Which actually sounds, I don't know, sounds to me like my talking to my laptop at a Starbucks

01:07:32.440 | or something.

01:07:34.320 | And so for our work on deep speech, we actually take something like 10,000 hours of raw audio

01:07:39.640 | that sounds kind of like this, and then we pile on lots and lots of audio tracks from

01:07:45.160 | Creative Commons videos.

01:07:47.520 | It turns out there's a strange thing.

01:07:49.160 | People upload, like, noise tracks to the web that last for hours.

01:07:53.520 | It's, like, really soothing to listen to the highway or something.

01:07:57.960 | And so you can download all this free found data, and you can just overlay it on this

01:08:02.920 | voice, and you can synthesize perhaps hundreds of thousands of hours of unique audio.

01:08:07.520 | And so the idea here is that it's just much easier to engineer your data pipeline to be

01:08:15.320 | robust than it is to engineer the speech engine itself to be robust.

01:08:20.200 | So whenever you encounter an environment that you've never seen before and your speech engine

01:08:23.720 | is breaking down, you should shift your instinct away from trying to engineer the engine to

01:08:29.000 | fix it and toward this idea of how do I reproduce it really cheaply in my data.

01:08:35.320 | So here's that Wall Street Journal example again.

01:08:37.320 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.

01:08:43.080 | >> And so if I wanted to, for instance, deal with a person reading Wall Street Journal

01:08:47.960 | on a tanker, maybe something like this.

01:08:50.280 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.

01:08:54.640 | >> There's lots of reverb in this room, so you can't hear the reverb on the audio.

01:08:58.720 | But basically, you can synthesize these things with one line of socks on the command line.

01:09:05.300 | So from some of our own work with building a large-scale speech engine with these technologies,

01:09:11.580 | this helps a ton.

01:09:13.200 | And you can actually see that when we run on clean and noisy test utterances, as we

01:09:20.960 | add more and more data all the way up to about 10,000 hours and using a lot of these synthesis

01:09:27.800 | strategies, we can just steadily improve the performance of the engine.

01:09:32.080 | And in fact, on things like clean speech, you can get down well below 10% word error

01:09:37.640 | rate, which is a pretty strong engine.

01:09:42.480 | Okay.

01:09:44.160 | Let's talk about computation.

01:09:46.000 | Because the caveat on that last slide is, yes, more data will help if you have a big

01:09:51.480 | enough model.

01:09:52.640 | And big models usually mean lots of computation.

01:09:56.800 | So what I haven't talked about is how big are these neural networks and how big is one

01:10:01.040 | experiment.

01:10:02.040 | So if you actually want to train one of these things at scale, what are you in for?

01:10:06.000 | So here's the back of the envelope.

01:10:08.200 | It's going to take at least the number of connections in your neural network.

01:10:13.040 | So take one slice of that RNN, the number of unique connections, multiplied by the number

01:10:18.760 | of frames once you unroll the recurrent network, once you unfold it, multiplied by the number

01:10:24.000 | of utterances you've got to process in your dataset, times the number of training epics,

01:10:29.160 | the number of times you loop through the dataset, times 3, because you have to do forward prop,

01:10:34.080 | backward prop, and then a gradient update.

01:10:35.880 | It's about a factor of 3 increase.

01:10:38.400 | And then 2 flops for every connection, because there's a multiply and an add.

01:10:43.120 | So if you multiply this out for some parameters from the deep speech engine at Baidu, you

01:10:47.960 | get something like 1.2 times 10 to the 19 flops.

01:10:52.080 | It's about 10 exaflops.

01:10:54.840 | And if you run this on a Titan X card, this will take about a month.

01:10:59.800 | Now if you already know what the model is, that might be tolerable.

01:11:04.100 | If you're on your epic run to get your best performance so far, then this is OK.

01:11:09.380 | But if you don't know what model's going to work, you're targeting some new scenario,

01:11:12.840 | then you want it done now so that you can try lots and lots of models quickly.

01:11:17.400 | So the easy fix is just to try using a bunch more GPUs with data parallelism.

01:11:23.700 | And the good news is that so far, it looks like speech recognition allows us to use mini-batch

01:11:29.440 | sizes.

01:11:30.440 | We can process enough utterances in parallel that this is actually efficient.

01:11:35.080 | So you'd like to keep maybe a bit more than 64 utterances on each GPU, and up to a total

01:11:41.640 | mini-batch size of like 1,000 or maybe 2,000 is still useful.

01:11:47.140 | And so if you're putting together your infrastructure, you can go out and you can buy a server that'll

01:11:53.600 | fit eight of these Titan GPUs in them, and that'll actually get you to less than a week

01:11:57.600 | training time, which is pretty respectable.

01:12:01.480 | So there are a whole bunch of ways to use GPUs.

01:12:03.960 | If I do, we've been using synchronous SGD.

01:12:07.360 | It turns out that you've got to optimize things like all reduced code.

01:12:10.920 | Once you leave one node, you have to start worrying about your network.

01:12:15.520 | And if you want to keep scaling, then thinking about things like network traffic and the

01:12:19.920 | right strategy for moving all of your data becomes important.

01:12:24.680 | But we've had success scaling really well all the way out to things like 64 GPUs and

01:12:30.820 | just getting linear speedups all of the way.

01:12:33.140 | So if you've got a big cluster available, these things scale really well.

01:12:38.120 | And there are a bunch of other solutions.

01:12:39.440 | For instance, asynchronous SGD is now kind of a mainstay of distributed deep learning.

01:12:44.980 | There's also been some work recently of trying to go back to synchronous SGD that has a lot

01:12:48.640 | of nice properties, but using things like backup workers.

01:12:53.600 | So that's sort of the easy thing.

01:12:56.640 | Just throw more GPUs at it and go faster.

01:12:59.360 | One word of warning as you're trying to build these systems is to watch for code that isn't

01:13:06.660 | as optimized as you expected it to be.

01:13:10.240 | And so this back-of-the-envelope calculation that we did of figuring out how many flops

01:13:15.600 | are involved in our network and then calculating how long it would take to run if our GPU were

01:13:21.800 | running at full efficiency, you should actually do this for your network.

01:13:26.000 | We call this the speed of light.

01:13:27.720 | This is the fastest your code could ever run on one GPU.

01:13:31.580 | And if you find that you're just drastically underperforming that number, what could be

01:13:36.840 | happening to you is that you've hit a little edge case in one of the libraries that you're

01:13:41.860 | using and you're actually suffering a huge setback that you don't need to be feeling

01:13:45.880 | right now.

01:13:46.880 | So one of the things we found back in November is that in libraries like Kublai's, you can

01:13:51.620 | actually use mini batch sizes that hit these weird catastrophic cases in the library, where

01:13:57.680 | you could be suffering like a factor of two or three performance reduction.

01:14:02.580 | So that might take your wonderful one-week training time and blow it up to, say, a three-week

01:14:07.460 | training time.

01:14:09.420 | So that's why I wanted to go through this and ask you to keep in mind while you're training

01:14:14.100 | these things, try to figure out how long it ought to be taking.

01:14:17.900 | And if it's going a lot slower, be suspicious that there's some code you could be optimizing.

01:14:24.260 | Another good trick that's particular to speech, you can also use this for other recurrent

01:14:29.980 | tricks, is to try to keep similar length utterances together.

01:14:34.900 | So if you look at your dataset, like a lot of things, you have this sort of distribution

01:14:40.700 | over possible utterance lengths.

01:14:43.400 | And so you see there's a whole bunch that are, you know, maybe within about 50% of each

01:14:48.380 | other, but there's also a large number of utterances that are very short.

01:14:52.740 | And so what happens is when we want to process a whole bunch of these utterances in parallel,

01:14:58.580 | if we just randomly select, say, 1,000 utterances to go into a mini-batch, there's a high probability

01:15:05.620 | that we're going to get a whole bunch of these little short utterances along with some really

01:15:09.800 | long utterances.

01:15:11.560 | And in order to make all the CTC libraries work and all of our recurrent network computations

01:15:16.060 | easy, what we have to do is pad these audio signals with zero.

01:15:20.300 | And that winds up meaning that we're wasting huge amounts of computation, maybe a factor

01:15:24.300 | of two or more.

01:15:26.560 | And so one way to get around it is just sort all of your utterances by length and then

01:15:32.060 | try to keep the mini-batches to be similar lengths so that you just don't end up with

01:15:36.660 | quite as much waste in each mini-batch.

01:15:39.940 | And this kind of modifies your algorithm a little bit, but in the end is worthwhile.

01:15:46.140 | All right.

01:15:47.140 | So that's kind of all I want to say about computation.

01:15:50.740 | If you've got a few GPUs, keep an eye on your running time so that you know what to optimize

01:15:56.800 | and pay attention to the easy wins, like keeping your utterances together.

01:16:00.660 | You can actually scale really well.

01:16:03.080 | And I think for a lot of the jobs we see, you can have your GPU running at something

01:16:08.820 | like 50% of the peak.

01:16:11.200 | And that's all in.

01:16:12.200 | With network time, with all the bandwidth-bound stuff, you can actually run at two to three

01:16:16.460 | teraflops on a GPU that can only do five teraflops in the perfect case.

01:16:22.860 | So what can you actually do with this?

01:16:25.940 | One of my favorite results from one of our largest models is actually in Mandarin.

01:16:30.220 | So we have a whole bunch of labeled Mandarin data at Baidu.

01:16:34.160 | And so one of the things that we did was we scaled up this model, trained it on a huge

01:16:37.720 | amount of Mandarin data, and then, as we always do, we sit down and we do error analysis.

01:16:44.200 | And what we would do is have a whole bunch of humans sitting around, try to debate the

01:16:50.560 | transcriptions and figure out the ground truth that tend to be very high quality.

01:16:55.100 | And then we'd go and we'd run now a sort of holdout test on some new people and on the

01:16:59.940 | speech engine itself.

01:17:01.740 | And so if you benchmark a single human being against this deep speech engine in Mandarin

01:17:08.480 | that's powered by all the technologies we were just talking about, it turns out that

01:17:13.540 | the speech engine can get an error rate that's down below 6% character error rate.

01:17:18.820 | So only about 6% of the characters are wrong.

01:17:21.500 | And a single human sitting there listening to these transcriptions actually does quite

01:17:25.240 | a bit worse.

01:17:26.240 | It gets almost 10%.

01:17:29.380 | If you give people a bit of an advantage, which is you now assemble a committee of people

01:17:36.300 | and you get them a fresh test set so that no one has seen it before and we run this

01:17:40.220 | test again, it turns out that the two engines, or that the two cases are actually really

01:17:46.260 | similar.

01:17:47.260 | And you can end up with a committee of native Mandarin speakers sitting around debating,

01:17:50.620 | "No, no, I think this person said this," or "No, they have an accent.

01:17:54.380 | It's from the north.

01:17:55.380 | I think they're actually saying that."

01:17:57.700 | And then when you show them the deep speech transcription, they actually go, "Oh, that's

01:18:02.140 | what it was."

01:18:04.060 | And so you can actually get this technology up to a point where it's highly competitive

01:18:09.540 | with human beings, even human beings working together.

01:18:12.620 | And this is sort of where I think all of the speech recognition systems are heading, thanks

01:18:17.220 | to deep learning and the technologies that we're talking about here.

01:18:22.080 | Any questions so far?

01:18:24.100 | Yeah, go ahead.

01:18:26.540 | So how do you know the actual label of the data?

01:18:33.540 | Yep.

01:18:34.540 | Sorry?

01:18:35.540 | Repeat the question.

01:18:36.540 | Yeah.

01:18:37.540 | So the question is, if humans have such a hard time coming up with the correct transcription,

01:18:38.660 | how do you know what the truth is?

01:18:40.500 | And the real answer is you don't really.

01:18:43.720 | Sometimes you might have a little bit of user feedback, but in this instance, we have very

01:18:48.140 | high-quality transcriptions that are coming from many labelers teamed up with a speech

01:18:52.580 | engine.

01:18:54.260 | And so that could be wrong.

01:18:56.540 | We do occasionally find errors where we just think that's a label error.

01:19:00.660 | But when you have a committee of humans around, the really astonishing thing is that you can

01:19:05.100 | look at the output of the speech engines, and the humans will suddenly jump ship and

01:19:09.980 | say, oh, no, no, no, no.

01:19:11.340 | The speech engine is actually correct, because it'll often come up with an obscure word or

01:19:15.380 | place that they weren't aware of.

01:19:18.860 | Once they see the label, can they be biased towards that label?

01:19:23.300 | Yeah.

01:19:24.300 | So this is an inherently ambiguous result.

01:19:26.980 | But let's say that a committee of human beings tend to disagree with another committee of

01:19:31.980 | human beings about the same amount as a speech engine does.

01:19:35.660 | Yeah.

01:19:36.660 | So this is basically doing a sequence-to-sequence sort of task, right?

01:19:42.660 | So we're going to hear about a really different approach to that later.

01:19:49.660 | Can you say anything about the -- Yeah.

01:19:50.660 | So this is using the CTC cost, right?

01:19:53.780 | That's really the core component of this system.

01:19:56.260 | It's how you deal with mapping one variable-length sequence to another.

01:20:01.140 | When the CTC cost is not perfect, it has this assumption of independence baked into the

01:20:06.660 | probabilistic model.

01:20:08.660 | And because of that assumption, we're introducing some bias into the system.

01:20:12.900 | And for languages like English, where the characters are obviously not independent of

01:20:17.700 | each other, this might be a limitation.

01:20:20.660 | In practice, the thing that we see is that as you add a lot of data and your model gets

01:20:25.020 | much more powerful, you can still find your way around it, but it might take more data

01:20:29.940 | and a bigger model than necessary.

01:20:32.500 | And of course, we hope that all the new state-of-the-art methods coming out of the deep learning community

01:20:36.380 | are going to give us an even better solution.

01:20:38.780 | Okay.

01:20:39.780 | Go ahead.

01:20:40.780 | In this spectrogram, you're saying that there's a 20 milliseconds sample that you take.

01:20:51.780 | Is there a reason for the prediction that you can have a bigger or smaller --

01:20:52.780 | Empirically determined.

01:20:53.780 | Yeah.

01:20:54.780 | So the question is, for a spectrogram with -- We talked about these little spectrogram

01:20:58.380 | frames being computed from 20 milliseconds of audio.

01:21:01.180 | And is that number special?

01:21:02.180 | Is there a reason for it?

01:21:05.180 | So this is really determined from years and years of experience.

01:21:08.380 | This is captured from the traditional speech community.

01:21:12.180 | We know this works pretty well.

01:21:14.020 | There's actually some fun things you can do.

01:21:15.980 | You can take a spectrogram, go back and find the best audio that corresponds to that spectrogram

01:21:22.860 | to listen to it and see if you lost anything.

01:21:25.700 | And spectrograms of about this level of quantization, you can kind of tell what people are saying.

01:21:30.860 | It's a little bit garbled, but it's still actually pretty good.

01:21:34.580 | So amongst all the hyperparameters you could choose, this one's kind of a good tradeoff

01:21:38.420 | in keeping the information, but also saving a little bit of the phase by doing it frequently.

01:21:44.300 | Yeah.

01:21:45.300 | Are you doing overlapping things?

01:21:50.220 | I think in a lot of the models in the demo, for example, we don't use overlapping windows.

01:21:55.660 | They're just adjacent.

01:21:56.660 | Yeah.

01:21:57.660 | You mentioned you get linear scale up across CPU and GPUs.

01:21:58.660 | Is that, does it really matter?

01:21:59.660 | Like, what do you get when you do that?

01:22:00.660 | Yeah.

01:22:01.660 | So those results are from in-house software at Baidu.

01:22:11.540 | If you use something like OpenMPI, for example, on a cluster of GPUs, it actually works pretty

01:22:17.700 | well on a bunch of machines.

01:22:21.540 | But I think some of the algorithms all reduce once you start moving huge amounts of data.

01:22:27.500 | They're not optimal.

01:22:28.580 | You'll suffer a hit once you start going to that many GPUs.

01:22:32.900 | Within a single box, if you use the CUDA libraries to move data back and forth just on a local

01:22:40.660 | box, that stuff is pretty well optimized, and you can often do it yourself.

01:22:46.020 | Okay.

01:22:47.020 | So I want to take a few more questions at the end, and maybe we can run into the break

01:22:51.100 | a little bit.

01:22:52.100 | I wanted to just dive right through a few comments about production here.

01:22:57.940 | So of course, the ultimate goal of solving speech recognition is to improve people's

01:23:05.860 | lives and enable exciting products, and so that means even though so far we've trained

01:23:11.260 | a bunch of acoustic and language models, we also want to get these things in production.

01:23:16.340 | And users tend to care about more than just accuracy.

01:23:19.700 | Accuracy of course matters a lot, but we also care about things like latency.

01:23:23.740 | Users want to see the engine send them some feedback very quickly so that they know that

01:23:27.900 | it's responding and that it's understanding what they're saying.

01:23:31.460 | And we also need this to be economical so that we can serve lots of users without breaking

01:23:35.620 | the bank.

01:23:37.220 | So in practice, a lot of the neural networks that we use in research papers, because they're

01:23:41.140 | awesome for beating benchmark results, turn out not to work that well on a production

01:23:46.260 | engine.

01:23:47.260 | So one in particular that I think is worth keeping an eye on is that it's really common

01:23:52.740 | to use bidirectional recurrent neural networks.

01:23:55.900 | And so throughout the talk, I've been drawing my RNN with connections that just go forward

01:24:00.420 | in time, but you'll see a lot of research results that also have a path that goes backward

01:24:05.780 | in time.

01:24:07.140 | And this works fine if you just want to process data offline.

01:24:11.500 | But the problem is that if I want to compute this neuron's output up at the top of my network,

01:24:16.580 | I have to wait until I see the entire audio segment so that I can compute this backward

01:24:21.620 | recurrence and get this response.

01:24:24.660 | So this sort of anti-causal part of my neural network that gets to see the future means

01:24:29.900 | that I can't respond to a user on the fly because I need to wait for the end of their

01:24:34.260 | signal.

01:24:36.380 | So if you start out with these bidirectional RNNs that are actually much easier to get

01:24:41.700 | working and then you jump to using a recurrent network that is forward only, it'll turn out

01:24:47.660 | that you're going to lose some accuracy.

01:24:50.420 | And you might kind of hope that CTC, because it doesn't care about the alignment, would

01:24:55.060 | somehow magically learn to shift the output over to get better accuracy and just artificially

01:25:01.620 | delay the response so that it could get more context on its own.

01:25:05.620 | But it kind of turns out to only do that a little bit in practice.

01:25:09.900 | It's really tough to control it.

01:25:11.500 | And so if you find that you're doing much worse, sometimes you have to sort of engage

01:25:15.620 | in model engineering.

01:25:17.620 | So even though I've been talking about these recurrent networks, I want you to bear in

01:25:20.620 | mind that there's this dual optimization going on.

01:25:24.940 | You want to find a model structure that gives you really good accuracy, but you also have

01:25:28.780 | to think carefully about how you set up the structure so that this little neuron at the

01:25:33.460 | top can actually see enough context to get an accurate answer and not depend too much

01:25:39.620 | on the future.

01:25:41.420 | So for example, what we could do is tweak this model so that this neuron at the top

01:25:46.660 | that's trying to output the character L in hello can see some future frames, but it doesn't

01:25:53.420 | have this backward recurrence.

01:25:55.040 | So it only gets to see a little bit of context.

01:25:57.660 | That lets us kind of contain the amount of latency in the model.

01:26:02.380 | I'm going to skip over this.

01:26:04.860 | So in terms of other online aspects, of course, we want this to be efficient.

01:26:12.380 | We want to serve lots of users on a small number of machines if possible.

01:26:17.260 | And one of the things that you might find if you have a really big deep neural network

01:26:22.020 | or recurrent neural network is that it's really hard to deploy them on conventional CPUs.

01:26:27.300 | CPUs are awesome for serial jobs.

01:26:30.940 | You just want to go as fast as you can for this one string of instructions.

01:26:35.660 | But as we've discovered with so much of deep learning, GPUs are really fantastic because

01:26:40.420 | when we work with neural networks, we love processing lots and lots of arithmetic in

01:26:44.580 | parallel.

01:26:46.080 | But it's really only efficient if the batch that we're working on, the hunks of audio

01:26:50.300 | that we're working on, are in a big enough batch.

01:26:56.280 | So if we just process one stream of audio so that my GPU is multiplying matrices times

01:27:00.900 | vectors, then my GPU is going to be really inefficient.

01:27:05.300 | So for example, on like a K1200 GPU, so something you could put in a server in the cloud, what

01:27:12.100 | you'll find is that you get really poor throughput considering the dollar value of this hardware

01:27:19.240 | if you're only processing one piece of audio at a time.

01:27:22.020 | Whereas if you could somehow batch up audio to have, say, 10 or 32 streams going at once,

01:27:28.480 | then you can actually squeeze out a lot more performance from that piece of hardware.

01:27:34.180 | So one of the things that we've been working on that works really well and is not too bad

01:27:39.200 | to implement is to just batch all of the packets as data comes in.

01:27:43.700 | So if I have a whole bunch of users talking to my server and they're sending me little

01:27:47.620 | hundred millisecond packets of audio, what I can do is I can sit and I can listen to

01:27:53.500 | all these users, and when I catch a whole batch of utterances coming in or a whole bunch

01:27:58.480 | of audio packets coming in from different people that start around the same time, I

01:28:03.580 | plug those all into my GPU and I process those matrix multiplications together.

01:28:08.780 | So instead of multiplying a matrix times only one little audio piece, I get to multiply

01:28:12.800 | it by a batch of, say, four audio pieces, and it's much more efficient.

01:28:18.420 | And if you actually do this on a live server and you plow a whole bunch of audio streams

01:28:23.140 | through it, you could support maybe 10, 20, 30 users in parallel, and as the load on that

01:28:29.300 | server goes up, I have more and more users piling on, what happens is that the GPU will

01:28:34.420 | naturally start batching up more and more packets into single matrix multiplications.

01:28:40.140 | So as you get more users, you actually get much more efficient as well.

01:28:45.660 | And so in practice, when you have a whole bunch of users on one machine, you usually

01:28:49.660 | don't see matrix multiplications happening with fewer than maybe batch sizes of four.

01:28:56.660 | So the summary of all of this is that deep learning is really making the first steps

01:29:04.060 | to building a state-of-the-art speech engine easier than they've ever been.

01:29:07.220 | So if you want to build a new state-of-the-art speech engine for some new language, all of

01:29:11.420 | the components that you need are things that we've covered so far.

01:29:15.860 | And the performance now is really significantly driven by data and models, and I think, as

01:29:20.740 | we were discussing earlier, I think future models from deep learning are going to make

01:29:24.420 | that influence of data and computing power even stronger.

01:29:29.740 | And of course, data and compute is important so that we can try lots and lots of models

01:29:34.300 | and keep making progress.

01:29:36.460 | And I think this technology is now at a stage where it's not just a research system anymore.

01:29:42.420 | We're seeing that the end-to-end deep learning technologies are now mature enough that we

01:29:46.700 | can get them into productions.

01:29:48.180 | I think you guys are going to be seeing deep learning play a bigger, bigger role in the

01:29:52.300 | speech engines that are powering all the devices that we use.

01:29:55.460 | So thank you very much.

01:29:56.460 | [APPLAUSE]

01:29:57.460 | So I think we're right at the end of time.

01:30:05.660 | [INAUDIBLE]

01:30:06.660 | Sounds good.

01:30:07.660 | All right, we had one in the back who was waiting patiently.

01:30:10.660 | Go ahead.

01:30:11.660 | [INAUDIBLE]

01:30:12.660 | More than one voice simultaneously?

01:30:19.780 | So the question is, how does the engine handle more than one voice simultaneously?

01:30:24.580 | So right now, there's nothing in this formalism that allows you to account for multiple speakers.

01:30:32.220 | And so usually, when you listen to an audio clip in practice, it's clear that there's

01:30:37.500 | one dominant speaker.

01:30:39.760 | And so this speech engine, of course, learns whatever it was taught from the labels.

01:30:44.540 | And it will try to filter out background speakers and just transcribe the dominant one.

01:30:49.580 | But if it's really ambiguous, then undefined results.

01:30:53.860 | Can you customize the transcription to the specific characteristics of a particular speaker?

01:31:04.260 | So we're not doing that in these pipelines right now.

01:31:08.780 | But of course, a lot of different strategies have been developed in the traditional speech

01:31:14.420 | literature.

01:31:15.420 | There are things like iVectors that try to quantify someone's voice.

01:31:18.420 | And those make useful features for improving speech engines.

01:31:21.980 | You could also imagine taking a lot of the concepts like embeddings, for example, and

01:31:26.540 | tossing them in here.

01:31:28.380 | So I think a lot of that is left open to future work.

01:31:31.740 | Adam, question.

01:31:35.100 | I think we have to break for time.

01:31:37.220 | But I'll step off stage here.

01:31:39.820 | And you guys can come to me with your questions.

01:31:41.540 | Thank you so much.

01:31:42.540 | [APPLAUSE]

01:31:43.540 | Thanks, Adam.

01:31:44.540 | So we'll reconvene at 2.45 for a presentation by Alex.

Deep Learning for Speech Recognition (Adam Coates, Baidu)

Chapters