back to index

Deep Learning for Speech Recognition (Adam Coates, Baidu)


Chapters

0:0
0:15 Speech recognition
4:47 Traditional ASR pipeline
10:49 Deep Learning in ASR
14:59 Scale model
15:49 Outline
17:23 Raw audio
18:28 Pre-processing
19:46 Spectrogram
21:57 Acoustic Model
35:34 Connectionist Temporal Classification (CTC)
39:7 Training tricks
48:13 Max Decoding
52:19 Language models
58:43 Decoding with LMs: Examples
59:14 Rescoring

Whisper Transcript | Transcript Only Page

00:00:00.000 | So I want to tell you guys about speech recognition and deep learning.
00:00:04.840 | I think deep learning has been playing an increasingly large role in speech recognition.
00:00:10.360 | And one of the things I think is most exciting about this field is that speech recognition
00:00:15.040 | is at a place right now where it's becoming good enough to enable really exciting applications
00:00:20.480 | that end up in the hands of users.
00:00:23.800 | So for example, if we want to caption video content and make it accessible to everyone,
00:00:28.680 | it used to be that we would sort of try to do this, but you still need a human to get
00:00:32.960 | really good captioning for something like a lecture.
00:00:36.480 | But it's possible that we can do a lot of this with higher quality in the future with
00:00:39.880 | deep learning.
00:00:40.920 | We can do things like hands-free interfaces in cars, make it safer to use technology while
00:00:46.040 | we're on the go and keep people's eyes on the road.
00:00:48.040 | Of course, it would make mobile devices, home devices much easier, much more efficient and
00:00:53.460 | enjoyable to use.
00:00:56.040 | But another actually sort of fun recent study that some folks at Baidu participated in,
00:01:02.680 | along with Stanford and UW, was to show that for even something straightforward that we
00:01:06.640 | sort of take for granted as an application of speech, which is just texting someone with
00:01:11.960 | voice or writing a piece of text, the study showed that you can actually go three times
00:01:16.720 | faster with voice recognition systems that are available today.
00:01:20.680 | So it's not just like a little bit faster now, even with the errors that a speech recognition
00:01:25.520 | system can make.
00:01:26.520 | It's actually a lot faster.
00:01:28.800 | And the reason I wanted to highlight this result, which is pretty recent, is that the
00:01:34.320 | speech engine that was used for this study is actually powered by a lot of the deep learning
00:01:39.040 | methods that I'm going to tell you about.
00:01:40.840 | So hopefully when you walk away today, you have an appreciation or an understanding of
00:01:45.120 | the sort of high-level ideas that make a result like this possible.
00:01:50.560 | So there are a whole bunch of different components that make up a complete speech application.
00:01:57.280 | So for example, there's speech transcription.
00:02:00.360 | So if I just talk, I want to come up with words that represent whatever I just said.
00:02:07.740 | There's also other tasks, though, like word spotting or triggering.
00:02:11.080 | So for example, if my phone is sitting over there and I want to say, "Hey, phone, go do
00:02:14.560 | something for me," it actually has to be listening continuously for me to say that word.
00:02:20.240 | And likewise, there are things like speaker identification or verification, so that if
00:02:25.120 | I want to authenticate myself or I want to be able to tell apart different users in a
00:02:28.920 | room, I've got to be able to recognize your voice, even though I don't know what you're
00:02:32.440 | saying.
00:02:33.440 | So these are different tasks.
00:02:34.440 | I'm not going to cover all of them today.
00:02:36.720 | Instead, I'm going to just focus on the bread and butter of speech recognition.
00:02:40.760 | We're going to focus on building a speech engine that can accurately transcribe audio
00:02:45.820 | into words.
00:02:47.420 | So that's our main goal.
00:02:48.880 | This is a very basic goal of artificial intelligence.
00:02:53.680 | Historically, people are very, very good at listening to someone talk, just like you guys
00:02:59.680 | are listening to me right now.
00:03:01.520 | And you can very quickly turn audio into words and into meaning on your own, almost effortlessly.
00:03:10.280 | And for machines, this has historically been incredibly hard.
00:03:13.480 | So you think of this as like one of those sort of consummate AI tasks.
00:03:18.160 | So the goal of building a speech pipeline is, if you just give me a raw audio wave,
00:03:23.040 | like you recorded on your laptop or your cell phone, I want to somehow build a speech recognizer
00:03:28.360 | that can do this very simple task of printing out "Hello, world" when I actually say "Hello,
00:03:33.920 | world."
00:03:35.040 | So before I dig into the deep learning part, I want to step back a little bit and spend
00:03:42.320 | maybe 10 minutes talking about how a traditional speech recognition pipeline is working, for
00:03:48.160 | two reasons.
00:03:49.600 | If you're out in the wild, you're doing an internship, you're trying to build a speech
00:03:54.800 | recognition system with a lot of the tools that are out there, you're going to bump into
00:03:59.160 | a lot of systems that are built on technologies that look like this.
00:04:02.880 | So I want you to understand a little bit of the vocabulary and how those things are put
00:04:07.040 | together.
00:04:08.200 | And also, this will sort of give you a story for what deep learning is doing in speech
00:04:13.440 | recognition today that is kind of special and that I think paves the way for much bigger
00:04:20.140 | results in the future.
00:04:22.900 | So traditional systems break the problem of converting an audio wave, of taking audio
00:04:31.180 | and turning it into a transcription, into a bunch of different pieces.
00:04:36.280 | So I'm going to start out with my raw audio, and I'm just going to represent that by X.
00:04:43.160 | And then usually we have to decide on some kind of feature representation.
00:04:47.280 | We have to convert this into some other form that's easier to deal with than a raw audio
00:04:52.380 | wave.
00:04:54.160 | And in a traditional speech system, I often have something called an acoustic model.
00:04:58.300 | And the job of the acoustic model is to learn the relationship between these features that
00:05:04.320 | represent my audio and the words that someone is trying to say.
00:05:10.040 | And then I'll often have a language model, which encapsulates all of my knowledge about
00:05:14.520 | what kinds of words, what spellings and what combinations of words are most likely in the
00:05:19.960 | language that I'm trying to transcribe.
00:05:22.640 | And once you have all of these pieces, so these might be -- these different models might
00:05:27.160 | be driven by machine learning themselves, what you would need to build in a traditional
00:05:31.160 | system is something called a decoder.
00:05:34.240 | And the job of a decoder, which itself might involve some modeling efforts and machine
00:05:39.160 | learning algorithms, is to find the sequence of words W that maximizes this probability.
00:05:47.780 | The probability of the particular sequence W, given your audio.
00:05:51.720 | That's straightforward.
00:05:53.460 | But that's equivalent to maximizing the product of the contributions from your acoustic model
00:05:58.560 | and from your language model.
00:06:00.840 | So a traditional speech system is broken down into these pieces, and a lot of the effort
00:06:05.180 | in getting that system to work is in developing this sort of portion that combines them all.
00:06:12.880 | So it turns out that if you want to just directly transcribe audio, you can't just go straight
00:06:19.120 | to characters.
00:06:20.720 | And the reason is, and it's especially apparent in English, that the way something is spelled
00:06:26.040 | in characters doesn't always correspond well to the way that it sounds.
00:06:30.700 | So if I give you the word "night," for example, without context, you don't really know whether
00:06:36.280 | I'm talking about a night in armor or whether I'm talking about night like an evening.
00:06:41.520 | And so a way to get around this, to abstract this problem away from a traditional system,
00:06:46.720 | is to replace this with a sort of intermediate representation.
00:06:50.920 | Instead of trying to predict characters, I'll just try to predict something called phonemes.
00:06:55.640 | So as an example, if I want to represent the word "hello," what I might try to do is break
00:07:01.520 | it down into these units of sound.
00:07:04.580 | So the first one is like the "h," that H sound in "hello," and then an "uh" sound, which
00:07:10.040 | is actually only one possible pronunciation of an E, and then an L and an O sound.
00:07:16.120 | And that would be my string that I try to come up with using all of my different speech
00:07:22.040 | components.
00:07:24.120 | So this, in one sense, makes the modeling problem easier.
00:07:27.680 | My acoustic model and so on can be simpler, because I don't have to worry about spelling.
00:07:33.080 | But it does have this problem that I have to think about where these things come from.
00:07:38.180 | So these phonemes are intuitively, they're the perceptually distinct units of sound that
00:07:45.920 | we can use to distinguish words.
00:07:48.640 | And they're very approximate.
00:07:51.680 | This might be our imagination that these things actually exist.
00:07:55.240 | It's not clear how fundamental this is.
00:07:57.880 | But they're sort of standardized.
00:07:58.880 | There are a bunch of different conventions for how to define these.
00:08:05.320 | And if you end up working on a system that uses phonemes, one popular data set is called
00:08:10.760 | TIMIT.
00:08:12.240 | And so this actually has a corpus of audio frames with examples of each of these phonemes.
00:08:19.640 | So once you have this phoneme representation, unfortunately, it adds even more complexity
00:08:28.120 | to this traditional pipeline.
00:08:30.580 | Because now, my acoustic model doesn't associate this audio feature with words.
00:08:35.720 | It actually associates them with another kind of transcription, with the transcription into
00:08:39.720 | phonemes.
00:08:41.000 | And so I have to introduce yet another component into my pipeline that tries to understand
00:08:46.680 | how do I convert the transcriptions in phonemes into actual spellings.
00:08:51.760 | And so I need some kind of dictionary or a lexicon to tell me all of that.
00:08:56.800 | So this is a way of taking our knowledge about a language and baking it into this engineered
00:09:01.660 | pipeline.
00:09:03.560 | And then once you've got all that, again, all of your work now goes into this decoder
00:09:08.560 | that has a slightly more complicated task in order to infer the most likely word transcription
00:09:14.560 | given the audio.
00:09:17.420 | So this is a tried and true pipeline.
00:09:20.040 | It's been around for a long time.
00:09:22.480 | You'll see a whole bunch of these systems out there.
00:09:25.660 | And we're still using a lot of the vocabulary from these systems.
00:09:30.920 | But traditionally, the big advantage is that it's very tweakable.
00:09:34.800 | If you want to go add a new pronunciation for a word you've never heard before, you
00:09:38.920 | can just drop it right in.
00:09:40.040 | That's great.
00:09:42.340 | But it's also really hard to get working well.
00:09:44.920 | If you start from scratch with this system and you have no experience in speech recognition,
00:09:50.360 | it's actually quite confusing and hard to debug.
00:09:53.240 | It's very difficult to know which of these various models is the one that's behind your
00:09:58.000 | error.
00:09:59.000 | And especially once we start dealing with things like accents, heavy noise, different
00:10:03.480 | kinds of ambiguity, that makes the problem even harder to engineer around.
00:10:08.080 | Because trying to think ourselves about how do I tweak my pronunciation model, for example,
00:10:13.540 | to account for someone's accent that I haven't heard, that's a very hard engineering judgment
00:10:18.180 | for us to make.
00:10:20.180 | So there are all kinds of design decisions that go into this pipeline, like choosing
00:10:24.800 | the feature representation, for example.
00:10:28.100 | So the first place that deep learning has started to make an impact in speech recognition,
00:10:35.260 | starting a few years ago, is to just take one of the core machine learning components
00:10:40.780 | of the system and replace it with a deep learning algorithm.
00:10:44.660 | So I mentioned back in this previous pipeline that we had this little model here whose job
00:10:50.500 | is to learn the relationship between a sequence of phonemes and the audio that we're hearing.
00:10:56.460 | So this is called the acoustic model.
00:10:59.500 | And there are lots of different methods for training this thing.
00:11:03.140 | So take your favorite machine learning algorithm.
00:11:06.140 | You can probably find someone who is trained in acoustic model with that algorithm, whether
00:11:09.740 | it's a Gaussian mixture model or a bunch of decision trees and random forests, anything
00:11:15.200 | for estimating these kinds of densities.
00:11:17.980 | There's a lot of work in trying to make better acoustic models.
00:11:21.700 | So some work by George Dahl and co-authors took what was a state of the art deep learning
00:11:28.940 | system back in 2011, which is a deep belief network with some pre-training strategies,
00:11:35.420 | and dropped it into a state of the art pipeline in place of this acoustic model.
00:11:41.140 | And the results are actually pretty striking, because even though we had neural networks
00:11:46.460 | and these pipelines for a while, what ended up happening is that when you replace the
00:11:52.140 | Gaussian mixture model and HMM system that already existed with this deep belief network
00:11:58.380 | as an acoustic model, you actually got something between like a 10% and 20% relative improvement
00:12:03.300 | in accuracy, which is a huge jump.
00:12:06.320 | This is highly noticeable to a person.
00:12:09.140 | And if you compare this to the amount of progress that had been made in preceding years, this
00:12:14.940 | is a giant leap for a single paper to make, compared to progress we'd been able to make
00:12:21.020 | previously.
00:12:22.780 | So this is in some sense the first generation of deep learning for speech recognition, which
00:12:28.380 | is I take one of these components and I swap it out for my favorite deep learning algorithm.
00:12:37.060 | So the picture looks sort of like this.
00:12:40.480 | So with these traditional speech recognition pipelines, the problem that we would always
00:12:46.180 | run into is that if you gave me a lot more data, you gave me a much bigger computer so
00:12:51.540 | that I could train a huge model, that actually didn't help me because all the problems I
00:12:56.420 | had were in the construction of this pipeline.
00:13:00.580 | And so eventually, if you gave me more data and a bigger computer, the performance of
00:13:04.780 | our speech recognition system would just kind of peter out.
00:13:08.340 | It would just reach a ceiling that was very hard to get over.
00:13:11.580 | And so we just start coming up with lots of different strategies.
00:13:14.180 | We start specializing for each application.
00:13:16.740 | We try to specialize for each user and try to make things a little bit better around
00:13:21.220 | the edges.
00:13:22.500 | And what these deep learning acoustic models did was in some sense moved that barrier a
00:13:28.540 | little ways.
00:13:29.740 | It made it possible for us to take a bit more data, much faster computers that let us try
00:13:34.860 | a whole lot of models, and move that ceiling up quite a ways.
00:13:40.300 | So the question that many in the research community, including folks at Baidu, have
00:13:44.780 | been trying to answer is, can we go to a next generation version of this insight?
00:13:51.780 | Can we, for instance, build a speech engine that is powered by deep learning all the way
00:13:56.740 | from the audio input to the transcription itself?
00:14:00.740 | Can we replace as much of that traditional system with deep learning as possible so that
00:14:05.380 | over time, as you give researchers more data and bigger computers and the ability to try
00:14:11.620 | more models, their speech recognition performance just keeps going up and we can potentially
00:14:16.340 | solve speech for everybody?
00:14:18.880 | So the goal of this tutorial is not to get you up here, which requires a whole bunch
00:14:26.180 | of things that I'll tell you about near the end.
00:14:29.100 | But what we want to try to do is give you enough to get a point on this curve.
00:14:33.460 | And then once you're on the curve, the idea is that what remains is now a problem of scale.
00:14:40.740 | It's about data and about getting bigger computers and coming up with ways to build bigger models.
00:14:47.500 | So that's my objective, so that when you walk away from here, you have a picture of what
00:14:51.700 | you would need to build to get this point.
00:14:54.900 | And then after that, it's hopefully all about scale.
00:14:59.180 | So thanks to Vinay Rao, who's been helping put this tutorial together, there is going
00:15:04.580 | to be some starter code live for the basic pipeline, the deep learning part of the pipeline
00:15:10.540 | that we're talking about.
00:15:12.300 | So there are some open source implementations of things like CTC, but we wanted to make
00:15:18.180 | sure that there's a system out there that's pretty representative of the acoustic models
00:15:22.220 | that I'm going to be talking about in the first half of the presentation here.
00:15:27.100 | So this will be enough that you can get a simple pipeline going with something called
00:15:31.760 | max decoding, which I'll tell you about later.
00:15:34.780 | And the idea is that this is sort of a scale model of the acoustic models that Baidu and
00:15:40.060 | other places are powering real production speech engines.
00:15:44.260 | So this will get you that point on the curve.
00:15:48.020 | Okay.
00:15:49.740 | So here's what we're going to talk about.
00:15:52.860 | The first part, I'm just going to introduce a few preliminaries, talk about preprocessing.
00:15:57.360 | So we still have a little bit of preprocessing around, but it's not really fundamental.
00:16:01.460 | I think it's probably going to go away in the long run.
00:16:04.660 | We'll talk about what is probably the most mature piece of sequence learning technologies
00:16:11.940 | for deep learning right now.
00:16:13.500 | So it turns out that one of the fundamental problems of doing speech recognition is how
00:16:17.740 | do I build a neural network that can map this audio signal to a transcription that can have
00:16:23.060 | a quite variable length.
00:16:25.580 | And so CTC is one highly mature method for doing this.
00:16:29.580 | And I think you're actually going to hear about maybe some other solutions later today.
00:16:33.820 | Then I'll say a little bit about training and just what that looks like.
00:16:40.020 | And then finally say a bit about decoding and language models, which is sort of an addendum
00:16:45.060 | to the current acoustic models that we can build that make them perform a lot better.
00:16:50.580 | And then once you have this, that's a picture of what you need to get this point on the
00:16:55.580 | curve.
00:16:56.660 | And then I'll talk a little bit about what's remaining.
00:16:59.300 | How do you scale up from this little scale model up to the full thing?
00:17:04.180 | What does that actually entail?
00:17:05.860 | And then time permitting, we'll talk a little bit about production.
00:17:08.740 | How could you put something like this into a cloud server and actually serve real users
00:17:13.540 | with it?
00:17:15.540 | Great.
00:17:17.260 | So how is audio represented?
00:17:20.760 | This should be pretty straightforward, I think.
00:17:24.100 | Unlike a two-dimensional image where we normally have a 2D grid of pixels, audio is just a
00:17:28.900 | 1D signal.
00:17:30.580 | And there are a bunch of different formats for audio, but typically this one-dimensional
00:17:35.180 | wave that is actually me saying something like, "Hello, world," is something like 8,000
00:17:42.580 | samples per second or 16,000 samples per second.
00:17:46.500 | And each wave is quantized into 8 or 16 bits.
00:17:51.020 | So when we represent this audio signal that's going to go into our pipeline, you could just
00:17:55.020 | think of that as a one-dimensional vector.
00:17:57.380 | So when I had that box called x that represented my audio signal, you can think of this as
00:18:02.540 | being broken down into samples, x1, x2, and so forth.
00:18:07.220 | And if I had a one-second audio clip, this vector would have a length of either, say,
00:18:12.300 | 8,000 or 16,000 samples.
00:18:14.840 | And each element would be, say, a floating point number that I'd extracted from this
00:18:19.460 | 8 or 16-bit sample.
00:18:21.140 | So it's really simple.
00:18:23.800 | Now once I have an audio clip, we'll do a little bit of preprocessing.
00:18:28.700 | So there are a couple of ways to start.
00:18:31.020 | The first is to just do some vanilla preprocessing, like convert to a simple spectrogram.
00:18:37.540 | So if you look at a traditional speech pipeline, you're going to see things like MFCCs, which
00:18:42.660 | are male frequency kepstral coefficients.
00:18:45.740 | You'll see a whole bunch of plays on spectrograms where you take differences in different kinds
00:18:50.860 | of features and try to engineer complex representations.
00:18:55.500 | But for the stuff that we're going to do today, a simple spectrogram is just fine.
00:18:59.700 | And it turns out, as you'll see in a second, we lose a little bit of information when we
00:19:04.180 | do this, but it turns out not to be a huge difference.
00:19:08.740 | Now I said a moment ago that I think probably this is going to go away in the long run.
00:19:14.260 | And that's because today you can actually find recent research in trying to do away
00:19:19.540 | with even this preprocessing part and having your neural network process the audio wave
00:19:23.860 | directly and just train its own feature transformation.
00:19:27.660 | So there's some references at the end that you can look at for this.
00:19:32.980 | So here's a quick straw poll.
00:19:35.460 | How many people have seen a spectrogram or computed a spectrogram before?
00:19:39.020 | Pretty good.
00:19:40.340 | Maybe 50%.
00:19:43.060 | So the idea behind a spectrogram is that it's sort of like a frequency domain representation,
00:19:49.940 | but instead of representing this entire signal in terms of frequencies, I'm just going to
00:19:55.100 | represent a small window in terms of frequencies.
00:19:59.940 | So to process this audio clip, the first thing I'm going to do is cut out a little window
00:20:06.660 | that's typically about 20 milliseconds long.
00:20:09.020 | And when you get down to that scale, it's usually very clear that these audio signals
00:20:12.880 | are made up of sort of a combination of different frequencies of sine waves.
00:20:18.960 | And then what we do is we compute an FFT.
00:20:21.780 | It basically converts this little signal into the frequency domain.
00:20:26.460 | And then we just take the log of the power at each frequency.
00:20:31.260 | And so if you look at what the result of this is, it basically tells us for every frequency
00:20:39.420 | of sine wave, what is the magnitude, what's the amount of power represented by that sine
00:20:44.820 | wave that makes up this original signal.
00:20:48.200 | So over here in this example, we have a very strong low frequency component in the signal.
00:20:55.780 | And then we have differing magnitudes at different differing frequencies.
00:21:01.180 | So we can just think of this as a vector.
00:21:06.020 | So now instead of representing this little 20 millisecond slice as sort of a sequence
00:21:10.560 | of audio samples, instead I'm going to represent it as a vector here where each element represents
00:21:18.180 | sort of the strength of each frequency in this little window.
00:21:22.500 | And the next step beyond this is that if I just told you how to process one little window,
00:21:28.100 | you can of course apply this to a whole bunch of windows across the entire piece of audio.
00:21:34.960 | And that gives you what we call a spectrogram.
00:21:37.400 | And you can use either disjoint windows that are just sort of adjacent or you can apply
00:21:41.800 | them to overlapping windows if you like.
00:21:44.560 | So there's a little bit of parameter tuning there.
00:21:46.980 | But this is an alternative representation of this audio signal that happens to be easier
00:21:53.000 | to use for a lot of purposes.
00:21:57.760 | So our goal, starting from this representation, is to build what I'm going to call an acoustic
00:22:04.200 | model, but which is really, to the extent we can make it happen, is really going to
00:22:08.720 | be an entire speech engine that is represented by a neural network.
00:22:12.780 | So what we would like to do is build a neural net that if we could train it from a whole
00:22:18.740 | bunch of pairs, X, which is my original audio that I turn into a spectrogram, and Y star,
00:22:25.000 | that's the ground truth transcription that some human has given me.
00:22:29.000 | If I were to train this big neural network off of these pairs, what I'd like it to produce
00:22:35.600 | is some kind of output that I'm representing by the character C here, so that I could later
00:22:41.560 | extract the correct transcription, which I'm going to denote by Y.
00:22:46.920 | So if I said hello, the first thing I'm going to do is run preprocessing to get all these
00:22:51.540 | spectrogram frames.
00:22:53.340 | And then I'm going to have a recurrent neural network that consumes each frame and processes
00:22:58.440 | them into some new representation called C. And hopefully, I can engineer my network in
00:23:04.640 | such a way that I can just read the transcription off of these output neurons.
00:23:09.940 | So that's kind of the intuitive picture of what we want to accomplish.
00:23:15.960 | So as I mentioned back in the outline, there's one obvious fundamental problem here, which
00:23:21.920 | is that the length of the input is not the same as the length of the transcription.
00:23:28.840 | So if I say hello very slowly, then I can have a very long audio signal, even though
00:23:35.120 | I didn't change the length of the transcription.
00:23:36.760 | Or if I say hello very quickly, then I can have a very short piece of audio.
00:23:43.120 | And so that means that this output of my neural network is changing length, and I need to
00:23:47.640 | come up with some way to map that variable length neural network output to this fixed
00:23:53.440 | length transcription, and also do it in a way that we can actually train this pipeline.
00:23:58.940 | So the traditional way to deal with this problem, if you were building a speech engine several
00:24:07.020 | years ago, is to just try to bootstrap the whole system.
00:24:11.040 | So I'd actually train a neural network to correctly predict the sounds at every frame
00:24:16.160 | using some kind of data set like Timit, where someone has lovingly annotated all of the
00:24:21.320 | phonemes for me.
00:24:23.140 | And then I'd try to figure out the alignment between my saying hello in a phonetic transcription
00:24:28.320 | with the input audio.
00:24:30.140 | And then once I've lined up all of the sounds with the input audio, now I don't care about
00:24:35.080 | length anymore, because I can just make a one-to-one mapping between the audio input
00:24:40.320 | and the phoneme outputs that I'm trying to target.
00:24:43.440 | But this alignment process is horribly error-prone.
00:24:47.420 | You have to do a lot of extra work to make it work well, and so we really don't want
00:24:51.160 | to do this.
00:24:52.160 | We really want to have some kind of solution that lets us solve this straightaway.
00:24:57.920 | So there are multiple ways to do it.
00:24:59.680 | And as I mentioned, there's some current research on how to use things like attentional models,
00:25:04.280 | sequence-to-sequence models that you'll hear about later, in order to solve this kind of
00:25:10.960 | problem.
00:25:11.960 | And then, as I said, we'll focus on something called connectionist temporal classification,
00:25:17.280 | or CTC, that is sort of current state-of-the-art for how to do this.
00:25:23.000 | So here's the basic idea.
00:25:24.960 | So our recurrent neural network has these output neurons that I'm calling C. And the
00:25:31.680 | job of these output neurons is to encode a distribution over the output symbols.
00:25:39.560 | So because of the structure of the recurrent network, the length of this symbol sequence
00:25:45.480 | C is the same as the length of my audio input.
00:25:48.280 | So if my audio input, say, was two seconds long, that might have 100 audio frames.
00:25:55.500 | And that would mean that the length of C is also 100 different values.
00:26:01.200 | So if we were working on a phoneme-based model, then C would be some kind of phoneme representation.
00:26:07.640 | And we would also include a blank symbol, which is special for CTC.
00:26:12.640 | But if, as we'll do in the rest of this talk, we're trying to just predict the graphemes,
00:26:19.040 | trying to predict the characters in this language directly from the audio, then I would just
00:26:24.400 | let C take on a value that's in my alphabet, or take on a blank or a space, if my language
00:26:30.940 | has spaces in it.
00:26:33.520 | And then the second thing I'm going to do, once my RNN gives me a distribution over these
00:26:39.540 | symbols C, is that I'm going to try to define some kind of mapping that can convert this
00:26:45.240 | long transcription C into the final transcription Y.
00:26:50.480 | That's like, hello.
00:26:51.720 | That's the actual string that I want.
00:26:55.000 | And now, recognizing that C is itself a probabilistic creature, there's a distribution over choices
00:27:01.260 | of C that correspond to the audio.
00:27:04.680 | Once I apply this function, that also means that there's a distribution over Y.
00:27:08.760 | There's a distribution over the possible transcriptions that I could get.
00:27:12.680 | And what I'll want to do to train my network is to maximize the probability of the correct
00:27:17.760 | transcription given the audio.
00:27:20.600 | So those are the three steps that we have to accomplish in order to make CTC work.
00:27:26.400 | So let's start with the first one.
00:27:29.520 | So we have these output neurons C, and they represent a distribution over the different
00:27:36.240 | symbols that I could be hearing in the audio.
00:27:39.680 | So I've got some audio signal down here.
00:27:41.380 | You can see the spectrogram frames poking up.
00:27:44.740 | And this is being processed by this recurrent neural network.
00:27:48.440 | And the output is a big bank of softmax neurons.
00:27:53.320 | So for the first frame of audio, I have a neuron that corresponds to each of the symbols
00:27:59.820 | that C could represent.
00:28:02.680 | And this set of softmax neurons here, with the output summing to 1, represents the probability
00:28:10.620 | of, say, C1 having the value A, B, C, and so on, or this special blank character.
00:28:17.680 | So for example, if I pick one of the neurons over here, then the first row, which represents
00:28:24.280 | the character B, and the 17th column, which is the 17th frame in time, this represents
00:28:31.760 | the probability that C17 represents the character B, given the audio.
00:28:40.920 | So once I have this, that also means that I can just define a distribution not just
00:28:47.160 | over the individual characters, but if I just assume that all of the characters are independent,
00:28:53.080 | which is kind of a naive assumption, but if I bake this into the system, I can define
00:28:57.820 | a distribution over all possible sequences of characters in this alphabet.
00:29:04.520 | So if I gave you a specific instance, a specific character string using this alphabet, for
00:29:11.960 | instance, I represent the string hello as H-H-H-E, blank E, blank blank L-L, blank L-O,
00:29:21.080 | and then a bunch of blanks.
00:29:22.600 | This is a string in this alphabet for C, and I can just use this formula to compute the
00:29:28.960 | probability of this specific sequence of characters.
00:29:33.560 | So that's how we compute the probability for a sequence of characters when they have the
00:29:38.760 | same length as the audio input.
00:29:43.960 | So the second step, and this is in some sense the kind of neat trick in CTC, is to define
00:29:52.440 | a mapping from this long encoding of the audio into symbols that crunches it down to the
00:30:03.240 | actual transcription that we're trying to predict.
00:30:06.000 | And the rule is this operator takes this character sequence, and it picks up all the duplicates,
00:30:13.440 | all of the adjacent characters that are repeated, and discards the duplicates and just keeps
00:30:18.720 | one of them, and then it drops all of the blanks.
00:30:23.240 | So in this example, you see you have three H's together, so I just keep one H, and then
00:30:29.600 | I have a blank, I throw that away, and I keep an E, and I have two L's, so I keep one of
00:30:34.640 | the L's over here, and then another blank, and an L-O.
00:30:38.400 | And the one key thing to note is that when I have two characters that are different right
00:30:43.280 | next to each other, I just end up keeping those two characters in my output.
00:30:48.680 | But if I ever have a double character, like L-L in "hello," then I'll need to have a blank
00:30:54.880 | character that gets put in between.
00:30:58.880 | But if our neural network gave me this transcription, told me that this was the right answer, we
00:31:04.360 | just have to apply this operator, and we get back the string "hello."
00:31:12.020 | So now that we have a way to define a distribution over these sequences of symbols that are the
00:31:18.320 | same length as the audio, and we now have a mapping from those strings into transcriptions,
00:31:25.160 | as I said, this gives us a probability distribution over the possible final transcriptions.
00:31:30.920 | So if I look at the probability distribution over all the different sequences of symbols,
00:31:37.720 | I might have "hello" written out like on the last slide, and maybe that has probability
00:31:42.620 | .1, and then I might have "hello" but written a different way, by say replacing this H with
00:31:49.880 | a blank that has a smaller probability, and I have a whole bunch of different possible
00:31:55.160 | symbol sequences below that.
00:31:58.440 | And what you'll notice is that if I go through every possible combination of symbols here,
00:32:06.160 | there are several combinations that all map to the same transcription.
00:32:10.640 | So here's one version of "hello," there's a second version of "hello," there's a third
00:32:15.120 | version of "hello."
00:32:16.800 | And so if I now ask, "What's the probability of the transcription 'hello'?"
00:32:21.720 | The way that I compute that is I go through all of the possible character sequences that
00:32:28.080 | correspond to the transcription "hello," and I add up all of their probabilities.
00:32:33.240 | So I have to sum over all possible choices of C that could give me that transcription
00:32:38.680 | in the end.
00:32:40.400 | So you can kind of think of this as searching through all the possible alignments, right?
00:32:48.080 | I could shift these characters around a little bit, I could move them forward, backward,
00:32:52.360 | I could expand them by adding duplicates or squish them up, depending on how fast someone
00:32:56.500 | is talking, and that corresponds to every possible alignment between the audio and the
00:33:04.000 | characters that I want to transcribe.
00:33:05.000 | It sort of solves the problem of the variable length.
00:33:08.760 | And the way that I get the probability of a specific transcription is to sum up, to
00:33:14.360 | marginalize over all the different alignments that could be feasible.
00:33:20.920 | And then if we have a whole bunch of other possibilities in here, like the word "yellow,"
00:33:25.480 | I'd compute them in the same way.
00:33:27.400 | So this equation just says to sum over all the character sequences C so that when I apply
00:33:32.800 | this little mapping operator, I end up with the transcription y.
00:33:36.160 | Oh, oh.
00:33:47.880 | I'm missing a double E. You're talking about this one?
00:33:51.920 | So when we apply this sort of squeezing operator here, we drop this double E to get a single
00:33:59.120 | E in "hello."
00:34:00.120 | And we remove all the duplicates.
00:34:03.320 | So the same way we did for an H.
00:34:04.320 | [INAUDIBLE]
00:34:05.320 | Right.
00:34:09.160 | So whenever you see two characters together like this, where they're adjacent duplicates,
00:34:16.120 | you sort of squeeze all those duplicates out, and you just keep one of them.
00:34:19.520 | But here we have a blank in between.
00:34:21.880 | So if we drop all the duplicates first, then we still have two L's left, and then we remove
00:34:27.440 | all the blanks.
00:34:29.160 | So this gives the algorithm a way to represent repeated characters in the transcription.
00:34:33.280 | There's another one in the back.
00:34:38.120 | [INAUDIBLE]
00:34:39.120 | Oh, oh, I see.
00:34:42.640 | Yeah.
00:34:43.640 | This is maybe-- I put a space in here.
00:34:47.520 | Really I should have put a space character in here instead of a blank.
00:34:51.000 | Really this could be H-E-L-L-O-H.
00:34:54.920 | Yeah.
00:34:58.120 | So the space here is erroneous.
00:35:03.240 | We good?
00:35:08.040 | So once I've defined this, I just gave you a formula to compute the probability of a
00:35:13.600 | string given the audio.
00:35:16.360 | So as with every good starting to a machine learning algorithm, we go and we try to apply
00:35:22.120 | maximum likelihood.
00:35:23.720 | I now give you the correct transcription, and your job is to tune the neural network
00:35:28.440 | to maximize the probability of that transcription using this model that I just defined.
00:35:34.160 | So in equations, what I'm going to do is I want to maximize the log probability of y
00:35:40.960 | star for a given example.
00:35:46.280 | I want to maximize the probability of the correct transcription given the audio x.
00:35:51.680 | And then I'm just going to sum over all the examples.
00:35:56.160 | And then what I want to do is just replace this with the equation that I had on the last
00:36:02.040 | page that says in order to compute the probability of a given transcription, I have to sum over
00:36:07.320 | all of the possible symbol sequences that could have given me that transcription, sum
00:36:12.120 | over all the possible alignments that would map that transcription to my audio.
00:36:18.800 | So Alex Graves and co-authors in 2006 actually show that because of this independence assumption,
00:36:26.000 | there is a clever way, there is a dynamic programming algorithm that can efficiently
00:36:29.720 | compute this summation for you.
00:36:33.000 | And not only compute this summation so that you can compute the objective function, but
00:36:37.200 | actually compute its gradient with respect to the output neurons of your neural network.
00:36:42.240 | So if you look at the paper, the algorithm details are in there.
00:36:47.000 | What's cool right now in the history of speech and deep learning is that this is at the level
00:36:52.200 | of a technology.
00:36:53.540 | This is something that's now implemented in a bunch of places so that you can download
00:36:57.500 | a software package that efficiently will calculate this CTC loss function for you that can calculate
00:37:05.540 | this likelihood and can also just give you back the gradient.
00:37:08.980 | So I won't go into the equations here.
00:37:11.240 | Instead, I'll tell you that there are a whole bunch of implementations on the web that you
00:37:15.760 | can now use as part of deep learning packages.
00:37:19.140 | So one of them from Baidu implements CTC on the GPU.
00:37:23.120 | It's called WarpCTC.
00:37:25.960 | Stanford and the group there, actually one of Andrew's students, has a CTC implementation.
00:37:33.360 | And there's also now CTC losses implemented in packages like TensorFlow.
00:37:37.880 | So this is something that's sufficiently widely distributed that you can use these algorithms
00:37:44.480 | off the shelf.
00:37:46.880 | So the way that these work, the way that we go about training, is we start from our audio
00:37:51.480 | spectrogram.
00:37:52.680 | We have our neural network structure where you get to choose how it's put together.
00:37:58.060 | And then it outputs this bank of softmax neurons.
00:38:01.640 | And then there are pieces of off-the-shelf software that will compute for you the CTC
00:38:07.520 | cost function.
00:38:08.520 | They'll compute this log likelihood given a transcription and the output neurons from
00:38:13.640 | your recurrent network.
00:38:16.440 | And then the software will also be able to tell you the gradient with respect to the
00:38:20.800 | output neurons.
00:38:22.040 | And once you've got that, you're set.
00:38:23.440 | You can feed them back into the rest of your code and get the gradient with respect to
00:38:27.640 | all of these parameters.
00:38:29.980 | So as I said, this is all available now in sort of efficient, off-the-shelf software.
00:38:34.280 | So you don't have to do this work yourself.
00:38:37.520 | So that's pretty much all there is to the high-level algorithm.
00:38:41.840 | With this, it's actually enough to get a sort of working drosophila of speech recognition
00:38:48.480 | going.
00:38:49.800 | There are a few little tricks, though, that you might need along the way.
00:38:54.680 | On easy problems, you might not need these.
00:38:57.520 | But as you get to more difficult data sets with a lot of noise, they can become more
00:39:01.720 | and more important.
00:39:03.360 | So the first one that we've been calling "sort of grad" in the vein of all of the grad algorithms
00:39:09.080 | out there is basically a trick to help with recurrent neural networks.
00:39:16.600 | So it turns out that when you try to train one of these big RNN models on some off-the-shelf
00:39:22.360 | speech data, one of the things that can really get you is seeing very long utterances early
00:39:28.680 | in the process.
00:39:30.540 | Because if you have a really long utterance, then if your neural network is badly initialized,
00:39:36.680 | you'll often end up with things like underflow and overflow as you try to go and compute
00:39:41.000 | the probabilities.
00:39:42.400 | And you end up with gradients exploding as you try to do back propagation.
00:39:46.320 | And it can make your optimization a real mess.
00:39:49.200 | And it's coming from the fact that these utterances are really long and really hard, and the neural
00:39:53.440 | network just isn't ready to deal with those transcriptions.
00:39:57.280 | And so one of the fixes that you can use is, during the early parts of training, usually
00:40:02.400 | in the first epoch, is you just sort all of your audio by length.
00:40:07.040 | And now, when you process a mini-batch, you just take the short utterances first so that
00:40:12.040 | you're working with really short RNNs that are quite easy to train and don't blow up
00:40:16.760 | and don't have a lot of catastrophic numerical problems.
00:40:20.480 | And then as time goes by, you start operating on longer and longer utterances that get more
00:40:25.440 | and more difficult.
00:40:27.440 | So we call this "sort of grad."
00:40:28.760 | It's basically a curriculum learning method.
00:40:31.480 | And so you can see some work from Yoshio Bengio and his team on a whole bunch of strategies
00:40:36.520 | for this.
00:40:37.520 | But you can think of the short utterances as being the easy ones.
00:40:40.480 | And if you start out with the easy utterances and move to the longer ones, your optimization
00:40:44.640 | algorithm can do better.
00:40:46.640 | So here's an example from one of the models that we've trained, where your CTC cost starts
00:40:53.100 | up here.
00:40:54.720 | And after a while, you optimize, and you sort of bottom out around, I don't know, what?
00:40:59.560 | A log likelihood of maybe 30.
00:41:02.080 | And then if you add this sort of grad strategy, after the first epoch, you're actually doing
00:41:07.380 | better.
00:41:08.380 | And you can reach a better optimum than you could without it.
00:41:12.120 | And in addition, another strategy that's extremely helpful for recurrent networks and very deep
00:41:17.120 | neural networks is batch normalization.
00:41:20.760 | So this is becoming very popular.
00:41:22.880 | And it's also available as sort of an off-the-shelf package inside of a lot of the different frameworks
00:41:28.120 | that are available today.
00:41:29.520 | So if you start having trouble, you can consider putting batch normalization into your network.
00:41:36.280 | So our neural network now spits out this big bank of softmax neurons.
00:41:41.720 | We've got a training algorithm.
00:41:42.760 | We're just doing gradient descent.
00:41:45.720 | How do we actually get a transcription?
00:41:47.740 | This process, as I said, is meant to be as close to characters as possible.
00:41:53.000 | But we still sort of need to decode these outputs.
00:41:56.600 | And you might think that one simple solution, which turns out to be approximate, to get
00:42:01.720 | the correct transcription is just go through here and pick the most likely sequence of
00:42:06.960 | symbols for C, and then apply our little squeeze operator to get back the transcription the
00:42:13.280 | way that we defined it.
00:42:14.960 | So this turns out not to be the optimal thing.
00:42:17.560 | This actually doesn't give you the most likely transcription, because it's not accounting
00:42:22.320 | for the fact that every transcription might have multiple sequences of Cs, multiple alignments
00:42:28.640 | in this representation.
00:42:32.320 | But you can actually do this, and this is called the max decoding.
00:42:36.280 | And so for this sort of contrived example here, I put little red dots on the most likely
00:42:42.720 | C. And if you see, there's a couple of blanks, a couple of Cs, there's another blank, A,
00:42:50.560 | more blanks, Bs, more blanks.
00:42:52.680 | And if you apply our little squeeze operator, you just get the word cab.
00:42:58.800 | If you do this, it is often terrible.
00:43:02.200 | It will often give you a very strange transcription that doesn't look like English necessarily.
00:43:09.280 | But the reason I mention it is that this is a really handy diagnostic.
00:43:13.400 | If you're kind of wondering what's going on in the network, glancing at a few of these
00:43:17.160 | will often tell you if the network's starting to pick up any signal or if it's just outputting
00:43:21.900 | gobbledygook.
00:43:22.900 | So I'll give you a more detailed example in a second of how that happens.
00:43:28.360 | All right.
00:43:29.640 | So these are all the concepts of our very simple pipeline.
00:43:32.880 | And the demo code that we're going to put up on the web will basically let you work
00:43:37.480 | on all of these pieces.
00:43:39.200 | So once we try to train these, I want to give you an example of the sort of data that we're
00:43:44.400 | training on.
00:43:45.400 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.
00:43:51.920 | >> So this is just a person sitting there reading the Wall Street Journal to us.
00:43:55.960 | So this is a sort of simple data set.
00:43:58.520 | It's really popular in the speech research community.
00:44:02.240 | It's published by the Linguistic Data Consortium.
00:44:05.320 | There's also a free alternative called Libris Speech that's very similar.
00:44:08.720 | But instead of people reading the Wall Street Journal, it's people reading Creative Commons
00:44:12.360 | audiobooks.
00:44:15.200 | So in the demo code that we have, a really simple network that works reasonably well
00:44:22.680 | looks like this.
00:44:23.680 | So there's a sort of family of models that we've been working with, where you start from
00:44:28.120 | your spectrogram.
00:44:29.640 | You have maybe one layer or several of convolutional filters at the bottom.
00:44:35.380 | And then on top of that, you have some kind of recurrent neural network.
00:44:38.040 | It might just be a vanilla RNN, but you can also use LSTM or GRU cells, any of your favorite
00:44:47.360 | RNN creatures from the literature.
00:44:50.760 | And then on top of that, we have some fully connected layers that produce these softmax
00:44:54.800 | outputs.
00:44:55.800 | And those are the things that go into CTC for training.
00:44:59.680 | So this is pretty straightforward.
00:45:01.040 | The implementation on the web uses the warp CTC code.
00:45:05.000 | And then we would just train this big neural network with stochastic gradient descent,
00:45:08.880 | Nesterov's momentum, all the stuff that you've probably seen in a whole bunch of other talks
00:45:13.120 | so far.
00:45:14.120 | All right.
00:45:15.120 | So if you actually run this, what is going on inside?
00:45:21.680 | So I mentioned that looking at the max decoding is kind of a handy way to see what's going
00:45:26.960 | on inside this creature.
00:45:29.920 | So I wanted to show you an example.
00:45:31.840 | So this is a picture.
00:45:34.640 | This is a visualization of those softmax neurons at the top of one of these big neural networks.
00:45:40.860 | So this is the representation of C from all the previous slides.
00:45:45.880 | So on the horizontal axis, this is basically time.
00:45:48.680 | This is the frame number or which chunk of the spectrogram we're seeing.
00:45:52.800 | And then on the vertical axis here, you see these are all the characters in the English
00:45:56.460 | alphabet or a space or a blank.
00:45:59.280 | So after 300 iterations of training, which is not very much, the system has learned something
00:46:04.960 | amazing, which is that it should just output blanks and spaces all the time.
00:46:09.600 | Because these are by far, because of all the silence and things in your data set, these
00:46:14.040 | are the most common characters.
00:46:16.000 | I just want to fill up the whole space with blanks.
00:46:18.360 | But you can see it's kind of randomly poking out a few characters here.
00:46:23.300 | And if you run your little max decoding strategy to see what does the system think the transcription
00:46:28.440 | is, it thinks the transcription is "eh."
00:46:33.160 | But after 300 iterations, that's okay.
00:46:35.600 | But this is a sign that the neural network's not going crazy.
00:46:38.360 | Your gradient isn't busted.
00:46:40.240 | It's at least learned what is the most likely characters.
00:46:44.560 | Then after maybe 1,500 or so, you start to get a little bit of structure.
00:46:49.680 | And if you try to like mouth these words, you might be able to sort of see that there's
00:46:54.860 | some English-like sounds in here, like "Beyar justinfrutin."
00:47:00.760 | Something kind of odd.
00:47:01.760 | But it's actually looking much better than just "h."
00:47:03.600 | It's actually starting to output something.
00:47:07.120 | Go a little bit farther.
00:47:08.480 | It's a little bit more organized.
00:47:11.920 | You can start to see that we have sort of fragments of possibly words starting to form.
00:47:18.980 | And then after you're getting close to convergence, it's still not a real sentence.
00:47:23.880 | But does this make sense to people?
00:47:25.160 | He guessed what the correct transcription might be.
00:47:31.640 | You might have a couple of candidates.
00:47:34.160 | The correct one is actually "there justinfrunt."
00:47:38.440 | And so you can see that sort of it's sort of sounding it out with English characters.
00:47:44.220 | I have a young son, and I kind of figure I'm eventually going to see him producing max-decoded
00:47:48.600 | outputs of English.
00:47:51.520 | And you're just going to sound these things out and be like, "Is it there justinfrunt?
00:47:55.960 | There?"
00:47:56.960 | But this is why this max-decoding strategy is really handy.
00:48:00.200 | Because you can kind of look at this output and say, yeah, it's starting to get some actual
00:48:03.880 | signal out of the data.
00:48:05.080 | It's not just gobbledygook.
00:48:07.240 | So because this is like my favorite speech recognition party game, I wanted to show you
00:48:11.800 | a few more of these.
00:48:13.480 | So here's the max-decoded output.
00:48:15.920 | "The poor little things," cried Cynthia, "think of them having been turned to the wall all
00:48:20.720 | these years."
00:48:21.720 | And so you can hear like the sound of the breath at the end.
00:48:26.400 | Turns into a little bit of a word.
00:48:29.440 | "Cynthia" is sort of in this transcription.
00:48:34.040 | And you'll find that things like proper names and so on tend to get sounded out.
00:48:38.240 | But if those names are not in your audio data, there's no way the network could have learned
00:48:42.320 | how to say the name Cynthia.
00:48:45.080 | And we'll come back to how to solve that later.
00:48:47.000 | But you see the true label is "The poor little things," cried Cynthia.
00:48:52.160 | And that the last word is actually "all these years."
00:48:54.560 | And there isn't a word hanging off at the end.
00:48:58.280 | So here's another one.
00:48:59.280 | >> That is true, bad dealt gray.
00:49:03.920 | >> How many people figured out what this is?
00:49:06.160 | This is the max-decoded transcription.
00:49:08.680 | It sounds good to you.
00:49:12.080 | It sounds good to me.
00:49:13.080 | If you told me that this was the ground truth, I'd go, "That's weird.
00:49:16.720 | I have to go look up what this is."
00:49:19.480 | Here's the actual true label.
00:49:21.920 | Turns out this is a French word that means something like "rubbernecking."
00:49:26.840 | I had no idea what this word was.
00:49:29.440 | So this is, again, the cool examples of what these neural networks are able to figure out
00:49:34.040 | with no knowledge of the language itself.
00:49:38.120 | Okay.
00:49:40.480 | So let's go back to decoding.
00:49:42.280 | We just talked about max-decoding, which is sort of an approximate way of going from these
00:49:49.040 | probability vectors to a transcription Y.
00:49:52.240 | And if you want to find the actual most likely transcription Y, there's actually no algorithm
00:49:58.400 | in general that can give you the perfect solution efficiently.
00:50:03.840 | So the reason for that, remember, is that for a single transcription Y, I have an efficient
00:50:09.200 | algorithm to compute its probability.
00:50:11.320 | But if I want to search over every possible transcription, I don't know how to do that
00:50:15.880 | because there are exponentially many possible transcriptions, and I'd have to run this algorithm
00:50:22.600 | to compute the probability of all of them.
00:50:25.400 | So we have to resort to some kind of generic search strategy.
00:50:29.840 | And so one proposed in the original paper briefly is a sort of prefix-decoding strategy.
00:50:36.760 | So I don't want to spend a ton of time on this.
00:50:39.200 | Instead, I want to step to sort of the next piece of the picture.
00:50:44.480 | So there were a bunch of examples in there, right, like proper names, like Cynthia and
00:50:49.160 | things like Baddourie, where unless you had heard this word before, you have no hope of
00:50:57.000 | getting it right with your neural network.
00:50:59.280 | And so there are lots of examples like this in the literature of things that are sort
00:51:04.680 | of spelled out phonetically but aren't legitimate English transcriptions.
00:51:10.240 | And so what we'd like to do is come up with a way to fold in just a little bit of that
00:51:17.040 | knowledge about the language, to take a small step backward from a perfect end-to-end system
00:51:22.200 | and make these transcriptions better.
00:51:25.280 | So as I said, the real problem here is that you don't have enough audio available to learn
00:51:31.200 | all these things.
00:51:32.200 | If you had millions and millions of hours of audio sitting around, you could probably
00:51:35.560 | learn all these transcriptions because you just hear enough words that you know how to
00:51:39.440 | spell them all, maybe the way a human does.
00:51:42.800 | But unfortunately, we just don't have enough audio for that.
00:51:45.960 | So we have to find a way to get around that data problem.
00:51:50.120 | There's also an example of something that in the AI lab we've dubbed the Tchaikovsky
00:51:53.880 | problem, which is that there are certain names in the world, right, like proper names, that
00:51:59.360 | if you've never heard of it before, you have no idea how it's spelled.
00:52:03.560 | And the only way to know it is to have seen this word in text before and to see it in
00:52:08.400 | context.
00:52:10.400 | So part of the purpose of these language models is to get examples like this correct.
00:52:14.760 | So there are a couple of solutions.
00:52:16.840 | One would be to just step back to a more traditional pipeline, right, use phonemes, because then
00:52:21.940 | we can bake new words in along with their phonetic pronunciation and the system will
00:52:27.680 | just get it right.
00:52:29.280 | But in this case, I want to focus on just fusing in a traditional language model that
00:52:34.960 | gives us the probability a priori of any sequence of words.
00:52:40.000 | So the reason that this is helpful is that using a language model, we can train these
00:52:45.480 | things from massive text corpora.
00:52:47.880 | We have way, way more text in the world than we have transcribed audio.
00:52:52.640 | And so that makes it possible to train these giant language models with huge vocabulary,
00:52:57.960 | and they can also pick up the sort of contextual things that will tip you off to the fact that
00:53:02.720 | Tchaikovsky concerto is a reasonable thing for a person to ask, and that this particular
00:53:08.520 | transcription which we have seen in the past, Tchaikovsky concerto, even though composed
00:53:15.200 | of legitimate English words, is nonsense.
00:53:19.680 | So there's actually not much to see on the language modeling front for this, except that
00:53:25.840 | the reasons for sticking with traditional N-gram models are kind of interesting if you're
00:53:30.520 | excited about speech applications.
00:53:32.960 | So if you go use a package like KenLM on the web to go build yourself a giant N-gram language
00:53:39.880 | model, these are really simple and well supported.
00:53:44.240 | And so that makes them easy to get working.
00:53:47.040 | And they'll let you train from lots of corpora, but for speech recognition in practice, one
00:53:52.360 | of the nice things about N-gram models as opposed to trying to, say, use like an RNN
00:53:58.160 | model is that we can update these things very quickly.
00:54:00.440 | If you have a big distributed cluster, you can update that N-gram model very rapidly
00:54:05.160 | in parallel from new data to keep track of whatever the trending words are today that
00:54:09.600 | your speech engine might need to deal with.
00:54:12.440 | And we also have the need to query this thing very rapidly inside our decoding loop that
00:54:18.960 | you'll see in just a second.
00:54:20.340 | And so being able to just look up the probabilities in a table the way an N-gram model is structured
00:54:24.680 | is very valuable.
00:54:26.880 | So I hope someday all of this will go away and be replaced with an amazing neural network.
00:54:32.760 | But this is a really best practice today.
00:54:37.040 | So in order to fuse this into the system, since to get the most likely transcription,
00:54:44.880 | right, probability of Y given X, to maximize that thing, we need to use a generic search
00:54:49.940 | algorithm anyway.
00:54:51.460 | This opens up a door.
00:54:54.060 | Once we're using a generic search scheme to do our decoding and find the most likely transcription,
00:54:58.340 | we can add some extra cost terms.
00:55:00.840 | So in a previous piece of work from Auni, Hanun, and several co-authors, what you do
00:55:07.820 | is you take the probability of a given word sequence from your audio.
00:55:13.480 | So this is what you would get from your giant RNN.
00:55:17.920 | And you can just multiply it by some extra terms, the probability of the word sequence
00:55:22.560 | according to your language model raised to some power, and then multiply by the length
00:55:26.480 | raised to another power.
00:55:28.160 | And you see that if you just take the log of this objective function, right, then you
00:55:33.840 | get the log probability that was your original objective.
00:55:37.380 | You get alpha times the log probability of the language model, and beta times the log
00:55:43.080 | of the length.
00:55:44.380 | And these alpha and beta parameters let you sort of trade off the importance of getting
00:55:49.880 | a transcription that makes sense to your language model versus getting a transcription that
00:55:53.480 | makes sense to your acoustic model and actually sounds like the thing that you heard.
00:55:59.040 | And the reason for this extra term over here is that as you're multiplying in all of these
00:56:04.520 | terms, you tend to penalize long transcriptions a bit too much.
00:56:09.120 | And so having a little bonus or penalty at the end to tweak to get the transcription
00:56:13.600 | length right is very helpful.
00:56:16.700 | So the basic idea behind this is just to use BeamSearch.
00:56:19.920 | BeamSearch, really popular search algorithm, a whole bunch of instances of it.
00:56:25.000 | And the rough strategy is this.
00:56:28.840 | So starting from time 0, starting from t equals 1 at the very beginning of your audio input,
00:56:35.120 | I start out with an empty list that I'm going to populate with prefixes.
00:56:40.540 | And these prefixes are just partial transcriptions that represent what I think I've heard so
00:56:45.300 | far in the audio up to the current time.
00:56:50.200 | And the way that this proceeds is I'm going to take at the current time step each candidate
00:56:56.440 | prefix out of this list.
00:56:58.960 | And then I'm going to try all of the possible characters in my softmax neurons that could
00:57:04.160 | possibly follow it.
00:57:06.040 | So for example, I can try adding a blank.
00:57:08.800 | I can say if the next element of C is actually supposed to be a blank, then what that would
00:57:15.440 | mean is that I don't change my prefix, right, because the blanks are just going to get dropped
00:57:19.680 | later.
00:57:21.020 | But I need to incorporate the probability of that blank character into the probability
00:57:26.780 | of this prefix, right?
00:57:28.440 | It represents one of the ways that I could reach that prefix.
00:57:32.400 | And so I need to sum that probability into that candidate.
00:57:37.060 | And likewise, whenever I add a space to the end of a prefix, that signals that this prefix
00:57:43.640 | represents the end of a word.
00:57:45.600 | And so in addition to adding the probability of the space into my current estimate, this
00:57:50.400 | gives me the chance to go look up that word in my language model and fold that into my
00:57:55.540 | current score.
00:57:57.520 | And then if I try adding a new character onto this prefix, it's just straightforward.
00:58:01.600 | I just go and update the probabilities based on the probability of that character.
00:58:06.400 | And then at the end of this, I'm going to have a huge list of possible prefixes that
00:58:11.120 | could be generated.
00:58:12.480 | And this is where you would normally get the exponential blow up of trying all possible
00:58:18.560 | prefixes to find the best one.
00:58:20.880 | And what BeamSearch does is it just says, take the k most probable prefixes after I
00:58:27.220 | remove all the duplicates in here, and then go and do this again.
00:58:31.360 | And so if you have a really large k, then your algorithm will be a bit more accurate
00:58:35.400 | in finding the best possible solution to this maximization problem, but it'll be slower.
00:58:42.860 | So here's what ends up happening.
00:58:44.620 | If you run this decoding algorithm, if you just run it on the RNN outputs, you'll see
00:58:49.720 | that you get actually better than straight max decoding.
00:58:53.680 | You find slightly better solutions.
00:58:55.600 | But you still make things like spelling errors, like Boston with an I.
00:59:00.560 | But once you add in a language model that can actually tell you that the word Boston
00:59:05.000 | with an O is much more probable than Boston with an I.
00:59:14.120 | So one place that you can also drop in deep learning that I wanted to mention very rapidly
00:59:18.100 | is just if you're not happy with your N-gram model, because it doesn't have enough context,
00:59:22.760 | or you've seen a really amazing neural language modeling paper that you'd like to fold in,
00:59:29.080 | one really easy way to do this and link it to your current pipeline is to do rescoring.
00:59:35.080 | So when this decoding strategy finishes, it can give you the most probable transcription,
00:59:40.800 | but it also gives you this big list of the top k transcriptions in terms of probability.
00:59:48.560 | And what you can do is take your recurrent network and just rescore all of these, basically
01:00:00.320 | reorder them according to this new model.
01:00:03.940 | So in the instance of a neural language model, let's say that this is my N best list.
01:00:09.760 | I have five candidates that were output by my decoding strategy.
01:00:15.840 | And the first one is I'm a connoisseur looking for wine and pork chops.
01:00:19.320 | Sounds good to me.
01:00:21.120 | I'm a connoisseur looking for wine and pork chops.
01:00:25.400 | So this is actually quite subtle.
01:00:28.000 | And depending on what kind of connoisseur you are, it's sort of up to interpretation
01:00:34.040 | what you're looking for.
01:00:35.400 | But perhaps a neural language model is going to be a little bit better at figuring out
01:00:38.840 | that wine and pork are closely related.
01:00:40.920 | And if you're a connoisseur, you might be looking for wine and pork chops.
01:00:45.000 | And so what you would hope to happen is that a neural language model trained on a bunch
01:00:48.960 | of text is going to correctly reorder these things and figure out that the second beam
01:00:56.080 | candidate is actually the correct one, even though your N-gram model didn't help you.
01:01:03.520 | So that is really the scale model.
01:01:07.160 | That is the set of concepts that you need to get a working speech recognition engine
01:01:14.200 | based on deep learning.
01:01:16.240 | And so the thing that's left to go to state of the art performance and start serving users
01:01:21.320 | is scale.
01:01:22.640 | So I'm going to kind of run through quickly a bunch of the different tactics that you
01:01:28.200 | can use to try to get there.
01:01:30.720 | So the two pieces of scale that I want to cover, of course, are data and computing power.
01:01:35.320 | Where do you get them?
01:01:37.920 | So the first thing to know, this is just a number you can keep in the back of your head
01:01:41.080 | for all purposes, which is that transcribing speech data is not cheap, but it's also not
01:01:45.880 | prohibitive.
01:01:46.880 | It's about 50 cents to a dollar a minute, depending on the quality you want and who's
01:01:50.460 | transcribing it and the difficulty of the data.
01:01:54.320 | So typical speech benchmarks you'll see out there are maybe hundreds to thousands of hours.
01:01:59.920 | So like the Libri speech data set is maybe hundreds of hours.
01:02:04.920 | There's another data set called VoxForge, and you can kind of cobble these together
01:02:08.240 | and get maybe hundreds to thousands of hours.
01:02:11.120 | But the real challenge is that the application matters a lot.
01:02:16.080 | So all the utterances I was playing for you are examples of read speech.
01:02:21.760 | People are sitting in a nice quiet room, they're reading something wonderful to me, and so
01:02:25.880 | I'm going to end up with a speech engine that's really awesome at listening to the Wall Street
01:02:29.680 | Journal, but maybe not so good at listening to someone in a crowded cafe.
01:02:35.600 | So the application that you want to target really needs to match your data set.
01:02:41.000 | And so it's worth, at the outset, if you're thinking about going and buying a bunch of
01:02:44.160 | speech data, to think of what is the style of speech you're actually targeting.
01:02:49.160 | Are you worried about read speech, like the ones we're hearing, or do you care about conversational
01:02:53.520 | speech?
01:02:55.000 | It turns out that when people talk in a conversation, when they're spontaneous, they're just coming
01:03:00.140 | up with what to say on the fly versus if they have something that they're just dictating
01:03:04.640 | and they already know what to say, they behave differently.
01:03:07.880 | And they can exhibit all of these effects like disfluency and stuttering.
01:03:14.000 | And then in addition to that, we have all kinds of environmental factors that might
01:03:17.040 | matter for an application, like reverb and echo.
01:03:20.040 | We start to care about the quality of microphones and whether they have noise canceling.
01:03:24.880 | There's something called Lombard effect that I'll mention again in a second, and of course
01:03:28.680 | things like speaker accents, where you really have to think carefully about how you collect
01:03:32.680 | your data to make sure that you actually represent the kinds of cases you want to test on.
01:03:39.600 | So the reason that read speech is really popular is because we can get a lot of it.
01:03:44.600 | And even if it doesn't perfectly match your application, it's cheap and getting a lot
01:03:49.480 | of it can still help you.
01:03:51.340 | So I wanted to say a few things about read speech, because for less than $10 an hour,
01:03:55.560 | often a lot less, you can get a whole bunch of data.
01:03:58.400 | And it has the disadvantage that you lose a lot of things like inflection and conversationality,
01:04:06.560 | but it can still be helpful.
01:04:08.360 | So one of the things that we've tried doing, and I'm always interested to hear more clever
01:04:13.920 | schemes for this, is you can kind of engineer the way that people read to try to get the
01:04:18.880 | effects that you want.
01:04:21.520 | So here's one, which is that if you want a little bit more conversationality, you want
01:04:26.000 | to get people out of that kind of humdrum dictation, you can start giving them reading
01:04:30.120 | material that's a little more exciting.
01:04:31.960 | You can give them movie scripts and books, and people will actually start voice acting
01:04:36.000 | for you.
01:04:37.000 | >> Creep in, said the witch, and see if it is properly heated so that we can put the
01:04:42.960 | bread in.
01:04:46.520 | >> So these are really wonderful workers; right?
01:04:48.600 | They're kind of really getting into it to give you better data.
01:04:59.320 | >> The wolf is dead.
01:05:00.600 | The wolf is dead and danced for joy around about the well with their mother.
01:05:08.600 | >> So you have people reading poetry.
01:05:10.200 | They get this sort of lyrical quality into it that you don't get from just reading the
01:05:14.120 | Wall Street Journal.
01:05:16.080 | And finally, there's something called the Lombard effect that happens when people are
01:05:20.400 | in noisy environments.
01:05:22.200 | So if you're in a noisy party and you're trying to talk to your friend who's a couple of chairs
01:05:26.640 | away, you'll catch yourself involuntarily going, "Hey, over there, what are you doing?"
01:05:31.760 | You raise your inflection, and you kind of -- you try to use different tactics to get
01:05:37.080 | your signal-to-noise ratio up.
01:05:39.200 | You'll sort of work around the channel problem.
01:05:43.040 | And so this is very problematic when you're trying to do transcription in a noisy environment
01:05:47.600 | because people will talk to their phones using all these effects, even though the noise canceling
01:05:52.120 | and everything could actually help them.
01:05:54.620 | So one strategy we've tried with varying levels of success --
01:05:57.760 | >> Then they fell asleep and evening passed, but no one came to the poor children.
01:06:02.920 | >> -- is to actually play loud noise in people's headphones to try to get them to elicit this
01:06:08.840 | behavior.
01:06:09.840 | So this person is kind of raising their voice a little bit in a way that they wouldn't if
01:06:14.160 | they were just reading.
01:06:17.080 | And similarly, as I mentioned, there are a whole bunch of different augmentation strategies.
01:06:23.000 | So there are all these effects of environment, like reverberation, echo, background noise,
01:06:28.640 | that we would like our speech engine to be robust to.
01:06:31.840 | And one way you could go about trying to solve this is to go collect a bunch of audio from
01:06:36.000 | those cases and then transcribe it, but getting that raw audio is really expensive.
01:06:41.440 | So instead, an alternative is to take the really cheap red speech that's very clean
01:06:46.560 | and use some, like, off-the-shelf open-source audio toolkit to synthesize all the things
01:06:55.360 | you want to be robust to.
01:06:57.800 | So for example, if we want to simulate noise in a cafe, here's just me talking to my laptop
01:07:05.240 | in a quiet room.
01:07:07.440 | Hello, how are you?
01:07:12.480 | So I'm just asking, how are you?
01:07:14.120 | And then here's the sound of a cafe.
01:07:19.200 | So I can obviously collect these independently, very cheaply.
01:07:22.840 | Then I can synthesize this by just adding these signals together.
01:07:25.800 | Hello, how are you?
01:07:28.000 | Which actually sounds, I don't know, sounds to me like my talking to my laptop at a Starbucks
01:07:32.440 | or something.
01:07:34.320 | And so for our work on deep speech, we actually take something like 10,000 hours of raw audio
01:07:39.640 | that sounds kind of like this, and then we pile on lots and lots of audio tracks from
01:07:45.160 | Creative Commons videos.
01:07:47.520 | It turns out there's a strange thing.
01:07:49.160 | People upload, like, noise tracks to the web that last for hours.
01:07:53.520 | It's, like, really soothing to listen to the highway or something.
01:07:57.960 | And so you can download all this free found data, and you can just overlay it on this
01:08:02.920 | voice, and you can synthesize perhaps hundreds of thousands of hours of unique audio.
01:08:07.520 | And so the idea here is that it's just much easier to engineer your data pipeline to be
01:08:15.320 | robust than it is to engineer the speech engine itself to be robust.
01:08:20.200 | So whenever you encounter an environment that you've never seen before and your speech engine
01:08:23.720 | is breaking down, you should shift your instinct away from trying to engineer the engine to
01:08:29.000 | fix it and toward this idea of how do I reproduce it really cheaply in my data.
01:08:35.320 | So here's that Wall Street Journal example again.
01:08:37.320 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.
01:08:43.080 | >> And so if I wanted to, for instance, deal with a person reading Wall Street Journal
01:08:47.960 | on a tanker, maybe something like this.
01:08:50.280 | >> A tanker is a ship designed to carry large volumes of oil or other liquid cargo.
01:08:54.640 | >> There's lots of reverb in this room, so you can't hear the reverb on the audio.
01:08:58.720 | But basically, you can synthesize these things with one line of socks on the command line.
01:09:05.300 | So from some of our own work with building a large-scale speech engine with these technologies,
01:09:11.580 | this helps a ton.
01:09:13.200 | And you can actually see that when we run on clean and noisy test utterances, as we
01:09:20.960 | add more and more data all the way up to about 10,000 hours and using a lot of these synthesis
01:09:27.800 | strategies, we can just steadily improve the performance of the engine.
01:09:32.080 | And in fact, on things like clean speech, you can get down well below 10% word error
01:09:37.640 | rate, which is a pretty strong engine.
01:09:42.480 | Okay.
01:09:44.160 | Let's talk about computation.
01:09:46.000 | Because the caveat on that last slide is, yes, more data will help if you have a big
01:09:51.480 | enough model.
01:09:52.640 | And big models usually mean lots of computation.
01:09:56.800 | So what I haven't talked about is how big are these neural networks and how big is one
01:10:01.040 | experiment.
01:10:02.040 | So if you actually want to train one of these things at scale, what are you in for?
01:10:06.000 | So here's the back of the envelope.
01:10:08.200 | It's going to take at least the number of connections in your neural network.
01:10:13.040 | So take one slice of that RNN, the number of unique connections, multiplied by the number
01:10:18.760 | of frames once you unroll the recurrent network, once you unfold it, multiplied by the number
01:10:24.000 | of utterances you've got to process in your dataset, times the number of training epics,
01:10:29.160 | the number of times you loop through the dataset, times 3, because you have to do forward prop,
01:10:34.080 | backward prop, and then a gradient update.
01:10:35.880 | It's about a factor of 3 increase.
01:10:38.400 | And then 2 flops for every connection, because there's a multiply and an add.
01:10:43.120 | So if you multiply this out for some parameters from the deep speech engine at Baidu, you
01:10:47.960 | get something like 1.2 times 10 to the 19 flops.
01:10:52.080 | It's about 10 exaflops.
01:10:54.840 | And if you run this on a Titan X card, this will take about a month.
01:10:59.800 | Now if you already know what the model is, that might be tolerable.
01:11:04.100 | If you're on your epic run to get your best performance so far, then this is OK.
01:11:09.380 | But if you don't know what model's going to work, you're targeting some new scenario,
01:11:12.840 | then you want it done now so that you can try lots and lots of models quickly.
01:11:17.400 | So the easy fix is just to try using a bunch more GPUs with data parallelism.
01:11:23.700 | And the good news is that so far, it looks like speech recognition allows us to use mini-batch
01:11:29.440 | sizes.
01:11:30.440 | We can process enough utterances in parallel that this is actually efficient.
01:11:35.080 | So you'd like to keep maybe a bit more than 64 utterances on each GPU, and up to a total
01:11:41.640 | mini-batch size of like 1,000 or maybe 2,000 is still useful.
01:11:47.140 | And so if you're putting together your infrastructure, you can go out and you can buy a server that'll
01:11:53.600 | fit eight of these Titan GPUs in them, and that'll actually get you to less than a week
01:11:57.600 | training time, which is pretty respectable.
01:12:01.480 | So there are a whole bunch of ways to use GPUs.
01:12:03.960 | If I do, we've been using synchronous SGD.
01:12:07.360 | It turns out that you've got to optimize things like all reduced code.
01:12:10.920 | Once you leave one node, you have to start worrying about your network.
01:12:15.520 | And if you want to keep scaling, then thinking about things like network traffic and the
01:12:19.920 | right strategy for moving all of your data becomes important.
01:12:24.680 | But we've had success scaling really well all the way out to things like 64 GPUs and
01:12:30.820 | just getting linear speedups all of the way.
01:12:33.140 | So if you've got a big cluster available, these things scale really well.
01:12:38.120 | And there are a bunch of other solutions.
01:12:39.440 | For instance, asynchronous SGD is now kind of a mainstay of distributed deep learning.
01:12:44.980 | There's also been some work recently of trying to go back to synchronous SGD that has a lot
01:12:48.640 | of nice properties, but using things like backup workers.
01:12:53.600 | So that's sort of the easy thing.
01:12:56.640 | Just throw more GPUs at it and go faster.
01:12:59.360 | One word of warning as you're trying to build these systems is to watch for code that isn't
01:13:06.660 | as optimized as you expected it to be.
01:13:10.240 | And so this back-of-the-envelope calculation that we did of figuring out how many flops
01:13:15.600 | are involved in our network and then calculating how long it would take to run if our GPU were
01:13:21.800 | running at full efficiency, you should actually do this for your network.
01:13:26.000 | We call this the speed of light.
01:13:27.720 | This is the fastest your code could ever run on one GPU.
01:13:31.580 | And if you find that you're just drastically underperforming that number, what could be
01:13:36.840 | happening to you is that you've hit a little edge case in one of the libraries that you're
01:13:41.860 | using and you're actually suffering a huge setback that you don't need to be feeling
01:13:45.880 | right now.
01:13:46.880 | So one of the things we found back in November is that in libraries like Kublai's, you can
01:13:51.620 | actually use mini batch sizes that hit these weird catastrophic cases in the library, where
01:13:57.680 | you could be suffering like a factor of two or three performance reduction.
01:14:02.580 | So that might take your wonderful one-week training time and blow it up to, say, a three-week
01:14:07.460 | training time.
01:14:09.420 | So that's why I wanted to go through this and ask you to keep in mind while you're training
01:14:14.100 | these things, try to figure out how long it ought to be taking.
01:14:17.900 | And if it's going a lot slower, be suspicious that there's some code you could be optimizing.
01:14:24.260 | Another good trick that's particular to speech, you can also use this for other recurrent
01:14:29.980 | tricks, is to try to keep similar length utterances together.
01:14:34.900 | So if you look at your dataset, like a lot of things, you have this sort of distribution
01:14:40.700 | over possible utterance lengths.
01:14:43.400 | And so you see there's a whole bunch that are, you know, maybe within about 50% of each
01:14:48.380 | other, but there's also a large number of utterances that are very short.
01:14:52.740 | And so what happens is when we want to process a whole bunch of these utterances in parallel,
01:14:58.580 | if we just randomly select, say, 1,000 utterances to go into a mini-batch, there's a high probability
01:15:05.620 | that we're going to get a whole bunch of these little short utterances along with some really
01:15:09.800 | long utterances.
01:15:11.560 | And in order to make all the CTC libraries work and all of our recurrent network computations
01:15:16.060 | easy, what we have to do is pad these audio signals with zero.
01:15:20.300 | And that winds up meaning that we're wasting huge amounts of computation, maybe a factor
01:15:24.300 | of two or more.
01:15:26.560 | And so one way to get around it is just sort all of your utterances by length and then
01:15:32.060 | try to keep the mini-batches to be similar lengths so that you just don't end up with
01:15:36.660 | quite as much waste in each mini-batch.
01:15:39.940 | And this kind of modifies your algorithm a little bit, but in the end is worthwhile.
01:15:46.140 | All right.
01:15:47.140 | So that's kind of all I want to say about computation.
01:15:50.740 | If you've got a few GPUs, keep an eye on your running time so that you know what to optimize
01:15:56.800 | and pay attention to the easy wins, like keeping your utterances together.
01:16:00.660 | You can actually scale really well.
01:16:03.080 | And I think for a lot of the jobs we see, you can have your GPU running at something
01:16:08.820 | like 50% of the peak.
01:16:11.200 | And that's all in.
01:16:12.200 | With network time, with all the bandwidth-bound stuff, you can actually run at two to three
01:16:16.460 | teraflops on a GPU that can only do five teraflops in the perfect case.
01:16:22.860 | So what can you actually do with this?
01:16:25.940 | One of my favorite results from one of our largest models is actually in Mandarin.
01:16:30.220 | So we have a whole bunch of labeled Mandarin data at Baidu.
01:16:34.160 | And so one of the things that we did was we scaled up this model, trained it on a huge
01:16:37.720 | amount of Mandarin data, and then, as we always do, we sit down and we do error analysis.
01:16:44.200 | And what we would do is have a whole bunch of humans sitting around, try to debate the
01:16:50.560 | transcriptions and figure out the ground truth that tend to be very high quality.
01:16:55.100 | And then we'd go and we'd run now a sort of holdout test on some new people and on the
01:16:59.940 | speech engine itself.
01:17:01.740 | And so if you benchmark a single human being against this deep speech engine in Mandarin
01:17:08.480 | that's powered by all the technologies we were just talking about, it turns out that
01:17:13.540 | the speech engine can get an error rate that's down below 6% character error rate.
01:17:18.820 | So only about 6% of the characters are wrong.
01:17:21.500 | And a single human sitting there listening to these transcriptions actually does quite
01:17:25.240 | a bit worse.
01:17:26.240 | It gets almost 10%.
01:17:29.380 | If you give people a bit of an advantage, which is you now assemble a committee of people
01:17:36.300 | and you get them a fresh test set so that no one has seen it before and we run this
01:17:40.220 | test again, it turns out that the two engines, or that the two cases are actually really
01:17:46.260 | similar.
01:17:47.260 | And you can end up with a committee of native Mandarin speakers sitting around debating,
01:17:50.620 | "No, no, I think this person said this," or "No, they have an accent.
01:17:54.380 | It's from the north.
01:17:55.380 | I think they're actually saying that."
01:17:57.700 | And then when you show them the deep speech transcription, they actually go, "Oh, that's
01:18:02.140 | what it was."
01:18:04.060 | And so you can actually get this technology up to a point where it's highly competitive
01:18:09.540 | with human beings, even human beings working together.
01:18:12.620 | And this is sort of where I think all of the speech recognition systems are heading, thanks
01:18:17.220 | to deep learning and the technologies that we're talking about here.
01:18:22.080 | Any questions so far?
01:18:24.100 | Yeah, go ahead.
01:18:26.540 | So how do you know the actual label of the data?
01:18:34.540 | Sorry?
01:18:35.540 | Repeat the question.
01:18:36.540 | Yeah.
01:18:37.540 | So the question is, if humans have such a hard time coming up with the correct transcription,
01:18:38.660 | how do you know what the truth is?
01:18:40.500 | And the real answer is you don't really.
01:18:43.720 | Sometimes you might have a little bit of user feedback, but in this instance, we have very
01:18:48.140 | high-quality transcriptions that are coming from many labelers teamed up with a speech
01:18:52.580 | engine.
01:18:54.260 | And so that could be wrong.
01:18:56.540 | We do occasionally find errors where we just think that's a label error.
01:19:00.660 | But when you have a committee of humans around, the really astonishing thing is that you can
01:19:05.100 | look at the output of the speech engines, and the humans will suddenly jump ship and
01:19:09.980 | say, oh, no, no, no, no.
01:19:11.340 | The speech engine is actually correct, because it'll often come up with an obscure word or
01:19:15.380 | place that they weren't aware of.
01:19:18.860 | Once they see the label, can they be biased towards that label?
01:19:23.300 | Yeah.
01:19:24.300 | So this is an inherently ambiguous result.
01:19:26.980 | But let's say that a committee of human beings tend to disagree with another committee of
01:19:31.980 | human beings about the same amount as a speech engine does.
01:19:35.660 | Yeah.
01:19:36.660 | So this is basically doing a sequence-to-sequence sort of task, right?
01:19:42.660 | So we're going to hear about a really different approach to that later.
01:19:49.660 | Can you say anything about the -- Yeah.
01:19:50.660 | So this is using the CTC cost, right?
01:19:53.780 | That's really the core component of this system.
01:19:56.260 | It's how you deal with mapping one variable-length sequence to another.
01:20:01.140 | When the CTC cost is not perfect, it has this assumption of independence baked into the
01:20:06.660 | probabilistic model.
01:20:08.660 | And because of that assumption, we're introducing some bias into the system.
01:20:12.900 | And for languages like English, where the characters are obviously not independent of
01:20:17.700 | each other, this might be a limitation.
01:20:20.660 | In practice, the thing that we see is that as you add a lot of data and your model gets
01:20:25.020 | much more powerful, you can still find your way around it, but it might take more data
01:20:29.940 | and a bigger model than necessary.
01:20:32.500 | And of course, we hope that all the new state-of-the-art methods coming out of the deep learning community
01:20:36.380 | are going to give us an even better solution.
01:20:38.780 | Okay.
01:20:39.780 | Go ahead.
01:20:40.780 | In this spectrogram, you're saying that there's a 20 milliseconds sample that you take.
01:20:51.780 | Is there a reason for the prediction that you can have a bigger or smaller --
01:20:52.780 | Empirically determined.
01:20:53.780 | Yeah.
01:20:54.780 | So the question is, for a spectrogram with -- We talked about these little spectrogram
01:20:58.380 | frames being computed from 20 milliseconds of audio.
01:21:01.180 | And is that number special?
01:21:02.180 | Is there a reason for it?
01:21:05.180 | So this is really determined from years and years of experience.
01:21:08.380 | This is captured from the traditional speech community.
01:21:12.180 | We know this works pretty well.
01:21:14.020 | There's actually some fun things you can do.
01:21:15.980 | You can take a spectrogram, go back and find the best audio that corresponds to that spectrogram
01:21:22.860 | to listen to it and see if you lost anything.
01:21:25.700 | And spectrograms of about this level of quantization, you can kind of tell what people are saying.
01:21:30.860 | It's a little bit garbled, but it's still actually pretty good.
01:21:34.580 | So amongst all the hyperparameters you could choose, this one's kind of a good tradeoff
01:21:38.420 | in keeping the information, but also saving a little bit of the phase by doing it frequently.
01:21:44.300 | Yeah.
01:21:45.300 | Are you doing overlapping things?
01:21:50.220 | I think in a lot of the models in the demo, for example, we don't use overlapping windows.
01:21:55.660 | They're just adjacent.
01:21:56.660 | Yeah.
01:21:57.660 | You mentioned you get linear scale up across CPU and GPUs.
01:21:58.660 | Is that, does it really matter?
01:21:59.660 | Like, what do you get when you do that?
01:22:00.660 | Yeah.
01:22:01.660 | So those results are from in-house software at Baidu.
01:22:11.540 | If you use something like OpenMPI, for example, on a cluster of GPUs, it actually works pretty
01:22:17.700 | well on a bunch of machines.
01:22:21.540 | But I think some of the algorithms all reduce once you start moving huge amounts of data.
01:22:27.500 | They're not optimal.
01:22:28.580 | You'll suffer a hit once you start going to that many GPUs.
01:22:32.900 | Within a single box, if you use the CUDA libraries to move data back and forth just on a local
01:22:40.660 | box, that stuff is pretty well optimized, and you can often do it yourself.
01:22:46.020 | Okay.
01:22:47.020 | So I want to take a few more questions at the end, and maybe we can run into the break
01:22:51.100 | a little bit.
01:22:52.100 | I wanted to just dive right through a few comments about production here.
01:22:57.940 | So of course, the ultimate goal of solving speech recognition is to improve people's
01:23:05.860 | lives and enable exciting products, and so that means even though so far we've trained
01:23:11.260 | a bunch of acoustic and language models, we also want to get these things in production.
01:23:16.340 | And users tend to care about more than just accuracy.
01:23:19.700 | Accuracy of course matters a lot, but we also care about things like latency.
01:23:23.740 | Users want to see the engine send them some feedback very quickly so that they know that
01:23:27.900 | it's responding and that it's understanding what they're saying.
01:23:31.460 | And we also need this to be economical so that we can serve lots of users without breaking
01:23:35.620 | the bank.
01:23:37.220 | So in practice, a lot of the neural networks that we use in research papers, because they're
01:23:41.140 | awesome for beating benchmark results, turn out not to work that well on a production
01:23:46.260 | engine.
01:23:47.260 | So one in particular that I think is worth keeping an eye on is that it's really common
01:23:52.740 | to use bidirectional recurrent neural networks.
01:23:55.900 | And so throughout the talk, I've been drawing my RNN with connections that just go forward
01:24:00.420 | in time, but you'll see a lot of research results that also have a path that goes backward
01:24:05.780 | in time.
01:24:07.140 | And this works fine if you just want to process data offline.
01:24:11.500 | But the problem is that if I want to compute this neuron's output up at the top of my network,
01:24:16.580 | I have to wait until I see the entire audio segment so that I can compute this backward
01:24:21.620 | recurrence and get this response.
01:24:24.660 | So this sort of anti-causal part of my neural network that gets to see the future means
01:24:29.900 | that I can't respond to a user on the fly because I need to wait for the end of their
01:24:34.260 | signal.
01:24:36.380 | So if you start out with these bidirectional RNNs that are actually much easier to get
01:24:41.700 | working and then you jump to using a recurrent network that is forward only, it'll turn out
01:24:47.660 | that you're going to lose some accuracy.
01:24:50.420 | And you might kind of hope that CTC, because it doesn't care about the alignment, would
01:24:55.060 | somehow magically learn to shift the output over to get better accuracy and just artificially
01:25:01.620 | delay the response so that it could get more context on its own.
01:25:05.620 | But it kind of turns out to only do that a little bit in practice.
01:25:09.900 | It's really tough to control it.
01:25:11.500 | And so if you find that you're doing much worse, sometimes you have to sort of engage
01:25:15.620 | in model engineering.
01:25:17.620 | So even though I've been talking about these recurrent networks, I want you to bear in
01:25:20.620 | mind that there's this dual optimization going on.
01:25:24.940 | You want to find a model structure that gives you really good accuracy, but you also have
01:25:28.780 | to think carefully about how you set up the structure so that this little neuron at the
01:25:33.460 | top can actually see enough context to get an accurate answer and not depend too much
01:25:39.620 | on the future.
01:25:41.420 | So for example, what we could do is tweak this model so that this neuron at the top
01:25:46.660 | that's trying to output the character L in hello can see some future frames, but it doesn't
01:25:53.420 | have this backward recurrence.
01:25:55.040 | So it only gets to see a little bit of context.
01:25:57.660 | That lets us kind of contain the amount of latency in the model.
01:26:02.380 | I'm going to skip over this.
01:26:04.860 | So in terms of other online aspects, of course, we want this to be efficient.
01:26:12.380 | We want to serve lots of users on a small number of machines if possible.
01:26:17.260 | And one of the things that you might find if you have a really big deep neural network
01:26:22.020 | or recurrent neural network is that it's really hard to deploy them on conventional CPUs.
01:26:27.300 | CPUs are awesome for serial jobs.
01:26:30.940 | You just want to go as fast as you can for this one string of instructions.
01:26:35.660 | But as we've discovered with so much of deep learning, GPUs are really fantastic because
01:26:40.420 | when we work with neural networks, we love processing lots and lots of arithmetic in
01:26:44.580 | parallel.
01:26:46.080 | But it's really only efficient if the batch that we're working on, the hunks of audio
01:26:50.300 | that we're working on, are in a big enough batch.
01:26:56.280 | So if we just process one stream of audio so that my GPU is multiplying matrices times
01:27:00.900 | vectors, then my GPU is going to be really inefficient.
01:27:05.300 | So for example, on like a K1200 GPU, so something you could put in a server in the cloud, what
01:27:12.100 | you'll find is that you get really poor throughput considering the dollar value of this hardware
01:27:19.240 | if you're only processing one piece of audio at a time.
01:27:22.020 | Whereas if you could somehow batch up audio to have, say, 10 or 32 streams going at once,
01:27:28.480 | then you can actually squeeze out a lot more performance from that piece of hardware.
01:27:34.180 | So one of the things that we've been working on that works really well and is not too bad
01:27:39.200 | to implement is to just batch all of the packets as data comes in.
01:27:43.700 | So if I have a whole bunch of users talking to my server and they're sending me little
01:27:47.620 | hundred millisecond packets of audio, what I can do is I can sit and I can listen to
01:27:53.500 | all these users, and when I catch a whole batch of utterances coming in or a whole bunch
01:27:58.480 | of audio packets coming in from different people that start around the same time, I
01:28:03.580 | plug those all into my GPU and I process those matrix multiplications together.
01:28:08.780 | So instead of multiplying a matrix times only one little audio piece, I get to multiply
01:28:12.800 | it by a batch of, say, four audio pieces, and it's much more efficient.
01:28:18.420 | And if you actually do this on a live server and you plow a whole bunch of audio streams
01:28:23.140 | through it, you could support maybe 10, 20, 30 users in parallel, and as the load on that
01:28:29.300 | server goes up, I have more and more users piling on, what happens is that the GPU will
01:28:34.420 | naturally start batching up more and more packets into single matrix multiplications.
01:28:40.140 | So as you get more users, you actually get much more efficient as well.
01:28:45.660 | And so in practice, when you have a whole bunch of users on one machine, you usually
01:28:49.660 | don't see matrix multiplications happening with fewer than maybe batch sizes of four.
01:28:56.660 | So the summary of all of this is that deep learning is really making the first steps
01:29:04.060 | to building a state-of-the-art speech engine easier than they've ever been.
01:29:07.220 | So if you want to build a new state-of-the-art speech engine for some new language, all of
01:29:11.420 | the components that you need are things that we've covered so far.
01:29:15.860 | And the performance now is really significantly driven by data and models, and I think, as
01:29:20.740 | we were discussing earlier, I think future models from deep learning are going to make
01:29:24.420 | that influence of data and computing power even stronger.
01:29:29.740 | And of course, data and compute is important so that we can try lots and lots of models
01:29:34.300 | and keep making progress.
01:29:36.460 | And I think this technology is now at a stage where it's not just a research system anymore.
01:29:42.420 | We're seeing that the end-to-end deep learning technologies are now mature enough that we
01:29:46.700 | can get them into productions.
01:29:48.180 | I think you guys are going to be seeing deep learning play a bigger, bigger role in the
01:29:52.300 | speech engines that are powering all the devices that we use.
01:29:55.460 | So thank you very much.
01:29:56.460 | [APPLAUSE]
01:29:57.460 | So I think we're right at the end of time.
01:30:05.660 | [INAUDIBLE]
01:30:06.660 | Sounds good.
01:30:07.660 | All right, we had one in the back who was waiting patiently.
01:30:10.660 | Go ahead.
01:30:11.660 | [INAUDIBLE]
01:30:12.660 | More than one voice simultaneously?
01:30:19.780 | So the question is, how does the engine handle more than one voice simultaneously?
01:30:24.580 | So right now, there's nothing in this formalism that allows you to account for multiple speakers.
01:30:32.220 | And so usually, when you listen to an audio clip in practice, it's clear that there's
01:30:37.500 | one dominant speaker.
01:30:39.760 | And so this speech engine, of course, learns whatever it was taught from the labels.
01:30:44.540 | And it will try to filter out background speakers and just transcribe the dominant one.
01:30:49.580 | But if it's really ambiguous, then undefined results.
01:30:53.860 | Can you customize the transcription to the specific characteristics of a particular speaker?
01:31:04.260 | So we're not doing that in these pipelines right now.
01:31:08.780 | But of course, a lot of different strategies have been developed in the traditional speech
01:31:14.420 | literature.
01:31:15.420 | There are things like iVectors that try to quantify someone's voice.
01:31:18.420 | And those make useful features for improving speech engines.
01:31:21.980 | You could also imagine taking a lot of the concepts like embeddings, for example, and
01:31:26.540 | tossing them in here.
01:31:28.380 | So I think a lot of that is left open to future work.
01:31:31.740 | Adam, question.
01:31:35.100 | I think we have to break for time.
01:31:37.220 | But I'll step off stage here.
01:31:39.820 | And you guys can come to me with your questions.
01:31:41.540 | Thank you so much.
01:31:42.540 | [APPLAUSE]
01:31:43.540 | Thanks, Adam.
01:31:44.540 | So we'll reconvene at 2.45 for a presentation by Alex.