back to indexDeep Learning for Speech Recognition (Adam Coates, Baidu)
Chapters
0:0
0:15 Speech recognition
4:47 Traditional ASR pipeline
10:49 Deep Learning in ASR
14:59 Scale model
15:49 Outline
17:23 Raw audio
18:28 Pre-processing
19:46 Spectrogram
21:57 Acoustic Model
35:34 Connectionist Temporal Classification (CTC)
39:7 Training tricks
48:13 Max Decoding
52:19 Language models
58:43 Decoding with LMs: Examples
59:14 Rescoring
00:00:00.000 |
So I want to tell you guys about speech recognition and deep learning. 00:00:04.840 |
I think deep learning has been playing an increasingly large role in speech recognition. 00:00:10.360 |
And one of the things I think is most exciting about this field is that speech recognition 00:00:15.040 |
is at a place right now where it's becoming good enough to enable really exciting applications 00:00:23.800 |
So for example, if we want to caption video content and make it accessible to everyone, 00:00:28.680 |
it used to be that we would sort of try to do this, but you still need a human to get 00:00:32.960 |
really good captioning for something like a lecture. 00:00:36.480 |
But it's possible that we can do a lot of this with higher quality in the future with 00:00:40.920 |
We can do things like hands-free interfaces in cars, make it safer to use technology while 00:00:46.040 |
we're on the go and keep people's eyes on the road. 00:00:48.040 |
Of course, it would make mobile devices, home devices much easier, much more efficient and 00:00:56.040 |
But another actually sort of fun recent study that some folks at Baidu participated in, 00:01:02.680 |
along with Stanford and UW, was to show that for even something straightforward that we 00:01:06.640 |
sort of take for granted as an application of speech, which is just texting someone with 00:01:11.960 |
voice or writing a piece of text, the study showed that you can actually go three times 00:01:16.720 |
faster with voice recognition systems that are available today. 00:01:20.680 |
So it's not just like a little bit faster now, even with the errors that a speech recognition 00:01:28.800 |
And the reason I wanted to highlight this result, which is pretty recent, is that the 00:01:34.320 |
speech engine that was used for this study is actually powered by a lot of the deep learning 00:01:40.840 |
So hopefully when you walk away today, you have an appreciation or an understanding of 00:01:45.120 |
the sort of high-level ideas that make a result like this possible. 00:01:50.560 |
So there are a whole bunch of different components that make up a complete speech application. 00:01:57.280 |
So for example, there's speech transcription. 00:02:00.360 |
So if I just talk, I want to come up with words that represent whatever I just said. 00:02:07.740 |
There's also other tasks, though, like word spotting or triggering. 00:02:11.080 |
So for example, if my phone is sitting over there and I want to say, "Hey, phone, go do 00:02:14.560 |
something for me," it actually has to be listening continuously for me to say that word. 00:02:20.240 |
And likewise, there are things like speaker identification or verification, so that if 00:02:25.120 |
I want to authenticate myself or I want to be able to tell apart different users in a 00:02:28.920 |
room, I've got to be able to recognize your voice, even though I don't know what you're 00:02:36.720 |
Instead, I'm going to just focus on the bread and butter of speech recognition. 00:02:40.760 |
We're going to focus on building a speech engine that can accurately transcribe audio 00:02:48.880 |
This is a very basic goal of artificial intelligence. 00:02:53.680 |
Historically, people are very, very good at listening to someone talk, just like you guys 00:03:01.520 |
And you can very quickly turn audio into words and into meaning on your own, almost effortlessly. 00:03:10.280 |
And for machines, this has historically been incredibly hard. 00:03:13.480 |
So you think of this as like one of those sort of consummate AI tasks. 00:03:18.160 |
So the goal of building a speech pipeline is, if you just give me a raw audio wave, 00:03:23.040 |
like you recorded on your laptop or your cell phone, I want to somehow build a speech recognizer 00:03:28.360 |
that can do this very simple task of printing out "Hello, world" when I actually say "Hello, 00:03:35.040 |
So before I dig into the deep learning part, I want to step back a little bit and spend 00:03:42.320 |
maybe 10 minutes talking about how a traditional speech recognition pipeline is working, for 00:03:49.600 |
If you're out in the wild, you're doing an internship, you're trying to build a speech 00:03:54.800 |
recognition system with a lot of the tools that are out there, you're going to bump into 00:03:59.160 |
a lot of systems that are built on technologies that look like this. 00:04:02.880 |
So I want you to understand a little bit of the vocabulary and how those things are put 00:04:08.200 |
And also, this will sort of give you a story for what deep learning is doing in speech 00:04:13.440 |
recognition today that is kind of special and that I think paves the way for much bigger 00:04:22.900 |
So traditional systems break the problem of converting an audio wave, of taking audio 00:04:31.180 |
and turning it into a transcription, into a bunch of different pieces. 00:04:36.280 |
So I'm going to start out with my raw audio, and I'm just going to represent that by X. 00:04:43.160 |
And then usually we have to decide on some kind of feature representation. 00:04:47.280 |
We have to convert this into some other form that's easier to deal with than a raw audio 00:04:54.160 |
And in a traditional speech system, I often have something called an acoustic model. 00:04:58.300 |
And the job of the acoustic model is to learn the relationship between these features that 00:05:04.320 |
represent my audio and the words that someone is trying to say. 00:05:10.040 |
And then I'll often have a language model, which encapsulates all of my knowledge about 00:05:14.520 |
what kinds of words, what spellings and what combinations of words are most likely in the 00:05:22.640 |
And once you have all of these pieces, so these might be -- these different models might 00:05:27.160 |
be driven by machine learning themselves, what you would need to build in a traditional 00:05:34.240 |
And the job of a decoder, which itself might involve some modeling efforts and machine 00:05:39.160 |
learning algorithms, is to find the sequence of words W that maximizes this probability. 00:05:47.780 |
The probability of the particular sequence W, given your audio. 00:05:53.460 |
But that's equivalent to maximizing the product of the contributions from your acoustic model 00:06:00.840 |
So a traditional speech system is broken down into these pieces, and a lot of the effort 00:06:05.180 |
in getting that system to work is in developing this sort of portion that combines them all. 00:06:12.880 |
So it turns out that if you want to just directly transcribe audio, you can't just go straight 00:06:20.720 |
And the reason is, and it's especially apparent in English, that the way something is spelled 00:06:26.040 |
in characters doesn't always correspond well to the way that it sounds. 00:06:30.700 |
So if I give you the word "night," for example, without context, you don't really know whether 00:06:36.280 |
I'm talking about a night in armor or whether I'm talking about night like an evening. 00:06:41.520 |
And so a way to get around this, to abstract this problem away from a traditional system, 00:06:46.720 |
is to replace this with a sort of intermediate representation. 00:06:50.920 |
Instead of trying to predict characters, I'll just try to predict something called phonemes. 00:06:55.640 |
So as an example, if I want to represent the word "hello," what I might try to do is break 00:07:04.580 |
So the first one is like the "h," that H sound in "hello," and then an "uh" sound, which 00:07:10.040 |
is actually only one possible pronunciation of an E, and then an L and an O sound. 00:07:16.120 |
And that would be my string that I try to come up with using all of my different speech 00:07:24.120 |
So this, in one sense, makes the modeling problem easier. 00:07:27.680 |
My acoustic model and so on can be simpler, because I don't have to worry about spelling. 00:07:33.080 |
But it does have this problem that I have to think about where these things come from. 00:07:38.180 |
So these phonemes are intuitively, they're the perceptually distinct units of sound that 00:07:51.680 |
This might be our imagination that these things actually exist. 00:07:58.880 |
There are a bunch of different conventions for how to define these. 00:08:05.320 |
And if you end up working on a system that uses phonemes, one popular data set is called 00:08:12.240 |
And so this actually has a corpus of audio frames with examples of each of these phonemes. 00:08:19.640 |
So once you have this phoneme representation, unfortunately, it adds even more complexity 00:08:30.580 |
Because now, my acoustic model doesn't associate this audio feature with words. 00:08:35.720 |
It actually associates them with another kind of transcription, with the transcription into 00:08:41.000 |
And so I have to introduce yet another component into my pipeline that tries to understand 00:08:46.680 |
how do I convert the transcriptions in phonemes into actual spellings. 00:08:51.760 |
And so I need some kind of dictionary or a lexicon to tell me all of that. 00:08:56.800 |
So this is a way of taking our knowledge about a language and baking it into this engineered 00:09:03.560 |
And then once you've got all that, again, all of your work now goes into this decoder 00:09:08.560 |
that has a slightly more complicated task in order to infer the most likely word transcription 00:09:22.480 |
You'll see a whole bunch of these systems out there. 00:09:25.660 |
And we're still using a lot of the vocabulary from these systems. 00:09:30.920 |
But traditionally, the big advantage is that it's very tweakable. 00:09:34.800 |
If you want to go add a new pronunciation for a word you've never heard before, you 00:09:42.340 |
But it's also really hard to get working well. 00:09:44.920 |
If you start from scratch with this system and you have no experience in speech recognition, 00:09:50.360 |
it's actually quite confusing and hard to debug. 00:09:53.240 |
It's very difficult to know which of these various models is the one that's behind your 00:09:59.000 |
And especially once we start dealing with things like accents, heavy noise, different 00:10:03.480 |
kinds of ambiguity, that makes the problem even harder to engineer around. 00:10:08.080 |
Because trying to think ourselves about how do I tweak my pronunciation model, for example, 00:10:13.540 |
to account for someone's accent that I haven't heard, that's a very hard engineering judgment 00:10:20.180 |
So there are all kinds of design decisions that go into this pipeline, like choosing 00:10:28.100 |
So the first place that deep learning has started to make an impact in speech recognition, 00:10:35.260 |
starting a few years ago, is to just take one of the core machine learning components 00:10:40.780 |
of the system and replace it with a deep learning algorithm. 00:10:44.660 |
So I mentioned back in this previous pipeline that we had this little model here whose job 00:10:50.500 |
is to learn the relationship between a sequence of phonemes and the audio that we're hearing. 00:10:59.500 |
And there are lots of different methods for training this thing. 00:11:03.140 |
So take your favorite machine learning algorithm. 00:11:06.140 |
You can probably find someone who is trained in acoustic model with that algorithm, whether 00:11:09.740 |
it's a Gaussian mixture model or a bunch of decision trees and random forests, anything 00:11:17.980 |
There's a lot of work in trying to make better acoustic models. 00:11:21.700 |
So some work by George Dahl and co-authors took what was a state of the art deep learning 00:11:28.940 |
system back in 2011, which is a deep belief network with some pre-training strategies, 00:11:35.420 |
and dropped it into a state of the art pipeline in place of this acoustic model. 00:11:41.140 |
And the results are actually pretty striking, because even though we had neural networks 00:11:46.460 |
and these pipelines for a while, what ended up happening is that when you replace the 00:11:52.140 |
Gaussian mixture model and HMM system that already existed with this deep belief network 00:11:58.380 |
as an acoustic model, you actually got something between like a 10% and 20% relative improvement 00:12:09.140 |
And if you compare this to the amount of progress that had been made in preceding years, this 00:12:14.940 |
is a giant leap for a single paper to make, compared to progress we'd been able to make 00:12:22.780 |
So this is in some sense the first generation of deep learning for speech recognition, which 00:12:28.380 |
is I take one of these components and I swap it out for my favorite deep learning algorithm. 00:12:40.480 |
So with these traditional speech recognition pipelines, the problem that we would always 00:12:46.180 |
run into is that if you gave me a lot more data, you gave me a much bigger computer so 00:12:51.540 |
that I could train a huge model, that actually didn't help me because all the problems I 00:12:56.420 |
had were in the construction of this pipeline. 00:13:00.580 |
And so eventually, if you gave me more data and a bigger computer, the performance of 00:13:04.780 |
our speech recognition system would just kind of peter out. 00:13:08.340 |
It would just reach a ceiling that was very hard to get over. 00:13:11.580 |
And so we just start coming up with lots of different strategies. 00:13:16.740 |
We try to specialize for each user and try to make things a little bit better around 00:13:22.500 |
And what these deep learning acoustic models did was in some sense moved that barrier a 00:13:29.740 |
It made it possible for us to take a bit more data, much faster computers that let us try 00:13:34.860 |
a whole lot of models, and move that ceiling up quite a ways. 00:13:40.300 |
So the question that many in the research community, including folks at Baidu, have 00:13:44.780 |
been trying to answer is, can we go to a next generation version of this insight? 00:13:51.780 |
Can we, for instance, build a speech engine that is powered by deep learning all the way 00:13:56.740 |
from the audio input to the transcription itself? 00:14:00.740 |
Can we replace as much of that traditional system with deep learning as possible so that 00:14:05.380 |
over time, as you give researchers more data and bigger computers and the ability to try 00:14:11.620 |
more models, their speech recognition performance just keeps going up and we can potentially 00:14:18.880 |
So the goal of this tutorial is not to get you up here, which requires a whole bunch 00:14:26.180 |
of things that I'll tell you about near the end. 00:14:29.100 |
But what we want to try to do is give you enough to get a point on this curve. 00:14:33.460 |
And then once you're on the curve, the idea is that what remains is now a problem of scale. 00:14:40.740 |
It's about data and about getting bigger computers and coming up with ways to build bigger models. 00:14:47.500 |
So that's my objective, so that when you walk away from here, you have a picture of what 00:14:54.900 |
And then after that, it's hopefully all about scale. 00:14:59.180 |
So thanks to Vinay Rao, who's been helping put this tutorial together, there is going 00:15:04.580 |
to be some starter code live for the basic pipeline, the deep learning part of the pipeline 00:15:12.300 |
So there are some open source implementations of things like CTC, but we wanted to make 00:15:18.180 |
sure that there's a system out there that's pretty representative of the acoustic models 00:15:22.220 |
that I'm going to be talking about in the first half of the presentation here. 00:15:27.100 |
So this will be enough that you can get a simple pipeline going with something called 00:15:31.760 |
max decoding, which I'll tell you about later. 00:15:34.780 |
And the idea is that this is sort of a scale model of the acoustic models that Baidu and 00:15:40.060 |
other places are powering real production speech engines. 00:15:44.260 |
So this will get you that point on the curve. 00:15:52.860 |
The first part, I'm just going to introduce a few preliminaries, talk about preprocessing. 00:15:57.360 |
So we still have a little bit of preprocessing around, but it's not really fundamental. 00:16:01.460 |
I think it's probably going to go away in the long run. 00:16:04.660 |
We'll talk about what is probably the most mature piece of sequence learning technologies 00:16:13.500 |
So it turns out that one of the fundamental problems of doing speech recognition is how 00:16:17.740 |
do I build a neural network that can map this audio signal to a transcription that can have 00:16:25.580 |
And so CTC is one highly mature method for doing this. 00:16:29.580 |
And I think you're actually going to hear about maybe some other solutions later today. 00:16:33.820 |
Then I'll say a little bit about training and just what that looks like. 00:16:40.020 |
And then finally say a bit about decoding and language models, which is sort of an addendum 00:16:45.060 |
to the current acoustic models that we can build that make them perform a lot better. 00:16:50.580 |
And then once you have this, that's a picture of what you need to get this point on the 00:16:56.660 |
And then I'll talk a little bit about what's remaining. 00:16:59.300 |
How do you scale up from this little scale model up to the full thing? 00:17:05.860 |
And then time permitting, we'll talk a little bit about production. 00:17:08.740 |
How could you put something like this into a cloud server and actually serve real users 00:17:20.760 |
This should be pretty straightforward, I think. 00:17:24.100 |
Unlike a two-dimensional image where we normally have a 2D grid of pixels, audio is just a 00:17:30.580 |
And there are a bunch of different formats for audio, but typically this one-dimensional 00:17:35.180 |
wave that is actually me saying something like, "Hello, world," is something like 8,000 00:17:42.580 |
samples per second or 16,000 samples per second. 00:17:46.500 |
And each wave is quantized into 8 or 16 bits. 00:17:51.020 |
So when we represent this audio signal that's going to go into our pipeline, you could just 00:17:57.380 |
So when I had that box called x that represented my audio signal, you can think of this as 00:18:02.540 |
being broken down into samples, x1, x2, and so forth. 00:18:07.220 |
And if I had a one-second audio clip, this vector would have a length of either, say, 00:18:14.840 |
And each element would be, say, a floating point number that I'd extracted from this 00:18:23.800 |
Now once I have an audio clip, we'll do a little bit of preprocessing. 00:18:31.020 |
The first is to just do some vanilla preprocessing, like convert to a simple spectrogram. 00:18:37.540 |
So if you look at a traditional speech pipeline, you're going to see things like MFCCs, which 00:18:45.740 |
You'll see a whole bunch of plays on spectrograms where you take differences in different kinds 00:18:50.860 |
of features and try to engineer complex representations. 00:18:55.500 |
But for the stuff that we're going to do today, a simple spectrogram is just fine. 00:18:59.700 |
And it turns out, as you'll see in a second, we lose a little bit of information when we 00:19:04.180 |
do this, but it turns out not to be a huge difference. 00:19:08.740 |
Now I said a moment ago that I think probably this is going to go away in the long run. 00:19:14.260 |
And that's because today you can actually find recent research in trying to do away 00:19:19.540 |
with even this preprocessing part and having your neural network process the audio wave 00:19:23.860 |
directly and just train its own feature transformation. 00:19:27.660 |
So there's some references at the end that you can look at for this. 00:19:35.460 |
How many people have seen a spectrogram or computed a spectrogram before? 00:19:43.060 |
So the idea behind a spectrogram is that it's sort of like a frequency domain representation, 00:19:49.940 |
but instead of representing this entire signal in terms of frequencies, I'm just going to 00:19:55.100 |
represent a small window in terms of frequencies. 00:19:59.940 |
So to process this audio clip, the first thing I'm going to do is cut out a little window 00:20:09.020 |
And when you get down to that scale, it's usually very clear that these audio signals 00:20:12.880 |
are made up of sort of a combination of different frequencies of sine waves. 00:20:21.780 |
It basically converts this little signal into the frequency domain. 00:20:26.460 |
And then we just take the log of the power at each frequency. 00:20:31.260 |
And so if you look at what the result of this is, it basically tells us for every frequency 00:20:39.420 |
of sine wave, what is the magnitude, what's the amount of power represented by that sine 00:20:48.200 |
So over here in this example, we have a very strong low frequency component in the signal. 00:20:55.780 |
And then we have differing magnitudes at different differing frequencies. 00:21:06.020 |
So now instead of representing this little 20 millisecond slice as sort of a sequence 00:21:10.560 |
of audio samples, instead I'm going to represent it as a vector here where each element represents 00:21:18.180 |
sort of the strength of each frequency in this little window. 00:21:22.500 |
And the next step beyond this is that if I just told you how to process one little window, 00:21:28.100 |
you can of course apply this to a whole bunch of windows across the entire piece of audio. 00:21:34.960 |
And that gives you what we call a spectrogram. 00:21:37.400 |
And you can use either disjoint windows that are just sort of adjacent or you can apply 00:21:44.560 |
So there's a little bit of parameter tuning there. 00:21:46.980 |
But this is an alternative representation of this audio signal that happens to be easier 00:21:57.760 |
So our goal, starting from this representation, is to build what I'm going to call an acoustic 00:22:04.200 |
model, but which is really, to the extent we can make it happen, is really going to 00:22:08.720 |
be an entire speech engine that is represented by a neural network. 00:22:12.780 |
So what we would like to do is build a neural net that if we could train it from a whole 00:22:18.740 |
bunch of pairs, X, which is my original audio that I turn into a spectrogram, and Y star, 00:22:25.000 |
that's the ground truth transcription that some human has given me. 00:22:29.000 |
If I were to train this big neural network off of these pairs, what I'd like it to produce 00:22:35.600 |
is some kind of output that I'm representing by the character C here, so that I could later 00:22:41.560 |
extract the correct transcription, which I'm going to denote by Y. 00:22:46.920 |
So if I said hello, the first thing I'm going to do is run preprocessing to get all these 00:22:53.340 |
And then I'm going to have a recurrent neural network that consumes each frame and processes 00:22:58.440 |
them into some new representation called C. And hopefully, I can engineer my network in 00:23:04.640 |
such a way that I can just read the transcription off of these output neurons. 00:23:09.940 |
So that's kind of the intuitive picture of what we want to accomplish. 00:23:15.960 |
So as I mentioned back in the outline, there's one obvious fundamental problem here, which 00:23:21.920 |
is that the length of the input is not the same as the length of the transcription. 00:23:28.840 |
So if I say hello very slowly, then I can have a very long audio signal, even though 00:23:35.120 |
I didn't change the length of the transcription. 00:23:36.760 |
Or if I say hello very quickly, then I can have a very short piece of audio. 00:23:43.120 |
And so that means that this output of my neural network is changing length, and I need to 00:23:47.640 |
come up with some way to map that variable length neural network output to this fixed 00:23:53.440 |
length transcription, and also do it in a way that we can actually train this pipeline. 00:23:58.940 |
So the traditional way to deal with this problem, if you were building a speech engine several 00:24:07.020 |
years ago, is to just try to bootstrap the whole system. 00:24:11.040 |
So I'd actually train a neural network to correctly predict the sounds at every frame 00:24:16.160 |
using some kind of data set like Timit, where someone has lovingly annotated all of the 00:24:23.140 |
And then I'd try to figure out the alignment between my saying hello in a phonetic transcription 00:24:30.140 |
And then once I've lined up all of the sounds with the input audio, now I don't care about 00:24:35.080 |
length anymore, because I can just make a one-to-one mapping between the audio input 00:24:40.320 |
and the phoneme outputs that I'm trying to target. 00:24:43.440 |
But this alignment process is horribly error-prone. 00:24:47.420 |
You have to do a lot of extra work to make it work well, and so we really don't want 00:24:52.160 |
We really want to have some kind of solution that lets us solve this straightaway. 00:24:59.680 |
And as I mentioned, there's some current research on how to use things like attentional models, 00:25:04.280 |
sequence-to-sequence models that you'll hear about later, in order to solve this kind of 00:25:11.960 |
And then, as I said, we'll focus on something called connectionist temporal classification, 00:25:17.280 |
or CTC, that is sort of current state-of-the-art for how to do this. 00:25:24.960 |
So our recurrent neural network has these output neurons that I'm calling C. And the 00:25:31.680 |
job of these output neurons is to encode a distribution over the output symbols. 00:25:39.560 |
So because of the structure of the recurrent network, the length of this symbol sequence 00:25:45.480 |
C is the same as the length of my audio input. 00:25:48.280 |
So if my audio input, say, was two seconds long, that might have 100 audio frames. 00:25:55.500 |
And that would mean that the length of C is also 100 different values. 00:26:01.200 |
So if we were working on a phoneme-based model, then C would be some kind of phoneme representation. 00:26:07.640 |
And we would also include a blank symbol, which is special for CTC. 00:26:12.640 |
But if, as we'll do in the rest of this talk, we're trying to just predict the graphemes, 00:26:19.040 |
trying to predict the characters in this language directly from the audio, then I would just 00:26:24.400 |
let C take on a value that's in my alphabet, or take on a blank or a space, if my language 00:26:33.520 |
And then the second thing I'm going to do, once my RNN gives me a distribution over these 00:26:39.540 |
symbols C, is that I'm going to try to define some kind of mapping that can convert this 00:26:45.240 |
long transcription C into the final transcription Y. 00:26:55.000 |
And now, recognizing that C is itself a probabilistic creature, there's a distribution over choices 00:27:04.680 |
Once I apply this function, that also means that there's a distribution over Y. 00:27:08.760 |
There's a distribution over the possible transcriptions that I could get. 00:27:12.680 |
And what I'll want to do to train my network is to maximize the probability of the correct 00:27:20.600 |
So those are the three steps that we have to accomplish in order to make CTC work. 00:27:29.520 |
So we have these output neurons C, and they represent a distribution over the different 00:27:36.240 |
symbols that I could be hearing in the audio. 00:27:41.380 |
You can see the spectrogram frames poking up. 00:27:44.740 |
And this is being processed by this recurrent neural network. 00:27:48.440 |
And the output is a big bank of softmax neurons. 00:27:53.320 |
So for the first frame of audio, I have a neuron that corresponds to each of the symbols 00:28:02.680 |
And this set of softmax neurons here, with the output summing to 1, represents the probability 00:28:10.620 |
of, say, C1 having the value A, B, C, and so on, or this special blank character. 00:28:17.680 |
So for example, if I pick one of the neurons over here, then the first row, which represents 00:28:24.280 |
the character B, and the 17th column, which is the 17th frame in time, this represents 00:28:31.760 |
the probability that C17 represents the character B, given the audio. 00:28:40.920 |
So once I have this, that also means that I can just define a distribution not just 00:28:47.160 |
over the individual characters, but if I just assume that all of the characters are independent, 00:28:53.080 |
which is kind of a naive assumption, but if I bake this into the system, I can define 00:28:57.820 |
a distribution over all possible sequences of characters in this alphabet. 00:29:04.520 |
So if I gave you a specific instance, a specific character string using this alphabet, for 00:29:11.960 |
instance, I represent the string hello as H-H-H-E, blank E, blank blank L-L, blank L-O, 00:29:22.600 |
This is a string in this alphabet for C, and I can just use this formula to compute the 00:29:28.960 |
probability of this specific sequence of characters. 00:29:33.560 |
So that's how we compute the probability for a sequence of characters when they have the 00:29:43.960 |
So the second step, and this is in some sense the kind of neat trick in CTC, is to define 00:29:52.440 |
a mapping from this long encoding of the audio into symbols that crunches it down to the 00:30:03.240 |
actual transcription that we're trying to predict. 00:30:06.000 |
And the rule is this operator takes this character sequence, and it picks up all the duplicates, 00:30:13.440 |
all of the adjacent characters that are repeated, and discards the duplicates and just keeps 00:30:18.720 |
one of them, and then it drops all of the blanks. 00:30:23.240 |
So in this example, you see you have three H's together, so I just keep one H, and then 00:30:29.600 |
I have a blank, I throw that away, and I keep an E, and I have two L's, so I keep one of 00:30:34.640 |
the L's over here, and then another blank, and an L-O. 00:30:38.400 |
And the one key thing to note is that when I have two characters that are different right 00:30:43.280 |
next to each other, I just end up keeping those two characters in my output. 00:30:48.680 |
But if I ever have a double character, like L-L in "hello," then I'll need to have a blank 00:30:58.880 |
But if our neural network gave me this transcription, told me that this was the right answer, we 00:31:04.360 |
just have to apply this operator, and we get back the string "hello." 00:31:12.020 |
So now that we have a way to define a distribution over these sequences of symbols that are the 00:31:18.320 |
same length as the audio, and we now have a mapping from those strings into transcriptions, 00:31:25.160 |
as I said, this gives us a probability distribution over the possible final transcriptions. 00:31:30.920 |
So if I look at the probability distribution over all the different sequences of symbols, 00:31:37.720 |
I might have "hello" written out like on the last slide, and maybe that has probability 00:31:42.620 |
.1, and then I might have "hello" but written a different way, by say replacing this H with 00:31:49.880 |
a blank that has a smaller probability, and I have a whole bunch of different possible 00:31:58.440 |
And what you'll notice is that if I go through every possible combination of symbols here, 00:32:06.160 |
there are several combinations that all map to the same transcription. 00:32:10.640 |
So here's one version of "hello," there's a second version of "hello," there's a third 00:32:16.800 |
And so if I now ask, "What's the probability of the transcription 'hello'?" 00:32:21.720 |
The way that I compute that is I go through all of the possible character sequences that 00:32:28.080 |
correspond to the transcription "hello," and I add up all of their probabilities. 00:32:33.240 |
So I have to sum over all possible choices of C that could give me that transcription 00:32:40.400 |
So you can kind of think of this as searching through all the possible alignments, right? 00:32:48.080 |
I could shift these characters around a little bit, I could move them forward, backward, 00:32:52.360 |
I could expand them by adding duplicates or squish them up, depending on how fast someone 00:32:56.500 |
is talking, and that corresponds to every possible alignment between the audio and the 00:33:05.000 |
It sort of solves the problem of the variable length. 00:33:08.760 |
And the way that I get the probability of a specific transcription is to sum up, to 00:33:14.360 |
marginalize over all the different alignments that could be feasible. 00:33:20.920 |
And then if we have a whole bunch of other possibilities in here, like the word "yellow," 00:33:27.400 |
So this equation just says to sum over all the character sequences C so that when I apply 00:33:32.800 |
this little mapping operator, I end up with the transcription y. 00:33:47.880 |
I'm missing a double E. You're talking about this one? 00:33:51.920 |
So when we apply this sort of squeezing operator here, we drop this double E to get a single 00:34:09.160 |
So whenever you see two characters together like this, where they're adjacent duplicates, 00:34:16.120 |
you sort of squeeze all those duplicates out, and you just keep one of them. 00:34:21.880 |
So if we drop all the duplicates first, then we still have two L's left, and then we remove 00:34:29.160 |
So this gives the algorithm a way to represent repeated characters in the transcription. 00:34:47.520 |
Really I should have put a space character in here instead of a blank. 00:35:08.040 |
So once I've defined this, I just gave you a formula to compute the probability of a 00:35:16.360 |
So as with every good starting to a machine learning algorithm, we go and we try to apply 00:35:23.720 |
I now give you the correct transcription, and your job is to tune the neural network 00:35:28.440 |
to maximize the probability of that transcription using this model that I just defined. 00:35:34.160 |
So in equations, what I'm going to do is I want to maximize the log probability of y 00:35:46.280 |
I want to maximize the probability of the correct transcription given the audio x. 00:35:51.680 |
And then I'm just going to sum over all the examples. 00:35:56.160 |
And then what I want to do is just replace this with the equation that I had on the last 00:36:02.040 |
page that says in order to compute the probability of a given transcription, I have to sum over 00:36:07.320 |
all of the possible symbol sequences that could have given me that transcription, sum 00:36:12.120 |
over all the possible alignments that would map that transcription to my audio. 00:36:18.800 |
So Alex Graves and co-authors in 2006 actually show that because of this independence assumption, 00:36:26.000 |
there is a clever way, there is a dynamic programming algorithm that can efficiently 00:36:33.000 |
And not only compute this summation so that you can compute the objective function, but 00:36:37.200 |
actually compute its gradient with respect to the output neurons of your neural network. 00:36:42.240 |
So if you look at the paper, the algorithm details are in there. 00:36:47.000 |
What's cool right now in the history of speech and deep learning is that this is at the level 00:36:53.540 |
This is something that's now implemented in a bunch of places so that you can download 00:36:57.500 |
a software package that efficiently will calculate this CTC loss function for you that can calculate 00:37:05.540 |
this likelihood and can also just give you back the gradient. 00:37:11.240 |
Instead, I'll tell you that there are a whole bunch of implementations on the web that you 00:37:15.760 |
can now use as part of deep learning packages. 00:37:19.140 |
So one of them from Baidu implements CTC on the GPU. 00:37:25.960 |
Stanford and the group there, actually one of Andrew's students, has a CTC implementation. 00:37:33.360 |
And there's also now CTC losses implemented in packages like TensorFlow. 00:37:37.880 |
So this is something that's sufficiently widely distributed that you can use these algorithms 00:37:46.880 |
So the way that these work, the way that we go about training, is we start from our audio 00:37:52.680 |
We have our neural network structure where you get to choose how it's put together. 00:37:58.060 |
And then it outputs this bank of softmax neurons. 00:38:01.640 |
And then there are pieces of off-the-shelf software that will compute for you the CTC 00:38:08.520 |
They'll compute this log likelihood given a transcription and the output neurons from 00:38:16.440 |
And then the software will also be able to tell you the gradient with respect to the 00:38:23.440 |
You can feed them back into the rest of your code and get the gradient with respect to 00:38:29.980 |
So as I said, this is all available now in sort of efficient, off-the-shelf software. 00:38:37.520 |
So that's pretty much all there is to the high-level algorithm. 00:38:41.840 |
With this, it's actually enough to get a sort of working drosophila of speech recognition 00:38:49.800 |
There are a few little tricks, though, that you might need along the way. 00:38:57.520 |
But as you get to more difficult data sets with a lot of noise, they can become more 00:39:03.360 |
So the first one that we've been calling "sort of grad" in the vein of all of the grad algorithms 00:39:09.080 |
out there is basically a trick to help with recurrent neural networks. 00:39:16.600 |
So it turns out that when you try to train one of these big RNN models on some off-the-shelf 00:39:22.360 |
speech data, one of the things that can really get you is seeing very long utterances early 00:39:30.540 |
Because if you have a really long utterance, then if your neural network is badly initialized, 00:39:36.680 |
you'll often end up with things like underflow and overflow as you try to go and compute 00:39:42.400 |
And you end up with gradients exploding as you try to do back propagation. 00:39:46.320 |
And it can make your optimization a real mess. 00:39:49.200 |
And it's coming from the fact that these utterances are really long and really hard, and the neural 00:39:53.440 |
network just isn't ready to deal with those transcriptions. 00:39:57.280 |
And so one of the fixes that you can use is, during the early parts of training, usually 00:40:02.400 |
in the first epoch, is you just sort all of your audio by length. 00:40:07.040 |
And now, when you process a mini-batch, you just take the short utterances first so that 00:40:12.040 |
you're working with really short RNNs that are quite easy to train and don't blow up 00:40:16.760 |
and don't have a lot of catastrophic numerical problems. 00:40:20.480 |
And then as time goes by, you start operating on longer and longer utterances that get more 00:40:31.480 |
And so you can see some work from Yoshio Bengio and his team on a whole bunch of strategies 00:40:37.520 |
But you can think of the short utterances as being the easy ones. 00:40:40.480 |
And if you start out with the easy utterances and move to the longer ones, your optimization 00:40:46.640 |
So here's an example from one of the models that we've trained, where your CTC cost starts 00:40:54.720 |
And after a while, you optimize, and you sort of bottom out around, I don't know, what? 00:41:02.080 |
And then if you add this sort of grad strategy, after the first epoch, you're actually doing 00:41:08.380 |
And you can reach a better optimum than you could without it. 00:41:12.120 |
And in addition, another strategy that's extremely helpful for recurrent networks and very deep 00:41:22.880 |
And it's also available as sort of an off-the-shelf package inside of a lot of the different frameworks 00:41:29.520 |
So if you start having trouble, you can consider putting batch normalization into your network. 00:41:36.280 |
So our neural network now spits out this big bank of softmax neurons. 00:41:47.740 |
This process, as I said, is meant to be as close to characters as possible. 00:41:53.000 |
But we still sort of need to decode these outputs. 00:41:56.600 |
And you might think that one simple solution, which turns out to be approximate, to get 00:42:01.720 |
the correct transcription is just go through here and pick the most likely sequence of 00:42:06.960 |
symbols for C, and then apply our little squeeze operator to get back the transcription the 00:42:14.960 |
So this turns out not to be the optimal thing. 00:42:17.560 |
This actually doesn't give you the most likely transcription, because it's not accounting 00:42:22.320 |
for the fact that every transcription might have multiple sequences of Cs, multiple alignments 00:42:32.320 |
But you can actually do this, and this is called the max decoding. 00:42:36.280 |
And so for this sort of contrived example here, I put little red dots on the most likely 00:42:42.720 |
C. And if you see, there's a couple of blanks, a couple of Cs, there's another blank, A, 00:42:52.680 |
And if you apply our little squeeze operator, you just get the word cab. 00:43:02.200 |
It will often give you a very strange transcription that doesn't look like English necessarily. 00:43:09.280 |
But the reason I mention it is that this is a really handy diagnostic. 00:43:13.400 |
If you're kind of wondering what's going on in the network, glancing at a few of these 00:43:17.160 |
will often tell you if the network's starting to pick up any signal or if it's just outputting 00:43:22.900 |
So I'll give you a more detailed example in a second of how that happens. 00:43:29.640 |
So these are all the concepts of our very simple pipeline. 00:43:32.880 |
And the demo code that we're going to put up on the web will basically let you work 00:43:39.200 |
So once we try to train these, I want to give you an example of the sort of data that we're 00:43:45.400 |
>> A tanker is a ship designed to carry large volumes of oil or other liquid cargo. 00:43:51.920 |
>> So this is just a person sitting there reading the Wall Street Journal to us. 00:43:58.520 |
It's really popular in the speech research community. 00:44:02.240 |
It's published by the Linguistic Data Consortium. 00:44:05.320 |
There's also a free alternative called Libris Speech that's very similar. 00:44:08.720 |
But instead of people reading the Wall Street Journal, it's people reading Creative Commons 00:44:15.200 |
So in the demo code that we have, a really simple network that works reasonably well 00:44:23.680 |
So there's a sort of family of models that we've been working with, where you start from 00:44:29.640 |
You have maybe one layer or several of convolutional filters at the bottom. 00:44:35.380 |
And then on top of that, you have some kind of recurrent neural network. 00:44:38.040 |
It might just be a vanilla RNN, but you can also use LSTM or GRU cells, any of your favorite 00:44:50.760 |
And then on top of that, we have some fully connected layers that produce these softmax 00:44:55.800 |
And those are the things that go into CTC for training. 00:45:01.040 |
The implementation on the web uses the warp CTC code. 00:45:05.000 |
And then we would just train this big neural network with stochastic gradient descent, 00:45:08.880 |
Nesterov's momentum, all the stuff that you've probably seen in a whole bunch of other talks 00:45:15.120 |
So if you actually run this, what is going on inside? 00:45:21.680 |
So I mentioned that looking at the max decoding is kind of a handy way to see what's going 00:45:34.640 |
This is a visualization of those softmax neurons at the top of one of these big neural networks. 00:45:40.860 |
So this is the representation of C from all the previous slides. 00:45:45.880 |
So on the horizontal axis, this is basically time. 00:45:48.680 |
This is the frame number or which chunk of the spectrogram we're seeing. 00:45:52.800 |
And then on the vertical axis here, you see these are all the characters in the English 00:45:59.280 |
So after 300 iterations of training, which is not very much, the system has learned something 00:46:04.960 |
amazing, which is that it should just output blanks and spaces all the time. 00:46:09.600 |
Because these are by far, because of all the silence and things in your data set, these 00:46:16.000 |
I just want to fill up the whole space with blanks. 00:46:18.360 |
But you can see it's kind of randomly poking out a few characters here. 00:46:23.300 |
And if you run your little max decoding strategy to see what does the system think the transcription 00:46:35.600 |
But this is a sign that the neural network's not going crazy. 00:46:40.240 |
It's at least learned what is the most likely characters. 00:46:44.560 |
Then after maybe 1,500 or so, you start to get a little bit of structure. 00:46:49.680 |
And if you try to like mouth these words, you might be able to sort of see that there's 00:46:54.860 |
some English-like sounds in here, like "Beyar justinfrutin." 00:47:01.760 |
But it's actually looking much better than just "h." 00:47:11.920 |
You can start to see that we have sort of fragments of possibly words starting to form. 00:47:18.980 |
And then after you're getting close to convergence, it's still not a real sentence. 00:47:25.160 |
He guessed what the correct transcription might be. 00:47:34.160 |
The correct one is actually "there justinfrunt." 00:47:38.440 |
And so you can see that sort of it's sort of sounding it out with English characters. 00:47:44.220 |
I have a young son, and I kind of figure I'm eventually going to see him producing max-decoded 00:47:51.520 |
And you're just going to sound these things out and be like, "Is it there justinfrunt? 00:47:56.960 |
But this is why this max-decoding strategy is really handy. 00:48:00.200 |
Because you can kind of look at this output and say, yeah, it's starting to get some actual 00:48:07.240 |
So because this is like my favorite speech recognition party game, I wanted to show you 00:48:15.920 |
"The poor little things," cried Cynthia, "think of them having been turned to the wall all 00:48:21.720 |
And so you can hear like the sound of the breath at the end. 00:48:34.040 |
And you'll find that things like proper names and so on tend to get sounded out. 00:48:38.240 |
But if those names are not in your audio data, there's no way the network could have learned 00:48:45.080 |
And we'll come back to how to solve that later. 00:48:47.000 |
But you see the true label is "The poor little things," cried Cynthia. 00:48:52.160 |
And that the last word is actually "all these years." 00:48:54.560 |
And there isn't a word hanging off at the end. 00:49:13.080 |
If you told me that this was the ground truth, I'd go, "That's weird. 00:49:21.920 |
Turns out this is a French word that means something like "rubbernecking." 00:49:29.440 |
So this is, again, the cool examples of what these neural networks are able to figure out 00:49:42.280 |
We just talked about max-decoding, which is sort of an approximate way of going from these 00:49:52.240 |
And if you want to find the actual most likely transcription Y, there's actually no algorithm 00:49:58.400 |
in general that can give you the perfect solution efficiently. 00:50:03.840 |
So the reason for that, remember, is that for a single transcription Y, I have an efficient 00:50:11.320 |
But if I want to search over every possible transcription, I don't know how to do that 00:50:15.880 |
because there are exponentially many possible transcriptions, and I'd have to run this algorithm 00:50:25.400 |
So we have to resort to some kind of generic search strategy. 00:50:29.840 |
And so one proposed in the original paper briefly is a sort of prefix-decoding strategy. 00:50:36.760 |
So I don't want to spend a ton of time on this. 00:50:39.200 |
Instead, I want to step to sort of the next piece of the picture. 00:50:44.480 |
So there were a bunch of examples in there, right, like proper names, like Cynthia and 00:50:49.160 |
things like Baddourie, where unless you had heard this word before, you have no hope of 00:50:59.280 |
And so there are lots of examples like this in the literature of things that are sort 00:51:04.680 |
of spelled out phonetically but aren't legitimate English transcriptions. 00:51:10.240 |
And so what we'd like to do is come up with a way to fold in just a little bit of that 00:51:17.040 |
knowledge about the language, to take a small step backward from a perfect end-to-end system 00:51:25.280 |
So as I said, the real problem here is that you don't have enough audio available to learn 00:51:32.200 |
If you had millions and millions of hours of audio sitting around, you could probably 00:51:35.560 |
learn all these transcriptions because you just hear enough words that you know how to 00:51:42.800 |
But unfortunately, we just don't have enough audio for that. 00:51:45.960 |
So we have to find a way to get around that data problem. 00:51:50.120 |
There's also an example of something that in the AI lab we've dubbed the Tchaikovsky 00:51:53.880 |
problem, which is that there are certain names in the world, right, like proper names, that 00:51:59.360 |
if you've never heard of it before, you have no idea how it's spelled. 00:52:03.560 |
And the only way to know it is to have seen this word in text before and to see it in 00:52:10.400 |
So part of the purpose of these language models is to get examples like this correct. 00:52:16.840 |
One would be to just step back to a more traditional pipeline, right, use phonemes, because then 00:52:21.940 |
we can bake new words in along with their phonetic pronunciation and the system will 00:52:29.280 |
But in this case, I want to focus on just fusing in a traditional language model that 00:52:34.960 |
gives us the probability a priori of any sequence of words. 00:52:40.000 |
So the reason that this is helpful is that using a language model, we can train these 00:52:47.880 |
We have way, way more text in the world than we have transcribed audio. 00:52:52.640 |
And so that makes it possible to train these giant language models with huge vocabulary, 00:52:57.960 |
and they can also pick up the sort of contextual things that will tip you off to the fact that 00:53:02.720 |
Tchaikovsky concerto is a reasonable thing for a person to ask, and that this particular 00:53:08.520 |
transcription which we have seen in the past, Tchaikovsky concerto, even though composed 00:53:19.680 |
So there's actually not much to see on the language modeling front for this, except that 00:53:25.840 |
the reasons for sticking with traditional N-gram models are kind of interesting if you're 00:53:32.960 |
So if you go use a package like KenLM on the web to go build yourself a giant N-gram language 00:53:39.880 |
model, these are really simple and well supported. 00:53:47.040 |
And they'll let you train from lots of corpora, but for speech recognition in practice, one 00:53:52.360 |
of the nice things about N-gram models as opposed to trying to, say, use like an RNN 00:53:58.160 |
model is that we can update these things very quickly. 00:54:00.440 |
If you have a big distributed cluster, you can update that N-gram model very rapidly 00:54:05.160 |
in parallel from new data to keep track of whatever the trending words are today that 00:54:12.440 |
And we also have the need to query this thing very rapidly inside our decoding loop that 00:54:20.340 |
And so being able to just look up the probabilities in a table the way an N-gram model is structured 00:54:26.880 |
So I hope someday all of this will go away and be replaced with an amazing neural network. 00:54:37.040 |
So in order to fuse this into the system, since to get the most likely transcription, 00:54:44.880 |
right, probability of Y given X, to maximize that thing, we need to use a generic search 00:54:54.060 |
Once we're using a generic search scheme to do our decoding and find the most likely transcription, 00:55:00.840 |
So in a previous piece of work from Auni, Hanun, and several co-authors, what you do 00:55:07.820 |
is you take the probability of a given word sequence from your audio. 00:55:13.480 |
So this is what you would get from your giant RNN. 00:55:17.920 |
And you can just multiply it by some extra terms, the probability of the word sequence 00:55:22.560 |
according to your language model raised to some power, and then multiply by the length 00:55:28.160 |
And you see that if you just take the log of this objective function, right, then you 00:55:33.840 |
get the log probability that was your original objective. 00:55:37.380 |
You get alpha times the log probability of the language model, and beta times the log 00:55:44.380 |
And these alpha and beta parameters let you sort of trade off the importance of getting 00:55:49.880 |
a transcription that makes sense to your language model versus getting a transcription that 00:55:53.480 |
makes sense to your acoustic model and actually sounds like the thing that you heard. 00:55:59.040 |
And the reason for this extra term over here is that as you're multiplying in all of these 00:56:04.520 |
terms, you tend to penalize long transcriptions a bit too much. 00:56:09.120 |
And so having a little bonus or penalty at the end to tweak to get the transcription 00:56:16.700 |
So the basic idea behind this is just to use BeamSearch. 00:56:19.920 |
BeamSearch, really popular search algorithm, a whole bunch of instances of it. 00:56:28.840 |
So starting from time 0, starting from t equals 1 at the very beginning of your audio input, 00:56:35.120 |
I start out with an empty list that I'm going to populate with prefixes. 00:56:40.540 |
And these prefixes are just partial transcriptions that represent what I think I've heard so 00:56:50.200 |
And the way that this proceeds is I'm going to take at the current time step each candidate 00:56:58.960 |
And then I'm going to try all of the possible characters in my softmax neurons that could 00:57:08.800 |
I can say if the next element of C is actually supposed to be a blank, then what that would 00:57:15.440 |
mean is that I don't change my prefix, right, because the blanks are just going to get dropped 00:57:21.020 |
But I need to incorporate the probability of that blank character into the probability 00:57:28.440 |
It represents one of the ways that I could reach that prefix. 00:57:32.400 |
And so I need to sum that probability into that candidate. 00:57:37.060 |
And likewise, whenever I add a space to the end of a prefix, that signals that this prefix 00:57:45.600 |
And so in addition to adding the probability of the space into my current estimate, this 00:57:50.400 |
gives me the chance to go look up that word in my language model and fold that into my 00:57:57.520 |
And then if I try adding a new character onto this prefix, it's just straightforward. 00:58:01.600 |
I just go and update the probabilities based on the probability of that character. 00:58:06.400 |
And then at the end of this, I'm going to have a huge list of possible prefixes that 00:58:12.480 |
And this is where you would normally get the exponential blow up of trying all possible 00:58:20.880 |
And what BeamSearch does is it just says, take the k most probable prefixes after I 00:58:27.220 |
remove all the duplicates in here, and then go and do this again. 00:58:31.360 |
And so if you have a really large k, then your algorithm will be a bit more accurate 00:58:35.400 |
in finding the best possible solution to this maximization problem, but it'll be slower. 00:58:44.620 |
If you run this decoding algorithm, if you just run it on the RNN outputs, you'll see 00:58:49.720 |
that you get actually better than straight max decoding. 00:58:55.600 |
But you still make things like spelling errors, like Boston with an I. 00:59:00.560 |
But once you add in a language model that can actually tell you that the word Boston 00:59:05.000 |
with an O is much more probable than Boston with an I. 00:59:14.120 |
So one place that you can also drop in deep learning that I wanted to mention very rapidly 00:59:18.100 |
is just if you're not happy with your N-gram model, because it doesn't have enough context, 00:59:22.760 |
or you've seen a really amazing neural language modeling paper that you'd like to fold in, 00:59:29.080 |
one really easy way to do this and link it to your current pipeline is to do rescoring. 00:59:35.080 |
So when this decoding strategy finishes, it can give you the most probable transcription, 00:59:40.800 |
but it also gives you this big list of the top k transcriptions in terms of probability. 00:59:48.560 |
And what you can do is take your recurrent network and just rescore all of these, basically 01:00:03.940 |
So in the instance of a neural language model, let's say that this is my N best list. 01:00:09.760 |
I have five candidates that were output by my decoding strategy. 01:00:15.840 |
And the first one is I'm a connoisseur looking for wine and pork chops. 01:00:21.120 |
I'm a connoisseur looking for wine and pork chops. 01:00:28.000 |
And depending on what kind of connoisseur you are, it's sort of up to interpretation 01:00:35.400 |
But perhaps a neural language model is going to be a little bit better at figuring out 01:00:40.920 |
And if you're a connoisseur, you might be looking for wine and pork chops. 01:00:45.000 |
And so what you would hope to happen is that a neural language model trained on a bunch 01:00:48.960 |
of text is going to correctly reorder these things and figure out that the second beam 01:00:56.080 |
candidate is actually the correct one, even though your N-gram model didn't help you. 01:01:07.160 |
That is the set of concepts that you need to get a working speech recognition engine 01:01:16.240 |
And so the thing that's left to go to state of the art performance and start serving users 01:01:22.640 |
So I'm going to kind of run through quickly a bunch of the different tactics that you 01:01:30.720 |
So the two pieces of scale that I want to cover, of course, are data and computing power. 01:01:37.920 |
So the first thing to know, this is just a number you can keep in the back of your head 01:01:41.080 |
for all purposes, which is that transcribing speech data is not cheap, but it's also not 01:01:46.880 |
It's about 50 cents to a dollar a minute, depending on the quality you want and who's 01:01:50.460 |
transcribing it and the difficulty of the data. 01:01:54.320 |
So typical speech benchmarks you'll see out there are maybe hundreds to thousands of hours. 01:01:59.920 |
So like the Libri speech data set is maybe hundreds of hours. 01:02:04.920 |
There's another data set called VoxForge, and you can kind of cobble these together 01:02:08.240 |
and get maybe hundreds to thousands of hours. 01:02:11.120 |
But the real challenge is that the application matters a lot. 01:02:16.080 |
So all the utterances I was playing for you are examples of read speech. 01:02:21.760 |
People are sitting in a nice quiet room, they're reading something wonderful to me, and so 01:02:25.880 |
I'm going to end up with a speech engine that's really awesome at listening to the Wall Street 01:02:29.680 |
Journal, but maybe not so good at listening to someone in a crowded cafe. 01:02:35.600 |
So the application that you want to target really needs to match your data set. 01:02:41.000 |
And so it's worth, at the outset, if you're thinking about going and buying a bunch of 01:02:44.160 |
speech data, to think of what is the style of speech you're actually targeting. 01:02:49.160 |
Are you worried about read speech, like the ones we're hearing, or do you care about conversational 01:02:55.000 |
It turns out that when people talk in a conversation, when they're spontaneous, they're just coming 01:03:00.140 |
up with what to say on the fly versus if they have something that they're just dictating 01:03:04.640 |
and they already know what to say, they behave differently. 01:03:07.880 |
And they can exhibit all of these effects like disfluency and stuttering. 01:03:14.000 |
And then in addition to that, we have all kinds of environmental factors that might 01:03:17.040 |
matter for an application, like reverb and echo. 01:03:20.040 |
We start to care about the quality of microphones and whether they have noise canceling. 01:03:24.880 |
There's something called Lombard effect that I'll mention again in a second, and of course 01:03:28.680 |
things like speaker accents, where you really have to think carefully about how you collect 01:03:32.680 |
your data to make sure that you actually represent the kinds of cases you want to test on. 01:03:39.600 |
So the reason that read speech is really popular is because we can get a lot of it. 01:03:44.600 |
And even if it doesn't perfectly match your application, it's cheap and getting a lot 01:03:51.340 |
So I wanted to say a few things about read speech, because for less than $10 an hour, 01:03:55.560 |
often a lot less, you can get a whole bunch of data. 01:03:58.400 |
And it has the disadvantage that you lose a lot of things like inflection and conversationality, 01:04:08.360 |
So one of the things that we've tried doing, and I'm always interested to hear more clever 01:04:13.920 |
schemes for this, is you can kind of engineer the way that people read to try to get the 01:04:21.520 |
So here's one, which is that if you want a little bit more conversationality, you want 01:04:26.000 |
to get people out of that kind of humdrum dictation, you can start giving them reading 01:04:31.960 |
You can give them movie scripts and books, and people will actually start voice acting 01:04:37.000 |
>> Creep in, said the witch, and see if it is properly heated so that we can put the 01:04:46.520 |
>> So these are really wonderful workers; right? 01:04:48.600 |
They're kind of really getting into it to give you better data. 01:05:00.600 |
The wolf is dead and danced for joy around about the well with their mother. 01:05:10.200 |
They get this sort of lyrical quality into it that you don't get from just reading the 01:05:16.080 |
And finally, there's something called the Lombard effect that happens when people are 01:05:22.200 |
So if you're in a noisy party and you're trying to talk to your friend who's a couple of chairs 01:05:26.640 |
away, you'll catch yourself involuntarily going, "Hey, over there, what are you doing?" 01:05:31.760 |
You raise your inflection, and you kind of -- you try to use different tactics to get 01:05:39.200 |
You'll sort of work around the channel problem. 01:05:43.040 |
And so this is very problematic when you're trying to do transcription in a noisy environment 01:05:47.600 |
because people will talk to their phones using all these effects, even though the noise canceling 01:05:54.620 |
So one strategy we've tried with varying levels of success -- 01:05:57.760 |
>> Then they fell asleep and evening passed, but no one came to the poor children. 01:06:02.920 |
>> -- is to actually play loud noise in people's headphones to try to get them to elicit this 01:06:09.840 |
So this person is kind of raising their voice a little bit in a way that they wouldn't if 01:06:17.080 |
And similarly, as I mentioned, there are a whole bunch of different augmentation strategies. 01:06:23.000 |
So there are all these effects of environment, like reverberation, echo, background noise, 01:06:28.640 |
that we would like our speech engine to be robust to. 01:06:31.840 |
And one way you could go about trying to solve this is to go collect a bunch of audio from 01:06:36.000 |
those cases and then transcribe it, but getting that raw audio is really expensive. 01:06:41.440 |
So instead, an alternative is to take the really cheap red speech that's very clean 01:06:46.560 |
and use some, like, off-the-shelf open-source audio toolkit to synthesize all the things 01:06:57.800 |
So for example, if we want to simulate noise in a cafe, here's just me talking to my laptop 01:07:19.200 |
So I can obviously collect these independently, very cheaply. 01:07:22.840 |
Then I can synthesize this by just adding these signals together. 01:07:28.000 |
Which actually sounds, I don't know, sounds to me like my talking to my laptop at a Starbucks 01:07:34.320 |
And so for our work on deep speech, we actually take something like 10,000 hours of raw audio 01:07:39.640 |
that sounds kind of like this, and then we pile on lots and lots of audio tracks from 01:07:49.160 |
People upload, like, noise tracks to the web that last for hours. 01:07:53.520 |
It's, like, really soothing to listen to the highway or something. 01:07:57.960 |
And so you can download all this free found data, and you can just overlay it on this 01:08:02.920 |
voice, and you can synthesize perhaps hundreds of thousands of hours of unique audio. 01:08:07.520 |
And so the idea here is that it's just much easier to engineer your data pipeline to be 01:08:15.320 |
robust than it is to engineer the speech engine itself to be robust. 01:08:20.200 |
So whenever you encounter an environment that you've never seen before and your speech engine 01:08:23.720 |
is breaking down, you should shift your instinct away from trying to engineer the engine to 01:08:29.000 |
fix it and toward this idea of how do I reproduce it really cheaply in my data. 01:08:35.320 |
So here's that Wall Street Journal example again. 01:08:37.320 |
>> A tanker is a ship designed to carry large volumes of oil or other liquid cargo. 01:08:43.080 |
>> And so if I wanted to, for instance, deal with a person reading Wall Street Journal 01:08:50.280 |
>> A tanker is a ship designed to carry large volumes of oil or other liquid cargo. 01:08:54.640 |
>> There's lots of reverb in this room, so you can't hear the reverb on the audio. 01:08:58.720 |
But basically, you can synthesize these things with one line of socks on the command line. 01:09:05.300 |
So from some of our own work with building a large-scale speech engine with these technologies, 01:09:13.200 |
And you can actually see that when we run on clean and noisy test utterances, as we 01:09:20.960 |
add more and more data all the way up to about 10,000 hours and using a lot of these synthesis 01:09:27.800 |
strategies, we can just steadily improve the performance of the engine. 01:09:32.080 |
And in fact, on things like clean speech, you can get down well below 10% word error 01:09:46.000 |
Because the caveat on that last slide is, yes, more data will help if you have a big 01:09:52.640 |
And big models usually mean lots of computation. 01:09:56.800 |
So what I haven't talked about is how big are these neural networks and how big is one 01:10:02.040 |
So if you actually want to train one of these things at scale, what are you in for? 01:10:08.200 |
It's going to take at least the number of connections in your neural network. 01:10:13.040 |
So take one slice of that RNN, the number of unique connections, multiplied by the number 01:10:18.760 |
of frames once you unroll the recurrent network, once you unfold it, multiplied by the number 01:10:24.000 |
of utterances you've got to process in your dataset, times the number of training epics, 01:10:29.160 |
the number of times you loop through the dataset, times 3, because you have to do forward prop, 01:10:38.400 |
And then 2 flops for every connection, because there's a multiply and an add. 01:10:43.120 |
So if you multiply this out for some parameters from the deep speech engine at Baidu, you 01:10:47.960 |
get something like 1.2 times 10 to the 19 flops. 01:10:54.840 |
And if you run this on a Titan X card, this will take about a month. 01:10:59.800 |
Now if you already know what the model is, that might be tolerable. 01:11:04.100 |
If you're on your epic run to get your best performance so far, then this is OK. 01:11:09.380 |
But if you don't know what model's going to work, you're targeting some new scenario, 01:11:12.840 |
then you want it done now so that you can try lots and lots of models quickly. 01:11:17.400 |
So the easy fix is just to try using a bunch more GPUs with data parallelism. 01:11:23.700 |
And the good news is that so far, it looks like speech recognition allows us to use mini-batch 01:11:30.440 |
We can process enough utterances in parallel that this is actually efficient. 01:11:35.080 |
So you'd like to keep maybe a bit more than 64 utterances on each GPU, and up to a total 01:11:41.640 |
mini-batch size of like 1,000 or maybe 2,000 is still useful. 01:11:47.140 |
And so if you're putting together your infrastructure, you can go out and you can buy a server that'll 01:11:53.600 |
fit eight of these Titan GPUs in them, and that'll actually get you to less than a week 01:12:01.480 |
So there are a whole bunch of ways to use GPUs. 01:12:07.360 |
It turns out that you've got to optimize things like all reduced code. 01:12:10.920 |
Once you leave one node, you have to start worrying about your network. 01:12:15.520 |
And if you want to keep scaling, then thinking about things like network traffic and the 01:12:19.920 |
right strategy for moving all of your data becomes important. 01:12:24.680 |
But we've had success scaling really well all the way out to things like 64 GPUs and 01:12:33.140 |
So if you've got a big cluster available, these things scale really well. 01:12:39.440 |
For instance, asynchronous SGD is now kind of a mainstay of distributed deep learning. 01:12:44.980 |
There's also been some work recently of trying to go back to synchronous SGD that has a lot 01:12:48.640 |
of nice properties, but using things like backup workers. 01:12:59.360 |
One word of warning as you're trying to build these systems is to watch for code that isn't 01:13:10.240 |
And so this back-of-the-envelope calculation that we did of figuring out how many flops 01:13:15.600 |
are involved in our network and then calculating how long it would take to run if our GPU were 01:13:21.800 |
running at full efficiency, you should actually do this for your network. 01:13:27.720 |
This is the fastest your code could ever run on one GPU. 01:13:31.580 |
And if you find that you're just drastically underperforming that number, what could be 01:13:36.840 |
happening to you is that you've hit a little edge case in one of the libraries that you're 01:13:41.860 |
using and you're actually suffering a huge setback that you don't need to be feeling 01:13:46.880 |
So one of the things we found back in November is that in libraries like Kublai's, you can 01:13:51.620 |
actually use mini batch sizes that hit these weird catastrophic cases in the library, where 01:13:57.680 |
you could be suffering like a factor of two or three performance reduction. 01:14:02.580 |
So that might take your wonderful one-week training time and blow it up to, say, a three-week 01:14:09.420 |
So that's why I wanted to go through this and ask you to keep in mind while you're training 01:14:14.100 |
these things, try to figure out how long it ought to be taking. 01:14:17.900 |
And if it's going a lot slower, be suspicious that there's some code you could be optimizing. 01:14:24.260 |
Another good trick that's particular to speech, you can also use this for other recurrent 01:14:29.980 |
tricks, is to try to keep similar length utterances together. 01:14:34.900 |
So if you look at your dataset, like a lot of things, you have this sort of distribution 01:14:43.400 |
And so you see there's a whole bunch that are, you know, maybe within about 50% of each 01:14:48.380 |
other, but there's also a large number of utterances that are very short. 01:14:52.740 |
And so what happens is when we want to process a whole bunch of these utterances in parallel, 01:14:58.580 |
if we just randomly select, say, 1,000 utterances to go into a mini-batch, there's a high probability 01:15:05.620 |
that we're going to get a whole bunch of these little short utterances along with some really 01:15:11.560 |
And in order to make all the CTC libraries work and all of our recurrent network computations 01:15:16.060 |
easy, what we have to do is pad these audio signals with zero. 01:15:20.300 |
And that winds up meaning that we're wasting huge amounts of computation, maybe a factor 01:15:26.560 |
And so one way to get around it is just sort all of your utterances by length and then 01:15:32.060 |
try to keep the mini-batches to be similar lengths so that you just don't end up with 01:15:39.940 |
And this kind of modifies your algorithm a little bit, but in the end is worthwhile. 01:15:47.140 |
So that's kind of all I want to say about computation. 01:15:50.740 |
If you've got a few GPUs, keep an eye on your running time so that you know what to optimize 01:15:56.800 |
and pay attention to the easy wins, like keeping your utterances together. 01:16:03.080 |
And I think for a lot of the jobs we see, you can have your GPU running at something 01:16:12.200 |
With network time, with all the bandwidth-bound stuff, you can actually run at two to three 01:16:16.460 |
teraflops on a GPU that can only do five teraflops in the perfect case. 01:16:25.940 |
One of my favorite results from one of our largest models is actually in Mandarin. 01:16:30.220 |
So we have a whole bunch of labeled Mandarin data at Baidu. 01:16:34.160 |
And so one of the things that we did was we scaled up this model, trained it on a huge 01:16:37.720 |
amount of Mandarin data, and then, as we always do, we sit down and we do error analysis. 01:16:44.200 |
And what we would do is have a whole bunch of humans sitting around, try to debate the 01:16:50.560 |
transcriptions and figure out the ground truth that tend to be very high quality. 01:16:55.100 |
And then we'd go and we'd run now a sort of holdout test on some new people and on the 01:17:01.740 |
And so if you benchmark a single human being against this deep speech engine in Mandarin 01:17:08.480 |
that's powered by all the technologies we were just talking about, it turns out that 01:17:13.540 |
the speech engine can get an error rate that's down below 6% character error rate. 01:17:18.820 |
So only about 6% of the characters are wrong. 01:17:21.500 |
And a single human sitting there listening to these transcriptions actually does quite 01:17:29.380 |
If you give people a bit of an advantage, which is you now assemble a committee of people 01:17:36.300 |
and you get them a fresh test set so that no one has seen it before and we run this 01:17:40.220 |
test again, it turns out that the two engines, or that the two cases are actually really 01:17:47.260 |
And you can end up with a committee of native Mandarin speakers sitting around debating, 01:17:50.620 |
"No, no, I think this person said this," or "No, they have an accent. 01:17:57.700 |
And then when you show them the deep speech transcription, they actually go, "Oh, that's 01:18:04.060 |
And so you can actually get this technology up to a point where it's highly competitive 01:18:09.540 |
with human beings, even human beings working together. 01:18:12.620 |
And this is sort of where I think all of the speech recognition systems are heading, thanks 01:18:17.220 |
to deep learning and the technologies that we're talking about here. 01:18:26.540 |
So how do you know the actual label of the data? 01:18:37.540 |
So the question is, if humans have such a hard time coming up with the correct transcription, 01:18:43.720 |
Sometimes you might have a little bit of user feedback, but in this instance, we have very 01:18:48.140 |
high-quality transcriptions that are coming from many labelers teamed up with a speech 01:18:56.540 |
We do occasionally find errors where we just think that's a label error. 01:19:00.660 |
But when you have a committee of humans around, the really astonishing thing is that you can 01:19:05.100 |
look at the output of the speech engines, and the humans will suddenly jump ship and 01:19:11.340 |
The speech engine is actually correct, because it'll often come up with an obscure word or 01:19:18.860 |
Once they see the label, can they be biased towards that label? 01:19:26.980 |
But let's say that a committee of human beings tend to disagree with another committee of 01:19:31.980 |
human beings about the same amount as a speech engine does. 01:19:36.660 |
So this is basically doing a sequence-to-sequence sort of task, right? 01:19:42.660 |
So we're going to hear about a really different approach to that later. 01:19:53.780 |
That's really the core component of this system. 01:19:56.260 |
It's how you deal with mapping one variable-length sequence to another. 01:20:01.140 |
When the CTC cost is not perfect, it has this assumption of independence baked into the 01:20:08.660 |
And because of that assumption, we're introducing some bias into the system. 01:20:12.900 |
And for languages like English, where the characters are obviously not independent of 01:20:20.660 |
In practice, the thing that we see is that as you add a lot of data and your model gets 01:20:25.020 |
much more powerful, you can still find your way around it, but it might take more data 01:20:32.500 |
And of course, we hope that all the new state-of-the-art methods coming out of the deep learning community 01:20:36.380 |
are going to give us an even better solution. 01:20:40.780 |
In this spectrogram, you're saying that there's a 20 milliseconds sample that you take. 01:20:51.780 |
Is there a reason for the prediction that you can have a bigger or smaller -- 01:20:54.780 |
So the question is, for a spectrogram with -- We talked about these little spectrogram 01:20:58.380 |
frames being computed from 20 milliseconds of audio. 01:21:05.180 |
So this is really determined from years and years of experience. 01:21:08.380 |
This is captured from the traditional speech community. 01:21:15.980 |
You can take a spectrogram, go back and find the best audio that corresponds to that spectrogram 01:21:22.860 |
to listen to it and see if you lost anything. 01:21:25.700 |
And spectrograms of about this level of quantization, you can kind of tell what people are saying. 01:21:30.860 |
It's a little bit garbled, but it's still actually pretty good. 01:21:34.580 |
So amongst all the hyperparameters you could choose, this one's kind of a good tradeoff 01:21:38.420 |
in keeping the information, but also saving a little bit of the phase by doing it frequently. 01:21:50.220 |
I think in a lot of the models in the demo, for example, we don't use overlapping windows. 01:21:57.660 |
You mentioned you get linear scale up across CPU and GPUs. 01:22:01.660 |
So those results are from in-house software at Baidu. 01:22:11.540 |
If you use something like OpenMPI, for example, on a cluster of GPUs, it actually works pretty 01:22:21.540 |
But I think some of the algorithms all reduce once you start moving huge amounts of data. 01:22:28.580 |
You'll suffer a hit once you start going to that many GPUs. 01:22:32.900 |
Within a single box, if you use the CUDA libraries to move data back and forth just on a local 01:22:40.660 |
box, that stuff is pretty well optimized, and you can often do it yourself. 01:22:47.020 |
So I want to take a few more questions at the end, and maybe we can run into the break 01:22:52.100 |
I wanted to just dive right through a few comments about production here. 01:22:57.940 |
So of course, the ultimate goal of solving speech recognition is to improve people's 01:23:05.860 |
lives and enable exciting products, and so that means even though so far we've trained 01:23:11.260 |
a bunch of acoustic and language models, we also want to get these things in production. 01:23:16.340 |
And users tend to care about more than just accuracy. 01:23:19.700 |
Accuracy of course matters a lot, but we also care about things like latency. 01:23:23.740 |
Users want to see the engine send them some feedback very quickly so that they know that 01:23:27.900 |
it's responding and that it's understanding what they're saying. 01:23:31.460 |
And we also need this to be economical so that we can serve lots of users without breaking 01:23:37.220 |
So in practice, a lot of the neural networks that we use in research papers, because they're 01:23:41.140 |
awesome for beating benchmark results, turn out not to work that well on a production 01:23:47.260 |
So one in particular that I think is worth keeping an eye on is that it's really common 01:23:52.740 |
to use bidirectional recurrent neural networks. 01:23:55.900 |
And so throughout the talk, I've been drawing my RNN with connections that just go forward 01:24:00.420 |
in time, but you'll see a lot of research results that also have a path that goes backward 01:24:07.140 |
And this works fine if you just want to process data offline. 01:24:11.500 |
But the problem is that if I want to compute this neuron's output up at the top of my network, 01:24:16.580 |
I have to wait until I see the entire audio segment so that I can compute this backward 01:24:24.660 |
So this sort of anti-causal part of my neural network that gets to see the future means 01:24:29.900 |
that I can't respond to a user on the fly because I need to wait for the end of their 01:24:36.380 |
So if you start out with these bidirectional RNNs that are actually much easier to get 01:24:41.700 |
working and then you jump to using a recurrent network that is forward only, it'll turn out 01:24:50.420 |
And you might kind of hope that CTC, because it doesn't care about the alignment, would 01:24:55.060 |
somehow magically learn to shift the output over to get better accuracy and just artificially 01:25:01.620 |
delay the response so that it could get more context on its own. 01:25:05.620 |
But it kind of turns out to only do that a little bit in practice. 01:25:11.500 |
And so if you find that you're doing much worse, sometimes you have to sort of engage 01:25:17.620 |
So even though I've been talking about these recurrent networks, I want you to bear in 01:25:20.620 |
mind that there's this dual optimization going on. 01:25:24.940 |
You want to find a model structure that gives you really good accuracy, but you also have 01:25:28.780 |
to think carefully about how you set up the structure so that this little neuron at the 01:25:33.460 |
top can actually see enough context to get an accurate answer and not depend too much 01:25:41.420 |
So for example, what we could do is tweak this model so that this neuron at the top 01:25:46.660 |
that's trying to output the character L in hello can see some future frames, but it doesn't 01:25:55.040 |
So it only gets to see a little bit of context. 01:25:57.660 |
That lets us kind of contain the amount of latency in the model. 01:26:04.860 |
So in terms of other online aspects, of course, we want this to be efficient. 01:26:12.380 |
We want to serve lots of users on a small number of machines if possible. 01:26:17.260 |
And one of the things that you might find if you have a really big deep neural network 01:26:22.020 |
or recurrent neural network is that it's really hard to deploy them on conventional CPUs. 01:26:30.940 |
You just want to go as fast as you can for this one string of instructions. 01:26:35.660 |
But as we've discovered with so much of deep learning, GPUs are really fantastic because 01:26:40.420 |
when we work with neural networks, we love processing lots and lots of arithmetic in 01:26:46.080 |
But it's really only efficient if the batch that we're working on, the hunks of audio 01:26:50.300 |
that we're working on, are in a big enough batch. 01:26:56.280 |
So if we just process one stream of audio so that my GPU is multiplying matrices times 01:27:00.900 |
vectors, then my GPU is going to be really inefficient. 01:27:05.300 |
So for example, on like a K1200 GPU, so something you could put in a server in the cloud, what 01:27:12.100 |
you'll find is that you get really poor throughput considering the dollar value of this hardware 01:27:19.240 |
if you're only processing one piece of audio at a time. 01:27:22.020 |
Whereas if you could somehow batch up audio to have, say, 10 or 32 streams going at once, 01:27:28.480 |
then you can actually squeeze out a lot more performance from that piece of hardware. 01:27:34.180 |
So one of the things that we've been working on that works really well and is not too bad 01:27:39.200 |
to implement is to just batch all of the packets as data comes in. 01:27:43.700 |
So if I have a whole bunch of users talking to my server and they're sending me little 01:27:47.620 |
hundred millisecond packets of audio, what I can do is I can sit and I can listen to 01:27:53.500 |
all these users, and when I catch a whole batch of utterances coming in or a whole bunch 01:27:58.480 |
of audio packets coming in from different people that start around the same time, I 01:28:03.580 |
plug those all into my GPU and I process those matrix multiplications together. 01:28:08.780 |
So instead of multiplying a matrix times only one little audio piece, I get to multiply 01:28:12.800 |
it by a batch of, say, four audio pieces, and it's much more efficient. 01:28:18.420 |
And if you actually do this on a live server and you plow a whole bunch of audio streams 01:28:23.140 |
through it, you could support maybe 10, 20, 30 users in parallel, and as the load on that 01:28:29.300 |
server goes up, I have more and more users piling on, what happens is that the GPU will 01:28:34.420 |
naturally start batching up more and more packets into single matrix multiplications. 01:28:40.140 |
So as you get more users, you actually get much more efficient as well. 01:28:45.660 |
And so in practice, when you have a whole bunch of users on one machine, you usually 01:28:49.660 |
don't see matrix multiplications happening with fewer than maybe batch sizes of four. 01:28:56.660 |
So the summary of all of this is that deep learning is really making the first steps 01:29:04.060 |
to building a state-of-the-art speech engine easier than they've ever been. 01:29:07.220 |
So if you want to build a new state-of-the-art speech engine for some new language, all of 01:29:11.420 |
the components that you need are things that we've covered so far. 01:29:15.860 |
And the performance now is really significantly driven by data and models, and I think, as 01:29:20.740 |
we were discussing earlier, I think future models from deep learning are going to make 01:29:24.420 |
that influence of data and computing power even stronger. 01:29:29.740 |
And of course, data and compute is important so that we can try lots and lots of models 01:29:36.460 |
And I think this technology is now at a stage where it's not just a research system anymore. 01:29:42.420 |
We're seeing that the end-to-end deep learning technologies are now mature enough that we 01:29:48.180 |
I think you guys are going to be seeing deep learning play a bigger, bigger role in the 01:29:52.300 |
speech engines that are powering all the devices that we use. 01:30:07.660 |
All right, we had one in the back who was waiting patiently. 01:30:19.780 |
So the question is, how does the engine handle more than one voice simultaneously? 01:30:24.580 |
So right now, there's nothing in this formalism that allows you to account for multiple speakers. 01:30:32.220 |
And so usually, when you listen to an audio clip in practice, it's clear that there's 01:30:39.760 |
And so this speech engine, of course, learns whatever it was taught from the labels. 01:30:44.540 |
And it will try to filter out background speakers and just transcribe the dominant one. 01:30:49.580 |
But if it's really ambiguous, then undefined results. 01:30:53.860 |
Can you customize the transcription to the specific characteristics of a particular speaker? 01:31:04.260 |
So we're not doing that in these pipelines right now. 01:31:08.780 |
But of course, a lot of different strategies have been developed in the traditional speech 01:31:15.420 |
There are things like iVectors that try to quantify someone's voice. 01:31:18.420 |
And those make useful features for improving speech engines. 01:31:21.980 |
You could also imagine taking a lot of the concepts like embeddings, for example, and 01:31:28.380 |
So I think a lot of that is left open to future work. 01:31:39.820 |
And you guys can come to me with your questions. 01:31:44.540 |
So we'll reconvene at 2.45 for a presentation by Alex.