back to indexStanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music
Chapters
0:0 Introduction
0:6 Transformers for Music and Audio: Language Modelling to Understanding to Synthesis
1:35 The Transformer Revolution
5:2 Models getting bigger ...
7:43 What are spectograms
14:30 Raw Audio Synthesis: Difficulty Classical FM synthesis Karplus Strong
17:14 Baseline : Classic WaveNet
20:4 Improving Transformer Baseline • Major bottleneck of Transformers
21:2 Results & Unconditioned Setup • Evaluation Criterion o Comparing Wavenet, Transformers on next sample prediction Top-5 accuracy, out of 256 possible states as a error metric Why this setup 7 1. Application agnostic 2. Suits training setup
22:11 A Framework for Generative and Contrastive Learning of Audio Representations
22:38 Acoustic Scene Understanding
24:34 Recipe of doing
26:0 Turbocharging best of two worlds Vector Quantization: A powerful and under-uilized algorithm Combining VQwih auto-encoders and Transformers
33:24 Turbocharging best of two worlds Leaming clusters from vector quantization Use long term dependency kaming with that cluster based representation for markovian assumption Better we become in prediction, the better the summarization is
37:6 Audio Transformers: Transformer Architectures for Large Scale Audio Understanding - Adieu Convolutions Stanford University March 2021
38:45 Wavelets on Transformer Embeddings
41:20 Methodology + Results
44:4 What does it learn -- the front end
47:18 Final Thoughts
00:00:12.400 |
different than what all of us were doing in this past course. 00:00:28.240 |
is basically like I'll be throwing a lot of stuff. 00:00:32.440 |
and then you feel free to like or dislike whatever you want. 00:00:36.640 |
And I'll be talking mostly about three papers of what 00:00:42.120 |
I'll start with introducing what transformers 00:00:52.200 |
which is just doing language modeling on sample level. 00:00:56.080 |
Then I'll talk about how can one do like language 00:00:59.080 |
modeling for speech and audio, which is different than what 00:01:04.280 |
What are the current trends in the literature? 00:01:14.920 |
How can we adapt similar ideas for audio transformers? 00:01:22.120 |
Having told that the talk is about 35 to 40 minutes 00:01:31.160 |
is not responsible for any of the mistake which I do. 00:01:35.880 |
So transformers have kind of revolutionized in a way 00:01:55.600 |
adapting CNNs in some sort of diluted convolutions. 00:02:02.480 |
Now, it seems like transformers are in fashion all the time. 00:02:05.960 |
So it seems to be solving almost every single problem which 00:02:15.000 |
One of the fact which struck me was their simplicity, which 00:02:28.640 |
And within three years, it has about 30,000 citations. 00:02:31.280 |
And it is kind of solving every single problem 00:02:40.720 |
transformers are basically a way of just cascading self-attention 00:02:46.960 |
And if you keep on doing it over and over again, 00:02:49.280 |
then the model, in a way, learns which parts of the input 00:02:55.120 |
removing the contents which are not important, 00:02:57.360 |
and just have the limited information which is just 00:03:04.120 |
And it has been very, very difficult to keep up 00:03:10.760 |
But then even Twitter's recommendation engine 00:03:16.120 |
they were getting haywire as to why is Chris Manning just 00:03:26.440 |
also to keep up with the pace of what's going on. 00:03:29.800 |
Just before transformers, all of the NLP community 00:03:33.000 |
was just doing gaga about bidirectional LSTMs 00:03:46.000 |
And then after that, you have attention mechanism 00:03:50.600 |
and then just keeps on decoding sequentially one at a time. 00:03:54.760 |
But this was not kind of like an ideal way to do it. 00:03:58.040 |
Because what turns out is when we start throwing longer 00:04:07.120 |
storing the gradient updates in a way it should be doing. 00:04:25.280 |
for a particular problem at that particular layer. 00:04:31.920 |
So then the whole idea of transformers and attention 00:04:40.880 |
because this is the last class of the course. 00:04:43.400 |
But then usual tricks do help across the neural net 00:04:46.600 |
literature, which is like having multi-header tensions, 00:04:54.440 |
like giving gains for transformers themselves, 00:04:57.800 |
but they can be just applied to any single other architecture 00:05:03.440 |
The other thing which is helping this research 00:05:05.960 |
is basically the compute bar is getting better and better. 00:05:12.640 |
throwing massive amounts of computing resources 00:05:14.880 |
at solving very, very simple and trivial tasks. 00:05:18.920 |
The top of the hill being the switch transformer, which 00:05:30.080 |
which was just learning these contextualized representations 00:05:37.400 |
one of the first kind of like model 0.0 or something, 00:05:50.760 |
You can see that how similar these kind of models 00:06:00.200 |
some of the LSTM layers with transformer modules. 00:06:08.360 |
of natural language processing or other domain, 00:06:10.760 |
these can be adopted in a variety of domains. 00:06:13.000 |
And for today's talk, I'll be just adopting them to audio. 00:06:17.720 |
So I'll basically start with introducing people 00:06:23.400 |
for the sake of completeness, talk about spectrograms. 00:06:41.680 |
you're kind of like decomposing the actual time domain 00:06:51.880 |
like this, which is a sum of three pure sinusoids, 00:06:57.440 |
And you can see that when you take a Fourier transform 00:07:03.280 |
the strength of the individual components shown here. 00:07:13.720 |
is basically a much richer sinusoidal decomposition 00:07:17.680 |
because it is kind of a discontinuous signal. 00:07:26.960 |
And here also you can see that, OK, if this was a square wave, 00:07:29.880 |
then it is actually made up of a lot of sinusoids 00:07:44.120 |
I mean, this right away is suboptimal, right? 00:07:47.040 |
Because you're kind of fixing up the number of sinusoids 00:07:56.080 |
which was a square wave itself than a sinusoidal signal. 00:08:01.360 |
The second thing is even if you are taking a sinusoidal signal, 00:08:05.320 |
we kind of are just putting them in an equidistant space. 00:08:09.440 |
So you're kind of dividing the whole frequency 00:08:19.400 |
So that is like a traditional Fourier representation 00:08:30.480 |
But in reality, all of these signals are discontinuous. 00:08:34.880 |
All of these signals vary quite a bit, right? 00:08:39.960 |
speaking which is like a square wave for a certain period 00:08:52.840 |
and take Fourier transform of these individual batches. 00:09:02.240 |
So right here, you can see that you have a continuous signal. 00:09:07.320 |
You apply the Fourier transform, and what you get 00:09:10.280 |
is basically like a spectrogram representation of the signal. 00:09:16.360 |
is for each of the slices, the signal kind of 00:09:21.560 |
transform with the waveform which is there below. 00:09:24.840 |
And what you do is for spectrogram representation, 00:09:29.160 |
slice, the magnitude of the Fourier transform slices. 00:09:31.760 |
And in this way, you kind of get like a 2D representation 00:09:36.720 |
And if you're coming from a vision background, 00:09:40.160 |
which you're doing in vision would just work well 00:09:42.240 |
if you just apply them to these 2D spectra representations. 00:09:47.240 |
I'll quickly play how these spectrograms look 00:10:38.120 |
you have kind of like a time axis on your x-axis. 00:10:40.880 |
And then you have a frequency axis on y-axis. 00:10:43.680 |
And then for whatever is your signal of interest, 00:10:46.120 |
you're basically like putting these slices together. 00:10:48.880 |
And different sound gives you like different spectra 00:10:58.040 |
So there can be like different kinds of representations also. 00:11:01.200 |
So one, you could just take these slices of Fourier 00:11:06.320 |
transform and then do like a linear mapping to them 00:11:09.320 |
so that you're kind of in a way making these as 00:11:14.680 |
So you can have like log of the frequency on the y-axis 00:11:19.240 |
And then you get like a constant Q-like representation. 00:11:25.880 |
the spacing between the harmonics kind of remains same. 00:11:28.960 |
So if you're like training convolutional filters, 00:11:31.040 |
then that's of a huge advantage because the signal, 00:11:33.680 |
like one component of the invariance is gone. 00:11:37.280 |
which are catching onto these constant templates of Fourier 00:11:50.520 |
are two things which we have to keep in mind. 00:11:54.320 |
So we kind of like take the continuous signal 00:11:57.200 |
and then we discretize the continuous signal. 00:11:59.400 |
So one parameter is like how fast we are sampling 00:12:04.280 |
So that's typically on the order of like 16,000 or 8,000 00:12:07.720 |
times a second if you're on telephonic speech. 00:12:10.440 |
The other thing which we also is like how many levels 00:12:15.480 |
So in this case, you can see that each of the dots 00:12:19.400 |
And typically, people use 8-bit quantizers or 16-bit 00:12:23.280 |
So in a way, you can think about that for every one 00:12:32.720 |
are allowed to take one of the levels between 0 to 55. 00:12:36.800 |
And that's like if I can take the problem of continuous audio 00:12:41.560 |
and just have it in terms of this sort of discrete space, 00:12:45.320 |
then basically I'm just going to the territory 00:12:53.480 |
can we do generative modeling for raw audio, which 00:13:01.480 |
I'll be putting QR codes if you like the stuff what I'm doing. 00:13:05.560 |
And if you think that this is relevant to you, 00:13:13.000 |
So yeah, so I'll start with the first subtopic 00:13:16.400 |
of today's talk, which is like what are WaveNets 00:13:22.640 |
and how do we do this generative modeling over raw audio? 00:13:28.360 |
about this as doing language modeling over these 255 00:13:33.120 |
So you can throw in your favorite transformer model 00:13:35.760 |
like transformer XL or GPT or whatever you want to call it. 00:13:43.160 |
are trying to predict one of the levels out of 255. 00:13:45.800 |
And you have to predict the next level given a certain context. 00:13:50.760 |
So the way you are modeling the probability distribution 00:13:57.320 |
you're trying to predict what's the probability 00:13:59.360 |
of the next sample given some parsed context. 00:14:07.080 |
and it has been a core building block for almost all speech 00:14:13.000 |
You can think about speech to text, text to speech synthesis, 00:14:19.040 |
concealment over the internet, speech denoising. 00:14:21.680 |
So wherever there's some sort of element of modifying audio, 00:14:25.560 |
people have been using WaveNet as a core building block. 00:14:30.440 |
And raw waveform synthesis has been difficult 00:14:35.920 |
if I'm just trying to synthesize 10 seconds of audio, 00:14:38.880 |
it would just amount to me having a probability 00:14:45.760 |
And that itself is tough because our ears are very, very 00:14:55.600 |
my eyes would not be as susceptible to noticing 00:14:59.200 |
that effect versus if I'm off by, say, a few samples 00:15:04.240 |
in an audio, it would just catch our ears pretty quickly. 00:15:08.600 |
People have been trying raw audio synthesis a lot 00:15:12.440 |
And before all of the WaveNet and transformer-based 00:15:25.600 |
On the right, I've shown a SampleRNN model, which 00:15:28.600 |
kind of models the probability distribution of what's 00:15:33.280 |
going to come next given the past at multiple levels. 00:15:36.400 |
And this was work done by Yoshua Bengio at Mila. 00:15:42.560 |
see this architecture versus a transformer architecture, 00:15:45.520 |
in a way, these are starting to get very, very similar. 00:15:52.320 |
you're trying to see a lot of local substructures. 00:15:56.040 |
And then you keep on doing it over and over again. 00:15:58.120 |
And you can draw parallels, like attention mechanism 00:16:04.000 |
So this was kind of like the literature in the past. 00:16:09.280 |
What we tried to do was we just had the WaveNet model. 00:16:13.120 |
And we tried to see whether transformers can beat them. 00:16:15.800 |
And our intuition was it should be able to beat them 00:16:18.160 |
because they are successful all over the other domains, 00:16:28.720 |
We also tried to see whether we can circumvent the order 00:16:31.400 |
n squared constraint by conditioning of the context 00:16:40.560 |
And we just said, OK, just in terms like modeling behavior, 00:16:53.440 |
because the model is agnostic to what it is being thrown in. 00:17:05.520 |
You do the exact same thing with transform-based, 00:17:09.240 |
like GPT kind of model and see how well they do. 00:17:14.000 |
I'll briefly chat about what WaveNet models are. 00:17:18.040 |
So WaveNet was kind of like a convolution-based model, which 00:17:21.360 |
was getting rid of all of the vanishing gradient problem 00:17:31.120 |
So what they did was basically have this sort 00:17:33.880 |
of dilation layers, or convolution with dilations, 00:17:38.320 |
which is basically I kind of skip in every subsequent layer 00:17:43.240 |
So you can see if I have a dilation factor of 2 00:17:47.080 |
with a kernel size of 2, I would get this kind of a topology 00:17:50.400 |
where my convolution filters in the very first layer 00:17:59.360 |
like I look at the fourth one in the next layer and so on. 00:18:08.000 |
And then I have a categorical cross-entropy loss, 00:18:11.200 |
which is basically I have to predict the next sample given 00:18:16.640 |
And I just do the exact same thing with transformers also. 00:18:24.120 |
So I have something which is very similar to GPT, 00:18:26.840 |
in which I have causal masks in my attention mechanism. 00:18:37.400 |
You just have a stack of these transformer blocks 00:18:45.320 |
So it should be doing better than our base wave net models. 00:18:55.120 |
we are kind of defining a topology on our own, right? 00:19:01.520 |
layer one were to depend on very way back sample, say, 00:19:07.760 |
instead of the second sample, the 10th sample? 00:19:09.640 |
So we are kind of ignoring all of that topology, which 00:19:16.040 |
Whereas transformers with the self-attention mechanism 00:19:19.640 |
can just learn, like, OK, which part of the samples 00:19:26.760 |
So it made sense to us that, OK, transformer layer 00:19:30.880 |
should be doing way better than wave net models. 00:19:34.880 |
The second thing which we came across was, OK, 00:19:42.480 |
needs to store all of those of order n squared. 00:19:46.120 |
So in this case, if I'm storing data at 100 milliseconds, 00:19:52.600 |
And I need to store 1,600 by 1,600 at multiple layers. 00:19:56.840 |
And it just becomes like a huge problem with the data-- 00:20:10.840 |
So in order to have much better representation at every layer, 00:20:21.960 |
do a sample-wise conditioning and throw a CNN layers just 00:20:28.560 |
So you still have, like, an attention mechanism 00:20:32.960 |
But then I'm also conditioning at every sample, OK, 00:20:36.560 |
what the next sample should be given on this context embedding. 00:20:41.240 |
And if you think about it, in a way, it is like, OK, 00:20:43.360 |
if there are, like, five or six notes being played in a piano, 00:20:59.040 |
And I would just use that to predict the next sample. 00:21:04.840 |
did not look for negative log-likelihood scores. 00:21:08.720 |
We just looked at how well our prediction task was. 00:21:17.520 |
and saw that, OK, what was the performance using 00:21:21.320 |
their benchmarks and even, like, bigger stacked WaveNets. 00:21:25.840 |
We then started to increase the complexity of transformers 00:21:32.520 |
in terms of, like, conditioning on the vanilla transformer 00:21:39.520 |
We did not look for, like, an application-specific problem, 00:21:43.640 |
which is basically, like, we don't look at, like, 00:21:46.520 |
how well perception tasks are for, like, say, 00:21:49.200 |
text-to-speech synthesis or speech denoising. 00:21:56.000 |
then with the same model, with the same loss function, 00:21:59.560 |
how well they do on, like, similar kind of parameters. 00:22:03.720 |
So this was the first kind of, like, sub-block of, like, 00:22:06.640 |
how can we use our transformers for generative modeling. 00:22:12.160 |
For the second problem, I'll do a quick headway 00:22:21.200 |
is kind of becoming a really fancy term right now. 00:22:24.880 |
And this work was done by Julia Smith way back in 2020. 00:22:28.880 |
And the goal of this was, can we kind of, in a way, 00:22:32.520 |
do language modeling with continuous audio sequences? 00:22:37.440 |
And I'll briefly mention about that in this sub-block 00:22:42.960 |
And this is in regard for, like, solving acoustic scene 00:23:06.120 |
So for example, like, if you think about, like, 00:23:09.040 |
So Waymo has started to incorporate microphones 00:23:14.220 |
Because, say, if there is an ambulance coming, 00:23:18.960 |
then that sound would be picked up way, way before even 00:23:30.880 |
Apple, during COVID, did a hand-washing detection 00:23:34.600 |
Because if you could detect when someone is washing their hands, 00:23:37.800 |
then you can, in a way, like, tell people that, oh, 00:23:42.720 |
And then that can be built upon as a cool application. 00:23:49.520 |
So Spotify, YouTube Music kind of gives, like, 00:23:51.760 |
very, very good songs, which you are listening to, 00:23:54.640 |
which are similar in content that you would perhaps like. 00:23:59.320 |
It can also give, like, really cool applications. 00:24:05.800 |
Or I could detect whether I'm coughing or not, 00:24:09.960 |
And these can be, like, good medical device-- 00:24:15.360 |
used along with the current diagnosis what doctor provides. 00:24:23.320 |
was, like, how can we do, like, language modeling 00:24:37.360 |
become, like, very, very popular these days in terms of, like, 00:24:42.240 |
It started with, like, open AI, and to a certain extent, 00:24:46.280 |
DeepMind proposing that in terms of, like, VQVAE models. 00:24:52.640 |
love operating in discrete spaces, as of now. 00:24:56.440 |
And what they kind of do is, as long as your representations 00:25:01.200 |
are discrete, they are very, very good at modeling 00:25:06.760 |
So what people have been proposing as a workaround 00:25:09.200 |
is you could take up, like, your favorite embedding 00:25:18.200 |
or you could take a Wave2Vec, or in terms of video, 00:25:21.840 |
you can just do classic VGG or ResNet embeddings. 00:25:31.040 |
And k-means clustering would give you, like, discrete codes. 00:25:34.320 |
You do language modeling with those discrete codes, 00:25:41.440 |
then you're kind of doing language modeling over audio. 00:25:55.320 |
to the audio, similar to what jukebox and OpenAI did. 00:26:00.240 |
So I'll quickly mention about what vector quantization is. 00:26:05.920 |
It's one of the most underutilized algorithms, 00:26:09.880 |
And what it does is basically gives, in a way, 00:26:12.960 |
discrete codes to continuous embedding spaces. 00:26:31.520 |
get these patches of where all of these embeddings 00:26:36.200 |
are, what would be the representative embedding 00:26:42.120 |
and you can just number them, or you can just list them. 00:26:45.960 |
So in this case, you can perhaps have 25 numbers, or 20 numbers, 00:26:49.840 |
which are, in a way, mapping from a continuous embedding 00:27:02.320 |
took a batch of spectrogram, which are basically 00:27:05.160 |
very small patches across time, and then shared all 00:27:16.600 |
In our case, it was just like three-layer autoencoder, 00:27:19.720 |
fully-connected encoders with three layers of decoders, 00:27:29.280 |
in, say, 64-dimensional space or 120-dimensional space. 00:27:37.840 |
Suddenly, in a way, you can find discrete codes 00:27:43.520 |
for continuous embedding spaces or even continuous segments. 00:27:48.040 |
And since we know that transformers kind of love 00:27:58.480 |
So in our case, we just had very simple three-layer, 00:28:13.600 |
Now, I'll give an example of why the number of codes 00:28:25.000 |
removing all of the information which was relevant, 00:28:27.280 |
and you're just kind of averaging them all out. 00:28:42.160 |
I talked about, in a slightly different manner. 00:28:50.520 |
So in a way, learn sequences which are just moving slowly 00:28:54.080 |
and which are looking at only a certain amount of audio. 00:28:57.680 |
So you kind of encode this in these discrete levels, which 00:29:10.520 |
And in the next level, perhaps, it had code number 2. 00:29:12.840 |
And in the very top, perhaps, it had code number 2,000. 00:29:16.440 |
So in a way, I'm discretizing the whole codes. 00:29:19.680 |
Now what I do is I take up my favorite transform model, 00:29:45.560 |
Text-to-speech, in a way, is going from discrete letters 00:29:50.560 |
So I would throw in the fanciest, which was WaveNet. 00:30:11.960 |
You can throw in the fanciest of embedding or latent 00:30:22.000 |
And then you decode back using a fancy WaveNet or state-of-the-art 00:30:28.000 |
So this was what they were doing for music synthesis. 00:30:51.160 |
learn representation, which can just encapsulate whatever we 00:31:06.840 |
But instead of doing on VQ-VAE end-to-end learned encodings, 00:31:20.960 |
and you just divide them into very small chunks, 00:31:23.960 |
learn autoencoder encodings for each of those chunks, 00:31:30.320 |
In this case, let's say I am learning 16 codes. 00:31:34.120 |
Represent the continuous audio in terms of the 16 codes. 00:31:38.360 |
Have a transformer which can perhaps predict the next code. 00:31:41.880 |
And if I keep on getting better and better at predicting 00:31:45.120 |
what's going to happen next, then in this linear layer, 00:31:51.320 |
or what's a good summary of what has happened in the past. 00:31:56.640 |
So that was our intuition behind trying this. 00:32:05.560 |
You can see here, these are just two piano notes switching 00:32:12.080 |
it just happens to have just a single line of encoding, 00:32:23.600 |
where I'm actually able to get what the individual notes are. 00:32:32.880 |
had a different name to the whole thing, which 00:32:35.240 |
is we can just call this as textless NLP also in the sense 00:32:40.760 |
that, OK, you can do NLP without having access to text. 00:32:46.360 |
You have an encoder, which is exactly similar to say 00:32:49.680 |
You have a VQ-VAE, Wave2Vec, or whatever you want to do. 00:33:00.280 |
like a different version of text-to-speech, which 00:33:05.000 |
So as you can see, these are all the same wine 00:33:08.280 |
But the core idea is almost exactly the same. 00:33:12.560 |
So this created a huge uproar of this going to change NLP. 00:33:18.120 |
But this is very, very similar to what people 00:33:32.400 |
what's going to happen next given the previous context 00:33:35.360 |
and use that representation similar to every single one 00:33:40.320 |
short learning or zero short learning-based method. 00:33:44.600 |
I also explain why the number of codes are important. 00:33:50.920 |
If you have too large, then you don't put in-- 00:34:00.720 |
And before I jump in, I should add one of the tweets 00:34:04.080 |
which I saw from one of the most prominent researchers 00:34:08.800 |
at DeepMind, which is basically like a lot of times 00:34:17.200 |
in my paper, which actually help a lot in terms 00:34:32.000 |
the actual these tricks for training are incorporating. 00:34:35.160 |
So for most of these methods, what we are trying to see 00:34:38.240 |
is we try to keep almost exactly the same approach. 00:34:42.000 |
No rate augmentation, no fancy label smoothing, 00:34:44.560 |
or moving average of weights, or decay, or whatever. 00:35:06.840 |
So in the first case, the model and all of the weights 00:35:09.360 |
have access to all of the labels, which is just 00:35:16.840 |
and you see how well you're doing on accuracy metrics. 00:35:26.160 |
in which you can take up these multiple augmentations 00:35:38.720 |
just have a linear head to predict what's happening. 00:35:45.040 |
You do the exact same thing with transformers. 00:35:48.440 |
You just run them while just to predict the next code. 00:35:51.640 |
You take the linear layer, apply the same linear head, 00:36:04.920 |
are very, very good at getting better and better 00:36:11.520 |
So there's still a 10% gap between purely supervised 00:36:17.200 |
But that's going to improve with throwing a lot of data 00:36:21.160 |
to these models, because it doesn't have access 00:36:25.240 |
So this is a famous paper by Dan Ellis and Nelson Morgan 00:36:28.560 |
at Berkeley, in which they actually showed way back 00:36:30.920 |
in 1999 as to why size matters for deep neural networks 00:36:37.200 |
and also the number of data points which is present. 00:36:41.000 |
So as they kept on increasing the size of the data set 00:36:44.400 |
and the parameters, they kept on getting lower and lower word 00:36:48.120 |
And this has been true across any of the data set. 00:36:58.760 |
can we do language modeling and unsupervised learning 00:37:05.200 |
For the third subplot, I'll just quickly mention 00:37:09.400 |
ideas which are very similar to what you would have seen 00:37:15.680 |
that how can we use some sort of signal processing 00:37:22.040 |
So the basic approach still remains the same exactly 00:37:24.520 |
as what you would have seen in vision transformers. 00:37:28.200 |
You have a signal of interest which you want to classify. 00:37:33.080 |
Here, they are raw waveform instead of images. 00:37:36.400 |
The goal is to predict what's there inside of it. 00:37:43.480 |
We don't have any other tricks which we were using before. 00:37:46.760 |
All we have to do is they can transform as themselves, 00:37:57.280 |
No data augmentation and no other forms of these tricks. 00:38:03.000 |
You are given like 40,000 snippets for training 00:38:13.480 |
This problem is very similar to the sound which you heard 00:38:17.120 |
and the video which you saw, that given a spectrogram patch, 00:38:22.080 |
you have to predict what's there inside of it. 00:38:24.080 |
We kind of do one step further than what's just 00:38:34.840 |
In a sense that we try to see whether some sort of hierarchy 00:38:38.400 |
over transformer embeddings would help us in any manner. 00:39:02.720 |
into another intermediate embedding, in a sense 00:39:06.800 |
that we are kind of putting these highways of like some 00:39:20.640 |
Because you can think about that at every intermediate state, 00:39:24.000 |
you are in a way learning some sort of hierarchy in the model. 00:39:27.800 |
So if I look at what we do with the wavelet decomposition 00:39:35.120 |
before and after, let's say you had time across this 00:39:41.640 |
and this whole patch was your output of, say, 00:39:52.320 |
have a mapping from this to the mapping of my interest 00:39:56.040 |
using wavelet decomposition, in which for half of the samples, 00:40:02.160 |
as what was learned by the transformer model. 00:40:05.960 |
In the next half, I would start combining two at a time. 00:40:10.600 |
of like a tree structure within a single layer 00:40:16.360 |
And for now, the wavelet or the BCS function which I use 00:40:22.360 |
So let's say from all of the embedding layers in between, 00:40:26.040 |
I just need to have one embedding which is not 00:40:31.400 |
moving at all, which is just representative of whatever 00:40:33.600 |
is there of the whole latent space in that nth layer. 00:40:40.240 |
Then in the next layer, I would just use two at a time 00:40:46.680 |
until I reach the exact resolution as what I had. 00:40:50.360 |
Doing this operation doesn't add any parameters whatsoever. 00:40:53.560 |
You're just defining what your BCS function would be 00:41:00.200 |
And I start combining them and I learned a hierarchy 00:41:08.280 |
And this improved our performance significantly 00:41:20.840 |
So this is how the whole approach looks like. 00:41:28.480 |
of 2,000 neurons followed by a dense layer of 64 neurons, 00:41:43.200 |
then that's the dimension which I'm mapping them to. 00:41:48.920 |
I patch it in very small patches similar to how 00:41:54.520 |
I would just have a single layer of 2,000 neurons 00:42:00.040 |
with the hope that the first layer is learning 00:42:04.760 |
should be adaptable according to what I'm learning. 00:42:08.080 |
After that, I keep on doing this over and over again. 00:42:11.640 |
I don't have a classification head or anything like that. 00:42:15.520 |
I keep on adding multiple stacks of transformers after that. 00:42:35.160 |
we do in classical vision, that each of the embeddings 00:42:47.160 |
So what I do is that I take all of these embeddings 00:42:54.800 |
And some are retained at the exact same resolution 00:42:59.800 |
And then I keep doing this over and over again. 00:43:08.680 |
So this is kind of what the approach looks like. 00:43:12.400 |
We compare it with all of the traditional vision-based 00:43:18.520 |
So the vision-based models have been very good. 00:43:21.040 |
And the performance have been similar in understanding 00:43:29.680 |
And we see that even the tiniest models of transformers 00:43:32.320 |
were just surpassing all of the state-of-the-art CNN 00:43:39.800 |
The larger model should keep on improving the performance. 00:43:46.000 |
with the pooling layers, they improve the performance 00:43:48.760 |
even further, which was kind of very surprising to us 00:43:52.520 |
because the number of parameters are very small. 00:43:58.360 |
DenseNet, which are huge models with a lot of millions 00:44:08.160 |
After that, we said that, OK, this is looking pretty cool. 00:44:12.200 |
What actually is the transformer or the first-layer learning? 00:44:17.480 |
So in order to make this plot, what we said was, OK, 00:44:24.240 |
if you were to take a classic Fourier transform, 00:44:36.720 |
Then in a way, it should be connecting all of the points 00:44:43.400 |
And this is akin to the number of points in the FFT. 00:44:50.680 |
then I would have 2,048 sinusoidal basis functions, 00:45:03.640 |
So we have a frequency along y-axis and the number 00:45:11.280 |
then it would be connecting right as a linear line. 00:45:15.560 |
But what we did was we take up the front end, which 00:45:19.600 |
is learned by transformer, take its Fourier transform, 00:45:24.960 |
as to what frequency it is activating the most, 00:45:31.960 |
we saw that we are learning a different time frequency 00:45:35.720 |
representation, which is specific to a particular 00:45:57.320 |
which is adapting to that particular problem. 00:46:00.080 |
So this was very exciting to us because making computers 00:46:05.520 |
hear in a way in which they are adapting their ears 00:46:07.880 |
according to a particular problem is a very cool idea. 00:46:12.400 |
Second thing is we actually saw each of the filters 00:46:17.320 |
And these are basically just single slices like this. 00:46:21.240 |
So this is what we would have learned as a front end neuron. 00:46:25.280 |
So we take up each of the neurons and we just plot them. 00:46:32.040 |
sort them according to where the center frequency is. 00:46:46.840 |
So you would have something like an answer detector 00:46:52.280 |
In a way, it is learning to have a kernel which 00:46:55.400 |
is best for a time frequency representation, what people 00:47:23.120 |
I'll conclude by saying that, OK, transformers 00:47:30.000 |
And it seems like they're solving everything for now. 00:47:58.720 |
about how transformers work for the audio case. 00:48:04.880 |
And now I would invite questions from the class students.