back to index

Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music


Chapters

0:0 Introduction
0:6 Transformers for Music and Audio: Language Modelling to Understanding to Synthesis
1:35 The Transformer Revolution
5:2 Models getting bigger ...
7:43 What are spectograms
14:30 Raw Audio Synthesis: Difficulty Classical FM synthesis Karplus Strong
17:14 Baseline : Classic WaveNet
20:4 Improving Transformer Baseline • Major bottleneck of Transformers
21:2 Results & Unconditioned Setup • Evaluation Criterion o Comparing Wavenet, Transformers on next sample prediction Top-5 accuracy, out of 256 possible states as a error metric Why this setup 7 1. Application agnostic 2. Suits training setup
22:11 A Framework for Generative and Contrastive Learning of Audio Representations
22:38 Acoustic Scene Understanding
24:34 Recipe of doing
26:0 Turbocharging best of two worlds Vector Quantization: A powerful and under-uilized algorithm Combining VQwih auto-encoders and Transformers
33:24 Turbocharging best of two worlds Leaming clusters from vector quantization Use long term dependency kaming with that cluster based representation for markovian assumption Better we become in prediction, the better the summarization is
37:6 Audio Transformers: Transformer Architectures for Large Scale Audio Understanding - Adieu Convolutions Stanford University March 2021
38:45 Wavelets on Transformer Embeddings
41:20 Methodology + Results
44:4 What does it learn -- the front end
47:18 Final Thoughts

Whisper Transcript | Transcript Only Page

00:00:00.000 | [AUDIO OUT]
00:00:04.920 | Thanks for inviting me for the talk today.
00:00:07.480 | And I'll be just talking about transformers
00:00:10.320 | for music and audio, which is very
00:00:12.400 | different than what all of us were doing in this past course.
00:00:16.880 | I'm also the only speaker from Stanford,
00:00:18.840 | so I have to do a good job.
00:00:20.200 | So you'll see very good slides, because I'm
00:00:23.280 | representing the university in some sense.
00:00:26.520 | So yeah, so the flow of the talk for today
00:00:28.240 | is basically like I'll be throwing a lot of stuff.
00:00:30.960 | It's kind of like a buffet style,
00:00:32.440 | and then you feel free to like or dislike whatever you want.
00:00:36.640 | And I'll be talking mostly about three papers of what
00:00:39.560 | I've been working on.
00:00:42.120 | I'll start with introducing what transformers
00:00:44.600 | are from a different perspective, what
00:00:46.480 | audio representations are.
00:00:49.240 | Talk about a generative model for audio,
00:00:52.200 | which is just doing language modeling on sample level.
00:00:56.080 | Then I'll talk about how can one do like language
00:00:59.080 | modeling for speech and audio, which is different than what
00:01:02.480 | people do for text.
00:01:04.280 | What are the current trends in the literature?
00:01:07.360 | Finally, I'll briefly mention similar stuff
00:01:11.000 | as to what was happening in computer vision
00:01:13.080 | with regard to vision transformers.
00:01:14.920 | How can we adapt similar ideas for audio transformers?
00:01:17.920 | And throw in a bit of signal processing
00:01:20.040 | to improve the performance.
00:01:22.120 | Having told that the talk is about 35 to 40 minutes
00:01:24.840 | with about 15 minutes of Q&A, I should also
00:01:27.600 | say that all of the opinions are mine,
00:01:29.600 | and Stanford or any other professor
00:01:31.160 | is not responsible for any of the mistake which I do.
00:01:35.880 | So transformers have kind of revolutionized in a way
00:01:40.240 | like everyone was approaching deep learning.
00:01:42.960 | Before that, it was all about CNNs.
00:01:45.240 | And mostly, all of these prominent models
00:01:48.360 | have been coming in waves.
00:01:49.440 | So there was a time when everyone was just
00:01:51.200 | applying CNNs.
00:01:53.360 | Then came a time where people started
00:01:55.600 | adapting CNNs in some sort of diluted convolutions.
00:01:58.480 | And slowly, the recurrent networks
00:02:00.400 | were getting out of fashion.
00:02:02.480 | Now, it seems like transformers are in fashion all the time.
00:02:05.960 | So it seems to be solving almost every single problem which
00:02:08.840 | is being thrown at them.
00:02:12.640 | So what's special about them?
00:02:15.000 | One of the fact which struck me was their simplicity, which
00:02:19.160 | is, if you think about it, this--
00:02:23.680 | and it has been hugely popular also.
00:02:25.720 | So it was just released in 2018.
00:02:28.640 | And within three years, it has about 30,000 citations.
00:02:31.280 | And it is kind of solving every single problem
00:02:33.200 | in every single domain.
00:02:35.400 | It has its limitations, though, also.
00:02:38.560 | But if you think about it, in a way,
00:02:40.720 | transformers are basically a way of just cascading self-attention
00:02:45.800 | with feature learning.
00:02:46.960 | And if you keep on doing it over and over again,
00:02:49.280 | then the model, in a way, learns which parts of the input
00:02:52.160 | are important and keep on transforming them,
00:02:55.120 | removing the contents which are not important,
00:02:57.360 | and just have the limited information which is just
00:03:00.480 | responsible for a particular task.
00:03:04.120 | And it has been very, very difficult to keep up
00:03:06.160 | with the literature.
00:03:09.000 | I have put it as a joke here.
00:03:10.760 | But then even Twitter's recommendation engine
00:03:13.160 | were kind of just getting out of--
00:03:16.120 | they were getting haywire as to why is Chris Manning just
00:03:19.360 | searching over transformers?
00:03:21.480 | And that was way back in 2020.
00:03:24.480 | So it has been difficult for researchers
00:03:26.440 | also to keep up with the pace of what's going on.
00:03:29.800 | Just before transformers, all of the NLP community
00:03:33.000 | was just doing gaga about bidirectional LSTMs
00:03:35.760 | with attention.
00:03:36.760 | So every single paper before 2017
00:03:38.840 | was just like you have encoded LSTM layers.
00:03:42.320 | You keep on adding multiple layers.
00:03:46.000 | And then after that, you have attention mechanism
00:03:48.440 | which just learns that what's important
00:03:50.600 | and then just keeps on decoding sequentially one at a time.
00:03:54.760 | But this was not kind of like an ideal way to do it.
00:03:58.040 | Because what turns out is when we start throwing longer
00:04:01.560 | sequences, the connections are no longer
00:04:07.120 | storing the gradient updates in a way it should be doing.
00:04:09.840 | So what the researchers from Google said,
00:04:13.360 | instead of having just an attention
00:04:15.040 | layer at the very last encoding, we
00:04:19.240 | would just have these attention mechanisms
00:04:21.280 | at every single layer, which in a way
00:04:23.640 | would just learn what's important
00:04:25.280 | for a particular problem at that particular layer.
00:04:28.240 | And we keep on doing it over and over again.
00:04:31.920 | So then the whole idea of transformers and attention
00:04:35.920 | mechanism cascaded one after the other came.
00:04:39.520 | And I'll not go into the details,
00:04:40.880 | because this is the last class of the course.
00:04:43.400 | But then usual tricks do help across the neural net
00:04:46.600 | literature, which is like having multi-header tensions,
00:04:50.680 | having skip connection and layer norm.
00:04:52.280 | So all of these things, they are not only
00:04:54.440 | like giving gains for transformers themselves,
00:04:57.800 | but they can be just applied to any single other architecture
00:05:01.520 | also.
00:05:03.440 | The other thing which is helping this research
00:05:05.960 | is basically the compute bar is getting better and better.
00:05:09.360 | So all of these big companies are just
00:05:12.640 | throwing massive amounts of computing resources
00:05:14.880 | at solving very, very simple and trivial tasks.
00:05:18.920 | The top of the hill being the switch transformer, which
00:05:21.560 | was discussed in the course also.
00:05:23.000 | But one other thing which I think
00:05:28.240 | started all of this trend was ELMo,
00:05:30.080 | which was just learning these contextualized representations
00:05:33.200 | for natural language processing.
00:05:34.920 | And that model right here was perhaps
00:05:37.400 | one of the first kind of like model 0.0 or something,
00:05:45.640 | or 0.1 in terms of bringing and ushering
00:05:48.320 | in the whole revolution.
00:05:50.760 | You can see that how similar these kind of models
00:05:53.520 | look like.
00:05:54.560 | BERT was basically inspired heavily
00:05:57.280 | from ELMo, in which they just replaced
00:06:00.200 | some of the LSTM layers with transformer modules.
00:06:03.800 | So a point to note also is irrespective
00:06:08.360 | of natural language processing or other domain,
00:06:10.760 | these can be adopted in a variety of domains.
00:06:13.000 | And for today's talk, I'll be just adopting them to audio.
00:06:17.720 | So I'll basically start with introducing people
00:06:21.240 | what audio representations are, and just
00:06:23.400 | for the sake of completeness, talk about spectrograms.
00:06:28.240 | So you can take any time domain signal,
00:06:31.400 | and you can decompose that signal
00:06:34.560 | into a variety of basis functions.
00:06:39.600 | And if you take up a Fourier transform,
00:06:41.680 | you're kind of like decomposing the actual time domain
00:06:46.040 | signal into its sinusoidal basis components.
00:06:49.400 | So if you have like a waveform here
00:06:51.880 | like this, which is a sum of three pure sinusoids,
00:06:55.240 | then their sum basically is this.
00:06:57.440 | And you can see that when you take a Fourier transform
00:07:00.120 | and its magnitude, you kind of have
00:07:03.280 | the strength of the individual components shown here.
00:07:08.800 | So you can take up another waveform,
00:07:10.960 | let's say a square wave, and what you have
00:07:13.720 | is basically a much richer sinusoidal decomposition
00:07:17.680 | because it is kind of a discontinuous signal.
00:07:19.960 | So you need like many more sinusoids
00:07:21.640 | to represent that particular signal as close
00:07:24.400 | to the actual signal as possible.
00:07:26.960 | And here also you can see that, OK, if this was a square wave,
00:07:29.880 | then it is actually made up of a lot of sinusoids
00:07:35.960 | where each of the bar here represents
00:07:39.200 | the strength of the particular sinusoid.
00:07:42.280 | From an optimization perspective,
00:07:44.120 | I mean, this right away is suboptimal, right?
00:07:47.040 | Because you're kind of fixing up the number of sinusoids
00:07:51.080 | you're using for representing a square wave.
00:07:53.240 | I would have rather used a basis function
00:07:56.080 | which was a square wave itself than a sinusoidal signal.
00:08:01.360 | The second thing is even if you are taking a sinusoidal signal,
00:08:05.320 | we kind of are just putting them in an equidistant space.
00:08:09.440 | So you're kind of dividing the whole frequency
00:08:11.760 | axis into equidistant bins.
00:08:14.440 | And each of the bins is responsible
00:08:16.040 | for a particular sinusoid a lot.
00:08:19.400 | So that is like a traditional Fourier representation
00:08:22.360 | for representing any signal.
00:08:25.840 | What we do for--
00:08:28.840 | what are spectrograms?
00:08:30.480 | But in reality, all of these signals are discontinuous.
00:08:34.880 | All of these signals vary quite a bit, right?
00:08:37.360 | So you can have a signal while I'm
00:08:39.960 | speaking which is like a square wave for a certain period
00:08:42.680 | of time, and then it gets sinusoidal,
00:08:44.720 | and then it becomes something else.
00:08:46.680 | So what we really need is in a way
00:08:48.880 | to kind of take batches of input signal
00:08:52.840 | and take Fourier transform of these individual batches.
00:08:56.320 | I'm deliberately using the word batches,
00:08:58.000 | but you can-- in traditional terms,
00:08:59.520 | you are windowing the signal.
00:09:02.240 | So right here, you can see that you have a continuous signal.
00:09:05.120 | You keep on windowing it.
00:09:07.320 | You apply the Fourier transform, and what you get
00:09:10.280 | is basically like a spectrogram representation of the signal.
00:09:14.320 | So right here, what you're seeing basically
00:09:16.360 | is for each of the slices, the signal kind of
00:09:19.520 | look like this after taking the Fourier
00:09:21.560 | transform with the waveform which is there below.
00:09:24.840 | And what you do is for spectrogram representation,
00:09:26.920 | you keep on stacking these Fourier transform
00:09:29.160 | slice, the magnitude of the Fourier transform slices.
00:09:31.760 | And in this way, you kind of get like a 2D representation
00:09:34.760 | of audio signals.
00:09:36.720 | And if you're coming from a vision background,
00:09:38.640 | it is basically all of the things
00:09:40.160 | which you're doing in vision would just work well
00:09:42.240 | if you just apply them to these 2D spectra representations.
00:09:47.240 | I'll quickly play how these spectrograms look
00:09:49.960 | like for a wide area of common sounds.
00:09:53.400 | [VIDEO PLAYBACK]
00:09:56.400 | [MUSIC PLAYING]
00:10:00.400 | [BIRDS CHIRPING]
00:10:03.400 | [MUSIC PLAYING]
00:10:07.400 | [WHISTLING]
00:10:25.400 | [MUSIC PLAYING]
00:10:28.400 | [WHISTLING]
00:10:31.400 | [MUSIC PLAYING]
00:10:34.400 | [WHISTLING]
00:10:35.400 | So you can see like for spectrograms,
00:10:38.120 | you have kind of like a time axis on your x-axis.
00:10:40.880 | And then you have a frequency axis on y-axis.
00:10:43.680 | And then for whatever is your signal of interest,
00:10:46.120 | you're basically like putting these slices together.
00:10:48.880 | And different sound gives you like different spectra
00:10:50.960 | representation.
00:10:51.760 | So it's kind of a vision problem just
00:10:54.080 | in this sort of like Fourier space.
00:10:58.040 | So there can be like different kinds of representations also.
00:11:01.200 | So one, you could just take these slices of Fourier
00:11:06.320 | transform and then do like a linear mapping to them
00:11:09.320 | so that you're kind of in a way making these as
00:11:13.400 | close to how humans hear.
00:11:14.680 | So you can have like log of the frequency on the y-axis
00:11:17.960 | instead of common frequency.
00:11:19.240 | And then you get like a constant Q-like representation.
00:11:22.040 | The advantage of this being like you
00:11:23.640 | can see that for different frequencies,
00:11:25.880 | the spacing between the harmonics kind of remains same.
00:11:28.960 | So if you're like training convolutional filters,
00:11:31.040 | then that's of a huge advantage because the signal,
00:11:33.680 | like one component of the invariance is gone.
00:11:35.800 | And you can just learn these filters
00:11:37.280 | which are catching onto these constant templates of Fourier
00:11:41.880 | slices.
00:11:43.200 | You can have melt filter bank coefficients,
00:11:45.080 | or you can have like the raw waveform also.
00:11:48.960 | For raw waveforms, basically there
00:11:50.520 | are two things which we have to keep in mind.
00:11:53.200 | One is the sampling rate.
00:11:54.320 | So we kind of like take the continuous signal
00:11:57.200 | and then we discretize the continuous signal.
00:11:59.400 | So one parameter is like how fast we are sampling
00:12:03.360 | the continuous signal.
00:12:04.280 | So that's typically on the order of like 16,000 or 8,000
00:12:07.720 | times a second if you're on telephonic speech.
00:12:10.440 | The other thing which we also is like how many levels
00:12:13.720 | we are dividing your vertical axis.
00:12:15.480 | So in this case, you can see that each of the dots
00:12:17.760 | is basically one level.
00:12:19.400 | And typically, people use 8-bit quantizers or 16-bit
00:12:22.280 | quantizers.
00:12:23.280 | So in a way, you can think about that for every one
00:12:25.400 | second of audio which we would hear,
00:12:27.480 | you would have like 16,000 samples.
00:12:29.720 | And then in each of the 16,000 samples
00:12:32.720 | are allowed to take one of the levels between 0 to 55.
00:12:36.800 | And that's like if I can take the problem of continuous audio
00:12:41.560 | and just have it in terms of this sort of discrete space,
00:12:45.320 | then basically I'm just going to the territory
00:12:47.520 | of doing language modeling.
00:12:50.600 | So the first papers I discuss is how
00:12:53.480 | can we do generative modeling for raw audio, which
00:12:57.880 | is similar to WaveNets using transformers.
00:13:01.480 | I'll be putting QR codes if you like the stuff what I'm doing.
00:13:05.560 | And if you think that this is relevant to you,
00:13:07.800 | please cite or please have a look
00:13:10.080 | in terms of the QR codes.
00:13:13.000 | So yeah, so I'll start with the first subtopic
00:13:16.400 | of today's talk, which is like what are WaveNets
00:13:22.640 | and how do we do this generative modeling over raw audio?
00:13:26.960 | So in a single word, you can think
00:13:28.360 | about this as doing language modeling over these 255
00:13:31.360 | states of audio.
00:13:33.120 | So you can throw in your favorite transformer model
00:13:35.760 | like transformer XL or GPT or whatever you want to call it.
00:13:41.520 | And just treat the problem as if you
00:13:43.160 | are trying to predict one of the levels out of 255.
00:13:45.800 | And you have to predict the next level given a certain context.
00:13:49.480 | That's what WaveNet was doing.
00:13:50.760 | So the way you are modeling the probability distribution
00:13:55.320 | of a continuous space is basically
00:13:57.320 | you're trying to predict what's the probability
00:13:59.360 | of the next sample given some parsed context.
00:14:02.560 | And WaveNet has been hugely popular
00:14:04.720 | because it has over 3,000 citations
00:14:07.080 | and it has been a core building block for almost all speech
00:14:11.040 | and audio related problems.
00:14:13.000 | You can think about speech to text, text to speech synthesis,
00:14:16.600 | instrument conversion, packet loss
00:14:19.040 | concealment over the internet, speech denoising.
00:14:21.680 | So wherever there's some sort of element of modifying audio,
00:14:25.560 | people have been using WaveNet as a core building block.
00:14:30.440 | And raw waveform synthesis has been difficult
00:14:32.880 | because just the magnitude of the problem,
00:14:35.920 | if I'm just trying to synthesize 10 seconds of audio,
00:14:38.880 | it would just amount to me having a probability
00:14:41.680 | distribution over 160,000 samples.
00:14:45.760 | And that itself is tough because our ears are very, very
00:14:49.120 | sensitive to subtle changes.
00:14:51.000 | If I'm off by one pixel in an image,
00:14:55.600 | my eyes would not be as susceptible to noticing
00:14:59.200 | that effect versus if I'm off by, say, a few samples
00:15:04.240 | in an audio, it would just catch our ears pretty quickly.
00:15:08.600 | People have been trying raw audio synthesis a lot
00:15:10.960 | in the past.
00:15:12.440 | And before all of the WaveNet and transformer-based
00:15:15.720 | approaches, WaveRNNs and SampleRNNs
00:15:20.280 | were kind of like state-of-the-art models.
00:15:25.600 | On the right, I've shown a SampleRNN model, which
00:15:28.600 | kind of models the probability distribution of what's
00:15:33.280 | going to come next given the past at multiple levels.
00:15:36.400 | And this was work done by Yoshua Bengio at Mila.
00:15:40.200 | But you can closely see, if you just
00:15:42.560 | see this architecture versus a transformer architecture,
00:15:45.520 | in a way, these are starting to get very, very similar.
00:15:48.680 | Because what you're trying to do is
00:15:50.200 | that for the probability distribution here,
00:15:52.320 | you're trying to see a lot of local substructures.
00:15:56.040 | And then you keep on doing it over and over again.
00:15:58.120 | And you can draw parallels, like attention mechanism
00:16:00.840 | should also kind of be doing the same thing.
00:16:04.000 | So this was kind of like the literature in the past.
00:16:09.280 | What we tried to do was we just had the WaveNet model.
00:16:13.120 | And we tried to see whether transformers can beat them.
00:16:15.800 | And our intuition was it should be able to beat them
00:16:18.160 | because they are successful all over the other domains,
00:16:23.000 | like in language modeling.
00:16:24.120 | So it should do that for raw waveforms also.
00:16:28.720 | We also tried to see whether we can circumvent the order
00:16:31.400 | n squared constraint by conditioning of the context
00:16:35.560 | itself.
00:16:37.160 | And we did not go for specific applications.
00:16:40.560 | And we just said, OK, just in terms like modeling behavior,
00:16:43.000 | how will they do?
00:16:45.440 | So the data set for this was just
00:16:47.200 | like real-world kind of recording.
00:16:48.760 | So actual sound should not matter
00:16:53.440 | because the model is agnostic to what it is being thrown in.
00:16:57.880 | And the setup was exactly the same,
00:16:59.520 | like you are giving a certain context.
00:17:01.360 | And I have to predict the next sample.
00:17:03.600 | You do the same thing with WaveNets.
00:17:05.520 | You do the exact same thing with transform-based,
00:17:09.240 | like GPT kind of model and see how well they do.
00:17:14.000 | I'll briefly chat about what WaveNet models are.
00:17:18.040 | So WaveNet was kind of like a convolution-based model, which
00:17:21.360 | was getting rid of all of the vanishing gradient problem
00:17:25.360 | by just treating a sequential problem
00:17:28.680 | as being learned by a convolutional model.
00:17:31.120 | So what they did was basically have this sort
00:17:33.880 | of dilation layers, or convolution with dilations,
00:17:38.320 | which is basically I kind of skip in every subsequent layer
00:17:42.000 | by one sample.
00:17:43.240 | So you can see if I have a dilation factor of 2
00:17:47.080 | with a kernel size of 2, I would get this kind of a topology
00:17:50.400 | where my convolution filters in the very first layer
00:17:52.720 | are just combining the first two samples.
00:17:55.120 | Then I skip by one in the next layer.
00:17:57.520 | And then I skip by three, which is
00:17:59.360 | like I look at the fourth one in the next layer and so on.
00:18:03.400 | The loss is still the same.
00:18:04.600 | So I have this network.
00:18:06.640 | I learn a latent space.
00:18:08.000 | And then I have a categorical cross-entropy loss,
00:18:11.200 | which is basically I have to predict the next sample given
00:18:13.960 | the previous one.
00:18:16.640 | And I just do the exact same thing with transformers also.
00:18:20.800 | But then I have to make sure that I
00:18:22.760 | do it in a causal manner.
00:18:24.120 | So I have something which is very similar to GPT,
00:18:26.840 | in which I have causal masks in my attention mechanism.
00:18:30.320 | And I keep doing it over and over again.
00:18:32.360 | So you have self-attention.
00:18:35.520 | After that, you have feedforward layers.
00:18:37.400 | You just have a stack of these transformer blocks
00:18:40.400 | and see how they do.
00:18:43.320 | So I said intuitively it should work.
00:18:45.320 | So it should be doing better than our base wave net models.
00:18:52.360 | Because if you look at the topology,
00:18:55.120 | we are kind of defining a topology on our own, right?
00:18:57.640 | So what if the current prediction at, say,
00:19:01.520 | layer one were to depend on very way back sample, say,
00:19:07.760 | instead of the second sample, the 10th sample?
00:19:09.640 | So we are kind of ignoring all of that topology, which
00:19:12.800 | would have been important for prediction
00:19:14.480 | of this particular task.
00:19:16.040 | Whereas transformers with the self-attention mechanism
00:19:19.640 | can just learn, like, OK, which part of the samples
00:19:22.400 | are important and which are not.
00:19:23.960 | And you can keep on doing it iteratively.
00:19:26.760 | So it made sense to us that, OK, transformer layer
00:19:30.880 | should be doing way better than wave net models.
00:19:34.880 | The second thing which we came across was, OK,
00:19:37.680 | we cannot have a lot of context.
00:19:40.640 | For example, the attention mechanism
00:19:42.480 | needs to store all of those of order n squared.
00:19:46.120 | So in this case, if I'm storing data at 100 milliseconds,
00:19:50.240 | then I have about 1,600 samples.
00:19:52.600 | And I need to store 1,600 by 1,600 at multiple layers.
00:19:56.840 | And it just becomes like a huge problem with the data--
00:20:01.800 | problem with the memory constraint.
00:20:03.360 | So what we said was, OK, what if we just
00:20:06.600 | use the context itself as a latent code?
00:20:10.840 | So in order to have much better representation at every layer,
00:20:16.440 | we cannot have huge, big attention matrices.
00:20:20.040 | So what we said was, we would just
00:20:21.960 | do a sample-wise conditioning and throw a CNN layers just
00:20:26.240 | to understand what the latent code would be.
00:20:28.560 | So you still have, like, an attention mechanism
00:20:31.040 | or just a past context.
00:20:32.960 | But then I'm also conditioning at every sample, OK,
00:20:36.560 | what the next sample should be given on this context embedding.
00:20:41.240 | And if you think about it, in a way, it is like, OK,
00:20:43.360 | if there are, like, five or six notes being played in a piano,
00:20:46.680 | then I'm kind of certain which notes
00:20:48.280 | will be played to a certain extent
00:20:50.040 | if I just throw in a CNN layer.
00:20:52.600 | So I'll use that information along with what
00:20:55.360 | my transporters are learning.
00:20:57.480 | And then I would condition it.
00:20:59.040 | And I would just use that to predict the next sample.
00:21:03.240 | So for the evaluation criteria, we
00:21:04.840 | did not look for negative log-likelihood scores.
00:21:08.720 | We just looked at how well our prediction task was.
00:21:12.760 | So we took a, like, stacked WaveNet,
00:21:15.640 | which was implemented by DeepMind,
00:21:17.520 | and saw that, OK, what was the performance using
00:21:21.320 | their benchmarks and even, like, bigger stacked WaveNets.
00:21:25.840 | We then started to increase the complexity of transformers
00:21:29.080 | and started to see whatever we had proposed
00:21:32.520 | in terms of, like, conditioning on the vanilla transformer
00:21:36.600 | architectures to see how well they do.
00:21:39.520 | We did not look for, like, an application-specific problem,
00:21:43.640 | which is basically, like, we don't look at, like,
00:21:46.520 | how well perception tasks are for, like, say,
00:21:49.200 | text-to-speech synthesis or speech denoising.
00:21:51.600 | We just look at, OK, if we are trying
00:21:53.280 | to model this using a cross-entropy loss,
00:21:56.000 | then with the same model, with the same loss function,
00:21:59.560 | how well they do on, like, similar kind of parameters.
00:22:03.720 | So this was the first kind of, like, sub-block of, like,
00:22:06.640 | how can we use our transformers for generative modeling.
00:22:12.160 | For the second problem, I'll do a quick headway
00:22:15.360 | on how can we use, like, transformers
00:22:19.160 | for doing language modeling, which
00:22:21.200 | is kind of becoming a really fancy term right now.
00:22:24.880 | And this work was done by Julia Smith way back in 2020.
00:22:28.880 | And the goal of this was, can we kind of, in a way,
00:22:32.520 | do language modeling with continuous audio sequences?
00:22:37.440 | And I'll briefly mention about that in this sub-block
00:22:42.240 | of the talk.
00:22:42.960 | And this is in regard for, like, solving acoustic scene
00:22:48.440 | understanding, which is basically, like,
00:22:51.560 | if I'm given a chunk of audio, then
00:22:54.280 | I want to understand what's in there.
00:22:56.960 | And if we could do that well, then in a way,
00:23:01.800 | we can do a lot of fancy, nice applications.
00:23:06.120 | So for example, like, if you think about, like,
00:23:08.160 | self-driving cars.
00:23:09.040 | So Waymo has started to incorporate microphones
00:23:12.200 | into their self-driving cars.
00:23:14.220 | Because, say, if there is an ambulance coming,
00:23:16.080 | or if there is a fire truck coming,
00:23:18.960 | then that sound would be picked up way, way before even
00:23:23.040 | the LIDARs or even their sensors.
00:23:25.840 | So they want to understand that and take
00:23:28.320 | actions based upon that.
00:23:30.880 | Apple, during COVID, did a hand-washing detection
00:23:33.360 | on their Apple Watch.
00:23:34.600 | Because if you could detect when someone is washing their hands,
00:23:37.800 | then you can, in a way, like, tell people that, oh,
00:23:40.760 | you need to wash hands for 20 seconds.
00:23:42.720 | And then that can be built upon as a cool application.
00:23:46.880 | It can be used for music recommendations.
00:23:49.520 | So Spotify, YouTube Music kind of gives, like,
00:23:51.760 | very, very good songs, which you are listening to,
00:23:54.640 | which are similar in content that you would perhaps like.
00:23:59.320 | It can also give, like, really cool applications.
00:24:01.680 | Like, say, people have tried, like,
00:24:03.760 | detecting depression from audio.
00:24:05.800 | Or I could detect whether I'm coughing or not,
00:24:08.760 | or I'm sneezing or not.
00:24:09.960 | And these can be, like, good medical device--
00:24:13.080 | medical applications, which can be
00:24:15.360 | used along with the current diagnosis what doctor provides.
00:24:20.520 | So the question was basically, for us,
00:24:23.320 | was, like, how can we do, like, language modeling
00:24:26.600 | in a continuous audio domain?
00:24:29.040 | And secondly, like, how can we train models,
00:24:31.240 | or how should we approach doing this?
00:24:35.320 | So this kind of, like, recipe has
00:24:37.360 | become, like, very, very popular these days in terms of, like,
00:24:40.680 | how would you approach this problem?
00:24:42.240 | It started with, like, open AI, and to a certain extent,
00:24:46.280 | DeepMind proposing that in terms of, like, VQVAE models.
00:24:50.960 | But it turns out, like, transformers
00:24:52.640 | love operating in discrete spaces, as of now.
00:24:56.440 | And what they kind of do is, as long as your representations
00:25:01.200 | are discrete, they are very, very good at modeling
00:25:03.640 | what's going to come next.
00:25:06.760 | So what people have been proposing as a workaround
00:25:09.200 | is you could take up, like, your favorite embedding
00:25:15.640 | in some manner.
00:25:16.400 | You could take a VQVAE embeddings,
00:25:18.200 | or you could take a Wave2Vec, or in terms of video,
00:25:21.840 | you can just do classic VGG or ResNet embeddings.
00:25:27.880 | You can apply k-means clustering to it.
00:25:31.040 | And k-means clustering would give you, like, discrete codes.
00:25:34.320 | You do language modeling with those discrete codes,
00:25:37.240 | and you predict the next code.
00:25:39.360 | And in a way, if you're doing this,
00:25:41.440 | then you're kind of doing language modeling over audio.
00:25:45.200 | And if you need to get back to the audio,
00:25:46.840 | then you already saw with WaveNet
00:25:49.000 | that you can condition the WaveNet model
00:25:51.120 | to give continuous output.
00:25:53.480 | So you can use those codes to get back
00:25:55.320 | to the audio, similar to what jukebox and OpenAI did.
00:26:00.240 | So I'll quickly mention about what vector quantization is.
00:26:05.920 | It's one of the most underutilized algorithms,
00:26:08.680 | to be honest.
00:26:09.880 | And what it does is basically gives, in a way,
00:26:12.960 | discrete codes to continuous embedding spaces.
00:26:16.400 | So how does it do it?
00:26:18.080 | So you basically have an embedding space,
00:26:23.600 | let's say, in 2D right here.
00:26:25.240 | You define what are the number of clusters
00:26:27.120 | you want to put each of them in.
00:26:29.320 | You run k-means, and you would certainly
00:26:31.520 | get these patches of where all of these embeddings
00:26:36.200 | are, what would be the representative embedding
00:26:38.280 | of a continuous embedding.
00:26:40.720 | You can take all of those patches,
00:26:42.120 | and you can just number them, or you can just list them.
00:26:45.960 | So in this case, you can perhaps have 25 numbers, or 20 numbers,
00:26:49.840 | which are, in a way, mapping from a continuous embedding
00:26:53.480 | to a discrete token.
00:26:57.040 | This is another example right here.
00:26:58.520 | So in our case, what we did was we
00:27:02.320 | took a batch of spectrogram, which are basically
00:27:05.160 | very small patches across time, and then shared all
00:27:10.760 | across the frequency axis.
00:27:13.320 | You take those patches, you learn
00:27:15.200 | the embedding representation.
00:27:16.600 | In our case, it was just like three-layer autoencoder,
00:27:19.720 | fully-connected encoders with three layers of decoders,
00:27:22.800 | and have a bottleneck layer in between.
00:27:25.360 | So that bottleneck layer basically
00:27:27.000 | is kind of similar to this kind of diagram
00:27:29.280 | in, say, 64-dimensional space or 120-dimensional space.
00:27:33.520 | You take up those bottleneck codes,
00:27:35.400 | and then you run k-means clustering on it.
00:27:37.840 | Suddenly, in a way, you can find discrete codes
00:27:43.520 | for continuous embedding spaces or even continuous segments.
00:27:48.040 | And since we know that transformers kind of love
00:27:50.480 | operating in discrete spaces, you
00:27:52.600 | can just apply language modeling now,
00:27:55.160 | and then you can see what you can do.
00:27:58.480 | So in our case, we just had very simple three-layer,
00:28:02.200 | fully-connected autoencoder, small patches.
00:28:06.000 | The number of codes is important,
00:28:07.520 | because if you have too many codes,
00:28:10.000 | then you're kind of just throwing
00:28:11.800 | in all kinds of noisy things.
00:28:13.600 | Now, I'll give an example of why the number of codes
00:28:17.880 | are important through some example.
00:28:19.680 | And you have two little codes.
00:28:22.480 | What you're, in a way, doing is you're
00:28:25.000 | removing all of the information which was relevant,
00:28:27.280 | and you're just kind of averaging them all out.
00:28:29.480 | So this idea first was proposed by Jukebox,
00:28:38.200 | which did it for music.
00:28:40.240 | So you do the exact same thing, what
00:28:42.160 | I talked about, in a slightly different manner.
00:28:45.240 | In a way that, OK, you cannot learn codes
00:28:48.640 | for longer sequences.
00:28:50.520 | So in a way, learn sequences which are just moving slowly
00:28:54.080 | and which are looking at only a certain amount of audio.
00:28:57.680 | So you kind of encode this in these discrete levels, which
00:29:01.320 | are basically like--
00:29:03.360 | all of these basically are codes.
00:29:04.760 | So at every point, I define, OK, this audio
00:29:08.040 | had, perhaps, code number 55.
00:29:10.520 | And in the next level, perhaps, it had code number 2.
00:29:12.840 | And in the very top, perhaps, it had code number 2,000.
00:29:16.440 | So in a way, I'm discretizing the whole codes.
00:29:19.680 | Now what I do is I take up my favorite transform model,
00:29:23.440 | perhaps like a causal autoregressive one.
00:29:26.280 | And I say that, OK, given these codes,
00:29:29.240 | try to predict what codes would come next.
00:29:31.480 | And for sure, transformers can do that.
00:29:34.240 | So I would generate the codes in the future.
00:29:36.840 | Once I've generated the codes in the future,
00:29:39.560 | I can say that, OK, this problem now
00:29:41.360 | is kind of like a text-to-speech problem,
00:29:43.680 | because I have these discrete codes.
00:29:45.560 | Text-to-speech, in a way, is going from discrete letters
00:29:48.680 | to continuous audio.
00:29:50.560 | So I would throw in the fanciest, which was WaveNet.
00:29:53.720 | And I would just get back the code.
00:29:55.240 | And I would get the generated audio.
00:29:58.080 | So this was, in a way, what I described,
00:30:02.360 | that they take up a continuous audio.
00:30:04.760 | They have these compressed codes,
00:30:06.640 | which they encode using a CNN in this case.
00:30:10.720 | The method doesn't matter.
00:30:11.960 | You can throw in the fanciest of embedding or latent
00:30:15.040 | representation on those continuous code.
00:30:18.400 | You generate the patterns, which are like,
00:30:20.120 | what's going to happen next in the future?
00:30:22.000 | And then you decode back using a fancy WaveNet or state-of-the-art
00:30:26.000 | model.
00:30:28.000 | So this was what they were doing for music synthesis.
00:30:32.160 | What we said was, yeah, this is good.
00:30:35.200 | This can generate a good amount of music.
00:30:37.200 | But can these models be used for generating
00:30:44.480 | good representation of the current audio?
00:30:47.840 | And the goal there was, can language models
00:30:51.160 | learn representation, which can just encapsulate whatever we
00:30:55.600 | are giving as an input signal?
00:30:59.160 | So in this case, what we tried after that
00:31:01.160 | was you do exactly similar ideas.
00:31:06.840 | But instead of doing on VQ-VAE end-to-end learned encodings,
00:31:11.960 | we just apply vanilla k-means clustering,
00:31:14.320 | similar to what I described earlier.
00:31:16.880 | We do on spectrogram patches.
00:31:18.200 | So you take up these spectrograms of audio,
00:31:20.960 | and you just divide them into very small chunks,
00:31:23.960 | learn autoencoder encodings for each of those chunks,
00:31:28.440 | run k-means clustering.
00:31:30.320 | In this case, let's say I am learning 16 codes.
00:31:34.120 | Represent the continuous audio in terms of the 16 codes.
00:31:38.360 | Have a transformer which can perhaps predict the next code.
00:31:41.880 | And if I keep on getting better and better at predicting
00:31:45.120 | what's going to happen next, then in this linear layer,
00:31:48.000 | I should be encapsulating what's important
00:31:51.320 | or what's a good summary of what has happened in the past.
00:31:56.640 | So that was our intuition behind trying this.
00:32:01.600 | And as I explained, the number of codes
00:32:03.600 | play a very important role.
00:32:05.560 | You can see here, these are just two piano notes switching
00:32:08.920 | one after the other.
00:32:10.080 | If I just have 16 number of codes,
00:32:12.080 | it just happens to have just a single line of encoding,
00:32:16.560 | a single code assigned to all of this.
00:32:18.640 | Whereas if I'm assigning more codes,
00:32:21.040 | then it becomes a fine-grained prediction
00:32:23.600 | where I'm actually able to get what the individual notes are.
00:32:29.320 | Recently, Facebook also said, OK, they just
00:32:32.880 | had a different name to the whole thing, which
00:32:35.240 | is we can just call this as textless NLP also in the sense
00:32:40.760 | that, OK, you can do NLP without having access to text.
00:32:44.760 | But the idea is very, very similar.
00:32:46.360 | You have an encoder, which is exactly similar to say
00:32:48.600 | what OpenAI was using.
00:32:49.680 | You have a VQ-VAE, Wave2Vec, or whatever you want to do.
00:32:53.320 | You can apply k-means clustering to it.
00:32:55.240 | You apply language models to it.
00:32:57.160 | And instead of a decoder being WaveNet,
00:32:58.880 | they just have a decoder, which is
00:33:00.280 | like a different version of text-to-speech, which
00:33:02.840 | is like Takotron in this case.
00:33:05.000 | So as you can see, these are all the same wine
00:33:07.160 | and very different bottles.
00:33:08.280 | But the core idea is almost exactly the same.
00:33:12.560 | So this created a huge uproar of this going to change NLP.
00:33:18.120 | But this is very, very similar to what people
00:33:21.560 | have been doing in the past.
00:33:24.560 | So I've already explained what this was.
00:33:29.200 | So in our case, we just try to predict
00:33:32.400 | what's going to happen next given the previous context
00:33:35.360 | and use that representation similar to every single one
00:33:40.320 | short learning or zero short learning-based method.
00:33:44.600 | I also explain why the number of codes are important.
00:33:47.720 | If you have too small, then you're
00:33:49.240 | just throwing away a lot of information.
00:33:50.920 | If you have too large, then you don't put in--
00:33:55.360 | it is no longer robust to noise.
00:33:59.480 | So this was our setup.
00:34:00.720 | And before I jump in, I should add one of the tweets
00:34:04.080 | which I saw from one of the most prominent researchers
00:34:08.800 | at DeepMind, which is basically like a lot of times
00:34:11.240 | it is very, very easy to bump up numbers.
00:34:13.680 | I can have these details just not present
00:34:17.200 | in my paper, which actually help a lot in terms
00:34:20.280 | of improving the performance.
00:34:22.200 | And sometimes don't take into account
00:34:25.760 | what the actual model is incorporating
00:34:29.160 | or what model is contributing versus what
00:34:32.000 | the actual these tricks for training are incorporating.
00:34:35.160 | So for most of these methods, what we are trying to see
00:34:38.240 | is we try to keep almost exactly the same approach.
00:34:42.000 | No rate augmentation, no fancy label smoothing,
00:34:44.560 | or moving average of weights, or decay, or whatever.
00:34:48.400 | You just have similar-based recipes
00:34:51.440 | to see how well we are doing.
00:34:55.520 | For this case, the goal was to see
00:34:57.920 | that how well our models do with respect
00:35:00.400 | to this purely supervised approach
00:35:02.120 | and how well it does with respect
00:35:03.720 | to a similar unsupervised approach.
00:35:06.840 | So in the first case, the model and all of the weights
00:35:09.360 | have access to all of the labels, which is just
00:35:12.320 | shown as VGG supervised, which is basically
00:35:14.840 | you take up an audio understanding data set
00:35:16.840 | and you see how well you're doing on accuracy metrics.
00:35:21.440 | So that was the first one.
00:35:22.560 | In the second one, we applied SimClear,
00:35:24.760 | which was proposed by Geoff Hinton,
00:35:26.160 | in which you can take up these multiple augmentations
00:35:28.440 | of the same input.
00:35:30.440 | You can have patches removed.
00:35:32.000 | You can blur the signal.
00:35:33.000 | You can flip the signal.
00:35:34.800 | You learn an embedding out of the last layer
00:35:36.880 | without access to the labels, and then
00:35:38.720 | just have a linear head to predict what's happening.
00:35:42.000 | By using that, we got a 55% accuracy.
00:35:45.040 | You do the exact same thing with transformers.
00:35:46.960 | You don't have access to labels.
00:35:48.440 | You just run them while just to predict the next code.
00:35:51.640 | You take the linear layer, apply the same linear head,
00:35:54.400 | and try to predict what's happening inside.
00:35:57.120 | And with that, we got 60% accuracy.
00:35:59.440 | So even though the results are not good,
00:36:01.160 | but the fact is the neural networks actually
00:36:04.920 | are very, very good at getting better and better
00:36:09.360 | with throwing off huge amounts of data.
00:36:11.520 | So there's still a 10% gap between purely supervised
00:36:14.720 | and purely unsupervised.
00:36:17.200 | But that's going to improve with throwing a lot of data
00:36:21.160 | to these models, because it doesn't have access
00:36:23.480 | to any label as per se.
00:36:25.240 | So this is a famous paper by Dan Ellis and Nelson Morgan
00:36:28.560 | at Berkeley, in which they actually showed way back
00:36:30.920 | in 1999 as to why size matters for deep neural networks
00:36:37.200 | and also the number of data points which is present.
00:36:41.000 | So as they kept on increasing the size of the data set
00:36:44.400 | and the parameters, they kept on getting lower and lower word
00:36:47.120 | error rates.
00:36:48.120 | And this has been true across any of the data set.
00:36:51.480 | And that's why the whole excitement is about
00:36:53.720 | unsupervised learning.
00:36:56.640 | So this was, in a way, a flavor of how
00:36:58.760 | can we do language modeling and unsupervised learning
00:37:01.320 | on audio for continuous signals.
00:37:05.200 | For the third subplot, I'll just quickly mention
00:37:09.400 | ideas which are very similar to what you would have seen
00:37:11.800 | in vision transformers, but with the caveat
00:37:15.680 | that how can we use some sort of signal processing
00:37:18.880 | to improve these performance even further.
00:37:22.040 | So the basic approach still remains the same exactly
00:37:24.520 | as what you would have seen in vision transformers.
00:37:28.200 | You have a signal of interest which you want to classify.
00:37:33.080 | Here, they are raw waveform instead of images.
00:37:36.400 | The goal is to predict what's there inside of it.
00:37:41.080 | And also, we don't have any convolutions.
00:37:43.480 | We don't have any other tricks which we were using before.
00:37:46.760 | All we have to do is they can transform as themselves,
00:37:49.760 | solve this particular problem.
00:37:52.880 | So for the data set--
00:37:54.800 | and the whole setup was still the same.
00:37:57.280 | No data augmentation and no other forms of these tricks.
00:38:03.000 | You are given like 40,000 snippets for training
00:38:05.440 | and 10,000 for validation.
00:38:08.040 | Our job is to predict as good as possible
00:38:10.880 | as to what's there in the audio.
00:38:13.480 | This problem is very similar to the sound which you heard
00:38:17.120 | and the video which you saw, that given a spectrogram patch,
00:38:22.080 | you have to predict what's there inside of it.
00:38:24.080 | We kind of do one step further than what's just
00:38:33.000 | like a simple transformer model.
00:38:34.840 | In a sense that we try to see whether some sort of hierarchy
00:38:38.400 | over transformer embeddings would help us in any manner.
00:38:43.960 | So for that, we use wavelet decomposition
00:38:47.200 | on the intermediate transformer embeddings.
00:38:50.720 | So what is a wavelet decomposition?
00:38:55.040 | In very naive terms, it can be like a way
00:38:58.200 | of decomposing the intermediate embeddings
00:39:02.720 | into another intermediate embedding, in a sense
00:39:06.800 | that we are kind of putting these highways of like some
00:39:09.920 | embeddings are moving very slowly
00:39:11.440 | and some embeddings are moving very fast.
00:39:13.640 | And some embeddings are retained exactly
00:39:15.600 | at the rate of what the original signal was.
00:39:19.480 | And why this is important?
00:39:20.640 | Because you can think about that at every intermediate state,
00:39:24.000 | you are in a way learning some sort of hierarchy in the model.
00:39:27.800 | So if I look at what we do with the wavelet decomposition
00:39:35.120 | before and after, let's say you had time across this
00:39:39.200 | and you had the embedding size across this
00:39:41.640 | and this whole patch was your output of, say,
00:39:45.880 | the nth layer of the transformer.
00:39:48.640 | What I say now is, OK, I would just
00:39:52.320 | have a mapping from this to the mapping of my interest
00:39:56.040 | using wavelet decomposition, in which for half of the samples,
00:40:00.280 | I just retain the exact same embedding
00:40:02.160 | as what was learned by the transformer model.
00:40:05.960 | In the next half, I would start combining two at a time.
00:40:09.160 | So in a way, I'm learning this sort
00:40:10.600 | of like a tree structure within a single layer
00:40:13.800 | of the transformer embedding.
00:40:16.360 | And for now, the wavelet or the BCS function which I use
00:40:21.000 | is simple averaging.
00:40:22.360 | So let's say from all of the embedding layers in between,
00:40:26.040 | I just need to have one embedding which is not
00:40:31.400 | moving at all, which is just representative of whatever
00:40:33.600 | is there of the whole latent space in that nth layer.
00:40:40.240 | Then in the next layer, I would just use two at a time
00:40:44.160 | and then I would use four at a time
00:40:46.680 | until I reach the exact resolution as what I had.
00:40:50.360 | Doing this operation doesn't add any parameters whatsoever.
00:40:53.560 | You're just defining what your BCS function would be
00:40:56.320 | or what your wavelet function would be.
00:40:58.080 | In this case, it is a hard wavelet.
00:41:00.200 | And I start combining them and I learned a hierarchy
00:41:04.120 | at every single layer of the transformers.
00:41:08.280 | And this improved our performance significantly
00:41:12.560 | as compared to not using them with addition
00:41:15.080 | of no extra parameters.
00:41:17.800 | And I'll come to the results later also.
00:41:20.840 | So this is how the whole approach looks like.
00:41:23.840 | You have a front end.
00:41:25.680 | The front end is basically a single layer
00:41:28.480 | of 2,000 neurons followed by a dense layer of 64 neurons,
00:41:33.240 | which is just to make sure to conform it
00:41:37.360 | to the intermediate transform embeddings.
00:41:39.080 | Let's say if for the transformers,
00:41:40.920 | I define the embedding size to be 64,
00:41:43.200 | then that's the dimension which I'm mapping them to.
00:41:47.200 | So I take a broad waveform.
00:41:48.920 | I patch it in very small patches similar to how
00:41:51.960 | you do in vision transformers.
00:41:54.520 | I would just have a single layer of 2,000 neurons
00:41:56.880 | followed by a dense layer of 64 neurons
00:42:00.040 | with the hope that the first layer is learning
00:42:02.280 | like a Fourier BCS function, which
00:42:04.760 | should be adaptable according to what I'm learning.
00:42:08.080 | After that, I keep on doing this over and over again.
00:42:11.640 | I don't have a classification head or anything like that.
00:42:15.520 | I keep on adding multiple stacks of transformers after that.
00:42:20.120 | And then I have two approaches of what I can
00:42:26.120 | do in terms of adaptation.
00:42:28.920 | I can do average pooling across time
00:42:31.040 | of these intermediate embeddings,
00:42:32.800 | because the idea is very similar to what
00:42:35.160 | we do in classical vision, that each of the embeddings
00:42:38.080 | are looking at much, much broader output
00:42:42.440 | in the subsequent layers.
00:42:44.160 | Or I could do a wavelet decomposition.
00:42:47.160 | So what I do is that I take all of these embeddings
00:42:50.160 | and I define these highways.
00:42:51.560 | So some of the embeddings move fast.
00:42:53.200 | Some of them are moving very slow.
00:42:54.800 | And some are retained at the exact same resolution
00:42:57.280 | as what the transformer is learning.
00:42:59.800 | And then I keep doing this over and over again.
00:43:02.120 | I have a dense layer.
00:43:03.680 | I have my softmax or sigmoid, whatever
00:43:06.560 | is my classification head.
00:43:08.680 | So this is kind of what the approach looks like.
00:43:12.400 | We compare it with all of the traditional vision-based
00:43:17.720 | architecture.
00:43:18.520 | So the vision-based models have been very good.
00:43:21.040 | And the performance have been similar in understanding
00:43:24.360 | audio also.
00:43:25.560 | So we compare all of those models
00:43:27.600 | in terms of mean average precision.
00:43:29.680 | And we see that even the tiniest models of transformers
00:43:32.320 | were just surpassing all of the state-of-the-art CNN
00:43:34.680 | models, which was a very good sign.
00:43:37.880 | Then we started to bump up.
00:43:39.800 | The larger model should keep on improving the performance.
00:43:42.560 | And with the multi-scale models, as well as
00:43:46.000 | with the pooling layers, they improve the performance
00:43:48.760 | even further, which was kind of very surprising to us
00:43:52.520 | because the number of parameters are very small.
00:43:54.920 | These are very tiny architectures.
00:43:56.480 | Yet they are surpassing things like even
00:43:58.360 | DenseNet, which are huge models with a lot of millions
00:44:01.200 | of parameters.
00:44:03.920 | So after that, we said--
00:44:05.480 | and I'm going to conclude quickly.
00:44:08.160 | After that, we said that, OK, this is looking pretty cool.
00:44:12.200 | What actually is the transformer or the first-layer learning?
00:44:17.480 | So in order to make this plot, what we said was, OK,
00:44:24.240 | if you were to take a classic Fourier transform,
00:44:27.600 | then this axis is kind of like frequency.
00:44:32.960 | This axis is the number of filters.
00:44:34.520 | And this axis is the frequency.
00:44:36.720 | Then in a way, it should be connecting all of the points
00:44:41.640 | in a linear line.
00:44:43.400 | And this is akin to the number of points in the FFT.
00:44:45.920 | So how many points I'm defining here?
00:44:48.240 | If I'm defining 2,000 points here,
00:44:50.680 | then I would have 2,048 sinusoidal basis functions,
00:44:56.040 | which are going from lower frequency
00:44:57.720 | to the most highest frequency.
00:45:00.120 | We said, OK, we'll do the exact same thing,
00:45:02.080 | but now with filters.
00:45:03.640 | So we have a frequency along y-axis and the number
00:45:07.520 | of points in my x-axis.
00:45:09.360 | And if it was a classic Fourier transform,
00:45:11.280 | then it would be connecting right as a linear line.
00:45:15.560 | But what we did was we take up the front end, which
00:45:19.600 | is learned by transformer, take its Fourier transform,
00:45:22.800 | sort according to its center frequency
00:45:24.960 | as to what frequency it is activating the most,
00:45:27.680 | and then keep on stacking them.
00:45:29.960 | When we did this for two problems,
00:45:31.960 | we saw that we are learning a different time frequency
00:45:35.720 | representation, which is specific to a particular
00:45:38.080 | problem.
00:45:38.600 | So if I'm trying to understand what's
00:45:40.400 | there in the content of the audio,
00:45:42.480 | I learn a representation which is
00:45:43.880 | very different than Fourier transform,
00:45:45.160 | which would have been a straight line, which
00:45:47.040 | is like a curved exponential line like this.
00:45:51.640 | And if I do a polyphonic pitch estimation,
00:45:54.440 | I learn a very different front end,
00:45:57.320 | which is adapting to that particular problem.
00:46:00.080 | So this was very exciting to us because making computers
00:46:05.520 | hear in a way in which they are adapting their ears
00:46:07.880 | according to a particular problem is a very cool idea.
00:46:12.400 | Second thing is we actually saw each of the filters
00:46:15.200 | as to what they were doing.
00:46:17.320 | And these are basically just single slices like this.
00:46:21.240 | So this is what we would have learned as a front end neuron.
00:46:25.280 | So we take up each of the neurons and we just plot them.
00:46:27.880 | And for plotting this, we basically
00:46:29.800 | take a Fourier transform and then
00:46:32.040 | sort them according to where the center frequency is.
00:46:35.400 | When we just saw the neurons as to what
00:46:37.080 | they were learning in the front end,
00:46:39.040 | we saw that it is learning properties
00:46:41.440 | which are very, very closely matching
00:46:45.200 | with the traditional signal processing.
00:46:46.840 | So you would have something like an answer detector
00:46:48.960 | learned right here.
00:46:50.760 | You're learning windowing functions.
00:46:52.280 | In a way, it is learning to have a kernel which
00:46:55.400 | is best for a time frequency representation, what people
00:46:58.320 | have been using in signal processing, which
00:47:00.120 | is like a Hamming or a Hamming window.
00:47:03.000 | We are learning these pure sinusoids
00:47:04.720 | which are responsible for activating
00:47:07.400 | a particular frequency.
00:47:08.920 | So you can see the richness as compared
00:47:10.560 | to having a fixed purely sinusoidal PCS
00:47:14.000 | function right here.
00:47:16.440 | So this was what we had done.
00:47:20.040 | And then to share the final thoughts,
00:47:23.120 | I'll conclude by saying that, OK, transformers
00:47:25.480 | are proving to be a major advancement in AI
00:47:27.360 | research across the fields.
00:47:30.000 | And it seems like they're solving everything for now.
00:47:34.240 | And hopefully, this is not the end.
00:47:36.040 | And we should keep an eye out on something
00:47:38.440 | which would change and have an impact which
00:47:41.080 | is more than what transformers have put.
00:47:44.360 | And who knows what's going to come next?
00:47:47.600 | Yeah, so by that, I'll just conclude.
00:47:49.680 | And I'll be happy to take questions.
00:47:53.120 | Thank you, Prateek.
00:47:54.080 | That was a really good talk.
00:47:55.800 | And you provided some really good insights
00:47:58.720 | about how transformers work for the audio case.
00:48:02.360 | And yeah, thank you for the talk.
00:48:04.880 | And now I would invite questions from the class students.
00:48:10.200 | Let me just stop the recording.
00:48:13.280 | [BLANK_AUDIO]