Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

00:00:00.000 | [AUDIO OUT]

00:00:04.920 | Thanks for inviting me for the talk today.

00:00:07.480 | And I'll be just talking about transformers

00:00:10.320 | for music and audio, which is very

00:00:12.400 | different than what all of us were doing in this past course.

00:00:16.880 | I'm also the only speaker from Stanford,

00:00:18.840 | so I have to do a good job.

00:00:20.200 | So you'll see very good slides, because I'm

00:00:23.280 | representing the university in some sense.

00:00:26.520 | So yeah, so the flow of the talk for today

00:00:28.240 | is basically like I'll be throwing a lot of stuff.

00:00:30.960 | It's kind of like a buffet style,

00:00:32.440 | and then you feel free to like or dislike whatever you want.

00:00:36.640 | And I'll be talking mostly about three papers of what

00:00:39.560 | I've been working on.

00:00:42.120 | I'll start with introducing what transformers

00:00:44.600 | are from a different perspective, what

00:00:46.480 | audio representations are.

00:00:49.240 | Talk about a generative model for audio,

00:00:52.200 | which is just doing language modeling on sample level.

00:00:56.080 | Then I'll talk about how can one do like language

00:00:59.080 | modeling for speech and audio, which is different than what

00:01:02.480 | people do for text.

00:01:04.280 | What are the current trends in the literature?

00:01:07.360 | Finally, I'll briefly mention similar stuff

00:01:11.000 | as to what was happening in computer vision

00:01:13.080 | with regard to vision transformers.

00:01:14.920 | How can we adapt similar ideas for audio transformers?

00:01:17.920 | And throw in a bit of signal processing

00:01:20.040 | to improve the performance.

00:01:22.120 | Having told that the talk is about 35 to 40 minutes

00:01:24.840 | with about 15 minutes of Q&A, I should also

00:01:27.600 | say that all of the opinions are mine,

00:01:29.600 | and Stanford or any other professor

00:01:31.160 | is not responsible for any of the mistake which I do.

00:01:35.880 | So transformers have kind of revolutionized in a way

00:01:40.240 | like everyone was approaching deep learning.

00:01:42.960 | Before that, it was all about CNNs.

00:01:45.240 | And mostly, all of these prominent models

00:01:48.360 | have been coming in waves.

00:01:49.440 | So there was a time when everyone was just

00:01:51.200 | applying CNNs.

00:01:53.360 | Then came a time where people started

00:01:55.600 | adapting CNNs in some sort of diluted convolutions.

00:01:58.480 | And slowly, the recurrent networks

00:02:00.400 | were getting out of fashion.

00:02:02.480 | Now, it seems like transformers are in fashion all the time.

00:02:05.960 | So it seems to be solving almost every single problem which

00:02:08.840 | is being thrown at them.

00:02:12.640 | So what's special about them?

00:02:15.000 | One of the fact which struck me was their simplicity, which

00:02:19.160 | is, if you think about it, this--

00:02:23.680 | and it has been hugely popular also.

00:02:25.720 | So it was just released in 2018.

00:02:28.640 | And within three years, it has about 30,000 citations.

00:02:31.280 | And it is kind of solving every single problem

00:02:33.200 | in every single domain.

00:02:35.400 | It has its limitations, though, also.

00:02:38.560 | But if you think about it, in a way,

00:02:40.720 | transformers are basically a way of just cascading self-attention

00:02:45.800 | with feature learning.

00:02:46.960 | And if you keep on doing it over and over again,

00:02:49.280 | then the model, in a way, learns which parts of the input

00:02:52.160 | are important and keep on transforming them,

00:02:55.120 | removing the contents which are not important,

00:02:57.360 | and just have the limited information which is just

00:03:00.480 | responsible for a particular task.

00:03:04.120 | And it has been very, very difficult to keep up

00:03:06.160 | with the literature.

00:03:09.000 | I have put it as a joke here.

00:03:10.760 | But then even Twitter's recommendation engine

00:03:13.160 | were kind of just getting out of--

00:03:16.120 | they were getting haywire as to why is Chris Manning just

00:03:19.360 | searching over transformers?

00:03:21.480 | And that was way back in 2020.

00:03:24.480 | So it has been difficult for researchers

00:03:26.440 | also to keep up with the pace of what's going on.

00:03:29.800 | Just before transformers, all of the NLP community

00:03:33.000 | was just doing gaga about bidirectional LSTMs

00:03:35.760 | with attention.

00:03:36.760 | So every single paper before 2017

00:03:38.840 | was just like you have encoded LSTM layers.

00:03:42.320 | You keep on adding multiple layers.

00:03:46.000 | And then after that, you have attention mechanism

00:03:48.440 | which just learns that what's important

00:03:50.600 | and then just keeps on decoding sequentially one at a time.

00:03:54.760 | But this was not kind of like an ideal way to do it.

00:03:58.040 | Because what turns out is when we start throwing longer

00:04:01.560 | sequences, the connections are no longer

00:04:07.120 | storing the gradient updates in a way it should be doing.

00:04:09.840 | So what the researchers from Google said,

00:04:13.360 | instead of having just an attention

00:04:15.040 | layer at the very last encoding, we

00:04:19.240 | would just have these attention mechanisms

00:04:21.280 | at every single layer, which in a way

00:04:23.640 | would just learn what's important

00:04:25.280 | for a particular problem at that particular layer.

00:04:28.240 | And we keep on doing it over and over again.

00:04:31.920 | So then the whole idea of transformers and attention

00:04:35.920 | mechanism cascaded one after the other came.

00:04:39.520 | And I'll not go into the details,

00:04:40.880 | because this is the last class of the course.

00:04:43.400 | But then usual tricks do help across the neural net

00:04:46.600 | literature, which is like having multi-header tensions,

00:04:50.680 | having skip connection and layer norm.

00:04:52.280 | So all of these things, they are not only

00:04:54.440 | like giving gains for transformers themselves,

00:04:57.800 | but they can be just applied to any single other architecture

00:05:01.520 | also.

00:05:03.440 | The other thing which is helping this research

00:05:05.960 | is basically the compute bar is getting better and better.

00:05:09.360 | So all of these big companies are just

00:05:12.640 | throwing massive amounts of computing resources

00:05:14.880 | at solving very, very simple and trivial tasks.

00:05:18.920 | The top of the hill being the switch transformer, which

00:05:21.560 | was discussed in the course also.

00:05:23.000 | But one other thing which I think

00:05:28.240 | started all of this trend was ELMo,

00:05:30.080 | which was just learning these contextualized representations

00:05:33.200 | for natural language processing.

00:05:34.920 | And that model right here was perhaps

00:05:37.400 | one of the first kind of like model 0.0 or something,

00:05:45.640 | or 0.1 in terms of bringing and ushering

00:05:48.320 | in the whole revolution.

00:05:50.760 | You can see that how similar these kind of models

00:05:53.520 | look like.

00:05:54.560 | BERT was basically inspired heavily

00:05:57.280 | from ELMo, in which they just replaced

00:06:00.200 | some of the LSTM layers with transformer modules.

00:06:03.800 | So a point to note also is irrespective

00:06:08.360 | of natural language processing or other domain,

00:06:10.760 | these can be adopted in a variety of domains.

00:06:13.000 | And for today's talk, I'll be just adopting them to audio.

00:06:17.720 | So I'll basically start with introducing people

00:06:21.240 | what audio representations are, and just

00:06:23.400 | for the sake of completeness, talk about spectrograms.

00:06:28.240 | So you can take any time domain signal,

00:06:31.400 | and you can decompose that signal

00:06:34.560 | into a variety of basis functions.

00:06:39.600 | And if you take up a Fourier transform,

00:06:41.680 | you're kind of like decomposing the actual time domain

00:06:46.040 | signal into its sinusoidal basis components.

00:06:49.400 | So if you have like a waveform here

00:06:51.880 | like this, which is a sum of three pure sinusoids,

00:06:55.240 | then their sum basically is this.

00:06:57.440 | And you can see that when you take a Fourier transform

00:07:00.120 | and its magnitude, you kind of have

00:07:03.280 | the strength of the individual components shown here.

00:07:08.800 | So you can take up another waveform,

00:07:10.960 | let's say a square wave, and what you have

00:07:13.720 | is basically a much richer sinusoidal decomposition

00:07:17.680 | because it is kind of a discontinuous signal.

00:07:19.960 | So you need like many more sinusoids

00:07:21.640 | to represent that particular signal as close

00:07:24.400 | to the actual signal as possible.

00:07:26.960 | And here also you can see that, OK, if this was a square wave,

00:07:29.880 | then it is actually made up of a lot of sinusoids

00:07:35.960 | where each of the bar here represents

00:07:39.200 | the strength of the particular sinusoid.

00:07:42.280 | From an optimization perspective,

00:07:44.120 | I mean, this right away is suboptimal, right?

00:07:47.040 | Because you're kind of fixing up the number of sinusoids

00:07:51.080 | you're using for representing a square wave.

00:07:53.240 | I would have rather used a basis function

00:07:56.080 | which was a square wave itself than a sinusoidal signal.

00:08:01.360 | The second thing is even if you are taking a sinusoidal signal,

00:08:05.320 | we kind of are just putting them in an equidistant space.

00:08:09.440 | So you're kind of dividing the whole frequency

00:08:11.760 | axis into equidistant bins.

00:08:14.440 | And each of the bins is responsible

00:08:16.040 | for a particular sinusoid a lot.

00:08:19.400 | So that is like a traditional Fourier representation

00:08:22.360 | for representing any signal.

00:08:25.840 | What we do for--

00:08:28.840 | what are spectrograms?

00:08:30.480 | But in reality, all of these signals are discontinuous.

00:08:34.880 | All of these signals vary quite a bit, right?

00:08:37.360 | So you can have a signal while I'm

00:08:39.960 | speaking which is like a square wave for a certain period

00:08:42.680 | of time, and then it gets sinusoidal,

00:08:44.720 | and then it becomes something else.

00:08:46.680 | So what we really need is in a way

00:08:48.880 | to kind of take batches of input signal

00:08:52.840 | and take Fourier transform of these individual batches.

00:08:56.320 | I'm deliberately using the word batches,

00:08:58.000 | but you can-- in traditional terms,

00:08:59.520 | you are windowing the signal.

00:09:02.240 | So right here, you can see that you have a continuous signal.

00:09:05.120 | You keep on windowing it.

00:09:07.320 | You apply the Fourier transform, and what you get

00:09:10.280 | is basically like a spectrogram representation of the signal.

00:09:14.320 | So right here, what you're seeing basically

00:09:16.360 | is for each of the slices, the signal kind of

00:09:19.520 | look like this after taking the Fourier

00:09:21.560 | transform with the waveform which is there below.

00:09:24.840 | And what you do is for spectrogram representation,

00:09:26.920 | you keep on stacking these Fourier transform

00:09:29.160 | slice, the magnitude of the Fourier transform slices.

00:09:31.760 | And in this way, you kind of get like a 2D representation

00:09:34.760 | of audio signals.

00:09:36.720 | And if you're coming from a vision background,

00:09:38.640 | it is basically all of the things

00:09:40.160 | which you're doing in vision would just work well

00:09:42.240 | if you just apply them to these 2D spectra representations.

00:09:47.240 | I'll quickly play how these spectrograms look

00:09:49.960 | like for a wide area of common sounds.

00:09:53.400 | [VIDEO PLAYBACK]

00:09:56.400 | [MUSIC PLAYING]

00:10:00.400 | [BIRDS CHIRPING]

00:10:03.400 | [MUSIC PLAYING]

00:10:07.400 | [WHISTLING]

00:10:25.400 | [MUSIC PLAYING]

00:10:28.400 | [WHISTLING]

00:10:31.400 | [MUSIC PLAYING]

00:10:34.400 | [WHISTLING]

00:10:35.400 | So you can see like for spectrograms,

00:10:38.120 | you have kind of like a time axis on your x-axis.

00:10:40.880 | And then you have a frequency axis on y-axis.

00:10:43.680 | And then for whatever is your signal of interest,

00:10:46.120 | you're basically like putting these slices together.

00:10:48.880 | And different sound gives you like different spectra

00:10:50.960 | representation.

00:10:51.760 | So it's kind of a vision problem just

00:10:54.080 | in this sort of like Fourier space.

00:10:58.040 | So there can be like different kinds of representations also.

00:11:01.200 | So one, you could just take these slices of Fourier

00:11:06.320 | transform and then do like a linear mapping to them

00:11:09.320 | so that you're kind of in a way making these as

00:11:13.400 | close to how humans hear.

00:11:14.680 | So you can have like log of the frequency on the y-axis

00:11:17.960 | instead of common frequency.

00:11:19.240 | And then you get like a constant Q-like representation.

00:11:22.040 | The advantage of this being like you

00:11:23.640 | can see that for different frequencies,

00:11:25.880 | the spacing between the harmonics kind of remains same.

00:11:28.960 | So if you're like training convolutional filters,

00:11:31.040 | then that's of a huge advantage because the signal,

00:11:33.680 | like one component of the invariance is gone.

00:11:35.800 | And you can just learn these filters

00:11:37.280 | which are catching onto these constant templates of Fourier

00:11:41.880 | slices.

00:11:43.200 | You can have melt filter bank coefficients,

00:11:45.080 | or you can have like the raw waveform also.

00:11:48.960 | For raw waveforms, basically there

00:11:50.520 | are two things which we have to keep in mind.

00:11:53.200 | One is the sampling rate.

00:11:54.320 | So we kind of like take the continuous signal

00:11:57.200 | and then we discretize the continuous signal.

00:11:59.400 | So one parameter is like how fast we are sampling

00:12:03.360 | the continuous signal.

00:12:04.280 | So that's typically on the order of like 16,000 or 8,000

00:12:07.720 | times a second if you're on telephonic speech.

00:12:10.440 | The other thing which we also is like how many levels

00:12:13.720 | we are dividing your vertical axis.

00:12:15.480 | So in this case, you can see that each of the dots

00:12:17.760 | is basically one level.

00:12:19.400 | And typically, people use 8-bit quantizers or 16-bit

00:12:22.280 | quantizers.

00:12:23.280 | So in a way, you can think about that for every one

00:12:25.400 | second of audio which we would hear,

00:12:27.480 | you would have like 16,000 samples.

00:12:29.720 | And then in each of the 16,000 samples

00:12:32.720 | are allowed to take one of the levels between 0 to 55.

00:12:36.800 | And that's like if I can take the problem of continuous audio

00:12:41.560 | and just have it in terms of this sort of discrete space,

00:12:45.320 | then basically I'm just going to the territory

00:12:47.520 | of doing language modeling.

00:12:50.600 | So the first papers I discuss is how

00:12:53.480 | can we do generative modeling for raw audio, which

00:12:57.880 | is similar to WaveNets using transformers.

00:13:01.480 | I'll be putting QR codes if you like the stuff what I'm doing.

00:13:05.560 | And if you think that this is relevant to you,

00:13:07.800 | please cite or please have a look

00:13:10.080 | in terms of the QR codes.

00:13:13.000 | So yeah, so I'll start with the first subtopic

00:13:16.400 | of today's talk, which is like what are WaveNets

00:13:22.640 | and how do we do this generative modeling over raw audio?

00:13:26.960 | So in a single word, you can think

00:13:28.360 | about this as doing language modeling over these 255

00:13:31.360 | states of audio.

00:13:33.120 | So you can throw in your favorite transformer model

00:13:35.760 | like transformer XL or GPT or whatever you want to call it.

00:13:41.520 | And just treat the problem as if you

00:13:43.160 | are trying to predict one of the levels out of 255.

00:13:45.800 | And you have to predict the next level given a certain context.

00:13:49.480 | That's what WaveNet was doing.

00:13:50.760 | So the way you are modeling the probability distribution

00:13:55.320 | of a continuous space is basically

00:13:57.320 | you're trying to predict what's the probability

00:13:59.360 | of the next sample given some parsed context.

00:14:02.560 | And WaveNet has been hugely popular

00:14:04.720 | because it has over 3,000 citations

00:14:07.080 | and it has been a core building block for almost all speech

00:14:11.040 | and audio related problems.

00:14:13.000 | You can think about speech to text, text to speech synthesis,

00:14:16.600 | instrument conversion, packet loss

00:14:19.040 | concealment over the internet, speech denoising.

00:14:21.680 | So wherever there's some sort of element of modifying audio,

00:14:25.560 | people have been using WaveNet as a core building block.

00:14:30.440 | And raw waveform synthesis has been difficult

00:14:32.880 | because just the magnitude of the problem,

00:14:35.920 | if I'm just trying to synthesize 10 seconds of audio,

00:14:38.880 | it would just amount to me having a probability

00:14:41.680 | distribution over 160,000 samples.

00:14:45.760 | And that itself is tough because our ears are very, very

00:14:49.120 | sensitive to subtle changes.

00:14:51.000 | If I'm off by one pixel in an image,

00:14:55.600 | my eyes would not be as susceptible to noticing

00:14:59.200 | that effect versus if I'm off by, say, a few samples

00:15:04.240 | in an audio, it would just catch our ears pretty quickly.

00:15:08.600 | People have been trying raw audio synthesis a lot

00:15:10.960 | in the past.

00:15:12.440 | And before all of the WaveNet and transformer-based

00:15:15.720 | approaches, WaveRNNs and SampleRNNs

00:15:20.280 | were kind of like state-of-the-art models.

00:15:25.600 | On the right, I've shown a SampleRNN model, which

00:15:28.600 | kind of models the probability distribution of what's

00:15:33.280 | going to come next given the past at multiple levels.

00:15:36.400 | And this was work done by Yoshua Bengio at Mila.

00:15:40.200 | But you can closely see, if you just

00:15:42.560 | see this architecture versus a transformer architecture,

00:15:45.520 | in a way, these are starting to get very, very similar.

00:15:48.680 | Because what you're trying to do is

00:15:50.200 | that for the probability distribution here,

00:15:52.320 | you're trying to see a lot of local substructures.

00:15:56.040 | And then you keep on doing it over and over again.

00:15:58.120 | And you can draw parallels, like attention mechanism

00:16:00.840 | should also kind of be doing the same thing.

00:16:04.000 | So this was kind of like the literature in the past.

00:16:09.280 | What we tried to do was we just had the WaveNet model.

00:16:13.120 | And we tried to see whether transformers can beat them.

00:16:15.800 | And our intuition was it should be able to beat them

00:16:18.160 | because they are successful all over the other domains,

00:16:23.000 | like in language modeling.

00:16:24.120 | So it should do that for raw waveforms also.

00:16:28.720 | We also tried to see whether we can circumvent the order

00:16:31.400 | n squared constraint by conditioning of the context

00:16:35.560 | itself.

00:16:37.160 | And we did not go for specific applications.

00:16:40.560 | And we just said, OK, just in terms like modeling behavior,

00:16:43.000 | how will they do?

00:16:45.440 | So the data set for this was just

00:16:47.200 | like real-world kind of recording.

00:16:48.760 | So actual sound should not matter

00:16:53.440 | because the model is agnostic to what it is being thrown in.

00:16:57.880 | And the setup was exactly the same,

00:16:59.520 | like you are giving a certain context.

00:17:01.360 | And I have to predict the next sample.

00:17:03.600 | You do the same thing with WaveNets.

00:17:05.520 | You do the exact same thing with transform-based,

00:17:09.240 | like GPT kind of model and see how well they do.

00:17:14.000 | I'll briefly chat about what WaveNet models are.

00:17:18.040 | So WaveNet was kind of like a convolution-based model, which

00:17:21.360 | was getting rid of all of the vanishing gradient problem

00:17:25.360 | by just treating a sequential problem

00:17:28.680 | as being learned by a convolutional model.

00:17:31.120 | So what they did was basically have this sort

00:17:33.880 | of dilation layers, or convolution with dilations,

00:17:38.320 | which is basically I kind of skip in every subsequent layer

00:17:42.000 | by one sample.

00:17:43.240 | So you can see if I have a dilation factor of 2

00:17:47.080 | with a kernel size of 2, I would get this kind of a topology

00:17:50.400 | where my convolution filters in the very first layer

00:17:52.720 | are just combining the first two samples.

00:17:55.120 | Then I skip by one in the next layer.

00:17:57.520 | And then I skip by three, which is

00:17:59.360 | like I look at the fourth one in the next layer and so on.

00:18:03.400 | The loss is still the same.

00:18:04.600 | So I have this network.

00:18:06.640 | I learn a latent space.

00:18:08.000 | And then I have a categorical cross-entropy loss,

00:18:11.200 | which is basically I have to predict the next sample given

00:18:13.960 | the previous one.

00:18:16.640 | And I just do the exact same thing with transformers also.

00:18:20.800 | But then I have to make sure that I

00:18:22.760 | do it in a causal manner.

00:18:24.120 | So I have something which is very similar to GPT,

00:18:26.840 | in which I have causal masks in my attention mechanism.

00:18:30.320 | And I keep doing it over and over again.

00:18:32.360 | So you have self-attention.

00:18:35.520 | After that, you have feedforward layers.

00:18:37.400 | You just have a stack of these transformer blocks

00:18:40.400 | and see how they do.

00:18:43.320 | So I said intuitively it should work.

00:18:45.320 | So it should be doing better than our base wave net models.

00:18:52.360 | Because if you look at the topology,

00:18:55.120 | we are kind of defining a topology on our own, right?

00:18:57.640 | So what if the current prediction at, say,

00:19:01.520 | layer one were to depend on very way back sample, say,

00:19:07.760 | instead of the second sample, the 10th sample?

00:19:09.640 | So we are kind of ignoring all of that topology, which

00:19:12.800 | would have been important for prediction

00:19:14.480 | of this particular task.

00:19:16.040 | Whereas transformers with the self-attention mechanism

00:19:19.640 | can just learn, like, OK, which part of the samples

00:19:22.400 | are important and which are not.

00:19:23.960 | And you can keep on doing it iteratively.

00:19:26.760 | So it made sense to us that, OK, transformer layer

00:19:30.880 | should be doing way better than wave net models.

00:19:34.880 | The second thing which we came across was, OK,

00:19:37.680 | we cannot have a lot of context.

00:19:40.640 | For example, the attention mechanism

00:19:42.480 | needs to store all of those of order n squared.

00:19:46.120 | So in this case, if I'm storing data at 100 milliseconds,

00:19:50.240 | then I have about 1,600 samples.

00:19:52.600 | And I need to store 1,600 by 1,600 at multiple layers.

00:19:56.840 | And it just becomes like a huge problem with the data--

00:20:01.800 | problem with the memory constraint.

00:20:03.360 | So what we said was, OK, what if we just

00:20:06.600 | use the context itself as a latent code?

00:20:10.840 | So in order to have much better representation at every layer,

00:20:16.440 | we cannot have huge, big attention matrices.

00:20:20.040 | So what we said was, we would just

00:20:21.960 | do a sample-wise conditioning and throw a CNN layers just

00:20:26.240 | to understand what the latent code would be.

00:20:28.560 | So you still have, like, an attention mechanism

00:20:31.040 | or just a past context.

00:20:32.960 | But then I'm also conditioning at every sample, OK,

00:20:36.560 | what the next sample should be given on this context embedding.

00:20:41.240 | And if you think about it, in a way, it is like, OK,

00:20:43.360 | if there are, like, five or six notes being played in a piano,

00:20:46.680 | then I'm kind of certain which notes

00:20:48.280 | will be played to a certain extent

00:20:50.040 | if I just throw in a CNN layer.

00:20:52.600 | So I'll use that information along with what

00:20:55.360 | my transporters are learning.

00:20:57.480 | And then I would condition it.

00:20:59.040 | And I would just use that to predict the next sample.

00:21:03.240 | So for the evaluation criteria, we

00:21:04.840 | did not look for negative log-likelihood scores.

00:21:08.720 | We just looked at how well our prediction task was.

00:21:12.760 | So we took a, like, stacked WaveNet,

00:21:15.640 | which was implemented by DeepMind,

00:21:17.520 | and saw that, OK, what was the performance using

00:21:21.320 | their benchmarks and even, like, bigger stacked WaveNets.

00:21:25.840 | We then started to increase the complexity of transformers

00:21:29.080 | and started to see whatever we had proposed

00:21:32.520 | in terms of, like, conditioning on the vanilla transformer

00:21:36.600 | architectures to see how well they do.

00:21:39.520 | We did not look for, like, an application-specific problem,

00:21:43.640 | which is basically, like, we don't look at, like,

00:21:46.520 | how well perception tasks are for, like, say,

00:21:49.200 | text-to-speech synthesis or speech denoising.

00:21:51.600 | We just look at, OK, if we are trying

00:21:53.280 | to model this using a cross-entropy loss,

00:21:56.000 | then with the same model, with the same loss function,

00:21:59.560 | how well they do on, like, similar kind of parameters.

00:22:03.720 | So this was the first kind of, like, sub-block of, like,

00:22:06.640 | how can we use our transformers for generative modeling.

00:22:12.160 | For the second problem, I'll do a quick headway

00:22:15.360 | on how can we use, like, transformers

00:22:19.160 | for doing language modeling, which

00:22:21.200 | is kind of becoming a really fancy term right now.

00:22:24.880 | And this work was done by Julia Smith way back in 2020.

00:22:28.880 | And the goal of this was, can we kind of, in a way,

00:22:32.520 | do language modeling with continuous audio sequences?

00:22:37.440 | And I'll briefly mention about that in this sub-block

00:22:42.240 | of the talk.

00:22:42.960 | And this is in regard for, like, solving acoustic scene

00:22:48.440 | understanding, which is basically, like,

00:22:51.560 | if I'm given a chunk of audio, then

00:22:54.280 | I want to understand what's in there.

00:22:56.960 | And if we could do that well, then in a way,

00:23:01.800 | we can do a lot of fancy, nice applications.

00:23:06.120 | So for example, like, if you think about, like,

00:23:08.160 | self-driving cars.

00:23:09.040 | So Waymo has started to incorporate microphones

00:23:12.200 | into their self-driving cars.

00:23:13.720 | Why?

00:23:14.220 | Because, say, if there is an ambulance coming,

00:23:16.080 | or if there is a fire truck coming,

00:23:18.960 | then that sound would be picked up way, way before even

00:23:23.040 | the LIDARs or even their sensors.

00:23:25.840 | So they want to understand that and take

00:23:28.320 | actions based upon that.

00:23:30.880 | Apple, during COVID, did a hand-washing detection

00:23:33.360 | on their Apple Watch.

00:23:34.600 | Because if you could detect when someone is washing their hands,

00:23:37.800 | then you can, in a way, like, tell people that, oh,

00:23:40.760 | you need to wash hands for 20 seconds.

00:23:42.720 | And then that can be built upon as a cool application.

00:23:46.880 | It can be used for music recommendations.

00:23:49.520 | So Spotify, YouTube Music kind of gives, like,

00:23:51.760 | very, very good songs, which you are listening to,

00:23:54.640 | which are similar in content that you would perhaps like.

00:23:59.320 | It can also give, like, really cool applications.

00:24:01.680 | Like, say, people have tried, like,

00:24:03.760 | detecting depression from audio.

00:24:05.800 | Or I could detect whether I'm coughing or not,

00:24:08.760 | or I'm sneezing or not.

00:24:09.960 | And these can be, like, good medical device--

00:24:13.080 | medical applications, which can be

00:24:15.360 | used along with the current diagnosis what doctor provides.

00:24:20.520 | So the question was basically, for us,

00:24:23.320 | was, like, how can we do, like, language modeling

00:24:26.600 | in a continuous audio domain?

00:24:29.040 | And secondly, like, how can we train models,

00:24:31.240 | or how should we approach doing this?

00:24:35.320 | So this kind of, like, recipe has

00:24:37.360 | become, like, very, very popular these days in terms of, like,

00:24:40.680 | how would you approach this problem?

00:24:42.240 | It started with, like, open AI, and to a certain extent,

00:24:46.280 | DeepMind proposing that in terms of, like, VQVAE models.

00:24:50.960 | But it turns out, like, transformers

00:24:52.640 | love operating in discrete spaces, as of now.

00:24:56.440 | And what they kind of do is, as long as your representations

00:25:01.200 | are discrete, they are very, very good at modeling

00:25:03.640 | what's going to come next.

00:25:06.760 | So what people have been proposing as a workaround

00:25:09.200 | is you could take up, like, your favorite embedding

00:25:15.640 | in some manner.

00:25:16.400 | You could take a VQVAE embeddings,

00:25:18.200 | or you could take a Wave2Vec, or in terms of video,

00:25:21.840 | you can just do classic VGG or ResNet embeddings.

00:25:27.880 | You can apply k-means clustering to it.

00:25:31.040 | And k-means clustering would give you, like, discrete codes.

00:25:34.320 | You do language modeling with those discrete codes,

00:25:37.240 | and you predict the next code.

00:25:39.360 | And in a way, if you're doing this,

00:25:41.440 | then you're kind of doing language modeling over audio.

00:25:45.200 | And if you need to get back to the audio,

00:25:46.840 | then you already saw with WaveNet

00:25:49.000 | that you can condition the WaveNet model

00:25:51.120 | to give continuous output.

00:25:53.480 | So you can use those codes to get back

00:25:55.320 | to the audio, similar to what jukebox and OpenAI did.

00:26:00.240 | So I'll quickly mention about what vector quantization is.

00:26:05.920 | It's one of the most underutilized algorithms,

00:26:08.680 | to be honest.

00:26:09.880 | And what it does is basically gives, in a way,

00:26:12.960 | discrete codes to continuous embedding spaces.

00:26:16.400 | So how does it do it?

00:26:18.080 | So you basically have an embedding space,

00:26:23.600 | let's say, in 2D right here.

00:26:25.240 | You define what are the number of clusters

00:26:27.120 | you want to put each of them in.

00:26:29.320 | You run k-means, and you would certainly

00:26:31.520 | get these patches of where all of these embeddings

00:26:36.200 | are, what would be the representative embedding

00:26:38.280 | of a continuous embedding.

00:26:40.720 | You can take all of those patches,

00:26:42.120 | and you can just number them, or you can just list them.

00:26:45.960 | So in this case, you can perhaps have 25 numbers, or 20 numbers,

00:26:49.840 | which are, in a way, mapping from a continuous embedding

00:26:53.480 | to a discrete token.

00:26:57.040 | This is another example right here.

00:26:58.520 | So in our case, what we did was we

00:27:02.320 | took a batch of spectrogram, which are basically

00:27:05.160 | very small patches across time, and then shared all

00:27:10.760 | across the frequency axis.

00:27:13.320 | You take those patches, you learn

00:27:15.200 | the embedding representation.

00:27:16.600 | In our case, it was just like three-layer autoencoder,

00:27:19.720 | fully-connected encoders with three layers of decoders,

00:27:22.800 | and have a bottleneck layer in between.

00:27:25.360 | So that bottleneck layer basically

00:27:27.000 | is kind of similar to this kind of diagram

00:27:29.280 | in, say, 64-dimensional space or 120-dimensional space.

00:27:33.520 | You take up those bottleneck codes,

00:27:35.400 | and then you run k-means clustering on it.

00:27:37.840 | Suddenly, in a way, you can find discrete codes

00:27:43.520 | for continuous embedding spaces or even continuous segments.

00:27:48.040 | And since we know that transformers kind of love

00:27:50.480 | operating in discrete spaces, you

00:27:52.600 | can just apply language modeling now,

00:27:55.160 | and then you can see what you can do.

00:27:58.480 | So in our case, we just had very simple three-layer,

00:28:02.200 | fully-connected autoencoder, small patches.

00:28:06.000 | The number of codes is important,

00:28:07.520 | because if you have too many codes,

00:28:10.000 | then you're kind of just throwing

00:28:11.800 | in all kinds of noisy things.

00:28:13.600 | Now, I'll give an example of why the number of codes

00:28:17.880 | are important through some example.

00:28:19.680 | And you have two little codes.

00:28:22.480 | What you're, in a way, doing is you're

00:28:25.000 | removing all of the information which was relevant,

00:28:27.280 | and you're just kind of averaging them all out.

00:28:29.480 | So this idea first was proposed by Jukebox,

00:28:38.200 | which did it for music.

00:28:40.240 | So you do the exact same thing, what

00:28:42.160 | I talked about, in a slightly different manner.

00:28:45.240 | In a way that, OK, you cannot learn codes

00:28:48.640 | for longer sequences.

00:28:50.520 | So in a way, learn sequences which are just moving slowly

00:28:54.080 | and which are looking at only a certain amount of audio.

00:28:57.680 | So you kind of encode this in these discrete levels, which

00:29:01.320 | are basically like--

00:29:03.360 | all of these basically are codes.

00:29:04.760 | So at every point, I define, OK, this audio

00:29:08.040 | had, perhaps, code number 55.

00:29:10.520 | And in the next level, perhaps, it had code number 2.

00:29:12.840 | And in the very top, perhaps, it had code number 2,000.

00:29:16.440 | So in a way, I'm discretizing the whole codes.

00:29:19.680 | Now what I do is I take up my favorite transform model,

00:29:23.440 | perhaps like a causal autoregressive one.

00:29:26.280 | And I say that, OK, given these codes,

00:29:29.240 | try to predict what codes would come next.

00:29:31.480 | And for sure, transformers can do that.

00:29:34.240 | So I would generate the codes in the future.

00:29:36.840 | Once I've generated the codes in the future,

00:29:39.560 | I can say that, OK, this problem now

00:29:41.360 | is kind of like a text-to-speech problem,

00:29:43.680 | because I have these discrete codes.

00:29:45.560 | Text-to-speech, in a way, is going from discrete letters

00:29:48.680 | to continuous audio.

00:29:50.560 | So I would throw in the fanciest, which was WaveNet.

00:29:53.720 | And I would just get back the code.

00:29:55.240 | And I would get the generated audio.

00:29:58.080 | So this was, in a way, what I described,

00:30:02.360 | that they take up a continuous audio.

00:30:04.760 | They have these compressed codes,

00:30:06.640 | which they encode using a CNN in this case.

00:30:10.720 | The method doesn't matter.

00:30:11.960 | You can throw in the fanciest of embedding or latent

00:30:15.040 | representation on those continuous code.

00:30:18.400 | You generate the patterns, which are like,

00:30:20.120 | what's going to happen next in the future?

00:30:22.000 | And then you decode back using a fancy WaveNet or state-of-the-art

00:30:26.000 | model.

00:30:28.000 | So this was what they were doing for music synthesis.

00:30:32.160 | What we said was, yeah, this is good.

00:30:35.200 | This can generate a good amount of music.

00:30:37.200 | But can these models be used for generating

00:30:44.480 | good representation of the current audio?

00:30:47.840 | And the goal there was, can language models

00:30:51.160 | learn representation, which can just encapsulate whatever we

00:30:55.600 | are giving as an input signal?

00:30:59.160 | So in this case, what we tried after that

00:31:01.160 | was you do exactly similar ideas.

00:31:06.840 | But instead of doing on VQ-VAE end-to-end learned encodings,

00:31:11.960 | we just apply vanilla k-means clustering,

00:31:14.320 | similar to what I described earlier.

00:31:16.880 | We do on spectrogram patches.

00:31:18.200 | So you take up these spectrograms of audio,

00:31:20.960 | and you just divide them into very small chunks,

00:31:23.960 | learn autoencoder encodings for each of those chunks,

00:31:28.440 | run k-means clustering.

00:31:30.320 | In this case, let's say I am learning 16 codes.

00:31:34.120 | Represent the continuous audio in terms of the 16 codes.

00:31:38.360 | Have a transformer which can perhaps predict the next code.

00:31:41.880 | And if I keep on getting better and better at predicting

00:31:45.120 | what's going to happen next, then in this linear layer,

00:31:48.000 | I should be encapsulating what's important

00:31:51.320 | or what's a good summary of what has happened in the past.

00:31:56.640 | So that was our intuition behind trying this.

00:32:01.600 | And as I explained, the number of codes

00:32:03.600 | play a very important role.

00:32:05.560 | You can see here, these are just two piano notes switching

00:32:08.920 | one after the other.

00:32:10.080 | If I just have 16 number of codes,

00:32:12.080 | it just happens to have just a single line of encoding,

00:32:16.560 | a single code assigned to all of this.

00:32:18.640 | Whereas if I'm assigning more codes,

00:32:21.040 | then it becomes a fine-grained prediction

00:32:23.600 | where I'm actually able to get what the individual notes are.

00:32:29.320 | Recently, Facebook also said, OK, they just

00:32:32.880 | had a different name to the whole thing, which

00:32:35.240 | is we can just call this as textless NLP also in the sense

00:32:40.760 | that, OK, you can do NLP without having access to text.

00:32:44.760 | But the idea is very, very similar.

00:32:46.360 | You have an encoder, which is exactly similar to say

00:32:48.600 | what OpenAI was using.

00:32:49.680 | You have a VQ-VAE, Wave2Vec, or whatever you want to do.

00:32:53.320 | You can apply k-means clustering to it.

00:32:55.240 | You apply language models to it.

00:32:57.160 | And instead of a decoder being WaveNet,

00:32:58.880 | they just have a decoder, which is

00:33:00.280 | like a different version of text-to-speech, which

00:33:02.840 | is like Takotron in this case.

00:33:05.000 | So as you can see, these are all the same wine

00:33:07.160 | and very different bottles.

00:33:08.280 | But the core idea is almost exactly the same.

00:33:12.560 | So this created a huge uproar of this going to change NLP.

00:33:18.120 | But this is very, very similar to what people

00:33:21.560 | have been doing in the past.

00:33:24.560 | So I've already explained what this was.

00:33:29.200 | So in our case, we just try to predict

00:33:32.400 | what's going to happen next given the previous context

00:33:35.360 | and use that representation similar to every single one

00:33:40.320 | short learning or zero short learning-based method.

00:33:44.600 | I also explain why the number of codes are important.

00:33:47.720 | If you have too small, then you're

00:33:49.240 | just throwing away a lot of information.

00:33:50.920 | If you have too large, then you don't put in--

00:33:55.360 | it is no longer robust to noise.

00:33:59.480 | So this was our setup.

00:34:00.720 | And before I jump in, I should add one of the tweets

00:34:04.080 | which I saw from one of the most prominent researchers

00:34:08.800 | at DeepMind, which is basically like a lot of times

00:34:11.240 | it is very, very easy to bump up numbers.

00:34:13.680 | I can have these details just not present

00:34:17.200 | in my paper, which actually help a lot in terms

00:34:20.280 | of improving the performance.

00:34:22.200 | And sometimes don't take into account

00:34:25.760 | what the actual model is incorporating

00:34:29.160 | or what model is contributing versus what

00:34:32.000 | the actual these tricks for training are incorporating.

00:34:35.160 | So for most of these methods, what we are trying to see

00:34:38.240 | is we try to keep almost exactly the same approach.

00:34:42.000 | No rate augmentation, no fancy label smoothing,

00:34:44.560 | or moving average of weights, or decay, or whatever.

00:34:48.400 | You just have similar-based recipes

00:34:51.440 | to see how well we are doing.

00:34:55.520 | For this case, the goal was to see

00:34:57.920 | that how well our models do with respect

00:35:00.400 | to this purely supervised approach

00:35:02.120 | and how well it does with respect

00:35:03.720 | to a similar unsupervised approach.

00:35:06.840 | So in the first case, the model and all of the weights

00:35:09.360 | have access to all of the labels, which is just

00:35:12.320 | shown as VGG supervised, which is basically

00:35:14.840 | you take up an audio understanding data set

00:35:16.840 | and you see how well you're doing on accuracy metrics.

00:35:21.440 | So that was the first one.

00:35:22.560 | In the second one, we applied SimClear,

00:35:24.760 | which was proposed by Geoff Hinton,

00:35:26.160 | in which you can take up these multiple augmentations

00:35:28.440 | of the same input.

00:35:30.440 | You can have patches removed.

00:35:32.000 | You can blur the signal.

00:35:33.000 | You can flip the signal.

00:35:34.800 | You learn an embedding out of the last layer

00:35:36.880 | without access to the labels, and then

00:35:38.720 | just have a linear head to predict what's happening.

00:35:42.000 | By using that, we got a 55% accuracy.

00:35:45.040 | You do the exact same thing with transformers.

00:35:46.960 | You don't have access to labels.

00:35:48.440 | You just run them while just to predict the next code.

00:35:51.640 | You take the linear layer, apply the same linear head,

00:35:54.400 | and try to predict what's happening inside.

00:35:57.120 | And with that, we got 60% accuracy.

00:35:59.440 | So even though the results are not good,

00:36:01.160 | but the fact is the neural networks actually

00:36:04.920 | are very, very good at getting better and better

00:36:09.360 | with throwing off huge amounts of data.

00:36:11.520 | So there's still a 10% gap between purely supervised

00:36:14.720 | and purely unsupervised.

00:36:17.200 | But that's going to improve with throwing a lot of data

00:36:21.160 | to these models, because it doesn't have access

00:36:23.480 | to any label as per se.

00:36:25.240 | So this is a famous paper by Dan Ellis and Nelson Morgan

00:36:28.560 | at Berkeley, in which they actually showed way back

00:36:30.920 | in 1999 as to why size matters for deep neural networks

00:36:37.200 | and also the number of data points which is present.

00:36:41.000 | So as they kept on increasing the size of the data set

00:36:44.400 | and the parameters, they kept on getting lower and lower word

00:36:47.120 | error rates.

00:36:48.120 | And this has been true across any of the data set.

00:36:51.480 | And that's why the whole excitement is about

00:36:53.720 | unsupervised learning.

00:36:56.640 | So this was, in a way, a flavor of how

00:36:58.760 | can we do language modeling and unsupervised learning

00:37:01.320 | on audio for continuous signals.

00:37:05.200 | For the third subplot, I'll just quickly mention

00:37:09.400 | ideas which are very similar to what you would have seen

00:37:11.800 | in vision transformers, but with the caveat

00:37:15.680 | that how can we use some sort of signal processing

00:37:18.880 | to improve these performance even further.

00:37:22.040 | So the basic approach still remains the same exactly

00:37:24.520 | as what you would have seen in vision transformers.

00:37:28.200 | You have a signal of interest which you want to classify.

00:37:33.080 | Here, they are raw waveform instead of images.

00:37:36.400 | The goal is to predict what's there inside of it.

00:37:41.080 | And also, we don't have any convolutions.

00:37:43.480 | We don't have any other tricks which we were using before.

00:37:46.760 | All we have to do is they can transform as themselves,

00:37:49.760 | solve this particular problem.

00:37:52.880 | So for the data set--

00:37:54.800 | and the whole setup was still the same.

00:37:57.280 | No data augmentation and no other forms of these tricks.

00:38:03.000 | You are given like 40,000 snippets for training

00:38:05.440 | and 10,000 for validation.

00:38:08.040 | Our job is to predict as good as possible

00:38:10.880 | as to what's there in the audio.

00:38:13.480 | This problem is very similar to the sound which you heard

00:38:17.120 | and the video which you saw, that given a spectrogram patch,

00:38:22.080 | you have to predict what's there inside of it.

00:38:24.080 | We kind of do one step further than what's just

00:38:33.000 | like a simple transformer model.

00:38:34.840 | In a sense that we try to see whether some sort of hierarchy

00:38:38.400 | over transformer embeddings would help us in any manner.

00:38:43.960 | So for that, we use wavelet decomposition

00:38:47.200 | on the intermediate transformer embeddings.

00:38:50.720 | So what is a wavelet decomposition?

00:38:55.040 | In very naive terms, it can be like a way

00:38:58.200 | of decomposing the intermediate embeddings

00:39:02.720 | into another intermediate embedding, in a sense

00:39:06.800 | that we are kind of putting these highways of like some

00:39:09.920 | embeddings are moving very slowly

00:39:11.440 | and some embeddings are moving very fast.

00:39:13.640 | And some embeddings are retained exactly

00:39:15.600 | at the rate of what the original signal was.

00:39:19.480 | And why this is important?

00:39:20.640 | Because you can think about that at every intermediate state,

00:39:24.000 | you are in a way learning some sort of hierarchy in the model.

00:39:27.800 | So if I look at what we do with the wavelet decomposition

00:39:35.120 | before and after, let's say you had time across this

00:39:39.200 | and you had the embedding size across this

00:39:41.640 | and this whole patch was your output of, say,

00:39:45.880 | the nth layer of the transformer.

00:39:48.640 | What I say now is, OK, I would just

00:39:52.320 | have a mapping from this to the mapping of my interest

00:39:56.040 | using wavelet decomposition, in which for half of the samples,

00:40:00.280 | I just retain the exact same embedding

00:40:02.160 | as what was learned by the transformer model.

00:40:05.960 | In the next half, I would start combining two at a time.

00:40:09.160 | So in a way, I'm learning this sort

00:40:10.600 | of like a tree structure within a single layer

00:40:13.800 | of the transformer embedding.

00:40:16.360 | And for now, the wavelet or the BCS function which I use

00:40:21.000 | is simple averaging.

00:40:22.360 | So let's say from all of the embedding layers in between,

00:40:26.040 | I just need to have one embedding which is not

00:40:31.400 | moving at all, which is just representative of whatever

00:40:33.600 | is there of the whole latent space in that nth layer.

00:40:40.240 | Then in the next layer, I would just use two at a time

00:40:44.160 | and then I would use four at a time

00:40:46.680 | until I reach the exact resolution as what I had.

00:40:50.360 | Doing this operation doesn't add any parameters whatsoever.

00:40:53.560 | You're just defining what your BCS function would be

00:40:56.320 | or what your wavelet function would be.

00:40:58.080 | In this case, it is a hard wavelet.

00:41:00.200 | And I start combining them and I learned a hierarchy

00:41:04.120 | at every single layer of the transformers.

00:41:08.280 | And this improved our performance significantly

00:41:12.560 | as compared to not using them with addition

00:41:15.080 | of no extra parameters.

00:41:17.800 | And I'll come to the results later also.

00:41:20.840 | So this is how the whole approach looks like.

00:41:23.840 | You have a front end.

00:41:25.680 | The front end is basically a single layer

00:41:28.480 | of 2,000 neurons followed by a dense layer of 64 neurons,

00:41:33.240 | which is just to make sure to conform it

00:41:37.360 | to the intermediate transform embeddings.

00:41:39.080 | Let's say if for the transformers,

00:41:40.920 | I define the embedding size to be 64,

00:41:43.200 | then that's the dimension which I'm mapping them to.

00:41:47.200 | So I take a broad waveform.

00:41:48.920 | I patch it in very small patches similar to how

00:41:51.960 | you do in vision transformers.

00:41:54.520 | I would just have a single layer of 2,000 neurons

00:41:56.880 | followed by a dense layer of 64 neurons

00:42:00.040 | with the hope that the first layer is learning

00:42:02.280 | like a Fourier BCS function, which

00:42:04.760 | should be adaptable according to what I'm learning.

00:42:08.080 | After that, I keep on doing this over and over again.

00:42:11.640 | I don't have a classification head or anything like that.

00:42:15.520 | I keep on adding multiple stacks of transformers after that.

00:42:20.120 | And then I have two approaches of what I can

00:42:26.120 | do in terms of adaptation.

00:42:28.920 | I can do average pooling across time

00:42:31.040 | of these intermediate embeddings,

00:42:32.800 | because the idea is very similar to what

00:42:35.160 | we do in classical vision, that each of the embeddings

00:42:38.080 | are looking at much, much broader output

00:42:42.440 | in the subsequent layers.

00:42:44.160 | Or I could do a wavelet decomposition.

00:42:47.160 | So what I do is that I take all of these embeddings

00:42:50.160 | and I define these highways.

00:42:51.560 | So some of the embeddings move fast.

00:42:53.200 | Some of them are moving very slow.

00:42:54.800 | And some are retained at the exact same resolution

00:42:57.280 | as what the transformer is learning.

00:42:59.800 | And then I keep doing this over and over again.

00:43:02.120 | I have a dense layer.

00:43:03.680 | I have my softmax or sigmoid, whatever

00:43:06.560 | is my classification head.

00:43:08.680 | So this is kind of what the approach looks like.

00:43:12.400 | We compare it with all of the traditional vision-based

00:43:17.720 | architecture.

00:43:18.520 | So the vision-based models have been very good.

00:43:21.040 | And the performance have been similar in understanding

00:43:24.360 | audio also.

00:43:25.560 | So we compare all of those models

00:43:27.600 | in terms of mean average precision.

00:43:29.680 | And we see that even the tiniest models of transformers

00:43:32.320 | were just surpassing all of the state-of-the-art CNN

00:43:34.680 | models, which was a very good sign.

00:43:37.880 | Then we started to bump up.

00:43:39.800 | The larger model should keep on improving the performance.

00:43:42.560 | And with the multi-scale models, as well as

00:43:46.000 | with the pooling layers, they improve the performance

00:43:48.760 | even further, which was kind of very surprising to us

00:43:52.520 | because the number of parameters are very small.

00:43:54.920 | These are very tiny architectures.

00:43:56.480 | Yet they are surpassing things like even

00:43:58.360 | DenseNet, which are huge models with a lot of millions

00:44:01.200 | of parameters.

00:44:03.920 | So after that, we said--

00:44:05.480 | and I'm going to conclude quickly.

00:44:08.160 | After that, we said that, OK, this is looking pretty cool.

00:44:12.200 | What actually is the transformer or the first-layer learning?

00:44:17.480 | So in order to make this plot, what we said was, OK,

00:44:24.240 | if you were to take a classic Fourier transform,

00:44:27.600 | then this axis is kind of like frequency.

00:44:32.960 | This axis is the number of filters.

00:44:34.520 | And this axis is the frequency.

00:44:36.720 | Then in a way, it should be connecting all of the points

00:44:41.640 | in a linear line.

00:44:43.400 | And this is akin to the number of points in the FFT.

00:44:45.920 | So how many points I'm defining here?

00:44:48.240 | If I'm defining 2,000 points here,

00:44:50.680 | then I would have 2,048 sinusoidal basis functions,

00:44:56.040 | which are going from lower frequency

00:44:57.720 | to the most highest frequency.

00:45:00.120 | We said, OK, we'll do the exact same thing,

00:45:02.080 | but now with filters.

00:45:03.640 | So we have a frequency along y-axis and the number

00:45:07.520 | of points in my x-axis.

00:45:09.360 | And if it was a classic Fourier transform,

00:45:11.280 | then it would be connecting right as a linear line.

00:45:15.560 | But what we did was we take up the front end, which

00:45:19.600 | is learned by transformer, take its Fourier transform,

00:45:22.800 | sort according to its center frequency

00:45:24.960 | as to what frequency it is activating the most,

00:45:27.680 | and then keep on stacking them.

00:45:29.960 | When we did this for two problems,

00:45:31.960 | we saw that we are learning a different time frequency

00:45:35.720 | representation, which is specific to a particular

00:45:38.080 | problem.

00:45:38.600 | So if I'm trying to understand what's

00:45:40.400 | there in the content of the audio,

00:45:42.480 | I learn a representation which is

00:45:43.880 | very different than Fourier transform,

00:45:45.160 | which would have been a straight line, which

00:45:47.040 | is like a curved exponential line like this.

00:45:51.640 | And if I do a polyphonic pitch estimation,

00:45:54.440 | I learn a very different front end,

00:45:57.320 | which is adapting to that particular problem.

00:46:00.080 | So this was very exciting to us because making computers

00:46:05.520 | hear in a way in which they are adapting their ears

00:46:07.880 | according to a particular problem is a very cool idea.

00:46:12.400 | Second thing is we actually saw each of the filters

00:46:15.200 | as to what they were doing.

00:46:17.320 | And these are basically just single slices like this.

00:46:21.240 | So this is what we would have learned as a front end neuron.

00:46:25.280 | So we take up each of the neurons and we just plot them.

00:46:27.880 | And for plotting this, we basically

00:46:29.800 | take a Fourier transform and then

00:46:32.040 | sort them according to where the center frequency is.

00:46:35.400 | When we just saw the neurons as to what

00:46:37.080 | they were learning in the front end,

00:46:39.040 | we saw that it is learning properties

00:46:41.440 | which are very, very closely matching

00:46:45.200 | with the traditional signal processing.

00:46:46.840 | So you would have something like an answer detector

00:46:48.960 | learned right here.

00:46:50.760 | You're learning windowing functions.

00:46:52.280 | In a way, it is learning to have a kernel which

00:46:55.400 | is best for a time frequency representation, what people

00:46:58.320 | have been using in signal processing, which

00:47:00.120 | is like a Hamming or a Hamming window.

00:47:03.000 | We are learning these pure sinusoids

00:47:04.720 | which are responsible for activating

00:47:07.400 | a particular frequency.

00:47:08.920 | So you can see the richness as compared

00:47:10.560 | to having a fixed purely sinusoidal PCS

00:47:14.000 | function right here.

00:47:16.440 | So this was what we had done.

00:47:20.040 | And then to share the final thoughts,

00:47:23.120 | I'll conclude by saying that, OK, transformers

00:47:25.480 | are proving to be a major advancement in AI

00:47:27.360 | research across the fields.

00:47:30.000 | And it seems like they're solving everything for now.

00:47:34.240 | And hopefully, this is not the end.

00:47:36.040 | And we should keep an eye out on something

00:47:38.440 | which would change and have an impact which

00:47:41.080 | is more than what transformers have put.

00:47:44.360 | And who knows what's going to come next?

00:47:47.600 | Yeah, so by that, I'll just conclude.

00:47:49.680 | And I'll be happy to take questions.

00:47:53.120 | Thank you, Prateek.

00:47:54.080 | That was a really good talk.

00:47:55.800 | And you provided some really good insights

00:47:58.720 | about how transformers work for the audio case.

00:48:02.360 | And yeah, thank you for the talk.

00:48:04.880 | And now I would invite questions from the class students.

00:48:10.200 | Let me just stop the recording.

00:48:13.280 | [BLANK_AUDIO]

Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

Chapters