00:00:04.920 | Thanks for inviting me for the talk today.
00:00:07.480 | And I'll be just talking about transformers
00:00:10.320 | for music and audio, which is very
00:00:12.400 | different than what all of us were doing in this past course.
00:00:16.880 | I'm also the only speaker from Stanford,
00:00:18.840 | so I have to do a good job.
00:00:20.200 | So you'll see very good slides, because I'm
00:00:23.280 | representing the university in some sense.
00:00:26.520 | So yeah, so the flow of the talk for today
00:00:28.240 | is basically like I'll be throwing a lot of stuff.
00:00:30.960 | It's kind of like a buffet style,
00:00:32.440 | and then you feel free to like or dislike whatever you want.
00:00:36.640 | And I'll be talking mostly about three papers of what
00:00:39.560 | I've been working on.
00:00:42.120 | I'll start with introducing what transformers
00:00:44.600 | are from a different perspective, what
00:00:46.480 | audio representations are.
00:00:49.240 | Talk about a generative model for audio,
00:00:52.200 | which is just doing language modeling on sample level.
00:00:56.080 | Then I'll talk about how can one do like language
00:00:59.080 | modeling for speech and audio, which is different than what
00:01:02.480 | people do for text.
00:01:04.280 | What are the current trends in the literature?
00:01:07.360 | Finally, I'll briefly mention similar stuff
00:01:11.000 | as to what was happening in computer vision
00:01:13.080 | with regard to vision transformers.
00:01:14.920 | How can we adapt similar ideas for audio transformers?
00:01:17.920 | And throw in a bit of signal processing
00:01:20.040 | to improve the performance.
00:01:22.120 | Having told that the talk is about 35 to 40 minutes
00:01:24.840 | with about 15 minutes of Q&A, I should also
00:01:27.600 | say that all of the opinions are mine,
00:01:29.600 | and Stanford or any other professor
00:01:31.160 | is not responsible for any of the mistake which I do.
00:01:35.880 | So transformers have kind of revolutionized in a way
00:01:40.240 | like everyone was approaching deep learning.
00:01:42.960 | Before that, it was all about CNNs.
00:01:45.240 | And mostly, all of these prominent models
00:01:48.360 | have been coming in waves.
00:01:49.440 | So there was a time when everyone was just
00:01:51.200 | applying CNNs.
00:01:53.360 | Then came a time where people started
00:01:55.600 | adapting CNNs in some sort of diluted convolutions.
00:01:58.480 | And slowly, the recurrent networks
00:02:00.400 | were getting out of fashion.
00:02:02.480 | Now, it seems like transformers are in fashion all the time.
00:02:05.960 | So it seems to be solving almost every single problem which
00:02:08.840 | is being thrown at them.
00:02:12.640 | So what's special about them?
00:02:15.000 | One of the fact which struck me was their simplicity, which
00:02:19.160 | is, if you think about it, this--
00:02:23.680 | and it has been hugely popular also.
00:02:25.720 | So it was just released in 2018.
00:02:28.640 | And within three years, it has about 30,000 citations.
00:02:31.280 | And it is kind of solving every single problem
00:02:33.200 | in every single domain.
00:02:35.400 | It has its limitations, though, also.
00:02:38.560 | But if you think about it, in a way,
00:02:40.720 | transformers are basically a way of just cascading self-attention
00:02:45.800 | with feature learning.
00:02:46.960 | And if you keep on doing it over and over again,
00:02:49.280 | then the model, in a way, learns which parts of the input
00:02:52.160 | are important and keep on transforming them,
00:02:55.120 | removing the contents which are not important,
00:02:57.360 | and just have the limited information which is just
00:03:00.480 | responsible for a particular task.
00:03:04.120 | And it has been very, very difficult to keep up
00:03:06.160 | with the literature.
00:03:09.000 | I have put it as a joke here.
00:03:10.760 | But then even Twitter's recommendation engine
00:03:13.160 | were kind of just getting out of--
00:03:16.120 | they were getting haywire as to why is Chris Manning just
00:03:19.360 | searching over transformers?
00:03:21.480 | And that was way back in 2020.
00:03:24.480 | So it has been difficult for researchers
00:03:26.440 | also to keep up with the pace of what's going on.
00:03:29.800 | Just before transformers, all of the NLP community
00:03:33.000 | was just doing gaga about bidirectional LSTMs
00:03:35.760 | with attention.
00:03:36.760 | So every single paper before 2017
00:03:38.840 | was just like you have encoded LSTM layers.
00:03:42.320 | You keep on adding multiple layers.
00:03:46.000 | And then after that, you have attention mechanism
00:03:48.440 | which just learns that what's important
00:03:50.600 | and then just keeps on decoding sequentially one at a time.
00:03:54.760 | But this was not kind of like an ideal way to do it.
00:03:58.040 | Because what turns out is when we start throwing longer
00:04:01.560 | sequences, the connections are no longer
00:04:07.120 | storing the gradient updates in a way it should be doing.
00:04:09.840 | So what the researchers from Google said,
00:04:13.360 | instead of having just an attention
00:04:15.040 | layer at the very last encoding, we
00:04:19.240 | would just have these attention mechanisms
00:04:21.280 | at every single layer, which in a way
00:04:23.640 | would just learn what's important
00:04:25.280 | for a particular problem at that particular layer.
00:04:28.240 | And we keep on doing it over and over again.
00:04:31.920 | So then the whole idea of transformers and attention
00:04:35.920 | mechanism cascaded one after the other came.
00:04:39.520 | And I'll not go into the details,
00:04:40.880 | because this is the last class of the course.
00:04:43.400 | But then usual tricks do help across the neural net
00:04:46.600 | literature, which is like having multi-header tensions,
00:04:50.680 | having skip connection and layer norm.
00:04:52.280 | So all of these things, they are not only
00:04:54.440 | like giving gains for transformers themselves,
00:04:57.800 | but they can be just applied to any single other architecture
00:05:01.520 | also.
00:05:03.440 | The other thing which is helping this research
00:05:05.960 | is basically the compute bar is getting better and better.
00:05:09.360 | So all of these big companies are just
00:05:12.640 | throwing massive amounts of computing resources
00:05:14.880 | at solving very, very simple and trivial tasks.
00:05:18.920 | The top of the hill being the switch transformer, which
00:05:21.560 | was discussed in the course also.
00:05:23.000 | But one other thing which I think
00:05:28.240 | started all of this trend was ELMo,
00:05:30.080 | which was just learning these contextualized representations
00:05:33.200 | for natural language processing.
00:05:34.920 | And that model right here was perhaps
00:05:37.400 | one of the first kind of like model 0.0 or something,
00:05:45.640 | or 0.1 in terms of bringing and ushering
00:05:48.320 | in the whole revolution.
00:05:50.760 | You can see that how similar these kind of models
00:05:53.520 | look like.
00:05:54.560 | BERT was basically inspired heavily
00:05:57.280 | from ELMo, in which they just replaced
00:06:00.200 | some of the LSTM layers with transformer modules.
00:06:03.800 | So a point to note also is irrespective
00:06:08.360 | of natural language processing or other domain,
00:06:10.760 | these can be adopted in a variety of domains.
00:06:13.000 | And for today's talk, I'll be just adopting them to audio.
00:06:17.720 | So I'll basically start with introducing people
00:06:21.240 | what audio representations are, and just
00:06:23.400 | for the sake of completeness, talk about spectrograms.
00:06:28.240 | So you can take any time domain signal,
00:06:31.400 | and you can decompose that signal
00:06:34.560 | into a variety of basis functions.
00:06:39.600 | And if you take up a Fourier transform,
00:06:41.680 | you're kind of like decomposing the actual time domain
00:06:46.040 | signal into its sinusoidal basis components.
00:06:49.400 | So if you have like a waveform here
00:06:51.880 | like this, which is a sum of three pure sinusoids,
00:06:55.240 | then their sum basically is this.
00:06:57.440 | And you can see that when you take a Fourier transform
00:07:00.120 | and its magnitude, you kind of have
00:07:03.280 | the strength of the individual components shown here.
00:07:08.800 | So you can take up another waveform,
00:07:10.960 | let's say a square wave, and what you have
00:07:13.720 | is basically a much richer sinusoidal decomposition
00:07:17.680 | because it is kind of a discontinuous signal.
00:07:19.960 | So you need like many more sinusoids
00:07:21.640 | to represent that particular signal as close
00:07:24.400 | to the actual signal as possible.
00:07:26.960 | And here also you can see that, OK, if this was a square wave,
00:07:29.880 | then it is actually made up of a lot of sinusoids
00:07:35.960 | where each of the bar here represents
00:07:39.200 | the strength of the particular sinusoid.
00:07:42.280 | From an optimization perspective,
00:07:44.120 | I mean, this right away is suboptimal, right?
00:07:47.040 | Because you're kind of fixing up the number of sinusoids
00:07:51.080 | you're using for representing a square wave.
00:07:53.240 | I would have rather used a basis function
00:07:56.080 | which was a square wave itself than a sinusoidal signal.
00:08:01.360 | The second thing is even if you are taking a sinusoidal signal,
00:08:05.320 | we kind of are just putting them in an equidistant space.
00:08:09.440 | So you're kind of dividing the whole frequency
00:08:11.760 | axis into equidistant bins.
00:08:14.440 | And each of the bins is responsible
00:08:16.040 | for a particular sinusoid a lot.
00:08:19.400 | So that is like a traditional Fourier representation
00:08:22.360 | for representing any signal.
00:08:25.840 | What we do for--
00:08:28.840 | what are spectrograms?
00:08:30.480 | But in reality, all of these signals are discontinuous.
00:08:34.880 | All of these signals vary quite a bit, right?
00:08:37.360 | So you can have a signal while I'm
00:08:39.960 | speaking which is like a square wave for a certain period
00:08:42.680 | of time, and then it gets sinusoidal,
00:08:44.720 | and then it becomes something else.
00:08:46.680 | So what we really need is in a way
00:08:48.880 | to kind of take batches of input signal
00:08:52.840 | and take Fourier transform of these individual batches.
00:08:56.320 | I'm deliberately using the word batches,
00:08:58.000 | but you can-- in traditional terms,
00:08:59.520 | you are windowing the signal.
00:09:02.240 | So right here, you can see that you have a continuous signal.
00:09:05.120 | You keep on windowing it.
00:09:07.320 | You apply the Fourier transform, and what you get
00:09:10.280 | is basically like a spectrogram representation of the signal.
00:09:14.320 | So right here, what you're seeing basically
00:09:16.360 | is for each of the slices, the signal kind of
00:09:19.520 | look like this after taking the Fourier
00:09:21.560 | transform with the waveform which is there below.
00:09:24.840 | And what you do is for spectrogram representation,
00:09:26.920 | you keep on stacking these Fourier transform
00:09:29.160 | slice, the magnitude of the Fourier transform slices.
00:09:31.760 | And in this way, you kind of get like a 2D representation
00:09:34.760 | of audio signals.
00:09:36.720 | And if you're coming from a vision background,
00:09:38.640 | it is basically all of the things
00:09:40.160 | which you're doing in vision would just work well
00:09:42.240 | if you just apply them to these 2D spectra representations.
00:09:47.240 | I'll quickly play how these spectrograms look
00:09:49.960 | like for a wide area of common sounds.
00:10:35.400 | So you can see like for spectrograms,
00:10:38.120 | you have kind of like a time axis on your x-axis.
00:10:40.880 | And then you have a frequency axis on y-axis.
00:10:43.680 | And then for whatever is your signal of interest,
00:10:46.120 | you're basically like putting these slices together.
00:10:48.880 | And different sound gives you like different spectra
00:10:50.960 | representation.
00:10:51.760 | So it's kind of a vision problem just
00:10:54.080 | in this sort of like Fourier space.
00:10:58.040 | So there can be like different kinds of representations also.
00:11:01.200 | So one, you could just take these slices of Fourier
00:11:06.320 | transform and then do like a linear mapping to them
00:11:09.320 | so that you're kind of in a way making these as
00:11:13.400 | close to how humans hear.
00:11:14.680 | So you can have like log of the frequency on the y-axis
00:11:17.960 | instead of common frequency.
00:11:19.240 | And then you get like a constant Q-like representation.
00:11:22.040 | The advantage of this being like you
00:11:23.640 | can see that for different frequencies,
00:11:25.880 | the spacing between the harmonics kind of remains same.
00:11:28.960 | So if you're like training convolutional filters,
00:11:31.040 | then that's of a huge advantage because the signal,
00:11:33.680 | like one component of the invariance is gone.
00:11:35.800 | And you can just learn these filters
00:11:37.280 | which are catching onto these constant templates of Fourier
00:11:41.880 | slices.
00:11:43.200 | You can have melt filter bank coefficients,
00:11:45.080 | or you can have like the raw waveform also.
00:11:48.960 | For raw waveforms, basically there
00:11:50.520 | are two things which we have to keep in mind.
00:11:53.200 | One is the sampling rate.
00:11:54.320 | So we kind of like take the continuous signal
00:11:57.200 | and then we discretize the continuous signal.
00:11:59.400 | So one parameter is like how fast we are sampling
00:12:03.360 | the continuous signal.
00:12:04.280 | So that's typically on the order of like 16,000 or 8,000
00:12:07.720 | times a second if you're on telephonic speech.
00:12:10.440 | The other thing which we also is like how many levels
00:12:13.720 | we are dividing your vertical axis.
00:12:15.480 | So in this case, you can see that each of the dots
00:12:17.760 | is basically one level.
00:12:19.400 | And typically, people use 8-bit quantizers or 16-bit
00:12:22.280 | quantizers.
00:12:23.280 | So in a way, you can think about that for every one
00:12:25.400 | second of audio which we would hear,
00:12:27.480 | you would have like 16,000 samples.
00:12:29.720 | And then in each of the 16,000 samples
00:12:32.720 | are allowed to take one of the levels between 0 to 55.
00:12:36.800 | And that's like if I can take the problem of continuous audio
00:12:41.560 | and just have it in terms of this sort of discrete space,
00:12:45.320 | then basically I'm just going to the territory
00:12:47.520 | of doing language modeling.
00:12:50.600 | So the first papers I discuss is how
00:12:53.480 | can we do generative modeling for raw audio, which
00:12:57.880 | is similar to WaveNets using transformers.
00:13:01.480 | I'll be putting QR codes if you like the stuff what I'm doing.
00:13:05.560 | And if you think that this is relevant to you,
00:13:07.800 | please cite or please have a look
00:13:10.080 | in terms of the QR codes.
00:13:13.000 | So yeah, so I'll start with the first subtopic
00:13:16.400 | of today's talk, which is like what are WaveNets
00:13:22.640 | and how do we do this generative modeling over raw audio?
00:13:26.960 | So in a single word, you can think
00:13:28.360 | about this as doing language modeling over these 255
00:13:31.360 | states of audio.
00:13:33.120 | So you can throw in your favorite transformer model
00:13:35.760 | like transformer XL or GPT or whatever you want to call it.
00:13:41.520 | And just treat the problem as if you
00:13:43.160 | are trying to predict one of the levels out of 255.
00:13:45.800 | And you have to predict the next level given a certain context.
00:13:49.480 | That's what WaveNet was doing.
00:13:50.760 | So the way you are modeling the probability distribution
00:13:55.320 | of a continuous space is basically
00:13:57.320 | you're trying to predict what's the probability
00:13:59.360 | of the next sample given some parsed context.
00:14:02.560 | And WaveNet has been hugely popular
00:14:04.720 | because it has over 3,000 citations
00:14:07.080 | and it has been a core building block for almost all speech
00:14:11.040 | and audio related problems.
00:14:13.000 | You can think about speech to text, text to speech synthesis,
00:14:16.600 | instrument conversion, packet loss
00:14:19.040 | concealment over the internet, speech denoising.
00:14:21.680 | So wherever there's some sort of element of modifying audio,
00:14:25.560 | people have been using WaveNet as a core building block.
00:14:30.440 | And raw waveform synthesis has been difficult
00:14:32.880 | because just the magnitude of the problem,
00:14:35.920 | if I'm just trying to synthesize 10 seconds of audio,
00:14:38.880 | it would just amount to me having a probability
00:14:41.680 | distribution over 160,000 samples.
00:14:45.760 | And that itself is tough because our ears are very, very
00:14:49.120 | sensitive to subtle changes.
00:14:51.000 | If I'm off by one pixel in an image,
00:14:55.600 | my eyes would not be as susceptible to noticing
00:14:59.200 | that effect versus if I'm off by, say, a few samples
00:15:04.240 | in an audio, it would just catch our ears pretty quickly.
00:15:08.600 | People have been trying raw audio synthesis a lot
00:15:10.960 | in the past.
00:15:12.440 | And before all of the WaveNet and transformer-based
00:15:15.720 | approaches, WaveRNNs and SampleRNNs
00:15:20.280 | were kind of like state-of-the-art models.
00:15:25.600 | On the right, I've shown a SampleRNN model, which
00:15:28.600 | kind of models the probability distribution of what's
00:15:33.280 | going to come next given the past at multiple levels.
00:15:36.400 | And this was work done by Yoshua Bengio at Mila.
00:15:40.200 | But you can closely see, if you just
00:15:42.560 | see this architecture versus a transformer architecture,
00:15:45.520 | in a way, these are starting to get very, very similar.
00:15:48.680 | Because what you're trying to do is
00:15:50.200 | that for the probability distribution here,
00:15:52.320 | you're trying to see a lot of local substructures.
00:15:56.040 | And then you keep on doing it over and over again.
00:15:58.120 | And you can draw parallels, like attention mechanism
00:16:00.840 | should also kind of be doing the same thing.
00:16:04.000 | So this was kind of like the literature in the past.
00:16:09.280 | What we tried to do was we just had the WaveNet model.
00:16:13.120 | And we tried to see whether transformers can beat them.
00:16:15.800 | And our intuition was it should be able to beat them
00:16:18.160 | because they are successful all over the other domains,
00:16:23.000 | like in language modeling.
00:16:24.120 | So it should do that for raw waveforms also.
00:16:28.720 | We also tried to see whether we can circumvent the order
00:16:31.400 | n squared constraint by conditioning of the context
00:16:35.560 | itself.
00:16:37.160 | And we did not go for specific applications.
00:16:40.560 | And we just said, OK, just in terms like modeling behavior,
00:16:43.000 | how will they do?
00:16:45.440 | So the data set for this was just
00:16:47.200 | like real-world kind of recording.
00:16:48.760 | So actual sound should not matter
00:16:53.440 | because the model is agnostic to what it is being thrown in.
00:16:57.880 | And the setup was exactly the same,
00:16:59.520 | like you are giving a certain context.
00:17:01.360 | And I have to predict the next sample.
00:17:03.600 | You do the same thing with WaveNets.
00:17:05.520 | You do the exact same thing with transform-based,
00:17:09.240 | like GPT kind of model and see how well they do.
00:17:14.000 | I'll briefly chat about what WaveNet models are.
00:17:18.040 | So WaveNet was kind of like a convolution-based model, which
00:17:21.360 | was getting rid of all of the vanishing gradient problem
00:17:25.360 | by just treating a sequential problem
00:17:28.680 | as being learned by a convolutional model.
00:17:31.120 | So what they did was basically have this sort
00:17:33.880 | of dilation layers, or convolution with dilations,
00:17:38.320 | which is basically I kind of skip in every subsequent layer
00:17:42.000 | by one sample.
00:17:43.240 | So you can see if I have a dilation factor of 2
00:17:47.080 | with a kernel size of 2, I would get this kind of a topology
00:17:50.400 | where my convolution filters in the very first layer
00:17:52.720 | are just combining the first two samples.
00:17:55.120 | Then I skip by one in the next layer.
00:17:57.520 | And then I skip by three, which is
00:17:59.360 | like I look at the fourth one in the next layer and so on.
00:18:03.400 | The loss is still the same.
00:18:04.600 | So I have this network.
00:18:06.640 | I learn a latent space.
00:18:08.000 | And then I have a categorical cross-entropy loss,
00:18:11.200 | which is basically I have to predict the next sample given
00:18:13.960 | the previous one.
00:18:16.640 | And I just do the exact same thing with transformers also.
00:18:20.800 | But then I have to make sure that I
00:18:22.760 | do it in a causal manner.
00:18:24.120 | So I have something which is very similar to GPT,
00:18:26.840 | in which I have causal masks in my attention mechanism.
00:18:30.320 | And I keep doing it over and over again.
00:18:32.360 | So you have self-attention.
00:18:35.520 | After that, you have feedforward layers.
00:18:37.400 | You just have a stack of these transformer blocks
00:18:40.400 | and see how they do.
00:18:43.320 | So I said intuitively it should work.
00:18:45.320 | So it should be doing better than our base wave net models.
00:18:52.360 | Because if you look at the topology,
00:18:55.120 | we are kind of defining a topology on our own, right?
00:18:57.640 | So what if the current prediction at, say,
00:19:01.520 | layer one were to depend on very way back sample, say,
00:19:07.760 | instead of the second sample, the 10th sample?
00:19:09.640 | So we are kind of ignoring all of that topology, which
00:19:12.800 | would have been important for prediction
00:19:14.480 | of this particular task.
00:19:16.040 | Whereas transformers with the self-attention mechanism
00:19:19.640 | can just learn, like, OK, which part of the samples
00:19:22.400 | are important and which are not.
00:19:23.960 | And you can keep on doing it iteratively.
00:19:26.760 | So it made sense to us that, OK, transformer layer
00:19:30.880 | should be doing way better than wave net models.
00:19:34.880 | The second thing which we came across was, OK,
00:19:37.680 | we cannot have a lot of context.
00:19:40.640 | For example, the attention mechanism
00:19:42.480 | needs to store all of those of order n squared.
00:19:46.120 | So in this case, if I'm storing data at 100 milliseconds,
00:19:50.240 | then I have about 1,600 samples.
00:19:52.600 | And I need to store 1,600 by 1,600 at multiple layers.
00:19:56.840 | And it just becomes like a huge problem with the data--
00:20:01.800 | problem with the memory constraint.
00:20:03.360 | So what we said was, OK, what if we just
00:20:06.600 | use the context itself as a latent code?
00:20:10.840 | So in order to have much better representation at every layer,
00:20:16.440 | we cannot have huge, big attention matrices.
00:20:20.040 | So what we said was, we would just
00:20:21.960 | do a sample-wise conditioning and throw a CNN layers just
00:20:26.240 | to understand what the latent code would be.
00:20:28.560 | So you still have, like, an attention mechanism
00:20:31.040 | or just a past context.
00:20:32.960 | But then I'm also conditioning at every sample, OK,
00:20:36.560 | what the next sample should be given on this context embedding.
00:20:41.240 | And if you think about it, in a way, it is like, OK,
00:20:43.360 | if there are, like, five or six notes being played in a piano,
00:20:46.680 | then I'm kind of certain which notes
00:20:48.280 | will be played to a certain extent
00:20:50.040 | if I just throw in a CNN layer.
00:20:52.600 | So I'll use that information along with what
00:20:55.360 | my transporters are learning.
00:20:57.480 | And then I would condition it.
00:20:59.040 | And I would just use that to predict the next sample.
00:21:03.240 | So for the evaluation criteria, we
00:21:04.840 | did not look for negative log-likelihood scores.
00:21:08.720 | We just looked at how well our prediction task was.
00:21:12.760 | So we took a, like, stacked WaveNet,
00:21:15.640 | which was implemented by DeepMind,
00:21:17.520 | and saw that, OK, what was the performance using
00:21:21.320 | their benchmarks and even, like, bigger stacked WaveNets.
00:21:25.840 | We then started to increase the complexity of transformers
00:21:29.080 | and started to see whatever we had proposed
00:21:32.520 | in terms of, like, conditioning on the vanilla transformer
00:21:36.600 | architectures to see how well they do.
00:21:39.520 | We did not look for, like, an application-specific problem,
00:21:43.640 | which is basically, like, we don't look at, like,
00:21:46.520 | how well perception tasks are for, like, say,
00:21:49.200 | text-to-speech synthesis or speech denoising.
00:21:51.600 | We just look at, OK, if we are trying
00:21:53.280 | to model this using a cross-entropy loss,
00:21:56.000 | then with the same model, with the same loss function,
00:21:59.560 | how well they do on, like, similar kind of parameters.
00:22:03.720 | So this was the first kind of, like, sub-block of, like,
00:22:06.640 | how can we use our transformers for generative modeling.
00:22:12.160 | For the second problem, I'll do a quick headway
00:22:15.360 | on how can we use, like, transformers
00:22:19.160 | for doing language modeling, which
00:22:21.200 | is kind of becoming a really fancy term right now.
00:22:24.880 | And this work was done by Julia Smith way back in 2020.
00:22:28.880 | And the goal of this was, can we kind of, in a way,
00:22:32.520 | do language modeling with continuous audio sequences?
00:22:37.440 | And I'll briefly mention about that in this sub-block
00:22:42.240 | of the talk.
00:22:42.960 | And this is in regard for, like, solving acoustic scene
00:22:48.440 | understanding, which is basically, like,
00:22:51.560 | if I'm given a chunk of audio, then
00:22:54.280 | I want to understand what's in there.
00:22:56.960 | And if we could do that well, then in a way,
00:23:01.800 | we can do a lot of fancy, nice applications.
00:23:06.120 | So for example, like, if you think about, like,
00:23:08.160 | self-driving cars.
00:23:09.040 | So Waymo has started to incorporate microphones
00:23:12.200 | into their self-driving cars.
00:23:14.220 | Because, say, if there is an ambulance coming,
00:23:16.080 | or if there is a fire truck coming,
00:23:18.960 | then that sound would be picked up way, way before even
00:23:23.040 | the LIDARs or even their sensors.
00:23:25.840 | So they want to understand that and take
00:23:28.320 | actions based upon that.
00:23:30.880 | Apple, during COVID, did a hand-washing detection
00:23:33.360 | on their Apple Watch.
00:23:34.600 | Because if you could detect when someone is washing their hands,
00:23:37.800 | then you can, in a way, like, tell people that, oh,
00:23:40.760 | you need to wash hands for 20 seconds.
00:23:42.720 | And then that can be built upon as a cool application.
00:23:46.880 | It can be used for music recommendations.
00:23:49.520 | So Spotify, YouTube Music kind of gives, like,
00:23:51.760 | very, very good songs, which you are listening to,
00:23:54.640 | which are similar in content that you would perhaps like.
00:23:59.320 | It can also give, like, really cool applications.
00:24:01.680 | Like, say, people have tried, like,
00:24:03.760 | detecting depression from audio.
00:24:05.800 | Or I could detect whether I'm coughing or not,
00:24:08.760 | or I'm sneezing or not.
00:24:09.960 | And these can be, like, good medical device--
00:24:13.080 | medical applications, which can be
00:24:15.360 | used along with the current diagnosis what doctor provides.
00:24:20.520 | So the question was basically, for us,
00:24:23.320 | was, like, how can we do, like, language modeling
00:24:26.600 | in a continuous audio domain?
00:24:29.040 | And secondly, like, how can we train models,
00:24:31.240 | or how should we approach doing this?
00:24:35.320 | So this kind of, like, recipe has
00:24:37.360 | become, like, very, very popular these days in terms of, like,
00:24:40.680 | how would you approach this problem?
00:24:42.240 | It started with, like, open AI, and to a certain extent,
00:24:46.280 | DeepMind proposing that in terms of, like, VQVAE models.
00:24:50.960 | But it turns out, like, transformers
00:24:52.640 | love operating in discrete spaces, as of now.
00:24:56.440 | And what they kind of do is, as long as your representations
00:25:01.200 | are discrete, they are very, very good at modeling
00:25:03.640 | what's going to come next.
00:25:06.760 | So what people have been proposing as a workaround
00:25:09.200 | is you could take up, like, your favorite embedding
00:25:15.640 | in some manner.
00:25:16.400 | You could take a VQVAE embeddings,
00:25:18.200 | or you could take a Wave2Vec, or in terms of video,
00:25:21.840 | you can just do classic VGG or ResNet embeddings.
00:25:27.880 | You can apply k-means clustering to it.
00:25:31.040 | And k-means clustering would give you, like, discrete codes.
00:25:34.320 | You do language modeling with those discrete codes,
00:25:37.240 | and you predict the next code.
00:25:39.360 | And in a way, if you're doing this,
00:25:41.440 | then you're kind of doing language modeling over audio.
00:25:45.200 | And if you need to get back to the audio,
00:25:46.840 | then you already saw with WaveNet
00:25:49.000 | that you can condition the WaveNet model
00:25:51.120 | to give continuous output.
00:25:53.480 | So you can use those codes to get back
00:25:55.320 | to the audio, similar to what jukebox and OpenAI did.
00:26:00.240 | So I'll quickly mention about what vector quantization is.
00:26:05.920 | It's one of the most underutilized algorithms,
00:26:08.680 | to be honest.
00:26:09.880 | And what it does is basically gives, in a way,
00:26:12.960 | discrete codes to continuous embedding spaces.
00:26:16.400 | So how does it do it?
00:26:18.080 | So you basically have an embedding space,
00:26:23.600 | let's say, in 2D right here.
00:26:25.240 | You define what are the number of clusters
00:26:27.120 | you want to put each of them in.
00:26:29.320 | You run k-means, and you would certainly
00:26:31.520 | get these patches of where all of these embeddings
00:26:36.200 | are, what would be the representative embedding
00:26:38.280 | of a continuous embedding.
00:26:40.720 | You can take all of those patches,
00:26:42.120 | and you can just number them, or you can just list them.
00:26:45.960 | So in this case, you can perhaps have 25 numbers, or 20 numbers,
00:26:49.840 | which are, in a way, mapping from a continuous embedding
00:26:53.480 | to a discrete token.
00:26:57.040 | This is another example right here.
00:26:58.520 | So in our case, what we did was we
00:27:02.320 | took a batch of spectrogram, which are basically
00:27:05.160 | very small patches across time, and then shared all
00:27:10.760 | across the frequency axis.
00:27:13.320 | You take those patches, you learn
00:27:15.200 | the embedding representation.
00:27:16.600 | In our case, it was just like three-layer autoencoder,
00:27:19.720 | fully-connected encoders with three layers of decoders,
00:27:22.800 | and have a bottleneck layer in between.
00:27:25.360 | So that bottleneck layer basically
00:27:27.000 | is kind of similar to this kind of diagram
00:27:29.280 | in, say, 64-dimensional space or 120-dimensional space.
00:27:33.520 | You take up those bottleneck codes,
00:27:35.400 | and then you run k-means clustering on it.
00:27:37.840 | Suddenly, in a way, you can find discrete codes
00:27:43.520 | for continuous embedding spaces or even continuous segments.
00:27:48.040 | And since we know that transformers kind of love
00:27:50.480 | operating in discrete spaces, you
00:27:52.600 | can just apply language modeling now,
00:27:55.160 | and then you can see what you can do.
00:27:58.480 | So in our case, we just had very simple three-layer,
00:28:02.200 | fully-connected autoencoder, small patches.
00:28:06.000 | The number of codes is important,
00:28:07.520 | because if you have too many codes,
00:28:10.000 | then you're kind of just throwing
00:28:11.800 | in all kinds of noisy things.
00:28:13.600 | Now, I'll give an example of why the number of codes
00:28:17.880 | are important through some example.
00:28:19.680 | And you have two little codes.
00:28:22.480 | What you're, in a way, doing is you're
00:28:25.000 | removing all of the information which was relevant,
00:28:27.280 | and you're just kind of averaging them all out.
00:28:29.480 | So this idea first was proposed by Jukebox,
00:28:38.200 | which did it for music.
00:28:40.240 | So you do the exact same thing, what
00:28:42.160 | I talked about, in a slightly different manner.
00:28:45.240 | In a way that, OK, you cannot learn codes
00:28:48.640 | for longer sequences.
00:28:50.520 | So in a way, learn sequences which are just moving slowly
00:28:54.080 | and which are looking at only a certain amount of audio.
00:28:57.680 | So you kind of encode this in these discrete levels, which
00:29:01.320 | are basically like--
00:29:03.360 | all of these basically are codes.
00:29:04.760 | So at every point, I define, OK, this audio
00:29:08.040 | had, perhaps, code number 55.
00:29:10.520 | And in the next level, perhaps, it had code number 2.
00:29:12.840 | And in the very top, perhaps, it had code number 2,000.
00:29:16.440 | So in a way, I'm discretizing the whole codes.
00:29:19.680 | Now what I do is I take up my favorite transform model,
00:29:23.440 | perhaps like a causal autoregressive one.
00:29:26.280 | And I say that, OK, given these codes,
00:29:29.240 | try to predict what codes would come next.
00:29:31.480 | And for sure, transformers can do that.
00:29:34.240 | So I would generate the codes in the future.
00:29:36.840 | Once I've generated the codes in the future,
00:29:39.560 | I can say that, OK, this problem now
00:29:41.360 | is kind of like a text-to-speech problem,
00:29:43.680 | because I have these discrete codes.
00:29:45.560 | Text-to-speech, in a way, is going from discrete letters
00:29:48.680 | to continuous audio.
00:29:50.560 | So I would throw in the fanciest, which was WaveNet.
00:29:53.720 | And I would just get back the code.
00:29:55.240 | And I would get the generated audio.
00:29:58.080 | So this was, in a way, what I described,
00:30:02.360 | that they take up a continuous audio.
00:30:04.760 | They have these compressed codes,
00:30:06.640 | which they encode using a CNN in this case.
00:30:10.720 | The method doesn't matter.
00:30:11.960 | You can throw in the fanciest of embedding or latent
00:30:15.040 | representation on those continuous code.
00:30:18.400 | You generate the patterns, which are like,
00:30:20.120 | what's going to happen next in the future?
00:30:22.000 | And then you decode back using a fancy WaveNet or state-of-the-art
00:30:26.000 | model.
00:30:28.000 | So this was what they were doing for music synthesis.
00:30:32.160 | What we said was, yeah, this is good.
00:30:35.200 | This can generate a good amount of music.
00:30:37.200 | But can these models be used for generating
00:30:44.480 | good representation of the current audio?
00:30:47.840 | And the goal there was, can language models
00:30:51.160 | learn representation, which can just encapsulate whatever we
00:30:55.600 | are giving as an input signal?
00:30:59.160 | So in this case, what we tried after that
00:31:01.160 | was you do exactly similar ideas.
00:31:06.840 | But instead of doing on VQ-VAE end-to-end learned encodings,
00:31:11.960 | we just apply vanilla k-means clustering,
00:31:14.320 | similar to what I described earlier.
00:31:16.880 | We do on spectrogram patches.
00:31:18.200 | So you take up these spectrograms of audio,
00:31:20.960 | and you just divide them into very small chunks,
00:31:23.960 | learn autoencoder encodings for each of those chunks,
00:31:28.440 | run k-means clustering.
00:31:30.320 | In this case, let's say I am learning 16 codes.
00:31:34.120 | Represent the continuous audio in terms of the 16 codes.
00:31:38.360 | Have a transformer which can perhaps predict the next code.
00:31:41.880 | And if I keep on getting better and better at predicting
00:31:45.120 | what's going to happen next, then in this linear layer,
00:31:48.000 | I should be encapsulating what's important
00:31:51.320 | or what's a good summary of what has happened in the past.
00:31:56.640 | So that was our intuition behind trying this.
00:32:01.600 | And as I explained, the number of codes
00:32:03.600 | play a very important role.
00:32:05.560 | You can see here, these are just two piano notes switching
00:32:08.920 | one after the other.
00:32:10.080 | If I just have 16 number of codes,
00:32:12.080 | it just happens to have just a single line of encoding,
00:32:16.560 | a single code assigned to all of this.
00:32:18.640 | Whereas if I'm assigning more codes,
00:32:21.040 | then it becomes a fine-grained prediction
00:32:23.600 | where I'm actually able to get what the individual notes are.
00:32:29.320 | Recently, Facebook also said, OK, they just
00:32:32.880 | had a different name to the whole thing, which
00:32:35.240 | is we can just call this as textless NLP also in the sense
00:32:40.760 | that, OK, you can do NLP without having access to text.
00:32:44.760 | But the idea is very, very similar.
00:32:46.360 | You have an encoder, which is exactly similar to say
00:32:48.600 | what OpenAI was using.
00:32:49.680 | You have a VQ-VAE, Wave2Vec, or whatever you want to do.
00:32:53.320 | You can apply k-means clustering to it.
00:32:55.240 | You apply language models to it.
00:32:57.160 | And instead of a decoder being WaveNet,
00:32:58.880 | they just have a decoder, which is
00:33:00.280 | like a different version of text-to-speech, which
00:33:02.840 | is like Takotron in this case.
00:33:05.000 | So as you can see, these are all the same wine
00:33:07.160 | and very different bottles.
00:33:08.280 | But the core idea is almost exactly the same.
00:33:12.560 | So this created a huge uproar of this going to change NLP.
00:33:18.120 | But this is very, very similar to what people
00:33:21.560 | have been doing in the past.
00:33:24.560 | So I've already explained what this was.
00:33:29.200 | So in our case, we just try to predict
00:33:32.400 | what's going to happen next given the previous context
00:33:35.360 | and use that representation similar to every single one
00:33:40.320 | short learning or zero short learning-based method.
00:33:44.600 | I also explain why the number of codes are important.
00:33:47.720 | If you have too small, then you're
00:33:49.240 | just throwing away a lot of information.
00:33:50.920 | If you have too large, then you don't put in--
00:33:55.360 | it is no longer robust to noise.
00:33:59.480 | So this was our setup.
00:34:00.720 | And before I jump in, I should add one of the tweets
00:34:04.080 | which I saw from one of the most prominent researchers
00:34:08.800 | at DeepMind, which is basically like a lot of times
00:34:11.240 | it is very, very easy to bump up numbers.
00:34:13.680 | I can have these details just not present
00:34:17.200 | in my paper, which actually help a lot in terms
00:34:20.280 | of improving the performance.
00:34:22.200 | And sometimes don't take into account
00:34:25.760 | what the actual model is incorporating
00:34:29.160 | or what model is contributing versus what
00:34:32.000 | the actual these tricks for training are incorporating.
00:34:35.160 | So for most of these methods, what we are trying to see
00:34:38.240 | is we try to keep almost exactly the same approach.
00:34:42.000 | No rate augmentation, no fancy label smoothing,
00:34:44.560 | or moving average of weights, or decay, or whatever.
00:34:48.400 | You just have similar-based recipes
00:34:51.440 | to see how well we are doing.
00:34:55.520 | For this case, the goal was to see
00:34:57.920 | that how well our models do with respect
00:35:00.400 | to this purely supervised approach
00:35:02.120 | and how well it does with respect
00:35:03.720 | to a similar unsupervised approach.
00:35:06.840 | So in the first case, the model and all of the weights
00:35:09.360 | have access to all of the labels, which is just
00:35:12.320 | shown as VGG supervised, which is basically
00:35:14.840 | you take up an audio understanding data set
00:35:16.840 | and you see how well you're doing on accuracy metrics.
00:35:21.440 | So that was the first one.
00:35:22.560 | In the second one, we applied SimClear,
00:35:24.760 | which was proposed by Geoff Hinton,
00:35:26.160 | in which you can take up these multiple augmentations
00:35:28.440 | of the same input.
00:35:30.440 | You can have patches removed.
00:35:32.000 | You can blur the signal.
00:35:33.000 | You can flip the signal.
00:35:34.800 | You learn an embedding out of the last layer
00:35:36.880 | without access to the labels, and then
00:35:38.720 | just have a linear head to predict what's happening.
00:35:42.000 | By using that, we got a 55% accuracy.
00:35:45.040 | You do the exact same thing with transformers.
00:35:46.960 | You don't have access to labels.
00:35:48.440 | You just run them while just to predict the next code.
00:35:51.640 | You take the linear layer, apply the same linear head,
00:35:54.400 | and try to predict what's happening inside.
00:35:57.120 | And with that, we got 60% accuracy.
00:35:59.440 | So even though the results are not good,
00:36:01.160 | but the fact is the neural networks actually
00:36:04.920 | are very, very good at getting better and better
00:36:09.360 | with throwing off huge amounts of data.
00:36:11.520 | So there's still a 10% gap between purely supervised
00:36:14.720 | and purely unsupervised.
00:36:17.200 | But that's going to improve with throwing a lot of data
00:36:21.160 | to these models, because it doesn't have access
00:36:23.480 | to any label as per se.
00:36:25.240 | So this is a famous paper by Dan Ellis and Nelson Morgan
00:36:28.560 | at Berkeley, in which they actually showed way back
00:36:30.920 | in 1999 as to why size matters for deep neural networks
00:36:37.200 | and also the number of data points which is present.
00:36:41.000 | So as they kept on increasing the size of the data set
00:36:44.400 | and the parameters, they kept on getting lower and lower word
00:36:47.120 | error rates.
00:36:48.120 | And this has been true across any of the data set.
00:36:51.480 | And that's why the whole excitement is about
00:36:53.720 | unsupervised learning.
00:36:56.640 | So this was, in a way, a flavor of how
00:36:58.760 | can we do language modeling and unsupervised learning
00:37:01.320 | on audio for continuous signals.
00:37:05.200 | For the third subplot, I'll just quickly mention
00:37:09.400 | ideas which are very similar to what you would have seen
00:37:11.800 | in vision transformers, but with the caveat
00:37:15.680 | that how can we use some sort of signal processing
00:37:18.880 | to improve these performance even further.
00:37:22.040 | So the basic approach still remains the same exactly
00:37:24.520 | as what you would have seen in vision transformers.
00:37:28.200 | You have a signal of interest which you want to classify.
00:37:33.080 | Here, they are raw waveform instead of images.
00:37:36.400 | The goal is to predict what's there inside of it.
00:37:41.080 | And also, we don't have any convolutions.
00:37:43.480 | We don't have any other tricks which we were using before.
00:37:46.760 | All we have to do is they can transform as themselves,
00:37:49.760 | solve this particular problem.
00:37:52.880 | So for the data set--
00:37:54.800 | and the whole setup was still the same.
00:37:57.280 | No data augmentation and no other forms of these tricks.
00:38:03.000 | You are given like 40,000 snippets for training
00:38:05.440 | and 10,000 for validation.
00:38:08.040 | Our job is to predict as good as possible
00:38:10.880 | as to what's there in the audio.
00:38:13.480 | This problem is very similar to the sound which you heard
00:38:17.120 | and the video which you saw, that given a spectrogram patch,
00:38:22.080 | you have to predict what's there inside of it.
00:38:24.080 | We kind of do one step further than what's just
00:38:33.000 | like a simple transformer model.
00:38:34.840 | In a sense that we try to see whether some sort of hierarchy
00:38:38.400 | over transformer embeddings would help us in any manner.
00:38:43.960 | So for that, we use wavelet decomposition
00:38:47.200 | on the intermediate transformer embeddings.
00:38:50.720 | So what is a wavelet decomposition?
00:38:55.040 | In very naive terms, it can be like a way
00:38:58.200 | of decomposing the intermediate embeddings
00:39:02.720 | into another intermediate embedding, in a sense
00:39:06.800 | that we are kind of putting these highways of like some
00:39:09.920 | embeddings are moving very slowly
00:39:11.440 | and some embeddings are moving very fast.
00:39:13.640 | And some embeddings are retained exactly
00:39:15.600 | at the rate of what the original signal was.
00:39:19.480 | And why this is important?
00:39:20.640 | Because you can think about that at every intermediate state,
00:39:24.000 | you are in a way learning some sort of hierarchy in the model.
00:39:27.800 | So if I look at what we do with the wavelet decomposition
00:39:35.120 | before and after, let's say you had time across this
00:39:39.200 | and you had the embedding size across this
00:39:41.640 | and this whole patch was your output of, say,
00:39:45.880 | the nth layer of the transformer.
00:39:48.640 | What I say now is, OK, I would just
00:39:52.320 | have a mapping from this to the mapping of my interest
00:39:56.040 | using wavelet decomposition, in which for half of the samples,
00:40:00.280 | I just retain the exact same embedding
00:40:02.160 | as what was learned by the transformer model.
00:40:05.960 | In the next half, I would start combining two at a time.
00:40:09.160 | So in a way, I'm learning this sort
00:40:10.600 | of like a tree structure within a single layer
00:40:13.800 | of the transformer embedding.
00:40:16.360 | And for now, the wavelet or the BCS function which I use
00:40:21.000 | is simple averaging.
00:40:22.360 | So let's say from all of the embedding layers in between,
00:40:26.040 | I just need to have one embedding which is not
00:40:31.400 | moving at all, which is just representative of whatever
00:40:33.600 | is there of the whole latent space in that nth layer.
00:40:40.240 | Then in the next layer, I would just use two at a time
00:40:44.160 | and then I would use four at a time
00:40:46.680 | until I reach the exact resolution as what I had.
00:40:50.360 | Doing this operation doesn't add any parameters whatsoever.
00:40:53.560 | You're just defining what your BCS function would be
00:40:56.320 | or what your wavelet function would be.
00:40:58.080 | In this case, it is a hard wavelet.
00:41:00.200 | And I start combining them and I learned a hierarchy
00:41:04.120 | at every single layer of the transformers.
00:41:08.280 | And this improved our performance significantly
00:41:12.560 | as compared to not using them with addition
00:41:15.080 | of no extra parameters.
00:41:17.800 | And I'll come to the results later also.
00:41:20.840 | So this is how the whole approach looks like.
00:41:23.840 | You have a front end.
00:41:25.680 | The front end is basically a single layer
00:41:28.480 | of 2,000 neurons followed by a dense layer of 64 neurons,
00:41:33.240 | which is just to make sure to conform it
00:41:37.360 | to the intermediate transform embeddings.
00:41:39.080 | Let's say if for the transformers,
00:41:40.920 | I define the embedding size to be 64,
00:41:43.200 | then that's the dimension which I'm mapping them to.
00:41:47.200 | So I take a broad waveform.
00:41:48.920 | I patch it in very small patches similar to how
00:41:51.960 | you do in vision transformers.
00:41:54.520 | I would just have a single layer of 2,000 neurons
00:41:56.880 | followed by a dense layer of 64 neurons
00:42:00.040 | with the hope that the first layer is learning
00:42:02.280 | like a Fourier BCS function, which
00:42:04.760 | should be adaptable according to what I'm learning.
00:42:08.080 | After that, I keep on doing this over and over again.
00:42:11.640 | I don't have a classification head or anything like that.
00:42:15.520 | I keep on adding multiple stacks of transformers after that.
00:42:20.120 | And then I have two approaches of what I can
00:42:26.120 | do in terms of adaptation.
00:42:28.920 | I can do average pooling across time
00:42:31.040 | of these intermediate embeddings,
00:42:32.800 | because the idea is very similar to what
00:42:35.160 | we do in classical vision, that each of the embeddings
00:42:38.080 | are looking at much, much broader output
00:42:42.440 | in the subsequent layers.
00:42:44.160 | Or I could do a wavelet decomposition.
00:42:47.160 | So what I do is that I take all of these embeddings
00:42:50.160 | and I define these highways.
00:42:51.560 | So some of the embeddings move fast.
00:42:53.200 | Some of them are moving very slow.
00:42:54.800 | And some are retained at the exact same resolution
00:42:57.280 | as what the transformer is learning.
00:42:59.800 | And then I keep doing this over and over again.
00:43:02.120 | I have a dense layer.
00:43:03.680 | I have my softmax or sigmoid, whatever
00:43:06.560 | is my classification head.
00:43:08.680 | So this is kind of what the approach looks like.
00:43:12.400 | We compare it with all of the traditional vision-based
00:43:17.720 | architecture.
00:43:18.520 | So the vision-based models have been very good.
00:43:21.040 | And the performance have been similar in understanding
00:43:24.360 | audio also.
00:43:25.560 | So we compare all of those models
00:43:27.600 | in terms of mean average precision.
00:43:29.680 | And we see that even the tiniest models of transformers
00:43:32.320 | were just surpassing all of the state-of-the-art CNN
00:43:34.680 | models, which was a very good sign.
00:43:37.880 | Then we started to bump up.
00:43:39.800 | The larger model should keep on improving the performance.
00:43:42.560 | And with the multi-scale models, as well as
00:43:46.000 | with the pooling layers, they improve the performance
00:43:48.760 | even further, which was kind of very surprising to us
00:43:52.520 | because the number of parameters are very small.
00:43:54.920 | These are very tiny architectures.
00:43:56.480 | Yet they are surpassing things like even
00:43:58.360 | DenseNet, which are huge models with a lot of millions
00:44:01.200 | of parameters.
00:44:03.920 | So after that, we said--
00:44:05.480 | and I'm going to conclude quickly.
00:44:08.160 | After that, we said that, OK, this is looking pretty cool.
00:44:12.200 | What actually is the transformer or the first-layer learning?
00:44:17.480 | So in order to make this plot, what we said was, OK,
00:44:24.240 | if you were to take a classic Fourier transform,
00:44:27.600 | then this axis is kind of like frequency.
00:44:32.960 | This axis is the number of filters.
00:44:34.520 | And this axis is the frequency.
00:44:36.720 | Then in a way, it should be connecting all of the points
00:44:41.640 | in a linear line.
00:44:43.400 | And this is akin to the number of points in the FFT.
00:44:45.920 | So how many points I'm defining here?
00:44:48.240 | If I'm defining 2,000 points here,
00:44:50.680 | then I would have 2,048 sinusoidal basis functions,
00:44:56.040 | which are going from lower frequency
00:44:57.720 | to the most highest frequency.
00:45:00.120 | We said, OK, we'll do the exact same thing,
00:45:02.080 | but now with filters.
00:45:03.640 | So we have a frequency along y-axis and the number
00:45:07.520 | of points in my x-axis.
00:45:09.360 | And if it was a classic Fourier transform,
00:45:11.280 | then it would be connecting right as a linear line.
00:45:15.560 | But what we did was we take up the front end, which
00:45:19.600 | is learned by transformer, take its Fourier transform,
00:45:22.800 | sort according to its center frequency
00:45:24.960 | as to what frequency it is activating the most,
00:45:27.680 | and then keep on stacking them.
00:45:29.960 | When we did this for two problems,
00:45:31.960 | we saw that we are learning a different time frequency
00:45:35.720 | representation, which is specific to a particular
00:45:38.080 | problem.
00:45:38.600 | So if I'm trying to understand what's
00:45:40.400 | there in the content of the audio,
00:45:42.480 | I learn a representation which is
00:45:43.880 | very different than Fourier transform,
00:45:45.160 | which would have been a straight line, which
00:45:47.040 | is like a curved exponential line like this.
00:45:51.640 | And if I do a polyphonic pitch estimation,
00:45:54.440 | I learn a very different front end,
00:45:57.320 | which is adapting to that particular problem.
00:46:00.080 | So this was very exciting to us because making computers
00:46:05.520 | hear in a way in which they are adapting their ears
00:46:07.880 | according to a particular problem is a very cool idea.
00:46:12.400 | Second thing is we actually saw each of the filters
00:46:15.200 | as to what they were doing.
00:46:17.320 | And these are basically just single slices like this.
00:46:21.240 | So this is what we would have learned as a front end neuron.
00:46:25.280 | So we take up each of the neurons and we just plot them.
00:46:27.880 | And for plotting this, we basically
00:46:29.800 | take a Fourier transform and then
00:46:32.040 | sort them according to where the center frequency is.
00:46:35.400 | When we just saw the neurons as to what
00:46:37.080 | they were learning in the front end,
00:46:39.040 | we saw that it is learning properties
00:46:41.440 | which are very, very closely matching
00:46:45.200 | with the traditional signal processing.
00:46:46.840 | So you would have something like an answer detector
00:46:48.960 | learned right here.
00:46:50.760 | You're learning windowing functions.
00:46:52.280 | In a way, it is learning to have a kernel which
00:46:55.400 | is best for a time frequency representation, what people
00:46:58.320 | have been using in signal processing, which
00:47:00.120 | is like a Hamming or a Hamming window.
00:47:03.000 | We are learning these pure sinusoids
00:47:04.720 | which are responsible for activating
00:47:07.400 | a particular frequency.
00:47:08.920 | So you can see the richness as compared
00:47:10.560 | to having a fixed purely sinusoidal PCS
00:47:14.000 | function right here.
00:47:16.440 | So this was what we had done.
00:47:20.040 | And then to share the final thoughts,
00:47:23.120 | I'll conclude by saying that, OK, transformers
00:47:25.480 | are proving to be a major advancement in AI
00:47:27.360 | research across the fields.
00:47:30.000 | And it seems like they're solving everything for now.
00:47:34.240 | And hopefully, this is not the end.
00:47:36.040 | And we should keep an eye out on something
00:47:38.440 | which would change and have an impact which
00:47:41.080 | is more than what transformers have put.
00:47:44.360 | And who knows what's going to come next?
00:47:47.600 | Yeah, so by that, I'll just conclude.
00:47:49.680 | And I'll be happy to take questions.
00:47:53.120 | Thank you, Prateek.
00:47:54.080 | That was a really good talk.
00:47:55.800 | And you provided some really good insights
00:47:58.720 | about how transformers work for the audio case.
00:48:02.360 | And yeah, thank you for the talk.
00:48:04.880 | And now I would invite questions from the class students.
00:48:10.200 | Let me just stop the recording.
