Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

Thanks for inviting me for the talk today. And I'll be just talking about transformers for music and audio, which is very different than what all of us were doing in this past course. I'm also the only speaker from Stanford, so I have to do a good job. So you'll see very good slides, because I'm representing the university in some sense.

So yeah, so the flow of the talk for today is basically like I'll be throwing a lot of stuff. It's kind of like a buffet style, and then you feel free to like or dislike whatever you want. And I'll be talking mostly about three papers of what I've been working on.

I'll start with introducing what transformers are from a different perspective, what audio representations are. Talk about a generative model for audio, which is just doing language modeling on sample level. Then I'll talk about how can one do like language modeling for speech and audio, which is different than what people do for text.

What are the current trends in the literature? Finally, I'll briefly mention similar stuff as to what was happening in computer vision with regard to vision transformers. How can we adapt similar ideas for audio transformers? And throw in a bit of signal processing to improve the performance. Having told that the talk is about 35 to 40 minutes with about 15 minutes of Q&A, I should also say that all of the opinions are mine, and Stanford or any other professor is not responsible for any of the mistake which I do.

So transformers have kind of revolutionized in a way like everyone was approaching deep learning. Before that, it was all about CNNs. And mostly, all of these prominent models have been coming in waves. So there was a time when everyone was just applying CNNs. Then came a time where people started adapting CNNs in some sort of diluted convolutions.

And slowly, the recurrent networks were getting out of fashion. Now, it seems like transformers are in fashion all the time. So it seems to be solving almost every single problem which is being thrown at them. So what's special about them? One of the fact which struck me was their simplicity, which is, if you think about it, this-- and it has been hugely popular also.

So it was just released in 2018. And within three years, it has about 30,000 citations. And it is kind of solving every single problem in every single domain. It has its limitations, though, also. But if you think about it, in a way, transformers are basically a way of just cascading self-attention with feature learning.

And if you keep on doing it over and over again, then the model, in a way, learns which parts of the input are important and keep on transforming them, removing the contents which are not important, and just have the limited information which is just responsible for a particular task.

And it has been very, very difficult to keep up with the literature. I have put it as a joke here. But then even Twitter's recommendation engine were kind of just getting out of-- they were getting haywire as to why is Chris Manning just searching over transformers? And that was way back in 2020.

So it has been difficult for researchers also to keep up with the pace of what's going on. Just before transformers, all of the NLP community was just doing gaga about bidirectional LSTMs with attention. So every single paper before 2017 was just like you have encoded LSTM layers. You keep on adding multiple layers.

And then after that, you have attention mechanism which just learns that what's important and then just keeps on decoding sequentially one at a time. But this was not kind of like an ideal way to do it. Because what turns out is when we start throwing longer sequences, the connections are no longer storing the gradient updates in a way it should be doing.

So what the researchers from Google said, instead of having just an attention layer at the very last encoding, we would just have these attention mechanisms at every single layer, which in a way would just learn what's important for a particular problem at that particular layer. And we keep on doing it over and over again.

So then the whole idea of transformers and attention mechanism cascaded one after the other came. And I'll not go into the details, because this is the last class of the course. But then usual tricks do help across the neural net literature, which is like having multi-header tensions, having skip connection and layer norm.

So all of these things, they are not only like giving gains for transformers themselves, but they can be just applied to any single other architecture also. The other thing which is helping this research is basically the compute bar is getting better and better. So all of these big companies are just throwing massive amounts of computing resources at solving very, very simple and trivial tasks.

The top of the hill being the switch transformer, which was discussed in the course also. But one other thing which I think started all of this trend was ELMo, which was just learning these contextualized representations for natural language processing. And that model right here was perhaps one of the first kind of like model 0.0 or something, or 0.1 in terms of bringing and ushering in the whole revolution.

You can see that how similar these kind of models look like. BERT was basically inspired heavily from ELMo, in which they just replaced some of the LSTM layers with transformer modules. So a point to note also is irrespective of natural language processing or other domain, these can be adopted in a variety of domains.

And for today's talk, I'll be just adopting them to audio. So I'll basically start with introducing people what audio representations are, and just for the sake of completeness, talk about spectrograms. So you can take any time domain signal, and you can decompose that signal into a variety of basis functions.

And if you take up a Fourier transform, you're kind of like decomposing the actual time domain signal into its sinusoidal basis components. So if you have like a waveform here like this, which is a sum of three pure sinusoids, then their sum basically is this. And you can see that when you take a Fourier transform and its magnitude, you kind of have the strength of the individual components shown here.

So you can take up another waveform, let's say a square wave, and what you have is basically a much richer sinusoidal decomposition because it is kind of a discontinuous signal. So you need like many more sinusoids to represent that particular signal as close to the actual signal as possible.

And here also you can see that, OK, if this was a square wave, then it is actually made up of a lot of sinusoids where each of the bar here represents the strength of the particular sinusoid. From an optimization perspective, I mean, this right away is suboptimal, right? Because you're kind of fixing up the number of sinusoids you're using for representing a square wave.

I would have rather used a basis function which was a square wave itself than a sinusoidal signal. The second thing is even if you are taking a sinusoidal signal, we kind of are just putting them in an equidistant space. So you're kind of dividing the whole frequency axis into equidistant bins.

And each of the bins is responsible for a particular sinusoid a lot. So that is like a traditional Fourier representation for representing any signal. What we do for-- what are spectrograms? But in reality, all of these signals are discontinuous. All of these signals vary quite a bit, right? So you can have a signal while I'm speaking which is like a square wave for a certain period of time, and then it gets sinusoidal, and then it becomes something else.

So what we really need is in a way to kind of take batches of input signal and take Fourier transform of these individual batches. I'm deliberately using the word batches, but you can-- in traditional terms, you are windowing the signal. So right here, you can see that you have a continuous signal.

You keep on windowing it. You apply the Fourier transform, and what you get is basically like a spectrogram representation of the signal. So right here, what you're seeing basically is for each of the slices, the signal kind of look like this after taking the Fourier transform with the waveform which is there below.

And what you do is for spectrogram representation, you keep on stacking these Fourier transform slice, the magnitude of the Fourier transform slices. And in this way, you kind of get like a 2D representation of audio signals. And if you're coming from a vision background, it is basically all of the things which you're doing in vision would just work well if you just apply them to these 2D spectra representations.

I'll quickly play how these spectrograms look like for a wide area of common sounds. So you can see like for spectrograms, you have kind of like a time axis on your x-axis. And then you have a frequency axis on y-axis. And then for whatever is your signal of interest, you're basically like putting these slices together.

And different sound gives you like different spectra representation. So it's kind of a vision problem just in this sort of like Fourier space. So there can be like different kinds of representations also. So one, you could just take these slices of Fourier transform and then do like a linear mapping to them so that you're kind of in a way making these as close to how humans hear.

So you can have like log of the frequency on the y-axis instead of common frequency. And then you get like a constant Q-like representation. The advantage of this being like you can see that for different frequencies, the spacing between the harmonics kind of remains same. So if you're like training convolutional filters, then that's of a huge advantage because the signal, like one component of the invariance is gone.

And you can just learn these filters which are catching onto these constant templates of Fourier slices. You can have melt filter bank coefficients, or you can have like the raw waveform also. For raw waveforms, basically there are two things which we have to keep in mind. One is the sampling rate.

So we kind of like take the continuous signal and then we discretize the continuous signal. So one parameter is like how fast we are sampling the continuous signal. So that's typically on the order of like 16,000 or 8,000 times a second if you're on telephonic speech. The other thing which we also is like how many levels we are dividing your vertical axis.

So in this case, you can see that each of the dots is basically one level. And typically, people use 8-bit quantizers or 16-bit quantizers. So in a way, you can think about that for every one second of audio which we would hear, you would have like 16,000 samples. And then in each of the 16,000 samples are allowed to take one of the levels between 0 to 55.

And that's like if I can take the problem of continuous audio and just have it in terms of this sort of discrete space, then basically I'm just going to the territory of doing language modeling. So the first papers I discuss is how can we do generative modeling for raw audio, which is similar to WaveNets using transformers.

I'll be putting QR codes if you like the stuff what I'm doing. And if you think that this is relevant to you, please cite or please have a look in terms of the QR codes. So yeah, so I'll start with the first subtopic of today's talk, which is like what are WaveNets and how do we do this generative modeling over raw audio?

So in a single word, you can think about this as doing language modeling over these 255 states of audio. So you can throw in your favorite transformer model like transformer XL or GPT or whatever you want to call it. And just treat the problem as if you are trying to predict one of the levels out of 255.

And you have to predict the next level given a certain context. That's what WaveNet was doing. So the way you are modeling the probability distribution of a continuous space is basically you're trying to predict what's the probability of the next sample given some parsed context. And WaveNet has been hugely popular because it has over 3,000 citations and it has been a core building block for almost all speech and audio related problems.

You can think about speech to text, text to speech synthesis, instrument conversion, packet loss concealment over the internet, speech denoising. So wherever there's some sort of element of modifying audio, people have been using WaveNet as a core building block. And raw waveform synthesis has been difficult because just the magnitude of the problem, if I'm just trying to synthesize 10 seconds of audio, it would just amount to me having a probability distribution over 160,000 samples.

And that itself is tough because our ears are very, very sensitive to subtle changes. If I'm off by one pixel in an image, my eyes would not be as susceptible to noticing that effect versus if I'm off by, say, a few samples in an audio, it would just catch our ears pretty quickly.

People have been trying raw audio synthesis a lot in the past. And before all of the WaveNet and transformer-based approaches, WaveRNNs and SampleRNNs were kind of like state-of-the-art models. On the right, I've shown a SampleRNN model, which kind of models the probability distribution of what's going to come next given the past at multiple levels.

And this was work done by Yoshua Bengio at Mila. But you can closely see, if you just see this architecture versus a transformer architecture, in a way, these are starting to get very, very similar. Because what you're trying to do is that for the probability distribution here, you're trying to see a lot of local substructures.

And then you keep on doing it over and over again. And you can draw parallels, like attention mechanism should also kind of be doing the same thing. So this was kind of like the literature in the past. What we tried to do was we just had the WaveNet model.

And we tried to see whether transformers can beat them. And our intuition was it should be able to beat them because they are successful all over the other domains, like in language modeling. So it should do that for raw waveforms also. We also tried to see whether we can circumvent the order n squared constraint by conditioning of the context itself.

And we did not go for specific applications. And we just said, OK, just in terms like modeling behavior, how will they do? So the data set for this was just like real-world kind of recording. So actual sound should not matter because the model is agnostic to what it is being thrown in.

And the setup was exactly the same, like you are giving a certain context. And I have to predict the next sample. You do the same thing with WaveNets. You do the exact same thing with transform-based, like GPT kind of model and see how well they do. I'll briefly chat about what WaveNet models are.

So WaveNet was kind of like a convolution-based model, which was getting rid of all of the vanishing gradient problem by just treating a sequential problem as being learned by a convolutional model. So what they did was basically have this sort of dilation layers, or convolution with dilations, which is basically I kind of skip in every subsequent layer by one sample.

So you can see if I have a dilation factor of 2 with a kernel size of 2, I would get this kind of a topology where my convolution filters in the very first layer are just combining the first two samples. Then I skip by one in the next layer.

And then I skip by three, which is like I look at the fourth one in the next layer and so on. The loss is still the same. So I have this network. I learn a latent space. And then I have a categorical cross-entropy loss, which is basically I have to predict the next sample given the previous one.

And I just do the exact same thing with transformers also. But then I have to make sure that I do it in a causal manner. So I have something which is very similar to GPT, in which I have causal masks in my attention mechanism. And I keep doing it over and over again.

So you have self-attention. After that, you have feedforward layers. You just have a stack of these transformer blocks and see how they do. So I said intuitively it should work. So it should be doing better than our base wave net models. Because if you look at the topology, we are kind of defining a topology on our own, right?

So what if the current prediction at, say, layer one were to depend on very way back sample, say, instead of the second sample, the 10th sample? So we are kind of ignoring all of that topology, which would have been important for prediction of this particular task. Whereas transformers with the self-attention mechanism can just learn, like, OK, which part of the samples are important and which are not.

And you can keep on doing it iteratively. So it made sense to us that, OK, transformer layer should be doing way better than wave net models. The second thing which we came across was, OK, we cannot have a lot of context. For example, the attention mechanism needs to store all of those of order n squared.

So in this case, if I'm storing data at 100 milliseconds, then I have about 1,600 samples. And I need to store 1,600 by 1,600 at multiple layers. And it just becomes like a huge problem with the data-- problem with the memory constraint. So what we said was, OK, what if we just use the context itself as a latent code?

So in order to have much better representation at every layer, we cannot have huge, big attention matrices. So what we said was, we would just do a sample-wise conditioning and throw a CNN layers just to understand what the latent code would be. So you still have, like, an attention mechanism or just a past context.

But then I'm also conditioning at every sample, OK, what the next sample should be given on this context embedding. And if you think about it, in a way, it is like, OK, if there are, like, five or six notes being played in a piano, then I'm kind of certain which notes will be played to a certain extent if I just throw in a CNN layer.

So I'll use that information along with what my transporters are learning. And then I would condition it. And I would just use that to predict the next sample. So for the evaluation criteria, we did not look for negative log-likelihood scores. We just looked at how well our prediction task was.

So we took a, like, stacked WaveNet, which was implemented by DeepMind, and saw that, OK, what was the performance using their benchmarks and even, like, bigger stacked WaveNets. We then started to increase the complexity of transformers and started to see whatever we had proposed in terms of, like, conditioning on the vanilla transformer architectures to see how well they do.

We did not look for, like, an application-specific problem, which is basically, like, we don't look at, like, how well perception tasks are for, like, say, text-to-speech synthesis or speech denoising. We just look at, OK, if we are trying to model this using a cross-entropy loss, then with the same model, with the same loss function, how well they do on, like, similar kind of parameters.

So this was the first kind of, like, sub-block of, like, how can we use our transformers for generative modeling. For the second problem, I'll do a quick headway on how can we use, like, transformers for doing language modeling, which is kind of becoming a really fancy term right now.

And this work was done by Julia Smith way back in 2020. And the goal of this was, can we kind of, in a way, do language modeling with continuous audio sequences? And I'll briefly mention about that in this sub-block of the talk. And this is in regard for, like, solving acoustic scene understanding, which is basically, like, if I'm given a chunk of audio, then I want to understand what's in there.

And if we could do that well, then in a way, we can do a lot of fancy, nice applications. So for example, like, if you think about, like, self-driving cars. So Waymo has started to incorporate microphones into their self-driving cars. Why? Because, say, if there is an ambulance coming, or if there is a fire truck coming, then that sound would be picked up way, way before even the LIDARs or even their sensors.

So they want to understand that and take actions based upon that. Apple, during COVID, did a hand-washing detection on their Apple Watch. Because if you could detect when someone is washing their hands, then you can, in a way, like, tell people that, oh, you need to wash hands for 20 seconds.

And then that can be built upon as a cool application. It can be used for music recommendations. So Spotify, YouTube Music kind of gives, like, very, very good songs, which you are listening to, which are similar in content that you would perhaps like. It can also give, like, really cool applications.

Like, say, people have tried, like, detecting depression from audio. Or I could detect whether I'm coughing or not, or I'm sneezing or not. And these can be, like, good medical device-- medical applications, which can be used along with the current diagnosis what doctor provides. So the question was basically, for us, was, like, how can we do, like, language modeling in a continuous audio domain?

And secondly, like, how can we train models, or how should we approach doing this? So this kind of, like, recipe has become, like, very, very popular these days in terms of, like, how would you approach this problem? It started with, like, open AI, and to a certain extent, DeepMind proposing that in terms of, like, VQVAE models.

But it turns out, like, transformers love operating in discrete spaces, as of now. And what they kind of do is, as long as your representations are discrete, they are very, very good at modeling what's going to come next. So what people have been proposing as a workaround is you could take up, like, your favorite embedding in some manner.

You could take a VQVAE embeddings, or you could take a Wave2Vec, or in terms of video, you can just do classic VGG or ResNet embeddings. You can apply k-means clustering to it. And k-means clustering would give you, like, discrete codes. You do language modeling with those discrete codes, and you predict the next code.

And in a way, if you're doing this, then you're kind of doing language modeling over audio. And if you need to get back to the audio, then you already saw with WaveNet that you can condition the WaveNet model to give continuous output. So you can use those codes to get back to the audio, similar to what jukebox and OpenAI did.

So I'll quickly mention about what vector quantization is. It's one of the most underutilized algorithms, to be honest. And what it does is basically gives, in a way, discrete codes to continuous embedding spaces. So how does it do it? So you basically have an embedding space, let's say, in 2D right here.

You define what are the number of clusters you want to put each of them in. You run k-means, and you would certainly get these patches of where all of these embeddings are, what would be the representative embedding of a continuous embedding. You can take all of those patches, and you can just number them, or you can just list them.

So in this case, you can perhaps have 25 numbers, or 20 numbers, which are, in a way, mapping from a continuous embedding to a discrete token. This is another example right here. So in our case, what we did was we took a batch of spectrogram, which are basically very small patches across time, and then shared all across the frequency axis.

You take those patches, you learn the embedding representation. In our case, it was just like three-layer autoencoder, fully-connected encoders with three layers of decoders, and have a bottleneck layer in between. So that bottleneck layer basically is kind of similar to this kind of diagram in, say, 64-dimensional space or 120-dimensional space.

You take up those bottleneck codes, and then you run k-means clustering on it. Suddenly, in a way, you can find discrete codes for continuous embedding spaces or even continuous segments. And since we know that transformers kind of love operating in discrete spaces, you can just apply language modeling now, and then you can see what you can do.

So in our case, we just had very simple three-layer, fully-connected autoencoder, small patches. The number of codes is important, because if you have too many codes, then you're kind of just throwing in all kinds of noisy things. Now, I'll give an example of why the number of codes are important through some example.

And you have two little codes. What you're, in a way, doing is you're removing all of the information which was relevant, and you're just kind of averaging them all out. So this idea first was proposed by Jukebox, which did it for music. So you do the exact same thing, what I talked about, in a slightly different manner.

In a way that, OK, you cannot learn codes for longer sequences. So in a way, learn sequences which are just moving slowly and which are looking at only a certain amount of audio. So you kind of encode this in these discrete levels, which are basically like-- all of these basically are codes.

So at every point, I define, OK, this audio had, perhaps, code number 55. And in the next level, perhaps, it had code number 2. And in the very top, perhaps, it had code number 2,000. So in a way, I'm discretizing the whole codes. Now what I do is I take up my favorite transform model, perhaps like a causal autoregressive one.

And I say that, OK, given these codes, try to predict what codes would come next. And for sure, transformers can do that. So I would generate the codes in the future. Once I've generated the codes in the future, I can say that, OK, this problem now is kind of like a text-to-speech problem, because I have these discrete codes.

Text-to-speech, in a way, is going from discrete letters to continuous audio. So I would throw in the fanciest, which was WaveNet. And I would just get back the code. And I would get the generated audio. So this was, in a way, what I described, that they take up a continuous audio.

They have these compressed codes, which they encode using a CNN in this case. The method doesn't matter. You can throw in the fanciest of embedding or latent representation on those continuous code. You generate the patterns, which are like, what's going to happen next in the future? And then you decode back using a fancy WaveNet or state-of-the-art model.

So this was what they were doing for music synthesis. What we said was, yeah, this is good. This can generate a good amount of music. But can these models be used for generating good representation of the current audio? And the goal there was, can language models learn representation, which can just encapsulate whatever we are giving as an input signal?

So in this case, what we tried after that was you do exactly similar ideas. But instead of doing on VQ-VAE end-to-end learned encodings, we just apply vanilla k-means clustering, similar to what I described earlier. We do on spectrogram patches. So you take up these spectrograms of audio, and you just divide them into very small chunks, learn autoencoder encodings for each of those chunks, run k-means clustering.

In this case, let's say I am learning 16 codes. Represent the continuous audio in terms of the 16 codes. Have a transformer which can perhaps predict the next code. And if I keep on getting better and better at predicting what's going to happen next, then in this linear layer, I should be encapsulating what's important or what's a good summary of what has happened in the past.

So that was our intuition behind trying this. And as I explained, the number of codes play a very important role. You can see here, these are just two piano notes switching one after the other. If I just have 16 number of codes, it just happens to have just a single line of encoding, a single code assigned to all of this.

Whereas if I'm assigning more codes, then it becomes a fine-grained prediction where I'm actually able to get what the individual notes are. Recently, Facebook also said, OK, they just had a different name to the whole thing, which is we can just call this as textless NLP also in the sense that, OK, you can do NLP without having access to text.

But the idea is very, very similar. You have an encoder, which is exactly similar to say what OpenAI was using. You have a VQ-VAE, Wave2Vec, or whatever you want to do. You can apply k-means clustering to it. You apply language models to it. And instead of a decoder being WaveNet, they just have a decoder, which is like a different version of text-to-speech, which is like Takotron in this case.

So as you can see, these are all the same wine and very different bottles. But the core idea is almost exactly the same. So this created a huge uproar of this going to change NLP. But this is very, very similar to what people have been doing in the past.

So I've already explained what this was. So in our case, we just try to predict what's going to happen next given the previous context and use that representation similar to every single one short learning or zero short learning-based method. I also explain why the number of codes are important.

If you have too small, then you're just throwing away a lot of information. If you have too large, then you don't put in-- it is no longer robust to noise. So this was our setup. And before I jump in, I should add one of the tweets which I saw from one of the most prominent researchers at DeepMind, which is basically like a lot of times it is very, very easy to bump up numbers.

I can have these details just not present in my paper, which actually help a lot in terms of improving the performance. And sometimes don't take into account what the actual model is incorporating or what model is contributing versus what the actual these tricks for training are incorporating. So for most of these methods, what we are trying to see is we try to keep almost exactly the same approach.

No rate augmentation, no fancy label smoothing, or moving average of weights, or decay, or whatever. You just have similar-based recipes to see how well we are doing. For this case, the goal was to see that how well our models do with respect to this purely supervised approach and how well it does with respect to a similar unsupervised approach.

So in the first case, the model and all of the weights have access to all of the labels, which is just shown as VGG supervised, which is basically you take up an audio understanding data set and you see how well you're doing on accuracy metrics. So that was the first one.

In the second one, we applied SimClear, which was proposed by Geoff Hinton, in which you can take up these multiple augmentations of the same input. You can have patches removed. You can blur the signal. You can flip the signal. You learn an embedding out of the last layer without access to the labels, and then just have a linear head to predict what's happening.

By using that, we got a 55% accuracy. You do the exact same thing with transformers. You don't have access to labels. You just run them while just to predict the next code. You take the linear layer, apply the same linear head, and try to predict what's happening inside. And with that, we got 60% accuracy.

So even though the results are not good, but the fact is the neural networks actually are very, very good at getting better and better with throwing off huge amounts of data. So there's still a 10% gap between purely supervised and purely unsupervised. But that's going to improve with throwing a lot of data to these models, because it doesn't have access to any label as per se.

So this is a famous paper by Dan Ellis and Nelson Morgan at Berkeley, in which they actually showed way back in 1999 as to why size matters for deep neural networks and also the number of data points which is present. So as they kept on increasing the size of the data set and the parameters, they kept on getting lower and lower word error rates.

And this has been true across any of the data set. And that's why the whole excitement is about unsupervised learning. So this was, in a way, a flavor of how can we do language modeling and unsupervised learning on audio for continuous signals. For the third subplot, I'll just quickly mention ideas which are very similar to what you would have seen in vision transformers, but with the caveat that how can we use some sort of signal processing to improve these performance even further.

So the basic approach still remains the same exactly as what you would have seen in vision transformers. You have a signal of interest which you want to classify. Here, they are raw waveform instead of images. The goal is to predict what's there inside of it. And also, we don't have any convolutions.

We don't have any other tricks which we were using before. All we have to do is they can transform as themselves, solve this particular problem. So for the data set-- and the whole setup was still the same. No data augmentation and no other forms of these tricks. You are given like 40,000 snippets for training and 10,000 for validation.

Our job is to predict as good as possible as to what's there in the audio. This problem is very similar to the sound which you heard and the video which you saw, that given a spectrogram patch, you have to predict what's there inside of it. We kind of do one step further than what's just like a simple transformer model.

In a sense that we try to see whether some sort of hierarchy over transformer embeddings would help us in any manner. So for that, we use wavelet decomposition on the intermediate transformer embeddings. So what is a wavelet decomposition? In very naive terms, it can be like a way of decomposing the intermediate embeddings into another intermediate embedding, in a sense that we are kind of putting these highways of like some embeddings are moving very slowly and some embeddings are moving very fast.

And some embeddings are retained exactly at the rate of what the original signal was. And why this is important? Because you can think about that at every intermediate state, you are in a way learning some sort of hierarchy in the model. So if I look at what we do with the wavelet decomposition before and after, let's say you had time across this and you had the embedding size across this and this whole patch was your output of, say, the nth layer of the transformer.

What I say now is, OK, I would just have a mapping from this to the mapping of my interest using wavelet decomposition, in which for half of the samples, I just retain the exact same embedding as what was learned by the transformer model. In the next half, I would start combining two at a time.

So in a way, I'm learning this sort of like a tree structure within a single layer of the transformer embedding. And for now, the wavelet or the BCS function which I use is simple averaging. So let's say from all of the embedding layers in between, I just need to have one embedding which is not moving at all, which is just representative of whatever is there of the whole latent space in that nth layer.

Then in the next layer, I would just use two at a time and then I would use four at a time until I reach the exact resolution as what I had. Doing this operation doesn't add any parameters whatsoever. You're just defining what your BCS function would be or what your wavelet function would be.

In this case, it is a hard wavelet. And I start combining them and I learned a hierarchy at every single layer of the transformers. And this improved our performance significantly as compared to not using them with addition of no extra parameters. And I'll come to the results later also.

So this is how the whole approach looks like. You have a front end. The front end is basically a single layer of 2,000 neurons followed by a dense layer of 64 neurons, which is just to make sure to conform it to the intermediate transform embeddings. Let's say if for the transformers, I define the embedding size to be 64, then that's the dimension which I'm mapping them to.

So I take a broad waveform. I patch it in very small patches similar to how you do in vision transformers. I would just have a single layer of 2,000 neurons followed by a dense layer of 64 neurons with the hope that the first layer is learning like a Fourier BCS function, which should be adaptable according to what I'm learning.

After that, I keep on doing this over and over again. I don't have a classification head or anything like that. I keep on adding multiple stacks of transformers after that. And then I have two approaches of what I can do in terms of adaptation. I can do average pooling across time of these intermediate embeddings, because the idea is very similar to what we do in classical vision, that each of the embeddings are looking at much, much broader output in the subsequent layers.

Or I could do a wavelet decomposition. So what I do is that I take all of these embeddings and I define these highways. So some of the embeddings move fast. Some of them are moving very slow. And some are retained at the exact same resolution as what the transformer is learning.

And then I keep doing this over and over again. I have a dense layer. I have my softmax or sigmoid, whatever is my classification head. So this is kind of what the approach looks like. We compare it with all of the traditional vision-based architecture. So the vision-based models have been very good.

And the performance have been similar in understanding audio also. So we compare all of those models in terms of mean average precision. And we see that even the tiniest models of transformers were just surpassing all of the state-of-the-art CNN models, which was a very good sign. Then we started to bump up.

The larger model should keep on improving the performance. And with the multi-scale models, as well as with the pooling layers, they improve the performance even further, which was kind of very surprising to us because the number of parameters are very small. These are very tiny architectures. Yet they are surpassing things like even DenseNet, which are huge models with a lot of millions of parameters.

So after that, we said-- and I'm going to conclude quickly. After that, we said that, OK, this is looking pretty cool. What actually is the transformer or the first-layer learning? So in order to make this plot, what we said was, OK, if you were to take a classic Fourier transform, then this axis is kind of like frequency.

This axis is the number of filters. And this axis is the frequency. Then in a way, it should be connecting all of the points in a linear line. And this is akin to the number of points in the FFT. So how many points I'm defining here? If I'm defining 2,000 points here, then I would have 2,048 sinusoidal basis functions, which are going from lower frequency to the most highest frequency.

We said, OK, we'll do the exact same thing, but now with filters. So we have a frequency along y-axis and the number of points in my x-axis. And if it was a classic Fourier transform, then it would be connecting right as a linear line. But what we did was we take up the front end, which is learned by transformer, take its Fourier transform, sort according to its center frequency as to what frequency it is activating the most, and then keep on stacking them.

When we did this for two problems, we saw that we are learning a different time frequency representation, which is specific to a particular problem. So if I'm trying to understand what's there in the content of the audio, I learn a representation which is very different than Fourier transform, which would have been a straight line, which is like a curved exponential line like this.

And if I do a polyphonic pitch estimation, I learn a very different front end, which is adapting to that particular problem. So this was very exciting to us because making computers hear in a way in which they are adapting their ears according to a particular problem is a very cool idea.

Second thing is we actually saw each of the filters as to what they were doing. And these are basically just single slices like this. So this is what we would have learned as a front end neuron. So we take up each of the neurons and we just plot them.

And for plotting this, we basically take a Fourier transform and then sort them according to where the center frequency is. When we just saw the neurons as to what they were learning in the front end, we saw that it is learning properties which are very, very closely matching with the traditional signal processing.

So you would have something like an answer detector learned right here. You're learning windowing functions. In a way, it is learning to have a kernel which is best for a time frequency representation, what people have been using in signal processing, which is like a Hamming or a Hamming window.

We are learning these pure sinusoids which are responsible for activating a particular frequency. So you can see the richness as compared to having a fixed purely sinusoidal PCS function right here. So this was what we had done. And then to share the final thoughts, I'll conclude by saying that, OK, transformers are proving to be a major advancement in AI research across the fields.

And it seems like they're solving everything for now. And hopefully, this is not the end. And we should keep an eye out on something which would change and have an impact which is more than what transformers have put. And who knows what's going to come next? Yeah, so by that, I'll just conclude.

And I'll be happy to take questions. Thank you, Prateek. That was a really good talk. And you provided some really good insights about how transformers work for the audio case. And yeah, thank you for the talk. And now I would invite questions from the class students. Let me just stop the recording.

Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music

Chapters

Transcript