Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures

>> Today, for our talk, we have Professor Jake Williams from Drexel University. He is an Associate Professor at Information Science at Drexel University's College of Computing and Informatics in Philadelphia, Pennsylvania. Dr. Williams has a background in physics and math with degrees from the University of Vermont, and his research leverages a quantitative linguistic perspective that applies math and statistical methodologies to analyze and improve linguistic learning systems.

Following a one-year postdoc appointment at the University of Berkeley, studying large language, large-scale machine learning in 2015, Dr. Williams became a data science faculty at Drexel, where he drove the foundation of a DSMS program and develops and instructs data science coursework, including natural language processing with deep learning. So, welcome, and thank you for coming today for your talk, and you could do a quick introduction of yourself before you start.

>> Great. Thanks so much. I got the mic here. Nice to see you all here. Thanks for coming out, and also for showing up online. It's a pleasure to be here. As was mentioned, my name is Jake, and my background's in math and physics, so the perspective that I'm coming from towards this work might be a little bit different than the standard, and that'll be a theme throughout the discussion.

The purpose of this discussion is to go through a relatively long-term development, a project that I've been working on, and as mentioned, my background is in quantitative linguistics, which means my history of focus on language has primarily been to develop general theories and descriptions of phenomena that you observe with regards to linguistic units, whatever those might be.

It's a statistical approach based on theories of language generation that are statistical in basis, and over the course of my time as a researcher, I've explored and ventured into language modeling itself and ultimately into neural networks as they approach language modeling themselves, and that's what brought me here through quite a bit of other work, so if you look into my profile, you'll see a lot of different subjects in either applied NLP, like I said, quantitative linguistics, and neural networks is a natural transition for me into inferential work, so let's get started.

So well, this is how we'll start the conversation today. It's not exactly how we got here in my lab. We came at this subject from a different approach, trying to think about layer initializations in neural networks, and this subject that we're discussing as a front for this talk is specifically focused on transformer architecture components, the self-attention component that's pivotal to the success of the transformer architecture, and it focuses on the fact that self-attention requires a quadratic comparison of vectors in order to produce the feature weights of those vectors needed to model long-range dependencies in text.

Commonly, parameters for self-attention are based on a transformation matrix, two, usually, queries and keys, that are responsible for dimensionalizing input vectors, and I describe it this way because generally speaking, when you're at the point of a self-attention layer, you already have low-dimensional vectors, but the parameters in a standard self-attention layer are changing the dimensionalities and the structure of that dimensional space.

They are like an embedding layer, which is factorizing the embedding dimensions. This redimensionalization is the primary means by which self-attention creates feature weights. It really just computes similarity in that shared space. Large and similar inner products really just result in strongly weighted features, so it's up to that dimensionalization to produce good similarities for whatever purpose your prediction requires.

However, an alternative strategy for feature weights might ask, given a basis, so in other words, you're stuck with your low-dimensional vectors, what is the optimal way to convert those comparisons of the vectors you're looking at by a matrix transformation to modify the vector similarities that you are stuck with that correspond to the best weights for features?

In other words, treat this as a feed-forward layer to produce self-attention weights as opposed to try and transform to some basis that produces good feature weights. The use of this modified self-attention mechanism will be part and parcel the substance of this talk. It's worth noting that this alternative mechanism is entirely compatible with the traditional dimensionalizing version of self-attention.

In other words, you could still change the dimension and compute similarities and then convert that with a second feed-forward layer to produce optimal feature weights. This is not exclusive in any way. This is exploring how useful that alternative prediction of feature weights can function. However, we'll avoid the standard mechanism for two reasons.

First, we have no solution to the standard parameters for self-attention as an initialization. And this will be discussed at length in slides to come. Likewise, it would create an additional model complexity that would muddle the effects of the modified form of self-attention that we wish to study. So having that dimensionalization as a way to produce good feature weights would confuse whether or not the feed-forward computation of feature weights is functioning well.

There's a catch to this, however, which is that these vectors that we use for such a self-attention layer better be good. In other words, their comparisons must be consistent and meaningful in the first place. So to get it out of the way, here's an architectural diagram for the relatively simple near-shallow architecture pattern that we're using.

It doesn't seem like there are many neurons in a network of this type. And that's because all of the activations are softmax, which means despite the fact that the U matrix, for example, is an entire layer, it's really just going through a single prediction non-linearity, the softmax function. So you can think about this as essentially a three-layer network that might be creating an encoder-decoder kind of design.

Likewise, the difference in presentation here over self-attention, which is parameterized by the matrix W here, is intending to show how a-- whether you consider it the query or the key-- one vector is the pivot for the comparison that will produce the feature weights, which is then fed forward in this model through W.

This is the case for standard self-attention, too. In other words, you can reduce it to a by-prediction diagram in this way, where a gray vector, such as is depicted here, is that pivot. The attention distribution coming out of the W matrix and the softmax function is indicated by the vertical red bar there, which weights the block of vectors in black.

That includes the pivot vector in gray, which is then passed through a feed-forward layer, often called the values of a standard self-attention matrix, U. We then-- since we use U as a way to reduce the dimensionality of the prediction that we're trying to make, we then feed that forward through another layer and then to output.

And that's essentially the relative shallowness that we're talking about here. U is a self-attention matrix, which means there's really only two layers in effect here. And the activation functions are strange. And you might wonder, for example, why we're using a different activation function, the softmax, instead of any of the dimensionally independent activation functions, like a logistic function or anything else.

And that's because we have additional insight into the softmax function and the parameters that it optimizes, which is very useful. So let's talk about those vectors first, though, before we get to layer initialization. Optimizing the keys and queries of standard self-attention bears substantial similarity to token and word embedding.

This is because the key and query matrices have a common dimension that they project to, much like you'd see with the factorization of an embedding layer on its own. Think Word2Vec, something like that. These normally-- there might be multiple self-attention heads. And because of the indeterminacy in creating a different dimensional space-- in other words, there are multiple equivalent reshufflings of those different dimensions which will produce the same output-- that indeterminacy is something that we hypothesize has bearing on what is now referred to as the lottery ticket hypothesis.

In other words, that multiple-- or this is the way that I would state it-- but that multiple different embeddings which produce different vector spaces can be leveraged in parallel to create further robustness for the model. Or in the way that it's implemented, that if a random initialization doesn't do that well, you can eliminate it from the network.

And that sub-network will do just as well, even after it's totally trained. In other words, having multiple clones, self-attention heads, which have no difference in the outputs that they're trying to predict, is at the root of the lottery ticket hypothesis. And ultimately, that invocation of the lottery ticket hypothesis is really a justification for eliminating parameters whose substantial cost of training are essentially wasted as a result of random parameter initialization.

You might ask questions like, well, what is a good initialization? What is a good set of word embeddings to use? So how can lottery ticket hypothesis interactive effects of randomly initialized embedding layers be avoided when constructing language models is another question that is embedded in this discussion. But we shouldn't say that dimensionality reduction isn't needed.

It's incredibly necessary. For language modeling, you absolutely have to work with reduced dimension unless you're in a very small vocabulary. For example, like 26 Latin characters or something like that, like a wave to Vec. The inherent input dimension of a large vocabulary model presents many computational intractabilities when designing NLP systems, something that you're probably all very aware of.

Likewise, though, the distance from embedding layers to learning information, the loss at outputs, puts them in a challenging position to train. It's really hard to learn embedding layers because of the indeterminacy in the space that you're trying to learn. You could swap dimensions, and it's equivalent. But the distance means that they receive learning information last.

This is a real challenge, and it's present in the history of NLP and deep learning, too. Vanishing gradient stuff. And this is exacerbated in the way that we have to actually learn embedding layers in standard models where we might modify learning rates to be lower all the way back at the bottom of a network to be gentle with those embedding layers and help them learn effectively.

But this is really trouble because if we had a good embedding layer at the start, those subsequent layers could be much easier to learn. So ultimately, in order to approach this challenge, we came along with a discernibility hypothesis. In other words, this boiled down to the theory that low-dimensional vectors, more than anything, needed to be able to discern features.

And that doesn't sound like a very strong assertion. And we started with a really, really, really low bar and assumed that the most common features needed to be the most discernible features. So if we're stuck with a lower dimension and we can't give everything a one-hot vector to be told apart very well, then we might want to give the more clear vectors, which have more dimensional independencies, to those features which appear most frequently and could stand to confuse models the most.

This hypothesis led us directly to develop the bit cipher algorithm, which is really just a scheme for assigning vectors of zeros and ones. Nothing too crazy in terms of what we're attempting to do. In the figure at right here, the order of vector assignment is by row from top to bottom.

And this is on a five-dimension, five-bit vector system. The first five from bottom are those one-hot vectors. Past that point, you'll see two-hot vectors, but they're a little bit less darkly shaded, indicating the way that we actually utilize the system. In other words, we normalize them to have unit sum.

What I hope you can see from this is that the bit cipher algorithm generalizes one-hot vectors to low dimensions. And as a result, we can work from a very sparse feature set and explore dimensionalities as a controlled phenomenon. And this assignment is incredibly naive, too. That's the other thing that I want you to see as well, that this discernibility hypothesis does not create any meaningful correlations between tokens that behave similarly.

So if you've got the upper and lower case of a word, their vectors aren't going to capture those similarities according to the bit cipher. It's really just gonna try and make sure that those features are distinguishable in a low-dimensional space and that the most distinguishable features are those which appear most commonly.

This was enough to do a surprising amount of work. So with some scheme for a deterministic low-dimensionalization procedure, we were then able to utilize this solution that we had actually developed previously. So this was actually the real motivator for a lot of the work that you're seeing today, although it might seem like it's just a checkpoint in the middle.

Provided bit cipher produces decent embeddings, we can ask, can other layers be non-randomly initialized? In other words, without gradient descent or backpropagation or other gradient-based iterative algorithms. This equation came about from analysis of Word2Vec with the original Softmax activation function. And much like other articulations of the Word2Vec family of embeddings, came up with differential solutions that depended on co-occurrence matrices.

We formalized this as a question. Is there a way to take a co-occurrence matrix, F, in this equation here, and convert it with some weights, some denominators by row, into something that warms up a single-layer feedforward in a neural network? And ultimately, this k minus 1 over k term here, and this sum, is really just expressing something like conditional probability.

Like conditional probability, because k minus 1 over k is a wrinkle that says that as the number of features increases, in other words, the context window increases in a block transformer, then the warm start that we could apply to start off a neural network without a randomness, entirely determined by the vectors underneath, nearing whatever direction it's going.

All we have to do is compute some co-occurrences between inputs and outputs, and I don't mean necessarily standard co-occurrences that you might have learned about a long time ago which depend on a radius. I mean, whatever your inputs are, whatever your outputs are, you take their sum of outer products and you get a co-occurrence matrix of inputs and outputs, and that can then be utilized to initialize your layer in that neural network to be vastly more performant than what you'd get by a random initialization.

This was a strong motivator for us. This was just for a single-layer model, but it depended on the softmax function for activation. And the softmax function as an activation function, we knew, is also necessary for self-attention features. And this meant that if we could put self-attention into some kind of a standard form with this equation just like a single layer, then we could apply the same solution with one catch.

That catch is specifically that we don't know what the targets are for self-attention. There's no target vector y, the thing that you're trying to predict, which position is the one that you want to weight most strongly. And so in order to apply this solution for a self-attention model, we had to do some more analysis.

And that's in the reference number one, which is all the way back up in the first slide if you want to see it. But that derives a differential criterion, an analog for the single-layer solution that tells us what the targets of that kind of self-attention actually are, the hidden targets, the weights that you're trying to create, which really are just about making sure that the layer above self-attention has some unsurprising things coming towards it.

The self-attention layer is really just trying to massage the vectors so that way they look like something that the next layer above expects. Aside from that, though, it's a much more in-depth conversation. The point, though, is that for the model in this picture here, we can now start off with vectors x that are not random.

We can use those vectors x to initialize non-randomly the parameters in W, the self-attention matrix, and then use that, going up the network, to initialize the parameters in U, since it's just a feed-forward layer with whatever self-attention is giving it as weights. And then whatever that produces, the hidden state, H, we can use that with the actual targets after the output layer to warm up the matrix O.

And you might say, "Okay, well, how did you figure out what those hidden targets are?" You had to have an output for the U matrix to try and hit. That too is something that the bit cipher can provide in the form of label embeddings. In other words, low-dimensional targets of the thing that is downstream that you're trying to hit, the language model's output.

So similarly, we can warm start the U matrix in terms of those bit cipher label embeddings. So in this view, the aim is to show how simple and general a single-layer softmax activated solution is to apply. It's really just no more challenging than computing conditional probability given inputs and outputs.

It's fast, it's something that you can distribute in terms of processing, and it's very, very general. So this is essentially the process that we're using in order to warm up the W and U matrix. There's the U matrix there, starts out as zeros. In other words, nothing, no random values, no weights anywhere.

Over the data, which is just borrowing the dimension of this gigantic Y matrix that has all of the targets in it for the entire data set, we simply just take the outer products of whatever the hidden state, the input to that layer is, assuming that the lower layers beneath it are also warmed up with whatever the targets for that layer are.

Following that, it's really just about normalization and a logarithmic transformation. And that logarithm really just emerges as a result of being an inverse to the exponential function, which is a part of softmax, pretty much all of softmax. And that's really what brought us here. So what does warm starting a network do?

This is going back to before we had the bit cipher algorithm for dimensionality reduction. And we started out by just saying, OK, if we take a simple, simple language model that only looks at a radius of traditional co-occurrences as features, we can concatenate those vectors and feed them forward for a language model's output.

A completely random start, a cold start to a language model, is really just the size of the vocabulary in perplexity. And those three lines here for a few different radii are demonstrating that point with the point all the way at the top left-hand corner of this figure, cold starts.

In any of those cases, when the warm start is applied, the perplexity is immediately automatically lower. And furthermore, the trajectories that the updates follow continue in the same learning rate and the same time to perform better than models that were started cold. If you have an early stopping criterion, similarly.

Early stopping, well, more than just generally, engage first and with a higher perplexity. So this was the first indication that we had figured out something that's very useful. There are some folks on Slido saying they're a bit confused. They're asking, are we talking about an alternative approach to self-attention?

We are. So we're all the way back at slide one. And it is the premise of this whole conversation. So here, in this modified version of self-attention, you might normally expect to do a comparison of your inputs, the matrix X. Whatever your inputs are, they might be a whole block of vectors, or they might be-- this is self-attention.

It's not cross-attention, where you have different vectors that you're trying to attend. And forgetting about the values, which for us is the U matrix, the keys and queries, which are the parameters for self-attention, are in the middle. They're in between the two copies of the inputs, X. Each of those you can view as some kind of a projection down to a dimension where they can interact.

And this is necessary for something like cross-attention, where you might have different dimensionalities like X1 and X2 in two separate groups of vectors if you're doing something like machine translation. That's not necessary to think about when you're just looking to do a standard language model that has to predict the next output according to the inputs, which are also outputs from previous iterations.

Two insights here-- one, that multiplying the key and query matrices, WK and WQ, it's just another parameter matrix that's implied. There aren't two parameter matrices there in the middle for self-attention in any effective way. There is a common dimension of comparison, and that kind of just moves stuff around.

It creates degrees of freedom so that optimization can figure out what's the best weighting from comparisons. But the softmax function is strictly operating on similarities of that comparison space. It's not doing anything with those similarities. It's just softmaxing them. It's just activating them. So if it was a big similarity, it's a big attention value.

In this equation, there's no transformation happening before those vectors are multiplied together, inner products. So those vectors better be good vectors that you're starting with-- x and x transpose, the same thing. They better be vectors that are comparable. They can't be vectors from cross-attention, where you're trying to translate from one language to another, and they just don't inner product.

They're different dimensions. You could force it through if they were two differently trained embedding layers, and they had the same dimension with this mechanism. And if you didn't, you could put those key and query matrices back in between the two x vectors, x blocks of vectors. But a lot of what's going on here in this talk is trying to simplify and make more efficient the architectures that we need and the mechanisms that they utilize, given what we know about how language functions.

And that's a critical piece there. We have assumptions that we can make. If all we're doing is autoregression, we don't need cross-attention dimensionalization in between. That'll be the theme, in other words, that can we use knowledge that we have about the way language functions to design better versions of architectures that meet the needs of language instead of being simply general.

Is this good? This is important. So if there are any questions here, it's a good time. We are there and there. So we just talked briefly. This was for language. The thing about language models is it's a really simple language model. There's no self-attention here yet. This is really just evaluating that a warm start in either the blue, green, or purple case does better than its partner, which is a cold start of the same architecture, same hyperparameters, orange, reddish, and brown.

So three different models, regardless of how long your context is in each case here, we see that a model which has a nonrandom initialization by the equation presented two slides back from here starts a network off with a much lower perplexity. The requirements to apply this solution to a feedforward layer of parameters is simply that your inputs should not have negative values.

That's really all we have to worry about. So it becomes really easy to ask questions like, well, what happens when you apply this to other data with non-negative values? Well, there's one little catch that we had to think about here in this case, and that is with the bit cipher or one-hot vectors, we're controlling the norms of the inputs.

With standard embeddings, with MNIST, for example, when you're trying to predict the handwritten digits, 0 through 9 value, you don't get to assume necessarily that all inputs have the same norm. You can normalize the inputs, but it doesn't necessarily make sense to normalize them to one when you're looking at images, for example.

They're non-negative. They have 0 through 255, for example, in MNIST. And as a result, we can put these data through that same warm start. Now one little caveat here I've alluded to about the norms of vectors is that we don't know what that value of k is. In other words, let me go back, you could look at it here or here, that's the number of features per prediction, which if you're looking at unit-normed word vectors is however big your context window is, k, because they all have unit norm and there's k of them.

But if you're looking at just an image, it's not clear if it's a composition of multiple vectors, if it's one vector, and how many it is, if it is a composition. It just has a norm. In application to data like that, that is what k becomes, the average norm of an input.

And I'm regretting not putting a graph in this, but the paper that discusses this shows that in the MNIST dataset, the exact optimal value of k is the average norm of the inputs however you've pre-processed them. And that's how we generally apply this rule when we're warm starting systems and we don't have unit-normed vectors.

And it was learned from studying this model's application, this solution's application to non-linguistic data. But as mentioned, the purpose was always towards language. So longer context windows in principle should provide models with more information than shorter context windows. This means one should expect that models perform better when context window length is longer, theoretically.

And this is essentially the reason for why self-attention was initially developed. Researchers wanted to improve language models and context windows, providing more information were seen as the key to that. In other words, the more features, the more information, the more flexibility a model can have and expressivity. However, without feature weights, models didn't simply get better with long context windows, and feature weights and self-attention were hypothesized to be needed.

And this was proven back in 2017 with the transformer architecture. In moving towards self-attention and transformer though, the primacy of the transformer architecture's block context model casts a shadow over the use of other context models. So for example, if I were to ask here, is it clear to everyone that the standard self-attention block model of context is different than the traditional notion of co-occurrences, which use a radius that is not positionally anchored?

It is the context model, the positional anchoring of the block context model, that gives it its information. It is not, in all likelihood, anything else. Now what you do with that context model matters. You can't just take those vectors in a block, add them together, and expect a feedforward to do well.

That's where self-attention is needed in order to figure out which vector needs the best weight, most weight. So what you'll also see in the architectures that are based on what I've already presented is that we're interested to explore how different models of context for language models can be integrated in general because they each provide different information.

And we all know that the standard transformer's block model of context requires a ridiculous amount of information and data in order to become effectively trained. So the current state of contexts that we use, top there might be the standard transformer context that has a fixed positional block. And it takes the first 10 tokens, for example, the second 10 tokens, and the third 10 tokens, each in different blocks.

Each of those is a group of contextualizing vectors. The second one there that you see with the r as a subscript is a radial model because those do different things. In other words, rather than assume you're looking at the first 10 or the nth 10 features, you pick a radius and you say, what are the last r features, the last r vectors?

That can also have an attention distribution, a self-attention distribution, according to the exact same model that's being presented. It produces an entirely separate context in the state, whatever you want to call it, which can be conjoined with the block model to articulate features and be given to an output layer that knows what to do with them when each has different values.

The concatenation of those different context models keeps the information separate so the output layer can decide which portion of the context is useful for the prediction. This last one is getting really traditional at the bottom. It's what I refer to as a document model. If you've ever implemented something like a Naive Bayes classifier or a term frequency inverse document frequency model, that's essentially what a document model is.

Set up your vectors, you get something. Is it going to be the best for predicting the next token? Absolutely not. However, it's always different. What that means is that even if you wrap to the next block between the radial and the document models, you have a unique context vector, even if you're looking at the exact same block, because the document has grown and the radius just says, what are the last three?

What are the last 10? As a result, when you incorporate different models of context, you don't really have to say that there's a finite context window. It might not be very good to make predictions past the first block, but that might be about how much data you've used, and it might be about the hyperparameters for each one of those models that you're applying, in other words, radius, the block size, like usual.

So far, the only embeddings that I've suggested are from this BitCypher algorithm, and as I've expressed, they don't capture any useful similarities between similar tokens. The BitCypher algorithm, it doesn't care if you're looking at the uppercase or the lowercase version of a word. It doesn't see them as bearing any similarity, even though they might be used very similarly.

So how can you utilize the BitCypher to create vectors for tokens that have meaningful similarities between words that are used similarly? And this is just backing off to the traditional methods once again, taking co-occurrences of BitCypher vectors with whatever's there at the middle or center of a co-occurrence model.

Normally, if you think about one-hot vectors, a co-occurrence matrix is really just the same thing, except now we just have smaller vectors with different dimensions on, so to speak. And we normalize after concatenating these blocks of different radii from the BitCypher to match the original input requirements that we discovered for the warm start solution.

And that enables us to use these just like we would the original BitCypher vectors, except now, just from the usual co-occurrence statistics, you'll see that capital word and lowercase word have a lot of common usage. And you know this works because you've seen co-occurrences for a very long time, and while they might not normally be useful in our applications these days with deep learning, they can be imparted through the BitCypher algorithm to prescribed vectors as well.

So here's where things start paying out in terms of speed and efficiency. If you only have one layer of self-attention, then that means that you don't need to worry about whatever weird expressive stuff is happening that, you know, similar inputs might have slightly different hidden states. Since that first layer is just a set of static word embeddings, the self-attention layer is working off of static word embeddings.

And that means each pair of words have a fixed comparison given static word embeddings. And that means if you want to compute the quadratic features of self-attention, you can just pre-compute them and pull them from memory. This caching of vector comparisons is essentially reducing the self-attention layer's cost from quadratic to linear, since those values that we're using to weight the vectors for the feedforward layer no longer require comparison across the block.

They're already compared. So when our vectors are static, which is at inference time, and if we're not learning the embedding layer's parameters with iterative differential updates, then not only do we have to not track gradients for the embedding layer, but we don't even have to compute the vector comparisons.

We can pre-compute them and just load them, which is much, much faster. So we can reduce a lot of, all the inference and training costs, not all the training costs, some of the training costs, because if we want to update those vectors, then we can't assume cache comparisons. But it's a huge cost savings.

This means that we can train these self-attentive feedforward unit models very quickly and with good initializations. But there are some other things that we immediately observed while developing these models, and that is the lack of randomization produced models which were quite effective even on small data. Now, it doesn't mean that training on small data will let you generalize to everything else that's out there in the world.

In other words, training on a small data set might produce a model which has a surprisingly low perplexity on its training set, but it doesn't mean that you're going to be able to generalize and have a language model that's talking well from just hearing a couple of thousand tokens.

It does mean it will know that couple of thousand tokens very well, very quickly. But there's a challenge with using self-attention still, and that is the fact that the block model of context often is not fully utilized, since many documents are shorter than long context models. There are long context windows.

And these days, there are exceptionally long context windows. I'm not even talking about those. Many of the language modeling benchmarks simply don't even go out to a thousand words when it comes to context, and you're looking at a document to predict. So this has been a problem for a while, and it means that if you're going to pad your short documents, you're going to waste a lot of prediction on those paddings.

A lot of computation gets lost just for null information, essentially. And the way that this is often relieved in some groups, and to great effect, is by packing long contexts. So for example, if you've got a hundred thousand token context window, most documents will not be a hundred thousand tokens long.

What do you do with the rest of that long context if you want to use a thousand tokens of good training data? You fill out the other ninety-nine thousand tokens with a bunch of other random documents that don't belong anywhere near the first one. That's called packing. Packing can be utilized without impacting different documents with each other, without contaminating the information between documents, and that takes a lot of work, but it can be done.

However, there are different strategies that we could employ, different engineering tricks that we could employ, to make our operation of self-attention more effective at any length of document without having to deal with this packing problem. And that comes about by dynamically changing the context length from some maximum value, that's what you would normally set, just use the context that you have.

But you still have to create batches if you want to train models quickly, and what that means is that there's still some padding if you use this approach. But you can pad those short documents to set lengths, batch short documents together, batch long documents together. This means that we don't need to pack documents together to make use of a long context window.

When a document is long, you can let its context be long. When a document is short, you can put it with other short documents and just use a subset of those self-attention parameters. And with traditional self-attention parameters, keys and queries, it would never be a subset because it's a low dimensionalization that that matrix provides.

With this modified self-attention, though, there's a different shape to the weight matrix, and that's why it's a subset of those parameters that we have to utilize, and that might be something worth discussing afterwards. In other words, how does the difference in shapes of dimensionalities between this and the standard self-attention weights shake out?

But we want to get to a different point for the sake of this conversation. What is a model like this useful for? That should be a question that you're asking. It's a question that we've been asking. We're not entirely certain yet how an extremely large model like this will function on trillions of tokens, for example.

In other words, can you expect the same kinds of outcomes, like a chat GPT kind of thing from some of these models, human interaction and RLHF and all the rest of that, though it's something that we're considering, but also at different scales, too, since those are performant on their own as well, but for what?

So the point is, is that from what we've stress tested into the billions, models can be trained very quickly on a relatively small GPU, in ways that we expect when we cache vector comparisons, we see really big speedups. When we don't cache those comparisons, you see all of the growth in computation time that you would expect from longer context windows.

This one here, though, we're trying to make it really, really, really small, the one called potato. That's because we want to see if we can train a model from scratch, since on very little data, these models can fit effectively with the initializations that we've developed. And with the purpose of starting from scratch, starting with no data, we're thinking about edge computing cases where we could deploy a language model with a microphone so that a person can talk to it and just train it from their own data, train it from their own speech, to understand their speech.

So between these, we've explored a lot of different configurations, trying to consider similarities to what some standard configurations might look like, a couple thousand tokens in a context window, for example, to look something like a GPT-2 style model. Thinking about bit cipher embeddings that are 500 dimensional or 1,000 dimensional to be something like a GPT-2, that's, again, pointing towards the big/large category of models that we've experimented with.

Beyond that, we haven't really touched those scales, because our first objective is not to make big, big language models and train chatbots. We want to know, what can we do with a small model, since this is a relatively unique capability? So what does training look like? To the best of our ability so far, it's kind of hard to see, but the first step is that warm start, where you train the bit cipher, and you take a couple of splits of data, and you compute that warm start for the self-attention layer and the feedforward layers.

In this case, which is really just using a 100 million token data set from the baby language model challenge, which has as an objective to see what language models can do on a relatively human scale of data. In other words, 100 million tokens is something that a person might hear in 10 years of their life.

In 10 years of life, people become pretty proficient speakers, and can a language model be trained at that scale? The second stage, after the warm start happens, is where the majority of training time occurs, and yet is also where training operates the most quickly. At this stage, we find that freezing vectors is important.

One, because it means that we can train much quicker. So we can have the subsequent layers optimized beyond their warm starts very, very fast, using that vector caching, the vector comparison caching, to avoid the quadratic costs of self-attention. This articulates the parameters in the middle layers of the model for taking 100 million tokens and making five passes over the data here a lot quicker than any of the other stages.

The comparison that you'd make to this is the training time once those embedding layers are unfrozen, where everything slows down to the normal speeds, where you have to do all of your vector comparisons on the fly, since you can't assume that the same comparisons will always result in the same numbers, since model parameters might be updated.

This is the best procedure that we've figured out so far. And in order to make those vectors update, we find that learning rates have to be adjusted dynamically inside of the network, like normal, and that the embedding layers are really tough to make progress on. And you'll notice here in this picture that the slowness and the lack of stability, for example, in learning the embedding layer once it had been prescribed earlier, makes it really hard to train over the entire data set compared to five passes, for example, in the middle phase when the middle and upper parameters are being updated, still with backpropagation.

And the other thing that I would highlight before leaving this slide is, in phase one, how the warm start saturates pretty quickly. So if you have 100 million tokens, you really only need to apply the warm start to something like maybe 10 million tokens, not that much more. You don't see that much gain from that much more data.

That's not a bad thing, because it means that we don't have to apply that process for any longer. It would be great if it gave us all of the optimization that we could hope for, but it's not something that we could necessarily expect, since it's just an approximation of where the parameters are headed.

So on the back of an envelope, thinking about how the systems that example was trained on as compared to other examples that are out there, and thinking about models that are kind of sort of similar size, we're talking about a 12 gigabyte GPU, a relatively small single chip, specifically when referring to these training times.

So that's a 12 gigabyte GPU. Just working off of eight chips, each having roughly four times the scale, and comparing to this time that it took to train something with maybe an additional order of magnitude, although we have trained models up to around 50 million parameters, too, which is getting towards GPT-2 scale.

We see training times that, if we scaled up to the relatively large systems that present us with how much work we should expect to have to do for a model that large, we can expect to be able to train much faster. But as mentioned, the initial objective here is not to simply figure out how well we can do something that's being done well already.

It's to figure out what these alternative strategies are useful for, since they give us access to different regimes of model scale as effective. So as mentioned, we've gone to relatively large amounts of data. I wouldn't really call them big data at this time, even though just a couple of years ago a billion tokens would be a relatively large amount of data.

It's really just a stress test at this point, gives us something like, do we continue to see models getting better as we continue to give them more data? Do we continue to see models getting better as we continue to give them longer context windows? And the answer to both of those questions is absolutely yes.

So nothing is telling us that we can't train bigger models with these. But will those bigger models be as good as a standard self-attention model? I don't know. It's a different self-attention parameter matrix than what you see in a standard self-attention model. You could integrate the two. And in theory, that should be overkill, because you'd have more parameters and more power through them.

And we can see from this work that the alternative self-attention parameters are reasonably effective. We're getting close to time. So I'll go quick through these, since this is the work that we're approaching right now. And this is the idea that we're seeing as a use case for such a model like this.

In other words, no pre-training. Just training on the target data, whatever the data of interaction are. And in this example, you'll see that this relatively smaller precision language model just needs to predict whether or not a light should go on or off. A lamp that listens with a microphone and a switch.

And you can use that switch to train the lamp. So that's the goal here. Can pre-training be eliminated? And we want to anticipate whether or not you're going to flip the light on or off. That's the task that we're going to try and approach. Or that, rather, we're currently approaching.

There's a few different processes that integrate into this approach. There has to be a microphone that's listening to you, recording audio. There has to be a transcription algorithm. And we use Wave2Vec at this point, because there's a very small version of it that's character-based. And as a result, it doesn't even require you to use consistent-- or it does require to use consistent language.

But it doesn't even require you to use words, since it's strictly phonetic. There has to be-- and this is the bread and butter of what's going on here-- a process which anticipates what you want. And that process is responsible for creating good training data. So this is a smart data collection algorithm that figures out, when you flip the switch, is that the target for something that you just said?

Is that the transfer learning objectives from text that was transcribed, that it anticipates you want. Following this, there's also two other processes. One which operates on a different time cycle, and that's training. So always train a model. Always be training a model, whenever there's new data, is essentially what that fourth process says.

And the last one is operation. In other words, if you flip the switch, there has to be a process which operates the light bulb. It always has to be a lamp in order to be useful. It always has to be able to just be a switch. However, that operation process likewise needs to see a directive from the anticipator.

If the language model predicts that you just said a thing, that means you want there to be light, that operator then needs to receive the signal from the anticipator and execute the directive. If the user then, though, within some time scale, changes the switch back after the model created a prediction that was bad, the operator is also responsible for issuing a correction to the anticipator to correct the training data.

What this looks like as a process is in this diagram here. And you can see the flow here from stage one, a verbal command maybe gets recorded, transcribed, turned into text. And if there's no model that's yet trained, that text is just stored as data along with any directives given by the user in the form of a light switch going on or off.

Once there's any data, the learning process says, okay, time to train a language model and integrate it with these targets. Once a model is done training, it's sent over to the anticipator who is responsible for using the language model. That small language model then is now empowered to make predictions every single time it receives a text command.

And those predictions are sent to the operator, which then does whatever it's told. And the last thing that can happen, step six, is if the wrong prediction was made and the user fixes it by turning off the light because they didn't want the light on, that corrects the data that was transcribed and the next model which is trained will be able to avoid that problem.

And there's some dialing this in in terms of the time scales that you want based on the way humans interact with the light switch. So there's a lot of development that goes into figuring out the right way to set this up. The data that you collect from a process like this, how do we organize it?

This actually is not transfer learning, so I kind of lied there a little bit. This is strictly language modeling. It's a conversation between the human and the lamp. You say something, the lamp says, here's what you want. And it's just an extending context window, like you'd see with a decoder-only kind of architecture these days, a chatbot kind of thing, a human personal assistant, human assistant dialogue.

And you might also suspect then that, well, couldn't you let the lamp talk? Yes, you could absolutely let it use other tokens, and that is something which is on the horizon for us, in other words, how to determine once the model is learned enough and knows when you want to hear it talk and knows what you want to hear it say, which requires other smart data collection currently in development.

And there's three tags here if you don't see it, although what they really are are tokens, since they're integrated within the language model's vocabulary. I want the lamp lit, I want the lamp dark, or nothing, if no switch is applied during transcription. So what do the models look like that go into a lamp?

They're a little bit smaller than that micro model in terms of having a long context window, B, the block size. They still use these other features, like a radius, which help them to do well with only little data, those other context models. And the embedding size is around 50 or 100 and something, and this is small enough to fit on a microprocessor, on a CPU of a microprocessor, including training, no GPU whatsoever.

And the first time we ever got the interaction right, the right timescales, from no data whatsoever, creating this data, and 20 minutes of it, was enough, and you can see there's loads of misspellings here because the transcription is not required to produce known words, known tokens. It's strictly character-based, so you can say whatever you want to say, you can whistle, and as long as wav2vec thinks that's tokens, it'll figure out what to transcribe.

That's enough, 20 minutes of talking to it, to have it know pretty well when you want the light on. This is what the numbers look like for that prediction, and you see lots of zeros there. That's because there's no positive instances yet in the data, until you flip the switch, there's nothing to predict.

Once there is enough to predict, we see an immediate jump in the model's ability to figure out whether LAMP should say on, off, or nothing. And while we trained this first model, for example, in 20 minutes on LePotato, which is a really, really, really small microprocessor, it's incredibly frustrating to utilize because the processing time is a couple seconds, and it feels like it's going somewhere, even though the data is entirely localized, there's no Wi-Fi, there's no internet connection.

It just takes the model on this tiny chip a minute, not really a minute, like a couple seconds, to flip the switch on, because it has to transcribe it, interpret it, issue the directive, ask the operator to operate it. And so part of what we're doing is figuring out at what scale of microprocessing do the models that we're developing really make a good real-time system that a user can make use of well.

And as you can see, the larger the model in terms of hyperparameters and so forth, the more performant it gets. So we see these as potentially useful in edge scenarios, but not just for operation, for training. So go to Home Depot, buy a light switch installed in your house, start talking to it.

But this isn't really the stopping point that we want to get to. We want to eventually get to the point of talkback. We want to treat these as language models that essentially have a bit of you inside of them that you can converse with. And that's important to know when the model is aware of what you want to hear said.

In other words, it needs to know what is a good thing to say back to what you just said. And the lamp has never heard a lamp talk before. So there are challenges to figuring out the lamp's role in conversation. And choosing a lamp, though, is arbitrary. We don't have to make it be a light bulb which goes on and off.

This could be a controller for anything which is a binary switch. And you could imagine, like others are looking at right now, there's a lot of opportunities with predicting the action on your phone that you want to take, which thing you want to push. And with a system like this, microsizing on to your cell phone, for example, assumes better hardware than what we're already using, but would be entirely localized, including training.

But this is also really just getting to the point of feasibility. It's not getting to the point of a well-optimized system, which we're still developing. There are, in principle, different modifications that we could make to the self-attention layers, which include traditional self-attention parameters. That's just one example. Then there are updates to the very naive scheme that we have for BitCipher, the vectors that we're using to initialize our models.

And a lot of other minutia that need to be approached. So this isn't really work that's done. It's a work in progress. And in addition to what I just described, we're moving towards larger models and evaluations that compare better to modern systems, which will eventually come online. We'll most likely participate in this year's baby language model challenge, although that challenge assumes you're working with a standard architecture, which is already developed for all of the evaluative needs.

So there's a lot of work to do. But that's really all I have prepared for you to discuss today in this conversation. I've gone over a lot of details, and if you'd like to talk about any of these, I'm certainly happy to. Questions that you might have as well.

And if you have access to the slides, there's some links to the different papers I've referenced. That's all for today. Thanks. Hey, so thanks, Jake, for the great talk. And now we'll have some time for questions. So if anybody here has any questions, feel free to raise your hand and ask.

Otherwise, we'll go to some questions on Slido. Some folks are asking. So we'll be posting the slides later. But I've also pasted these references in the Zoom chat, as well as Discord, in case anybody wants to see them. I was wondering, in the plots that you showed for warm start versus cold start, does the cold start use the modified self-attention or the standard self-attention?

Sure. So the question was, in this picture, comparing warm starts to cold starts, what self-attention was used here? None. This is strictly a feed-forward experiment, where we take a single layer, and all we do is feed forward with one-hot vectors from some context window and concatenate them together. And the general property that you'll see is, by concatenating vectors, there's very little for attention to do.

Simply with a block, you're adding the vectors together, and that superposition of the dimensions smears them. And that's why self-attention is needed, in order to weight that superposition so just the right ones stick out and it's not muddled. If those vectors are instead concatenated, a weighting of those is really just appealing to the sensibilities of the matrix above.

When they're superimposed, there's a lot to work on, since you're smearing separate information together. When the information is already separated, there's not that much re-weighting can do. And in this case, there's absolutely no re-weighting going on. And what I've described to you is really just something that's become very clear from a lot of small-scale experiments in between the models that we've developed.

And moving towards self-attention took additional time, and we didn't have a solution for that layer yet when this work was done. I had a question in regards to-- so you're doing this with on-edge controllers, right? What? You're doing this with on-edge controllers, right? You're doing training for on-edge controllers?

So this could be for IoT devices, right? Could be. And you talked about how this also could work for image data, right? Oh, I saw that. Yeah. Have you conducted any tests with image data? Like, with these small-scale models? Yeah. So image data works best on not just feed-forward architectures.

They have, for example, convolutional bits and pieces that are useful to them. And that means if we want to apply some kind of a warm start for, for example, a convolutional layer to create a performant image classifier or something that's working with images, we'd want to develop an initialization for that layer, too.

It has weirder activation functions, which means we need to branch out from softmax as an activation function. But surprisingly similar convolution is to a radial model. It's really just saying what's near where I'm trying to create a feature. So I would say, yes, it seems like it's something that we could do.

But currently, it's in the phase of future work that it fits in one bullet here at the bottom. Different layer types need formal derivation for warm starts. So if we wanted to do this kind of a thing with performant architecture, we would be probably uniforming or randomly initializing some of those parameters that we don't have warm starts for yet.

And as a result, we would receive a lot of just sort of like noise in where things are going. And if we started to utilize the activation functions, whether it's even just logistic activation, a logistic activation is not really fundamentally different than a softmax activation. So you might say, for example, well, why can't you just apply that to logistic function, like a two-dimensional softmax?

And the reason is, is because if we treat it like a standard logistic, then each dimension is independent. Each dimension is trying to predict the same thing. And there's a lot more questions about how you can get different information out of different dimensions. So it's a question that's really worth spending time on, in my opinion, separately.

And it's not the first question that makes a lot of what we've developed practical. On one of the slides, you had a dialogue with your user. I'm wondering, does that imply there is a speech-to-text system inside the microprocessor? Yeah. So audio goes in. And there's a process here which accepts that audio.

And it utilizes a pre-trained wave to vet. It's really just fitting a need with a pre-trained model. That's what we're doing right now. Although transcription is something that we would like to move into in our future work for the purposes of training from scratch, because one of the real benefits of a system like this is that it doesn't come with any biases from other people's data, aside from the fact that there's a pre-trained transcription system, which means that it's pre-trained towards whatever phonetics were within the language that was there for pre-training in the wave-to-vec algorithm.

So there is external utility here coming from a pre-trained model. But the text itself and the language model that we're presenting is only working from what gets transcribed. I have a follow-up on my previous question. You said that the feed-forward worm start is independent of the choice of self-attention.

Does that mean that the worm start strategy can be used for any network that uses a feed-forward layer, not just PLMs, but any LLM or any other network? Yeah. So that's going back to the worm start solution here. And what it says is that in terms of any layer beneath, if you assume that those layers' parameters are what they are, you're not going to update them.

And assuming that you know what the targets for that layer are, which for middle layers, there's some questions to be answered, then this initialization will do better than random for a softmax output. That's really important at this stage, that there's a softmax as a part of the activation. If there's not, then more math, basically.

But the point at which it becomes clear should that whatever type of prediction scenario you're in, as long as you have non-negative features and a softmax for activation, like in this case with a single layer, or even two softmax layers, whatever that's doing, on MNIST, you can get a really good initialization.

Doesn't have to be linguistic data. This can be mixed data, too. You could do an image caption generation system that has both features from images and text and warms them up with the same solution with entirely different data in two places. Could you point out which part of the process requires the values to be non-negative?

Yeah. What happens when you put a negative in a logarithm? Not saying you can't, but it's not going to start making probabilities for you at the other end of the softmax any time fast. So you have to start with a different premise, essentially. And that premise is something that requires more derivation.

You'd want to assure, if you're going to use a logarithm anywhere, or assume that inverse, that you're able to probably modify every parameter independently, instead of full rows of parameters. I think we should get to a couple of questions on Slido that folks asked. The first is, what's the difference in performance between naive assignment and optimized or omniscient assignment for packing tokens into bit vectors, and any experimental results?

What's the difference in performance between naive assignment and optimized assignment for packing tokens into bit vectors? The performance differences are going to be in speed. The systems which utilize packing for contexts have, at great length, gone to make sure that information from different portions of the context that have nothing to do with each other don't bleed information, if you're going to pack them together.

That creates a lot of logistical challenges in terms of defining models. And it's still just doing the regular self-attention thing. So it's quadratic. So if you have the same length of context window, it's going to be the same computational cost. However, if you pack all of your small documents together, they don't need the whole context window worth of quadratic comparisons.

And that's why you pack something into the empty end. I guess it should be over here. But document packing isn't exactly, even though it's well known as a mechanism to make training much more efficient. In other words, you only need fewer batches if more documents are packed together. It's not something which is, for example, entirely accepted as a published accepted form of preprocessing.

So what I would say is just document packing is not a correct model of context. It is an efficiency, but requires the same level of quadratic comparison. Whereas dynamically batching and utilizing a block size that is dynamic preserves the model of context. It does something that is true to the objective and unwavering in that.

And it reduces the complexity for smaller documents. But a direct comparison of the two is something I have not done, because it would require having that oracle and utilizing those algorithms. And where are they used? They're used with insanely big models, which means we would likewise have to compare two insanely big models to create the same level of expectation that people have from packing.

So that's in the future. Great. Thanks for your detailed response. We have a question quickly that's asking, are there any implementations of SAFU available that one could experiment with? Well, once we publish, there will be. But that requires a lot of work on developing systems for evaluation, since the evaluation systems rely upon standardized functions within the architectures that you're all very familiar with, like GPT-2, that are easily taken for granted.

Even though you do lots of work in training them, you have to do a lot of work in creating those functions that meet the needs of the separate prediction tasks and fine tuning that evaluations perform. All right, great. Makes sense. I think we're pretty much out of time. So thanks, Jake, for the great talk, and thanks for coming to another lecture.

Thank you.

Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures

Transcript