back to indexStanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures
00:00:00.000 |
>> Today, for our talk, we have Professor Jake Williams from Drexel University. 00:00:09.480 |
He is an Associate Professor at Information Science at Drexel University's College of 00:00:15.200 |
Computing and Informatics in Philadelphia, Pennsylvania. 00:00:18.920 |
Dr. Williams has a background in physics and math with degrees from the University of Vermont, 00:00:25.000 |
and his research leverages a quantitative linguistic perspective that applies math and 00:00:31.120 |
statistical methodologies to analyze and improve linguistic learning systems. 00:00:36.800 |
Following a one-year postdoc appointment at the University of Berkeley, studying large 00:00:41.760 |
language, large-scale machine learning in 2015, Dr. Williams became a data science faculty 00:00:48.440 |
at Drexel, where he drove the foundation of a DSMS program and develops and instructs 00:00:54.000 |
data science coursework, including natural language processing with deep learning. 00:01:00.560 |
So, welcome, and thank you for coming today for your talk, and you could do a quick introduction 00:01:12.600 |
Thanks for coming out, and also for showing up online. 00:01:19.760 |
As was mentioned, my name is Jake, and my background's in math and physics, so the perspective 00:01:25.240 |
that I'm coming from towards this work might be a little bit different than the standard, 00:01:29.400 |
and that'll be a theme throughout the discussion. 00:01:32.800 |
The purpose of this discussion is to go through a relatively long-term development, a project 00:01:40.840 |
that I've been working on, and as mentioned, my background is in quantitative linguistics, 00:01:46.840 |
which means my history of focus on language has primarily been to develop general theories 00:01:56.600 |
and descriptions of phenomena that you observe with regards to linguistic units, whatever 00:02:04.120 |
It's a statistical approach based on theories of language generation that are statistical 00:02:11.640 |
in basis, and over the course of my time as a researcher, I've explored and ventured into 00:02:21.560 |
language modeling itself and ultimately into neural networks as they approach language 00:02:26.080 |
modeling themselves, and that's what brought me here through quite a bit of other work, 00:02:34.200 |
so if you look into my profile, you'll see a lot of different subjects in either applied 00:02:38.800 |
NLP, like I said, quantitative linguistics, and neural networks is a natural transition 00:02:44.520 |
for me into inferential work, so let's get started. 00:02:52.160 |
So well, this is how we'll start the conversation today. 00:02:58.920 |
We came at this subject from a different approach, trying to think about layer initializations 00:03:07.600 |
in neural networks, and this subject that we're discussing as a front for this talk 00:03:16.440 |
is specifically focused on transformer architecture components, the self-attention component that's 00:03:21.480 |
pivotal to the success of the transformer architecture, and it focuses on the fact that 00:03:28.120 |
self-attention requires a quadratic comparison of vectors in order to produce the feature 00:03:32.200 |
weights of those vectors needed to model long-range dependencies in text. 00:03:38.320 |
Commonly, parameters for self-attention are based on a transformation matrix, two, usually, 00:03:43.640 |
queries and keys, that are responsible for dimensionalizing input vectors, and I describe 00:03:49.800 |
it this way because generally speaking, when you're at the point of a self-attention layer, 00:03:54.640 |
you already have low-dimensional vectors, but the parameters in a standard self-attention 00:04:00.080 |
layer are changing the dimensionalities and the structure of that dimensional space. 00:04:04.520 |
They are like an embedding layer, which is factorizing the embedding dimensions. 00:04:10.560 |
This redimensionalization is the primary means by which self-attention creates feature weights. 00:04:15.520 |
It really just computes similarity in that shared space. 00:04:21.400 |
Large and similar inner products really just result in strongly weighted features, so it's 00:04:25.360 |
up to that dimensionalization to produce good similarities for whatever purpose your prediction 00:04:32.520 |
However, an alternative strategy for feature weights might ask, given a basis, so in other 00:04:38.000 |
words, you're stuck with your low-dimensional vectors, what is the optimal way to convert 00:04:43.320 |
those comparisons of the vectors you're looking at by a matrix transformation to modify the 00:04:49.320 |
vector similarities that you are stuck with that correspond to the best weights for features? 00:04:55.600 |
In other words, treat this as a feed-forward layer to produce self-attention weights as 00:05:00.240 |
opposed to try and transform to some basis that produces good feature weights. 00:05:06.320 |
The use of this modified self-attention mechanism will be part and parcel the substance of this 00:05:15.960 |
It's worth noting that this alternative mechanism is entirely compatible with the traditional 00:05:22.120 |
In other words, you could still change the dimension and compute similarities and then 00:05:29.400 |
convert that with a second feed-forward layer to produce optimal feature weights. 00:05:36.600 |
This is exploring how useful that alternative prediction of feature weights can function. 00:05:42.960 |
However, we'll avoid the standard mechanism for two reasons. 00:05:47.200 |
First, we have no solution to the standard parameters for self-attention as an initialization. 00:05:54.040 |
And this will be discussed at length in slides to come. 00:05:58.280 |
Likewise, it would create an additional model complexity that would muddle the effects of 00:06:03.160 |
the modified form of self-attention that we wish to study. 00:06:05.640 |
So having that dimensionalization as a way to produce good feature weights would confuse 00:06:12.120 |
whether or not the feed-forward computation of feature weights is functioning well. 00:06:17.440 |
There's a catch to this, however, which is that these vectors that we use for such a 00:06:26.240 |
In other words, their comparisons must be consistent and meaningful in the first place. 00:06:34.080 |
So to get it out of the way, here's an architectural diagram for the relatively simple near-shallow 00:06:43.400 |
It doesn't seem like there are many neurons in a network of this type. 00:06:46.320 |
And that's because all of the activations are softmax, which means despite the fact 00:06:51.080 |
that the U matrix, for example, is an entire layer, it's really just going through a single 00:06:56.920 |
prediction non-linearity, the softmax function. 00:07:00.200 |
So you can think about this as essentially a three-layer network that might be creating 00:07:07.080 |
Likewise, the difference in presentation here over self-attention, which is parameterized 00:07:12.240 |
by the matrix W here, is intending to show how a-- whether you consider it the query 00:07:21.560 |
or the key-- one vector is the pivot for the comparison that will produce the feature weights, 00:07:29.360 |
which is then fed forward in this model through W. 00:07:33.600 |
This is the case for standard self-attention, too. 00:07:35.960 |
In other words, you can reduce it to a by-prediction diagram in this way, where a gray vector, 00:07:47.160 |
The attention distribution coming out of the W matrix and the softmax function is indicated 00:07:51.780 |
by the vertical red bar there, which weights the block of vectors in black. 00:07:57.840 |
That includes the pivot vector in gray, which is then passed through a feed-forward layer, 00:08:03.720 |
often called the values of a standard self-attention matrix, U. 00:08:08.760 |
We then-- since we use U as a way to reduce the dimensionality of the prediction that 00:08:14.720 |
we're trying to make, we then feed that forward through another layer and then to output. 00:08:22.200 |
And that's essentially the relative shallowness that we're talking about here. 00:08:27.140 |
U is a self-attention matrix, which means there's really only two layers in effect here. 00:08:36.660 |
And you might wonder, for example, why we're using a different activation function, the 00:08:40.260 |
softmax, instead of any of the dimensionally independent activation functions, like a logistic 00:08:48.360 |
And that's because we have additional insight into the softmax function and the parameters 00:08:58.580 |
So let's talk about those vectors first, though, before we get to layer initialization. 00:09:07.700 |
Optimizing the keys and queries of standard self-attention bears substantial similarity 00:09:14.460 |
This is because the key and query matrices have a common dimension that they project 00:09:19.700 |
to, much like you'd see with the factorization of an embedding layer on its own. 00:09:29.500 |
These normally-- there might be multiple self-attention heads. 00:09:33.620 |
And because of the indeterminacy in creating a different dimensional space-- in other words, 00:09:38.780 |
there are multiple equivalent reshufflings of those different dimensions which will produce 00:09:43.580 |
the same output-- that indeterminacy is something that we hypothesize has bearing on what is 00:09:51.420 |
now referred to as the lottery ticket hypothesis. 00:09:53.540 |
In other words, that multiple-- or this is the way that I would state it-- but that multiple 00:09:59.980 |
different embeddings which produce different vector spaces can be leveraged in parallel 00:10:07.340 |
Or in the way that it's implemented, that if a random initialization doesn't do that 00:10:16.180 |
And that sub-network will do just as well, even after it's totally trained. 00:10:21.700 |
In other words, having multiple clones, self-attention heads, which have no difference in the outputs 00:10:27.140 |
that they're trying to predict, is at the root of the lottery ticket hypothesis. 00:10:32.380 |
And ultimately, that invocation of the lottery ticket hypothesis is really a justification 00:10:37.100 |
for eliminating parameters whose substantial cost of training are essentially wasted as 00:10:45.700 |
You might ask questions like, well, what is a good initialization? 00:10:49.100 |
What is a good set of word embeddings to use? 00:10:55.780 |
So how can lottery ticket hypothesis interactive effects of randomly initialized embedding 00:11:02.020 |
layers be avoided when constructing language models is another question that is embedded 00:11:12.860 |
But we shouldn't say that dimensionality reduction isn't needed. 00:11:19.280 |
For language modeling, you absolutely have to work with reduced dimension unless you're 00:11:25.500 |
For example, like 26 Latin characters or something like that, like a wave to Vec. 00:11:33.860 |
The inherent input dimension of a large vocabulary model presents many computational intractabilities 00:11:39.940 |
when designing NLP systems, something that you're probably all very aware of. 00:11:43.660 |
Likewise, though, the distance from embedding layers to learning information, the loss at 00:11:50.500 |
outputs, puts them in a challenging position to train. 00:11:53.980 |
It's really hard to learn embedding layers because of the indeterminacy in the space 00:12:01.460 |
You could swap dimensions, and it's equivalent. 00:12:07.060 |
But the distance means that they receive learning information last. 00:12:13.760 |
This is a real challenge, and it's present in the history of NLP and deep learning, too. 00:12:24.380 |
And this is exacerbated in the way that we have to actually learn embedding layers in 00:12:28.140 |
standard models where we might modify learning rates to be lower all the way back at the 00:12:33.300 |
bottom of a network to be gentle with those embedding layers and help them learn effectively. 00:12:40.620 |
But this is really trouble because if we had a good embedding layer at the start, those 00:12:46.380 |
subsequent layers could be much easier to learn. 00:12:53.420 |
So ultimately, in order to approach this challenge, we came along with a discernibility hypothesis. 00:13:03.060 |
In other words, this boiled down to the theory that low-dimensional vectors, more than anything, 00:13:11.900 |
And that doesn't sound like a very strong assertion. 00:13:17.020 |
And we started with a really, really, really low bar and assumed that the most common features 00:13:27.480 |
So if we're stuck with a lower dimension and we can't give everything a one-hot vector 00:13:31.140 |
to be told apart very well, then we might want to give the more clear vectors, which 00:13:36.980 |
have more dimensional independencies, to those features which appear most frequently and 00:13:48.980 |
This hypothesis led us directly to develop the bit cipher algorithm, which is really 00:13:54.500 |
just a scheme for assigning vectors of zeros and ones. 00:13:59.300 |
Nothing too crazy in terms of what we're attempting to do. 00:14:02.940 |
In the figure at right here, the order of vector assignment is by row from top to bottom. 00:14:08.220 |
And this is on a five-dimension, five-bit vector system. 00:14:14.140 |
The first five from bottom are those one-hot vectors. 00:14:18.380 |
Past that point, you'll see two-hot vectors, but they're a little bit less darkly shaded, 00:14:24.700 |
indicating the way that we actually utilize the system. 00:14:26.740 |
In other words, we normalize them to have unit sum. 00:14:34.020 |
What I hope you can see from this is that the bit cipher algorithm generalizes one-hot 00:14:43.060 |
And as a result, we can work from a very sparse feature set and explore dimensionalities as 00:14:54.020 |
And this assignment is incredibly naive, too. 00:14:57.180 |
That's the other thing that I want you to see as well, that this discernibility hypothesis 00:15:01.740 |
does not create any meaningful correlations between tokens that behave similarly. 00:15:05.980 |
So if you've got the upper and lower case of a word, their vectors aren't going to capture 00:15:11.620 |
those similarities according to the bit cipher. 00:15:14.460 |
It's really just gonna try and make sure that those features are distinguishable in a low-dimensional 00:15:19.020 |
space and that the most distinguishable features are those which appear most commonly. 00:15:25.020 |
This was enough to do a surprising amount of work. 00:15:31.960 |
So with some scheme for a deterministic low-dimensionalization procedure, we were then able to utilize this 00:15:41.960 |
solution that we had actually developed previously. 00:15:45.500 |
So this was actually the real motivator for a lot of the work that you're seeing today, 00:15:49.940 |
although it might seem like it's just a checkpoint in the middle. 00:15:56.020 |
Provided bit cipher produces decent embeddings, we can ask, can other layers be non-randomly 00:16:01.980 |
In other words, without gradient descent or backpropagation or other gradient-based iterative 00:16:07.740 |
This equation came about from analysis of Word2Vec with the original Softmax activation 00:16:16.300 |
And much like other articulations of the Word2Vec family of embeddings, came up with differential 00:16:27.220 |
solutions that depended on co-occurrence matrices. 00:16:32.820 |
Is there a way to take a co-occurrence matrix, F, in this equation here, and convert it with 00:16:41.300 |
some weights, some denominators by row, into something that warms up a single-layer feedforward 00:16:53.380 |
And ultimately, this k minus 1 over k term here, and this sum, is really just expressing 00:17:03.700 |
Like conditional probability, because k minus 1 over k is a wrinkle that says that as the 00:17:11.860 |
number of features increases, in other words, the context window increases in a block transformer, 00:17:19.500 |
then the warm start that we could apply to start off a neural network without a randomness, 00:17:26.540 |
entirely determined by the vectors underneath, nearing whatever direction it's going. 00:17:34.540 |
All we have to do is compute some co-occurrences between inputs and outputs, and I don't mean 00:17:39.340 |
necessarily standard co-occurrences that you might have learned about a long time ago which 00:17:44.860 |
I mean, whatever your inputs are, whatever your outputs are, you take their sum of outer 00:17:52.240 |
products and you get a co-occurrence matrix of inputs and outputs, and that can then be 00:17:58.220 |
utilized to initialize your layer in that neural network to be vastly more performant 00:18:06.100 |
than what you'd get by a random initialization. 00:18:13.900 |
This was just for a single-layer model, but it depended on the softmax function for activation. 00:18:22.060 |
And the softmax function as an activation function, we knew, is also necessary for self-attention 00:18:30.240 |
And this meant that if we could put self-attention into some kind of a standard form with this 00:18:35.700 |
equation just like a single layer, then we could apply the same solution with one catch. 00:18:43.940 |
That catch is specifically that we don't know what the targets are for self-attention. 00:18:49.100 |
There's no target vector y, the thing that you're trying to predict, which position is 00:18:55.340 |
the one that you want to weight most strongly. 00:18:58.700 |
And so in order to apply this solution for a self-attention model, we had to do some 00:19:05.140 |
And that's in the reference number one, which is all the way back up in the first slide 00:19:11.120 |
But that derives a differential criterion, an analog for the single-layer solution that 00:19:17.980 |
tells us what the targets of that kind of self-attention actually are, the hidden targets, 00:19:23.460 |
the weights that you're trying to create, which really are just about making sure that 00:19:28.700 |
the layer above self-attention has some unsurprising things coming towards it. 00:19:36.300 |
The self-attention layer is really just trying to massage the vectors so that way they look 00:19:40.140 |
like something that the next layer above expects. 00:19:44.540 |
Aside from that, though, it's a much more in-depth conversation. 00:19:48.620 |
The point, though, is that for the model in this picture here, we can now start off with 00:20:04.640 |
We can use those vectors x to initialize non-randomly the parameters in W, the self-attention matrix, 00:20:13.780 |
and then use that, going up the network, to initialize the parameters in U, since it's 00:20:19.460 |
just a feed-forward layer with whatever self-attention is giving it as weights. 00:20:24.620 |
And then whatever that produces, the hidden state, H, we can use that with the actual 00:20:29.700 |
targets after the output layer to warm up the matrix O. 00:20:37.460 |
And you might say, "Okay, well, how did you figure out what those hidden targets are?" 00:20:43.660 |
You had to have an output for the U matrix to try and hit. 00:20:49.100 |
That too is something that the bit cipher can provide in the form of label embeddings. 00:20:57.300 |
In other words, low-dimensional targets of the thing that is downstream that you're trying 00:21:04.660 |
So similarly, we can warm start the U matrix in terms of those bit cipher label embeddings. 00:21:16.260 |
So in this view, the aim is to show how simple and general a single-layer softmax activated 00:21:22.880 |
It's really just no more challenging than computing conditional probability given inputs 00:21:30.420 |
It's fast, it's something that you can distribute in terms of processing, and it's very, very 00:21:39.960 |
So this is essentially the process that we're using in order to warm up the W and U matrix. 00:21:51.060 |
There's the U matrix there, starts out as zeros. 00:21:54.980 |
In other words, nothing, no random values, no weights anywhere. 00:22:00.460 |
Over the data, which is just borrowing the dimension of this gigantic Y matrix that has 00:22:06.940 |
all of the targets in it for the entire data set, we simply just take the outer products 00:22:14.060 |
of whatever the hidden state, the input to that layer is, assuming that the lower layers 00:22:18.740 |
beneath it are also warmed up with whatever the targets for that layer are. 00:22:25.740 |
Following that, it's really just about normalization and a logarithmic transformation. 00:22:31.380 |
And that logarithm really just emerges as a result of being an inverse to the exponential 00:22:36.860 |
function, which is a part of softmax, pretty much all of softmax. 00:22:50.700 |
This is going back to before we had the bit cipher algorithm for dimensionality reduction. 00:22:58.980 |
And we started out by just saying, OK, if we take a simple, simple language model that 00:23:05.700 |
only looks at a radius of traditional co-occurrences as features, we can concatenate those vectors 00:23:13.140 |
and feed them forward for a language model's output. 00:23:17.380 |
A completely random start, a cold start to a language model, is really just the size 00:23:27.860 |
And those three lines here for a few different radii are demonstrating that point with the 00:23:33.980 |
point all the way at the top left-hand corner of this figure, cold starts. 00:23:41.380 |
In any of those cases, when the warm start is applied, the perplexity is immediately 00:23:49.100 |
And furthermore, the trajectories that the updates follow continue in the same learning 00:23:57.460 |
rate and the same time to perform better than models that were started cold. 00:24:05.700 |
If you have an early stopping criterion, similarly. 00:24:08.820 |
Early stopping, well, more than just generally, engage first and with a higher perplexity. 00:24:19.300 |
So this was the first indication that we had figured out something that's very useful. 00:24:24.820 |
There are some folks on Slido saying they're a bit confused. 00:24:31.780 |
They're asking, are we talking about an alternative approach to self-attention? 00:24:41.660 |
And it is the premise of this whole conversation. 00:24:45.980 |
So here, in this modified version of self-attention, you might normally expect to do a comparison 00:24:55.820 |
Whatever your inputs are, they might be a whole block of vectors, or they might be-- 00:25:00.660 |
It's not cross-attention, where you have different vectors that you're trying to attend. 00:25:06.500 |
And forgetting about the values, which for us is the U matrix, the keys and queries, 00:25:15.700 |
which are the parameters for self-attention, are in the middle. 00:25:18.340 |
They're in between the two copies of the inputs, X. 00:25:25.580 |
Each of those you can view as some kind of a projection down to a dimension where they 00:25:30.860 |
And this is necessary for something like cross-attention, where you might have different dimensionalities 00:25:35.540 |
like X1 and X2 in two separate groups of vectors if you're doing something like machine translation. 00:25:41.820 |
That's not necessary to think about when you're just looking to do a standard language model 00:25:48.660 |
that has to predict the next output according to the inputs, which are also outputs from 00:25:58.940 |
Two insights here-- one, that multiplying the key and query matrices, WK and WQ, it's 00:26:08.740 |
just another parameter matrix that's implied. 00:26:11.860 |
There aren't two parameter matrices there in the middle for self-attention in any effective 00:26:19.060 |
There is a common dimension of comparison, and that kind of just moves stuff around. 00:26:24.580 |
It creates degrees of freedom so that optimization can figure out what's the best weighting from 00:26:32.300 |
But the softmax function is strictly operating on similarities of that comparison space. 00:26:40.500 |
It's not doing anything with those similarities. 00:26:46.460 |
So if it was a big similarity, it's a big attention value. 00:26:51.060 |
In this equation, there's no transformation happening before those vectors are multiplied 00:26:58.980 |
So those vectors better be good vectors that you're starting with-- x and x transpose, 00:27:08.580 |
They can't be vectors from cross-attention, where you're trying to translate from one 00:27:11.940 |
language to another, and they just don't inner product. 00:27:15.740 |
You could force it through if they were two differently trained embedding layers, and 00:27:19.740 |
they had the same dimension with this mechanism. 00:27:23.140 |
And if you didn't, you could put those key and query matrices back in between the two 00:27:33.580 |
But a lot of what's going on here in this talk is trying to simplify and make more efficient 00:27:42.860 |
the architectures that we need and the mechanisms that they utilize, given what we know about 00:27:55.520 |
If all we're doing is autoregression, we don't need cross-attention dimensionalization in 00:28:01.260 |
That'll be the theme, in other words, that can we use knowledge that we have about the 00:28:08.980 |
way language functions to design better versions of architectures that meet the needs of language 00:28:23.780 |
So if there are any questions here, it's a good time. 00:28:39.920 |
The thing about language models is it's a really simple language model. 00:28:46.420 |
This is really just evaluating that a warm start in either the blue, green, or purple 00:28:51.320 |
case does better than its partner, which is a cold start of the same architecture, same 00:29:04.740 |
So three different models, regardless of how long your context is in each case here, we 00:29:10.160 |
see that a model which has a nonrandom initialization by the equation presented two slides back 00:29:16.520 |
from here starts a network off with a much lower perplexity. 00:29:26.920 |
The requirements to apply this solution to a feedforward layer of parameters is simply 00:29:33.180 |
that your inputs should not have negative values. 00:29:44.420 |
So it becomes really easy to ask questions like, well, what happens when you apply this 00:29:52.940 |
Well, there's one little catch that we had to think about here in this case, and that 00:29:57.820 |
is with the bit cipher or one-hot vectors, we're controlling the norms of the inputs. 00:30:04.260 |
With standard embeddings, with MNIST, for example, when you're trying to predict the 00:30:10.420 |
handwritten digits, 0 through 9 value, you don't get to assume necessarily that all inputs 00:30:21.580 |
You can normalize the inputs, but it doesn't necessarily make sense to normalize them to 00:30:26.540 |
one when you're looking at images, for example. 00:30:30.340 |
They have 0 through 255, for example, in MNIST. 00:30:35.440 |
And as a result, we can put these data through that same warm start. 00:30:42.100 |
Now one little caveat here I've alluded to about the norms of vectors is that we don't 00:30:53.060 |
In other words, let me go back, you could look at it here or here, that's the number 00:31:02.660 |
of features per prediction, which if you're looking at unit-normed word vectors is however 00:31:10.540 |
big your context window is, k, because they all have unit norm and there's k of them. 00:31:17.860 |
But if you're looking at just an image, it's not clear if it's a composition of multiple 00:31:23.060 |
vectors, if it's one vector, and how many it is, if it is a composition. 00:31:32.940 |
In application to data like that, that is what k becomes, the average norm of an input. 00:31:41.900 |
And I'm regretting not putting a graph in this, but the paper that discusses this shows 00:31:45.820 |
that in the MNIST dataset, the exact optimal value of k is the average norm of the inputs 00:31:58.300 |
And that's how we generally apply this rule when we're warm starting systems and we don't 00:32:04.940 |
And it was learned from studying this model's application, this solution's application to 00:32:13.180 |
But as mentioned, the purpose was always towards language. 00:32:21.620 |
So longer context windows in principle should provide models with more information than 00:32:30.140 |
This means one should expect that models perform better when context window length is longer, 00:32:40.420 |
And this is essentially the reason for why self-attention was initially developed. 00:32:45.180 |
Researchers wanted to improve language models and context windows, providing more information 00:32:51.380 |
In other words, the more features, the more information, the more flexibility a model 00:33:00.900 |
However, without feature weights, models didn't simply get better with long context windows, 00:33:06.300 |
and feature weights and self-attention were hypothesized to be needed. 00:33:11.540 |
And this was proven back in 2017 with the transformer architecture. 00:33:18.740 |
In moving towards self-attention and transformer though, the primacy of the transformer architecture's 00:33:24.700 |
block context model casts a shadow over the use of other context models. 00:33:32.260 |
So for example, if I were to ask here, is it clear to everyone that the standard self-attention 00:33:41.020 |
block model of context is different than the traditional notion of co-occurrences, which 00:33:46.660 |
use a radius that is not positionally anchored? 00:33:50.620 |
It is the context model, the positional anchoring of the block context model, that gives it 00:34:06.540 |
Now what you do with that context model matters. 00:34:10.540 |
You can't just take those vectors in a block, add them together, and expect a feedforward 00:34:15.460 |
That's where self-attention is needed in order to figure out which vector needs the best 00:34:23.540 |
So what you'll also see in the architectures that are based on what I've already presented 00:34:29.260 |
is that we're interested to explore how different models of context for language models can 00:34:34.740 |
be integrated in general because they each provide different information. 00:34:41.440 |
And we all know that the standard transformer's block model of context requires a ridiculous 00:34:46.300 |
amount of information and data in order to become effectively trained. 00:34:53.080 |
So the current state of contexts that we use, top there might be the standard transformer 00:35:04.860 |
And it takes the first 10 tokens, for example, the second 10 tokens, and the third 10 tokens, 00:35:13.080 |
Each of those is a group of contextualizing vectors. 00:35:18.220 |
The second one there that you see with the r as a subscript is a radial model because 00:35:24.980 |
In other words, rather than assume you're looking at the first 10 or the nth 10 features, 00:35:30.820 |
you pick a radius and you say, what are the last r features, the last r vectors? 00:35:36.980 |
That can also have an attention distribution, a self-attention distribution, according to 00:35:45.140 |
It produces an entirely separate context in the state, whatever you want to call it, which 00:35:51.900 |
can be conjoined with the block model to articulate features and be given to an output layer that 00:36:00.780 |
knows what to do with them when each has different values. 00:36:06.940 |
The concatenation of those different context models keeps the information separate so the 00:36:11.500 |
output layer can decide which portion of the context is useful for the prediction. 00:36:18.820 |
This last one is getting really traditional at the bottom. 00:36:26.300 |
If you've ever implemented something like a Naive Bayes classifier or a term frequency 00:36:33.660 |
inverse document frequency model, that's essentially what a document model is. 00:36:43.140 |
Is it going to be the best for predicting the next token? 00:36:49.700 |
What that means is that even if you wrap to the next block between the radial and the 00:36:55.060 |
document models, you have a unique context vector, even if you're looking at the exact 00:36:59.860 |
same block, because the document has grown and the radius just says, what are the last 00:37:07.780 |
As a result, when you incorporate different models of context, you don't really have to 00:37:14.300 |
It might not be very good to make predictions past the first block, but that might be about 00:37:19.260 |
how much data you've used, and it might be about the hyperparameters for each one of 00:37:24.340 |
those models that you're applying, in other words, radius, the block size, like usual. 00:37:33.220 |
So far, the only embeddings that I've suggested are from this BitCypher algorithm, and as 00:37:38.540 |
I've expressed, they don't capture any useful similarities between similar tokens. 00:37:45.140 |
The BitCypher algorithm, it doesn't care if you're looking at the uppercase or the lowercase 00:37:50.860 |
It doesn't see them as bearing any similarity, even though they might be used very similarly. 00:37:57.300 |
So how can you utilize the BitCypher to create vectors for tokens that have meaningful similarities 00:38:12.060 |
And this is just backing off to the traditional methods once again, taking co-occurrences 00:38:19.460 |
of BitCypher vectors with whatever's there at the middle or center of a co-occurrence 00:38:27.980 |
Normally, if you think about one-hot vectors, a co-occurrence matrix is really just the 00:38:34.380 |
same thing, except now we just have smaller vectors with different dimensions on, so to 00:38:44.580 |
And we normalize after concatenating these blocks of different radii from the BitCypher 00:38:52.660 |
to match the original input requirements that we discovered for the warm start solution. 00:39:00.620 |
And that enables us to use these just like we would the original BitCypher vectors, except 00:39:06.900 |
now, just from the usual co-occurrence statistics, you'll see that capital word and lowercase 00:39:17.980 |
And you know this works because you've seen co-occurrences for a very long time, and while 00:39:23.740 |
they might not normally be useful in our applications these days with deep learning, they can be 00:39:29.740 |
imparted through the BitCypher algorithm to prescribed vectors as well. 00:39:41.920 |
So here's where things start paying out in terms of speed and efficiency. 00:39:51.380 |
If you only have one layer of self-attention, then that means that you don't need to worry 00:39:57.780 |
about whatever weird expressive stuff is happening that, you know, similar inputs might have 00:40:07.620 |
Since that first layer is just a set of static word embeddings, the self-attention layer 00:40:18.500 |
And that means each pair of words have a fixed comparison given static word embeddings. 00:40:26.220 |
And that means if you want to compute the quadratic features of self-attention, you 00:40:31.460 |
can just pre-compute them and pull them from memory. 00:40:36.540 |
This caching of vector comparisons is essentially reducing the self-attention layer's cost from 00:40:43.260 |
quadratic to linear, since those values that we're using to weight the vectors for the 00:40:49.980 |
feedforward layer no longer require comparison across the block. 00:40:58.220 |
So when our vectors are static, which is at inference time, and if we're not learning 00:41:05.980 |
the embedding layer's parameters with iterative differential updates, then not only do we 00:41:14.060 |
have to not track gradients for the embedding layer, but we don't even have to compute the 00:41:20.140 |
We can pre-compute them and just load them, which is much, much faster. 00:41:31.900 |
So we can reduce a lot of, all the inference and training costs, not all the training costs, 00:41:38.620 |
some of the training costs, because if we want to update those vectors, then we can't 00:41:48.460 |
This means that we can train these self-attentive feedforward unit models very quickly and with 00:41:57.460 |
But there are some other things that we immediately observed while developing these models, and 00:42:01.820 |
that is the lack of randomization produced models which were quite effective even on 00:42:10.300 |
Now, it doesn't mean that training on small data will let you generalize to everything 00:42:15.500 |
In other words, training on a small data set might produce a model which has a surprisingly 00:42:20.100 |
low perplexity on its training set, but it doesn't mean that you're going to be able 00:42:24.820 |
to generalize and have a language model that's talking well from just hearing a couple of 00:42:29.940 |
It does mean it will know that couple of thousand tokens very well, very quickly. 00:42:38.700 |
But there's a challenge with using self-attention still, and that is the fact that the block 00:42:45.220 |
model of context often is not fully utilized, since many documents are shorter than long 00:42:59.620 |
And these days, there are exceptionally long context windows. 00:43:05.500 |
Many of the language modeling benchmarks simply don't even go out to a thousand words when 00:43:09.060 |
it comes to context, and you're looking at a document to predict. 00:43:15.020 |
So this has been a problem for a while, and it means that if you're going to pad your 00:43:22.680 |
short documents, you're going to waste a lot of prediction on those paddings. 00:43:27.740 |
A lot of computation gets lost just for null information, essentially. 00:43:35.300 |
And the way that this is often relieved in some groups, and to great effect, is by packing 00:43:44.740 |
So for example, if you've got a hundred thousand token context window, most documents will 00:43:51.660 |
What do you do with the rest of that long context if you want to use a thousand tokens 00:43:58.020 |
You fill out the other ninety-nine thousand tokens with a bunch of other random documents 00:44:02.100 |
that don't belong anywhere near the first one. 00:44:07.640 |
Packing can be utilized without impacting different documents with each other, without 00:44:14.880 |
contaminating the information between documents, and that takes a lot of work, but it can be 00:44:22.160 |
However, there are different strategies that we could employ, different engineering tricks 00:44:28.760 |
that we could employ, to make our operation of self-attention more effective at any length 00:44:36.760 |
of document without having to deal with this packing problem. 00:44:41.040 |
And that comes about by dynamically changing the context length from some maximum value, 00:44:49.640 |
that's what you would normally set, just use the context that you have. 00:44:55.280 |
But you still have to create batches if you want to train models quickly, and what that 00:44:58.840 |
means is that there's still some padding if you use this approach. 00:45:03.040 |
But you can pad those short documents to set lengths, batch short documents together, batch 00:45:17.200 |
This means that we don't need to pack documents together to make use of a long context window. 00:45:26.440 |
When a document is long, you can let its context be long. 00:45:28.960 |
When a document is short, you can put it with other short documents and just use a subset 00:45:36.560 |
And with traditional self-attention parameters, keys and queries, it would never be a subset 00:45:40.480 |
because it's a low dimensionalization that that matrix provides. 00:45:44.800 |
With this modified self-attention, though, there's a different shape to the weight matrix, 00:45:49.120 |
and that's why it's a subset of those parameters that we have to utilize, and that might be 00:45:55.320 |
In other words, how does the difference in shapes of dimensionalities between this and 00:46:00.360 |
the standard self-attention weights shake out? 00:46:08.000 |
But we want to get to a different point for the sake of this conversation. 00:46:17.720 |
That should be a question that you're asking. 00:46:26.040 |
We're not entirely certain yet how an extremely large model like this will function on trillions 00:46:35.920 |
In other words, can you expect the same kinds of outcomes, like a chat GPT kind of thing 00:46:44.120 |
from some of these models, human interaction and RLHF and all the rest of that, though 00:46:50.720 |
it's something that we're considering, but also at different scales, too, since those 00:46:57.760 |
are performant on their own as well, but for what? 00:47:05.760 |
So the point is, is that from what we've stress tested into the billions, models can be trained 00:47:13.120 |
very quickly on a relatively small GPU, in ways that we expect when we cache vector comparisons, 00:47:22.160 |
When we don't cache those comparisons, you see all of the growth in computation time 00:47:29.280 |
that you would expect from longer context windows. 00:47:35.820 |
This one here, though, we're trying to make it really, really, really small, the one called 00:47:42.060 |
That's because we want to see if we can train a model from scratch, since on very little 00:47:47.280 |
data, these models can fit effectively with the initializations that we've developed. 00:47:55.880 |
And with the purpose of starting from scratch, starting with no data, we're thinking about 00:48:01.040 |
edge computing cases where we could deploy a language model with a microphone so that 00:48:05.880 |
a person can talk to it and just train it from their own data, train it from their own 00:48:19.480 |
So between these, we've explored a lot of different configurations, trying to consider 00:48:25.040 |
similarities to what some standard configurations might look like, a couple thousand tokens 00:48:29.240 |
in a context window, for example, to look something like a GPT-2 style model. 00:48:34.660 |
Thinking about bit cipher embeddings that are 500 dimensional or 1,000 dimensional to 00:48:39.560 |
be something like a GPT-2, that's, again, pointing towards the big/large category of 00:48:49.400 |
Beyond that, we haven't really touched those scales, because our first objective is not 00:48:55.440 |
to make big, big language models and train chatbots. 00:48:59.020 |
We want to know, what can we do with a small model, since this is a relatively unique capability? 00:49:12.140 |
To the best of our ability so far, it's kind of hard to see, but the first step is that 00:49:20.220 |
warm start, where you train the bit cipher, and you take a couple of splits of data, and 00:49:27.940 |
you compute that warm start for the self-attention layer and the feedforward layers. 00:49:34.140 |
In this case, which is really just using a 100 million token data set from the baby language 00:49:40.180 |
model challenge, which has as an objective to see what language models can do on a relatively 00:49:51.220 |
In other words, 100 million tokens is something that a person might hear in 10 years of their 00:49:58.100 |
In 10 years of life, people become pretty proficient speakers, and can a language model 00:50:07.900 |
The second stage, after the warm start happens, is where the majority of training time occurs, 00:50:14.620 |
and yet is also where training operates the most quickly. 00:50:22.260 |
At this stage, we find that freezing vectors is important. 00:50:26.060 |
One, because it means that we can train much quicker. 00:50:29.120 |
So we can have the subsequent layers optimized beyond their warm starts very, very fast, 00:50:35.620 |
using that vector caching, the vector comparison caching, to avoid the quadratic costs of self-attention. 00:50:43.260 |
This articulates the parameters in the middle layers of the model for taking 100 million 00:50:50.180 |
tokens and making five passes over the data here a lot quicker than any of the other stages. 00:50:57.980 |
The comparison that you'd make to this is the training time once those embedding layers 00:51:03.460 |
are unfrozen, where everything slows down to the normal speeds, where you have to do 00:51:08.580 |
all of your vector comparisons on the fly, since you can't assume that the same comparisons 00:51:14.060 |
will always result in the same numbers, since model parameters might be updated. 00:51:22.380 |
This is the best procedure that we've figured out so far. 00:51:25.160 |
And in order to make those vectors update, we find that learning rates have to be adjusted 00:51:30.160 |
dynamically inside of the network, like normal, and that the embedding layers are really tough 00:51:40.380 |
And you'll notice here in this picture that the slowness and the lack of stability, for 00:51:45.660 |
example, in learning the embedding layer once it had been prescribed earlier, makes it really 00:51:51.220 |
hard to train over the entire data set compared to five passes, for example, in the middle 00:51:57.460 |
phase when the middle and upper parameters are being updated, still with backpropagation. 00:52:03.940 |
And the other thing that I would highlight before leaving this slide is, in phase one, 00:52:15.540 |
So if you have 100 million tokens, you really only need to apply the warm start to something 00:52:19.820 |
like maybe 10 million tokens, not that much more. 00:52:23.060 |
You don't see that much gain from that much more data. 00:52:27.820 |
That's not a bad thing, because it means that we don't have to apply that process for any 00:52:34.500 |
It would be great if it gave us all of the optimization that we could hope for, but it's 00:52:38.820 |
not something that we could necessarily expect, since it's just an approximation of where 00:52:48.960 |
So on the back of an envelope, thinking about how the systems that example was trained on 00:52:54.860 |
as compared to other examples that are out there, and thinking about models that are 00:53:00.900 |
kind of sort of similar size, we're talking about a 12 gigabyte GPU, a relatively small 00:53:08.180 |
single chip, specifically when referring to these training times. 00:53:20.020 |
Just working off of eight chips, each having roughly four times the scale, and comparing 00:53:26.060 |
to this time that it took to train something with maybe an additional order of magnitude, 00:53:31.820 |
although we have trained models up to around 50 million parameters, too, which is getting 00:53:39.820 |
We see training times that, if we scaled up to the relatively large systems that present 00:53:45.300 |
us with how much work we should expect to have to do for a model that large, we can 00:53:53.500 |
But as mentioned, the initial objective here is not to simply figure out how well we can 00:54:02.180 |
It's to figure out what these alternative strategies are useful for, since they give 00:54:06.900 |
us access to different regimes of model scale as effective. 00:54:15.860 |
So as mentioned, we've gone to relatively large amounts of data. 00:54:22.380 |
I wouldn't really call them big data at this time, even though just a couple of years ago 00:54:26.500 |
a billion tokens would be a relatively large amount of data. 00:54:30.860 |
It's really just a stress test at this point, gives us something like, do we continue to 00:54:35.860 |
see models getting better as we continue to give them more data? 00:54:39.460 |
Do we continue to see models getting better as we continue to give them longer context 00:54:44.780 |
And the answer to both of those questions is absolutely yes. 00:54:47.540 |
So nothing is telling us that we can't train bigger models with these. 00:54:51.540 |
But will those bigger models be as good as a standard self-attention model? 00:54:56.300 |
It's a different self-attention parameter matrix than what you see in a standard self-attention 00:55:03.300 |
And in theory, that should be overkill, because you'd have more parameters and more power 00:55:10.220 |
And we can see from this work that the alternative self-attention parameters are reasonably effective. 00:55:21.180 |
So I'll go quick through these, since this is the work that we're approaching right now. 00:55:27.740 |
And this is the idea that we're seeing as a use case for such a model like this. 00:55:38.540 |
Just training on the target data, whatever the data of interaction are. 00:55:43.420 |
And in this example, you'll see that this relatively smaller precision language model 00:55:47.900 |
just needs to predict whether or not a light should go on or off. 00:55:51.860 |
A lamp that listens with a microphone and a switch. 00:55:56.000 |
And you can use that switch to train the lamp. 00:56:07.180 |
And we want to anticipate whether or not you're going to flip the light on or off. 00:56:12.520 |
That's the task that we're going to try and approach. 00:56:16.420 |
Or that, rather, we're currently approaching. 00:56:19.340 |
There's a few different processes that integrate into this approach. 00:56:23.820 |
There has to be a microphone that's listening to you, recording audio. 00:56:29.700 |
And we use Wave2Vec at this point, because there's a very small version of it that's 00:56:34.340 |
And as a result, it doesn't even require you to use consistent-- or it does require to 00:56:40.260 |
But it doesn't even require you to use words, since it's strictly phonetic. 00:56:45.740 |
There has to be-- and this is the bread and butter of what's going on here-- a process 00:56:54.620 |
And that process is responsible for creating good training data. 00:56:58.100 |
So this is a smart data collection algorithm that figures out, when you flip the switch, 00:57:04.820 |
is that the target for something that you just said? 00:57:08.460 |
Is that the transfer learning objectives from text that was transcribed, that it anticipates 00:57:20.340 |
Following this, there's also two other processes. 00:57:23.060 |
One which operates on a different time cycle, and that's training. 00:57:29.940 |
Always be training a model, whenever there's new data, is essentially what that fourth 00:57:39.300 |
In other words, if you flip the switch, there has to be a process which operates the light 00:57:44.100 |
It always has to be a lamp in order to be useful. 00:57:45.940 |
It always has to be able to just be a switch. 00:57:51.380 |
However, that operation process likewise needs to see a directive from the anticipator. 00:57:57.780 |
If the language model predicts that you just said a thing, that means you want there to 00:58:01.900 |
be light, that operator then needs to receive the signal from the anticipator and execute 00:58:11.660 |
If the user then, though, within some time scale, changes the switch back after the model 00:58:18.980 |
created a prediction that was bad, the operator is also responsible for issuing a correction 00:58:24.920 |
to the anticipator to correct the training data. 00:58:30.220 |
What this looks like as a process is in this diagram here. 00:58:34.900 |
And you can see the flow here from stage one, a verbal command maybe gets recorded, transcribed, 00:58:45.240 |
And if there's no model that's yet trained, that text is just stored as data along with 00:58:50.060 |
any directives given by the user in the form of a light switch going on or off. 00:58:57.180 |
Once there's any data, the learning process says, okay, time to train a language model 00:59:07.180 |
Once a model is done training, it's sent over to the anticipator who is responsible for 00:59:15.620 |
That small language model then is now empowered to make predictions every single time it receives 00:59:25.380 |
And those predictions are sent to the operator, which then does whatever it's told. 00:59:31.580 |
And the last thing that can happen, step six, is if the wrong prediction was made and the 00:59:36.360 |
user fixes it by turning off the light because they didn't want the light on, that corrects 00:59:43.460 |
the data that was transcribed and the next model which is trained will be able to avoid 00:59:48.820 |
And there's some dialing this in in terms of the time scales that you want based on 00:59:54.020 |
the way humans interact with the light switch. 00:59:55.740 |
So there's a lot of development that goes into figuring out the right way to set this 01:00:01.340 |
The data that you collect from a process like this, how do we organize it? 01:00:06.940 |
This actually is not transfer learning, so I kind of lied there a little bit. 01:00:14.140 |
It's a conversation between the human and the lamp. 01:00:17.100 |
You say something, the lamp says, here's what you want. 01:00:23.060 |
And it's just an extending context window, like you'd see with a decoder-only kind of 01:00:29.100 |
architecture these days, a chatbot kind of thing, a human personal assistant, human assistant 01:00:39.300 |
And you might also suspect then that, well, couldn't you let the lamp talk? 01:00:43.700 |
Yes, you could absolutely let it use other tokens, and that is something which is on 01:00:47.540 |
the horizon for us, in other words, how to determine once the model is learned enough 01:00:52.900 |
and knows when you want to hear it talk and knows what you want to hear it say, which 01:00:58.980 |
requires other smart data collection currently in development. 01:01:04.780 |
And there's three tags here if you don't see it, although what they really are are tokens, 01:01:08.900 |
since they're integrated within the language model's vocabulary. 01:01:13.540 |
I want the lamp lit, I want the lamp dark, or nothing, if no switch is applied during 01:01:23.700 |
So what do the models look like that go into a lamp? 01:01:27.580 |
They're a little bit smaller than that micro model in terms of having a long context window, 01:01:34.780 |
They still use these other features, like a radius, which help them to do well with 01:01:40.940 |
only little data, those other context models. 01:01:46.020 |
And the embedding size is around 50 or 100 and something, and this is small enough to 01:01:53.380 |
fit on a microprocessor, on a CPU of a microprocessor, including training, no GPU whatsoever. 01:02:04.220 |
And the first time we ever got the interaction right, the right timescales, from no data 01:02:09.860 |
whatsoever, creating this data, and 20 minutes of it, was enough, and you can see there's 01:02:18.340 |
loads of misspellings here because the transcription is not required to produce known words, known 01:02:25.080 |
It's strictly character-based, so you can say whatever you want to say, you can whistle, 01:02:28.820 |
and as long as wav2vec thinks that's tokens, it'll figure out what to transcribe. 01:02:39.820 |
That's enough, 20 minutes of talking to it, to have it know pretty well when you want 01:02:47.740 |
This is what the numbers look like for that prediction, and you see lots of zeros there. 01:02:51.100 |
That's because there's no positive instances yet in the data, until you flip the switch, 01:02:59.620 |
Once there is enough to predict, we see an immediate jump in the model's ability to figure 01:03:05.000 |
out whether LAMP should say on, off, or nothing. 01:03:11.540 |
And while we trained this first model, for example, in 20 minutes on LePotato, which 01:03:19.940 |
is a really, really, really small microprocessor, it's incredibly frustrating to utilize because 01:03:26.620 |
the processing time is a couple seconds, and it feels like it's going somewhere, even though 01:03:31.420 |
the data is entirely localized, there's no Wi-Fi, there's no internet connection. 01:03:36.140 |
It just takes the model on this tiny chip a minute, not really a minute, like a couple 01:03:40.900 |
seconds, to flip the switch on, because it has to transcribe it, interpret it, issue 01:03:46.100 |
the directive, ask the operator to operate it. 01:03:49.060 |
And so part of what we're doing is figuring out at what scale of microprocessing do the 01:03:53.220 |
models that we're developing really make a good real-time system that a user can make 01:04:00.560 |
And as you can see, the larger the model in terms of hyperparameters and so forth, the 01:04:11.800 |
So we see these as potentially useful in edge scenarios, but not just for operation, for 01:04:19.060 |
So go to Home Depot, buy a light switch installed in your house, start talking to it. 01:04:28.040 |
But this isn't really the stopping point that we want to get to. 01:04:34.820 |
We want to eventually get to the point of talkback. 01:04:36.900 |
We want to treat these as language models that essentially have a bit of you inside 01:04:44.380 |
And that's important to know when the model is aware of what you want to hear said. 01:04:50.900 |
In other words, it needs to know what is a good thing to say back to what you just said. 01:04:55.700 |
And the lamp has never heard a lamp talk before. 01:04:58.860 |
So there are challenges to figuring out the lamp's role in conversation. 01:05:08.660 |
We don't have to make it be a light bulb which goes on and off. 01:05:10.900 |
This could be a controller for anything which is a binary switch. 01:05:15.660 |
And you could imagine, like others are looking at right now, there's a lot of opportunities 01:05:20.820 |
with predicting the action on your phone that you want to take, which thing you want to 01:05:26.860 |
And with a system like this, microsizing on to your cell phone, for example, assumes better 01:05:33.220 |
hardware than what we're already using, but would be entirely localized, including training. 01:05:44.140 |
But this is also really just getting to the point of feasibility. 01:05:49.580 |
It's not getting to the point of a well-optimized system, which we're still developing. 01:05:54.460 |
There are, in principle, different modifications that we could make to the self-attention layers, 01:05:59.100 |
which include traditional self-attention parameters. 01:06:04.820 |
Then there are updates to the very naive scheme that we have for BitCipher, the vectors that 01:06:13.260 |
And a lot of other minutia that need to be approached. 01:06:23.300 |
And in addition to what I just described, we're moving towards larger models and evaluations 01:06:29.600 |
that compare better to modern systems, which will eventually come online. 01:06:34.980 |
We'll most likely participate in this year's baby language model challenge, although that 01:06:40.780 |
challenge assumes you're working with a standard architecture, which is already developed for 01:06:51.020 |
But that's really all I have prepared for you to discuss today in this conversation. 01:06:54.300 |
I've gone over a lot of details, and if you'd like to talk about any of these, I'm certainly 01:07:01.900 |
And if you have access to the slides, there's some links to the different papers I've referenced. 01:07:21.660 |
So if anybody here has any questions, feel free to raise your hand and ask. 01:07:25.420 |
Otherwise, we'll go to some questions on Slido. 01:07:40.100 |
But I've also pasted these references in the Zoom chat, as well as Discord, in case anybody 01:07:48.340 |
I was wondering, in the plots that you showed for warm start versus cold start, does the 01:07:59.100 |
cold start use the modified self-attention or the standard self-attention? 01:08:07.180 |
So the question was, in this picture, comparing warm starts to cold starts, what self-attention 01:08:16.340 |
This is strictly a feed-forward experiment, where we take a single layer, and all we do 01:08:20.660 |
is feed forward with one-hot vectors from some context window and concatenate them together. 01:08:28.420 |
And the general property that you'll see is, by concatenating vectors, there's very little 01:08:36.460 |
Simply with a block, you're adding the vectors together, and that superposition of the dimensions 01:08:43.500 |
And that's why self-attention is needed, in order to weight that superposition so just 01:08:47.940 |
the right ones stick out and it's not muddled. 01:08:51.320 |
If those vectors are instead concatenated, a weighting of those is really just appealing 01:09:01.020 |
When they're superimposed, there's a lot to work on, since you're smearing separate information 01:09:07.500 |
When the information is already separated, there's not that much re-weighting can do. 01:09:14.020 |
And in this case, there's absolutely no re-weighting going on. 01:09:18.180 |
And what I've described to you is really just something that's become very clear from a 01:09:24.060 |
lot of small-scale experiments in between the models that we've developed. 01:09:29.420 |
And moving towards self-attention took additional time, and we didn't have a solution for that 01:09:38.180 |
I had a question in regards to-- so you're doing this with on-edge controllers, right? 01:09:48.500 |
You're doing this with on-edge controllers, right? 01:09:49.500 |
You're doing training for on-edge controllers? 01:09:52.500 |
And you talked about how this also could work for image data, right? 01:09:55.500 |
Have you conducted any tests with image data? 01:09:58.500 |
So image data works best on not just feed-forward architectures. 01:10:16.020 |
They have, for example, convolutional bits and pieces that are useful to them. 01:10:21.140 |
And that means if we want to apply some kind of a warm start for, for example, a convolutional 01:10:27.220 |
layer to create a performant image classifier or something that's working with images, we'd 01:10:31.700 |
want to develop an initialization for that layer, too. 01:10:35.660 |
It has weirder activation functions, which means we need to branch out from softmax as 01:10:41.740 |
But surprisingly similar convolution is to a radial model. 01:10:47.260 |
It's really just saying what's near where I'm trying to create a feature. 01:10:51.980 |
So I would say, yes, it seems like it's something that we could do. 01:10:55.540 |
But currently, it's in the phase of future work that it fits in one bullet here at the 01:11:09.260 |
Different layer types need formal derivation for warm starts. 01:11:13.420 |
So if we wanted to do this kind of a thing with performant architecture, we would be 01:11:18.100 |
probably uniforming or randomly initializing some of those parameters that we don't have 01:11:23.980 |
And as a result, we would receive a lot of just sort of like noise in where things are 01:11:29.540 |
And if we started to utilize the activation functions, whether it's even just logistic 01:11:34.300 |
activation, a logistic activation is not really fundamentally different than a softmax activation. 01:11:39.420 |
So you might say, for example, well, why can't you just apply that to logistic function, 01:11:46.660 |
And the reason is, is because if we treat it like a standard logistic, then each dimension 01:11:51.900 |
Each dimension is trying to predict the same thing. 01:11:54.800 |
And there's a lot more questions about how you can get different information out of different 01:11:59.860 |
So it's a question that's really worth spending time on, in my opinion, separately. 01:12:05.820 |
And it's not the first question that makes a lot of what we've developed practical. 01:12:10.860 |
On one of the slides, you had a dialogue with your user. 01:12:21.500 |
I'm wondering, does that imply there is a speech-to-text system inside the microprocessor? 01:12:31.300 |
And there's a process here which accepts that audio. 01:12:39.100 |
It's really just fitting a need with a pre-trained model. 01:12:44.420 |
Although transcription is something that we would like to move into in our future work 01:12:49.100 |
for the purposes of training from scratch, because one of the real benefits of a system 01:12:53.700 |
like this is that it doesn't come with any biases from other people's data, aside from 01:12:59.420 |
the fact that there's a pre-trained transcription system, which means that it's pre-trained 01:13:04.060 |
towards whatever phonetics were within the language that was there for pre-training in 01:13:11.460 |
So there is external utility here coming from a pre-trained model. 01:13:17.600 |
But the text itself and the language model that we're presenting is only working from 01:13:35.320 |
You said that the feed-forward worm start is independent of the choice of self-attention. 01:13:42.040 |
Does that mean that the worm start strategy can be used for any network that uses a feed-forward 01:13:49.160 |
layer, not just PLMs, but any LLM or any other network? 01:13:55.600 |
So that's going back to the worm start solution here. 01:14:01.880 |
And what it says is that in terms of any layer beneath, if you assume that those layers' 01:14:07.320 |
parameters are what they are, you're not going to update them. 01:14:12.680 |
And assuming that you know what the targets for that layer are, which for middle layers, 01:14:16.560 |
there's some questions to be answered, then this initialization will do better than random 01:14:26.600 |
That's really important at this stage, that there's a softmax as a part of the activation. 01:14:39.000 |
But the point at which it becomes clear should that whatever type of prediction scenario 01:14:48.040 |
you're in, as long as you have non-negative features and a softmax for activation, like 01:14:56.720 |
in this case with a single layer, or even two softmax layers, whatever that's doing, 01:15:02.220 |
on MNIST, you can get a really good initialization. 01:15:14.480 |
You could do an image caption generation system that has both features from images and text 01:15:20.120 |
and warms them up with the same solution with entirely different data in two places. 01:15:25.760 |
Could you point out which part of the process requires the values to be non-negative? 01:15:34.720 |
What happens when you put a negative in a logarithm? 01:15:42.740 |
Not saying you can't, but it's not going to start making probabilities for you at the 01:15:50.400 |
So you have to start with a different premise, essentially. 01:15:55.140 |
And that premise is something that requires more derivation. 01:16:00.920 |
You'd want to assure, if you're going to use a logarithm anywhere, or assume that inverse, 01:16:05.600 |
that you're able to probably modify every parameter independently, instead of full rows 01:16:16.480 |
I think we should get to a couple of questions on Slido that folks asked. 01:16:24.880 |
The first is, what's the difference in performance between naive assignment and optimized or 01:16:30.280 |
omniscient assignment for packing tokens into bit vectors, and any experimental results? 01:16:36.860 |
What's the difference in performance between naive assignment and optimized assignment 01:16:53.740 |
The performance differences are going to be in speed. 01:16:56.720 |
The systems which utilize packing for contexts have, at great length, gone to make sure that 01:17:02.400 |
information from different portions of the context that have nothing to do with each 01:17:06.640 |
other don't bleed information, if you're going to pack them together. 01:17:11.760 |
That creates a lot of logistical challenges in terms of defining models. 01:17:16.920 |
And it's still just doing the regular self-attention thing. 01:17:20.520 |
So if you have the same length of context window, it's going to be the same computational 01:17:25.440 |
However, if you pack all of your small documents together, they don't need the whole context 01:17:38.960 |
And that's why you pack something into the empty end. 01:17:46.040 |
But document packing isn't exactly, even though it's well known as a mechanism to make training 01:17:56.040 |
In other words, you only need fewer batches if more documents are packed together. 01:18:00.920 |
It's not something which is, for example, entirely accepted as a published accepted 01:18:11.140 |
So what I would say is just document packing is not a correct model of context. 01:18:16.440 |
It is an efficiency, but requires the same level of quadratic comparison. 01:18:21.920 |
Whereas dynamically batching and utilizing a block size that is dynamic preserves the 01:18:30.240 |
It does something that is true to the objective and unwavering in that. 01:18:35.040 |
And it reduces the complexity for smaller documents. 01:18:39.360 |
But a direct comparison of the two is something I have not done, because it would require 01:18:44.160 |
having that oracle and utilizing those algorithms. 01:18:48.040 |
They're used with insanely big models, which means we would likewise have to compare two 01:18:52.880 |
insanely big models to create the same level of expectation that people have from packing. 01:19:02.400 |
We have a question quickly that's asking, are there any implementations of SAFU available 01:19:15.400 |
But that requires a lot of work on developing systems for evaluation, since the evaluation 01:19:20.420 |
systems rely upon standardized functions within the architectures that you're all very familiar 01:19:26.280 |
with, like GPT-2, that are easily taken for granted. 01:19:29.680 |
Even though you do lots of work in training them, you have to do a lot of work in creating 01:19:34.200 |
those functions that meet the needs of the separate prediction tasks and fine tuning 01:19:45.320 |
So thanks, Jake, for the great talk, and thanks for coming to another lecture.