Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures

00:00:00.000 | >> Today, for our talk, we have Professor Jake Williams from Drexel University.

00:00:09.480 | He is an Associate Professor at Information Science at Drexel University's College of

00:00:15.200 | Computing and Informatics in Philadelphia, Pennsylvania.

00:00:18.920 | Dr. Williams has a background in physics and math with degrees from the University of Vermont,

00:00:25.000 | and his research leverages a quantitative linguistic perspective that applies math and

00:00:31.120 | statistical methodologies to analyze and improve linguistic learning systems.

00:00:36.800 | Following a one-year postdoc appointment at the University of Berkeley, studying large

00:00:41.760 | language, large-scale machine learning in 2015, Dr. Williams became a data science faculty

00:00:48.440 | at Drexel, where he drove the foundation of a DSMS program and develops and instructs

00:00:54.000 | data science coursework, including natural language processing with deep learning.

00:01:00.560 | So, welcome, and thank you for coming today for your talk, and you could do a quick introduction

00:01:06.320 | of yourself before you start.

00:01:08.320 | >> Great.

00:01:09.320 | Thanks so much.

00:01:10.320 | I got the mic here.

00:01:11.320 | Nice to see you all here.

00:01:12.600 | Thanks for coming out, and also for showing up online.

00:01:16.080 | It's a pleasure to be here.

00:01:19.760 | As was mentioned, my name is Jake, and my background's in math and physics, so the perspective

00:01:25.240 | that I'm coming from towards this work might be a little bit different than the standard,

00:01:29.400 | and that'll be a theme throughout the discussion.

00:01:32.800 | The purpose of this discussion is to go through a relatively long-term development, a project

00:01:40.840 | that I've been working on, and as mentioned, my background is in quantitative linguistics,

00:01:46.840 | which means my history of focus on language has primarily been to develop general theories

00:01:56.600 | and descriptions of phenomena that you observe with regards to linguistic units, whatever

00:02:01.560 | those might be.

00:02:04.120 | It's a statistical approach based on theories of language generation that are statistical

00:02:11.640 | in basis, and over the course of my time as a researcher, I've explored and ventured into

00:02:21.560 | language modeling itself and ultimately into neural networks as they approach language

00:02:26.080 | modeling themselves, and that's what brought me here through quite a bit of other work,

00:02:34.200 | so if you look into my profile, you'll see a lot of different subjects in either applied

00:02:38.800 | NLP, like I said, quantitative linguistics, and neural networks is a natural transition

00:02:44.520 | for me into inferential work, so let's get started.

00:02:52.160 | So well, this is how we'll start the conversation today.

00:02:56.320 | It's not exactly how we got here in my lab.

00:02:58.920 | We came at this subject from a different approach, trying to think about layer initializations

00:03:07.600 | in neural networks, and this subject that we're discussing as a front for this talk

00:03:16.440 | is specifically focused on transformer architecture components, the self-attention component that's

00:03:21.480 | pivotal to the success of the transformer architecture, and it focuses on the fact that

00:03:28.120 | self-attention requires a quadratic comparison of vectors in order to produce the feature

00:03:32.200 | weights of those vectors needed to model long-range dependencies in text.

00:03:38.320 | Commonly, parameters for self-attention are based on a transformation matrix, two, usually,

00:03:43.640 | queries and keys, that are responsible for dimensionalizing input vectors, and I describe

00:03:49.800 | it this way because generally speaking, when you're at the point of a self-attention layer,

00:03:54.640 | you already have low-dimensional vectors, but the parameters in a standard self-attention

00:04:00.080 | layer are changing the dimensionalities and the structure of that dimensional space.

00:04:04.520 | They are like an embedding layer, which is factorizing the embedding dimensions.

00:04:10.560 | This redimensionalization is the primary means by which self-attention creates feature weights.

00:04:15.520 | It really just computes similarity in that shared space.

00:04:21.400 | Large and similar inner products really just result in strongly weighted features, so it's

00:04:25.360 | up to that dimensionalization to produce good similarities for whatever purpose your prediction

00:04:31.520 | requires.

00:04:32.520 | However, an alternative strategy for feature weights might ask, given a basis, so in other

00:04:38.000 | words, you're stuck with your low-dimensional vectors, what is the optimal way to convert

00:04:43.320 | those comparisons of the vectors you're looking at by a matrix transformation to modify the

00:04:49.320 | vector similarities that you are stuck with that correspond to the best weights for features?

00:04:55.600 | In other words, treat this as a feed-forward layer to produce self-attention weights as

00:05:00.240 | opposed to try and transform to some basis that produces good feature weights.

00:05:06.320 | The use of this modified self-attention mechanism will be part and parcel the substance of this

00:05:10.920 | talk.

00:05:15.960 | It's worth noting that this alternative mechanism is entirely compatible with the traditional

00:05:20.160 | dimensionalizing version of self-attention.

00:05:22.120 | In other words, you could still change the dimension and compute similarities and then

00:05:29.400 | convert that with a second feed-forward layer to produce optimal feature weights.

00:05:34.360 | This is not exclusive in any way.

00:05:36.600 | This is exploring how useful that alternative prediction of feature weights can function.

00:05:42.960 | However, we'll avoid the standard mechanism for two reasons.

00:05:47.200 | First, we have no solution to the standard parameters for self-attention as an initialization.

00:05:54.040 | And this will be discussed at length in slides to come.

00:05:58.280 | Likewise, it would create an additional model complexity that would muddle the effects of

00:06:03.160 | the modified form of self-attention that we wish to study.

00:06:05.640 | So having that dimensionalization as a way to produce good feature weights would confuse

00:06:12.120 | whether or not the feed-forward computation of feature weights is functioning well.

00:06:17.440 | There's a catch to this, however, which is that these vectors that we use for such a

00:06:24.200 | self-attention layer better be good.

00:06:26.240 | In other words, their comparisons must be consistent and meaningful in the first place.

00:06:34.080 | So to get it out of the way, here's an architectural diagram for the relatively simple near-shallow

00:06:39.880 | architecture pattern that we're using.

00:06:43.400 | It doesn't seem like there are many neurons in a network of this type.

00:06:46.320 | And that's because all of the activations are softmax, which means despite the fact

00:06:51.080 | that the U matrix, for example, is an entire layer, it's really just going through a single

00:06:56.920 | prediction non-linearity, the softmax function.

00:07:00.200 | So you can think about this as essentially a three-layer network that might be creating

00:07:04.000 | an encoder-decoder kind of design.

00:07:07.080 | Likewise, the difference in presentation here over self-attention, which is parameterized

00:07:12.240 | by the matrix W here, is intending to show how a-- whether you consider it the query

00:07:21.560 | or the key-- one vector is the pivot for the comparison that will produce the feature weights,

00:07:29.360 | which is then fed forward in this model through W.

00:07:33.600 | This is the case for standard self-attention, too.

00:07:35.960 | In other words, you can reduce it to a by-prediction diagram in this way, where a gray vector,

00:07:42.360 | such as is depicted here, is that pivot.

00:07:47.160 | The attention distribution coming out of the W matrix and the softmax function is indicated

00:07:51.780 | by the vertical red bar there, which weights the block of vectors in black.

00:07:57.840 | That includes the pivot vector in gray, which is then passed through a feed-forward layer,

00:08:03.720 | often called the values of a standard self-attention matrix, U.

00:08:08.760 | We then-- since we use U as a way to reduce the dimensionality of the prediction that

00:08:14.720 | we're trying to make, we then feed that forward through another layer and then to output.

00:08:22.200 | And that's essentially the relative shallowness that we're talking about here.

00:08:27.140 | U is a self-attention matrix, which means there's really only two layers in effect here.

00:08:34.420 | And the activation functions are strange.

00:08:36.660 | And you might wonder, for example, why we're using a different activation function, the

00:08:40.260 | softmax, instead of any of the dimensionally independent activation functions, like a logistic

00:08:45.780 | function or anything else.

00:08:48.360 | And that's because we have additional insight into the softmax function and the parameters

00:08:53.340 | that it optimizes, which is very useful.

00:08:58.580 | So let's talk about those vectors first, though, before we get to layer initialization.

00:09:07.700 | Optimizing the keys and queries of standard self-attention bears substantial similarity

00:09:11.540 | to token and word embedding.

00:09:14.460 | This is because the key and query matrices have a common dimension that they project

00:09:19.700 | to, much like you'd see with the factorization of an embedding layer on its own.

00:09:24.740 | Think Word2Vec, something like that.

00:09:29.500 | These normally-- there might be multiple self-attention heads.

00:09:33.620 | And because of the indeterminacy in creating a different dimensional space-- in other words,

00:09:38.780 | there are multiple equivalent reshufflings of those different dimensions which will produce

00:09:43.580 | the same output-- that indeterminacy is something that we hypothesize has bearing on what is

00:09:51.420 | now referred to as the lottery ticket hypothesis.

00:09:53.540 | In other words, that multiple-- or this is the way that I would state it-- but that multiple

00:09:59.980 | different embeddings which produce different vector spaces can be leveraged in parallel

00:10:04.340 | to create further robustness for the model.

00:10:07.340 | Or in the way that it's implemented, that if a random initialization doesn't do that

00:10:13.500 | well, you can eliminate it from the network.

00:10:16.180 | And that sub-network will do just as well, even after it's totally trained.

00:10:21.700 | In other words, having multiple clones, self-attention heads, which have no difference in the outputs

00:10:27.140 | that they're trying to predict, is at the root of the lottery ticket hypothesis.

00:10:32.380 | And ultimately, that invocation of the lottery ticket hypothesis is really a justification

00:10:37.100 | for eliminating parameters whose substantial cost of training are essentially wasted as

00:10:41.780 | a result of random parameter initialization.

00:10:45.700 | You might ask questions like, well, what is a good initialization?

00:10:49.100 | What is a good set of word embeddings to use?

00:10:55.780 | So how can lottery ticket hypothesis interactive effects of randomly initialized embedding

00:11:02.020 | layers be avoided when constructing language models is another question that is embedded

00:11:07.820 | in this discussion.

00:11:12.860 | But we shouldn't say that dimensionality reduction isn't needed.

00:11:15.980 | It's incredibly necessary.

00:11:19.280 | For language modeling, you absolutely have to work with reduced dimension unless you're

00:11:23.660 | in a very small vocabulary.

00:11:25.500 | For example, like 26 Latin characters or something like that, like a wave to Vec.

00:11:33.860 | The inherent input dimension of a large vocabulary model presents many computational intractabilities

00:11:39.940 | when designing NLP systems, something that you're probably all very aware of.

00:11:43.660 | Likewise, though, the distance from embedding layers to learning information, the loss at

00:11:50.500 | outputs, puts them in a challenging position to train.

00:11:53.980 | It's really hard to learn embedding layers because of the indeterminacy in the space

00:12:00.260 | that you're trying to learn.

00:12:01.460 | You could swap dimensions, and it's equivalent.

00:12:07.060 | But the distance means that they receive learning information last.

00:12:13.760 | This is a real challenge, and it's present in the history of NLP and deep learning, too.

00:12:21.620 | Vanishing gradient stuff.

00:12:24.380 | And this is exacerbated in the way that we have to actually learn embedding layers in

00:12:28.140 | standard models where we might modify learning rates to be lower all the way back at the

00:12:33.300 | bottom of a network to be gentle with those embedding layers and help them learn effectively.

00:12:40.620 | But this is really trouble because if we had a good embedding layer at the start, those

00:12:46.380 | subsequent layers could be much easier to learn.

00:12:53.420 | So ultimately, in order to approach this challenge, we came along with a discernibility hypothesis.

00:13:03.060 | In other words, this boiled down to the theory that low-dimensional vectors, more than anything,

00:13:09.260 | needed to be able to discern features.

00:13:11.900 | And that doesn't sound like a very strong assertion.

00:13:17.020 | And we started with a really, really, really low bar and assumed that the most common features

00:13:25.340 | needed to be the most discernible features.

00:13:27.480 | So if we're stuck with a lower dimension and we can't give everything a one-hot vector

00:13:31.140 | to be told apart very well, then we might want to give the more clear vectors, which

00:13:36.980 | have more dimensional independencies, to those features which appear most frequently and

00:13:44.060 | could stand to confuse models the most.

00:13:48.980 | This hypothesis led us directly to develop the bit cipher algorithm, which is really

00:13:54.500 | just a scheme for assigning vectors of zeros and ones.

00:13:59.300 | Nothing too crazy in terms of what we're attempting to do.

00:14:02.940 | In the figure at right here, the order of vector assignment is by row from top to bottom.

00:14:08.220 | And this is on a five-dimension, five-bit vector system.

00:14:14.140 | The first five from bottom are those one-hot vectors.

00:14:18.380 | Past that point, you'll see two-hot vectors, but they're a little bit less darkly shaded,

00:14:24.700 | indicating the way that we actually utilize the system.

00:14:26.740 | In other words, we normalize them to have unit sum.

00:14:34.020 | What I hope you can see from this is that the bit cipher algorithm generalizes one-hot

00:14:39.940 | vectors to low dimensions.

00:14:43.060 | And as a result, we can work from a very sparse feature set and explore dimensionalities as

00:14:50.220 | a controlled phenomenon.

00:14:54.020 | And this assignment is incredibly naive, too.

00:14:57.180 | That's the other thing that I want you to see as well, that this discernibility hypothesis

00:15:01.740 | does not create any meaningful correlations between tokens that behave similarly.

00:15:05.980 | So if you've got the upper and lower case of a word, their vectors aren't going to capture

00:15:11.620 | those similarities according to the bit cipher.

00:15:14.460 | It's really just gonna try and make sure that those features are distinguishable in a low-dimensional

00:15:19.020 | space and that the most distinguishable features are those which appear most commonly.

00:15:25.020 | This was enough to do a surprising amount of work.

00:15:31.960 | So with some scheme for a deterministic low-dimensionalization procedure, we were then able to utilize this

00:15:41.960 | solution that we had actually developed previously.

00:15:45.500 | So this was actually the real motivator for a lot of the work that you're seeing today,

00:15:49.940 | although it might seem like it's just a checkpoint in the middle.

00:15:56.020 | Provided bit cipher produces decent embeddings, we can ask, can other layers be non-randomly

00:16:00.980 | initialized?

00:16:01.980 | In other words, without gradient descent or backpropagation or other gradient-based iterative

00:16:05.180 | algorithms.

00:16:07.740 | This equation came about from analysis of Word2Vec with the original Softmax activation

00:16:13.880 | function.

00:16:16.300 | And much like other articulations of the Word2Vec family of embeddings, came up with differential

00:16:27.220 | solutions that depended on co-occurrence matrices.

00:16:31.020 | We formalized this as a question.

00:16:32.820 | Is there a way to take a co-occurrence matrix, F, in this equation here, and convert it with

00:16:41.300 | some weights, some denominators by row, into something that warms up a single-layer feedforward

00:16:51.020 | in a neural network?

00:16:53.380 | And ultimately, this k minus 1 over k term here, and this sum, is really just expressing

00:17:00.520 | something like conditional probability.

00:17:03.700 | Like conditional probability, because k minus 1 over k is a wrinkle that says that as the

00:17:11.860 | number of features increases, in other words, the context window increases in a block transformer,

00:17:19.500 | then the warm start that we could apply to start off a neural network without a randomness,

00:17:26.540 | entirely determined by the vectors underneath, nearing whatever direction it's going.

00:17:34.540 | All we have to do is compute some co-occurrences between inputs and outputs, and I don't mean

00:17:39.340 | necessarily standard co-occurrences that you might have learned about a long time ago which

00:17:43.820 | depend on a radius.

00:17:44.860 | I mean, whatever your inputs are, whatever your outputs are, you take their sum of outer

00:17:52.240 | products and you get a co-occurrence matrix of inputs and outputs, and that can then be

00:17:58.220 | utilized to initialize your layer in that neural network to be vastly more performant

00:18:06.100 | than what you'd get by a random initialization.

00:18:10.380 | This was a strong motivator for us.

00:18:13.900 | This was just for a single-layer model, but it depended on the softmax function for activation.

00:18:22.060 | And the softmax function as an activation function, we knew, is also necessary for self-attention

00:18:28.860 | features.

00:18:30.240 | And this meant that if we could put self-attention into some kind of a standard form with this

00:18:35.700 | equation just like a single layer, then we could apply the same solution with one catch.

00:18:43.940 | That catch is specifically that we don't know what the targets are for self-attention.

00:18:49.100 | There's no target vector y, the thing that you're trying to predict, which position is

00:18:55.340 | the one that you want to weight most strongly.

00:18:58.700 | And so in order to apply this solution for a self-attention model, we had to do some

00:19:03.260 | more analysis.

00:19:05.140 | And that's in the reference number one, which is all the way back up in the first slide

00:19:09.340 | if you want to see it.

00:19:11.120 | But that derives a differential criterion, an analog for the single-layer solution that

00:19:17.980 | tells us what the targets of that kind of self-attention actually are, the hidden targets,

00:19:23.460 | the weights that you're trying to create, which really are just about making sure that

00:19:28.700 | the layer above self-attention has some unsurprising things coming towards it.

00:19:36.300 | The self-attention layer is really just trying to massage the vectors so that way they look

00:19:40.140 | like something that the next layer above expects.

00:19:44.540 | Aside from that, though, it's a much more in-depth conversation.

00:19:48.620 | The point, though, is that for the model in this picture here, we can now start off with

00:20:00.100 | vectors x that are not random.

00:20:04.640 | We can use those vectors x to initialize non-randomly the parameters in W, the self-attention matrix,

00:20:13.780 | and then use that, going up the network, to initialize the parameters in U, since it's

00:20:19.460 | just a feed-forward layer with whatever self-attention is giving it as weights.

00:20:24.620 | And then whatever that produces, the hidden state, H, we can use that with the actual

00:20:29.700 | targets after the output layer to warm up the matrix O.

00:20:37.460 | And you might say, "Okay, well, how did you figure out what those hidden targets are?"

00:20:43.660 | You had to have an output for the U matrix to try and hit.

00:20:49.100 | That too is something that the bit cipher can provide in the form of label embeddings.

00:20:57.300 | In other words, low-dimensional targets of the thing that is downstream that you're trying

00:21:01.420 | to hit, the language model's output.

00:21:04.660 | So similarly, we can warm start the U matrix in terms of those bit cipher label embeddings.

00:21:16.260 | So in this view, the aim is to show how simple and general a single-layer softmax activated

00:21:20.880 | solution is to apply.

00:21:22.880 | It's really just no more challenging than computing conditional probability given inputs

00:21:27.660 | and outputs.

00:21:30.420 | It's fast, it's something that you can distribute in terms of processing, and it's very, very

00:21:37.460 | general.

00:21:39.960 | So this is essentially the process that we're using in order to warm up the W and U matrix.

00:21:51.060 | There's the U matrix there, starts out as zeros.

00:21:54.980 | In other words, nothing, no random values, no weights anywhere.

00:22:00.460 | Over the data, which is just borrowing the dimension of this gigantic Y matrix that has

00:22:06.940 | all of the targets in it for the entire data set, we simply just take the outer products

00:22:14.060 | of whatever the hidden state, the input to that layer is, assuming that the lower layers

00:22:18.740 | beneath it are also warmed up with whatever the targets for that layer are.

00:22:25.740 | Following that, it's really just about normalization and a logarithmic transformation.

00:22:31.380 | And that logarithm really just emerges as a result of being an inverse to the exponential

00:22:36.860 | function, which is a part of softmax, pretty much all of softmax.

00:22:43.540 | And that's really what brought us here.

00:22:48.140 | So what does warm starting a network do?

00:22:50.700 | This is going back to before we had the bit cipher algorithm for dimensionality reduction.

00:22:58.980 | And we started out by just saying, OK, if we take a simple, simple language model that

00:23:05.700 | only looks at a radius of traditional co-occurrences as features, we can concatenate those vectors

00:23:13.140 | and feed them forward for a language model's output.

00:23:17.380 | A completely random start, a cold start to a language model, is really just the size

00:23:24.620 | of the vocabulary in perplexity.

00:23:27.860 | And those three lines here for a few different radii are demonstrating that point with the

00:23:33.980 | point all the way at the top left-hand corner of this figure, cold starts.

00:23:41.380 | In any of those cases, when the warm start is applied, the perplexity is immediately

00:23:45.460 | automatically lower.

00:23:49.100 | And furthermore, the trajectories that the updates follow continue in the same learning

00:23:57.460 | rate and the same time to perform better than models that were started cold.

00:24:05.700 | If you have an early stopping criterion, similarly.

00:24:08.820 | Early stopping, well, more than just generally, engage first and with a higher perplexity.

00:24:19.300 | So this was the first indication that we had figured out something that's very useful.

00:24:24.820 | There are some folks on Slido saying they're a bit confused.

00:24:31.780 | They're asking, are we talking about an alternative approach to self-attention?

00:24:36.060 | We are.

00:24:37.060 | So we're all the way back at slide one.

00:24:41.660 | And it is the premise of this whole conversation.

00:24:45.980 | So here, in this modified version of self-attention, you might normally expect to do a comparison

00:24:53.020 | of your inputs, the matrix X.

00:24:55.820 | Whatever your inputs are, they might be a whole block of vectors, or they might be--

00:24:59.660 | this is self-attention.

00:25:00.660 | It's not cross-attention, where you have different vectors that you're trying to attend.

00:25:06.500 | And forgetting about the values, which for us is the U matrix, the keys and queries,

00:25:15.700 | which are the parameters for self-attention, are in the middle.

00:25:18.340 | They're in between the two copies of the inputs, X.

00:25:25.580 | Each of those you can view as some kind of a projection down to a dimension where they

00:25:29.860 | can interact.

00:25:30.860 | And this is necessary for something like cross-attention, where you might have different dimensionalities

00:25:35.540 | like X1 and X2 in two separate groups of vectors if you're doing something like machine translation.

00:25:41.820 | That's not necessary to think about when you're just looking to do a standard language model

00:25:48.660 | that has to predict the next output according to the inputs, which are also outputs from

00:25:53.700 | previous iterations.

00:25:58.940 | Two insights here-- one, that multiplying the key and query matrices, WK and WQ, it's

00:26:08.740 | just another parameter matrix that's implied.

00:26:11.860 | There aren't two parameter matrices there in the middle for self-attention in any effective

00:26:17.380 | way.

00:26:19.060 | There is a common dimension of comparison, and that kind of just moves stuff around.

00:26:24.580 | It creates degrees of freedom so that optimization can figure out what's the best weighting from

00:26:31.300 | comparisons.

00:26:32.300 | But the softmax function is strictly operating on similarities of that comparison space.

00:26:40.500 | It's not doing anything with those similarities.

00:26:43.620 | It's just softmaxing them.

00:26:44.860 | It's just activating them.

00:26:46.460 | So if it was a big similarity, it's a big attention value.

00:26:51.060 | In this equation, there's no transformation happening before those vectors are multiplied

00:26:56.580 | together, inner products.

00:26:58.980 | So those vectors better be good vectors that you're starting with-- x and x transpose,

00:27:03.740 | the same thing.

00:27:06.020 | They better be vectors that are comparable.

00:27:08.580 | They can't be vectors from cross-attention, where you're trying to translate from one

00:27:11.940 | language to another, and they just don't inner product.

00:27:13.940 | They're different dimensions.

00:27:15.740 | You could force it through if they were two differently trained embedding layers, and

00:27:19.740 | they had the same dimension with this mechanism.

00:27:23.140 | And if you didn't, you could put those key and query matrices back in between the two

00:27:27.500 | x vectors, x blocks of vectors.

00:27:33.580 | But a lot of what's going on here in this talk is trying to simplify and make more efficient

00:27:42.860 | the architectures that we need and the mechanisms that they utilize, given what we know about

00:27:49.940 | how language functions.

00:27:52.300 | And that's a critical piece there.

00:27:53.640 | We have assumptions that we can make.

00:27:55.520 | If all we're doing is autoregression, we don't need cross-attention dimensionalization in

00:28:00.260 | between.

00:28:01.260 | That'll be the theme, in other words, that can we use knowledge that we have about the

00:28:08.980 | way language functions to design better versions of architectures that meet the needs of language

00:28:16.300 | instead of being simply general.

00:28:18.460 | Is this good?

00:28:22.780 | This is important.

00:28:23.780 | So if there are any questions here, it's a good time.

00:28:30.980 | We are there and there.

00:28:36.980 | So we just talked briefly.

00:28:38.920 | This was for language.

00:28:39.920 | The thing about language models is it's a really simple language model.

00:28:43.100 | There's no self-attention here yet.

00:28:46.420 | This is really just evaluating that a warm start in either the blue, green, or purple

00:28:51.320 | case does better than its partner, which is a cold start of the same architecture, same

00:28:57.660 | hyperparameters, orange, reddish, and brown.

00:29:04.740 | So three different models, regardless of how long your context is in each case here, we

00:29:10.160 | see that a model which has a nonrandom initialization by the equation presented two slides back

00:29:16.520 | from here starts a network off with a much lower perplexity.

00:29:26.920 | The requirements to apply this solution to a feedforward layer of parameters is simply

00:29:33.180 | that your inputs should not have negative values.

00:29:41.300 | That's really all we have to worry about.

00:29:44.420 | So it becomes really easy to ask questions like, well, what happens when you apply this

00:29:50.080 | to other data with non-negative values?

00:29:52.940 | Well, there's one little catch that we had to think about here in this case, and that

00:29:57.820 | is with the bit cipher or one-hot vectors, we're controlling the norms of the inputs.

00:30:04.260 | With standard embeddings, with MNIST, for example, when you're trying to predict the

00:30:10.420 | handwritten digits, 0 through 9 value, you don't get to assume necessarily that all inputs

00:30:18.660 | have the same norm.

00:30:21.580 | You can normalize the inputs, but it doesn't necessarily make sense to normalize them to

00:30:26.540 | one when you're looking at images, for example.

00:30:29.340 | They're non-negative.

00:30:30.340 | They have 0 through 255, for example, in MNIST.

00:30:35.440 | And as a result, we can put these data through that same warm start.

00:30:42.100 | Now one little caveat here I've alluded to about the norms of vectors is that we don't

00:30:51.020 | know what that value of k is.

00:30:53.060 | In other words, let me go back, you could look at it here or here, that's the number

00:31:02.660 | of features per prediction, which if you're looking at unit-normed word vectors is however

00:31:10.540 | big your context window is, k, because they all have unit norm and there's k of them.

00:31:17.860 | But if you're looking at just an image, it's not clear if it's a composition of multiple

00:31:23.060 | vectors, if it's one vector, and how many it is, if it is a composition.

00:31:28.340 | It just has a norm.

00:31:32.940 | In application to data like that, that is what k becomes, the average norm of an input.

00:31:41.900 | And I'm regretting not putting a graph in this, but the paper that discusses this shows

00:31:45.820 | that in the MNIST dataset, the exact optimal value of k is the average norm of the inputs

00:31:54.300 | however you've pre-processed them.

00:31:58.300 | And that's how we generally apply this rule when we're warm starting systems and we don't

00:32:02.420 | have unit-normed vectors.

00:32:04.940 | And it was learned from studying this model's application, this solution's application to

00:32:11.340 | non-linguistic data.

00:32:13.180 | But as mentioned, the purpose was always towards language.

00:32:21.620 | So longer context windows in principle should provide models with more information than

00:32:26.820 | shorter context windows.

00:32:30.140 | This means one should expect that models perform better when context window length is longer,

00:32:37.940 | theoretically.

00:32:40.420 | And this is essentially the reason for why self-attention was initially developed.

00:32:45.180 | Researchers wanted to improve language models and context windows, providing more information

00:32:49.620 | were seen as the key to that.

00:32:51.380 | In other words, the more features, the more information, the more flexibility a model

00:32:57.380 | can have and expressivity.

00:33:00.900 | However, without feature weights, models didn't simply get better with long context windows,

00:33:06.300 | and feature weights and self-attention were hypothesized to be needed.

00:33:11.540 | And this was proven back in 2017 with the transformer architecture.

00:33:18.740 | In moving towards self-attention and transformer though, the primacy of the transformer architecture's

00:33:24.700 | block context model casts a shadow over the use of other context models.

00:33:32.260 | So for example, if I were to ask here, is it clear to everyone that the standard self-attention

00:33:41.020 | block model of context is different than the traditional notion of co-occurrences, which

00:33:46.660 | use a radius that is not positionally anchored?

00:33:50.620 | It is the context model, the positional anchoring of the block context model, that gives it

00:33:56.980 | its information.

00:33:59.540 | It is not, in all likelihood, anything else.

00:34:06.540 | Now what you do with that context model matters.

00:34:10.540 | You can't just take those vectors in a block, add them together, and expect a feedforward

00:34:14.460 | to do well.

00:34:15.460 | That's where self-attention is needed in order to figure out which vector needs the best

00:34:19.300 | weight, most weight.

00:34:23.540 | So what you'll also see in the architectures that are based on what I've already presented

00:34:29.260 | is that we're interested to explore how different models of context for language models can

00:34:34.740 | be integrated in general because they each provide different information.

00:34:41.440 | And we all know that the standard transformer's block model of context requires a ridiculous

00:34:46.300 | amount of information and data in order to become effectively trained.

00:34:53.080 | So the current state of contexts that we use, top there might be the standard transformer

00:35:01.900 | context that has a fixed positional block.

00:35:04.860 | And it takes the first 10 tokens, for example, the second 10 tokens, and the third 10 tokens,

00:35:10.860 | each in different blocks.

00:35:13.080 | Each of those is a group of contextualizing vectors.

00:35:18.220 | The second one there that you see with the r as a subscript is a radial model because

00:35:23.620 | those do different things.

00:35:24.980 | In other words, rather than assume you're looking at the first 10 or the nth 10 features,

00:35:30.820 | you pick a radius and you say, what are the last r features, the last r vectors?

00:35:36.980 | That can also have an attention distribution, a self-attention distribution, according to

00:35:41.140 | the exact same model that's being presented.

00:35:45.140 | It produces an entirely separate context in the state, whatever you want to call it, which

00:35:51.900 | can be conjoined with the block model to articulate features and be given to an output layer that

00:36:00.780 | knows what to do with them when each has different values.

00:36:06.940 | The concatenation of those different context models keeps the information separate so the

00:36:11.500 | output layer can decide which portion of the context is useful for the prediction.

00:36:18.820 | This last one is getting really traditional at the bottom.

00:36:22.540 | It's what I refer to as a document model.

00:36:26.300 | If you've ever implemented something like a Naive Bayes classifier or a term frequency

00:36:33.660 | inverse document frequency model, that's essentially what a document model is.

00:36:38.780 | Set up your vectors, you get something.

00:36:43.140 | Is it going to be the best for predicting the next token?

00:36:46.260 | Absolutely not.

00:36:47.260 | However, it's always different.

00:36:49.700 | What that means is that even if you wrap to the next block between the radial and the

00:36:55.060 | document models, you have a unique context vector, even if you're looking at the exact

00:36:59.860 | same block, because the document has grown and the radius just says, what are the last

00:37:04.660 | three?

00:37:05.660 | What are the last 10?

00:37:07.780 | As a result, when you incorporate different models of context, you don't really have to

00:37:12.380 | say that there's a finite context window.

00:37:14.300 | It might not be very good to make predictions past the first block, but that might be about

00:37:19.260 | how much data you've used, and it might be about the hyperparameters for each one of

00:37:24.340 | those models that you're applying, in other words, radius, the block size, like usual.

00:37:33.220 | So far, the only embeddings that I've suggested are from this BitCypher algorithm, and as

00:37:38.540 | I've expressed, they don't capture any useful similarities between similar tokens.

00:37:45.140 | The BitCypher algorithm, it doesn't care if you're looking at the uppercase or the lowercase

00:37:49.860 | version of a word.

00:37:50.860 | It doesn't see them as bearing any similarity, even though they might be used very similarly.

00:37:57.300 | So how can you utilize the BitCypher to create vectors for tokens that have meaningful similarities

00:38:07.660 | between words that are used similarly?

00:38:12.060 | And this is just backing off to the traditional methods once again, taking co-occurrences

00:38:19.460 | of BitCypher vectors with whatever's there at the middle or center of a co-occurrence

00:38:25.780 | model.

00:38:27.980 | Normally, if you think about one-hot vectors, a co-occurrence matrix is really just the

00:38:34.380 | same thing, except now we just have smaller vectors with different dimensions on, so to

00:38:41.420 | speak.

00:38:44.580 | And we normalize after concatenating these blocks of different radii from the BitCypher

00:38:52.660 | to match the original input requirements that we discovered for the warm start solution.

00:39:00.620 | And that enables us to use these just like we would the original BitCypher vectors, except

00:39:06.900 | now, just from the usual co-occurrence statistics, you'll see that capital word and lowercase

00:39:14.580 | word have a lot of common usage.

00:39:17.980 | And you know this works because you've seen co-occurrences for a very long time, and while

00:39:23.740 | they might not normally be useful in our applications these days with deep learning, they can be

00:39:29.740 | imparted through the BitCypher algorithm to prescribed vectors as well.

00:39:41.920 | So here's where things start paying out in terms of speed and efficiency.

00:39:51.380 | If you only have one layer of self-attention, then that means that you don't need to worry

00:39:57.780 | about whatever weird expressive stuff is happening that, you know, similar inputs might have

00:40:03.540 | slightly different hidden states.

00:40:07.620 | Since that first layer is just a set of static word embeddings, the self-attention layer

00:40:14.580 | is working off of static word embeddings.

00:40:18.500 | And that means each pair of words have a fixed comparison given static word embeddings.

00:40:26.220 | And that means if you want to compute the quadratic features of self-attention, you

00:40:31.460 | can just pre-compute them and pull them from memory.

00:40:36.540 | This caching of vector comparisons is essentially reducing the self-attention layer's cost from

00:40:43.260 | quadratic to linear, since those values that we're using to weight the vectors for the

00:40:49.980 | feedforward layer no longer require comparison across the block.

00:40:55.660 | They're already compared.

00:40:58.220 | So when our vectors are static, which is at inference time, and if we're not learning

00:41:05.980 | the embedding layer's parameters with iterative differential updates, then not only do we

00:41:14.060 | have to not track gradients for the embedding layer, but we don't even have to compute the

00:41:19.140 | vector comparisons.

00:41:20.140 | We can pre-compute them and just load them, which is much, much faster.

00:41:31.900 | So we can reduce a lot of, all the inference and training costs, not all the training costs,

00:41:38.620 | some of the training costs, because if we want to update those vectors, then we can't

00:41:42.340 | assume cache comparisons.

00:41:45.540 | But it's a huge cost savings.

00:41:48.460 | This means that we can train these self-attentive feedforward unit models very quickly and with

00:41:54.540 | good initializations.

00:41:57.460 | But there are some other things that we immediately observed while developing these models, and

00:42:01.820 | that is the lack of randomization produced models which were quite effective even on

00:42:09.300 | small data.

00:42:10.300 | Now, it doesn't mean that training on small data will let you generalize to everything

00:42:14.260 | else that's out there in the world.

00:42:15.500 | In other words, training on a small data set might produce a model which has a surprisingly

00:42:20.100 | low perplexity on its training set, but it doesn't mean that you're going to be able

00:42:24.820 | to generalize and have a language model that's talking well from just hearing a couple of

00:42:28.020 | thousand tokens.

00:42:29.940 | It does mean it will know that couple of thousand tokens very well, very quickly.

00:42:38.700 | But there's a challenge with using self-attention still, and that is the fact that the block

00:42:45.220 | model of context often is not fully utilized, since many documents are shorter than long

00:42:55.220 | context models.

00:42:56.220 | There are long context windows.

00:42:59.620 | And these days, there are exceptionally long context windows.

00:43:03.140 | I'm not even talking about those.

00:43:05.500 | Many of the language modeling benchmarks simply don't even go out to a thousand words when

00:43:09.060 | it comes to context, and you're looking at a document to predict.

00:43:15.020 | So this has been a problem for a while, and it means that if you're going to pad your

00:43:22.680 | short documents, you're going to waste a lot of prediction on those paddings.

00:43:27.740 | A lot of computation gets lost just for null information, essentially.

00:43:35.300 | And the way that this is often relieved in some groups, and to great effect, is by packing

00:43:42.740 | long contexts.

00:43:44.740 | So for example, if you've got a hundred thousand token context window, most documents will

00:43:49.220 | not be a hundred thousand tokens long.

00:43:51.660 | What do you do with the rest of that long context if you want to use a thousand tokens

00:43:55.740 | of good training data?

00:43:58.020 | You fill out the other ninety-nine thousand tokens with a bunch of other random documents

00:44:02.100 | that don't belong anywhere near the first one.

00:44:04.400 | That's called packing.

00:44:07.640 | Packing can be utilized without impacting different documents with each other, without

00:44:14.880 | contaminating the information between documents, and that takes a lot of work, but it can be

00:44:19.640 | done.

00:44:22.160 | However, there are different strategies that we could employ, different engineering tricks

00:44:28.760 | that we could employ, to make our operation of self-attention more effective at any length

00:44:36.760 | of document without having to deal with this packing problem.

00:44:41.040 | And that comes about by dynamically changing the context length from some maximum value,

00:44:49.640 | that's what you would normally set, just use the context that you have.

00:44:55.280 | But you still have to create batches if you want to train models quickly, and what that

00:44:58.840 | means is that there's still some padding if you use this approach.

00:45:03.040 | But you can pad those short documents to set lengths, batch short documents together, batch

00:45:13.680 | long documents together.

00:45:17.200 | This means that we don't need to pack documents together to make use of a long context window.

00:45:26.440 | When a document is long, you can let its context be long.

00:45:28.960 | When a document is short, you can put it with other short documents and just use a subset

00:45:32.880 | of those self-attention parameters.

00:45:36.560 | And with traditional self-attention parameters, keys and queries, it would never be a subset

00:45:40.480 | because it's a low dimensionalization that that matrix provides.

00:45:44.800 | With this modified self-attention, though, there's a different shape to the weight matrix,

00:45:49.120 | and that's why it's a subset of those parameters that we have to utilize, and that might be

00:45:53.600 | something worth discussing afterwards.

00:45:55.320 | In other words, how does the difference in shapes of dimensionalities between this and

00:46:00.360 | the standard self-attention weights shake out?

00:46:08.000 | But we want to get to a different point for the sake of this conversation.

00:46:15.320 | What is a model like this useful for?

00:46:17.720 | That should be a question that you're asking.

00:46:19.400 | It's a question that we've been asking.

00:46:26.040 | We're not entirely certain yet how an extremely large model like this will function on trillions

00:46:34.920 | of tokens, for example.

00:46:35.920 | In other words, can you expect the same kinds of outcomes, like a chat GPT kind of thing

00:46:44.120 | from some of these models, human interaction and RLHF and all the rest of that, though

00:46:50.720 | it's something that we're considering, but also at different scales, too, since those

00:46:57.760 | are performant on their own as well, but for what?

00:47:05.760 | So the point is, is that from what we've stress tested into the billions, models can be trained

00:47:13.120 | very quickly on a relatively small GPU, in ways that we expect when we cache vector comparisons,

00:47:20.240 | we see really big speedups.

00:47:22.160 | When we don't cache those comparisons, you see all of the growth in computation time

00:47:29.280 | that you would expect from longer context windows.

00:47:35.820 | This one here, though, we're trying to make it really, really, really small, the one called

00:47:40.240 | potato.

00:47:42.060 | That's because we want to see if we can train a model from scratch, since on very little

00:47:47.280 | data, these models can fit effectively with the initializations that we've developed.

00:47:55.880 | And with the purpose of starting from scratch, starting with no data, we're thinking about

00:48:01.040 | edge computing cases where we could deploy a language model with a microphone so that

00:48:05.880 | a person can talk to it and just train it from their own data, train it from their own

00:48:10.720 | speech, to understand their speech.

00:48:19.480 | So between these, we've explored a lot of different configurations, trying to consider

00:48:25.040 | similarities to what some standard configurations might look like, a couple thousand tokens

00:48:29.240 | in a context window, for example, to look something like a GPT-2 style model.

00:48:34.660 | Thinking about bit cipher embeddings that are 500 dimensional or 1,000 dimensional to

00:48:39.560 | be something like a GPT-2, that's, again, pointing towards the big/large category of

00:48:46.200 | models that we've experimented with.

00:48:49.400 | Beyond that, we haven't really touched those scales, because our first objective is not

00:48:55.440 | to make big, big language models and train chatbots.

00:48:59.020 | We want to know, what can we do with a small model, since this is a relatively unique capability?

00:49:09.740 | So what does training look like?

00:49:12.140 | To the best of our ability so far, it's kind of hard to see, but the first step is that

00:49:20.220 | warm start, where you train the bit cipher, and you take a couple of splits of data, and

00:49:27.940 | you compute that warm start for the self-attention layer and the feedforward layers.

00:49:34.140 | In this case, which is really just using a 100 million token data set from the baby language

00:49:40.180 | model challenge, which has as an objective to see what language models can do on a relatively

00:49:49.800 | human scale of data.

00:49:51.220 | In other words, 100 million tokens is something that a person might hear in 10 years of their

00:49:56.380 | life.

00:49:58.100 | In 10 years of life, people become pretty proficient speakers, and can a language model

00:50:04.100 | be trained at that scale?

00:50:07.900 | The second stage, after the warm start happens, is where the majority of training time occurs,

00:50:14.620 | and yet is also where training operates the most quickly.

00:50:22.260 | At this stage, we find that freezing vectors is important.

00:50:26.060 | One, because it means that we can train much quicker.

00:50:29.120 | So we can have the subsequent layers optimized beyond their warm starts very, very fast,

00:50:35.620 | using that vector caching, the vector comparison caching, to avoid the quadratic costs of self-attention.

00:50:43.260 | This articulates the parameters in the middle layers of the model for taking 100 million

00:50:50.180 | tokens and making five passes over the data here a lot quicker than any of the other stages.

00:50:57.980 | The comparison that you'd make to this is the training time once those embedding layers

00:51:03.460 | are unfrozen, where everything slows down to the normal speeds, where you have to do

00:51:08.580 | all of your vector comparisons on the fly, since you can't assume that the same comparisons

00:51:14.060 | will always result in the same numbers, since model parameters might be updated.

00:51:22.380 | This is the best procedure that we've figured out so far.

00:51:25.160 | And in order to make those vectors update, we find that learning rates have to be adjusted

00:51:30.160 | dynamically inside of the network, like normal, and that the embedding layers are really tough

00:51:36.140 | to make progress on.

00:51:40.380 | And you'll notice here in this picture that the slowness and the lack of stability, for

00:51:45.660 | example, in learning the embedding layer once it had been prescribed earlier, makes it really

00:51:51.220 | hard to train over the entire data set compared to five passes, for example, in the middle

00:51:57.460 | phase when the middle and upper parameters are being updated, still with backpropagation.

00:52:03.940 | And the other thing that I would highlight before leaving this slide is, in phase one,

00:52:11.060 | how the warm start saturates pretty quickly.

00:52:15.540 | So if you have 100 million tokens, you really only need to apply the warm start to something

00:52:19.820 | like maybe 10 million tokens, not that much more.

00:52:23.060 | You don't see that much gain from that much more data.

00:52:27.820 | That's not a bad thing, because it means that we don't have to apply that process for any

00:52:32.660 | longer.

00:52:34.500 | It would be great if it gave us all of the optimization that we could hope for, but it's

00:52:38.820 | not something that we could necessarily expect, since it's just an approximation of where

00:52:43.340 | the parameters are headed.

00:52:48.960 | So on the back of an envelope, thinking about how the systems that example was trained on

00:52:54.860 | as compared to other examples that are out there, and thinking about models that are

00:53:00.900 | kind of sort of similar size, we're talking about a 12 gigabyte GPU, a relatively small

00:53:08.180 | single chip, specifically when referring to these training times.

00:53:14.740 | So that's a 12 gigabyte GPU.

00:53:20.020 | Just working off of eight chips, each having roughly four times the scale, and comparing

00:53:26.060 | to this time that it took to train something with maybe an additional order of magnitude,

00:53:31.820 | although we have trained models up to around 50 million parameters, too, which is getting

00:53:35.940 | towards GPT-2 scale.

00:53:39.820 | We see training times that, if we scaled up to the relatively large systems that present

00:53:45.300 | us with how much work we should expect to have to do for a model that large, we can

00:53:50.220 | expect to be able to train much faster.

00:53:53.500 | But as mentioned, the initial objective here is not to simply figure out how well we can

00:53:59.500 | do something that's being done well already.

00:54:02.180 | It's to figure out what these alternative strategies are useful for, since they give

00:54:06.900 | us access to different regimes of model scale as effective.

00:54:15.860 | So as mentioned, we've gone to relatively large amounts of data.

00:54:22.380 | I wouldn't really call them big data at this time, even though just a couple of years ago

00:54:26.500 | a billion tokens would be a relatively large amount of data.

00:54:30.860 | It's really just a stress test at this point, gives us something like, do we continue to

00:54:35.860 | see models getting better as we continue to give them more data?

00:54:39.460 | Do we continue to see models getting better as we continue to give them longer context

00:54:43.780 | windows?

00:54:44.780 | And the answer to both of those questions is absolutely yes.

00:54:47.540 | So nothing is telling us that we can't train bigger models with these.

00:54:51.540 | But will those bigger models be as good as a standard self-attention model?

00:54:55.300 | I don't know.

00:54:56.300 | It's a different self-attention parameter matrix than what you see in a standard self-attention

00:55:00.100 | model.

00:55:01.420 | You could integrate the two.

00:55:03.300 | And in theory, that should be overkill, because you'd have more parameters and more power

00:55:08.700 | through them.

00:55:10.220 | And we can see from this work that the alternative self-attention parameters are reasonably effective.

00:55:18.460 | We're getting close to time.

00:55:21.180 | So I'll go quick through these, since this is the work that we're approaching right now.

00:55:27.740 | And this is the idea that we're seeing as a use case for such a model like this.

00:55:34.860 | In other words, no pre-training.

00:55:38.540 | Just training on the target data, whatever the data of interaction are.

00:55:43.420 | And in this example, you'll see that this relatively smaller precision language model

00:55:47.900 | just needs to predict whether or not a light should go on or off.

00:55:51.860 | A lamp that listens with a microphone and a switch.

00:55:56.000 | And you can use that switch to train the lamp.

00:56:02.460 | So that's the goal here.

00:56:04.800 | Can pre-training be eliminated?

00:56:07.180 | And we want to anticipate whether or not you're going to flip the light on or off.

00:56:12.520 | That's the task that we're going to try and approach.

00:56:16.420 | Or that, rather, we're currently approaching.

00:56:19.340 | There's a few different processes that integrate into this approach.

00:56:23.820 | There has to be a microphone that's listening to you, recording audio.

00:56:28.020 | There has to be a transcription algorithm.

00:56:29.700 | And we use Wave2Vec at this point, because there's a very small version of it that's

00:56:33.340 | character-based.

00:56:34.340 | And as a result, it doesn't even require you to use consistent-- or it does require to

00:56:38.540 | use consistent language.

00:56:40.260 | But it doesn't even require you to use words, since it's strictly phonetic.

00:56:45.740 | There has to be-- and this is the bread and butter of what's going on here-- a process

00:56:51.300 | which anticipates what you want.

00:56:54.620 | And that process is responsible for creating good training data.

00:56:58.100 | So this is a smart data collection algorithm that figures out, when you flip the switch,

00:57:04.820 | is that the target for something that you just said?

00:57:08.460 | Is that the transfer learning objectives from text that was transcribed, that it anticipates

00:57:17.700 | you want.

00:57:20.340 | Following this, there's also two other processes.

00:57:23.060 | One which operates on a different time cycle, and that's training.

00:57:28.180 | So always train a model.

00:57:29.940 | Always be training a model, whenever there's new data, is essentially what that fourth

00:57:35.900 | process says.

00:57:37.620 | And the last one is operation.

00:57:39.300 | In other words, if you flip the switch, there has to be a process which operates the light

00:57:43.100 | bulb.

00:57:44.100 | It always has to be a lamp in order to be useful.

00:57:45.940 | It always has to be able to just be a switch.

00:57:51.380 | However, that operation process likewise needs to see a directive from the anticipator.

00:57:57.780 | If the language model predicts that you just said a thing, that means you want there to

00:58:01.900 | be light, that operator then needs to receive the signal from the anticipator and execute

00:58:10.180 | the directive.

00:58:11.660 | If the user then, though, within some time scale, changes the switch back after the model

00:58:18.980 | created a prediction that was bad, the operator is also responsible for issuing a correction

00:58:24.920 | to the anticipator to correct the training data.

00:58:30.220 | What this looks like as a process is in this diagram here.

00:58:34.900 | And you can see the flow here from stage one, a verbal command maybe gets recorded, transcribed,

00:58:43.740 | turned into text.

00:58:45.240 | And if there's no model that's yet trained, that text is just stored as data along with

00:58:50.060 | any directives given by the user in the form of a light switch going on or off.

00:58:57.180 | Once there's any data, the learning process says, okay, time to train a language model

00:59:03.940 | and integrate it with these targets.

00:59:07.180 | Once a model is done training, it's sent over to the anticipator who is responsible for

00:59:12.940 | using the language model.

00:59:15.620 | That small language model then is now empowered to make predictions every single time it receives

00:59:22.700 | a text command.

00:59:25.380 | And those predictions are sent to the operator, which then does whatever it's told.

00:59:31.580 | And the last thing that can happen, step six, is if the wrong prediction was made and the

00:59:36.360 | user fixes it by turning off the light because they didn't want the light on, that corrects

00:59:43.460 | the data that was transcribed and the next model which is trained will be able to avoid

00:59:47.000 | that problem.

00:59:48.820 | And there's some dialing this in in terms of the time scales that you want based on

00:59:54.020 | the way humans interact with the light switch.

00:59:55.740 | So there's a lot of development that goes into figuring out the right way to set this

00:59:59.580 | up.

01:00:01.340 | The data that you collect from a process like this, how do we organize it?

01:00:06.940 | This actually is not transfer learning, so I kind of lied there a little bit.

01:00:11.040 | This is strictly language modeling.

01:00:14.140 | It's a conversation between the human and the lamp.

01:00:17.100 | You say something, the lamp says, here's what you want.

01:00:23.060 | And it's just an extending context window, like you'd see with a decoder-only kind of

01:00:29.100 | architecture these days, a chatbot kind of thing, a human personal assistant, human assistant

01:00:36.420 | dialogue.

01:00:39.300 | And you might also suspect then that, well, couldn't you let the lamp talk?

01:00:43.700 | Yes, you could absolutely let it use other tokens, and that is something which is on

01:00:47.540 | the horizon for us, in other words, how to determine once the model is learned enough

01:00:52.900 | and knows when you want to hear it talk and knows what you want to hear it say, which

01:00:58.980 | requires other smart data collection currently in development.

01:01:04.780 | And there's three tags here if you don't see it, although what they really are are tokens,

01:01:08.900 | since they're integrated within the language model's vocabulary.

01:01:13.540 | I want the lamp lit, I want the lamp dark, or nothing, if no switch is applied during

01:01:19.220 | transcription.

01:01:23.700 | So what do the models look like that go into a lamp?

01:01:27.580 | They're a little bit smaller than that micro model in terms of having a long context window,

01:01:32.860 | B, the block size.

01:01:34.780 | They still use these other features, like a radius, which help them to do well with

01:01:40.940 | only little data, those other context models.

01:01:46.020 | And the embedding size is around 50 or 100 and something, and this is small enough to

01:01:53.380 | fit on a microprocessor, on a CPU of a microprocessor, including training, no GPU whatsoever.

01:02:04.220 | And the first time we ever got the interaction right, the right timescales, from no data

01:02:09.860 | whatsoever, creating this data, and 20 minutes of it, was enough, and you can see there's

01:02:18.340 | loads of misspellings here because the transcription is not required to produce known words, known

01:02:23.500 | tokens.

01:02:25.080 | It's strictly character-based, so you can say whatever you want to say, you can whistle,

01:02:28.820 | and as long as wav2vec thinks that's tokens, it'll figure out what to transcribe.

01:02:39.820 | That's enough, 20 minutes of talking to it, to have it know pretty well when you want

01:02:45.420 | the light on.

01:02:47.740 | This is what the numbers look like for that prediction, and you see lots of zeros there.

01:02:51.100 | That's because there's no positive instances yet in the data, until you flip the switch,

01:02:56.780 | there's nothing to predict.

01:02:59.620 | Once there is enough to predict, we see an immediate jump in the model's ability to figure

01:03:05.000 | out whether LAMP should say on, off, or nothing.

01:03:11.540 | And while we trained this first model, for example, in 20 minutes on LePotato, which

01:03:19.940 | is a really, really, really small microprocessor, it's incredibly frustrating to utilize because

01:03:26.620 | the processing time is a couple seconds, and it feels like it's going somewhere, even though

01:03:31.420 | the data is entirely localized, there's no Wi-Fi, there's no internet connection.

01:03:36.140 | It just takes the model on this tiny chip a minute, not really a minute, like a couple

01:03:40.900 | seconds, to flip the switch on, because it has to transcribe it, interpret it, issue

01:03:46.100 | the directive, ask the operator to operate it.

01:03:49.060 | And so part of what we're doing is figuring out at what scale of microprocessing do the

01:03:53.220 | models that we're developing really make a good real-time system that a user can make

01:03:57.980 | use of well.

01:04:00.560 | And as you can see, the larger the model in terms of hyperparameters and so forth, the

01:04:07.020 | more performant it gets.

01:04:11.800 | So we see these as potentially useful in edge scenarios, but not just for operation, for

01:04:17.940 | training.

01:04:19.060 | So go to Home Depot, buy a light switch installed in your house, start talking to it.

01:04:28.040 | But this isn't really the stopping point that we want to get to.

01:04:34.820 | We want to eventually get to the point of talkback.

01:04:36.900 | We want to treat these as language models that essentially have a bit of you inside

01:04:40.580 | of them that you can converse with.

01:04:44.380 | And that's important to know when the model is aware of what you want to hear said.

01:04:50.900 | In other words, it needs to know what is a good thing to say back to what you just said.

01:04:55.700 | And the lamp has never heard a lamp talk before.

01:04:58.860 | So there are challenges to figuring out the lamp's role in conversation.

01:05:04.940 | And choosing a lamp, though, is arbitrary.

01:05:08.660 | We don't have to make it be a light bulb which goes on and off.

01:05:10.900 | This could be a controller for anything which is a binary switch.

01:05:15.660 | And you could imagine, like others are looking at right now, there's a lot of opportunities

01:05:20.820 | with predicting the action on your phone that you want to take, which thing you want to

01:05:24.940 | push.

01:05:26.860 | And with a system like this, microsizing on to your cell phone, for example, assumes better

01:05:33.220 | hardware than what we're already using, but would be entirely localized, including training.

01:05:44.140 | But this is also really just getting to the point of feasibility.

01:05:49.580 | It's not getting to the point of a well-optimized system, which we're still developing.

01:05:54.460 | There are, in principle, different modifications that we could make to the self-attention layers,

01:05:59.100 | which include traditional self-attention parameters.

01:06:02.220 | That's just one example.

01:06:04.820 | Then there are updates to the very naive scheme that we have for BitCipher, the vectors that

01:06:10.020 | we're using to initialize our models.

01:06:13.260 | And a lot of other minutia that need to be approached.

01:06:19.500 | So this isn't really work that's done.

01:06:21.740 | It's a work in progress.

01:06:23.300 | And in addition to what I just described, we're moving towards larger models and evaluations

01:06:29.600 | that compare better to modern systems, which will eventually come online.

01:06:34.980 | We'll most likely participate in this year's baby language model challenge, although that

01:06:40.780 | challenge assumes you're working with a standard architecture, which is already developed for

01:06:45.660 | all of the evaluative needs.

01:06:48.140 | So there's a lot of work to do.

01:06:51.020 | But that's really all I have prepared for you to discuss today in this conversation.

01:06:54.300 | I've gone over a lot of details, and if you'd like to talk about any of these, I'm certainly

01:06:58.820 | happy to.

01:07:00.560 | Questions that you might have as well.

01:07:01.900 | And if you have access to the slides, there's some links to the different papers I've referenced.

01:07:09.580 | That's all for today.

01:07:10.580 | Thanks.

01:07:11.580 | [APPLAUSE]

01:07:12.580 | Hey, so thanks, Jake, for the great talk.

01:07:19.380 | And now we'll have some time for questions.

01:07:21.660 | So if anybody here has any questions, feel free to raise your hand and ask.

01:07:25.420 | Otherwise, we'll go to some questions on Slido.

01:07:36.180 | Some folks are asking.

01:07:37.180 | So we'll be posting the slides later.

01:07:40.100 | But I've also pasted these references in the Zoom chat, as well as Discord, in case anybody

01:07:46.420 | wants to see them.

01:07:48.340 | I was wondering, in the plots that you showed for warm start versus cold start, does the

01:07:59.100 | cold start use the modified self-attention or the standard self-attention?

01:08:05.780 | Sure.

01:08:07.180 | So the question was, in this picture, comparing warm starts to cold starts, what self-attention

01:08:14.340 | was used here?

01:08:15.340 | None.

01:08:16.340 | This is strictly a feed-forward experiment, where we take a single layer, and all we do

01:08:20.660 | is feed forward with one-hot vectors from some context window and concatenate them together.

01:08:28.420 | And the general property that you'll see is, by concatenating vectors, there's very little

01:08:33.980 | for attention to do.

01:08:36.460 | Simply with a block, you're adding the vectors together, and that superposition of the dimensions

01:08:41.860 | smears them.

01:08:43.500 | And that's why self-attention is needed, in order to weight that superposition so just

01:08:47.940 | the right ones stick out and it's not muddled.

01:08:51.320 | If those vectors are instead concatenated, a weighting of those is really just appealing

01:08:56.060 | to the sensibilities of the matrix above.

01:09:01.020 | When they're superimposed, there's a lot to work on, since you're smearing separate information

01:09:06.260 | together.

01:09:07.500 | When the information is already separated, there's not that much re-weighting can do.

01:09:14.020 | And in this case, there's absolutely no re-weighting going on.

01:09:18.180 | And what I've described to you is really just something that's become very clear from a

01:09:24.060 | lot of small-scale experiments in between the models that we've developed.

01:09:29.420 | And moving towards self-attention took additional time, and we didn't have a solution for that

01:09:35.380 | layer yet when this work was done.

01:09:38.180 | I had a question in regards to-- so you're doing this with on-edge controllers, right?

01:09:47.500 | What?

01:09:48.500 | You're doing this with on-edge controllers, right?

01:09:49.500 | You're doing training for on-edge controllers?

01:09:50.500 | So this could be for IoT devices, right?

01:09:51.500 | Could be.

01:09:52.500 | And you talked about how this also could work for image data, right?

01:09:53.500 | Oh, I saw that.

01:09:54.500 | Yeah.

01:09:55.500 | Have you conducted any tests with image data?

01:09:56.500 | Like, with these small-scale models?

01:09:57.500 | Yeah.

01:09:58.500 | So image data works best on not just feed-forward architectures.

01:10:16.020 | They have, for example, convolutional bits and pieces that are useful to them.

01:10:21.140 | And that means if we want to apply some kind of a warm start for, for example, a convolutional

01:10:27.220 | layer to create a performant image classifier or something that's working with images, we'd

01:10:31.700 | want to develop an initialization for that layer, too.

01:10:35.660 | It has weirder activation functions, which means we need to branch out from softmax as

01:10:39.660 | an activation function.

01:10:41.740 | But surprisingly similar convolution is to a radial model.

01:10:47.260 | It's really just saying what's near where I'm trying to create a feature.

01:10:51.980 | So I would say, yes, it seems like it's something that we could do.

01:10:55.540 | But currently, it's in the phase of future work that it fits in one bullet here at the

01:11:08.260 | bottom.

01:11:09.260 | Different layer types need formal derivation for warm starts.

01:11:13.420 | So if we wanted to do this kind of a thing with performant architecture, we would be

01:11:18.100 | probably uniforming or randomly initializing some of those parameters that we don't have

01:11:22.140 | warm starts for yet.

01:11:23.980 | And as a result, we would receive a lot of just sort of like noise in where things are

01:11:28.340 | going.

01:11:29.540 | And if we started to utilize the activation functions, whether it's even just logistic

01:11:34.300 | activation, a logistic activation is not really fundamentally different than a softmax activation.

01:11:39.420 | So you might say, for example, well, why can't you just apply that to logistic function,

01:11:43.220 | like a two-dimensional softmax?

01:11:46.660 | And the reason is, is because if we treat it like a standard logistic, then each dimension

01:11:50.140 | is independent.

01:11:51.900 | Each dimension is trying to predict the same thing.

01:11:54.800 | And there's a lot more questions about how you can get different information out of different

01:11:58.440 | dimensions.

01:11:59.860 | So it's a question that's really worth spending time on, in my opinion, separately.

01:12:05.820 | And it's not the first question that makes a lot of what we've developed practical.

01:12:10.860 | On one of the slides, you had a dialogue with your user.

01:12:21.500 | I'm wondering, does that imply there is a speech-to-text system inside the microprocessor?

01:12:28.540 | Yeah.

01:12:29.540 | So audio goes in.

01:12:31.300 | And there's a process here which accepts that audio.

01:12:35.340 | And it utilizes a pre-trained wave to vet.

01:12:39.100 | It's really just fitting a need with a pre-trained model.

01:12:42.020 | That's what we're doing right now.

01:12:44.420 | Although transcription is something that we would like to move into in our future work

01:12:49.100 | for the purposes of training from scratch, because one of the real benefits of a system

01:12:53.700 | like this is that it doesn't come with any biases from other people's data, aside from

01:12:59.420 | the fact that there's a pre-trained transcription system, which means that it's pre-trained

01:13:04.060 | towards whatever phonetics were within the language that was there for pre-training in

01:13:09.340 | the wave-to-vec algorithm.

01:13:11.460 | So there is external utility here coming from a pre-trained model.

01:13:17.600 | But the text itself and the language model that we're presenting is only working from

01:13:23.280 | what gets transcribed.

01:13:24.280 | I have a follow-up on my previous question.

01:13:35.320 | You said that the feed-forward worm start is independent of the choice of self-attention.

01:13:42.040 | Does that mean that the worm start strategy can be used for any network that uses a feed-forward

01:13:49.160 | layer, not just PLMs, but any LLM or any other network?

01:13:54.600 | Yeah.

01:13:55.600 | So that's going back to the worm start solution here.

01:14:01.880 | And what it says is that in terms of any layer beneath, if you assume that those layers'

01:14:07.320 | parameters are what they are, you're not going to update them.

01:14:12.680 | And assuming that you know what the targets for that layer are, which for middle layers,

01:14:16.560 | there's some questions to be answered, then this initialization will do better than random

01:14:23.940 | for a softmax output.

01:14:26.600 | That's really important at this stage, that there's a softmax as a part of the activation.

01:14:32.080 | If there's not, then more math, basically.

01:14:39.000 | But the point at which it becomes clear should that whatever type of prediction scenario

01:14:48.040 | you're in, as long as you have non-negative features and a softmax for activation, like

01:14:56.720 | in this case with a single layer, or even two softmax layers, whatever that's doing,

01:15:02.220 | on MNIST, you can get a really good initialization.

01:15:07.120 | Doesn't have to be linguistic data.

01:15:13.080 | This can be mixed data, too.

01:15:14.480 | You could do an image caption generation system that has both features from images and text

01:15:20.120 | and warms them up with the same solution with entirely different data in two places.

01:15:25.760 | Could you point out which part of the process requires the values to be non-negative?

01:15:33.720 | Yeah.

01:15:34.720 | What happens when you put a negative in a logarithm?

01:15:42.740 | Not saying you can't, but it's not going to start making probabilities for you at the

01:15:46.840 | other end of the softmax any time fast.

01:15:50.400 | So you have to start with a different premise, essentially.

01:15:55.140 | And that premise is something that requires more derivation.

01:16:00.920 | You'd want to assure, if you're going to use a logarithm anywhere, or assume that inverse,

01:16:05.600 | that you're able to probably modify every parameter independently, instead of full rows

01:16:13.640 | of parameters.

01:16:16.480 | I think we should get to a couple of questions on Slido that folks asked.

01:16:24.880 | The first is, what's the difference in performance between naive assignment and optimized or

01:16:30.280 | omniscient assignment for packing tokens into bit vectors, and any experimental results?

01:16:36.860 | What's the difference in performance between naive assignment and optimized assignment

01:16:44.560 | for packing tokens into bit vectors?

01:16:53.740 | The performance differences are going to be in speed.

01:16:56.720 | The systems which utilize packing for contexts have, at great length, gone to make sure that

01:17:02.400 | information from different portions of the context that have nothing to do with each

01:17:06.640 | other don't bleed information, if you're going to pack them together.

01:17:11.760 | That creates a lot of logistical challenges in terms of defining models.

01:17:16.920 | And it's still just doing the regular self-attention thing.

01:17:19.520 | So it's quadratic.

01:17:20.520 | So if you have the same length of context window, it's going to be the same computational

01:17:23.920 | cost.

01:17:25.440 | However, if you pack all of your small documents together, they don't need the whole context

01:17:34.040 | window worth of quadratic comparisons.

01:17:38.960 | And that's why you pack something into the empty end.

01:17:41.720 | I guess it should be over here.

01:17:46.040 | But document packing isn't exactly, even though it's well known as a mechanism to make training

01:17:55.040 | much more efficient.

01:17:56.040 | In other words, you only need fewer batches if more documents are packed together.

01:18:00.920 | It's not something which is, for example, entirely accepted as a published accepted

01:18:08.440 | form of preprocessing.

01:18:11.140 | So what I would say is just document packing is not a correct model of context.

01:18:16.440 | It is an efficiency, but requires the same level of quadratic comparison.

01:18:21.920 | Whereas dynamically batching and utilizing a block size that is dynamic preserves the

01:18:28.520 | model of context.

01:18:30.240 | It does something that is true to the objective and unwavering in that.

01:18:35.040 | And it reduces the complexity for smaller documents.

01:18:39.360 | But a direct comparison of the two is something I have not done, because it would require

01:18:44.160 | having that oracle and utilizing those algorithms.

01:18:46.720 | And where are they used?

01:18:48.040 | They're used with insanely big models, which means we would likewise have to compare two

01:18:52.880 | insanely big models to create the same level of expectation that people have from packing.

01:18:58.120 | So that's in the future.

01:19:00.400 | Great.

01:19:01.400 | Thanks for your detailed response.

01:19:02.400 | We have a question quickly that's asking, are there any implementations of SAFU available

01:19:07.920 | that one could experiment with?

01:19:10.600 | Well, once we publish, there will be.

01:19:15.400 | But that requires a lot of work on developing systems for evaluation, since the evaluation

01:19:20.420 | systems rely upon standardized functions within the architectures that you're all very familiar

01:19:26.280 | with, like GPT-2, that are easily taken for granted.

01:19:29.680 | Even though you do lots of work in training them, you have to do a lot of work in creating

01:19:34.200 | those functions that meet the needs of the separate prediction tasks and fine tuning

01:19:38.040 | that evaluations perform.

01:19:39.840 | All right, great.

01:19:42.160 | Makes sense.

01:19:43.160 | I think we're pretty much out of time.

01:19:45.320 | So thanks, Jake, for the great talk, and thanks for coming to another lecture.

01:19:50.560 | Thank you.

01:19:51.560 | [END]

01:19:51.560 | [BLANK_AUDIO]