back to index

Stanford CS25: V4 I Transformers that Transform Well Enough to Support Near-Shallow Architectures


Whisper Transcript | Transcript Only Page

00:00:00.000 | >> Today, for our talk, we have Professor Jake Williams from Drexel University.
00:00:09.480 | He is an Associate Professor at Information Science at Drexel University's College of
00:00:15.200 | Computing and Informatics in Philadelphia, Pennsylvania.
00:00:18.920 | Dr. Williams has a background in physics and math with degrees from the University of Vermont,
00:00:25.000 | and his research leverages a quantitative linguistic perspective that applies math and
00:00:31.120 | statistical methodologies to analyze and improve linguistic learning systems.
00:00:36.800 | Following a one-year postdoc appointment at the University of Berkeley, studying large
00:00:41.760 | language, large-scale machine learning in 2015, Dr. Williams became a data science faculty
00:00:48.440 | at Drexel, where he drove the foundation of a DSMS program and develops and instructs
00:00:54.000 | data science coursework, including natural language processing with deep learning.
00:01:00.560 | So, welcome, and thank you for coming today for your talk, and you could do a quick introduction
00:01:06.320 | of yourself before you start.
00:01:08.320 | >> Great.
00:01:09.320 | Thanks so much.
00:01:10.320 | I got the mic here.
00:01:11.320 | Nice to see you all here.
00:01:12.600 | Thanks for coming out, and also for showing up online.
00:01:16.080 | It's a pleasure to be here.
00:01:19.760 | As was mentioned, my name is Jake, and my background's in math and physics, so the perspective
00:01:25.240 | that I'm coming from towards this work might be a little bit different than the standard,
00:01:29.400 | and that'll be a theme throughout the discussion.
00:01:32.800 | The purpose of this discussion is to go through a relatively long-term development, a project
00:01:40.840 | that I've been working on, and as mentioned, my background is in quantitative linguistics,
00:01:46.840 | which means my history of focus on language has primarily been to develop general theories
00:01:56.600 | and descriptions of phenomena that you observe with regards to linguistic units, whatever
00:02:01.560 | those might be.
00:02:04.120 | It's a statistical approach based on theories of language generation that are statistical
00:02:11.640 | in basis, and over the course of my time as a researcher, I've explored and ventured into
00:02:21.560 | language modeling itself and ultimately into neural networks as they approach language
00:02:26.080 | modeling themselves, and that's what brought me here through quite a bit of other work,
00:02:34.200 | so if you look into my profile, you'll see a lot of different subjects in either applied
00:02:38.800 | NLP, like I said, quantitative linguistics, and neural networks is a natural transition
00:02:44.520 | for me into inferential work, so let's get started.
00:02:52.160 | So well, this is how we'll start the conversation today.
00:02:56.320 | It's not exactly how we got here in my lab.
00:02:58.920 | We came at this subject from a different approach, trying to think about layer initializations
00:03:07.600 | in neural networks, and this subject that we're discussing as a front for this talk
00:03:16.440 | is specifically focused on transformer architecture components, the self-attention component that's
00:03:21.480 | pivotal to the success of the transformer architecture, and it focuses on the fact that
00:03:28.120 | self-attention requires a quadratic comparison of vectors in order to produce the feature
00:03:32.200 | weights of those vectors needed to model long-range dependencies in text.
00:03:38.320 | Commonly, parameters for self-attention are based on a transformation matrix, two, usually,
00:03:43.640 | queries and keys, that are responsible for dimensionalizing input vectors, and I describe
00:03:49.800 | it this way because generally speaking, when you're at the point of a self-attention layer,
00:03:54.640 | you already have low-dimensional vectors, but the parameters in a standard self-attention
00:04:00.080 | layer are changing the dimensionalities and the structure of that dimensional space.
00:04:04.520 | They are like an embedding layer, which is factorizing the embedding dimensions.
00:04:10.560 | This redimensionalization is the primary means by which self-attention creates feature weights.
00:04:15.520 | It really just computes similarity in that shared space.
00:04:21.400 | Large and similar inner products really just result in strongly weighted features, so it's
00:04:25.360 | up to that dimensionalization to produce good similarities for whatever purpose your prediction
00:04:31.520 | requires.
00:04:32.520 | However, an alternative strategy for feature weights might ask, given a basis, so in other
00:04:38.000 | words, you're stuck with your low-dimensional vectors, what is the optimal way to convert
00:04:43.320 | those comparisons of the vectors you're looking at by a matrix transformation to modify the
00:04:49.320 | vector similarities that you are stuck with that correspond to the best weights for features?
00:04:55.600 | In other words, treat this as a feed-forward layer to produce self-attention weights as
00:05:00.240 | opposed to try and transform to some basis that produces good feature weights.
00:05:06.320 | The use of this modified self-attention mechanism will be part and parcel the substance of this
00:05:10.920 | talk.
00:05:15.960 | It's worth noting that this alternative mechanism is entirely compatible with the traditional
00:05:20.160 | dimensionalizing version of self-attention.
00:05:22.120 | In other words, you could still change the dimension and compute similarities and then
00:05:29.400 | convert that with a second feed-forward layer to produce optimal feature weights.
00:05:34.360 | This is not exclusive in any way.
00:05:36.600 | This is exploring how useful that alternative prediction of feature weights can function.
00:05:42.960 | However, we'll avoid the standard mechanism for two reasons.
00:05:47.200 | First, we have no solution to the standard parameters for self-attention as an initialization.
00:05:54.040 | And this will be discussed at length in slides to come.
00:05:58.280 | Likewise, it would create an additional model complexity that would muddle the effects of
00:06:03.160 | the modified form of self-attention that we wish to study.
00:06:05.640 | So having that dimensionalization as a way to produce good feature weights would confuse
00:06:12.120 | whether or not the feed-forward computation of feature weights is functioning well.
00:06:17.440 | There's a catch to this, however, which is that these vectors that we use for such a
00:06:24.200 | self-attention layer better be good.
00:06:26.240 | In other words, their comparisons must be consistent and meaningful in the first place.
00:06:34.080 | So to get it out of the way, here's an architectural diagram for the relatively simple near-shallow
00:06:39.880 | architecture pattern that we're using.
00:06:43.400 | It doesn't seem like there are many neurons in a network of this type.
00:06:46.320 | And that's because all of the activations are softmax, which means despite the fact
00:06:51.080 | that the U matrix, for example, is an entire layer, it's really just going through a single
00:06:56.920 | prediction non-linearity, the softmax function.
00:07:00.200 | So you can think about this as essentially a three-layer network that might be creating
00:07:04.000 | an encoder-decoder kind of design.
00:07:07.080 | Likewise, the difference in presentation here over self-attention, which is parameterized
00:07:12.240 | by the matrix W here, is intending to show how a-- whether you consider it the query
00:07:21.560 | or the key-- one vector is the pivot for the comparison that will produce the feature weights,
00:07:29.360 | which is then fed forward in this model through W.
00:07:33.600 | This is the case for standard self-attention, too.
00:07:35.960 | In other words, you can reduce it to a by-prediction diagram in this way, where a gray vector,
00:07:42.360 | such as is depicted here, is that pivot.
00:07:47.160 | The attention distribution coming out of the W matrix and the softmax function is indicated
00:07:51.780 | by the vertical red bar there, which weights the block of vectors in black.
00:07:57.840 | That includes the pivot vector in gray, which is then passed through a feed-forward layer,
00:08:03.720 | often called the values of a standard self-attention matrix, U.
00:08:08.760 | We then-- since we use U as a way to reduce the dimensionality of the prediction that
00:08:14.720 | we're trying to make, we then feed that forward through another layer and then to output.
00:08:22.200 | And that's essentially the relative shallowness that we're talking about here.
00:08:27.140 | U is a self-attention matrix, which means there's really only two layers in effect here.
00:08:34.420 | And the activation functions are strange.
00:08:36.660 | And you might wonder, for example, why we're using a different activation function, the
00:08:40.260 | softmax, instead of any of the dimensionally independent activation functions, like a logistic
00:08:45.780 | function or anything else.
00:08:48.360 | And that's because we have additional insight into the softmax function and the parameters
00:08:53.340 | that it optimizes, which is very useful.
00:08:58.580 | So let's talk about those vectors first, though, before we get to layer initialization.
00:09:07.700 | Optimizing the keys and queries of standard self-attention bears substantial similarity
00:09:11.540 | to token and word embedding.
00:09:14.460 | This is because the key and query matrices have a common dimension that they project
00:09:19.700 | to, much like you'd see with the factorization of an embedding layer on its own.
00:09:24.740 | Think Word2Vec, something like that.
00:09:29.500 | These normally-- there might be multiple self-attention heads.
00:09:33.620 | And because of the indeterminacy in creating a different dimensional space-- in other words,
00:09:38.780 | there are multiple equivalent reshufflings of those different dimensions which will produce
00:09:43.580 | the same output-- that indeterminacy is something that we hypothesize has bearing on what is
00:09:51.420 | now referred to as the lottery ticket hypothesis.
00:09:53.540 | In other words, that multiple-- or this is the way that I would state it-- but that multiple
00:09:59.980 | different embeddings which produce different vector spaces can be leveraged in parallel
00:10:04.340 | to create further robustness for the model.
00:10:07.340 | Or in the way that it's implemented, that if a random initialization doesn't do that
00:10:13.500 | well, you can eliminate it from the network.
00:10:16.180 | And that sub-network will do just as well, even after it's totally trained.
00:10:21.700 | In other words, having multiple clones, self-attention heads, which have no difference in the outputs
00:10:27.140 | that they're trying to predict, is at the root of the lottery ticket hypothesis.
00:10:32.380 | And ultimately, that invocation of the lottery ticket hypothesis is really a justification
00:10:37.100 | for eliminating parameters whose substantial cost of training are essentially wasted as
00:10:41.780 | a result of random parameter initialization.
00:10:45.700 | You might ask questions like, well, what is a good initialization?
00:10:49.100 | What is a good set of word embeddings to use?
00:10:55.780 | So how can lottery ticket hypothesis interactive effects of randomly initialized embedding
00:11:02.020 | layers be avoided when constructing language models is another question that is embedded
00:11:07.820 | in this discussion.
00:11:12.860 | But we shouldn't say that dimensionality reduction isn't needed.
00:11:15.980 | It's incredibly necessary.
00:11:19.280 | For language modeling, you absolutely have to work with reduced dimension unless you're
00:11:23.660 | in a very small vocabulary.
00:11:25.500 | For example, like 26 Latin characters or something like that, like a wave to Vec.
00:11:33.860 | The inherent input dimension of a large vocabulary model presents many computational intractabilities
00:11:39.940 | when designing NLP systems, something that you're probably all very aware of.
00:11:43.660 | Likewise, though, the distance from embedding layers to learning information, the loss at
00:11:50.500 | outputs, puts them in a challenging position to train.
00:11:53.980 | It's really hard to learn embedding layers because of the indeterminacy in the space
00:12:00.260 | that you're trying to learn.
00:12:01.460 | You could swap dimensions, and it's equivalent.
00:12:07.060 | But the distance means that they receive learning information last.
00:12:13.760 | This is a real challenge, and it's present in the history of NLP and deep learning, too.
00:12:21.620 | Vanishing gradient stuff.
00:12:24.380 | And this is exacerbated in the way that we have to actually learn embedding layers in
00:12:28.140 | standard models where we might modify learning rates to be lower all the way back at the
00:12:33.300 | bottom of a network to be gentle with those embedding layers and help them learn effectively.
00:12:40.620 | But this is really trouble because if we had a good embedding layer at the start, those
00:12:46.380 | subsequent layers could be much easier to learn.
00:12:53.420 | So ultimately, in order to approach this challenge, we came along with a discernibility hypothesis.
00:13:03.060 | In other words, this boiled down to the theory that low-dimensional vectors, more than anything,
00:13:09.260 | needed to be able to discern features.
00:13:11.900 | And that doesn't sound like a very strong assertion.
00:13:17.020 | And we started with a really, really, really low bar and assumed that the most common features
00:13:25.340 | needed to be the most discernible features.
00:13:27.480 | So if we're stuck with a lower dimension and we can't give everything a one-hot vector
00:13:31.140 | to be told apart very well, then we might want to give the more clear vectors, which
00:13:36.980 | have more dimensional independencies, to those features which appear most frequently and
00:13:44.060 | could stand to confuse models the most.
00:13:48.980 | This hypothesis led us directly to develop the bit cipher algorithm, which is really
00:13:54.500 | just a scheme for assigning vectors of zeros and ones.
00:13:59.300 | Nothing too crazy in terms of what we're attempting to do.
00:14:02.940 | In the figure at right here, the order of vector assignment is by row from top to bottom.
00:14:08.220 | And this is on a five-dimension, five-bit vector system.
00:14:14.140 | The first five from bottom are those one-hot vectors.
00:14:18.380 | Past that point, you'll see two-hot vectors, but they're a little bit less darkly shaded,
00:14:24.700 | indicating the way that we actually utilize the system.
00:14:26.740 | In other words, we normalize them to have unit sum.
00:14:34.020 | What I hope you can see from this is that the bit cipher algorithm generalizes one-hot
00:14:39.940 | vectors to low dimensions.
00:14:43.060 | And as a result, we can work from a very sparse feature set and explore dimensionalities as
00:14:50.220 | a controlled phenomenon.
00:14:54.020 | And this assignment is incredibly naive, too.
00:14:57.180 | That's the other thing that I want you to see as well, that this discernibility hypothesis
00:15:01.740 | does not create any meaningful correlations between tokens that behave similarly.
00:15:05.980 | So if you've got the upper and lower case of a word, their vectors aren't going to capture
00:15:11.620 | those similarities according to the bit cipher.
00:15:14.460 | It's really just gonna try and make sure that those features are distinguishable in a low-dimensional
00:15:19.020 | space and that the most distinguishable features are those which appear most commonly.
00:15:25.020 | This was enough to do a surprising amount of work.
00:15:31.960 | So with some scheme for a deterministic low-dimensionalization procedure, we were then able to utilize this
00:15:41.960 | solution that we had actually developed previously.
00:15:45.500 | So this was actually the real motivator for a lot of the work that you're seeing today,
00:15:49.940 | although it might seem like it's just a checkpoint in the middle.
00:15:56.020 | Provided bit cipher produces decent embeddings, we can ask, can other layers be non-randomly
00:16:00.980 | initialized?
00:16:01.980 | In other words, without gradient descent or backpropagation or other gradient-based iterative
00:16:05.180 | algorithms.
00:16:07.740 | This equation came about from analysis of Word2Vec with the original Softmax activation
00:16:13.880 | function.
00:16:16.300 | And much like other articulations of the Word2Vec family of embeddings, came up with differential
00:16:27.220 | solutions that depended on co-occurrence matrices.
00:16:31.020 | We formalized this as a question.
00:16:32.820 | Is there a way to take a co-occurrence matrix, F, in this equation here, and convert it with
00:16:41.300 | some weights, some denominators by row, into something that warms up a single-layer feedforward
00:16:51.020 | in a neural network?
00:16:53.380 | And ultimately, this k minus 1 over k term here, and this sum, is really just expressing
00:17:00.520 | something like conditional probability.
00:17:03.700 | Like conditional probability, because k minus 1 over k is a wrinkle that says that as the
00:17:11.860 | number of features increases, in other words, the context window increases in a block transformer,
00:17:19.500 | then the warm start that we could apply to start off a neural network without a randomness,
00:17:26.540 | entirely determined by the vectors underneath, nearing whatever direction it's going.
00:17:34.540 | All we have to do is compute some co-occurrences between inputs and outputs, and I don't mean
00:17:39.340 | necessarily standard co-occurrences that you might have learned about a long time ago which
00:17:43.820 | depend on a radius.
00:17:44.860 | I mean, whatever your inputs are, whatever your outputs are, you take their sum of outer
00:17:52.240 | products and you get a co-occurrence matrix of inputs and outputs, and that can then be
00:17:58.220 | utilized to initialize your layer in that neural network to be vastly more performant
00:18:06.100 | than what you'd get by a random initialization.
00:18:10.380 | This was a strong motivator for us.
00:18:13.900 | This was just for a single-layer model, but it depended on the softmax function for activation.
00:18:22.060 | And the softmax function as an activation function, we knew, is also necessary for self-attention
00:18:28.860 | features.
00:18:30.240 | And this meant that if we could put self-attention into some kind of a standard form with this
00:18:35.700 | equation just like a single layer, then we could apply the same solution with one catch.
00:18:43.940 | That catch is specifically that we don't know what the targets are for self-attention.
00:18:49.100 | There's no target vector y, the thing that you're trying to predict, which position is
00:18:55.340 | the one that you want to weight most strongly.
00:18:58.700 | And so in order to apply this solution for a self-attention model, we had to do some
00:19:03.260 | more analysis.
00:19:05.140 | And that's in the reference number one, which is all the way back up in the first slide
00:19:09.340 | if you want to see it.
00:19:11.120 | But that derives a differential criterion, an analog for the single-layer solution that
00:19:17.980 | tells us what the targets of that kind of self-attention actually are, the hidden targets,
00:19:23.460 | the weights that you're trying to create, which really are just about making sure that
00:19:28.700 | the layer above self-attention has some unsurprising things coming towards it.
00:19:36.300 | The self-attention layer is really just trying to massage the vectors so that way they look
00:19:40.140 | like something that the next layer above expects.
00:19:44.540 | Aside from that, though, it's a much more in-depth conversation.
00:19:48.620 | The point, though, is that for the model in this picture here, we can now start off with
00:20:00.100 | vectors x that are not random.
00:20:04.640 | We can use those vectors x to initialize non-randomly the parameters in W, the self-attention matrix,
00:20:13.780 | and then use that, going up the network, to initialize the parameters in U, since it's
00:20:19.460 | just a feed-forward layer with whatever self-attention is giving it as weights.
00:20:24.620 | And then whatever that produces, the hidden state, H, we can use that with the actual
00:20:29.700 | targets after the output layer to warm up the matrix O.
00:20:37.460 | And you might say, "Okay, well, how did you figure out what those hidden targets are?"
00:20:43.660 | You had to have an output for the U matrix to try and hit.
00:20:49.100 | That too is something that the bit cipher can provide in the form of label embeddings.
00:20:57.300 | In other words, low-dimensional targets of the thing that is downstream that you're trying
00:21:01.420 | to hit, the language model's output.
00:21:04.660 | So similarly, we can warm start the U matrix in terms of those bit cipher label embeddings.
00:21:16.260 | So in this view, the aim is to show how simple and general a single-layer softmax activated
00:21:20.880 | solution is to apply.
00:21:22.880 | It's really just no more challenging than computing conditional probability given inputs
00:21:27.660 | and outputs.
00:21:30.420 | It's fast, it's something that you can distribute in terms of processing, and it's very, very
00:21:37.460 | general.
00:21:39.960 | So this is essentially the process that we're using in order to warm up the W and U matrix.
00:21:51.060 | There's the U matrix there, starts out as zeros.
00:21:54.980 | In other words, nothing, no random values, no weights anywhere.
00:22:00.460 | Over the data, which is just borrowing the dimension of this gigantic Y matrix that has
00:22:06.940 | all of the targets in it for the entire data set, we simply just take the outer products
00:22:14.060 | of whatever the hidden state, the input to that layer is, assuming that the lower layers
00:22:18.740 | beneath it are also warmed up with whatever the targets for that layer are.
00:22:25.740 | Following that, it's really just about normalization and a logarithmic transformation.
00:22:31.380 | And that logarithm really just emerges as a result of being an inverse to the exponential
00:22:36.860 | function, which is a part of softmax, pretty much all of softmax.
00:22:43.540 | And that's really what brought us here.
00:22:48.140 | So what does warm starting a network do?
00:22:50.700 | This is going back to before we had the bit cipher algorithm for dimensionality reduction.
00:22:58.980 | And we started out by just saying, OK, if we take a simple, simple language model that
00:23:05.700 | only looks at a radius of traditional co-occurrences as features, we can concatenate those vectors
00:23:13.140 | and feed them forward for a language model's output.
00:23:17.380 | A completely random start, a cold start to a language model, is really just the size
00:23:24.620 | of the vocabulary in perplexity.
00:23:27.860 | And those three lines here for a few different radii are demonstrating that point with the
00:23:33.980 | point all the way at the top left-hand corner of this figure, cold starts.
00:23:41.380 | In any of those cases, when the warm start is applied, the perplexity is immediately
00:23:45.460 | automatically lower.
00:23:49.100 | And furthermore, the trajectories that the updates follow continue in the same learning
00:23:57.460 | rate and the same time to perform better than models that were started cold.
00:24:05.700 | If you have an early stopping criterion, similarly.
00:24:08.820 | Early stopping, well, more than just generally, engage first and with a higher perplexity.
00:24:19.300 | So this was the first indication that we had figured out something that's very useful.
00:24:24.820 | There are some folks on Slido saying they're a bit confused.
00:24:31.780 | They're asking, are we talking about an alternative approach to self-attention?
00:24:36.060 | We are.
00:24:37.060 | So we're all the way back at slide one.
00:24:41.660 | And it is the premise of this whole conversation.
00:24:45.980 | So here, in this modified version of self-attention, you might normally expect to do a comparison
00:24:53.020 | of your inputs, the matrix X.
00:24:55.820 | Whatever your inputs are, they might be a whole block of vectors, or they might be--
00:24:59.660 | this is self-attention.
00:25:00.660 | It's not cross-attention, where you have different vectors that you're trying to attend.
00:25:06.500 | And forgetting about the values, which for us is the U matrix, the keys and queries,
00:25:15.700 | which are the parameters for self-attention, are in the middle.
00:25:18.340 | They're in between the two copies of the inputs, X.
00:25:25.580 | Each of those you can view as some kind of a projection down to a dimension where they
00:25:29.860 | can interact.
00:25:30.860 | And this is necessary for something like cross-attention, where you might have different dimensionalities
00:25:35.540 | like X1 and X2 in two separate groups of vectors if you're doing something like machine translation.
00:25:41.820 | That's not necessary to think about when you're just looking to do a standard language model
00:25:48.660 | that has to predict the next output according to the inputs, which are also outputs from
00:25:53.700 | previous iterations.
00:25:58.940 | Two insights here-- one, that multiplying the key and query matrices, WK and WQ, it's
00:26:08.740 | just another parameter matrix that's implied.
00:26:11.860 | There aren't two parameter matrices there in the middle for self-attention in any effective
00:26:19.060 | There is a common dimension of comparison, and that kind of just moves stuff around.
00:26:24.580 | It creates degrees of freedom so that optimization can figure out what's the best weighting from
00:26:31.300 | comparisons.
00:26:32.300 | But the softmax function is strictly operating on similarities of that comparison space.
00:26:40.500 | It's not doing anything with those similarities.
00:26:43.620 | It's just softmaxing them.
00:26:44.860 | It's just activating them.
00:26:46.460 | So if it was a big similarity, it's a big attention value.
00:26:51.060 | In this equation, there's no transformation happening before those vectors are multiplied
00:26:56.580 | together, inner products.
00:26:58.980 | So those vectors better be good vectors that you're starting with-- x and x transpose,
00:27:03.740 | the same thing.
00:27:06.020 | They better be vectors that are comparable.
00:27:08.580 | They can't be vectors from cross-attention, where you're trying to translate from one
00:27:11.940 | language to another, and they just don't inner product.
00:27:13.940 | They're different dimensions.
00:27:15.740 | You could force it through if they were two differently trained embedding layers, and
00:27:19.740 | they had the same dimension with this mechanism.
00:27:23.140 | And if you didn't, you could put those key and query matrices back in between the two
00:27:27.500 | x vectors, x blocks of vectors.
00:27:33.580 | But a lot of what's going on here in this talk is trying to simplify and make more efficient
00:27:42.860 | the architectures that we need and the mechanisms that they utilize, given what we know about
00:27:49.940 | how language functions.
00:27:52.300 | And that's a critical piece there.
00:27:53.640 | We have assumptions that we can make.
00:27:55.520 | If all we're doing is autoregression, we don't need cross-attention dimensionalization in
00:28:00.260 | between.
00:28:01.260 | That'll be the theme, in other words, that can we use knowledge that we have about the
00:28:08.980 | way language functions to design better versions of architectures that meet the needs of language
00:28:16.300 | instead of being simply general.
00:28:18.460 | Is this good?
00:28:22.780 | This is important.
00:28:23.780 | So if there are any questions here, it's a good time.
00:28:30.980 | We are there and there.
00:28:36.980 | So we just talked briefly.
00:28:38.920 | This was for language.
00:28:39.920 | The thing about language models is it's a really simple language model.
00:28:43.100 | There's no self-attention here yet.
00:28:46.420 | This is really just evaluating that a warm start in either the blue, green, or purple
00:28:51.320 | case does better than its partner, which is a cold start of the same architecture, same
00:28:57.660 | hyperparameters, orange, reddish, and brown.
00:29:04.740 | So three different models, regardless of how long your context is in each case here, we
00:29:10.160 | see that a model which has a nonrandom initialization by the equation presented two slides back
00:29:16.520 | from here starts a network off with a much lower perplexity.
00:29:26.920 | The requirements to apply this solution to a feedforward layer of parameters is simply
00:29:33.180 | that your inputs should not have negative values.
00:29:41.300 | That's really all we have to worry about.
00:29:44.420 | So it becomes really easy to ask questions like, well, what happens when you apply this
00:29:50.080 | to other data with non-negative values?
00:29:52.940 | Well, there's one little catch that we had to think about here in this case, and that
00:29:57.820 | is with the bit cipher or one-hot vectors, we're controlling the norms of the inputs.
00:30:04.260 | With standard embeddings, with MNIST, for example, when you're trying to predict the
00:30:10.420 | handwritten digits, 0 through 9 value, you don't get to assume necessarily that all inputs
00:30:18.660 | have the same norm.
00:30:21.580 | You can normalize the inputs, but it doesn't necessarily make sense to normalize them to
00:30:26.540 | one when you're looking at images, for example.
00:30:29.340 | They're non-negative.
00:30:30.340 | They have 0 through 255, for example, in MNIST.
00:30:35.440 | And as a result, we can put these data through that same warm start.
00:30:42.100 | Now one little caveat here I've alluded to about the norms of vectors is that we don't
00:30:51.020 | know what that value of k is.
00:30:53.060 | In other words, let me go back, you could look at it here or here, that's the number
00:31:02.660 | of features per prediction, which if you're looking at unit-normed word vectors is however
00:31:10.540 | big your context window is, k, because they all have unit norm and there's k of them.
00:31:17.860 | But if you're looking at just an image, it's not clear if it's a composition of multiple
00:31:23.060 | vectors, if it's one vector, and how many it is, if it is a composition.
00:31:28.340 | It just has a norm.
00:31:32.940 | In application to data like that, that is what k becomes, the average norm of an input.
00:31:41.900 | And I'm regretting not putting a graph in this, but the paper that discusses this shows
00:31:45.820 | that in the MNIST dataset, the exact optimal value of k is the average norm of the inputs
00:31:54.300 | however you've pre-processed them.
00:31:58.300 | And that's how we generally apply this rule when we're warm starting systems and we don't
00:32:02.420 | have unit-normed vectors.
00:32:04.940 | And it was learned from studying this model's application, this solution's application to
00:32:11.340 | non-linguistic data.
00:32:13.180 | But as mentioned, the purpose was always towards language.
00:32:21.620 | So longer context windows in principle should provide models with more information than
00:32:26.820 | shorter context windows.
00:32:30.140 | This means one should expect that models perform better when context window length is longer,
00:32:37.940 | theoretically.
00:32:40.420 | And this is essentially the reason for why self-attention was initially developed.
00:32:45.180 | Researchers wanted to improve language models and context windows, providing more information
00:32:49.620 | were seen as the key to that.
00:32:51.380 | In other words, the more features, the more information, the more flexibility a model
00:32:57.380 | can have and expressivity.
00:33:00.900 | However, without feature weights, models didn't simply get better with long context windows,
00:33:06.300 | and feature weights and self-attention were hypothesized to be needed.
00:33:11.540 | And this was proven back in 2017 with the transformer architecture.
00:33:18.740 | In moving towards self-attention and transformer though, the primacy of the transformer architecture's
00:33:24.700 | block context model casts a shadow over the use of other context models.
00:33:32.260 | So for example, if I were to ask here, is it clear to everyone that the standard self-attention
00:33:41.020 | block model of context is different than the traditional notion of co-occurrences, which
00:33:46.660 | use a radius that is not positionally anchored?
00:33:50.620 | It is the context model, the positional anchoring of the block context model, that gives it
00:33:56.980 | its information.
00:33:59.540 | It is not, in all likelihood, anything else.
00:34:06.540 | Now what you do with that context model matters.
00:34:10.540 | You can't just take those vectors in a block, add them together, and expect a feedforward
00:34:14.460 | to do well.
00:34:15.460 | That's where self-attention is needed in order to figure out which vector needs the best
00:34:19.300 | weight, most weight.
00:34:23.540 | So what you'll also see in the architectures that are based on what I've already presented
00:34:29.260 | is that we're interested to explore how different models of context for language models can
00:34:34.740 | be integrated in general because they each provide different information.
00:34:41.440 | And we all know that the standard transformer's block model of context requires a ridiculous
00:34:46.300 | amount of information and data in order to become effectively trained.
00:34:53.080 | So the current state of contexts that we use, top there might be the standard transformer
00:35:01.900 | context that has a fixed positional block.
00:35:04.860 | And it takes the first 10 tokens, for example, the second 10 tokens, and the third 10 tokens,
00:35:10.860 | each in different blocks.
00:35:13.080 | Each of those is a group of contextualizing vectors.
00:35:18.220 | The second one there that you see with the r as a subscript is a radial model because
00:35:23.620 | those do different things.
00:35:24.980 | In other words, rather than assume you're looking at the first 10 or the nth 10 features,
00:35:30.820 | you pick a radius and you say, what are the last r features, the last r vectors?
00:35:36.980 | That can also have an attention distribution, a self-attention distribution, according to
00:35:41.140 | the exact same model that's being presented.
00:35:45.140 | It produces an entirely separate context in the state, whatever you want to call it, which
00:35:51.900 | can be conjoined with the block model to articulate features and be given to an output layer that
00:36:00.780 | knows what to do with them when each has different values.
00:36:06.940 | The concatenation of those different context models keeps the information separate so the
00:36:11.500 | output layer can decide which portion of the context is useful for the prediction.
00:36:18.820 | This last one is getting really traditional at the bottom.
00:36:22.540 | It's what I refer to as a document model.
00:36:26.300 | If you've ever implemented something like a Naive Bayes classifier or a term frequency
00:36:33.660 | inverse document frequency model, that's essentially what a document model is.
00:36:38.780 | Set up your vectors, you get something.
00:36:43.140 | Is it going to be the best for predicting the next token?
00:36:46.260 | Absolutely not.
00:36:47.260 | However, it's always different.
00:36:49.700 | What that means is that even if you wrap to the next block between the radial and the
00:36:55.060 | document models, you have a unique context vector, even if you're looking at the exact
00:36:59.860 | same block, because the document has grown and the radius just says, what are the last
00:37:04.660 | three?
00:37:05.660 | What are the last 10?
00:37:07.780 | As a result, when you incorporate different models of context, you don't really have to
00:37:12.380 | say that there's a finite context window.
00:37:14.300 | It might not be very good to make predictions past the first block, but that might be about
00:37:19.260 | how much data you've used, and it might be about the hyperparameters for each one of
00:37:24.340 | those models that you're applying, in other words, radius, the block size, like usual.
00:37:33.220 | So far, the only embeddings that I've suggested are from this BitCypher algorithm, and as
00:37:38.540 | I've expressed, they don't capture any useful similarities between similar tokens.
00:37:45.140 | The BitCypher algorithm, it doesn't care if you're looking at the uppercase or the lowercase
00:37:49.860 | version of a word.
00:37:50.860 | It doesn't see them as bearing any similarity, even though they might be used very similarly.
00:37:57.300 | So how can you utilize the BitCypher to create vectors for tokens that have meaningful similarities
00:38:07.660 | between words that are used similarly?
00:38:12.060 | And this is just backing off to the traditional methods once again, taking co-occurrences
00:38:19.460 | of BitCypher vectors with whatever's there at the middle or center of a co-occurrence
00:38:25.780 | model.
00:38:27.980 | Normally, if you think about one-hot vectors, a co-occurrence matrix is really just the
00:38:34.380 | same thing, except now we just have smaller vectors with different dimensions on, so to
00:38:41.420 | speak.
00:38:44.580 | And we normalize after concatenating these blocks of different radii from the BitCypher
00:38:52.660 | to match the original input requirements that we discovered for the warm start solution.
00:39:00.620 | And that enables us to use these just like we would the original BitCypher vectors, except
00:39:06.900 | now, just from the usual co-occurrence statistics, you'll see that capital word and lowercase
00:39:14.580 | word have a lot of common usage.
00:39:17.980 | And you know this works because you've seen co-occurrences for a very long time, and while
00:39:23.740 | they might not normally be useful in our applications these days with deep learning, they can be
00:39:29.740 | imparted through the BitCypher algorithm to prescribed vectors as well.
00:39:41.920 | So here's where things start paying out in terms of speed and efficiency.
00:39:51.380 | If you only have one layer of self-attention, then that means that you don't need to worry
00:39:57.780 | about whatever weird expressive stuff is happening that, you know, similar inputs might have
00:40:03.540 | slightly different hidden states.
00:40:07.620 | Since that first layer is just a set of static word embeddings, the self-attention layer
00:40:14.580 | is working off of static word embeddings.
00:40:18.500 | And that means each pair of words have a fixed comparison given static word embeddings.
00:40:26.220 | And that means if you want to compute the quadratic features of self-attention, you
00:40:31.460 | can just pre-compute them and pull them from memory.
00:40:36.540 | This caching of vector comparisons is essentially reducing the self-attention layer's cost from
00:40:43.260 | quadratic to linear, since those values that we're using to weight the vectors for the
00:40:49.980 | feedforward layer no longer require comparison across the block.
00:40:55.660 | They're already compared.
00:40:58.220 | So when our vectors are static, which is at inference time, and if we're not learning
00:41:05.980 | the embedding layer's parameters with iterative differential updates, then not only do we
00:41:14.060 | have to not track gradients for the embedding layer, but we don't even have to compute the
00:41:19.140 | vector comparisons.
00:41:20.140 | We can pre-compute them and just load them, which is much, much faster.
00:41:31.900 | So we can reduce a lot of, all the inference and training costs, not all the training costs,
00:41:38.620 | some of the training costs, because if we want to update those vectors, then we can't
00:41:42.340 | assume cache comparisons.
00:41:45.540 | But it's a huge cost savings.
00:41:48.460 | This means that we can train these self-attentive feedforward unit models very quickly and with
00:41:54.540 | good initializations.
00:41:57.460 | But there are some other things that we immediately observed while developing these models, and
00:42:01.820 | that is the lack of randomization produced models which were quite effective even on
00:42:09.300 | small data.
00:42:10.300 | Now, it doesn't mean that training on small data will let you generalize to everything
00:42:14.260 | else that's out there in the world.
00:42:15.500 | In other words, training on a small data set might produce a model which has a surprisingly
00:42:20.100 | low perplexity on its training set, but it doesn't mean that you're going to be able
00:42:24.820 | to generalize and have a language model that's talking well from just hearing a couple of
00:42:28.020 | thousand tokens.
00:42:29.940 | It does mean it will know that couple of thousand tokens very well, very quickly.
00:42:38.700 | But there's a challenge with using self-attention still, and that is the fact that the block
00:42:45.220 | model of context often is not fully utilized, since many documents are shorter than long
00:42:55.220 | context models.
00:42:56.220 | There are long context windows.
00:42:59.620 | And these days, there are exceptionally long context windows.
00:43:03.140 | I'm not even talking about those.
00:43:05.500 | Many of the language modeling benchmarks simply don't even go out to a thousand words when
00:43:09.060 | it comes to context, and you're looking at a document to predict.
00:43:15.020 | So this has been a problem for a while, and it means that if you're going to pad your
00:43:22.680 | short documents, you're going to waste a lot of prediction on those paddings.
00:43:27.740 | A lot of computation gets lost just for null information, essentially.
00:43:35.300 | And the way that this is often relieved in some groups, and to great effect, is by packing
00:43:42.740 | long contexts.
00:43:44.740 | So for example, if you've got a hundred thousand token context window, most documents will
00:43:49.220 | not be a hundred thousand tokens long.
00:43:51.660 | What do you do with the rest of that long context if you want to use a thousand tokens
00:43:55.740 | of good training data?
00:43:58.020 | You fill out the other ninety-nine thousand tokens with a bunch of other random documents
00:44:02.100 | that don't belong anywhere near the first one.
00:44:04.400 | That's called packing.
00:44:07.640 | Packing can be utilized without impacting different documents with each other, without
00:44:14.880 | contaminating the information between documents, and that takes a lot of work, but it can be
00:44:19.640 | done.
00:44:22.160 | However, there are different strategies that we could employ, different engineering tricks
00:44:28.760 | that we could employ, to make our operation of self-attention more effective at any length
00:44:36.760 | of document without having to deal with this packing problem.
00:44:41.040 | And that comes about by dynamically changing the context length from some maximum value,
00:44:49.640 | that's what you would normally set, just use the context that you have.
00:44:55.280 | But you still have to create batches if you want to train models quickly, and what that
00:44:58.840 | means is that there's still some padding if you use this approach.
00:45:03.040 | But you can pad those short documents to set lengths, batch short documents together, batch
00:45:13.680 | long documents together.
00:45:17.200 | This means that we don't need to pack documents together to make use of a long context window.
00:45:26.440 | When a document is long, you can let its context be long.
00:45:28.960 | When a document is short, you can put it with other short documents and just use a subset
00:45:32.880 | of those self-attention parameters.
00:45:36.560 | And with traditional self-attention parameters, keys and queries, it would never be a subset
00:45:40.480 | because it's a low dimensionalization that that matrix provides.
00:45:44.800 | With this modified self-attention, though, there's a different shape to the weight matrix,
00:45:49.120 | and that's why it's a subset of those parameters that we have to utilize, and that might be
00:45:53.600 | something worth discussing afterwards.
00:45:55.320 | In other words, how does the difference in shapes of dimensionalities between this and
00:46:00.360 | the standard self-attention weights shake out?
00:46:08.000 | But we want to get to a different point for the sake of this conversation.
00:46:15.320 | What is a model like this useful for?
00:46:17.720 | That should be a question that you're asking.
00:46:19.400 | It's a question that we've been asking.
00:46:26.040 | We're not entirely certain yet how an extremely large model like this will function on trillions
00:46:34.920 | of tokens, for example.
00:46:35.920 | In other words, can you expect the same kinds of outcomes, like a chat GPT kind of thing
00:46:44.120 | from some of these models, human interaction and RLHF and all the rest of that, though
00:46:50.720 | it's something that we're considering, but also at different scales, too, since those
00:46:57.760 | are performant on their own as well, but for what?
00:47:05.760 | So the point is, is that from what we've stress tested into the billions, models can be trained
00:47:13.120 | very quickly on a relatively small GPU, in ways that we expect when we cache vector comparisons,
00:47:20.240 | we see really big speedups.
00:47:22.160 | When we don't cache those comparisons, you see all of the growth in computation time
00:47:29.280 | that you would expect from longer context windows.
00:47:35.820 | This one here, though, we're trying to make it really, really, really small, the one called
00:47:40.240 | potato.
00:47:42.060 | That's because we want to see if we can train a model from scratch, since on very little
00:47:47.280 | data, these models can fit effectively with the initializations that we've developed.
00:47:55.880 | And with the purpose of starting from scratch, starting with no data, we're thinking about
00:48:01.040 | edge computing cases where we could deploy a language model with a microphone so that
00:48:05.880 | a person can talk to it and just train it from their own data, train it from their own
00:48:10.720 | speech, to understand their speech.
00:48:19.480 | So between these, we've explored a lot of different configurations, trying to consider
00:48:25.040 | similarities to what some standard configurations might look like, a couple thousand tokens
00:48:29.240 | in a context window, for example, to look something like a GPT-2 style model.
00:48:34.660 | Thinking about bit cipher embeddings that are 500 dimensional or 1,000 dimensional to
00:48:39.560 | be something like a GPT-2, that's, again, pointing towards the big/large category of
00:48:46.200 | models that we've experimented with.
00:48:49.400 | Beyond that, we haven't really touched those scales, because our first objective is not
00:48:55.440 | to make big, big language models and train chatbots.
00:48:59.020 | We want to know, what can we do with a small model, since this is a relatively unique capability?
00:49:09.740 | So what does training look like?
00:49:12.140 | To the best of our ability so far, it's kind of hard to see, but the first step is that
00:49:20.220 | warm start, where you train the bit cipher, and you take a couple of splits of data, and
00:49:27.940 | you compute that warm start for the self-attention layer and the feedforward layers.
00:49:34.140 | In this case, which is really just using a 100 million token data set from the baby language
00:49:40.180 | model challenge, which has as an objective to see what language models can do on a relatively
00:49:49.800 | human scale of data.
00:49:51.220 | In other words, 100 million tokens is something that a person might hear in 10 years of their
00:49:56.380 | life.
00:49:58.100 | In 10 years of life, people become pretty proficient speakers, and can a language model
00:50:04.100 | be trained at that scale?
00:50:07.900 | The second stage, after the warm start happens, is where the majority of training time occurs,
00:50:14.620 | and yet is also where training operates the most quickly.
00:50:22.260 | At this stage, we find that freezing vectors is important.
00:50:26.060 | One, because it means that we can train much quicker.
00:50:29.120 | So we can have the subsequent layers optimized beyond their warm starts very, very fast,
00:50:35.620 | using that vector caching, the vector comparison caching, to avoid the quadratic costs of self-attention.
00:50:43.260 | This articulates the parameters in the middle layers of the model for taking 100 million
00:50:50.180 | tokens and making five passes over the data here a lot quicker than any of the other stages.
00:50:57.980 | The comparison that you'd make to this is the training time once those embedding layers
00:51:03.460 | are unfrozen, where everything slows down to the normal speeds, where you have to do
00:51:08.580 | all of your vector comparisons on the fly, since you can't assume that the same comparisons
00:51:14.060 | will always result in the same numbers, since model parameters might be updated.
00:51:22.380 | This is the best procedure that we've figured out so far.
00:51:25.160 | And in order to make those vectors update, we find that learning rates have to be adjusted
00:51:30.160 | dynamically inside of the network, like normal, and that the embedding layers are really tough
00:51:36.140 | to make progress on.
00:51:40.380 | And you'll notice here in this picture that the slowness and the lack of stability, for
00:51:45.660 | example, in learning the embedding layer once it had been prescribed earlier, makes it really
00:51:51.220 | hard to train over the entire data set compared to five passes, for example, in the middle
00:51:57.460 | phase when the middle and upper parameters are being updated, still with backpropagation.
00:52:03.940 | And the other thing that I would highlight before leaving this slide is, in phase one,
00:52:11.060 | how the warm start saturates pretty quickly.
00:52:15.540 | So if you have 100 million tokens, you really only need to apply the warm start to something
00:52:19.820 | like maybe 10 million tokens, not that much more.
00:52:23.060 | You don't see that much gain from that much more data.
00:52:27.820 | That's not a bad thing, because it means that we don't have to apply that process for any
00:52:32.660 | longer.
00:52:34.500 | It would be great if it gave us all of the optimization that we could hope for, but it's
00:52:38.820 | not something that we could necessarily expect, since it's just an approximation of where
00:52:43.340 | the parameters are headed.
00:52:48.960 | So on the back of an envelope, thinking about how the systems that example was trained on
00:52:54.860 | as compared to other examples that are out there, and thinking about models that are
00:53:00.900 | kind of sort of similar size, we're talking about a 12 gigabyte GPU, a relatively small
00:53:08.180 | single chip, specifically when referring to these training times.
00:53:14.740 | So that's a 12 gigabyte GPU.
00:53:20.020 | Just working off of eight chips, each having roughly four times the scale, and comparing
00:53:26.060 | to this time that it took to train something with maybe an additional order of magnitude,
00:53:31.820 | although we have trained models up to around 50 million parameters, too, which is getting
00:53:35.940 | towards GPT-2 scale.
00:53:39.820 | We see training times that, if we scaled up to the relatively large systems that present
00:53:45.300 | us with how much work we should expect to have to do for a model that large, we can
00:53:50.220 | expect to be able to train much faster.
00:53:53.500 | But as mentioned, the initial objective here is not to simply figure out how well we can
00:53:59.500 | do something that's being done well already.
00:54:02.180 | It's to figure out what these alternative strategies are useful for, since they give
00:54:06.900 | us access to different regimes of model scale as effective.
00:54:15.860 | So as mentioned, we've gone to relatively large amounts of data.
00:54:22.380 | I wouldn't really call them big data at this time, even though just a couple of years ago
00:54:26.500 | a billion tokens would be a relatively large amount of data.
00:54:30.860 | It's really just a stress test at this point, gives us something like, do we continue to
00:54:35.860 | see models getting better as we continue to give them more data?
00:54:39.460 | Do we continue to see models getting better as we continue to give them longer context
00:54:43.780 | windows?
00:54:44.780 | And the answer to both of those questions is absolutely yes.
00:54:47.540 | So nothing is telling us that we can't train bigger models with these.
00:54:51.540 | But will those bigger models be as good as a standard self-attention model?
00:54:55.300 | I don't know.
00:54:56.300 | It's a different self-attention parameter matrix than what you see in a standard self-attention
00:55:00.100 | model.
00:55:01.420 | You could integrate the two.
00:55:03.300 | And in theory, that should be overkill, because you'd have more parameters and more power
00:55:08.700 | through them.
00:55:10.220 | And we can see from this work that the alternative self-attention parameters are reasonably effective.
00:55:18.460 | We're getting close to time.
00:55:21.180 | So I'll go quick through these, since this is the work that we're approaching right now.
00:55:27.740 | And this is the idea that we're seeing as a use case for such a model like this.
00:55:34.860 | In other words, no pre-training.
00:55:38.540 | Just training on the target data, whatever the data of interaction are.
00:55:43.420 | And in this example, you'll see that this relatively smaller precision language model
00:55:47.900 | just needs to predict whether or not a light should go on or off.
00:55:51.860 | A lamp that listens with a microphone and a switch.
00:55:56.000 | And you can use that switch to train the lamp.
00:56:02.460 | So that's the goal here.
00:56:04.800 | Can pre-training be eliminated?
00:56:07.180 | And we want to anticipate whether or not you're going to flip the light on or off.
00:56:12.520 | That's the task that we're going to try and approach.
00:56:16.420 | Or that, rather, we're currently approaching.
00:56:19.340 | There's a few different processes that integrate into this approach.
00:56:23.820 | There has to be a microphone that's listening to you, recording audio.
00:56:28.020 | There has to be a transcription algorithm.
00:56:29.700 | And we use Wave2Vec at this point, because there's a very small version of it that's
00:56:33.340 | character-based.
00:56:34.340 | And as a result, it doesn't even require you to use consistent-- or it does require to
00:56:38.540 | use consistent language.
00:56:40.260 | But it doesn't even require you to use words, since it's strictly phonetic.
00:56:45.740 | There has to be-- and this is the bread and butter of what's going on here-- a process
00:56:51.300 | which anticipates what you want.
00:56:54.620 | And that process is responsible for creating good training data.
00:56:58.100 | So this is a smart data collection algorithm that figures out, when you flip the switch,
00:57:04.820 | is that the target for something that you just said?
00:57:08.460 | Is that the transfer learning objectives from text that was transcribed, that it anticipates
00:57:17.700 | you want.
00:57:20.340 | Following this, there's also two other processes.
00:57:23.060 | One which operates on a different time cycle, and that's training.
00:57:28.180 | So always train a model.
00:57:29.940 | Always be training a model, whenever there's new data, is essentially what that fourth
00:57:35.900 | process says.
00:57:37.620 | And the last one is operation.
00:57:39.300 | In other words, if you flip the switch, there has to be a process which operates the light
00:57:43.100 | bulb.
00:57:44.100 | It always has to be a lamp in order to be useful.
00:57:45.940 | It always has to be able to just be a switch.
00:57:51.380 | However, that operation process likewise needs to see a directive from the anticipator.
00:57:57.780 | If the language model predicts that you just said a thing, that means you want there to
00:58:01.900 | be light, that operator then needs to receive the signal from the anticipator and execute
00:58:10.180 | the directive.
00:58:11.660 | If the user then, though, within some time scale, changes the switch back after the model
00:58:18.980 | created a prediction that was bad, the operator is also responsible for issuing a correction
00:58:24.920 | to the anticipator to correct the training data.
00:58:30.220 | What this looks like as a process is in this diagram here.
00:58:34.900 | And you can see the flow here from stage one, a verbal command maybe gets recorded, transcribed,
00:58:43.740 | turned into text.
00:58:45.240 | And if there's no model that's yet trained, that text is just stored as data along with
00:58:50.060 | any directives given by the user in the form of a light switch going on or off.
00:58:57.180 | Once there's any data, the learning process says, okay, time to train a language model
00:59:03.940 | and integrate it with these targets.
00:59:07.180 | Once a model is done training, it's sent over to the anticipator who is responsible for
00:59:12.940 | using the language model.
00:59:15.620 | That small language model then is now empowered to make predictions every single time it receives
00:59:22.700 | a text command.
00:59:25.380 | And those predictions are sent to the operator, which then does whatever it's told.
00:59:31.580 | And the last thing that can happen, step six, is if the wrong prediction was made and the
00:59:36.360 | user fixes it by turning off the light because they didn't want the light on, that corrects
00:59:43.460 | the data that was transcribed and the next model which is trained will be able to avoid
00:59:47.000 | that problem.
00:59:48.820 | And there's some dialing this in in terms of the time scales that you want based on
00:59:54.020 | the way humans interact with the light switch.
00:59:55.740 | So there's a lot of development that goes into figuring out the right way to set this
01:00:01.340 | The data that you collect from a process like this, how do we organize it?
01:00:06.940 | This actually is not transfer learning, so I kind of lied there a little bit.
01:00:11.040 | This is strictly language modeling.
01:00:14.140 | It's a conversation between the human and the lamp.
01:00:17.100 | You say something, the lamp says, here's what you want.
01:00:23.060 | And it's just an extending context window, like you'd see with a decoder-only kind of
01:00:29.100 | architecture these days, a chatbot kind of thing, a human personal assistant, human assistant
01:00:36.420 | dialogue.
01:00:39.300 | And you might also suspect then that, well, couldn't you let the lamp talk?
01:00:43.700 | Yes, you could absolutely let it use other tokens, and that is something which is on
01:00:47.540 | the horizon for us, in other words, how to determine once the model is learned enough
01:00:52.900 | and knows when you want to hear it talk and knows what you want to hear it say, which
01:00:58.980 | requires other smart data collection currently in development.
01:01:04.780 | And there's three tags here if you don't see it, although what they really are are tokens,
01:01:08.900 | since they're integrated within the language model's vocabulary.
01:01:13.540 | I want the lamp lit, I want the lamp dark, or nothing, if no switch is applied during
01:01:19.220 | transcription.
01:01:23.700 | So what do the models look like that go into a lamp?
01:01:27.580 | They're a little bit smaller than that micro model in terms of having a long context window,
01:01:32.860 | B, the block size.
01:01:34.780 | They still use these other features, like a radius, which help them to do well with
01:01:40.940 | only little data, those other context models.
01:01:46.020 | And the embedding size is around 50 or 100 and something, and this is small enough to
01:01:53.380 | fit on a microprocessor, on a CPU of a microprocessor, including training, no GPU whatsoever.
01:02:04.220 | And the first time we ever got the interaction right, the right timescales, from no data
01:02:09.860 | whatsoever, creating this data, and 20 minutes of it, was enough, and you can see there's
01:02:18.340 | loads of misspellings here because the transcription is not required to produce known words, known
01:02:23.500 | tokens.
01:02:25.080 | It's strictly character-based, so you can say whatever you want to say, you can whistle,
01:02:28.820 | and as long as wav2vec thinks that's tokens, it'll figure out what to transcribe.
01:02:39.820 | That's enough, 20 minutes of talking to it, to have it know pretty well when you want
01:02:45.420 | the light on.
01:02:47.740 | This is what the numbers look like for that prediction, and you see lots of zeros there.
01:02:51.100 | That's because there's no positive instances yet in the data, until you flip the switch,
01:02:56.780 | there's nothing to predict.
01:02:59.620 | Once there is enough to predict, we see an immediate jump in the model's ability to figure
01:03:05.000 | out whether LAMP should say on, off, or nothing.
01:03:11.540 | And while we trained this first model, for example, in 20 minutes on LePotato, which
01:03:19.940 | is a really, really, really small microprocessor, it's incredibly frustrating to utilize because
01:03:26.620 | the processing time is a couple seconds, and it feels like it's going somewhere, even though
01:03:31.420 | the data is entirely localized, there's no Wi-Fi, there's no internet connection.
01:03:36.140 | It just takes the model on this tiny chip a minute, not really a minute, like a couple
01:03:40.900 | seconds, to flip the switch on, because it has to transcribe it, interpret it, issue
01:03:46.100 | the directive, ask the operator to operate it.
01:03:49.060 | And so part of what we're doing is figuring out at what scale of microprocessing do the
01:03:53.220 | models that we're developing really make a good real-time system that a user can make
01:03:57.980 | use of well.
01:04:00.560 | And as you can see, the larger the model in terms of hyperparameters and so forth, the
01:04:07.020 | more performant it gets.
01:04:11.800 | So we see these as potentially useful in edge scenarios, but not just for operation, for
01:04:17.940 | training.
01:04:19.060 | So go to Home Depot, buy a light switch installed in your house, start talking to it.
01:04:28.040 | But this isn't really the stopping point that we want to get to.
01:04:34.820 | We want to eventually get to the point of talkback.
01:04:36.900 | We want to treat these as language models that essentially have a bit of you inside
01:04:40.580 | of them that you can converse with.
01:04:44.380 | And that's important to know when the model is aware of what you want to hear said.
01:04:50.900 | In other words, it needs to know what is a good thing to say back to what you just said.
01:04:55.700 | And the lamp has never heard a lamp talk before.
01:04:58.860 | So there are challenges to figuring out the lamp's role in conversation.
01:05:04.940 | And choosing a lamp, though, is arbitrary.
01:05:08.660 | We don't have to make it be a light bulb which goes on and off.
01:05:10.900 | This could be a controller for anything which is a binary switch.
01:05:15.660 | And you could imagine, like others are looking at right now, there's a lot of opportunities
01:05:20.820 | with predicting the action on your phone that you want to take, which thing you want to
01:05:24.940 | push.
01:05:26.860 | And with a system like this, microsizing on to your cell phone, for example, assumes better
01:05:33.220 | hardware than what we're already using, but would be entirely localized, including training.
01:05:44.140 | But this is also really just getting to the point of feasibility.
01:05:49.580 | It's not getting to the point of a well-optimized system, which we're still developing.
01:05:54.460 | There are, in principle, different modifications that we could make to the self-attention layers,
01:05:59.100 | which include traditional self-attention parameters.
01:06:02.220 | That's just one example.
01:06:04.820 | Then there are updates to the very naive scheme that we have for BitCipher, the vectors that
01:06:10.020 | we're using to initialize our models.
01:06:13.260 | And a lot of other minutia that need to be approached.
01:06:19.500 | So this isn't really work that's done.
01:06:21.740 | It's a work in progress.
01:06:23.300 | And in addition to what I just described, we're moving towards larger models and evaluations
01:06:29.600 | that compare better to modern systems, which will eventually come online.
01:06:34.980 | We'll most likely participate in this year's baby language model challenge, although that
01:06:40.780 | challenge assumes you're working with a standard architecture, which is already developed for
01:06:45.660 | all of the evaluative needs.
01:06:48.140 | So there's a lot of work to do.
01:06:51.020 | But that's really all I have prepared for you to discuss today in this conversation.
01:06:54.300 | I've gone over a lot of details, and if you'd like to talk about any of these, I'm certainly
01:06:58.820 | happy to.
01:07:00.560 | Questions that you might have as well.
01:07:01.900 | And if you have access to the slides, there's some links to the different papers I've referenced.
01:07:09.580 | That's all for today.
01:07:10.580 | Thanks.
01:07:11.580 | [APPLAUSE]
01:07:12.580 | Hey, so thanks, Jake, for the great talk.
01:07:19.380 | And now we'll have some time for questions.
01:07:21.660 | So if anybody here has any questions, feel free to raise your hand and ask.
01:07:25.420 | Otherwise, we'll go to some questions on Slido.
01:07:36.180 | Some folks are asking.
01:07:37.180 | So we'll be posting the slides later.
01:07:40.100 | But I've also pasted these references in the Zoom chat, as well as Discord, in case anybody
01:07:46.420 | wants to see them.
01:07:48.340 | I was wondering, in the plots that you showed for warm start versus cold start, does the
01:07:59.100 | cold start use the modified self-attention or the standard self-attention?
01:08:05.780 | Sure.
01:08:07.180 | So the question was, in this picture, comparing warm starts to cold starts, what self-attention
01:08:14.340 | was used here?
01:08:15.340 | None.
01:08:16.340 | This is strictly a feed-forward experiment, where we take a single layer, and all we do
01:08:20.660 | is feed forward with one-hot vectors from some context window and concatenate them together.
01:08:28.420 | And the general property that you'll see is, by concatenating vectors, there's very little
01:08:33.980 | for attention to do.
01:08:36.460 | Simply with a block, you're adding the vectors together, and that superposition of the dimensions
01:08:41.860 | smears them.
01:08:43.500 | And that's why self-attention is needed, in order to weight that superposition so just
01:08:47.940 | the right ones stick out and it's not muddled.
01:08:51.320 | If those vectors are instead concatenated, a weighting of those is really just appealing
01:08:56.060 | to the sensibilities of the matrix above.
01:09:01.020 | When they're superimposed, there's a lot to work on, since you're smearing separate information
01:09:06.260 | together.
01:09:07.500 | When the information is already separated, there's not that much re-weighting can do.
01:09:14.020 | And in this case, there's absolutely no re-weighting going on.
01:09:18.180 | And what I've described to you is really just something that's become very clear from a
01:09:24.060 | lot of small-scale experiments in between the models that we've developed.
01:09:29.420 | And moving towards self-attention took additional time, and we didn't have a solution for that
01:09:35.380 | layer yet when this work was done.
01:09:38.180 | I had a question in regards to-- so you're doing this with on-edge controllers, right?
01:09:47.500 | What?
01:09:48.500 | You're doing this with on-edge controllers, right?
01:09:49.500 | You're doing training for on-edge controllers?
01:09:50.500 | So this could be for IoT devices, right?
01:09:51.500 | Could be.
01:09:52.500 | And you talked about how this also could work for image data, right?
01:09:53.500 | Oh, I saw that.
01:09:54.500 | Yeah.
01:09:55.500 | Have you conducted any tests with image data?
01:09:56.500 | Like, with these small-scale models?
01:09:57.500 | Yeah.
01:09:58.500 | So image data works best on not just feed-forward architectures.
01:10:16.020 | They have, for example, convolutional bits and pieces that are useful to them.
01:10:21.140 | And that means if we want to apply some kind of a warm start for, for example, a convolutional
01:10:27.220 | layer to create a performant image classifier or something that's working with images, we'd
01:10:31.700 | want to develop an initialization for that layer, too.
01:10:35.660 | It has weirder activation functions, which means we need to branch out from softmax as
01:10:39.660 | an activation function.
01:10:41.740 | But surprisingly similar convolution is to a radial model.
01:10:47.260 | It's really just saying what's near where I'm trying to create a feature.
01:10:51.980 | So I would say, yes, it seems like it's something that we could do.
01:10:55.540 | But currently, it's in the phase of future work that it fits in one bullet here at the
01:11:08.260 | bottom.
01:11:09.260 | Different layer types need formal derivation for warm starts.
01:11:13.420 | So if we wanted to do this kind of a thing with performant architecture, we would be
01:11:18.100 | probably uniforming or randomly initializing some of those parameters that we don't have
01:11:22.140 | warm starts for yet.
01:11:23.980 | And as a result, we would receive a lot of just sort of like noise in where things are
01:11:28.340 | going.
01:11:29.540 | And if we started to utilize the activation functions, whether it's even just logistic
01:11:34.300 | activation, a logistic activation is not really fundamentally different than a softmax activation.
01:11:39.420 | So you might say, for example, well, why can't you just apply that to logistic function,
01:11:43.220 | like a two-dimensional softmax?
01:11:46.660 | And the reason is, is because if we treat it like a standard logistic, then each dimension
01:11:50.140 | is independent.
01:11:51.900 | Each dimension is trying to predict the same thing.
01:11:54.800 | And there's a lot more questions about how you can get different information out of different
01:11:58.440 | dimensions.
01:11:59.860 | So it's a question that's really worth spending time on, in my opinion, separately.
01:12:05.820 | And it's not the first question that makes a lot of what we've developed practical.
01:12:10.860 | On one of the slides, you had a dialogue with your user.
01:12:21.500 | I'm wondering, does that imply there is a speech-to-text system inside the microprocessor?
01:12:28.540 | Yeah.
01:12:29.540 | So audio goes in.
01:12:31.300 | And there's a process here which accepts that audio.
01:12:35.340 | And it utilizes a pre-trained wave to vet.
01:12:39.100 | It's really just fitting a need with a pre-trained model.
01:12:42.020 | That's what we're doing right now.
01:12:44.420 | Although transcription is something that we would like to move into in our future work
01:12:49.100 | for the purposes of training from scratch, because one of the real benefits of a system
01:12:53.700 | like this is that it doesn't come with any biases from other people's data, aside from
01:12:59.420 | the fact that there's a pre-trained transcription system, which means that it's pre-trained
01:13:04.060 | towards whatever phonetics were within the language that was there for pre-training in
01:13:09.340 | the wave-to-vec algorithm.
01:13:11.460 | So there is external utility here coming from a pre-trained model.
01:13:17.600 | But the text itself and the language model that we're presenting is only working from
01:13:23.280 | what gets transcribed.
01:13:24.280 | I have a follow-up on my previous question.
01:13:35.320 | You said that the feed-forward worm start is independent of the choice of self-attention.
01:13:42.040 | Does that mean that the worm start strategy can be used for any network that uses a feed-forward
01:13:49.160 | layer, not just PLMs, but any LLM or any other network?
01:13:54.600 | Yeah.
01:13:55.600 | So that's going back to the worm start solution here.
01:14:01.880 | And what it says is that in terms of any layer beneath, if you assume that those layers'
01:14:07.320 | parameters are what they are, you're not going to update them.
01:14:12.680 | And assuming that you know what the targets for that layer are, which for middle layers,
01:14:16.560 | there's some questions to be answered, then this initialization will do better than random
01:14:23.940 | for a softmax output.
01:14:26.600 | That's really important at this stage, that there's a softmax as a part of the activation.
01:14:32.080 | If there's not, then more math, basically.
01:14:39.000 | But the point at which it becomes clear should that whatever type of prediction scenario
01:14:48.040 | you're in, as long as you have non-negative features and a softmax for activation, like
01:14:56.720 | in this case with a single layer, or even two softmax layers, whatever that's doing,
01:15:02.220 | on MNIST, you can get a really good initialization.
01:15:07.120 | Doesn't have to be linguistic data.
01:15:13.080 | This can be mixed data, too.
01:15:14.480 | You could do an image caption generation system that has both features from images and text
01:15:20.120 | and warms them up with the same solution with entirely different data in two places.
01:15:25.760 | Could you point out which part of the process requires the values to be non-negative?
01:15:33.720 | Yeah.
01:15:34.720 | What happens when you put a negative in a logarithm?
01:15:42.740 | Not saying you can't, but it's not going to start making probabilities for you at the
01:15:46.840 | other end of the softmax any time fast.
01:15:50.400 | So you have to start with a different premise, essentially.
01:15:55.140 | And that premise is something that requires more derivation.
01:16:00.920 | You'd want to assure, if you're going to use a logarithm anywhere, or assume that inverse,
01:16:05.600 | that you're able to probably modify every parameter independently, instead of full rows
01:16:13.640 | of parameters.
01:16:16.480 | I think we should get to a couple of questions on Slido that folks asked.
01:16:24.880 | The first is, what's the difference in performance between naive assignment and optimized or
01:16:30.280 | omniscient assignment for packing tokens into bit vectors, and any experimental results?
01:16:36.860 | What's the difference in performance between naive assignment and optimized assignment
01:16:44.560 | for packing tokens into bit vectors?
01:16:53.740 | The performance differences are going to be in speed.
01:16:56.720 | The systems which utilize packing for contexts have, at great length, gone to make sure that
01:17:02.400 | information from different portions of the context that have nothing to do with each
01:17:06.640 | other don't bleed information, if you're going to pack them together.
01:17:11.760 | That creates a lot of logistical challenges in terms of defining models.
01:17:16.920 | And it's still just doing the regular self-attention thing.
01:17:19.520 | So it's quadratic.
01:17:20.520 | So if you have the same length of context window, it's going to be the same computational
01:17:23.920 | cost.
01:17:25.440 | However, if you pack all of your small documents together, they don't need the whole context
01:17:34.040 | window worth of quadratic comparisons.
01:17:38.960 | And that's why you pack something into the empty end.
01:17:41.720 | I guess it should be over here.
01:17:46.040 | But document packing isn't exactly, even though it's well known as a mechanism to make training
01:17:55.040 | much more efficient.
01:17:56.040 | In other words, you only need fewer batches if more documents are packed together.
01:18:00.920 | It's not something which is, for example, entirely accepted as a published accepted
01:18:08.440 | form of preprocessing.
01:18:11.140 | So what I would say is just document packing is not a correct model of context.
01:18:16.440 | It is an efficiency, but requires the same level of quadratic comparison.
01:18:21.920 | Whereas dynamically batching and utilizing a block size that is dynamic preserves the
01:18:28.520 | model of context.
01:18:30.240 | It does something that is true to the objective and unwavering in that.
01:18:35.040 | And it reduces the complexity for smaller documents.
01:18:39.360 | But a direct comparison of the two is something I have not done, because it would require
01:18:44.160 | having that oracle and utilizing those algorithms.
01:18:46.720 | And where are they used?
01:18:48.040 | They're used with insanely big models, which means we would likewise have to compare two
01:18:52.880 | insanely big models to create the same level of expectation that people have from packing.
01:18:58.120 | So that's in the future.
01:19:00.400 | Great.
01:19:01.400 | Thanks for your detailed response.
01:19:02.400 | We have a question quickly that's asking, are there any implementations of SAFU available
01:19:07.920 | that one could experiment with?
01:19:10.600 | Well, once we publish, there will be.
01:19:15.400 | But that requires a lot of work on developing systems for evaluation, since the evaluation
01:19:20.420 | systems rely upon standardized functions within the architectures that you're all very familiar
01:19:26.280 | with, like GPT-2, that are easily taken for granted.
01:19:29.680 | Even though you do lots of work in training them, you have to do a lot of work in creating
01:19:34.200 | those functions that meet the needs of the separate prediction tasks and fine tuning
01:19:38.040 | that evaluations perform.
01:19:39.840 | All right, great.
01:19:42.160 | Makes sense.
01:19:43.160 | I think we're pretty much out of time.
01:19:45.320 | So thanks, Jake, for the great talk, and thanks for coming to another lecture.
01:19:50.560 | Thank you.
01:19:51.560 | [END]
01:19:51.560 | [BLANK_AUDIO]