back to indexDeep Dive into LLMs like ChatGPT

Chapters
0:0 introduction
1:0 pretraining data (internet)
7:47 tokenization
14:27 neural network I/O
20:11 neural network internals
26:1 inference
31:9 GPT-2: training and inference
42:52 Llama 3.1 base model inference
59:23 pretraining to post-training
61:6 post-training data (conversations)
80:32 hallucinations, tool use, knowledge/working memory
101:46 knowledge of self
106:56 models need tokens to think
121:11 tokenization revisited: models struggle with spelling
124:53 jagged intelligence
127:28 supervised finetuning to reinforcement learning
134:42 reinforcement learning
147:47 DeepSeek-R1
162:7 AlphaGo
168:26 reinforcement learning from human feedback (RLHF)
189:39 preview of things to come
195:15 keeping track of LLMs
198:34 where to find LLMs
201:46 grand summary
00:00:00.000 |
Hi everyone, so I've wanted to make this video for a while. It is a comprehensive 00:00:05.000 |
but general audience introduction to large language models like ChatGPT and 00:00:10.320 |
what I'm hoping to achieve in this video is to give you kind of mental models for 00:00:14.640 |
thinking through what it is that this tool is. It's obviously magical and 00:00:19.680 |
amazing in some respects. It's really good at some things, not very good at 00:00:23.880 |
other things, and there's also a lot of sharp edges to be aware of. So what is 00:00:27.960 |
behind this text box? You can put anything in there and press enter, but 00:00:31.800 |
what should we be putting there and what are these words generated back? How does 00:00:36.720 |
this work and what what are you talking to exactly? So I'm hoping to get at all 00:00:40.400 |
those topics in this video. We're gonna go through the entire pipeline of how 00:00:44.000 |
this stuff is built, but I'm going to keep everything sort of accessible to a 00:00:48.280 |
general audience. So let's take a look at first how you build something like 00:00:51.840 |
ChatGPT and along the way I'm gonna talk about, you know, some of the sort of 00:00:56.800 |
cognitive psychological implications of these tools. Okay so let's build ChatGPT. 00:01:02.160 |
So there's going to be multiple stages arranged sequentially. The first stage is 00:01:06.840 |
called the pre-training stage and the first step of the pre-training stage is 00:01:11.120 |
to download and process the internet. Now to get a sense of what this roughly 00:01:14.600 |
looks like, I recommend looking at this URL here. So this company called Hugging 00:01:20.800 |
Face collected and created and curated this dataset called FineWeb and they go 00:01:27.760 |
into a lot of detail in this blog post on how they constructed the FineWeb 00:01:31.120 |
dataset and all of the major LLM providers like OpenAI, Anthropic and 00:01:35.200 |
Google and so on will have some equivalent internally of something like 00:01:39.320 |
the FineWeb dataset. So roughly what are we trying to achieve here? We're trying 00:01:43.080 |
to get a ton of text from the internet, from publicly available sources. So we're 00:01:47.640 |
trying to have a huge quantity of very high quality documents and we also want 00:01:52.680 |
very large diversity of documents because we want to have a lot of 00:01:55.800 |
knowledge inside these models. So we want large diversity of high quality 00:02:00.340 |
documents and we want many many of them. And achieving this is quite complicated 00:02:04.520 |
and as you can see here it takes multiple stages to do well. So let's take 00:02:08.800 |
a look at what some of these stages look like in a bit. For now I'd like to just 00:02:12.240 |
like to note that for example the FineWeb dataset which is fairly 00:02:15.000 |
representative what you would see in a production grade application actually 00:02:18.920 |
ends up being only about 44 terabytes of disk space. You can get a USB stick for 00:02:24.320 |
like a terabyte very easily or I think this could fit on a single hard drive 00:02:27.800 |
almost today. So this is not a huge amount of data at the end of the day 00:02:31.960 |
even though the internet is very very large we're working with text and we're 00:02:35.680 |
also filtering it aggressively so we end up with about 44 terabytes in this 00:02:39.400 |
example. So let's take a look at kind of what this data looks like and what some 00:02:44.960 |
of these stages also are. So the starting point for a lot of these efforts and 00:02:48.960 |
something that contributes most of the data by the end of it is data from 00:02:53.160 |
Common Crawl. So Common Crawl is an organization that has been basically 00:02:57.400 |
scouring the internet since 2007. So as of 2024 for example Common Crawl has 00:03:03.040 |
indexed 2.7 billion web pages and they have all these crawlers going around the 00:03:09.080 |
internet and what you end up doing basically is you start with a few seed 00:03:12.000 |
web pages and then you follow all the links and you just keep following links 00:03:15.720 |
and you keep indexing all the information and you end up with a ton of 00:03:18.000 |
data of the internet over time. So this is usually the starting point for a lot 00:03:22.360 |
of these efforts. Now this Common Crawl data is quite raw and is 00:03:27.480 |
filtered in many many different ways. So here they document - this is the 00:03:33.240 |
same diagram - they document a little bit the kind of processing that happens in 00:03:36.640 |
these stages. So the first thing here is something called URL filtering. So what 00:03:43.000 |
that is referring to is that there's these block lists of basically URLs that 00:03:49.840 |
are domains that you don't want to be getting data from. So usually this 00:03:54.280 |
includes things like malware websites, spam websites, marketing websites, racist 00:03:59.320 |
websites, adult sites and things like that. So there's a ton of different types 00:04:02.880 |
of websites that are just eliminated at this stage because we don't want 00:04:06.600 |
them in our data set. The second part is text extraction. You have to remember 00:04:11.160 |
that all these web pages - this is the raw HTML of these web pages that are being 00:04:14.880 |
saved by these crawlers. So when I go to inspect here, this is what the raw HTML 00:04:21.080 |
actually looks like. You'll notice that it's got all this markup like lists and 00:04:26.280 |
stuff like that and there's CSS and all this kind of stuff. So this is computer 00:04:31.280 |
code almost for these web pages but what we really want is we just want this text 00:04:35.480 |
right? We just want the text of this web page and we don't want the navigation 00:04:38.920 |
and things like that. So there's a lot of filtering and processing and heuristics 00:04:42.720 |
that go into adequately filtering for just the good content of these web pages. 00:04:48.400 |
The next stage here is language filtering. So for example, FineWeb 00:04:53.160 |
filters using a language classifier. They try to guess what language every single 00:04:58.880 |
web page is in and then they only keep web pages that have more than 65% of 00:05:02.680 |
English as an example. And so you can get a sense that this is like a design 00:05:06.840 |
decision that different companies can can take for themselves. What fraction of 00:05:12.640 |
all different types of languages are we going to include in our data set? Because 00:05:15.920 |
for example, if we filter out all of the Spanish as an example, then you might 00:05:19.360 |
imagine that our model later will not be very good at Spanish because it's just 00:05:22.480 |
never seen that much data of that language. And so different companies can 00:05:26.440 |
focus on multilingual performance to to a different degree as an example. So 00:05:30.880 |
FineWeb is quite focused on English and so their language model, if they end up 00:05:35.000 |
training one later, will be very good at English but not maybe very good at other 00:05:38.560 |
languages. After language filtering, there's a few other filtering steps and 00:05:43.400 |
deduplication and things like that. Finishing with, for example, the PII 00:05:48.440 |
removal. This is personally identifiable information. So as an example, addresses, 00:05:54.320 |
social security numbers, and things like that. You would try to detect them and 00:05:57.600 |
you would try to filter out those kinds of webpages from the data set as well. So 00:06:01.280 |
there's a lot of stages here and I won't go into full detail but it is a fairly 00:06:05.440 |
extensive part of the pre-processing and you end up with, for example, the FineWeb 00:06:09.440 |
data set. So when you click in on it, you can see some examples here of what this 00:06:13.800 |
actually ends up looking like and anyone can download this on the Hugging Phase 00:06:17.920 |
web page. And so here's some examples of the final text that ends up in the 00:06:21.920 |
training set. So this is some article about tornadoes in 2012. So there's some 00:06:29.400 |
tornadoes in 2012 and what happened. This next one is something about... 00:06:36.480 |
"Did you know you have two little yellow 9-volt battery-sized adrenal glands in 00:06:41.120 |
your body?" Okay, so this is some kind of a odd medical article. So just think of 00:06:48.600 |
these as basically web pages on the Internet filtered just for the text in 00:06:53.000 |
various ways. And now we have a ton of text, 40 terabytes of it, and that now is 00:06:58.600 |
the starting point for the next step of this stage. Now I wanted to give you an 00:07:02.640 |
intuitive sense of where we are right now. So I took the first 200 web pages 00:07:06.520 |
here, and remember we have tons of them, and I just take all that text and I just 00:07:11.040 |
put it all together, concatenate it. And so this is what we end up with. We just 00:07:14.960 |
get this just raw text, raw internet text, and there's a ton of it even in these 00:07:21.120 |
200 web pages. So I can continue zooming out here, and we just have this like 00:07:24.960 |
massive tapestry of text data. And this text data has all these patterns, and 00:07:30.080 |
what we want to do now is we want to start training neural networks on this 00:07:33.320 |
data so the neural networks can internalize and model how this text 00:07:39.120 |
flows, right? So we just have this giant texture of text, and now we want to get 00:07:44.920 |
neural nets that mimic it. Okay, now before we plug text into neural networks, 00:07:50.920 |
we have to decide how we're going to represent this text, and how we're going 00:07:54.720 |
to feed it in. Now the way our technology works for these neural nets is that 00:07:58.880 |
they expect a one-dimensional sequence of symbols, and they want a finite set of 00:08:04.960 |
symbols that are possible. And so we have to decide what are the symbols, and then 00:08:10.080 |
we have to represent our data as a one-dimensional sequence of those 00:08:13.360 |
symbols. So right now what we have is a one-dimensional sequence of text. It 00:08:18.480 |
starts here, and it goes here, and then it comes here, etc. So this is a 00:08:22.200 |
one-dimensional sequence, even though on my monitor of course it's laid out in a 00:08:25.960 |
two-dimensional way, but it goes from left to right and top to bottom, right? So 00:08:29.760 |
it's a one-dimensional sequence of text. Now this being computers, of course, 00:08:33.560 |
there's an underlying representation here. So if I do what's called UTF-8 00:08:37.720 |
encode this text, then I can get the raw bits that correspond to this text in the 00:08:44.080 |
computer. And that looks like this. So it turns out that, for 00:08:50.260 |
example, this very first bar here is the first eight bits as an example. So what 00:08:57.320 |
is this thing, right? This is a representation that we are looking for, 00:09:02.040 |
in a certain sense. We have exactly two possible symbols, 0 and 1, and we 00:09:07.880 |
have a very long sequence of it, right? Now as it turns out, this sequence length 00:09:14.800 |
is actually going to be a very finite and precious resource in our neural 00:09:19.640 |
network, and we actually don't want extremely long sequences of just two 00:09:23.320 |
symbols. Instead what we want is we want to trade off this symbol size of this 00:09:31.320 |
vocabulary, as we call it, and the resulting sequence length. So we don't 00:09:35.440 |
want just two symbols and extremely long sequences. We're going to want more 00:09:39.320 |
symbols and shorter sequences. Okay, so one naive way of compressing or 00:09:44.680 |
decreasing the length of our sequence here is to basically consider some group 00:09:49.800 |
of consecutive bits, for example 8 bits, and group them into a single what's 00:09:55.160 |
called byte. So because these bits are either on or off, if we take a group of 00:10:00.320 |
eight of them, there turns out to be only 256 possible combinations of how these 00:10:04.520 |
bits could be on or off. And so therefore we can re-represent the sequence into a 00:10:09.080 |
sequence of bytes instead. So this sequence of bytes will be 8 times 00:10:14.200 |
shorter, but now we have 256 possible symbols. So every number here goes from 00:10:19.680 |
0 to 255. Now I really encourage you to think of these not as numbers, but as 00:10:24.360 |
unique IDs, or like unique symbols. So maybe it's a bit more, maybe it's better 00:10:29.960 |
to actually think of these, to replace every one of these with a unique emoji. 00:10:33.200 |
You'd get something like this. So we basically have a sequence of emojis, and 00:10:38.840 |
there's 256 possible emojis. You can think of it that way. Now it turns out 00:10:44.560 |
that in production, for state-of-the-art language models, you actually want to go 00:10:48.040 |
even beyond this. You want to continue to shrink the length of the sequence, because 00:10:52.720 |
again it is a precious resource, in return for more symbols in your 00:10:57.120 |
vocabulary. And the way this is done is done by running what's called the byte 00:11:02.120 |
pair encoding algorithm. And the way this works is we're basically looking for 00:11:06.000 |
consecutive bytes, or symbols, that are very common. So for example, it turns out 00:11:13.760 |
that the sequence 116 followed by 32 is quite common and occurs very frequently. 00:11:18.360 |
So what we're going to do is we're going to group this pair into a new symbol. So 00:11:25.520 |
we're going to mint a symbol with an ID 256, and we're going to rewrite every 00:11:29.640 |
single pair, 116, 32, with this new symbol. And then we can iterate this 00:11:34.920 |
algorithm as many times as we wish. And each time when we mint a new symbol, 00:11:39.000 |
we're decreasing the length and we're increasing the symbol size. And in 00:11:43.440 |
practice, it turns out that a pretty good setting of, basically, the vocabulary 00:11:48.560 |
size turns out to be about 100,000 possible symbols. So in particular, GPT-4 00:11:53.400 |
uses 100,277 symbols. And this process of converting from raw text into these 00:12:05.760 |
symbols, or as we call them, tokens, is the process called tokenization. So let's 00:12:11.720 |
now take a look at how GPT-4 performs tokenization, converting from text to 00:12:16.480 |
tokens, and from tokens back to text, and what this actually looks like. So one 00:12:20.700 |
website I like to use to explore these token representations is called 00:12:25.160 |
TicTokenizer. And so come here to the drop-down and select CL100KBase, which 00:12:30.240 |
is the GPT-4 base model tokenizer. And here on the left, you can put in text, and 00:12:35.040 |
it shows you the tokenization of that text. So for example, "hello world". 00:12:43.880 |
So "hello world" turns out to be exactly two tokens. The token "hello", which is the 00:12:49.720 |
token with ID 15339, and the token "space world", that is the token 1917. So "hello 00:13:00.440 |
space world". Now if I was to join these two, for example, I'm gonna get again two 00:13:05.680 |
tokens, but it's the token "h" followed by the token "hello world", without the "h". If I 00:13:13.600 |
put in two spaces here between "hello" and "world", it's again a 00:13:16.640 |
different tokenization. There's a new token "220" here. Okay, so you can play 00:13:23.760 |
with this and see what happens here. Also keep in mind this is case 00:13:28.100 |
sensitive, so if this is a capital "H", it is something else. Or if it's "hello world", 00:13:33.840 |
then actually this ends up being three tokens, since there are just two tokens. 00:13:39.600 |
Yeah, so you can play with this and get a sort of like an intuitive sense of what 00:13:46.000 |
these tokens work like. We're actually going to loop around to tokenization a 00:13:49.040 |
bit later in the video. For now I just wanted to show you the website, and I 00:13:52.140 |
wanted to show you that this text basically, at the end of the day, so for 00:13:56.440 |
example if I take one line here, this is what GPT-4 will see it as. So this text 00:14:01.680 |
will be a sequence of length 62. This is the sequence here, and this is how the 00:14:08.920 |
chunks of text correspond to these symbols. And again there's 100,000, 00:14:14.640 |
277 possible symbols, and we now have one-dimensional sequences of 00:14:20.640 |
those symbols. So yeah, we're gonna come back to tokenization, but that's for now 00:14:26.480 |
where we are. Okay, so what I've done now is I've taken this sequence of text that 00:14:30.840 |
we have here in the dataset, and I have re-represented it using our tokenizer 00:14:34.480 |
into a sequence of tokens. And this is what that looks like now. So for example 00:14:40.560 |
when we go back to the FindWeb dataset, they mentioned that not only is this 00:14:43.840 |
44 terabytes of disk space, but this is about a 15 trillion token sequence in 00:14:50.640 |
this dataset. And so here, these are just some of the first one or two or three or 00:14:56.760 |
a few thousand here, I think, tokens of this dataset, but there's 15 trillion 00:15:01.520 |
here to keep in mind. And again, keep in mind one more time that all of these 00:15:06.160 |
represent little text chunks. They're all just like atoms of these sequences, and 00:15:10.680 |
the numbers here don't make any sense. They're just unique IDs. 00:15:14.560 |
Okay, so now we get to the fun part, which is the neural network training. And this 00:15:21.400 |
is where a lot of the heavy lifting happens computationally when you're 00:15:23.960 |
training these neural networks. So what we do here in this step is we want 00:15:30.320 |
to model the statistical relationships of how these tokens follow each other in 00:15:33.440 |
the sequence. So what we do is we come into the data, and we take windows of 00:15:38.760 |
tokens. So we take a window of tokens from this data fairly randomly, and the 00:15:47.320 |
window's length can range anywhere between zero tokens, actually, 00:15:52.320 |
all the way up to some maximum size that we decide on. So for example, in practice 00:15:57.960 |
you could see a token windows of, say, 8,000 tokens. Now, in principle, we can 00:16:02.680 |
use arbitrary window lengths of tokens, but processing very long, basically, 00:16:10.460 |
window sequences would just be very computationally expensive. So we just 00:16:15.000 |
kind of decide that, say, 8,000 is a good number, or 4,000, or 16,000, and we crop 00:16:19.280 |
it there. Now, in this example, I'm going to be taking the first four tokens just 00:16:25.560 |
so everything fits nicely. So these tokens, we're going to take a window of 00:16:31.920 |
four tokens, this bar, view, ing, and space single, which are these token IDs. And now 00:16:40.320 |
what we're trying to do here is we're trying to basically predict the token 00:16:42.960 |
that comes next in the sequence. So 3962 comes next, right? So what we do now here 00:16:49.400 |
is that we call this the context. These four tokens are context, and they feed 00:16:54.640 |
into a neural network. And this is the input to the neural network. Now, I'm 00:17:00.400 |
going to go into the detail of what's inside this neural network in a little 00:17:03.400 |
bit. For now, what's important to understand is the input and the output 00:17:06.000 |
of the neural net. So the input are sequences of tokens of variable length, 00:17:11.680 |
anywhere between 0 and some maximum size, like 8,000. The output now is a 00:17:17.360 |
prediction for what comes next. So because our vocabulary has 100,277 00:17:24.320 |
possible tokens, the neural network is going to output exactly that many 00:17:28.440 |
numbers. And all of those numbers correspond to the probability of that 00:17:32.400 |
token as coming next in the sequence. So it's making guesses about what comes 00:17:36.640 |
next. In the beginning, this neural network is randomly initialized. So we're 00:17:42.840 |
going to see in a little bit what that means. But it's a random 00:17:46.800 |
transformation. So these probabilities in the very beginning of the training are 00:17:50.280 |
also going to be kind of random. So here I have three examples, but keep in mind 00:17:54.680 |
that there's 100,000 numbers here. So the probability of this token, space 00:17:59.520 |
direction, the neural network is saying that this is 4% likely right now. 11,799 00:18:04.520 |
is 2%. And then here, the probability of 3962, which is post, is 3%. Now, of 00:18:10.960 |
course, we've sampled this window from our data set. So we know what comes next. 00:18:14.680 |
We know, and that's the label, we know that the correct answer is that 3962 00:18:19.520 |
actually comes next in the sequence. So now what we have is this mathematical 00:18:24.360 |
process for doing an update to the neural network. We have a way of tuning 00:18:29.040 |
it. And we're going to go into a little bit of detail in a bit. But basically, we 00:18:34.440 |
know that this probability here of 3%, we want this probability to be higher, and 00:18:39.640 |
we want the probabilities of all the other tokens to be lower. And so we have 00:18:45.480 |
a way of mathematically calculating how to adjust and update the neural network 00:18:50.400 |
so that the correct answer has a slightly higher probability. So if I do an update 00:18:55.520 |
to the neural network now, the next time I feed this particular sequence of four 00:19:00.520 |
tokens into the neural network, the neural network will be slightly adjusted now and 00:19:04.000 |
it will say, okay, post is maybe 4%, and case now maybe is 1%. And direction 00:19:11.000 |
could become 2% or something like that. And so we have a way of nudging, of 00:19:14.880 |
slightly updating the neural net to basically give a higher probability to 00:19:20.240 |
the correct token that comes next in the sequence. And now we just have to 00:19:23.360 |
remember that this process happens not just for this token here, where these 00:19:30.240 |
four fed in and predicted this one. This process happens at the same time for all 00:19:35.440 |
of these tokens in the entire data set. And so in practice, we sample little 00:19:39.320 |
windows, little batches of windows, and then at every single one of these tokens, 00:19:43.600 |
we want to adjust our neural network so that the probability of that token 00:19:47.600 |
becomes slightly higher. And this all happens in parallel in large batches of 00:19:51.880 |
these tokens. And this is the process of training the neural network. It's a 00:19:55.760 |
sequence of updating it so that its predictions match up the statistics of 00:20:01.200 |
what actually happens in your training set. And its probabilities become 00:20:05.080 |
consistent with the statistical patterns of how these tokens follow each other in 00:20:10.040 |
the data. So let's now briefly get into the internals of these neural networks 00:20:13.640 |
just to give you a sense of what's inside. So neural network internals. So 00:20:18.600 |
as I mentioned, we have these inputs that are sequences of tokens. In this case, 00:20:23.680 |
this is four input tokens, but this can be anywhere between zero up to, let's say, 00:20:28.600 |
a thousand tokens. In principle, this can be an infinite number of tokens. We just, 00:20:32.680 |
it would just be too computationally expensive to process an infinite number 00:20:36.560 |
of tokens. So we just crop it at a certain length, and that becomes the 00:20:39.720 |
maximum context length of that model. Now these inputs X are mixed up in a giant 00:20:46.680 |
mathematical expression together with the parameters or the weights of these 00:20:51.960 |
neural networks. So here I'm showing six example parameters and their setting. But 00:20:57.800 |
in practice, these modern neural networks will have billions of these parameters. 00:21:04.000 |
And in the beginning, these parameters are completely randomly set. Now with a 00:21:09.000 |
random setting of parameters, you might expect that this neural network 00:21:13.600 |
would make random predictions, and it does. In the beginning, it's totally 00:21:16.800 |
random predictions. But it's through this process of iteratively updating the 00:21:22.040 |
network, and we call that process training a neural network, so that the 00:21:28.160 |
setting of these parameters gets adjusted such that the outputs of our 00:21:31.720 |
neural network becomes consistent with the patterns seen in our training set. So 00:21:37.480 |
think of these parameters as kind of like knobs on a DJ set, and as you're 00:21:41.280 |
twiddling these knobs, you're getting different predictions for every possible 00:21:45.800 |
token sequence input. And training a neural network just means discovering a 00:21:51.120 |
setting of parameters that seems to be consistent with the statistics of the 00:21:55.360 |
training set. Now let me just give you an example of what this giant mathematical 00:21:59.560 |
expression looks like, just to give you a sense. And modern networks are massive 00:22:03.320 |
expressions with trillions of terms probably. But let me just show you a 00:22:06.600 |
simple example here. It would look something like this. I mean, these are 00:22:10.240 |
the kinds of expressions, just to show you that it's not very scary. We have 00:22:13.840 |
inputs x, like x1, x2, in this case two example inputs, and they get mixed up 00:22:19.600 |
with the weights of the network, w0, w1, w2, w3, etc. And this mixing is simple 00:22:26.680 |
things like multiplication, addition, exponentiation, division, etc. And it is 00:22:32.760 |
the subject of neural network architecture research to design effective 00:22:36.960 |
mathematical expressions that have a lot of kind of convenient characteristics. 00:22:41.880 |
They are expressive, they're optimizable, they're parallelizable, etc. And so, but at 00:22:47.760 |
the end of the day, these are not complex expressions, and basically 00:22:51.200 |
they mix up the inputs with the parameters to make predictions, and we're 00:22:55.680 |
optimizing the parameters of this neural network so that the predictions come out 00:23:00.680 |
consistent with the training set. Now, I would like to show you an actual 00:23:05.440 |
production-grade example of what these neural networks look like. So for that, I 00:23:09.320 |
encourage you to go to this website that has a very nice visualization of one of 00:23:12.840 |
these networks. So this is what you will find on this website, and this neural 00:23:19.040 |
network here that is used in production settings has this special kind of 00:23:22.760 |
structure. This network is called the transformer, and this particular one as 00:23:27.560 |
an example has 85,000, roughly, parameters. Now, here on the top, we take the inputs, 00:23:33.760 |
which are the token sequences, and then information flows through the neural 00:23:40.400 |
network until the output, which here are the logit softmax, but these are the 00:23:45.660 |
predictions for what comes next, what token comes next. And then here, there's a 00:23:51.960 |
sequence of transformations, and all these intermediate values that get 00:23:55.960 |
produced inside this mathematical expression as it is sort of predicting 00:23:59.480 |
what comes next. So as an example, these tokens are embedded into kind of like 00:24:05.400 |
this distributed representation, as it's called. So every possible token has kind 00:24:09.360 |
of like a vector that represents it inside the neural network. So first, we 00:24:13.320 |
embed the tokens, and then those values kind of like flow through this diagram, 00:24:19.600 |
and these are all very simple mathematical expressions individually. 00:24:22.720 |
So we have layer norms, and matrix multiplications, and soft maxes, and so 00:24:27.360 |
on. So here's kind of like the attention block of this transformer, and then 00:24:31.840 |
information kind of flows through into the multi-layer perceptron block, and so 00:24:35.760 |
on. And all these numbers here, these are the intermediate values of their 00:24:39.960 |
expression, and you can almost think of these as kind of like the firing rates 00:24:44.920 |
of these synthetic neurons. But I would caution you to not kind of think of it 00:24:50.840 |
too much like neurons, because these are extremely simple neurons compared to the 00:24:54.960 |
neurons you would find in your brain. Your biological neurons are very complex 00:24:58.360 |
dynamical processes that have memory, and so on. There's no memory in this 00:25:02.160 |
expression. It's a fixed mathematical expression from input to output with no 00:25:05.840 |
memory. It's just a stateless. So these are very simple neurons in comparison to 00:25:10.280 |
biological neurons, but you can still kind of loosely think of this as like a 00:25:13.560 |
synthetic piece of brain tissue, if you like to think about it that way. 00:25:18.320 |
So information flows through all these neurons fire until we get to the 00:25:24.480 |
predictions. Now I'm not actually going to dwell too much on the precise kind of 00:25:29.040 |
like mathematical details of all these transformations. Honestly, I don't think 00:25:32.080 |
it's that important to get into. What's really important to understand is that 00:25:35.200 |
this is a mathematical function. It is parameterized by some fixed set of 00:25:41.480 |
parameters, let's say 85,000 of them, and it is a way of transforming inputs into 00:25:45.960 |
outputs. And as we twiddle the parameters we are getting different kinds of 00:25:50.440 |
predictions, and then we need to find a good setting of these parameters so that 00:25:54.320 |
the predictions sort of match up with the patterns seen in training set. So 00:25:59.240 |
that's the transformer. Okay, so I've shown you the internals of the neural 00:26:03.440 |
network, and we talked a bit about the process of training it. I want to cover 00:26:07.160 |
one more major stage of working with these networks, and that is the stage 00:26:11.800 |
called inference. So in inference what we're doing is we're generating new data 00:26:15.880 |
from the model, and so we want to basically see what kind of patterns it 00:26:20.960 |
has internalized in the parameters of its network. So to generate from the 00:26:26.160 |
model is relatively straightforward. We start with some tokens that are 00:26:30.800 |
basically your prefix, like what you want to start with. So say we want to start 00:26:34.440 |
with the token 91. Well, we feed it into the network, and remember that network 00:26:39.880 |
gives us probabilities, right? It gives us this probability vector here. So what we 00:26:45.320 |
can do now is we can basically flip a biased coin. So we can sample basically a 00:26:52.800 |
token based on this probability distribution. So the tokens that are 00:26:57.440 |
given high probability by the model are more likely to be sampled when you flip 00:27:01.920 |
this biased coin. You can think of it that way. So we sample from the 00:27:05.840 |
distribution to get a single unique token. So for example, token 860 comes 00:27:10.600 |
next. So 860 in this case when we're generating from model could come next. 00:27:15.760 |
Now 860 is a relatively likely token. It might not be the only possible token in 00:27:20.480 |
this case. There could be many other tokens that could have been sampled, but 00:27:23.520 |
we could see that 860 is a relatively likely token as an example, and indeed in 00:27:27.560 |
our training example here, 860 does follow 91. So let's now say that we 00:27:33.640 |
continue the process. So after 91 there's 860. We append it, and we again ask what 00:27:39.440 |
is the third token. Let's sample, and let's just say that it's 287 exactly as 00:27:44.120 |
here. Let's do that again. We come back in. Now we have a sequence of three, and we 00:27:49.960 |
ask what is the likely fourth token, and we sample from that and get this one. And 00:27:54.920 |
now let's say we do it one more time. We take those four, we sample, and we get 00:28:00.000 |
this one. And this 13659, this is not actually 3962 as we had before. So this 00:28:08.480 |
token is the token article instead, so viewing a single article. And so in this 00:28:14.440 |
case we didn't exactly reproduce the sequence that we saw here in the 00:28:18.560 |
training data. So keep in mind that these systems are stochastic. We're 00:28:23.960 |
sampling, and we're flipping coins, and sometimes we luck out and we reproduce 00:28:29.360 |
some like small chunk of the text in a training set, but sometimes we're 00:28:33.920 |
getting a token that was not verbatim part of any of the documents in the 00:28:38.880 |
training data. So we're going to get sort of like remixes of the data that we saw 00:28:43.840 |
in the training, because at every step of the way we can flip and get a slightly 00:28:47.480 |
different token, and then once that token makes it in, if you sample the next one 00:28:51.400 |
and so on, you very quickly start to generate token streams that are very 00:28:56.240 |
different from the token streams that occur in the training documents. So 00:29:00.520 |
statistically they will have similar properties, but they are not identical 00:29:05.200 |
to training data. They're kind of like inspired by the training data. And so in 00:29:09.280 |
this case we got a slightly different sequence. And why would we get article? 00:29:13.160 |
You might imagine that article is a relatively likely token in the context 00:29:17.240 |
of bar, viewing, single, etc. And you could imagine that the word article followed 00:29:22.640 |
this context window somewhere in the training documents to some extent, and we 00:29:27.840 |
just happen to sample it here at that stage. So basically inference is just 00:29:32.080 |
predicting from these distributions one at a time, we continue feeding back 00:29:36.400 |
tokens and getting the next one, and we we're always flipping these coins, and 00:29:41.000 |
depending on how lucky or unlucky we get, we might get very different kinds of 00:29:46.440 |
patterns depending on how we sample from these probability distributions. So 00:29:51.440 |
that's inference. So in most common scenarios, basically downloading the 00:29:56.580 |
internet and tokenizing it is a pre-processing step. You do that a 00:29:59.640 |
single time. And then once you have your token sequence, we can start training 00:30:04.840 |
networks. And in practical cases you would try to train many different 00:30:09.080 |
networks of different kinds of settings, and different kinds of arrangements, and 00:30:12.840 |
different kinds of sizes. And so you'd be doing a lot of neural network training, 00:30:16.280 |
and then once you have a neural network and you train it, and you have some 00:30:20.680 |
specific set of parameters that you're happy with, then you can take the model 00:30:25.260 |
and you can do inference, and you can actually generate data from the model. 00:30:29.280 |
And when you're on chatGPT and you're talking with a model, that model is 00:30:33.280 |
trained, and has been trained by OpenAI many months ago probably, and they 00:30:38.360 |
have a specific set of weights that work well, and when you're talking to the 00:30:42.300 |
model, all of that is just inference. There's no more training. Those 00:30:45.920 |
parameters are held fixed, and you're just talking to the model, sort of. You're 00:30:51.060 |
giving it some of the tokens, and it's kind of completing token sequences, and 00:30:54.720 |
that's what you're seeing generated when you actually use the model on chatGPT. 00:30:58.480 |
So that model then just does inference alone. So let's now look at an example of 00:31:03.600 |
training and inference that is kind of concrete, and gives you a sense of what 00:31:06.840 |
this actually looks like when these models are trained. Now the example that 00:31:10.680 |
I would like to work with, and that I am particularly fond of, is that of OpenAI's 00:31:14.400 |
GPT2. So GPT stands for generatively pre-trained transformer, and this is the 00:31:19.900 |
second iteration of the GPT series by OpenAI. When you are talking to chatGPT 00:31:24.760 |
today, the model that is underlying all of the magic of that interaction is GPT4, 00:31:29.480 |
so the fourth iteration of that series. Now GPT2 was published in 2019 by 00:31:34.880 |
OpenAI in this paper that I have right here, and the reason I like GPT2 is that 00:31:40.320 |
it is the first time that a recognizably modern stack came together. So all of the 00:31:47.320 |
pieces of GPT2 are recognizable today by modern standards, it's just everything 00:31:51.800 |
has gotten bigger. Now I'm not going to be able to go into the full details of 00:31:55.560 |
this paper, of course, because it is a technical publication, but some of the 00:31:59.440 |
details that I would like to highlight are as follows. GPT2 was a transformer 00:32:03.600 |
neural network, just like the neural networks you would work 00:32:06.840 |
with today. It had 1.6 billion parameters, right? So these are the 00:32:12.120 |
parameters that we looked at here. It would have 1.6 billion of them. Today, 00:32:16.800 |
modern transformers would have a lot closer to a trillion or several hundred 00:32:20.360 |
billion, probably. The maximum context length here was 1024 tokens, so it is 00:32:28.120 |
when we are sampling chunks of windows of tokens from the data set, we're never 00:32:34.080 |
taking more than 1024 tokens, and so when you are trying to predict the next 00:32:37.680 |
token in a sequence, you will never have more than 1024 tokens kind of in your 00:32:42.480 |
context in order to make that prediction. Now, this is also tiny by modern 00:32:46.640 |
standards. Today, the context lengths would be a lot closer to a 00:32:52.120 |
couple hundred thousand or maybe even a million, and so you have a lot more 00:32:55.880 |
context, a lot more tokens in history, and you can make a lot better prediction 00:32:59.880 |
about the next token in a sequence in that way. And finally, GPT2 was trained 00:33:04.360 |
on approximately 100 billion tokens, and this is also fairly small by modern 00:33:08.520 |
standards. As I mentioned, the fine web data set that we looked at here, the fine 00:33:12.280 |
web data set has 15 trillion tokens, so 100 billion is quite small. Now, I 00:33:19.240 |
actually tried to reproduce GPT2 for fun as part of this project called LLM.C, 00:33:24.440 |
so you can see my write-up of doing that in this post on GitHub under the LLM.C 00:33:30.760 |
repository. So in particular, the cost of training GPT2 in 2019 was 00:33:37.080 |
estimated to be approximately $40,000, but today you can do 00:33:41.200 |
significantly better than that, and in particular, here it took about one day 00:33:45.120 |
and about $600. But this wasn't even trying too hard. I think you could really 00:33:50.720 |
bring this down to about $100 today. Now, why is it that the costs have come 00:33:56.320 |
down so much? Well, number one, these data sets have gotten a lot better, and the 00:34:01.160 |
way we filter them, extract them, and prepare them has gotten a lot more 00:34:04.800 |
refined, and so the data set is of just a lot higher quality, so that's one thing. 00:34:09.320 |
But really, the biggest difference is that our computers have gotten much 00:34:12.880 |
faster in terms of the hardware, and we're going to look at that in a second, 00:34:16.280 |
and also the software for running these models and really squeezing out all the 00:34:22.200 |
speed from the hardware as it is possible, that software has also gotten 00:34:26.720 |
much better as everyone has focused on these models and tried to run them very, 00:34:30.040 |
very quickly. Now, I'm not going to be able to go into the full detail of this 00:34:35.960 |
GPT-2 reproduction, and this is a long technical post, but I would like to still 00:34:39.880 |
give you an intuitive sense for what it looks like to actually train one of 00:34:43.240 |
these models as a researcher. Like, what are you looking at, and what does it look 00:34:46.200 |
like, what does it feel like? So let me give you a sense of that a little bit. 00:34:49.200 |
Okay, so this is what it looks like. Let me slide this over. So what I'm doing here 00:34:54.800 |
is I'm training a GPT-2 model right now, and what's happening here is that every 00:35:01.160 |
single line here, like this one, is one update to the model. So remember how here 00:35:09.000 |
we are basically making the prediction better for every one of these tokens, and 00:35:14.480 |
we are updating these weights or parameters of the neural net. So here, 00:35:18.600 |
every single line is one update to the neural network, where we change its 00:35:22.480 |
parameters by a little bit so that it is better at predicting next token and 00:35:25.640 |
sequence. In particular, every single line here is improving the prediction on 1 00:35:32.400 |
million tokens in the training set. So we've basically taken 1 million tokens 00:35:38.200 |
out of this data set, and we've tried to improve the prediction of that token as 00:35:44.560 |
coming next in a sequence on all 1 million of them simultaneously. And at 00:35:50.440 |
every single one of these steps, we are making an update to the network for that. 00:35:54.120 |
Now the number to watch closely is this number called loss, and the loss is a 00:35:59.960 |
single number that is telling you how well your neural network is 00:36:03.280 |
performing right now, and it is created so that low loss is good. So you'll see 00:36:09.200 |
that the loss is decreasing as we make more updates to the neural net, which 00:36:13.260 |
corresponds to making better predictions on the next token in a sequence. And so 00:36:17.680 |
the loss is the number that you are watching as a neural network researcher, 00:36:21.440 |
and you are kind of waiting, you're twiddling your thumbs, you're drinking 00:36:25.080 |
coffee, and you're making sure that this looks good so that with every update 00:36:29.520 |
your loss is improving and the network is getting better at prediction. Now here 00:36:34.440 |
you see that we are processing 1 million tokens per update. Each update takes 00:36:39.640 |
about 7 seconds roughly, and here we are going to process a total of 32,000 steps 00:36:46.160 |
of optimization. So 32,000 steps with 1 million tokens each is about 33 00:36:52.360 |
billion tokens that we are going to process, and we're currently only about 00:36:55.800 |
420, step 420 out of 32,000, so we are still only a bit more than 1% 00:37:01.840 |
done because I've only been running this for 10 or 15 minutes or something like 00:37:05.360 |
that. Now every 20 steps I have configured this optimization to do 00:37:10.960 |
inference. So what you're seeing here is the model is predicting the next token 00:37:15.120 |
in a sequence, and so you sort of start it randomly, and then you continue 00:37:19.380 |
plugging in the tokens. So we're running this inference step, and this is the 00:37:23.760 |
model sort of predicting the next token in a sequence, and every time you see 00:37:26.280 |
something appear, that's a new token. So let's just look at this, and you can see 00:37:34.760 |
that this is not yet very coherent, and keep in mind that this is only 1% of the 00:37:38.460 |
way through training, and so the model is not yet very good at predicting the next 00:37:41.960 |
token in the sequence. So what comes out is actually kind of a little bit of 00:37:45.520 |
gibberish, right, but it still has a little bit of like local coherence. So 00:37:49.800 |
since she is mine, it's a part of the information, should discuss my father, 00:37:54.360 |
great companions, Gordon showed me sitting over it, and etc. So I know it 00:37:59.540 |
doesn't look very good, but let's actually scroll up and see what it 00:38:04.160 |
looked like when I started the optimization. So all the way here, at 00:38:09.300 |
step 1, so after 20 steps of optimization, you see that what we're getting here is 00:38:16.760 |
looks completely random, and of course that's because the model has only had 20 00:38:20.300 |
updates to its parameters, and so it's giving you random text because it's a 00:38:23.600 |
random network. And so you can see that at least in comparison to this, the model 00:38:27.760 |
is starting to do much better, and indeed if we waited the entire 32,000 steps, the 00:38:32.620 |
model will have improved to the point that it's actually generating fairly 00:38:36.400 |
coherent English, and the tokens stream correctly, and they kind of make up 00:38:42.800 |
English a lot better. So this has to run for about a day or two more now, and so 00:38:50.840 |
at this stage we just make sure that the loss is decreasing, everything is looking 00:38:55.040 |
good, and we just have to wait. And now let me turn now to the story of the 00:39:03.100 |
computation that's required, because of course I'm not running this optimization 00:39:07.100 |
on my laptop. That would be way too expensive, because we have to run this 00:39:11.420 |
neural network, and we have to improve it, and we have we need all this data and so 00:39:15.060 |
on. So you can't run this too well on your computer, because the network is 00:39:19.180 |
just too large. So all of this is running on the computer that is out there in the 00:39:23.620 |
cloud, and I want to basically address the compute side of the story of training 00:39:28.080 |
these models, and what that looks like. So let's take a look. Okay so the computer 00:39:31.860 |
that I am running this optimization on is this 8xh100 node. So there are 00:39:37.860 |
eight h100s in a single node, or a single computer. Now I am renting this 00:39:43.340 |
computer, and it is somewhere in the cloud. I'm not sure where it is 00:39:45.900 |
physically actually. The place I like to rent from is called Lambda, but there are 00:39:49.940 |
many other companies who provide this service. So when you scroll down, you can 00:39:54.660 |
see that they have some on-demand pricing for sort of computers that have 00:39:59.900 |
these h100s, which are GPUs, and I'm going to show you what they look like in a 00:40:04.740 |
second. But on-demand 8xNVIDIA h100 GPU. This machine comes for three 00:40:12.720 |
dollars per GPU per hour, for example. So you can rent these, and then you get a 00:40:17.700 |
machine in the cloud, and you can go in and you can train these models. And these 00:40:23.660 |
GPUs, they look like this. So this is one h100 GPU. This is kind of what it looks 00:40:29.640 |
like, and you slot this into your computer. And GPUs are this perfect fit 00:40:33.700 |
for training neural networks, because they are very computationally expensive, 00:40:37.580 |
but they display a lot of parallelism in the computation. So you can have many 00:40:42.100 |
independent workers kind of working all at the same time in solving the matrix 00:40:48.540 |
multiplication that's under the hood of training these neural networks. So this 00:40:54.160 |
is just one of these h100s, but actually you would put them, you would put 00:40:57.260 |
multiple of them together. So you could stack eight of them into a single node, 00:41:00.940 |
and then you can stack multiple nodes into an entire data center, or an entire 00:41:11.180 |
can't spell, when we look at a data center, we start to see things that look 00:41:16.460 |
like this, right? So we have one GPU goes to eight GPUs, goes to a single system, 00:41:20.180 |
goes to many systems. And so these are the bigger data centers, and they of 00:41:23.940 |
course would be much, much more expensive. And what's happening is that all the 00:41:28.580 |
big tech companies really desire these GPUs, so they can train all these 00:41:33.100 |
language models, because they are so powerful. And that is fundamentally 00:41:37.260 |
what has driven the stock price of NVIDIA to be $3.4 trillion today, as an 00:41:41.860 |
example, and why NVIDIA has kind of exploded. So this is the gold rush. The 00:41:47.100 |
gold rush is getting the GPUs, getting enough of them, so they can all 00:41:51.540 |
collaborate to perform this optimization. And what are they all 00:41:56.500 |
doing? They're all collaborating to predict the next token on a data set 00:42:00.740 |
like the fine web data set. This is the computational workflow that basically 00:42:06.180 |
is extremely expensive. The more GPUs you have, the more tokens you can try 00:42:10.100 |
to predict and improve on, and you're going to process this data set faster, 00:42:14.020 |
and you can iterate faster and get a bigger network and train a bigger 00:42:17.140 |
network and so on. So this is what all those machines are doing. And this is 00:42:23.900 |
why all of this is such a big deal. And for example, this is a article from 00:42:29.140 |
like about a month ago or so. This is why it's a big deal that, for example, 00:42:32.260 |
Elon Musk is getting 100,000 GPUs in a single data center. And all of these 00:42:38.900 |
GPUs are extremely expensive, are going to take a ton of power, and all of them 00:42:42.700 |
are just trying to predict the next token in the sequence and improve the 00:42:45.740 |
network by doing so, and get probably a lot more coherent text than what we're 00:42:50.940 |
seeing here a lot faster. Okay, so unfortunately, I do not have a couple 00:42:54.940 |
10 or $100 million to spend on training a really big model like this. But 00:43:00.260 |
luckily, we can turn to some big tech companies who train these models 00:43:04.340 |
routinely, and release some of them once they are done training. So they've 00:43:08.740 |
spent a huge amount of compute to train this network, and they release the 00:43:12.220 |
network at the end of the optimization. So it's very useful because they've 00:43:15.700 |
done a lot of compute for that. So there are many companies who train these 00:43:19.300 |
models routinely, but actually not many of them release these what's called 00:43:23.620 |
base models. So the model that comes out at the end here is what's called a base 00:43:27.940 |
model. What is a base model? It's a token simulator, right? It's an internet 00:43:32.540 |
text token simulator. And so that is not by itself useful yet, because what we 00:43:38.340 |
want is what's called an assistant, we want to ask questions and have it 00:43:41.580 |
respond to answers. These models won't do that they just create sort of remixes 00:43:46.700 |
of the internet. They dream internet pages. So the base models are not very 00:43:51.900 |
often released, because they're kind of just only a step one of a few other 00:43:55.300 |
steps that we still need to take to get an assistant. However, a few releases 00:43:59.260 |
have been made. So as an example, the GPT-2 model released the 1.6 billion, 00:44:05.940 |
sorry, 1.5 billion model back in 2019. And this GPT-2 model is a base model. 00:44:11.300 |
Now, what is a model release? What does it look like to release these models? So 00:44:16.380 |
this is the GPT-2 repository on GitHub. Well, you need two things basically to 00:44:20.540 |
release model. Number one, we need the Python code, usually, that describes the 00:44:28.500 |
sequence of operations in detail that they make in their model. So if you 00:44:35.540 |
remember back this transformer, the sequence of steps that are taken here in 00:44:41.540 |
this neural network is what is being described by this code. So this code is 00:44:46.260 |
sort of implementing the what's called forward pass of this neural network. So 00:44:50.620 |
we need the specific details of exactly how they wired up that neural network. 00:44:54.380 |
So this is just computer code, and it's usually just a couple hundred lines of 00:44:58.220 |
code. It's not it's not that crazy. And this is all fairly understandable and 00:45:02.540 |
usually fairly standard. What's not standard are the parameters. That's where 00:45:06.100 |
the actual value is. What are the parameters of this neural network, 00:45:09.860 |
because there's 1.6 billion of them, and we need the correct setting or a really 00:45:14.060 |
good setting. And so that's why in addition to this source code, they 00:45:18.620 |
release the parameters, which in this case is roughly 1.5 billion parameters. 00:45:23.580 |
And these are just numbers. So it's one single list of 1.5 billion numbers, the 00:45:28.540 |
precise and good setting of all the knobs, such that the tokens come out 00:45:32.580 |
well. So you need those two things to get a base model release. Now, GPT-2 was 00:45:42.980 |
released, but that's actually a fairly old model, as I mentioned. So actually, 00:45:46.020 |
the model we're going to turn to is called LLAMA-3. And that's the one that 00:45:49.660 |
I would like to show you next. So LLAMA-3, so GPT-2 again, was 1.6 billion 00:45:54.980 |
parameters trained on 100 billion tokens. LLAMA-3 is a much bigger model 00:45:58.980 |
and much more modern model. It is released and trained by Meta. And it is 00:46:03.620 |
a 405 billion parameter model trained on 15 trillion tokens, in very much the 00:46:09.260 |
same way, just much, much bigger. And Meta has also made a release of LLAMA-3. 00:46:16.060 |
And that was part of this paper. So with this paper that goes into a lot of 00:46:21.540 |
detail, the biggest base model that they released is the LLAMA-3.1 4.5, 405 00:46:27.820 |
billion parameter model. So this is the base model. And then in addition to the 00:46:32.180 |
base model, you see here, foreshadowing for later sections of the video, they 00:46:36.100 |
also released the instruct model. And the instruct means that this is an 00:46:39.740 |
assistant, you can ask it questions, and it will give you answers. We still 00:46:43.060 |
have yet to cover that part later. For now, let's just look at this base model, 00:46:46.860 |
this token simulator. And let's play with it and try to think about, you 00:46:50.820 |
know, what is this thing? And how does it work? And what do we get at the end 00:46:54.780 |
of this optimization, if you let this run until the end, for a very big neural 00:46:59.300 |
network on a lot of data. So my favorite place to interact with the base models 00:47:03.620 |
is this company called Hyperbolic, which is basically serving the base model of 00:47:09.220 |
the 405B LLAMA-3.1. So when you go into the website, and I think you may have 00:47:14.420 |
to register and so on, make sure that in the models, make sure that you are 00:47:18.140 |
using LLAMA-3.1 405 billion base, it must be the base model. And then here, 00:47:24.420 |
let's say the max tokens is how many tokens we're going to be generating. So 00:47:27.700 |
let's just decrease this to be a bit less just so we don't waste compute, we 00:47:31.660 |
just want the next 128 tokens. And leave the other stuff alone, I'm not going to 00:47:35.660 |
go into the full detail here. Now, fundamentally, what's going to happen 00:47:39.500 |
here is identical to what happens here during inference for us. So this is just 00:47:44.820 |
going to continue the token sequence of whatever prefix you're going to give it. 00:47:48.620 |
So I want to first show you that this model here is not yet an assistant. So 00:47:53.420 |
you can, for example, ask it, what is two plus two, it's not going to tell 00:47:56.540 |
you, oh, it's four. What else can I help you with? It's not going to do that. 00:48:00.900 |
Because what is two plus two is going to be tokenized. And then those tokens just 00:48:05.860 |
acts as a prefix. And then what the model is going to do now is just going to get 00:48:09.580 |
the probability for the next token. And it's just a glorified autocomplete. It's 00:48:13.420 |
a very, very expensive autocomplete of what comes next, depending on the 00:48:17.940 |
statistics of what it saw in its training documents, which are basically 00:48:21.060 |
web pages. So let's just hit enter to see what tokens it comes up with as a 00:48:26.580 |
continuation. Okay, so here it kind of actually answered the question and 00:48:34.020 |
started to go off into some philosophical territory. Let's try it 00:48:37.580 |
again. So let me copy and paste. And let's try again, from scratch. What is 00:48:41.860 |
two plus two? Okay, so it just goes off again. So notice one more thing that I 00:48:50.460 |
want to stress is that the system, I think every time you put it in, it just 00:48:55.020 |
kind of starts from scratch. So the system here is stochastic. So for the 00:49:01.380 |
same prefix of tokens, we're always getting a different answer. And the 00:49:04.860 |
reason for that is that we get this probability distribution, and we sample 00:49:08.860 |
from it, and we always get different samples, and we sort of always go into a 00:49:12.180 |
different territory afterwards. So here in this case, I don't know what this is. 00:49:18.820 |
Let's try one more time. So it just continues on. So it's just doing the 00:49:25.740 |
stuff that it's on the internet, right? And it's just kind of like regurgitating 00:49:30.380 |
those statistical patterns. So first things, it's not an assistant yet, it's a 00:49:36.820 |
token autocomplete. And second, it is a stochastic system. Now the crucial thing 00:49:43.020 |
is that even though this model is not yet by itself very useful for a lot of 00:49:47.220 |
applications, just yet, it is still very useful because in the task of 00:49:53.380 |
predicting the next token in the sequence, the model has learned a lot 00:49:56.940 |
about the world. And it has stored all that knowledge in the parameters of the 00:50:01.180 |
network. So remember that our text looked like this, right? Internet web 00:50:05.700 |
pages. And now all of this is sort of compressed in the weights of the 00:50:10.260 |
network. So you can think of these 405 billion parameters as a kind of 00:50:16.100 |
compression of the internet. You can think of the 405 billion parameters as 00:50:21.020 |
kind of like a zip file. But it's not a lossless compression, it's a lossy 00:50:26.260 |
compression, we're kind of like left with kind of a gestalt of the internet 00:50:29.780 |
and we can generate from it, right? Now we can elicit some of this knowledge by 00:50:34.900 |
prompting the base model accordingly. So for example, here's a prompt that 00:50:39.300 |
might work to elicit some of that knowledge that's hiding in the 00:50:41.980 |
parameters. Here's my top 10 list of the top landmarks to see in Paris. And I'm 00:50:51.420 |
doing it this way, because I'm trying to prime the model to now continue this 00:50:54.860 |
list. So let's see if that works when I press enter. Okay, so you see that it 00:51:00.140 |
started the list, and it's now kind of giving me some of those landmarks. And 00:51:04.300 |
I noticed that it's trying to give a lot of information here. Now, you might not 00:51:08.380 |
be able to actually fully trust some of the information here. Remember that this 00:51:11.420 |
is all just a recollection of some of the internet documents. And so the 00:51:16.100 |
things that occur very frequently in the internet data are probably more likely 00:51:20.420 |
to be remembered correctly, compared to things that happen very infrequently. So 00:51:24.980 |
you can't fully trust some of the things that is some of the information that is 00:51:27.980 |
here, because it's all just a vague recollection of internet documents. 00:51:31.220 |
Because the information is not stored explicitly in any of the parameters, it's 00:51:35.860 |
all just the recollection. That said, we did get something that is probably 00:51:39.580 |
approximately correct. And I don't actually have the expertise to verify 00:51:43.260 |
that this is roughly correct. But you see that we've elicited a lot of the 00:51:46.900 |
knowledge of the model. And this knowledge is not precise and exact. This 00:51:51.460 |
knowledge is vague, and probabilistic, and statistical. And the kinds of things 00:51:56.180 |
that occur often are the kinds of things that are more likely to be remembered in 00:52:01.060 |
the model. Now I want to show you a few more examples of this model's behavior. 00:52:04.780 |
The first thing I want to show you is this example. I went to the Wikipedia 00:52:08.900 |
page for Zebra. And let me just copy-paste the first, even one sentence 00:52:13.780 |
here. And let me put it here. Now when I click enter, what kind of completion are 00:52:19.980 |
we going to get? So let me just hit enter. There are three living species, 00:52:25.860 |
etc, etc. What the model is producing here is an exact regurgitation of this 00:52:31.620 |
Wikipedia entry. It is reciting this Wikipedia entry purely from memory. And 00:52:36.420 |
this memory is stored in its parameters. And so it is possible that at some point 00:52:41.340 |
in these 512 tokens, the model will stray away from the Wikipedia entry. But 00:52:46.460 |
you can see that it has huge chunks of it memorized here. Let me see, for 00:52:50.020 |
example, if this sentence occurs by now. Okay, so we're still on track. Let me 00:52:56.860 |
check here. Okay, we're still on track. It will eventually stray away. Okay, so 00:53:05.700 |
this thing is just recited to a very large extent. It will eventually deviate 00:53:09.780 |
because it won't be able to remember exactly. Now, the reason that this 00:53:13.540 |
happens is because these models can be extremely good at memorization. And 00:53:17.740 |
usually, this is not what you want in the final model. And this is something 00:53:20.820 |
called regurgitation. And it's usually undesirable to cite things directly 00:53:26.060 |
that you have trained on. Now, the reason that this happens actually is 00:53:29.940 |
because for a lot of documents, like for example, Wikipedia, when these 00:53:33.700 |
documents are deemed to be of very high quality as a source, like for 00:53:37.100 |
example, Wikipedia, it is very often the case that when you train the model, 00:53:41.700 |
you will preferentially sample from those sources. So basically, the model 00:53:46.220 |
has probably done a few epochs on this data, meaning that it has seen this 00:53:49.860 |
web page, like maybe probably 10 times or so. And it's a bit like you like 00:53:53.740 |
when you read some kind of a text many, many times, say you read something 100 00:53:57.500 |
times, then you will be able to recite it. And it's very similar for this 00:54:01.060 |
model, if it sees something way too often, it's going to be able to recite 00:54:03.820 |
it later from memory. Except these models can be a lot more efficient, 00:54:08.340 |
like per presentation than a human. So probably it's only seen this Wikipedia 00:54:12.740 |
entry 10 times, but basically it has remembered this article exactly in its 00:54:16.660 |
parameters. Okay, the next thing I want to show you is something that the 00:54:19.340 |
model has definitely not seen during its training. So for example, if we go 00:54:23.420 |
to the paper, and then we navigate to the pre training data, we'll see here 00:54:29.100 |
that the data set has a knowledge cutoff until the end of 2023. So it 00:54:35.540 |
will not have seen documents after this point. And certainly it has not seen 00:54:39.540 |
anything about the 2024 election and how it turned out. Now, if we prime the 00:54:44.980 |
model with the tokens from the future, it will continue the token sequence, 00:54:49.900 |
and it will just take its best guess according to the knowledge that it has 00:54:53.020 |
in its own parameters. So let's take a look at what that could look like. So 00:54:57.540 |
the Republican party could Trump. Okay, President of the United States from 00:55:02.340 |
2017. And let's see what it says after this point. So for example, the model 00:55:07.020 |
will have to guess at the running mate and who it's against, etc. So let's 00:55:10.940 |
hit enter. So here are things that Mike Pence was the running mate instead of 00:55:15.740 |
JD Vance. And the ticket was against Hillary Clinton and Tim Kaine. So this 00:55:22.940 |
is kind of a interesting parallel universe potentially of what could have 00:55:26.260 |
happened according to the alarm. Let's get a different sample. So the 00:55:29.940 |
identical prompt, and let's resample. So here the running mate was Ron 00:55:35.700 |
DeSantis. And they ran against Joe Biden and Kamala Harris. So this is 00:55:40.540 |
again, a different parallel universe. So the model will take educated guesses, 00:55:44.020 |
and it will continue the token sequence based on this knowledge. And we'll just 00:55:48.220 |
kind of like all of what we're seeing here is what's called hallucination. The 00:55:51.980 |
model is just taking its best guess in a probabilistic manner. The next thing I 00:55:56.900 |
would like to show you is that even though this is a base model and not yet 00:56:00.100 |
an assistant model, it can still be utilized in practical applications if 00:56:04.260 |
you are clever with your prompt design. So here's something that we would call a 00:56:08.380 |
few shot prompt. So what it is here is that I have 10 words, or 10 pairs, and 00:56:15.020 |
each pair is a word of English colon, and then the translation in Korean. And 00:56:21.740 |
we have 10 of them. And what the model does here is at the end, we have 00:56:25.980 |
teacher colon, and then here's where we're going to do a completion of say, 00:56:29.580 |
just five tokens. And these models have what we call in context learning 00:56:34.180 |
abilities. And what that's referring to is that as it is reading this context, 00:56:38.700 |
it is learning sort of in place that there's some kind of an algorithmic 00:56:44.340 |
pattern going on in my data. And it knows to continue that pattern. And this 00:56:49.100 |
is called kind of like in context learning. So it takes on the role of 00:56:53.460 |
translator. And when we hit completion, we see that the teacher translation is 00:56:59.780 |
"선생님," which is correct. And so this is how you can build apps by being 00:57:04.700 |
clever with your prompting, even though we still just have a base model for now. 00:57:08.140 |
And it relies on what we call this in context learning ability. And it is done 00:57:14.140 |
by constructing what's called a few shot prompt. Okay, and finally, I want to 00:57:17.780 |
show you that there is a clever way to actually instantiate a whole language 00:57:21.540 |
model assistant just by prompting. And the trick to it is that we're going to 00:57:26.300 |
structure a prompt to look like a web page that is a conversation between a 00:57:31.140 |
helpful AI assistant and a human. And then the model will continue that 00:57:34.900 |
conversation. So actually, to write the prompt, I turned to chat GPT itself, 00:57:39.780 |
which is kind of meta. But I told it, I want to create an OLM assistant, but all 00:57:44.340 |
I have is the base model. So can you please write my prompt. And this is what 00:57:51.740 |
it came up with, which is actually quite good. So here's a conversation between 00:57:55.060 |
an AI assistant and a human. The AI assistant is knowledgeable, helpful, 00:57:58.580 |
capable of answering a wide variety of questions, etc. And then here, it's not 00:58:03.780 |
enough to just give it a sort of description. It works much better if you 00:58:07.740 |
create this few shot prompt. So here's a few terms of human assistant, human 00:58:12.220 |
assistant. And we have, you know, a few turns of conversation. And then here at 00:58:17.980 |
the end is we're going to be putting the actual query that we like. So let me 00:58:21.260 |
copy paste this into the base model prompt. And now, let me do human column. 00:58:28.220 |
And this is where we put our actual prompt. Why is the sky blue? And let's 00:58:34.900 |
run. Assistant, the sky appears blue due to the phenomenon called Rayleigh 00:58:41.460 |
scattering, etc, etc. So you see that the base model is just continuing the 00:58:45.220 |
sequence. But because the sequence looks like this conversation, it takes on that 00:58:49.940 |
role. But it is a little subtle, because here it just, you know, it ends the 00:58:54.460 |
assistant and then just, you know, hallucinates the next question by the 00:58:57.220 |
human, etc. So we'll just continue going on and on. But you can see that we have 00:59:01.820 |
sort of accomplished the task. And if you just took this, why is the sky blue? 00:59:06.420 |
And if we just refresh this, and put it here, then of course, we don't expect 00:59:11.020 |
this to work with the base model, right? We're just gonna, who knows what we're 00:59:14.100 |
gonna get? Okay, we're just gonna get more questions. Okay. So this is one way 00:59:19.220 |
to create an assistant, even though you may only have a base model. Okay, so this 00:59:23.980 |
is the kind of brief summary of the things we talked about over the last few 00:59:27.380 |
minutes. Now, let me zoom out here. And this is kind of like what we've talked 00:59:35.060 |
about so far. We wish to train LLM assistants like ChatGPT. We've discussed 00:59:40.540 |
the first stage of that, which is the pre training stage. And we saw that 00:59:44.020 |
really what it comes down to is we take internet documents, we break them up into 00:59:47.340 |
these tokens, these atoms of little text chunks. And then we predict token 00:59:51.420 |
sequences using neural networks. The output of this entire stage is this base 00:59:56.620 |
model, it is the setting of the parameters of this network. And this base 01:00:01.540 |
model is basically an internet document simulator on the token level. So it can 01:00:05.620 |
just, it can generate token sequences that have the same kind of like 01:00:09.780 |
statistics as internet documents. And we saw that we can use it in some 01:00:13.700 |
applications, but we actually need to do better. We want an assistant, we want to 01:00:17.300 |
be able to ask questions, and we want the model to give us answers. And so we need 01:00:21.300 |
to now go into the second stage, which is called the post training stage. So we 01:00:26.380 |
take our base model, our internet document simulator, and hand it off to 01:00:29.980 |
post training. So we're now going to discuss a few ways to do what's called 01:00:33.860 |
post training of these models. These stages in post training are going to be 01:00:38.020 |
computationally much less expensive, most of the computational work, all of 01:00:41.940 |
the massive data centers, and all of the sort of heavy compute and millions of 01:00:47.660 |
dollars are the pre training stage. But now we're going to the slightly cheaper, 01:00:52.460 |
but still extremely important stage called post training, where we turn this 01:00:56.620 |
LLM model into an assistant. So let's take a look at how we can get our model 01:01:01.580 |
to not sample internet documents, but to give answers to questions. So in other 01:01:07.300 |
words, what we want to do is we want to start thinking about conversations. And 01:01:10.900 |
these are conversations that can be multi term. So so there can be multiple 01:01:15.020 |
turns, and they are in the simplest case, a conversation between a human and an 01:01:18.900 |
assistant. And so for example, we can imagine the conversation could look 01:01:22.340 |
something like this. When a human says what is two plus two, the assistant 01:01:25.700 |
should respond with something like two plus two is four. When a human follows up 01:01:29.260 |
and says what if it was stars that have a plus assistant could respond with 01:01:32.620 |
something like this. And similar here, this is another example showing that the 01:01:37.020 |
assistant could also have some kind of a personality here, that it's kind of like 01:01:40.500 |
nice. And then here in the third example, I'm showing that when a human is asking 01:01:44.620 |
for something that we don't wish to help with, we can produce what's called 01:01:48.660 |
refusal, we can say that we cannot help with that. So in other words, what we 01:01:53.220 |
want to do now is we want to think through how an assistant should interact 01:01:56.780 |
with a human. And we want to program the assistant and its behavior in these 01:02:01.060 |
conversations. Now, because this is neural networks, we're not going to be 01:02:04.780 |
programming these explicitly in code, we're not going to be able to program 01:02:08.620 |
the assistant in that way. Because this is neural networks, everything is done 01:02:12.340 |
through neural network training on data sets. And so because of that, we are 01:02:17.380 |
going to be implicitly programming the assistant by creating data sets of 01:02:21.660 |
conversations. So these are three independent examples of conversations in 01:02:25.620 |
a data set, an actual data set, and I'm going to show you examples will be much 01:02:29.500 |
larger, it could have hundreds of 1000s of conversations that are multi turn 01:02:33.020 |
very long, etc. And would cover a diverse breadth of topics. But here I'm only 01:02:37.780 |
showing three examples. But the way this works basically is assistant is being 01:02:43.620 |
programmed by example. And where is this data coming from, like two times two 01:02:48.020 |
equals four, same as two plus two, etc. Where does that come from? This comes 01:02:51.540 |
from human labelers. So we will basically give human labelers some 01:02:55.820 |
conversational context. And we will ask them to basically give the ideal 01:03:00.100 |
assistant response in this situation. And a human will write out the ideal 01:03:05.540 |
response for an assistant in any situation. And then we're going to get 01:03:09.020 |
the model to basically train on this and to imitate those kinds of responses. So 01:03:15.220 |
the way this works, then is we are going to take our base model, which we 01:03:18.100 |
produced in the pre training stage. And this base model was trained on internet 01:03:22.300 |
documents, we're now going to take that data set of internet documents, and 01:03:25.340 |
we're going to throw it out. And we're going to substitute a new data set. And 01:03:29.540 |
that's going to be a data set of conversations. And we're going to 01:03:32.060 |
continue training the model on these conversations on this new data set of 01:03:35.540 |
conversations. And what happens is that the model will very rapidly adjust, and 01:03:40.620 |
we'll sort of like learn the statistics of how this assistant response to human 01:03:45.500 |
queries. And then later during inference, we'll be able to basically 01:03:49.900 |
prime the assistant and get the response. And it will be imitating what 01:03:55.380 |
the humans with human labelers would do in that situation, if that makes sense. 01:03:58.940 |
So we're going to see examples of that. And this is going to become a bit more 01:04:02.420 |
concrete. I also wanted to mention that this post training stage, we're going 01:04:06.140 |
to basically just continue training the model. But the pre training stage can in 01:04:11.100 |
practice take roughly three months of training on many 1000s of computers. 01:04:15.460 |
The post training stage will typically be much shorter, like three hours, for 01:04:19.060 |
example. And that's because the data set of conversations that we're going to 01:04:22.900 |
create here manually is much, much smaller than the data set of text on the 01:04:27.980 |
internet. And so this training will be very short. But fundamentally, we're just 01:04:32.900 |
going to take our base model, we're going to continue training using the exact 01:04:36.380 |
same algorithm, the exact same everything, except we're swapping out 01:04:39.820 |
the data set for conversations. So the questions now are, what are these 01:04:43.580 |
conversations? How do we represent them? How do we get the model to see 01:04:47.620 |
conversations instead of just raw text? And then what are the outcomes of this 01:04:53.220 |
kind of training? And what do you get in a certain like psychological sense when 01:04:57.780 |
we talk about the model? So let's turn to those questions now. So let's start by 01:05:01.620 |
talking about the tokenization of conversations. Everything in these models 01:05:06.060 |
has to be turned into tokens, because everything is just about token 01:05:09.100 |
sequences. So how do we turn conversations into token sequences is 01:05:12.980 |
the question. And so for that, we need to design some kind of an encoding. And 01:05:17.220 |
this is kind of similar to maybe if you're familiar, you don't have to be 01:05:20.500 |
with, for example, the TCP/IP packet in on the internet, there are precise rules 01:05:25.580 |
and protocols for how you represent information, how everything is 01:05:28.500 |
structured together, so that you have all this kind of data laid out in a way 01:05:32.380 |
that is written out on a paper, and that everyone can agree on. And so it's the 01:05:36.340 |
same thing now happening in LLMs, we need some kind of data structures, and 01:05:39.860 |
we need to have some rules around how these data structures like 01:05:42.340 |
conversations, get encoded and decoded to and from tokens. And so I want to 01:05:47.780 |
show you now how I would recreate this conversation in the token space. So if 01:05:53.940 |
you go to TickTokenizer, I can take that conversation. And this is how it is 01:05:59.180 |
represented in for the language model. So here we have we are iterating a user 01:06:04.860 |
and an assistant in this two turn conversation. And what you're seeing 01:06:09.940 |
here is it looks ugly, but it's actually relatively simple. The way it gets 01:06:13.540 |
turned into a token sequence here at the end is a little bit complicated. But at 01:06:17.900 |
the end, this conversation between the user and assistant ends up being 49 01:06:21.580 |
tokens, it is a one dimensional sequence of 49 tokens. And these are the tokens. 01:06:26.020 |
Okay. And all the different LLMs will have a slightly different format or 01:06:31.540 |
protocols. And it's a little bit of a Wild West right now. But for example, 01:06:35.980 |
GPT-40 does it in the following way. You have this special token called I am 01:06:41.140 |
underscore start. And this is short for imaginary monologue, the start, then you 01:06:47.500 |
have to specify, I don't actually know why it's called that, to be honest, then 01:06:51.940 |
you have to specify whose turn it is. So for example, user, which is a token 1428. 01:06:56.660 |
Then you have internal monologue separator. And then it's the exact 01:07:02.940 |
question. So the tokens of the question, and then you have to close it. So I am 01:07:07.140 |
end, the end of the imaginary monologue. So basically, the question from a user 01:07:12.860 |
of what is two plus two ends up being the token sequence of these tokens. And 01:07:19.340 |
now the important thing to mention here is that I am start, this is not text, 01:07:23.100 |
right? I am start is a special token that gets added, it's a new token. And 01:07:29.940 |
this token has never been trained on so far, it is a new token that we create in 01:07:33.860 |
a post training stage, and we introduce. And so these special tokens like I am 01:07:38.900 |
set, I am start, etc, are introduced and interspersed with text, so that they sort 01:07:44.420 |
of get the model to learn that, hey, this is the start of a turn for, who is it 01:07:49.300 |
started the term for the start of the turn is for the user. And then this is 01:07:53.780 |
what the user says, and then the user ends. And then it's a new start of a 01:07:57.780 |
turn, and it is by the assistant. And then what does the assistant say? Well, 01:08:02.460 |
these are the tokens of what the assistant says, etc. And so this 01:08:05.940 |
conversation is not turned into the sequence of tokens. The specific details 01:08:10.060 |
here are not actually that important. All I'm trying to show you in concrete 01:08:13.300 |
terms, is that our conversations, which we think of as kind of like a structured 01:08:17.300 |
object, end up being turned via some encoding into one dimensional sequences 01:08:22.300 |
of tokens. And so, because this is one dimensional sequence of tokens, we can 01:08:27.300 |
apply all this stuff that we applied before. Now it's just a sequence of 01:08:30.820 |
tokens. And now we can train a language model on it. And so we're just 01:08:34.740 |
predicting the next token in a sequence, just like before. And we can represent 01:08:39.780 |
and train on conversations. And then what does it look like at test time 01:08:43.580 |
during inference? So say we've trained a model. And we've trained a model on 01:08:48.660 |
these kinds of data sets of conversations. And now we want to 01:08:51.740 |
inference. So during inference, what does this look like when you're on 01:08:55.460 |
Chats GPT? Well, you come to Chats GPT, and you have, say, like a dialogue 01:09:00.020 |
with it. And the way this works is basically, say that this was already 01:09:06.180 |
filled in. So like, what is two plus two, two plus two is four. And now you 01:09:09.060 |
issue what if it was times, IM_END. And what basically ends up happening on 01:09:15.420 |
the servers of OpenAI or something like that, is they put an IM_START, 01:09:19.180 |
assistant, IM_SEP. And this is where they end it, right here. So they 01:09:24.580 |
construct this context. And now they start sampling from the model. So it's 01:09:29.180 |
at this stage that they will go to the model and say, okay, what is a good 01:09:32.100 |
first sequence? What is a good first token? What is a good second token? 01:09:35.820 |
What is a good third token? And this is where the LLM takes over and creates a 01:09:40.180 |
response, like for example, response that looks something like this, but it 01:09:44.540 |
doesn't have to be identical to this. But it will have the flavor of this, if 01:09:48.340 |
this kind of a conversation was in the data set. So that's roughly how the 01:09:53.140 |
protocol works. Although the details of this protocol are not important. So 01:09:58.100 |
again, my goal is just to show you that everything ends up being just a 01:10:01.900 |
one-dimensional token sequence. So we can apply everything we've already 01:10:05.140 |
seen. But we're not training on conversations. And we're now basically 01:10:10.460 |
generating conversations as well. Okay, so now I would like to turn to what 01:10:14.020 |
these data sets look like in practice. The first paper that I would like to 01:10:17.180 |
show you and the first effort in this direction is this paper from OpenAI in 01:10:21.660 |
2022. And this paper was called InstructGPT, or the technique that they 01:10:26.380 |
developed. And this was the first time that OpenAI has kind of talked about 01:10:29.620 |
how you can take language models and fine tune them on conversations. And so 01:10:33.820 |
this paper has a number of details that I would like to take you through. So the 01:10:37.260 |
first stop I would like to make is in section 3.4, where they talk about the 01:10:41.220 |
human contractors that they hired, in this case from Upwork or through ScaleAI 01:10:46.540 |
to construct these conversations. And so there are human labelers involved 01:10:51.660 |
whose job it is professionally to create these conversations. And these 01:10:55.900 |
labelers are asked to come up with prompts, and then they are asked to also 01:11:00.060 |
complete the ideal assistant responses. And so these are the kinds of prompts 01:11:04.100 |
that people came up with. So these are human labelers. So list five ideas for 01:11:08.060 |
how to regain enthusiasm for my career. What are the top 10 science fiction 01:11:11.500 |
books I should read next? And there's many different types of kind of prompts 01:11:16.060 |
here. So translate the sentence to Spanish, etc. And so there's many things 01:11:21.500 |
here that people came up with. They first come up with the prompt, and then 01:11:25.460 |
they also answer that prompt, and they give the ideal assistant response. Now, 01:11:30.140 |
how do they know what is the ideal assistant response that they should 01:11:33.300 |
write for these prompts? So when we scroll down a little bit further, we see 01:11:37.260 |
that here we have this excerpt of labeling instructions that are given to 01:11:41.300 |
the human labelers. So the company that is developing the language model, like 01:11:45.220 |
for example, OpenAI, writes up labeling instructions for how the humans should 01:11:49.540 |
create ideal responses. And so here, for example, is an excerpt of these kinds of 01:11:54.780 |
labeling instructions. On a high level, you're asking people to be helpful, 01:11:58.100 |
truthful, and harmless. And you can pause the video if you'd like to see more 01:12:01.980 |
here. But on a high level, basically just answer, try to be helpful, try to be 01:12:06.380 |
truthful, and don't answer questions that we don't want kind of the system to 01:12:10.900 |
handle later in ChatGPT. And so, roughly speaking, the company comes up with the 01:12:17.020 |
labeling instructions. Usually they are not this short. Usually they are hundreds 01:12:20.340 |
of pages, and people have to study them professionally. And then they write out 01:12:24.900 |
the ideal assistant responses following those labeling instructions. So this is 01:12:29.620 |
a very human-heavy process, as it was described in this paper. Now, the data 01:12:34.260 |
set for InstructGPT was never actually released by OpenAI. But we do have some 01:12:38.140 |
open source reproductions that were trying to follow this kind of a setup and 01:12:42.700 |
collect their own data. So one that I'm familiar with, for example, is the effort 01:12:47.220 |
of Open Assistant from a while back. And this is just one of, I think, many 01:12:51.620 |
examples, but I just want to show you an example. So here's, so these were people 01:12:56.140 |
on the internet that were asked to basically create these conversations 01:12:59.020 |
similar to what OpenAI did with human labelers. And so here's an entry of a 01:13:04.980 |
person who came up with this prompt. Can you write a short introduction to the 01:13:08.300 |
relevance of the term monopsony in economics? Please use examples, etc. And 01:13:14.100 |
then the same person, or potentially a different person, will write up the 01:13:17.700 |
response. So here's the assistant response to this. And so then the same 01:13:22.260 |
person or different person will actually write out this ideal response. And then 01:13:28.380 |
this is an example of maybe how the conversation could continue. Now explain 01:13:31.980 |
it to a dog. And then you can try to come up with a slightly simpler 01:13:35.940 |
explanation or something like that. Now, this then becomes the label, and we end 01:13:41.300 |
up training on this. So what happens during training is that, of course, 01:13:47.620 |
we're not going to have a full coverage of all the possible questions that the 01:13:53.100 |
model will encounter at test time during inference. We can't possibly cover all 01:13:57.340 |
the possible prompts that people are going to be asking in the future. But if 01:14:01.220 |
we have a, like a data set of a few of these examples, then the model during 01:14:05.780 |
training will start to take on this persona of this helpful, truthful, 01:14:10.540 |
harmless assistant. And it's all programmed by example. And so these are 01:14:15.180 |
all examples of behavior. And if you have conversations of these example 01:14:18.740 |
behaviors, and you have enough of them, like 100,000, and you train on it, the 01:14:22.380 |
model sort of starts to understand the statistical pattern. And it kind of 01:14:25.700 |
takes on this personality of this assistant. Now, it's possible that when 01:14:30.300 |
you get the exact same question like this, at test time, it's possible that 01:14:35.460 |
the answer will be recited as exactly what was in the training set. But more 01:14:40.340 |
likely than that is that the model will kind of like do something of a similar 01:14:44.220 |
vibe. And it will understand that this is the kind of answer that you want. So 01:14:51.100 |
that's what we're doing. We're programming the system by example, and 01:14:55.540 |
the system adopts statistically, this persona of this helpful, truthful, 01:15:00.460 |
harmless assistant, which is kind of like reflected in the labeling 01:15:03.900 |
instructions that the company creates. Now, I want to show you that the state 01:15:07.540 |
of the art has kind of advanced in the last two or three years, since the 01:15:10.820 |
instruct GPT paper. So in particular, it's not very common for humans to be 01:15:15.060 |
doing all the heavy lifting just by themselves anymore. And that's because 01:15:18.260 |
we now have language models. And these language models are helping us create 01:15:21.300 |
these data sets and conversations. So it is very rare that the people will like 01:15:25.220 |
literally just write out the response from scratch, it is a lot more likely 01:15:28.820 |
that they will use an existing LLM to basically like, come up with an answer, 01:15:32.540 |
and then they will edit it, or things like that. So there's many different 01:15:35.740 |
ways in which now LLMs have started to kind of permeate this post training set 01:15:41.020 |
stack. And LLMs are basically used pervasively to help create these massive 01:15:46.220 |
data sets of conversations. So I don't want to show like UltraChat is one such 01:15:51.540 |
example of like a more modern data set of conversations. It is to a very large 01:15:56.100 |
extent synthetic, but I believe there's some human involvement, I could be wrong 01:15:59.860 |
with that. Usually, there'll be a little bit of human, but there will be a huge 01:16:03.020 |
amount of synthetic help. And this is all kind of like, constructed in different 01:16:09.060 |
ways. And UltraChat is just one example of many SFT data sets that currently 01:16:12.420 |
exist. And the only thing I want to show you is that these data sets have now 01:16:16.540 |
millions of conversations. These conversations are mostly synthetic, but 01:16:20.220 |
they're probably edited to some extent by humans. And they span a huge diversity 01:16:24.620 |
of sort of areas and so on. So these are fairly extensive artifacts by now. And 01:16:33.780 |
there are all these like SFT mixtures, as they're called. So you have a mixture 01:16:37.300 |
of like lots of different types and sources, and it's partially synthetic, 01:16:40.540 |
partially human. And it's kind of like gone in that direction since. But roughly 01:16:46.620 |
speaking, we still have SFT data sets, they're made up of conversations, we're 01:16:50.500 |
training on them, just like we did before. And I guess like the last thing 01:16:56.500 |
to note is that I want to dispel a little bit of the magic of talking to an 01:17:01.220 |
AI. Like when you go to ChatGPT, and you give it a question, and then you hit 01:17:05.820 |
enter, what is coming back is kind of like statistically aligned with what's 01:17:12.340 |
happening in the training set. And these training sets, I mean, they really just 01:17:16.420 |
have a seed in humans following labeling instructions. So what are you actually 01:17:21.700 |
talking to in ChatGPT? Or how should you think about it? Well, it's not coming 01:17:25.660 |
from some magical AI, like roughly speaking, it's coming from something 01:17:29.340 |
that is statistically imitating human labelers, which comes from labeling 01:17:33.980 |
instructions written by these companies. And so you're kind of imitating this, 01:17:37.620 |
you're kind of getting, it's almost as if you're asking a human labeler. And 01:17:42.220 |
imagine that the answer that is given to you from ChatGPT is some kind of a 01:17:46.660 |
simulation of a human labeler. And it's kind of like asking what would a human 01:17:51.620 |
labeler say in this kind of a conversation. And it's not just like this 01:17:58.620 |
human labeler is not just like a random person from the internet, because these 01:18:01.580 |
companies actually hire experts. So for example, when you are asking questions 01:18:04.780 |
about code, and so on, the human labelers that would be involved in creation of 01:18:09.100 |
these conversation datasets, they will usually be educated expert people. And 01:18:14.180 |
you're kind of like asking a question of like a simulation of those people, if 01:18:18.580 |
that makes sense. So you're not talking to a magical AI, you're talking to an 01:18:21.860 |
average labeler, this average labeler is probably fairly highly skilled, but 01:18:25.500 |
you're talking to kind of like an instantaneous simulation of that kind of 01:18:29.340 |
a person that would be hired in the construction of these datasets. So let 01:18:34.620 |
me give you one more specific example before we move on. For example, when I 01:18:38.460 |
go to chat GPT, and I say, recommend the top five landmarks you see in Paris, and 01:18:42.340 |
then I hit enter. Okay, here we go. Okay, when I hit enter, what's coming out 01:18:53.060 |
here? How do I think about it? Well, it's not some kind of a magical AI that has 01:18:57.900 |
gone out and researched all the landmarks and then ranked them using its 01:19:01.580 |
infinite intelligence, etc. What I'm getting is a statistical simulation of a 01:19:06.340 |
labeler that was hired by open AI, you can think about it roughly in that way. 01:19:10.460 |
And so if this specific question is in the post training dataset, somewhere at 01:19:17.700 |
open AI, then I'm very likely to see an answer that is probably very, very 01:19:21.980 |
similar to what that human labeler would have put down for those five 01:19:26.060 |
landmarks. How does the human labeler come up with this? Well, they go off and 01:19:29.100 |
they go on the internet, and they kind of do their own little research for 20 01:19:31.780 |
minutes, and they just come up with a list, right? Now, so if they come up with 01:19:35.740 |
this list, and this is in the dataset, I'm probably very likely to see what 01:19:39.580 |
they submitted as the correct answer from the assistant. Now, if this 01:19:44.580 |
specific query is not part of the post training dataset, then what I'm getting 01:19:48.300 |
here is a little bit more emergent. Because the model kind of understands 01:19:53.540 |
that statistically, the kinds of landmarks that are in the training set 01:19:58.340 |
are usually the prominent landmarks, the landmarks that people usually want to 01:20:01.380 |
see, the kinds of landmarks that are usually very often talked about on the 01:20:05.980 |
internet. And remember that the model already has a ton of knowledge from its 01:20:09.500 |
pre-training on the internet. So it's probably seen a ton of conversations 01:20:12.820 |
about pairs, about landmarks, about the kinds of things that people like to see. 01:20:16.140 |
And so it's the pre-training knowledge that is then combined with the post 01:20:19.460 |
training dataset that results in this kind of an imitation. So that's, that's 01:20:26.460 |
roughly how you can kind of think about what's happening behind the scenes here 01:20:30.660 |
in, in the statistical sense. Okay, now I want to turn to the topic of LLM 01:20:34.900 |
psychology, as I like to call it, which is where sort of the emergent cognitive 01:20:38.820 |
effects of the training pipeline that we have for these models. So in particular, 01:20:43.740 |
the first one I want to talk to is, of course, hallucinations. So you might be 01:20:50.100 |
familiar with model hallucinations. It's when LLMs make stuff up, they just 01:20:53.460 |
totally fabricate information, etc. And it's a big problem with LLM assistants. 01:20:57.700 |
It is a problem that existed to a large extent with early models for many years 01:21:02.020 |
ago. And I think the problem has gotten a bit better, because there are some 01:21:05.580 |
medications that I'm going to go into in a second. For now, let's just try to 01:21:08.860 |
understand where these hallucinations come from. So here's a specific example 01:21:13.220 |
of a few of three conversations that you might think you have in your training 01:21:17.740 |
set. And these are pretty reasonable conversations that you could imagine 01:21:22.300 |
being in the training set. So like, for example, who is Tom Cruise? Well, Tom 01:21:25.700 |
Cruise is a famous actor, American actor and producer, etc. Who is John Barrasso? 01:21:30.460 |
This turns out to be a US senator, for example. Who is Genghis Khan? Well, 01:21:35.820 |
Genghis Khan was blah, blah, blah. And so this is what your conversations could 01:21:40.220 |
look like at training time. Now, the problem with this is that when the human 01:21:45.180 |
is writing the correct answer for the assistant, in each one of these cases, 01:21:49.700 |
the human either like knows who this person is, or they research them on the 01:21:53.020 |
internet, and they come in, and they write this response that kind of has 01:21:56.340 |
this like confident tone of an answer. And what happens basically is that at 01:22:00.380 |
test time, when you ask for someone who is, this is a totally random name that I 01:22:04.260 |
totally came up with, and I don't think this person exists. As far as I know, I 01:22:08.980 |
just tried to generate it randomly. The problem is when we ask who is Orson 01:22:12.860 |
Kovats, the problem is that the assistant will not just tell you, oh, I 01:22:17.660 |
don't know. Even if the assistant and the language model itself might know 01:22:22.740 |
inside its features inside its activations inside of its brain sort of, 01:22:26.340 |
it might know that this person is like not someone that that is that it's 01:22:30.780 |
familiar with, even if some part of the network kind of knows that in some 01:22:33.980 |
sense, the saying that, oh, I don't know who this is, is is not going to 01:22:39.340 |
happen. Because the model statistically imitates his training set. In the 01:22:44.460 |
training set, the questions of the form who is blah are confidently answered 01:22:48.620 |
with the correct answer. And so it's going to take on the style of the 01:22:52.500 |
answer, and it's going to do its best, it's going to give you statistically 01:22:55.900 |
the most likely guess, and it's just going to basically make stuff up. 01:22:59.020 |
Because these models, again, we just talked about it is they don't have 01:23:02.620 |
access to the internet, they're not doing research. These are statistical 01:23:05.940 |
token tumblers, as I call them, is just trying to sample the next token in the 01:23:09.860 |
sequence. And it's gonna basically make stuff up. So let's take a look at what 01:23:13.860 |
this looks like. I have here what's called the inference playground from 01:23:19.140 |
hugging face. And I am on purpose picking on a model called Falcon 7b, 01:23:23.820 |
which is an old model. This is a few years ago now. So it's an older model. 01:23:28.020 |
So it suffers from hallucinations. And as I mentioned, this has improved over 01:23:31.780 |
time recently. But let's say who is Orson Kovats? Let's ask Falcon 7b 01:23:36.060 |
instruct. Run. Oh, yeah, Orson Kovats is an American author and science 01:23:41.020 |
fiction writer. Okay. That's totally false. It's a hallucination. Let's try 01:23:45.700 |
again. These are statistical systems, right? So we can resample. This time, 01:23:50.060 |
Orson Kovats is a fictional character from this 1950s TV show. It's total BS, 01:23:55.020 |
right? Let's try again. He's a former minor league baseball player. Okay, so 01:24:01.220 |
it basically the model doesn't know. And it's given us lots of different 01:24:04.700 |
answers. Because it doesn't know. It's just kind of like sampling from these 01:24:08.540 |
probabilities. The model starts with the tokens who is Orson Kovats 01:24:12.460 |
assistant, and then it comes in here. And it's good. It's getting these 01:24:17.740 |
probabilities. And it's just sampling from the probabilities. And it just 01:24:20.540 |
like comes up with stuff. And the stuff is actually statistically consistent 01:24:26.100 |
with the style of the answer in its training set. And it's just doing that. 01:24:30.580 |
But you and I experienced it as a made up factual knowledge. But keep in mind 01:24:35.100 |
that the model basically doesn't know. And it's just imitating the format of 01:24:38.820 |
the answer. And it's not going to go off and look it up. Because it's just 01:24:43.100 |
imitating, again, the answer. So how can we mitigate this? Because for example, 01:24:47.860 |
when we go to chat GPT, and I say, who is Orson Kovats, and I'm now asking the 01:24:51.580 |
state of the art state of the art model from AI, this model will tell you. Oh, 01:24:57.820 |
so this model is actually is even smarter, because you saw very briefly, 01:25:02.340 |
it said, searching the web, we're going to cover this later. It's actually 01:25:06.940 |
trying to do tool use. And kind of just like came up with some kind of a story. 01:25:13.820 |
But I want to just use Orson Kovats did not use any tools. I don't want it to do 01:25:19.700 |
web search. There's a well known historical republic figure named Orson 01:25:26.060 |
Kovats. So this model is not going to make up stuff. This model knows that it 01:25:30.060 |
doesn't know. And it tells you that it doesn't appear to be a person that this 01:25:33.460 |
model knows. So somehow, we sort of improved hallucinations, even though they 01:25:38.620 |
clearly are an issue in older models. And it makes totally sense why you would be 01:25:44.180 |
getting these kinds of answers, if this is what your training set looks like. So 01:25:47.740 |
how do we fix this? Okay, well, clearly, we need some examples in our data set, 01:25:51.860 |
that were the correct answer for the assistant is that the model doesn't know 01:25:56.700 |
about some particular fact. But we only need to have those answers be produced 01:26:01.780 |
in the cases where the model actually doesn't know. And so the question is, how 01:26:05.140 |
do we know what the model knows or doesn't know? Well, we can empirically 01:26:08.700 |
probe the model to figure that out. So let's take a look at, for example, how 01:26:12.900 |
meta dealt with hallucinations for the llama three series of models as an 01:26:18.060 |
example. So in this paper that they published from meta, we can go into 01:26:21.740 |
hallucinations, which they call here factuality. And they describe the 01:26:28.140 |
procedure by which they basically interrogate the model to figure out what 01:26:32.380 |
it knows and doesn't know to figure out sort of like the boundary of its 01:26:35.540 |
knowledge. And then they add examples to the training set, where for the things 01:26:42.940 |
where the model doesn't know them, the correct answer is that the model doesn't 01:26:46.660 |
know them, which sounds like a very easy thing to do in principle. But this 01:26:51.300 |
roughly fixes the issue. And the reason it fixes the issue is because remember 01:26:57.020 |
like, the model might actually have a pretty good model of its self knowledge 01:27:02.220 |
inside the network. So remember, we looked at the network and all these 01:27:06.380 |
neurons inside the network, you might imagine there's a neuron somewhere in 01:27:10.260 |
the network, that sort of like lights up for when the model is uncertain. But 01:27:15.180 |
the problem is that the activation of that neuron is not currently wired up to 01:27:19.820 |
the model actually saying in words that it doesn't know. So even though the 01:27:23.460 |
internals of the neural network know, because there's some neurons that 01:27:26.540 |
represent that, the model will not surface that it will instead take its 01:27:31.380 |
best guess so that it sounds confident. Just like it sees in a training set. So 01:27:36.420 |
we need to basically interrogate the model and allow it to say I don't know 01:27:40.380 |
in the cases that it doesn't know. So let me take you through what meta roughly 01:27:43.820 |
does. So basically what they do is here I have an example. Dominikasik is the 01:27:50.220 |
featured article today. So I just went there randomly. And what they do is 01:27:54.380 |
basically they take a random document in a training set, and they take a 01:27:58.420 |
paragraph, and then they use an LM to construct questions about that 01:28:03.620 |
paragraph. So for example, I did that with chat GPT here. So I said, here's a 01:28:11.140 |
paragraph from this document, generate three specific factual questions based 01:28:15.220 |
on this paragraph, and give me the questions and the answers. And so the 01:28:19.100 |
LLMs are already good enough to create and reframe this information. So if the 01:28:24.780 |
information is in the context window of this LLM, this actually works pretty 01:28:29.940 |
well, it doesn't have to rely on its memory. It's right there in the context 01:28:33.540 |
window. And so it can basically reframe that information with fairly high 01:28:38.100 |
accuracy. So for example, it can generate questions for us like, for which team 01:28:42.060 |
did he play? Here's the answer. How many cups did he win, etc. And now what we 01:28:46.900 |
have to do is we have some question and answers. And now we want to interrogate 01:28:50.300 |
the model. So roughly speaking, what we'll do is we'll take our questions. 01:28:53.740 |
And we'll go to our model, which would be say Llama in meta. But let's just 01:28:59.380 |
interrogate Mistral7b here as an example. That's another model. So does 01:29:04.060 |
this model know about this answer? Let's take a look. So he played for Buffalo 01:29:11.220 |
Sabres, right? So the model knows. And the way that you can programmatically 01:29:15.740 |
decide is basically we're going to take this answer from the model. And we're 01:29:20.100 |
going to compare it to the correct answer. And again, the models are good 01:29:24.220 |
enough to do this automatically. So there's no humans involved here. We can 01:29:27.620 |
take basically the answer from the model. And we can use another LLM judge to 01:29:32.780 |
check if that is correct, according to this answer. And if it is correct, that 01:29:36.460 |
means that the model probably knows. So we're going to do is we're going to do 01:29:40.020 |
this maybe a few times. So okay, it knows it's Buffalo Sabres. Let's try again. 01:29:43.660 |
Buffalo Sabres. Let's try one more time. Buffalo Sabres. So we asked three times 01:29:54.020 |
about this factual question, and the model seems to know. So everything is 01:29:57.860 |
great. Now let's try the second question. How many Stanley Cups did he win? 01:30:03.180 |
And again, let's interrogate the model about that. And the correct answer is 01:30:05.660 |
two. So here, the model claims that he won four times, which is not correct, 01:30:16.740 |
right? It doesn't match two. So the model doesn't know it's making stuff up. 01:30:20.580 |
Let's try again. So here the model again, it's kind of like making stuff up, 01:30:30.260 |
right? Let's try again. Here it says he did not even, did not win during his 01:30:37.780 |
career. So obviously the model doesn't know. And the way we can programmatically 01:30:41.620 |
tell again is we interrogate the model three times, and we compare its answers 01:30:45.620 |
maybe three times, five times, whatever it is, to the correct answer. And if the 01:30:50.300 |
model doesn't know, then we know that the model doesn't know this question. And 01:30:53.820 |
then what we do is we take this question, we create a new conversation in the 01:30:59.020 |
training set. So we're going to add a new conversation training set. And when 01:31:03.100 |
the question is, how many Stanley Cups did he win? The answer is, I'm sorry, I 01:31:07.580 |
don't know, or I don't remember. And that's the correct answer for this 01:31:11.500 |
question, because we interrogated the model and we saw that that's the case. 01:31:14.460 |
If you do this for many different types of questions, for many different types of 01:31:20.140 |
documents, you are giving the model an opportunity to, in its training set, 01:31:24.860 |
refuse to say based on its knowledge. And if you just have a few examples of 01:31:28.860 |
that, in your training set, the model will know and has the opportunity to 01:31:34.140 |
learn the association of this knowledge-based refusal to this internal 01:31:39.420 |
neuron somewhere in its network that we presume exists. And empirically, this 01:31:43.660 |
turns out to be probably the case. And it can learn that association that, hey, 01:31:47.660 |
when this neuron of uncertainty is high, then I actually don't know. And I'm 01:31:52.860 |
allowed to say that, I'm sorry, but I don't think I remember this, etc. And if 01:31:57.420 |
you have these examples in your training set, then this is a large mitigation for 01:32:02.460 |
hallucination. And that's, roughly speaking, why ChatGPT is able to do 01:32:06.620 |
stuff like this as well. So these are the kinds of mitigations that people have 01:32:10.940 |
implemented and that have improved the factuality issue over time. Okay, so I've 01:32:15.740 |
described mitigation number one for basically mitigating the hallucinations 01:32:20.060 |
issue. Now, we can actually do much better than that. It's, instead of just 01:32:25.820 |
saying that we don't know, we can introduce an additional mitigation 01:32:29.420 |
number two to give the LLM an opportunity to be factual and actually 01:32:33.500 |
answer the question. Now, what do you and I do if I was to ask you a factual 01:32:38.540 |
question and you don't know? What would you do in order to answer the question? 01:32:43.260 |
Well, you could go off and do some search and use the internet, and you 01:32:47.740 |
could figure out the answer and then tell me what that answer is. And we can 01:32:52.380 |
do the exact same thing with these models. So think of the knowledge inside 01:32:56.780 |
the neural network, inside its billions of parameters. Think of that as kind of 01:33:00.860 |
a vague recollection of the things that the model has seen during its training, 01:33:05.820 |
during the pre-training stage, a long time ago. So think of that knowledge in 01:33:09.660 |
the parameters as something you read a month ago. And if you keep reading 01:33:14.220 |
something, then you will remember it and the model remembers that. But if it's 01:33:17.580 |
something rare, then you probably don't have a really good recollection of that 01:33:20.540 |
information. But what you and I do is we just go and look it up. Now, when you 01:33:24.620 |
go and look it up, what you're doing basically is like you're refreshing 01:33:27.180 |
your working memory with information, and then you're able to sort of like 01:33:30.700 |
retrieve it, talk about it, or etc. So we need some equivalent of allowing the 01:33:35.020 |
model to refresh its memory or its recollection. And we can do that by 01:33:39.740 |
introducing tools for the models. So the way we are going to approach this is 01:33:45.020 |
that instead of just saying, "Hey, I'm sorry, I don't know," we can attempt to 01:33:48.860 |
use tools. So we can create a mechanism by which the language model can emit 01:33:55.900 |
special tokens. And these are tokens that we're going to introduce, new 01:33:59.020 |
tokens. So for example, here I've introduced two tokens, and I've 01:34:03.500 |
introduced a format or a protocol for how the model is allowed to use these 01:34:07.820 |
tokens. So for example, instead of answering the question, when the model 01:34:11.820 |
does not, instead of just saying, "I don't know," sorry, the model has the 01:34:15.740 |
option now to emitting the special token search start. And this is the query 01:34:20.060 |
that will go to like bing.com in the case of OpenAI or say Google search or 01:34:23.740 |
something like that. So we'll emit the query, and then it will emit search 01:34:28.140 |
end. And then here, what will happen is that the program that is sampling 01:34:33.580 |
from the model that is running the inference, when it sees the special 01:34:37.180 |
token search end, instead of sampling the next token in the sequence, it 01:34:42.620 |
will actually pause generating from the model, it will go off, it will open a 01:34:47.020 |
session with bing.com, and it will paste the search query into bing. And it 01:34:52.140 |
will then get all the text that is retrieved. And it will basically take 01:34:56.940 |
that text, it will maybe represent it again with some other special tokens or 01:35:00.300 |
something like that. And it will take that text and it will copy paste it 01:35:03.740 |
here into what I tried to like show the brackets. So all that text kind of 01:35:08.860 |
comes here. And when the text comes here, it enters the context window. So 01:35:14.460 |
the model, so that text from the web search is now inside the context window 01:35:19.260 |
that will feed into the neural network. And you should think of the context 01:35:22.460 |
window as kind of like the working memory of the model. That data that is 01:35:26.380 |
in the context window is directly accessible by the model, it directly 01:35:29.900 |
feeds into the neural network. So it's not anymore a vague recollection, it's 01:35:34.220 |
data that it it has in the context window is directly available to that 01:35:38.300 |
model. So now when it's sampling new tokens here afterwards, it can 01:35:43.500 |
reference very easily the data that has been copy pasted in there. So that's 01:35:48.620 |
roughly how these how these tools use tools function. And so web search is 01:35:54.940 |
just one of the tools, we're going to look at some of the other tools in a 01:35:57.180 |
bit. But basically, you introduce new tokens, you introduce some schema by 01:36:01.340 |
which the model can utilize these tokens and can call these special 01:36:05.100 |
functions like web search functions. And how do you teach the model how to 01:36:09.020 |
correctly use these tools, like say web search, search start, search end, etc. 01:36:13.100 |
Well, again, you do that through training sets. So we need now to have a 01:36:16.540 |
bunch of data, and a bunch of conversations that show the model by 01:36:20.940 |
example, how to use web search. So what are the what are the settings where 01:36:25.740 |
you're using the search? And what does that look like? And here's by 01:36:29.420 |
example, how you start a search and a search, etc. And if you have a few 01:36:34.700 |
1000, maybe examples of that in your training set, the model will actually 01:36:38.140 |
do a pretty good job of understanding how this tool works. And it will know 01:36:42.140 |
how to sort of structure its queries. And of course, because of the 01:36:45.260 |
pre-training data set, and its understanding of the world, it actually 01:36:48.780 |
kind of understands what a web search is. And so it actually kind of has a 01:36:51.820 |
pretty good native understanding of what kind of stuff is a good search 01:36:56.300 |
query. And so it all kind of just like works, you just need a little bit of a 01:37:00.620 |
few examples to show it how to use this new tool. And then it can lean on it to 01:37:04.940 |
retrieve information, and put it in the context window. And that's 01:37:08.540 |
equivalent to you and I looking something up. Because once it's in the 01:37:12.140 |
context, it's in the working memory, and it's very easy to manipulate and 01:37:14.780 |
access. So that's what we saw a few minutes ago, when I was searching on 01:37:19.580 |
ChatGPT for who is Orson Kovats. The ChatGPT language model decided that 01:37:23.900 |
this is some kind of a rare individual or something like that. And instead 01:37:29.020 |
of giving me an answer from its memory, it decided that it will sample a 01:37:32.220 |
special token that is going to do a web search. And we saw briefly something 01:37:35.900 |
flash was like using the web tool or something like that. So it briefly said 01:37:39.740 |
that, and then we waited for like two seconds, and then it generated this. 01:37:42.940 |
And you see how it's creating references here. And so it's citing 01:37:46.780 |
sources. So what happened here is, it went off, it did a web search, it 01:37:52.460 |
found these sources and these URLs. And the text of these web pages was all 01:37:58.620 |
stuffed in between here. And it's not shown here, but it's it's basically 01:38:02.860 |
stuffed as text in between here. And now it sees that text. And now it 01:38:08.460 |
kind of references it and says that, okay, it could be these people 01:38:12.300 |
citation, it could be those people citation, etc. So that's what happened 01:38:15.740 |
here. And that's what and that's why when I said who is Orson Kovats, I 01:38:19.260 |
could also say, don't use any tools. And then that's enough to basically 01:38:24.460 |
convince Chachapiti to not use tools and just use its memory and its 01:38:27.420 |
recollection. I also went off and I tried to ask this question of Chachapiti. 01:38:34.780 |
So how many Stanley Cups did Dominik Hasek win? And Chachapiti actually 01:38:39.100 |
decided that it knows the answer. And it has the confidence to say that he 01:38:42.540 |
won twice. And so it kind of just relied on its memory because presumably it 01:38:46.620 |
has it has enough of a kind of confidence in its weights and its 01:38:53.420 |
parameters and activations that this is retrievable just from memory. But 01:38:59.020 |
you can also conversely use web search to make sure. And then for the same 01:39:05.020 |
query, it actually goes off and it searches and then it finds a bunch of 01:39:08.380 |
sources. It finds all this. All of this stuff gets copy pasted in there and 01:39:12.780 |
then it tells us to again and sites. And it actually says the Wikipedia 01:39:18.060 |
article, which is the source of this information for us as well. So that's 01:39:23.260 |
tools, web search. The model determines when to search. And then that's kind 01:39:27.660 |
of like how these tools work. And this is an additional kind of mitigation 01:39:33.020 |
for hallucinations and factuality. So I want to stress one more time this 01:39:37.340 |
very important sort of psychology point. Knowledge in the parameters of the 01:39:43.020 |
neural network is a vague recollection. The knowledge in the tokens that make 01:39:47.180 |
up the context window is the working memory. And it roughly speaking works 01:39:52.140 |
kind of like it works for us in our brain. The stuff we remember is our 01:39:56.860 |
parameters and the stuff that we just experienced like a few seconds or 01:40:01.820 |
minutes ago and so on. You can imagine that being in our context window. And 01:40:04.780 |
this context window is being built up as you have a conscious experience 01:40:07.820 |
around you. So this has a bunch of implications also for your use of LLMs 01:40:13.100 |
in practice. So for example, I can go to Chachipiti and I can do something 01:40:17.260 |
like this. I can say, can you summarize chapter one of Jane Austen's Pride and 01:40:20.380 |
Prejudice, right? And this is a perfectly fine prompt. And Chachipiti 01:40:24.940 |
actually does something relatively reasonable here. And the reason it does 01:40:27.980 |
that is because Chachipiti has a pretty good recollection of a famous work 01:40:31.740 |
like Pride and Prejudice. It's probably seen a ton of stuff about it. There's 01:40:35.180 |
probably forums about this book. It's probably read versions of this book. 01:40:38.540 |
And it's kind of like remembers because even if you've read this or articles 01:40:45.820 |
about it, you'd kind of have a recollection enough to actually say all 01:40:48.620 |
this. But usually when I actually interact with LLMs and I want them to 01:40:52.060 |
recall specific things, it always works better if you just give it to them. 01:40:55.660 |
So I think a much better prompt would be something like this. Can you summarize 01:40:59.740 |
for me chapter one of Jane Austen's Pride and Prejudice? And then I am 01:41:03.020 |
attaching it below for your reference. And then I do something like a 01:41:05.500 |
delimiter here and I paste it in. And I found that just copy pasting it from 01:41:10.300 |
some website that I found here. So copy pasting the chapter one here. And I do 01:41:16.060 |
that because when it's in the context window, the model has direct access to 01:41:19.740 |
it and can exactly, it doesn't have to recall it. It just has direct access to 01:41:23.740 |
it. And so this summary is, can be expected to be a significantly high 01:41:27.900 |
quality or higher quality than the summary just because it's directly 01:41:31.980 |
available to the model. And I think you and I would work in the same way. If 01:41:35.340 |
you want to, it would be, you would produce a much better summary if you 01:41:38.620 |
had re-read this chapter before you had to summarize it. And that's basically 01:41:43.820 |
what's happening here or the equivalent of it. The next sort of psychological 01:41:47.580 |
quirk I'd like to talk about briefly is that of the knowledge of self. So what 01:41:51.740 |
I see very often on the internet is that people do something like this. They 01:41:55.260 |
ask LLMs something like, what model are you and who built you? And basically 01:42:00.300 |
this question is a little bit nonsensical. And the reason I say that is 01:42:03.980 |
that as I tried to kind of explain with some of the under the hood 01:42:06.940 |
fundamentals, this thing is not a person, right? It doesn't have a 01:42:10.380 |
persistent existence in any way. It sort of boots up, processes tokens and 01:42:15.900 |
shuts off. And it does that for every single person. It just kind of builds 01:42:18.860 |
up a context window of conversation and then everything gets deleted. And so 01:42:22.540 |
this entity is kind of like restarted from scratch every single conversation, 01:42:26.060 |
if that makes sense. It has no persistent self, has no sense of self. It's a 01:42:29.660 |
token tumbler and it follows the statistical regularities of its training 01:42:34.540 |
set. So it doesn't really make sense to ask it, who are you, what built you, 01:42:39.020 |
et cetera. And by default, if you do what I described and just by default and 01:42:44.060 |
from nowhere, you're going to get some pretty random answers. So for example, 01:42:46.700 |
let's pick on Falcon, which is a fairly old model, and let's see what it tells 01:42:51.100 |
us. So it's evading the question, talented engineers and developers. Here 01:42:57.900 |
it says I was built by open AI based on the GPT-3 model. It's totally making 01:43:01.580 |
stuff up. Now, the fact that it's built by open AI here, I think a lot of 01:43:05.660 |
people would take this as evidence that this model was somehow trained on open 01:43:08.780 |
AI data or something like that. I don't actually think that that's necessarily 01:43:11.820 |
true. The reason for that is that if you don't explicitly program the model to 01:43:18.300 |
answer these kinds of questions, then what you're going to get is its 01:43:21.820 |
statistical best guess at the answer. And this model had a SFT data mixture of 01:43:29.020 |
conversations. And during the fine tuning, the model sort of understands as 01:43:35.820 |
it's training on this data, that it's taking on this personality of this like 01:43:39.500 |
helpful assistant. And it doesn't know how to, it doesn't actually, it wasn't 01:43:43.500 |
told exactly what label to apply to self. It just kind of is taking on this, 01:43:48.460 |
this persona of a helpful assistant. And remember that the pre training stage 01:43:54.300 |
took the documents from the entire internet. And chat GPT and open AI are 01:43:58.380 |
very prominent in these documents. And so I think what's actually likely to be 01:44:02.460 |
happening here is that this is just it's hallucinated label for what it is. This 01:44:07.180 |
is itself identity is that it's chat GPT by open AI. And it's only saying that 01:44:11.820 |
because there's a ton of data on the internet of answers like this, that are 01:44:17.420 |
actually coming from open AI from chat GPT. And so that's its label for what it 01:44:21.980 |
is. Now, you can override this as a developer, if you have an LLM model, you 01:44:26.780 |
can actually override it. And there are a few ways to do that. So for example, 01:44:30.220 |
let me show you, there's this Olmo model from Allen AI. And this is one LLM. 01:44:36.620 |
It's not a top tier LLM or anything like that. But I like it because it's fully 01:44:39.900 |
open source. So the paper for Olmo and everything else is completely fully open 01:44:43.580 |
source, which is nice. So here we are looking at its SFT mixture. So this is 01:44:48.300 |
the data mixture of the fine tuning. So this is the conversations data set, 01:44:52.940 |
right. And so the way that they are solving it for the Olmo model, is we see 01:44:57.340 |
that there's a bunch of stuff in the mixture. And there's a total of 1 01:44:59.580 |
million conversations here. But here we have Olmo two hard coded. If we go 01:45:04.940 |
there, we see that this is 240 conversations. And look at these 240 01:45:10.220 |
conversations, they're hard coded, tell me about yourself, says user. And then 01:45:15.260 |
the assistant says, I'm Olmo, an open language model developed by AI2, Allen 01:45:18.940 |
Institute of Artificial Intelligence, etc. I'm here to help, blah, blah, blah. 01:45:22.540 |
What is your name? The Olmo project. So these are all kinds of like cooked up 01:45:26.700 |
hard coded questions about Olmo two, and the correct answers to give in these 01:45:30.940 |
cases. If you take 240 questions like this, or conversations, put them into 01:45:35.740 |
your training set and fine tune with it, then the model will actually be 01:45:38.780 |
expected to parrot this stuff later. If you don't give it this, then it's 01:45:44.380 |
probably a chachivity by AI. And there's one more way to sometimes do this, is 01:45:50.380 |
that basically, in these conversations, and you have terms between human and 01:45:56.220 |
assistant, sometimes there's a special message called system message, at the 01:46:00.220 |
very beginning of the conversation. So it's not just between human and 01:46:03.340 |
assistant, there's a system. And in the system message, you can actually 01:46:07.100 |
hard code and remind the model that, hey, you are a model developed by open 01:46:11.500 |
AI. And your name is chachivity 4.0. And you were trained on this date, and 01:46:16.940 |
your knowledge cutoff is this. And basically, it kind of like documents the 01:46:20.140 |
model a little bit. And then this is inserted into your conversations. So 01:46:23.820 |
when you go on chachivity, you see a blank page, but actually the system 01:46:26.700 |
message is kind of like hidden in there. And those tokens are in the context 01:46:29.900 |
window. And so those are the two ways to kind of program the models to talk 01:46:35.420 |
about themselves, either is done through data like this, or is done through 01:46:40.380 |
system message and things like that, basically invisible tokens that are in 01:46:43.900 |
the context window, and remind the model of its identity. But it's all just 01:46:47.660 |
kind of like cooked up and bolted on in some in some way, it's not actually 01:46:51.500 |
like really deeply there in any real sense, as it would be for a human. I 01:46:56.540 |
want to now continue to the next section, which deals with the 01:46:59.340 |
computational capabilities, or like I should say, the native computational 01:47:02.460 |
capabilities of these models in problem solving scenarios. And so in 01:47:06.220 |
particular, we have to be very careful with these models when we construct 01:47:09.340 |
our examples of conversations. And there's a lot of sharp edges here, and 01:47:12.780 |
that are kind of like elucidative, is that a word? They're kind of like 01:47:16.300 |
interesting to look at when we consider how these models think. So consider 01:47:22.060 |
the following prompt from a human. And suppose that basically that we are 01:47:25.580 |
building out a conversation to enter into our training set of conversations. 01:47:28.700 |
So we're going to train the model on this, we're teaching you how to 01:47:31.340 |
basically solve simple math problems. So the prompt is, Emily buys three 01:47:35.580 |
apples and two oranges, each orange cost $2, the total cost is 13. What is 01:47:39.740 |
the cost of apples? Very simple math question. Now, there are two answers 01:47:44.140 |
here on the left and on the right. They are both correct answers, they both 01:47:48.220 |
say that the answer is three, which is correct. But one of these two is a 01:47:52.140 |
significantly better answer for the assistant than the other. Like if I was 01:47:56.300 |
data labeler, and I was creating one of these, one of these would be a really 01:48:01.340 |
terrible answer for the assistant, and the other would be okay. And so I'd 01:48:05.500 |
like you to potentially pause the video even, and think through why one of 01:48:09.180 |
these two is significantly better answer than the other. And if you use the 01:48:14.620 |
wrong one, your model will actually be really bad at math potentially, and it 01:48:19.260 |
would have bad outcomes. And this is something that you would be careful 01:48:22.140 |
with in your labeling documentations when you are training people to create 01:48:25.580 |
the ideal responses for the assistant. Okay, so the key to this question is to 01:48:29.740 |
realize and remember that when the models are training and also 01:48:34.140 |
inferencing, they are working in one dimensional sequence of tokens from left 01:48:38.140 |
to right. And this is the picture that I often have in my mind. I imagine 01:48:42.140 |
basically the token sequence evolving from left to right. And to always 01:48:45.660 |
produce the next token in a sequence, we are feeding all these tokens into the 01:48:50.380 |
neural network. And this neural network then gives us the probabilities for the 01:48:53.580 |
next token in sequence, right? So this picture here is the exact same picture 01:48:57.260 |
we saw before up here. And this comes from the web demo that I showed you 01:49:02.300 |
before, right? So this is the calculation that basically takes the input tokens 01:49:06.620 |
here on the top, and performs these operations of all these neurons, and 01:49:12.860 |
gives you the answer for the probabilities of what comes next. Now, the 01:49:15.820 |
important thing to realize is that, roughly speaking, there's basically a 01:49:20.780 |
finite number of layers of computation that happen here. So for example, this 01:49:24.700 |
model here has only one, two, three layers of what's called attention and 01:49:30.300 |
MLP here. Maybe a typical modern state-of-the-art network would have more 01:49:35.980 |
like, say, 100 layers or something like that. But there's only 100 layers of 01:49:39.020 |
computation or something like that to go from the previous token sequence to 01:49:42.540 |
the probabilities for the next token. And so there's a finite amount of 01:49:46.060 |
computation that happens here for every single token. And you should think of 01:49:49.740 |
this as a very small amount of computation. And this amount of 01:49:52.940 |
computation is almost roughly fixed for every single token in this sequence. 01:49:57.500 |
That's not actually fully true, because the more tokens you feed in, the more 01:50:03.180 |
expensive this forward pass will be of this neural network, but not by much. So 01:50:08.940 |
you should think of this, and I think is a good model to have in mind, this is a 01:50:12.140 |
fixed amount of compute that's going to happen in this box for every single one 01:50:15.340 |
of these tokens. And this amount of compute cannot possibly be too big, 01:50:18.620 |
because there's not that many layers that are sort of going from the top to 01:50:22.060 |
bottom here. There's not that much computationally that will happen here. 01:50:25.820 |
And so you can't imagine a model to basically do arbitrary computation in a 01:50:29.500 |
single forward pass to get a single token. And so what that means is that we 01:50:33.900 |
actually have to distribute our reasoning and our computation across 01:50:37.660 |
many tokens, because every single token is only spending a finite amount of 01:50:41.740 |
computation on it. And so we kind of want to distribute the computation 01:50:47.180 |
across many tokens. And we can't have too much computation or expect too much 01:50:51.900 |
computation out of the model in any single individual token, because there's 01:50:55.820 |
only so much computation that happens per token. Okay, roughly fixed amount of 01:51:00.540 |
computation here. So that's why this answer here is significantly worse. And 01:51:07.180 |
the reason for that is, imagine going from left to right here. And I copy 01:51:11.180 |
pasted it right here. The answer is three, etc. Imagine the model having to 01:51:16.620 |
go from left to right, emitting these tokens one at a time, it has to say, or 01:51:20.620 |
we're expecting to say, the answer is space dollar sign. And then right here, 01:51:27.980 |
we're expecting it to basically cram all the computation of this problem into 01:51:31.500 |
this single token, it has to emit the correct answer three. And then once 01:51:36.060 |
we've emitted the answer three, we're expecting it to say all these tokens. 01:51:40.060 |
But at this point, we've already produced the answer. And it's already in 01:51:43.420 |
the context window for all these tokens that follow. So anything here is just 01:51:47.340 |
kind of post hoc justification of why this is the answer. Because the answer 01:51:52.940 |
is already created, it's already in the token window. So it's, it's not 01:51:56.700 |
actually being calculated here. And so if you are answering the question 01:52:01.260 |
directly, and immediately, you are training the model to try to basically 01:52:06.380 |
guess the answer in a single token. And that is just not going to work 01:52:10.060 |
because of the finite amount of computation that happens per token. 01:52:12.700 |
That's why this answer on the right is significantly better, because we are 01:52:17.100 |
distributing this computation across the answer, we're actually getting the 01:52:20.460 |
model to sort of slowly come to the answer. From the left to right, we're 01:52:24.300 |
getting intermediate results, we're saying, okay, the total cost of oranges 01:52:27.580 |
is four. So 13 minus four is nine. And so we're creating intermediate 01:52:32.540 |
calculations. And each one of these calculations is by itself not that 01:52:36.060 |
expensive. And so we're actually basically kind of guessing a little bit 01:52:39.420 |
the difficulty that the model is capable of in any single one of these 01:52:43.740 |
individual tokens. And there can never be too much work in any one of these 01:52:48.380 |
tokens computationally, because then the model won't be able to do that later 01:52:52.380 |
at test time. And so we're teaching the model here to spread out its reasoning 01:52:57.260 |
and to spread out its computation over the tokens. And in this way, it only has 01:53:02.140 |
very simple problems in each token, and they can add up. And then by the time 01:53:07.260 |
it's near the end, it has all the previous results in its working memory. 01:53:11.340 |
And it's much easier for it to determine that the answer is and here it is 01:53:14.540 |
three. So this is a significantly better label for our computation. This would 01:53:19.580 |
be really bad. And this teaching the model to try to do all the computation 01:53:23.500 |
in a single token is really bad. So that's kind of like an interesting thing 01:53:29.020 |
to keep in mind is in your prompts. Usually don't have to think about it 01:53:33.740 |
explicitly because the people at open AI have labelers and so on that actually 01:53:39.420 |
worry about this and to make sure that the answers are spread out. And so 01:53:42.940 |
actually open AI will kind of like do the right thing. So when I asked this 01:53:46.140 |
question for chat GPT, it's actually going to go very slowly, it's going to 01:53:49.580 |
be like, okay, let's define our variables, set up the equation. And it's 01:53:53.100 |
kind of creating all these intermediate results. These are not for you. These 01:53:56.460 |
are for the model. If the model is not creating these intermediate results for 01:54:00.300 |
itself, it's not going to be able to reach three. I also wanted to show you 01:54:04.540 |
that it's possible to be a bit mean to the model, we can just ask for things. So 01:54:08.540 |
as an example, I said, I gave it the exact same prompt. And I said, answer 01:54:13.420 |
the question in a single token, just immediately give me the answer, nothing 01:54:16.540 |
else. And it turns out that for this simple prompt here, it actually was able 01:54:21.740 |
to do it in a single go. So it just created a single I think this is two 01:54:25.180 |
tokens, right? Because the dollar sign is its own token. So basically, this 01:54:30.140 |
model didn't give me a single token and give me two tokens, but it still 01:54:33.420 |
produced the correct answer. And it did that in a single forward pass of the 01:54:36.860 |
network. Now, that's because the numbers here I think are very simple. And so I 01:54:41.580 |
made it a bit more difficult to be a bit mean to the model. So I said Emily 01:54:45.100 |
buys 23 apples and 177 oranges. And then I just made the numbers a bit bigger. 01:54:49.900 |
And I'm just making it harder for the model, I'm asking you to do more 01:54:52.380 |
computation in a single token. And so I said the same thing. And here it gave 01:54:56.700 |
me five, and five is actually not correct. So the model failed to do all 01:55:00.860 |
this calculation in a single forward pass of the network, it failed to go 01:55:04.860 |
from the input tokens. And then in a single forward pass of the network, 01:55:09.660 |
single go through the network, it couldn't produce the result. And then I 01:55:13.420 |
said, Okay, now don't worry about the, the token limit, and just solve the 01:55:17.660 |
problem as usual. And then it goes all the intermediate results, it 01:55:20.940 |
simplifies. And every one of these intermediate results here, and 01:55:24.700 |
intermediate calculations is much easier for the model. And it's sort of, 01:55:29.900 |
it's not too much work per token, all of the tokens here are correct. And it 01:55:33.740 |
arises the resolution, which is seven. And it just couldn't squeeze all this 01:55:37.260 |
work. It couldn't squeeze that into a single forward pass of the network. So 01:55:41.180 |
I think that's kind of just a cute example. And something to kind of like 01:55:44.300 |
think about. And I think it's kind of, again, just elucidative in terms of how 01:55:47.820 |
these models work. The last thing that I would say on this topic is that if I 01:55:51.580 |
was in practice trying to actually solve this in my day to day life, I might 01:55:54.780 |
actually not trust that the model that all the intermediate calculations 01:55:58.780 |
correctly here. So actually, probably what I do is something like this, I 01:56:01.580 |
would come here and I would say, use code. And that's because code is one of 01:56:07.900 |
the possible tools that Chachapiti can use. And instead of it having to do 01:56:12.620 |
mental arithmetic, like this mental arithmetic here, I don't fully trust it. 01:56:16.460 |
And especially if the numbers get really big. There's no guarantee that the 01:56:19.420 |
model will do this correctly. Any one of these intermediate steps might, in 01:56:23.180 |
principle, fail. We're using neural networks to do mental arithmetic, kind 01:56:27.020 |
of like you doing mental arithmetic in your brain. It might just like screw up 01:56:30.860 |
some of the intermediate results. It's actually kind of amazing that it can 01:56:33.500 |
even do this kind of mental arithmetic. I don't think I could do this in my 01:56:35.820 |
head. But basically, the model is kind of like doing it in its head. And I 01:56:39.100 |
don't trust that. So I wanted to use tools. So you can say stuff like, use 01:56:42.300 |
code. And I'm not sure what happened there. Use code. And so like I 01:56:52.540 |
mentioned, there's a special tool and the model can write code. And I can 01:56:57.580 |
inspect that this code is correct. And then it's not relying on its mental 01:57:02.380 |
arithmetic. It is using the Python interpreter, which is a very simple 01:57:05.500 |
programming language, to basically write out the code that calculates the 01:57:09.100 |
result. And I would personally trust this a lot more because this came out 01:57:12.220 |
of a Python program, which I think has a lot more correctness guarantees 01:57:16.060 |
than the mental arithmetic of a language model. So just another kind of 01:57:21.580 |
potential hint that if you have these kinds of problems, you may want to 01:57:24.860 |
basically just ask the model to use the code interpreter. And just like we 01:57:28.860 |
saw with the web search, the model has special kind of tokens for calling, 01:57:35.500 |
like it will not actually generate these tokens from the language model. It 01:57:38.540 |
will write the program. And then it actually sends that program to a 01:57:42.220 |
different sort of part of the computer that actually just runs that program 01:57:45.900 |
and brings back the result. And then the model gets access to that result 01:57:49.420 |
and can tell you that, okay, the cost of each Apple is seven. So that's 01:57:53.660 |
another kind of tool. And I would use this in practice for yourself. And 01:57:57.820 |
it's, yeah, it's just less error prone, I would say. So that's why I 01:58:03.020 |
called this section, Models Need Tokens to Think. Distribute your 01:58:07.020 |
competition across many tokens. Ask models to create intermediate 01:58:10.620 |
results. Or whenever you can, lean on tools and tool use instead of 01:58:15.420 |
allowing the models to do all of this stuff in their memory. So if they 01:58:18.140 |
try to do it all in their memory, don't fully trust it and prefer to use 01:58:21.740 |
tools whenever possible. I want to show you one more example of where 01:58:25.180 |
this actually comes up, and that's in counting. So models actually are 01:58:29.020 |
not very good at counting for the exact same reason. You're asking for 01:58:32.380 |
way too much in a single individual token. So let me show you a simple 01:58:36.300 |
example of that. How many dots are below? And then I just put in a bunch 01:58:40.700 |
of dots. And Chachapiti says there are, and then it just tries to solve 01:58:45.500 |
the problem in a single token. So in a single token, it has to count the 01:58:50.140 |
number of dots in its context window. And it has to do that in a single 01:58:55.020 |
forward pass of a network. In a single forward pass of a network, as we 01:58:58.300 |
talked about, there's not that much computation that can happen there. 01:59:00.860 |
Just think of that as being like very little computation that happens 01:59:03.820 |
there. So if I just look at what the model sees, let's go to the LLM tokenizer. 01:59:10.060 |
It sees this. How many dots are below? And then it turns out that these 01:59:16.460 |
dots here, this group of I think 20 dots, is a single token. And then 01:59:21.260 |
this group of whatever it is, is another token. And then for some 01:59:24.860 |
reason, they break up as this. So I don't actually, this has to do with 01:59:28.940 |
the details of the tokenizer, but it turns out that these, the model 01:59:33.020 |
basically sees the token ID, this, this, this, and so on. And then from 01:59:38.460 |
these token IDs, it's expected to count the number. And spoiler alert, 01:59:43.340 |
it's not 161. It's actually, I believe, 177. So here's what we can do 01:59:47.340 |
instead. We can say use code. And you might expect that, like, why 01:59:52.140 |
should this work? And it's actually kind of subtle and kind of 01:59:55.100 |
interesting. So when I say use code, I actually expect this to work. 01:59:58.220 |
Let's see. Okay. 177 is correct. So what happens here is I've actually, 02:00:03.900 |
it doesn't look like it, but I've broken down the problem into 02:00:06.620 |
problems that are easier for the model. I know that the model can't 02:00:11.100 |
count. It can't do mental counting. But I know that the model is 02:00:14.620 |
actually pretty good at doing copy-pasting. So what I'm doing here 02:00:17.980 |
is when I say use code, it creates a string in Python for this. And 02:00:22.460 |
the task of basically copy-pasting my input here to here is very 02:00:27.980 |
simple. Because for the model, it sees this string of, it sees it 02:00:34.140 |
as just these four tokens or whatever it is. So it's very simple 02:00:37.100 |
for the model to copy-paste those token IDs and kind of unpack them 02:00:43.260 |
into dots here. And so it creates this string, and then it calls 02:00:48.460 |
Python routine dot count, and then it comes up with the correct 02:00:51.420 |
answer. So the Python interpreter is doing the counting. It's not 02:00:54.540 |
the model's mental arithmetic doing the counting. So it's, again, 02:00:57.660 |
a simple example of models need tokens to think, don't rely on 02:01:02.300 |
their mental arithmetic. And that's why also the models are not 02:01:06.300 |
very good at counting. If you need them to do counting tasks, 02:01:08.780 |
always ask them to lean on the tool. Now, the models also have 02:01:12.860 |
many other little cognitive deficits here and there. And these 02:01:15.420 |
are kind of like sharp edges of the technology to be kind of aware 02:01:17.980 |
of over time. So as an example, the models are not very good with 02:01:21.820 |
all kinds of spelling-related tasks. They're not very good at it. 02:01:25.340 |
And I told you that we would loop back around to tokenization. 02:01:28.620 |
And the reason to do for this is that the models, they don't see 02:01:31.900 |
the characters. They see tokens. And their entire world is about 02:01:35.900 |
tokens, which are these little text chunks. And so they don't see 02:01:38.860 |
characters like our eyes do. And so very simple character-level 02:01:42.380 |
tasks often fail. So, for example, I'm giving it a string, 02:01:47.420 |
ubiquitous, and I'm asking it to print only every third character 02:01:51.340 |
starting with the first one. So we start with you, and then we 02:01:54.460 |
should go every third. So 1, 2, 3, Q should be next, and then 02:02:00.220 |
et cetera. So this I see is not correct. And again, my hypothesis 02:02:04.620 |
is that this is, again, the mental arithmetic here is failing, 02:02:07.980 |
number one, a little bit. But number two, I think the more 02:02:10.780 |
important issue here is that if you go to TickTokenizer and you 02:02:14.780 |
look at ubiquitous, we see that it is three tokens, right? So you 02:02:19.020 |
and I see ubiquitous, and we can easily access the individual 02:02:22.700 |
letters, because we kind of see them. And when we have it in the 02:02:25.580 |
working memory of our visual sort of field, we can really 02:02:28.620 |
easily index into every third letter, and I can do that task. 02:02:31.260 |
But the models don't have access to the individual letters. They 02:02:34.380 |
see this as these three tokens. And remember, these models are 02:02:38.460 |
trained from scratch on the internet. And all these token, 02:02:41.100 |
basically, the model has to discover how many of all these 02:02:44.620 |
different letters are packed into all these different tokens. 02:02:47.100 |
And the reason we even use tokens is mostly for efficiency. 02:02:50.700 |
But I think a lot of people are interested to delete tokens 02:02:53.100 |
entirely. Like, we should really have character level or byte 02:02:55.820 |
level models. It's just that that would create very long 02:02:58.540 |
sequences, and people don't know how to deal with that right now. 02:03:01.500 |
So while we have the token world, any kind of spelling 02:03:04.060 |
tasks are not actually expected to work super well. 02:03:05.980 |
So because I know that spelling is not a strong suit because of 02:03:09.580 |
tokenization, I can, again, ask it to lean on tools. So I can 02:03:13.180 |
just say use code. And I would, again, expect this to work, 02:03:16.700 |
because the task of copy pasting ubiquitous into the Python 02:03:19.580 |
interpreter is much easier. And then we're leaning on Python 02:03:22.700 |
interpreter to manipulate the characters of this string. 02:03:26.300 |
So when I say use code, ubiquitous, yes, it indexes into 02:03:32.700 |
every third character. And the actual truth is UQ2S, UQTS, 02:03:36.940 |
which looks correct to me. So again, an example of spelling 02:03:42.380 |
related tasks not working very well. A very famous example of 02:03:45.420 |
that recently is how many R are there in strawberry. And this 02:03:49.100 |
went viral many times. And basically, the models now get 02:03:52.300 |
it correct. They say there are three R's in strawberry. But for 02:03:55.020 |
a very long time, all the state of the art models would insist 02:03:57.500 |
that there are only two R's in strawberry. And this caused a 02:04:00.780 |
lot of, you know, ruckus, because is that a word? I think 02:04:03.980 |
so. Because it's just kind of like, why are the models so 02:04:08.060 |
brilliant? And they can solve math Olympiad questions, but 02:04:10.860 |
they can't like count R's in strawberry. And the answer for 02:04:14.300 |
that, again, is I've kind of built up to it kind of slowly. 02:04:16.860 |
But number one, the models don't see characters, they see 02:04:19.820 |
tokens. And number two, they are not very good at counting. 02:04:23.500 |
And so here we are combining the difficulty of seeing 02:04:26.700 |
characters with the difficulty of counting. And that's why the 02:04:29.660 |
models struggled with this, even though I think by now, 02:04:32.620 |
honestly, I think opening I may have hardcoded the answer here, 02:04:35.020 |
or I'm not sure what they did. But this specific query now 02:04:39.580 |
works. So models are not very good at spelling. And there's a 02:04:45.020 |
bunch of other little sharp edges. And I don't want to go 02:04:46.780 |
into all of them. I just want to show you a few examples of 02:04:49.260 |
things to be aware of. And when you're using these models in 02:04:52.300 |
practice, I don't actually want to have a comprehensive 02:04:54.700 |
analysis here of all the ways that the models are kind of 02:04:57.740 |
like falling short, I just want to make the point that there 02:05:00.140 |
are some jagged edges here and there. And we've discussed a 02:05:03.580 |
few of them. And a few of them make sense. But some of them 02:05:05.660 |
also will just not make as much sense. And they're kind of 02:05:08.380 |
like you're left scratching your head, even if you understand 02:05:11.100 |
in depth how these models work. And a good example of that 02:05:14.220 |
recently is the following. The models are not very good at 02:05:17.340 |
very simple questions like this. And this is shocking to a lot 02:05:20.780 |
of people, because these math, these problems can solve 02:05:23.580 |
complex math problems, they can answer PhD grade physics, 02:05:27.260 |
chemistry, biology questions much better than I can, but 02:05:30.220 |
sometimes they fall short in like super simple problems like 02:05:32.380 |
this. So here we go. 9.11 is bigger than 9.9. And it 02:05:37.820 |
justifies this in some way, but obviously, and then at the end, 02:05:41.100 |
okay, it actually it flips its decision later. So I don't 02:05:46.700 |
believe that this is very reproducible. Sometimes it flips 02:05:49.260 |
around its answer, sometimes it gets it right, sometimes get us 02:05:51.500 |
gets it wrong. Let's try again. Okay, even though it might look 02:05:59.340 |
larger. Okay, so here it doesn't even correct itself in the end. 02:06:02.700 |
If you ask many times, sometimes it gets it right, too. But how 02:06:05.900 |
is it that the model can do so great at Olympiad grade 02:06:09.340 |
problems, but then fail on very simple problems like this. 02:06:12.300 |
And I think this one is, as I mentioned, a little bit of a 02:06:16.300 |
head scratcher. It turns out that a bunch of people studied 02:06:18.700 |
this in depth, and I haven't actually read the paper. But 02:06:21.820 |
what I was told by this team was that when you scrutinize the 02:06:26.940 |
activations inside the neural network, when you look at some 02:06:29.660 |
of the features and what what features turn on or off and what 02:06:32.620 |
neurons turn on or off a bunch of neurons inside the neural 02:06:36.300 |
network light up, that are usually associated with Bible 02:06:39.180 |
verses. And so I think the model is kind of like reminded that 02:06:43.980 |
these almost look like Bible verse markers. And in a Bible 02:06:47.900 |
verse setting, 9.11 would come after 9.9. And so basically, the 02:06:52.620 |
model somehow finds it like cognitively very distracting, 02:06:55.260 |
that in Bible verses 9.11 would be greater. Even though here 02:07:00.540 |
it's actually trying to justify it and come up to the answer 02:07:02.940 |
with a math, it still ends up with the wrong answer here. So 02:07:06.860 |
it basically just doesn't fully make sense. And it's not fully 02:07:10.060 |
understood. And there's a few jagged issues like that. So 02:07:14.620 |
that's why treat this as a as what it is, which is a stochastic 02:07:18.620 |
system that is really magical, but that you can't also fully 02:07:21.420 |
trust. And you want to use it as a tool, not as something that 02:07:24.380 |
you kind of like let it rip on a problem and copy paste the 02:07:27.660 |
results. Okay, so we have now covered two major stages of 02:07:31.100 |
training of large language models. We saw that in the first 02:07:34.780 |
stage, this is called the pre training stage, we are 02:07:37.580 |
basically training on internet documents. And when you train a 02:07:41.020 |
language model on internet documents, you get what's called 02:07:43.500 |
a base model. And it's basically an internet document 02:07:46.060 |
simulator, right? Now, we saw that this is an interesting 02:07:49.580 |
artifact. And this takes many months to train on 1000s of 02:07:53.740 |
computers. And it's kind of a lossy compression of the 02:07:56.140 |
internet. And it's extremely interesting, but it's not 02:07:58.380 |
directly useful. Because we don't want to sample internet 02:08:01.100 |
documents, we want to ask questions of an AI and have it 02:08:04.300 |
respond to our questions. So for that, we need an assistant. 02:08:08.060 |
And we saw that we can actually construct an assistant in the 02:08:11.100 |
process of post training. And specifically, in the process of 02:08:16.780 |
supervised fine tuning, as we call it. So in this stage, we 02:08:21.900 |
saw that it's algorithmically identical to pre training, 02:08:24.540 |
nothing is going to change. The only thing that changes is the 02:08:27.100 |
data set. So instead of internet documents, we now want to create 02:08:31.100 |
and curate a very nice data set of conversations. So we want 02:08:35.660 |
millions conversations on all kinds of diverse topics between 02:08:40.620 |
a human and an assistant. And fundamentally, these 02:08:44.140 |
conversations are created by humans. So humans write the 02:08:48.140 |
prompts, and humans write the ideal responses. And they do 02:08:52.300 |
that based on labeling documentations. Now, in the 02:08:56.140 |
modern stack, it's not actually done fully and manually by 02:08:59.420 |
humans, right? They actually now have a lot of help from these 02:09:02.140 |
tools. So we can use language models to help us create these 02:09:06.060 |
data sets. And we've done extensively. But fundamentally, 02:09:08.860 |
it's all still coming from human curation at the end. So we 02:09:12.300 |
create these conversations that now becomes our data set, we 02:09:15.180 |
fine tune on it, or continue training on it, and we get an 02:09:18.300 |
assistant. And then we kind of shifted gears and started 02:09:21.100 |
talking about some of the kind of cognitive implications of 02:09:23.500 |
what the system is like. And we saw that, for example, the 02:09:26.540 |
assistant will hallucinate, if you don't take some sort of 02:09:30.540 |
mitigations towards it. So we saw that hallucinations would 02:09:33.980 |
be common. And then we looked at some of the mitigations of 02:09:36.380 |
those hallucinations. And then we saw that the models are quite 02:09:39.260 |
impressive and can do a lot of stuff in their head. But we saw 02:09:41.820 |
that they can also lean on tools to become better. So for 02:09:45.260 |
example, we can lean on the web search in order to hallucinate 02:09:49.100 |
less, and to maybe bring up some more recent information or 02:09:53.500 |
something like that. Or we can lean on tools like Code 02:09:56.060 |
Interpreter, so the LLM can write some code and actually 02:09:59.820 |
run it and see the results. So these are some of the topics we 02:10:03.820 |
looked at so far. Now what I'd like to do is, I'd like to cover 02:10:08.380 |
the last and major stage of this pipeline. And that is 02:10:12.540 |
reinforcement learning. So reinforcement learning is still 02:10:15.980 |
kind of thought to be under the umbrella of post-training. But 02:10:19.500 |
it is the last third major stage, and it's a different way 02:10:23.100 |
of training language models, and usually follows as this third 02:10:27.100 |
step. So inside companies like OpenAI, you will start here, and 02:10:31.020 |
these are all separate teams. So there's a team doing data for 02:10:34.300 |
pre-training, and a team doing training for pre-training. And 02:10:37.420 |
then there's a team doing all the conversation generation in 02:10:41.900 |
a different team that is kind of doing the supervised fine 02:10:44.540 |
tuning. And there will be a team for the reinforcement learning 02:10:46.940 |
as well. So it's kind of like a handoff of these models. You 02:10:49.820 |
get your base model, then you fine tune it to be an assistant, 02:10:52.860 |
and then you go into reinforcement learning, which 02:10:54.540 |
we'll talk about now. So that's kind of like the major flow. 02:10:59.500 |
And so let's now focus on reinforcement learning, the last 02:11:02.700 |
major stage of training. And let me first actually motivate it 02:11:06.380 |
and why we would want to do reinforcement learning and what 02:11:09.100 |
it looks like on a high level. So now I'd like to try to 02:11:11.980 |
motivate the reinforcement learning stage and what it 02:11:13.740 |
corresponds to. It's something that you're probably familiar 02:11:15.900 |
with, and that is basically going to school. So just like 02:11:19.180 |
you went to school to become really good at something, we 02:11:22.380 |
want to take large language models through school. And 02:11:25.580 |
really what we're doing is we have a few paradigms of ways of 02:11:31.980 |
giving them knowledge or transferring skills. So in 02:11:35.260 |
particular, when we're working with textbooks in school, you'll 02:11:38.140 |
see that there are three major pieces of information in these 02:11:42.460 |
textbooks, three classes of information. The first thing 02:11:45.900 |
you'll see is you'll see a lot of exposition. And by the way, 02:11:49.020 |
this is a totally random book I pulled from the internet. I 02:11:51.100 |
think it's some kind of organic chemistry or something. I'm not 02:11:53.660 |
sure. But the important thing is that you'll see that most of 02:11:56.780 |
the text, most of it is kind of just like the meat of it, is 02:11:59.660 |
exposition. It's kind of like background knowledge, etc. As 02:12:03.500 |
you are reading through the words of this exposition, you 02:12:07.260 |
can think of that roughly as training on that data. And 02:12:12.380 |
that's why when you're reading through this stuff, this 02:12:14.380 |
background knowledge, and there's all this context 02:12:15.820 |
information, it's kind of equivalent to pre-training. So 02:12:19.900 |
it's where we build sort of like a knowledge base of this 02:12:23.580 |
data and get a sense of the topic. The next major kind of 02:12:28.140 |
information that you will see is these problems and with their 02:12:33.020 |
worked solutions. So basically a human expert, in this case, the 02:12:37.100 |
author of this book, has given us not just a problem, but has 02:12:40.140 |
also worked through the solution. And the solution is 02:12:43.020 |
basically like equivalent to having like this ideal response 02:12:46.460 |
for an assistant. So it's basically the expert is showing 02:12:49.260 |
us how to solve the problem and it's kind of like in its full 02:12:53.580 |
form. So as we are reading the solution, we are basically 02:12:57.980 |
training on the expert data. And then later we can try to 02:13:02.060 |
imitate the expert. And basically that roughly 02:13:07.180 |
corresponds to having the SFT model. That's what it would be 02:13:09.580 |
doing. So basically we've already done pre-training and 02:13:12.940 |
we've already covered this imitation of experts and how 02:13:17.340 |
they solve these problems. And the third stage of 02:13:20.220 |
reinforcement learning is basically the practice problems. 02:13:23.100 |
So sometimes you'll see this is just a single practice problem 02:13:26.620 |
here. But of course, there will be usually many practice 02:13:28.940 |
problems at the end of each chapter in any textbook. And 02:13:32.140 |
practice problems, of course, we know are critical for 02:13:34.220 |
learning, because what are they getting you to do? They're 02:13:36.780 |
getting you to practice yourself and discover ways of 02:13:40.940 |
solving these problems yourself. And so what you get in 02:13:43.980 |
the practice problem is you get the problem description, but 02:13:47.180 |
you're not given the solution, but you are given the final 02:13:50.780 |
answer, usually in the answer key of the textbook. And so you 02:13:54.860 |
know the final answer that you're trying to get to, and you 02:13:57.100 |
have the problem statement, but you don't have the solution. 02:13:59.900 |
You are trying to practice the solution. You're trying out 02:14:02.940 |
many different things, and you're seeing what gets you to 02:14:06.140 |
the final solution the best. And so you're discovering how 02:14:09.900 |
to solve these problems. And in the process of that, you're 02:14:12.860 |
relying on, number one, the background information, which 02:14:15.340 |
comes from pre-training, and number two, maybe a little bit 02:14:17.820 |
of imitation of human experts. And you can probably try 02:14:21.420 |
similar kinds of solutions and so on. So we've done this and 02:14:25.420 |
this, and now in this section, we're going to try to practice. 02:14:28.300 |
And so we're going to be given prompts. We're going to be 02:14:32.140 |
given solutions. Sorry, the final answers, but we're not 02:14:35.740 |
going to be given expert solutions. We have to practice 02:14:38.940 |
and try stuff out. And that's what reinforcement learning is 02:14:41.580 |
about. Okay, so let's go back to the problem that we worked 02:14:44.620 |
with previously, just so we have a concrete example to talk 02:14:47.420 |
through as we explore the topic here. So I'm here in the 02:14:52.220 |
tick tokenizer because I'd also like to, well, I get a text box, 02:14:55.580 |
which is useful. But number two, I want to remind you again 02:14:58.780 |
that we're always working with one-dimensional token 02:15:00.540 |
sequences. And so I actually prefer this view because this 02:15:04.460 |
is the native view of the LLM, if that makes sense. This is 02:15:07.660 |
what it actually sees. It sees token IDs, right? So Emily buys 02:15:13.180 |
three apples and two oranges. Each orange is $2. The total 02:15:17.020 |
cost of all the fruit is $13. What is the cost of each apple? 02:15:21.500 |
And what I'd like you to appreciate here is these are 02:15:25.180 |
like four possible candidate solutions as an example. And 02:15:30.460 |
they all reach the answer three. Now what I'd like you to 02:15:33.260 |
appreciate at this point is that if I'm the human data 02:15:36.060 |
labeler that is creating a conversation to be entered into 02:15:39.100 |
the training set, I don't actually really know which of 02:15:42.940 |
these conversations to add to the data set. Some of these 02:15:49.420 |
conversations kind of set up a system of equations. Some of 02:15:51.900 |
them sort of like just talk through it in English, and some 02:15:55.020 |
of them just kind of like skip right through to the solution. 02:15:57.740 |
If you look at chatGPT, for example, and you give it this 02:16:02.380 |
question, it defines a system of variables and it kind of like 02:16:05.420 |
does this little thing. What we have to appreciate and 02:16:08.780 |
differentiate between though is the first purpose of a 02:16:13.180 |
solution is to reach the right answer. Of course, we want to 02:16:15.580 |
get the final answer three. That is the important purpose 02:16:19.020 |
here. But there's kind of like a secondary purpose as well, 02:16:21.660 |
where here we are also just kind of trying to make it like 02:16:24.620 |
nice for the human, because we're kind of assuming that the 02:16:27.980 |
person wants to see the solution, they want to see the 02:16:29.980 |
intermediate steps, we want to present it nicely, etc. So there 02:16:33.100 |
are two separate things going on here. Number one is the 02:16:35.900 |
presentation for the human. But number two, we're trying to 02:16:38.300 |
actually get the right answer. So let's, for the moment, focus 02:16:42.380 |
on just reaching the final answer. If we only care about 02:16:46.700 |
the final answer, then which of these is the optimal or like 02:16:50.860 |
the best prompt? Sorry, the best solution for the LLM to 02:16:55.820 |
reach the right answer. And what I'm trying to get at is we 02:17:00.620 |
don't know. Me, as a human labeler, I would not know which 02:17:03.580 |
one of these is best. So as an example, we saw earlier on when 02:17:07.340 |
we looked at the token sequences here and the mental arithmetic 02:17:12.540 |
and reasoning, we saw that for each token, we can only spend 02:17:15.580 |
basically a finite number of finite amount of compute here 02:17:18.780 |
that is not very large, or you should think about it that way. 02:17:20.940 |
And so we can't actually make too big of a leap in any one 02:17:25.100 |
token is maybe the way to think about it. So as an example, in 02:17:29.020 |
this one, what's really nice about it is that it's very few 02:17:31.820 |
tokens, so it's gonna take us very short amount of time to get 02:17:34.700 |
to the answer. But right here, when we're doing 13 minus four 02:17:38.380 |
divide three equals, right in this token here, we're actually 02:17:42.700 |
asking for a lot of computation to happen on that single 02:17:44.940 |
individual token. And so maybe this is a bad example to give 02:17:48.060 |
to the LLM because it's kind of incentivizing it to skip through 02:17:50.460 |
the calculations very quickly, and it's going to actually make 02:17:52.860 |
up mistakes, make mistakes in this mental arithmetic. So maybe 02:17:56.940 |
it would work better to like spread out the spread out more. 02:17:59.900 |
Maybe it would be better to set up as an equation, maybe it 02:18:03.180 |
would be better to talk through it. We fundamentally don't know. 02:18:06.780 |
And we don't know because what is easy for you or I as or as 02:18:11.900 |
human labelers, what's easy for us or hard for us is different 02:18:15.420 |
than what's easy or hard for the LLM. Its cognition is different. 02:18:18.860 |
And the token sequences are kind of like different hard for it. 02:18:24.460 |
And so some of the token sequences here that are trivial 02:18:30.300 |
for me might be very too much of a leap for the LLM. So right 02:18:35.500 |
here, this token would be way too hard. But conversely, many 02:18:39.580 |
of the tokens that I'm creating here might be just trivial to 02:18:43.180 |
the LLM. And we're just wasting tokens, like why waste all these 02:18:46.140 |
tokens when this is all trivial. So if the only thing we care 02:18:49.980 |
about is reaching the final answer, and we're separating out 02:18:52.620 |
the issue of the presentation to the human, then we don't 02:18:56.140 |
actually really know how to annotate this example. We don't 02:18:58.700 |
know what solution to get to the LLM, because we are not the 02:19:01.340 |
LLM. And it's clear here in the case of like the math example, 02:19:06.540 |
but this is actually like a very pervasive issue like for our 02:19:09.820 |
knowledge is not LLM's knowledge, like the LLM actually 02:19:13.260 |
has a ton of knowledge of PhD in math and physics and chemistry 02:19:15.980 |
and whatnot. So in many ways, it actually knows more than I do. 02:19:19.180 |
And I'm potentially not utilizing that knowledge in its 02:19:22.700 |
problem solving. But conversely, I might be injecting a bunch of 02:19:26.300 |
knowledge in my solutions that the LLM doesn't know in its 02:19:29.980 |
parameters. And then those are like sudden leaps that are very 02:19:33.740 |
confusing to the model. And so our cognitions are different. 02:19:38.300 |
And I don't really know what to put here, if all we care about 02:19:41.980 |
is the reaching the final solution, and doing it 02:19:44.620 |
economically, ideally. And so, long story short, we are not in 02:19:50.220 |
a good position to create these token sequences for the LLM. And 02:19:55.260 |
they're useful by imitation to initialize the system. But we 02:19:59.100 |
really want the LLM to discover the token sequences that work 02:20:02.140 |
for it. It needs to find for itself what token sequence 02:20:07.260 |
reliably gets to the answer, given the prompt. And it needs 02:20:11.180 |
to discover that in a process of reinforcement learning and of 02:20:13.660 |
trial and error. So let's see how this example would work like 02:20:18.700 |
in reinforcement learning. Okay, so we're now back in the 02:20:22.540 |
Hugging Face Inference Playground. And that just allows 02:20:26.300 |
me to very easily call different kinds of models. So as an 02:20:29.500 |
example, here on the top right, I chose the Gemma 2, 2 billion 02:20:33.420 |
parameter model. So 2 billion is very, very small. So this is a 02:20:36.540 |
tiny model, but it's okay. So we're going to give it the way 02:20:40.300 |
that reinforcement learning will basically work is actually 02:20:42.380 |
quite, quite simple. We need to try many different kinds of 02:20:47.180 |
solutions. And we want to see which solutions work well or 02:20:50.060 |
not. So we're basically going to take the prompt, we're going 02:20:53.740 |
to run the model. And the model generates a solution. And then 02:20:58.940 |
we're going to inspect the solution. And we know that the 02:21:01.660 |
correct answer for this one is $3. And so indeed, the model 02:21:05.500 |
gets it correct, says it's $3. So this is correct. So that's 02:21:09.180 |
just one attempt at the solution. So now we're going to 02:21:11.980 |
delete this, and we're going to rerun it again. Let's try a 02:21:14.860 |
second attempt. So the model solves it in a bit slightly 02:21:18.060 |
different way, right? Every single attempt will be a 02:21:21.180 |
different generation, because these models are stochastic 02:21:23.420 |
systems. Remember that every single token here, we have a 02:21:25.980 |
probability distribution, and we're sampling from that 02:21:28.460 |
distribution. So we end up kind of going down slightly 02:21:31.580 |
different paths. And so this is the second solution that also 02:21:34.940 |
ends in the correct answer. Now we're going to delete that. 02:21:38.380 |
Let's go a third time. Okay, so again, slightly different 02:21:41.980 |
solution, but also gets it correct. Now we can actually 02:21:45.900 |
repeat this many times. And so in practice, you might actually 02:21:49.740 |
sample 1000s of independent solutions, or even like a 02:21:52.780 |
million solutions for just a single prompt. And some of them 02:21:57.260 |
will be correct, and some of them will not be very correct. 02:21:59.660 |
And basically, what we want to do is we want to encourage the 02:22:02.140 |
solutions that lead to correct answers. So let's take a look 02:22:05.820 |
at what that looks like. So if we come back over here, here's 02:22:09.260 |
kind of like a cartoon diagram of what this is looking like. 02:22:11.420 |
We have a prompt. And then we tried many different solutions 02:22:15.740 |
in parallel. And some of the solutions might go well, so they 02:22:20.940 |
get the right answer, which is in green. And some of the 02:22:23.980 |
solutions might go poorly and may not reach the right answer, 02:22:26.700 |
which is red. Now, this problem here, unfortunately, is not the 02:22:29.900 |
best example, because it's a trivial prompt. And as we saw, 02:22:33.020 |
even like a two billion parameter model always gets it 02:22:36.060 |
right. So it's not the best example in that sense. But let's 02:22:39.020 |
just exercise some imagination here. And let's just suppose 02:22:42.700 |
that the green ones are good, and the red ones are bad. Okay, 02:22:50.060 |
so we generated 15 solutions, only four of them got the right 02:22:53.340 |
answer. And so now what we want to do is, basically, we want to 02:22:57.420 |
encourage the kinds of solutions that lead to right answers. So 02:23:00.860 |
whatever token sequences happened in these red solutions, 02:23:04.460 |
obviously, something went wrong along the way somewhere. And 02:23:07.340 |
this was not a good path to take through the solution. And 02:23:10.940 |
whatever token sequences that were in these green solutions, 02:23:13.740 |
well, things went pretty well in this situation. And so we 02:23:17.660 |
want to do more things like it in prompts like this. And the 02:23:22.140 |
way we encourage this kind of a behavior in the future is we 02:23:25.100 |
basically train on these sequences. But these training 02:23:28.540 |
sequences now are not coming from expert human annotators. 02:23:31.740 |
There's no human who decided that this is the correct 02:23:34.220 |
solution. This solution came from the model itself. So the 02:23:37.900 |
model is practicing here, it's tried out a few solutions, four 02:23:40.940 |
of them seem to have worked. And now the model will kind of 02:23:43.580 |
like train on them. And this corresponds to a student 02:23:46.220 |
basically looking at their solutions and being like, okay, 02:23:48.380 |
well, this one worked really well. So this is how I should be 02:23:50.940 |
solving these kinds of problems. And here in this example, there 02:23:55.980 |
are many different ways to actually like really tweak the 02:23:58.460 |
methodology a little bit here. But just to get the core idea 02:24:01.500 |
across, maybe it's simplest to just think about taking the 02:24:04.940 |
single best solution out of these four, like say this one, 02:24:08.300 |
that's why it was yellow. So this is the solution that not 02:24:12.700 |
only looked at the right answer, but maybe had some other nice 02:24:15.500 |
properties. Maybe it was the shortest one, or it looked 02:24:18.300 |
nicest in some ways, or there's other criteria you could think 02:24:21.740 |
of as an example. But we're going to decide that this is the 02:24:24.300 |
top solution, we're going to train on it. And then the model 02:24:28.380 |
will be slightly more likely, once you do the parameter 02:24:31.580 |
update, to take this path in this kind of a setting in the 02:24:35.900 |
future. But you have to remember that we're going to run many 02:24:39.500 |
different diverse prompts across lots of math problems and 02:24:42.540 |
physics problems and whatever, whatever there might be. So 02:24:46.060 |
10s of 1000s of prompts, maybe have in mind, there's 1000s of 02:24:49.580 |
solutions per prompt. And so this is all happening kind of 02:24:52.620 |
like at the same time. And as we're iterating this process, 02:24:56.700 |
the model is discovering for itself, what kinds of token 02:25:00.060 |
sequences lead it to correct answers. It's not coming from a 02:25:04.540 |
human annotator. The model is kind of like playing in this 02:25:08.540 |
playground. And it knows what it's trying to get to. And it's 02:25:12.300 |
discovering sequences that work for it. These are sequences 02:25:15.980 |
that don't make any mental leaps. They seem to work 02:25:20.300 |
reliably and statistically, and fully utilize the knowledge of 02:25:24.380 |
the model as it has it. And so this is the process of 02:25:28.460 |
reinforcement learning. It's basically a guess and check, 02:25:31.820 |
we're going to guess many different types of solutions, 02:25:33.580 |
we're going to check them, and we're going to do more of what 02:25:35.900 |
worked in the future. And that is reinforcement learning. So in 02:25:40.620 |
the context of what came before, we see now that the SFT model, 02:25:44.300 |
the supervised fine tuning model, it's still helpful, 02:25:47.020 |
because it's still kind of like initializes the model a little 02:25:49.500 |
bit into the vicinity of the correct solutions. So it's kind 02:25:52.940 |
of like a initialization of the model, in the sense that it kind 02:25:57.420 |
of gets the model to, you know, take solutions, like write out 02:26:01.100 |
solutions, and maybe it has an understanding of setting up a 02:26:03.820 |
system of equations, or maybe it kind of like talks through a 02:26:06.380 |
solution. So it gets you into the vicinity of correct 02:26:09.020 |
solutions. But reinforcement learning is where everything 02:26:11.660 |
gets dialed in, we really discover the solutions that 02:26:14.460 |
work for the model, get the right answers, we encourage 02:26:17.420 |
them, and then the model just kind of like gets better over 02:26:20.140 |
time. Okay, so that is the high level process for how we train 02:26:23.660 |
large language models. In short, we train them kind of very 02:26:27.180 |
similar to how we train children. And basically, the 02:26:30.380 |
only difference is that children go through chapters of 02:26:32.860 |
books, and they do all these different types of training 02:26:35.980 |
exercises, kind of within the chapter of each book. But 02:26:39.740 |
instead, when we train AIs, it's almost like we kind of do it 02:26:42.380 |
stage by stage, depending on the type of that stage. So first, 02:26:46.540 |
what we do is we do pre training, which as we saw is 02:26:49.020 |
equivalent to basically reading all the expository material. So 02:26:53.020 |
we look at all the textbooks at the same time, and we read all 02:26:55.980 |
the exposition, and we try to build a knowledge base. The 02:26:59.580 |
second thing then is we go into the SFT stage, which is really 02:27:03.180 |
looking at all the fixed sort of like solutions from human 02:27:07.180 |
experts of all the different kinds of worked solutions 02:27:10.860 |
across all the textbooks. And we just kind of get an SFT 02:27:14.380 |
model, which is able to imitate the experts, but does so kind 02:27:17.500 |
of blindly, it just kind of like does its best guess, kind 02:27:21.260 |
of just like trying to mimic statistically the expert 02:27:23.580 |
behavior. And so that's what you get when you look at all the 02:27:26.060 |
work solutions. And then finally, in the last stage, we 02:27:29.660 |
do all the practice problems in the RL stage across all the 02:27:33.180 |
textbooks, we only do the practice problems. And that's 02:27:36.220 |
how we get the RL model. So on a high level, the way we train 02:27:40.140 |
LLMs is very much equivalent to the process that we train, that 02:27:44.860 |
we use for training of children. The next point I would 02:27:47.900 |
like to make is that actually these first two stages pre 02:27:50.540 |
training and surprise fine tuning, they've been around for 02:27:52.940 |
years, and they are very standard, and everyone does them 02:27:55.020 |
all the different LLM providers. It is this last stage, the RL 02:27:58.860 |
training, there is a lot more early in its process of 02:28:01.820 |
development, and is not standard yet in the field. And so this 02:28:07.900 |
stage is a lot more kind of early and nascent. And the 02:28:11.020 |
reason for that is because I actually skipped over a ton of 02:28:13.580 |
little details here in this process. The high level idea is 02:28:16.140 |
very simple, it's trial and error learning, but there's a 02:28:18.700 |
ton of details and little mathematical kind of like 02:28:21.180 |
nuances to exactly how you pick the solutions that are the 02:28:23.580 |
best, and how much you train on them, and what is the prompt 02:28:26.540 |
distribution, and how to set up the training run, such that 02:28:29.260 |
this actually works. So there's a lot of little details and 02:28:31.900 |
knobs to the core idea that is very, very simple. And so 02:28:35.500 |
getting the details right here is not trivial. And so a lot of 02:28:39.580 |
companies like for example, OpenAI and other LLM providers 02:28:42.060 |
have experimented internally with reinforcement learning 02:28:45.420 |
fine tuning for LLMs for a while, but they've not talked 02:28:48.700 |
about it publicly. It's all kind of done inside the company. 02:28:53.180 |
And so that's why the paper from DeepSeq that came out very, 02:28:56.540 |
very recently was such a big deal. Because this is a paper 02:28:59.740 |
from this company called DeepSeq AI in China. And this paper 02:29:04.380 |
really talked very publicly about reinforcement learning 02:29:07.100 |
fine tuning for large language models, and how incredibly 02:29:10.380 |
important it is for large language models, and how it 02:29:13.500 |
brings out a lot of reasoning capabilities in the models. 02:29:15.980 |
We'll go into this in a second. So this paper reinvigorated the 02:29:19.900 |
public interest of using RL for LLMs, and gave a lot of the 02:29:26.220 |
sort of nitty-gritty details that are needed to reproduce 02:29:28.540 |
the results, and actually get the stage to work for large 02:29:31.340 |
language models. So let me take you briefly through this 02:29:34.300 |
DeepSeq RL paper, and what happens when you actually 02:29:36.780 |
correctly apply RL to language models, and what that looks 02:29:39.260 |
like, and what that gives you. So the first thing I'll scroll 02:29:41.420 |
to is this kind of figure two here, where we are looking at 02:29:44.700 |
the improvement in how the models are solving mathematical 02:29:47.900 |
problems. So this is the accuracy of solving mathematical 02:29:50.700 |
problems on the AIME accuracy. And then we can go to the web 02:29:54.620 |
page, and we can see the kinds of problems that are actually 02:29:56.460 |
in these kinds of math problems that are being measured here. 02:30:00.300 |
So these are simple math problems. You can pause the 02:30:03.420 |
video if you like, but these are the kinds of problems that 02:30:05.660 |
basically the models are being asked to solve. And you can see 02:30:08.300 |
that in the beginning they're not doing very well, but then 02:30:10.380 |
as you update the model with this many thousands of steps, 02:30:13.580 |
their accuracy kind of continues to climb. So the models are 02:30:17.260 |
improving, and they're solving these problems with a higher 02:30:19.500 |
accuracy as you do this trial and error on a large dataset of 02:30:23.580 |
these kinds of problems. And the models are discovering how 02:30:26.540 |
to solve math problems. But even more incredible than the 02:30:30.540 |
quantitative kind of results of solving these problems with a 02:30:33.580 |
higher accuracy is the qualitative means by which the 02:30:36.060 |
model achieves these results. So when we scroll down, one of 02:30:40.380 |
the figures here that is kind of interesting is that later on 02:30:43.420 |
in the optimization, the model seems to be using average 02:30:48.140 |
length per response goes up. So the model seems to be using 02:30:51.500 |
more tokens to get its higher accuracy results. So it's 02:30:55.900 |
learning to create very, very long solutions. Why are these 02:30:59.420 |
solutions very long? We can look at them qualitatively here. 02:31:02.140 |
So basically what they discover is that the model solution get 02:31:05.900 |
very, very long partially because, so here's a question, 02:31:08.620 |
and here's kind of the answer from the model. What the model 02:31:11.420 |
learns to do, and this is an emergent property of the 02:31:14.780 |
optimization, it just discovers that this is good for problem 02:31:17.820 |
solving, is it starts to do stuff like this. Wait, wait, 02:31:20.620 |
wait, that's an aha moment I can flag here. Let's re-evaluate 02:31:23.500 |
this step by step to identify the correct sum can be. So 02:31:26.380 |
what is the model doing here, right? The model is basically 02:31:29.500 |
re-evaluating steps. It has learned that it works better 02:31:33.340 |
for accuracy to try out lots of ideas, try something from 02:31:37.100 |
different perspectives, retrace, reframe, backtrack. It's 02:31:40.620 |
doing a lot of the things that you and I are doing in the 02:31:42.620 |
process of problem solving for mathematical questions, but 02:31:45.580 |
it's rediscovering what happens in your head, not what you put 02:31:48.380 |
down on the solution, and there is no human who can hard code 02:31:51.660 |
this stuff in the ideal assistant response. This is only 02:31:54.860 |
something that can be discovered in the process of 02:31:56.620 |
reinforcement learning because you wouldn't know what to put 02:31:59.420 |
here. This just turns out to work for the model, and it 02:32:02.940 |
improves its accuracy in problem solving. So the model learns 02:32:06.620 |
what we call these chains of thought in your head, and it's 02:32:09.580 |
an emergent property of the optimization, and that's what's 02:32:13.820 |
bloating up the response lengths, but that's also what's 02:32:17.260 |
increasing the accuracy of the problem solving. So what's 02:32:20.780 |
incredible here is basically the model is discovering ways 02:32:23.660 |
to think. It's learning what I like to call cognitive 02:32:26.540 |
strategies of how you manipulate a problem and how you 02:32:29.980 |
approach it from different perspectives, how you pull in 02:32:32.620 |
some analogies or do different kinds of things like that, and 02:32:35.500 |
how you kind of try out many different things over time, 02:32:37.980 |
check a result from different perspectives, and how you kind 02:32:41.260 |
of solve problems. But here, it's kind of discovered by the 02:32:44.460 |
RL, so extremely incredible to see this emerge in the 02:32:47.820 |
optimization without having to hard code it anywhere. The only 02:32:50.860 |
thing we've given it are the correct answers, and this comes 02:32:53.740 |
out from trying to just solve them correctly, which is 02:32:56.220 |
incredible. Now let's go back to actually the problem that 02:33:00.780 |
we've been working with, and let's take a look at what it 02:33:03.100 |
would look like for this kind of a model, what we call 02:33:07.900 |
reasoning or thinking model, to solve that problem. Okay, so 02:33:11.660 |
recall that this is the problem we've been working with, and 02:33:13.900 |
when I pasted it into ChatGPT 4.0, I'm getting this kind of 02:33:17.340 |
a response. Let's take a look at what happens when you give 02:33:20.540 |
the same query to what's called a reasoning or a thinking 02:33:23.500 |
model. This is a model that was trained with reinforcement 02:33:25.660 |
learning. So this model described in this paper, Deep 02:33:29.740 |
Seek R1, is available on chat.deepseek.com. So this is 02:33:34.300 |
kind of like the company that developed it is hosting it. You 02:33:37.260 |
have to make sure that the Deep Think button is turned on to 02:33:40.060 |
get the R1 model, as it's called. We can paste it here 02:33:43.180 |
and run it. And so let's take a look at what happens now, and 02:33:47.180 |
what is the output of the model. Okay, so here's what it 02:33:49.820 |
says. So this is previously what we get using basically 02:33:53.260 |
what's an SFT approach, a supervised fine-tuning 02:33:55.500 |
approach. This is like mimicking an expert solution. 02:33:58.460 |
This is what we get from the RL model. Okay, let me try to 02:34:01.820 |
figure this out. So Emily buys three apples and two oranges. 02:34:04.460 |
Each orange costs $2, total is $13. I need to find out blah 02:34:07.660 |
blah blah. So here, as you're reading this, you can't 02:34:12.780 |
escape thinking that this model is thinking. It's 02:34:17.820 |
definitely pursuing the solution. It derives that it 02:34:21.340 |
must cost $3. And then it says, wait a second, let me check my 02:34:23.900 |
math again to be sure. And then it tries it from a slightly 02:34:26.220 |
different perspective. And then it says, yep, all that checks 02:34:29.340 |
out. I think that's the answer. I don't see any mistakes. Let 02:34:33.180 |
me see if there's another way to approach the problem, maybe 02:34:35.340 |
setting up an equation. Let's let the cost of one apple be 02:34:39.500 |
$8, then blah blah blah. Yep, same answer. So definitely each 02:34:43.420 |
apple is $3. All right, confident that that's correct. 02:34:46.300 |
And then what it does, once it sort of did the thinking 02:34:50.140 |
process, is it writes up the nice solution for the human. 02:34:53.660 |
And so this is now considering -- so this is more about the 02:34:56.620 |
correctness aspect, and this is more about the presentation 02:34:59.660 |
aspect, where it kind of writes it out nicely and boxes in the 02:35:04.060 |
correct answer at the bottom. And so what's incredible about 02:35:06.940 |
this is we get this like thinking process of the model. 02:35:09.340 |
And this is what's coming from the reinforcement learning 02:35:11.900 |
process. This is what's bloating up the length of the token 02:35:15.820 |
sequences. They're doing thinking and they're trying 02:35:17.820 |
different ways. This is what's giving you higher accuracy in 02:35:21.420 |
problem solving. And this is where we are seeing these aha 02:35:25.340 |
moments and these different strategies and these ideas for 02:35:29.500 |
how you can make sure that you're getting the correct 02:35:31.420 |
answer. The last point I wanted to make is some people are a 02:35:35.260 |
little bit nervous about putting, you know, very 02:35:37.980 |
sensitive data into chat.deepseq.com because this is a 02:35:41.260 |
Chinese company. So people don't -- people are a little bit 02:35:43.980 |
careful and cagey with that a little bit. DeepSeq R1 is a 02:35:48.220 |
model that was released by this company. So this is an open 02:35:51.660 |
source model or open weights model. It is available for 02:35:54.860 |
anyone to download and use. You will not be able to like run it 02:35:58.300 |
in its full sort of -- the full model in full precision. You 02:36:03.100 |
won't run that on a MacBook or like a local device because 02:36:07.020 |
this is a fairly large model. But many companies are hosting 02:36:09.980 |
the full largest model. One of those companies that I like to 02:36:13.180 |
use is called together.ai. So when you go to together.ai, you 02:36:17.260 |
sign up and you go to playgrounds. You can select here 02:36:20.220 |
in the chat DeepSeq R1 and there's many different kinds of 02:36:23.580 |
other models that you can select here. These are all 02:36:25.340 |
state-of-the-art models. So this is kind of similar to the 02:36:27.900 |
Hugging Face inference playground that we've been 02:36:29.820 |
playing with so far. But together.ai will usually host 02:36:33.020 |
all the state-of-the-art models. So select DeepSeq R1. 02:36:36.220 |
You can try to ignore a lot of these. I think the default 02:36:39.180 |
settings will often be okay. And we can put in this. And 02:36:43.420 |
because the model was released by DeepSeq, what you're getting 02:36:46.380 |
here should be basically equivalent to what you're 02:36:48.780 |
getting here. Now because of the randomness in the sampling, 02:36:51.420 |
we're going to get something slightly different. But in 02:36:53.820 |
principle, this should be identical in terms of the power 02:36:56.700 |
of the model. And you should be able to see the same things 02:36:58.940 |
quantitatively and qualitatively. But this model is 02:37:02.140 |
coming from kind of an American company. So that's DeepSeq 02:37:06.780 |
and that's what's called a reasoning model. Now when I go 02:37:10.860 |
back to chat, let me go to chat here. Okay, so the model that 02:37:14.700 |
you're going to see in the drop down here, some of them like 02:37:17.420 |
O1, O3 mini, O3 mini high, etc. They are talking about 02:37:21.180 |
users-advanced reasoning. Now what this is referring to, 02:37:24.460 |
users-advanced reasoning, is it's referring to the fact that 02:37:27.260 |
it was trained by reinforcement learning with techniques very 02:37:30.060 |
similar to those of DeepSeq R1, per public statements of 02:37:33.900 |
OpenAI employees. So these are thinking models trained with 02:37:38.700 |
RL. And these models like GPT-40 or GPT-40 mini that you're 02:37:42.460 |
getting in the free tier, you should think of them as mostly 02:37:44.860 |
SFT models, supervised fine-tuning models. They don't 02:37:47.660 |
actually do this like thinking as you see in the RL models. 02:37:50.780 |
And even though there's a little bit of reinforcement 02:37:53.420 |
learning involved with these models, and I'll go into that 02:37:55.980 |
in a second, these are mostly SFT models. I think you should 02:37:58.540 |
think about it that way. So in the same way as what we saw 02:38:01.580 |
here, we can pick one of the thinking models, like say O3 02:38:04.860 |
mini high. And these models, by the way, might not be 02:38:07.500 |
available to you unless you pay a chat GPT subscription of 02:38:10.940 |
either $20 per month or $200 per month for some of the top 02:38:14.540 |
models. So we can pick a thinking model and run. Now 02:38:19.420 |
what's going to happen here is it's going to say reasoning, 02:38:21.580 |
and it's going to start to do stuff like this. And what we're 02:38:25.740 |
seeing here is not exactly the stuff we're seeing here. So 02:38:29.420 |
even though under the hood, the model produces these kinds of 02:38:32.700 |
chains of thought, OpenAI chooses to not show the exact 02:38:37.820 |
chains of thought in the web interface. It shows little 02:38:40.620 |
summaries of those chains of thought. And OpenAI kind of 02:38:44.060 |
does this, I think, partly because they are worried about 02:38:46.940 |
what's called a distillation risk. That is that someone 02:38:49.420 |
could come in and actually try to imitate those reasoning 02:38:52.140 |
traces and recover a lot of the reasoning performance by just 02:38:55.260 |
imitating the reasoning chains of thought. And so they kind of 02:38:58.780 |
hide them and they only show little summaries of them. So 02:39:01.180 |
you're not getting exactly what you would get in DeepSeq 02:39:03.340 |
as with respect to the reasoning itself. And then they 02:39:07.020 |
write up the solution. So these are kind of like equivalent, 02:39:11.020 |
even though we're not seeing the full under the hood details. 02:39:13.980 |
Now, in terms of the performance, these models and 02:39:17.420 |
DeepSeq models are currently roughly on par, I would say. 02:39:20.540 |
It's kind of hard to tell because of the evaluations. But 02:39:22.540 |
if you're paying $200 per month to OpenAI, some of these 02:39:25.020 |
models I believe are currently, they basically still look 02:39:27.660 |
better. But DeepSeq R1 for now is still a very solid choice 02:39:32.620 |
for a thinking model that would be available to you either on 02:39:38.300 |
this website or any other website because the model is 02:39:40.220 |
open weights. You can just download it. So that's thinking 02:39:44.460 |
models. So what is the summary so far? Well, we've talked 02:39:47.980 |
about reinforcement learning and the fact that thinking 02:39:51.180 |
emerges in the process of the optimization on when we 02:39:54.220 |
basically run RL on many math and kind of code problems that 02:39:57.820 |
have verifiable solutions. So there's like an answer three, 02:40:00.940 |
et cetera. Now, these thinking models you can access in, for 02:40:05.340 |
example, DeepSeq or any inference provider like 02:40:08.220 |
together.ai and choosing DeepSeq over there. These 02:40:12.380 |
thinking models are also available in chatGPT under any 02:40:16.140 |
of the O1 or O3 models. But these GPT-4.0 models, et 02:40:20.380 |
cetera, they're not thinking models. You should think of 02:40:21.980 |
them as mostly SFT models. Now, if you have a prompt that 02:40:27.740 |
requires advanced reasoning and so on, you should probably 02:40:30.300 |
use some of the thinking models or at least try them 02:40:31.980 |
out. But empirically, for a lot of my use, when you're 02:40:35.340 |
asking a simpler question, there's like a knowledge- 02:40:37.100 |
based question or something like that, this might be 02:40:38.940 |
overkill. There's no need to think 30 seconds about some 02:40:41.340 |
factual question. So for that, I will sometimes default to 02:40:44.620 |
just GPT-4.0. So empirically, about 80, 90 percent of my use 02:40:48.380 |
is just GPT-4.0. And when I come across a very difficult 02:40:51.100 |
problem, like in math and code, et cetera, I will reach for 02:40:53.740 |
the thinking models. But then I have to wait a bit longer 02:40:57.020 |
because they are thinking. So you can access these on 02:41:00.300 |
chatGPT, on DeepSeq. Also, I wanted to point out that 02:41:03.260 |
aistudio.google.com, even though it looks really busy, 02:41:08.060 |
really ugly, because Google is just unable to do this kind 02:41:11.420 |
of stuff well, is like what is happening. But if you choose 02:41:15.020 |
model and you choose here, Gemini 2.0 Flash Thinking 02:41:18.300 |
Experimental 0121, if you choose that one, that's also a 02:41:22.060 |
kind of early experiment, experimental of a thinking 02:41:25.180 |
model by Google. So we can go here and we can give it the 02:41:28.540 |
same problem and click run. And this is also a thinking 02:41:31.500 |
problem, thinking model that will also do something similar 02:41:34.860 |
and comes out with the right answer here. So basically, 02:41:39.100 |
Gemini also offers a thinking model. Anthropic currently 02:41:42.460 |
does not offer a thinking model. But basically, this is 02:41:44.780 |
kind of like the frontier development of these LLMs. I 02:41:47.420 |
think RL is kind of like this new exciting stage, but getting 02:41:51.100 |
the details right is difficult. And that's why all these 02:41:53.740 |
models and thinking models are currently experimental as of 02:41:56.620 |
2025, very early 2025. But this is kind of like the frontier 02:42:01.500 |
development of pushing the performance on these very 02:42:03.500 |
difficult problems using reasoning that is emergent in 02:42:06.220 |
these optimizations. One more connection that I wanted to 02:42:08.940 |
bring up is that the discovery that reinforcement learning is 02:42:12.540 |
extremely powerful way of learning is not new to the 02:42:16.220 |
field of AI. And one place where we've already seen this 02:42:19.420 |
demonstrated is in the game of Go. And famously, DeepMind 02:42:23.740 |
developed the system AlphaGo, and you can watch a movie about 02:42:26.540 |
it, where the system is learning to play the game of Go 02:42:30.460 |
against top human players. And when we go to the paper 02:42:35.340 |
underlying AlphaGo, so in this paper, when we scroll down, 02:42:42.140 |
we actually find a really interesting plot that I think 02:42:46.620 |
is kind of familiar to us, and we're kind of like rediscovering 02:42:49.980 |
in the more open domain of arbitrary problem solving, 02:42:53.260 |
instead of on the closed specific domain of the game of 02:42:55.820 |
Go. But basically what they saw, and we're going to see this in 02:42:58.620 |
LLMs as well, as this becomes more mature, is this is the ELO 02:43:03.660 |
rating of playing game of Go. And this is Lee Sedol, an 02:43:06.620 |
extremely strong human player. And here where they are 02:43:09.660 |
comparing is the strength of a model learned, trained by 02:43:12.860 |
supervised learning, and a model trained by reinforcement 02:43:15.500 |
learning. So the supervised learning model is imitating 02:43:19.020 |
human expert players. So if you just get a huge amount of games 02:43:22.460 |
played by expert players in the game of Go, and you try to 02:43:24.940 |
imitate them, you are going to get better, but then you top 02:43:29.100 |
out, and you never quite get better than some of the top, 02:43:32.620 |
top, top players in the game of Go, like Lee Sedol. So you're 02:43:35.740 |
never going to reach there, because you're just imitating 02:43:38.140 |
human players. You can't fundamentally go beyond a human 02:43:40.700 |
player if you're just imitating human players. But in a process 02:43:44.060 |
of reinforcement learning is significantly more powerful. In 02:43:47.260 |
reinforcement learning for a game of Go, it means that the 02:43:49.980 |
system is playing moves that empirically and statistically 02:43:53.900 |
lead to win, to winning the game. And so AlphaGo is a 02:43:58.460 |
system where it kind of plays against itself, and it's using 02:44:02.540 |
reinforcement learning to create rollouts. So it's the exact 02:44:06.380 |
same diagram here, but there's no prompt. It's just, because 02:44:10.380 |
there's no prompt, it's just a fixed game of Go. But it's 02:44:12.940 |
trying out lots of solutions, it's trying lots of plays, and 02:44:16.540 |
then the games that lead to a win, instead of a specific 02:44:19.740 |
answer, are reinforced. They're made stronger. And so the 02:44:25.660 |
system is learning basically the sequences of actions that 02:44:27.980 |
empirically and statistically lead to winning the game. And 02:44:31.980 |
reinforcement learning is not going to be constrained by 02:44:34.220 |
human performance. And reinforcement learning can do 02:44:36.540 |
significantly better and overcome even the top players 02:44:39.660 |
like Lisa Dole. And so probably they could have run this 02:44:44.780 |
longer, and they just chose to crop it at some point because 02:44:46.940 |
this costs money. But this is a very powerful demonstration of 02:44:49.740 |
reinforcement learning. And we're only starting to kind of 02:44:52.300 |
see hints of this diagram in larger language models for 02:44:56.780 |
reasoning problems. So we're not going to get too far by just 02:44:59.740 |
imitating experts. We need to go beyond that, set up these 02:45:02.780 |
little game environments, and let the system discover 02:45:07.900 |
reasoning traces or ways of solving problems that are 02:45:12.780 |
unique and that just basically work well. Now on this aspect 02:45:17.900 |
of uniqueness, notice that when you're doing reinforcement 02:45:20.300 |
learning, nothing prevents you from veering off the 02:45:23.340 |
distribution of how humans are playing the game. And so when 02:45:26.700 |
we go back to this AlphaGo search here, one of the 02:45:29.980 |
suggested modifications is called move 37. And move 37 in 02:45:34.380 |
AlphaGo is referring to a specific point in time where 02:45:37.740 |
AlphaGo basically played a move that no human expert would 02:45:42.380 |
play. So the probability of this move to be played by a 02:45:46.140 |
human player was evaluated to be about 1 in 10,000. So it's a 02:45:49.580 |
very rare move. But in retrospect, it was a brilliant 02:45:52.380 |
move. So AlphaGo, in the process of reinforcement 02:45:54.860 |
learning, discovered kind of like a strategy of playing that 02:45:57.900 |
was unknown to humans, but is in retrospect brilliant. I 02:46:02.300 |
recommend this YouTube video, Lee Sedol versus AlphaGo move 02:46:05.580 |
37 reaction analysis. And this is kind of what it looked like 02:46:10.700 |
"That's a very surprising move." "I thought it was a 02:46:20.380 |
mistake." "When I see this move." Anyway, so basically 02:46:24.940 |
people are kind of freaking out because it's a move that 02:46:27.820 |
a human would not play, that AlphaGo played, because in its 02:46:32.140 |
training, this move seemed to be a good idea. It just happens 02:46:35.420 |
not to be a kind of thing that humans would do. And so that 02:46:38.700 |
is, again, the power of reinforcement learning. And in 02:46:40.860 |
principle, we can actually see the equivalence of that if we 02:46:43.580 |
continue scaling this paradigm in language models. And what 02:46:46.620 |
that looks like is kind of unknown. So what does it mean to 02:46:50.540 |
solve problems in such a way that even humans would not be 02:46:55.420 |
able to get? How can you be better at reasoning or thinking 02:46:58.380 |
than humans? How can you go beyond just a thinking human? 02:47:03.500 |
Like maybe it means discovering analogies that humans would 02:47:06.300 |
not be able to create. Or maybe it's like a new thinking 02:47:09.500 |
strategy. It's kind of hard to think through. Maybe it's a 02:47:12.860 |
wholly new language that actually is not even English. 02:47:16.300 |
Maybe it discovers its own language that is a lot better 02:47:19.260 |
at thinking. Because the model is unconstrained to even like 02:47:23.740 |
stick with English. So maybe it takes a different language to 02:47:27.500 |
think in, or it discovers its own language. So in principle, 02:47:31.020 |
the behavior of the system is a lot less defined. It is open to 02:47:35.180 |
do whatever works. And it is open to also slowly drift from 02:47:39.980 |
the distribution of its training data, which is English. 02:47:41.900 |
But all that can only be done if we have a very large, diverse 02:47:46.140 |
set of problems in which these strategies can be refined and 02:47:49.500 |
perfected. And so that is a lot of the frontier LLM research 02:47:53.020 |
that's going on right now is trying to kind of create those 02:47:55.740 |
kinds of prompt distributions that are large and diverse. 02:47:58.540 |
These are all kind of like game environments in which the LLMs 02:48:01.100 |
can practice their thinking. And it's kind of like writing, 02:48:05.500 |
you know, these practice problems. We have to create 02:48:07.420 |
practice problems for all of domains of knowledge. And if we 02:48:11.260 |
have practice problems and tons of them, the models will be able 02:48:14.380 |
to reinforcement learning, reinforcement learn on them and 02:48:17.980 |
kind of create these kinds of diagrams. But in the domain of 02:48:22.540 |
open thinking, instead of a closed domain like Game of Go. 02:48:25.900 |
There's one more section within reinforcement learning that I 02:48:28.780 |
wanted to cover. And that is that of learning in unverifiable 02:48:32.780 |
domains. So, so far, all of the problems that we've looked at 02:48:36.220 |
are in what's called verifiable domains. That is, any candidate 02:48:39.580 |
solution, we can score very easily against a concrete 02:48:43.020 |
answer. So for example, answer is three, and we can very easily 02:48:46.220 |
score these solutions against the answer of three. Either we 02:48:49.980 |
require the models to like box in their answers, and then we 02:48:53.020 |
just check for equality of whatever's in the box with the 02:48:56.140 |
answer. Or you can also use kind of what's called an LLM 02:48:59.500 |
judge. So the LLM judge looks at a solution, and it gets the 02:49:03.340 |
answer, and just basically scores the solution for whether 02:49:06.300 |
it's consistent with the answer or not. And LLMs empirically are 02:49:09.980 |
good enough at the current capability that they can do this 02:49:12.380 |
fairly reliably. So we can apply those kinds of techniques as 02:49:15.100 |
well. In any case, we have a concrete answer, and we're just 02:49:17.820 |
checking solutions against it. And we can do this automatically 02:49:20.780 |
with no kind of humans in the loop. The problem is that we 02:49:23.820 |
can't apply the strategy in what's called unverifiable 02:49:26.620 |
domains. So usually these are, for example, creative writing 02:49:29.580 |
tasks like write a joke about pelicans or write a poem or 02:49:32.620 |
summarize a paragraph or something like that. In these 02:49:35.260 |
kinds of domains, it becomes harder to score our different 02:49:38.780 |
solutions to this problem. So for example, writing a joke 02:49:41.660 |
about pelicans, we can generate lots of different jokes, of 02:49:44.380 |
course, that's fine. For example, you can go to ChessGPT 02:49:47.100 |
and we can get it to generate a joke about pelicans. 02:49:50.220 |
So much stuff in their beaks because they don't bellican in 02:49:55.420 |
backpacks. Why? Okay, we can we can try something else. Why don't 02:50:02.860 |
pelicans ever pay for their drinks because they always bill 02:50:05.980 |
it to someone else? Haha. Okay, so these models are not 02:50:10.460 |
obviously not very good at humor. Actually, I think it's 02:50:12.540 |
pretty fascinating because I think humor is secretly very 02:50:14.780 |
difficult and the models don't have the capability, I think. 02:50:17.660 |
Anyway, in any case, you could imagine creating lots of jokes. 02:50:22.780 |
The problem that we are facing is how do we score them? Now, in 02:50:26.300 |
principle, we could, of course, get a human to look at all 02:50:29.260 |
these jokes, just like I did right now. The problem with 02:50:31.980 |
that is if you are doing reinforcement learning, you're 02:50:34.380 |
going to be doing many thousands of updates. And for 02:50:37.420 |
each update, you want to be looking at, say, thousands of 02:50:39.660 |
prompts. And for each prompt, you want to be potentially 02:50:42.060 |
looking at looking at hundreds or thousands of different kinds 02:50:44.860 |
of generations. And so there's just like way too many of these 02:50:48.460 |
to look at. And so, in principle, you could have a 02:50:51.740 |
human inspect all of them and score them and decide that, 02:50:53.900 |
okay, maybe this one is funny. And maybe this one is funny. And 02:50:58.060 |
this one is funny. And we could train on them to get the model 02:51:01.580 |
to become slightly better at jokes, in the context of 02:51:05.100 |
pelicans, at least. The problem is that it's just like way too 02:51:09.580 |
much human time. This is an unscalable strategy. We need 02:51:12.300 |
some kind of an automatic strategy for doing this. And 02:51:15.340 |
one sort of solution to this was proposed in this paper that 02:51:19.340 |
introduced what's called reinforcement learning from 02:51:21.260 |
human feedback. And so this was a paper from OpenAI at the 02:51:24.220 |
time. Many of these people are now co-founders in Anthropic. 02:51:27.580 |
And this kind of proposed a approach for basically doing 02:51:33.100 |
reinforcement learning in unverifiable domains. So let's 02:51:35.820 |
take a look at how that works. So this is the cartoon diagram 02:51:39.420 |
of the core ideas involved. So as I mentioned, the naive 02:51:42.540 |
approach is if we just had infinity human time, we could 02:51:46.300 |
just run RL in these domains just fine. So, for example, we 02:51:50.220 |
can run RL as usual if I have infinity humans. I just want to 02:51:53.820 |
do, and these are just cartoon numbers, I want to do 1,000 02:51:56.700 |
updates where each update will be on 1,000 prompts. And for 02:52:00.780 |
each prompt, we're going to have 1,000 rollouts that we're 02:52:03.820 |
scoring. So we can run RL with this kind of a setup. The 02:52:08.700 |
problem is in the process of doing this, I will need to run 02:52:11.580 |
one, I would need to ask a human to evaluate a joke a total 02:52:15.180 |
of 1 billion times. And so that's a lot of people looking 02:52:18.620 |
at really terrible jokes. So we don't want to do that. So 02:52:21.900 |
instead, we want to take the RLHF approach. So in RLHF 02:52:26.860 |
approach, we are kind of like the core trick is that of 02:52:29.900 |
indirection. So we're going to involve humans just a little 02:52:33.900 |
bit. And the way we cheat is that we basically train a whole 02:52:37.660 |
separate neural network that we call a reward model. And this 02:52:41.500 |
neural network will kind of like imitate human scores. So 02:52:45.340 |
we're going to ask humans to score rollouts, we're going to 02:52:49.500 |
then imitate human scores using a neural network. And this 02:52:54.060 |
neural network will become a kind of simulator of human 02:52:56.060 |
preferences. And now that we have a neural network simulator, 02:52:59.820 |
we can do RL against it. So instead of asking a real human, 02:53:03.740 |
we're asking a simulated human for their score of a joke as an 02:53:08.140 |
example. And so once we have a simulator, we're off to the 02:53:11.820 |
races because we can query it as many times as we want to. And 02:53:15.100 |
it's all whole automatic process. And we can now do 02:53:17.660 |
reinforcement learning with respect to the simulator. And 02:53:20.060 |
the simulator, as you might expect, is not going to be a 02:53:21.980 |
perfect human. But if it's at least statistically similar to 02:53:25.740 |
human judgment, then you might expect that this will do 02:53:28.060 |
something. And in practice, indeed, it does. So once we have 02:53:31.900 |
a simulator, we can do RL and everything works great. So let 02:53:34.860 |
me show you a cartoon diagram a little bit of what this process 02:53:37.980 |
looks like, although the details are not 100% like super 02:53:40.780 |
important, it's just a core idea of how this works. So here we 02:53:43.580 |
have a cartoon diagram of a hypothetical example of what 02:53:46.140 |
training the reward model would look like. So we have a prompt 02:53:49.740 |
like write a joke about pelicans. And then here we have 02:53:52.460 |
five separate rollouts. So these are all five different jokes, 02:53:55.340 |
just like this one. Now, the first thing we're going to do 02:53:59.500 |
is we are going to ask a human to order these jokes from the 02:54:04.300 |
best to worst. So this is so here, this human thought that 02:54:09.100 |
this joke is the best, the funniest. So number one joke, 02:54:12.940 |
this is number two joke, number three joke, four, and five. So 02:54:17.420 |
this is the worst joke. We're asking humans to order instead 02:54:20.620 |
of give scores directly, because it's a bit of an easier task. 02:54:23.580 |
It's easier for a human to give an ordering than to give precise 02:54:26.300 |
scores. Now, that is now the supervision for the model. So 02:54:30.620 |
the human has ordered them. And that is kind of like their 02:54:32.780 |
contribution to the training process. But now separately, 02:54:35.740 |
what we're going to do is we're going to ask a reward model 02:54:38.060 |
about its scoring of these jokes. Now the reward model is a 02:54:42.700 |
whole separate neural network, completely separate neural net. 02:54:45.660 |
And it's also probably a transformer. But it's not a 02:54:50.060 |
language model in the sense that it generates diverse language, 02:54:53.340 |
etc. It's just a scoring model. So the reward model will take as 02:54:57.820 |
an input, the prompt, number one, and number two, a candidate 02:55:02.540 |
joke. So those are the two inputs that go into the reward 02:55:06.300 |
model. So here, for example, the reward model would be taking 02:55:09.020 |
this prompt, and this joke. Now the output of a reward model is 02:55:13.100 |
a single number. And this number is thought of as a score. And 02:55:17.100 |
it can range, for example, from zero to one. So zero would be 02:55:20.300 |
the worst score, and one would be the best score. So here are 02:55:23.820 |
some examples of what a hypothetical reward model at 02:55:26.620 |
some stage in the training process would give as scoring to 02:55:30.060 |
these jokes. So 0.1 is a very low score, 0.8 is a really high 02:55:35.100 |
score, and so on. And so now we compare the scores given by the 02:55:41.340 |
reward model with the ordering given by the human. And there's 02:55:45.580 |
a precise mathematical way to actually calculate this, 02:55:48.860 |
basically set up a loss function and calculate a kind of like a 02:55:52.060 |
correspondence here, and update a model based on it. But I just 02:55:55.740 |
want to give you the intuition, which is that, as an example 02:55:58.780 |
here, for this second joke, the human thought that it was the 02:56:02.220 |
funniest, and the model kind of agreed, right? 0.8 is a relatively 02:56:05.260 |
high score. But this score should have been even higher, 02:56:07.900 |
right? So after an update, we would expect that maybe the 02:56:11.420 |
score should have been, will actually grow after an update of 02:56:14.300 |
the network to be like, say, 0.81 or something. For this one 02:56:18.860 |
here, they actually are in a massive disagreement, because 02:56:21.100 |
the human thought that this was number two, but here the score 02:56:24.380 |
is only 0.1. And so this score needs to be much higher. So 02:56:29.180 |
after an update, on top of this kind of a supervision, this 02:56:33.180 |
might grow a lot more, like maybe it's 0.15 or something 02:56:35.580 |
like that. And then here, the human thought that this one was 02:56:40.300 |
the worst joke, but here the model actually gave it a fairly 02:56:43.340 |
high number. So you might expect that after the update, this 02:56:46.940 |
would come down to maybe 3.5 or something like that. So 02:56:50.060 |
basically, we're doing what we did before. We're slightly 02:56:52.700 |
nudging the predictions from the models using neural network 02:56:57.340 |
training process. And we're trying to make the reward model 02:57:01.020 |
scores be consistent with human ordering. And so as we update 02:57:06.940 |
the reward model on human data, it becomes better and better 02:57:10.300 |
simulator of the scores and orders that humans provide, and 02:57:14.860 |
then becomes kind of like the simulator of human 02:57:18.700 |
preferences, which we can then do RL against. But critically, 02:57:22.300 |
we're not asking humans 1 billion times to look at a joke. 02:57:25.180 |
We're maybe looking at 1000 prompts and 5 rollouts each. So 02:57:28.380 |
maybe 5000 jokes that humans have to look at in total. And 02:57:31.900 |
they just give the ordering, and then we're training the model 02:57:34.140 |
to be consistent with that ordering. And I'm skipping over 02:57:36.620 |
the mathematical details. But I just want you to understand a 02:57:39.580 |
high level idea that this reward model is basically giving us 02:57:43.900 |
the scores, and we have a way of training it to be consistent 02:57:47.020 |
with human orderings. And that's how RLHF works. Okay, so 02:57:50.940 |
that is the rough idea. We basically train simulators of 02:57:53.980 |
humans and RL with respect to those simulators. Now, I want 02:57:58.060 |
to talk about first, the upside of reinforcement learning from 02:58:01.580 |
human feedback. The first thing is that this allows us to run 02:58:06.540 |
reinforcement learning, which we know is incredibly powerful 02:58:09.100 |
kind of set of techniques. And it allows us to do it in 02:58:11.500 |
arbitrary domains, and including the ones that are unverifiable. 02:58:15.500 |
So things like summarization, and poem writing, joke writing, 02:58:18.700 |
or any other creative writing, really, in domains outside of 02:58:21.900 |
math and code, etc. Now, empirically, what we see when 02:58:25.660 |
we actually apply RLHF is that this is a way to improve the 02:58:28.780 |
performance of the model. And I have a top answer for why that 02:58:33.740 |
might be, but I don't actually know that it is like super well 02:58:36.940 |
established on like, why this is, you can empirically observe 02:58:39.900 |
that when you do RLHF correctly, the models you get are just like 02:58:43.180 |
a little bit better. But as to why is I think like, not as 02:58:46.460 |
clear. So here's my best guess. My best guess is that this is 02:58:49.580 |
possibly mostly due to the discriminator generator gap. 02:58:52.780 |
What that means is that in many cases, it is significantly 02:58:57.500 |
easier to discriminate than to generate for humans. So in 02:59:01.100 |
particular, an example of this is in when we do supervised 02:59:06.780 |
fine tuning, right, SFT. We're asking humans to generate the 02:59:12.060 |
ideal assistant response. And in many cases here, as I've shown 02:59:17.020 |
it, the ideal response is very simple to write, but in many 02:59:20.540 |
cases might not be. So for example, in summarization, or 02:59:23.500 |
poem writing, or joke writing, like how are you as a human 02:59:26.460 |
assistant, as a human labeler, supposed to get the ideal 02:59:30.140 |
response in these cases, it requires creative human writing 02:59:33.180 |
to do that. And so RLHF kind of sidesteps this, because we get 02:59:37.580 |
we get to ask people a significantly easier question as 02:59:41.100 |
a data labelers, they're not asked to write poems directly, 02:59:44.540 |
they're just given five points from the model, and they're just 02:59:47.180 |
asked to order them. And so that's just a much easier task 02:59:50.860 |
for a human labeler to do. And so what I think this allows you 02:59:54.380 |
to do basically is it, it's kind of like allows a lot more 02:59:59.740 |
higher accuracy data, because we're not asking people to do 03:00:03.020 |
the generation task, which can be extremely difficult. Like 03:00:05.900 |
we're not asking them to do creative writing, we're just 03:00:08.140 |
trying to get them to distinguish between creative 03:00:10.380 |
writings, and find ones that are best. And that is the signal 03:00:14.860 |
that humans are providing just the ordering. And that is their 03:00:17.740 |
input into the system. And then the system in RLHF just 03:00:21.820 |
discovers the kinds of responses that would be graded well by 03:00:26.060 |
humans. And so that step of indirection allows the models to 03:00:30.380 |
become even better. So that is the upside of RLHF. It allows 03:00:34.540 |
us to run RL, it empirically results in better models, and 03:00:37.900 |
it allows people to contribute their supervision, even without 03:00:41.900 |
having to do extremely difficult tasks in the case of writing 03:00:45.500 |
ideal responses. Unfortunately, RLHF also comes with 03:00:48.780 |
significant downsides. And so the main one is that basically 03:00:54.620 |
we are doing reinforcement learning, not with respect to 03:00:56.860 |
humans and actual human judgment, but with respect to a 03:00:59.340 |
lossy simulation of humans, right? And this lossy 03:01:02.380 |
simulation could be misleading, because it's just a it's just a 03:01:05.180 |
simulation, right? It's just a language model that's kind of 03:01:08.060 |
outputting scores, and it might not perfectly reflect the 03:01:11.340 |
opinion of an actual human with an actual brain in all the 03:01:14.780 |
possible different cases. So that's number one. There's 03:01:17.500 |
actually something even more subtle and devious going on that 03:01:19.980 |
really dramatically holds back RLHF as a technique that we can 03:01:25.500 |
really scale to significantly kind of smart systems. And that 03:01:31.580 |
is that reinforcement learning is extremely good at discovering 03:01:34.700 |
a way to game the model, to game the simulation. So this reward 03:01:39.660 |
model that we're constructing here, that gives the scores, 03:01:42.380 |
these models are transformers. These transformers are massive 03:01:47.260 |
neural nets. They have billions of parameters, and they imitate 03:01:50.540 |
humans, but they do so in a kind of like a simulation way. Now, 03:01:53.820 |
the problem is that these are massive, complicated systems, 03:01:56.380 |
right? There's a billion parameters here that are 03:01:58.300 |
outputting a single score. It turns out that there are ways 03:02:02.700 |
to game these models. You can find kinds of inputs that were 03:02:07.180 |
not part of their training set. And these inputs inexplicably 03:02:11.820 |
get very high scores, but in a fake way. So very often what you 03:02:16.780 |
find if you run RLHF for very long, so for example, if we do 03:02:19.820 |
1000 updates, which is like say a lot of updates, you might 03:02:23.660 |
expect that your jokes are getting better and that you're 03:02:25.820 |
getting like real bangers about pelicans, but that's not exactly 03:02:29.260 |
what happens. What happens is that in the first few hundred 03:02:33.500 |
steps, the jokes about pelicans are probably improving a little 03:02:35.900 |
bit. And then they actually dramatically fall off the cliff 03:02:38.620 |
and you start to get extremely nonsensical results. Like for 03:02:41.980 |
example, you start to get the top joke about pelicans starts 03:02:45.420 |
to be the the the the the the. And this makes no sense, right? 03:02:48.940 |
Like when you look at it, why should this be a top joke? But 03:02:51.180 |
when you take the the the the the the and you plug it into 03:02:53.900 |
your reward model, you'd expect score of zero, but actually the 03:02:57.260 |
reward model loves this as a joke. It will tell you that the 03:03:01.180 |
the the the the is a score of 1.0. This is a top joke and this 03:03:06.380 |
makes no sense, right? But it's because these models are just 03:03:08.940 |
simulations of humans and they're massive neural nuts and 03:03:11.660 |
you can find inputs at the bottom that kind of like get 03:03:15.180 |
into the part of the input space that kind of gives you 03:03:17.100 |
nonsensical results. These examples are what's called 03:03:20.220 |
adversarial examples, and I'm not going to go into the topic 03:03:22.860 |
too much, but these are adversarial inputs to the model. 03:03:25.980 |
They are specific little inputs that kind of go between the 03:03:29.740 |
nooks and crannies of the model and give nonsensical results at 03:03:32.380 |
the top. Now here's what you might imagine doing. You say, 03:03:35.820 |
okay, the the the is obviously not score of one. It's obviously 03:03:39.580 |
a low score. So let's take the the the the the. Let's add it 03:03:42.700 |
to the data set and give it an ordering that is extremely bad, 03:03:46.140 |
like a score of five. And indeed, your model will learn 03:03:48.940 |
that the the the the should have a very low score, and we'll 03:03:51.740 |
give it score of zero. The problem is that there will 03:03:54.140 |
always be basically infinite number of nonsensical 03:03:57.580 |
adversarial examples hiding in the model. If you iterate this 03:04:01.500 |
process many times and you keep adding nonsensical stuff to 03:04:04.140 |
your reward model and giving it very low scores, you'll never 03:04:07.660 |
win the game. You can do this many, many rounds, and 03:04:10.940 |
reinforcement learning, if you run it long enough, will always 03:04:13.660 |
find a way to game the model. It will discover adversarial 03:04:16.460 |
examples. It will get really high scores with nonsensical 03:04:20.220 |
results. And fundamentally, this is because our scoring 03:04:23.820 |
function is a giant neural net, and RL is extremely good at 03:04:28.860 |
finding just the ways to trick it. So long story short, you 03:04:35.340 |
always run RLHF for maybe a few hundred updates, the model is 03:04:38.780 |
getting better, and then you have to crop it and you are 03:04:41.180 |
done. You can't run too much against this reward model 03:04:45.980 |
because the optimization will start to game it, and you 03:04:49.420 |
basically crop it and you call it and you ship it. And you can 03:04:55.340 |
improve the reward model, but you kind of like come across 03:04:57.260 |
these situations eventually at some point. So RLHF, basically 03:05:02.140 |
what I usually say is that RLHF is not RL. And what I mean by 03:05:05.740 |
that is, I mean, RLHF is RL, obviously, but it's not RL in 03:05:09.900 |
the magical sense. This is not RL that you can run 03:05:12.860 |
indefinitely. These kinds of problems, like where you are 03:05:16.700 |
getting concrete correct answer, you cannot gain this as 03:05:19.900 |
easily. You either got the correct answer or you didn't. 03:05:22.380 |
And the scoring function is much, much simpler. You're just 03:05:24.540 |
looking at the boxed area and seeing if the result is 03:05:27.180 |
correct. So it's very difficult to gain these functions, but 03:05:31.260 |
gaining a reward model is possible. Now, in these 03:05:34.060 |
verifiable domains, you can run RL indefinitely. You could run 03:05:37.900 |
for tens of thousands, hundreds of thousands of steps and 03:05:40.220 |
discover all kinds of really crazy strategies that we might 03:05:42.540 |
not even ever think about of performing really well for all 03:05:46.060 |
these problems. In the game of Go, there's no way to basically 03:05:50.860 |
game the winning of a game or losing of a game. We have a 03:05:54.220 |
perfect simulator. We know where all the stones are placed, and 03:05:59.100 |
we can calculate whether someone has won or not. There's no way 03:06:02.140 |
to game that. And so you can do RL indefinitely, and you can 03:06:05.740 |
eventually beat even Lisa Dole. But with models like this, 03:06:10.540 |
which are gameable, you cannot repeat this process 03:06:13.340 |
indefinitely. So I kind of see RLHF as not real RL because the 03:06:18.060 |
reward function is gameable. So it's kind of more like in the 03:06:21.260 |
realm of like little fine-tuning. It's a little 03:06:25.260 |
improvement, but it's not something that is fundamentally 03:06:27.980 |
set up correctly, where you can insert more compute, run for 03:06:31.500 |
longer, and get much better and magical results. So it's 03:06:35.580 |
not RL in that sense. It's not RL in the sense that it lacks 03:06:38.940 |
magic. It can fine-tune your model and get a better 03:06:41.980 |
performance. And indeed, if we go back to ChessGPT, the GPT40 03:06:47.020 |
model has gone through RLHF because it works well, but it's 03:06:51.420 |
just not RL in the same sense. RLHF is like a little fine-tune 03:06:54.940 |
that slightly improves your model, is maybe like the way I 03:06:57.180 |
would think about it. Okay, so that's most of the technical 03:06:59.820 |
content that I wanted to cover. I took you through the three 03:07:02.780 |
major stages and paradigms of training these models. 03:07:05.820 |
Pre-training, supervised fine-tuning, and reinforcement 03:07:08.220 |
learning. And I showed you that they loosely correspond to the 03:07:11.020 |
process we already use for teaching children. And so in 03:07:14.380 |
particular, we talked about pre-training being sort of like 03:07:17.100 |
the basic knowledge acquisition of reading exposition, 03:07:20.380 |
supervised fine-tuning being the process of looking at lots 03:07:22.940 |
and lots of worked examples and imitating experts, and practice 03:07:27.180 |
problems. The only difference is that we now have to effectively 03:07:30.620 |
write textbooks for LLMs and AIs across all the disciplines of 03:07:34.940 |
human knowledge. And also in all the cases where we actually 03:07:38.460 |
would like them to work, like code and math and basically all 03:07:42.860 |
the other disciplines. So we're in the process of writing 03:07:45.020 |
textbooks for them, refining all the algorithms that I've 03:07:47.900 |
presented on the high level. And then of course, doing a 03:07:50.380 |
really, really good job at the execution of training these 03:07:53.020 |
models at scale and efficiently. So in particular, I didn't go 03:07:56.380 |
into too many details, but these are extremely large and 03:07:59.900 |
complicated distributed sort of jobs that have to run over 03:08:06.380 |
tens of thousands or even hundreds of thousands of GPUs. 03:08:08.620 |
And the engineering that goes into this is really at the 03:08:12.060 |
state of the art of what's possible with computers at that 03:08:14.300 |
scale. So I didn't cover that aspect too much, but this is a 03:08:21.020 |
very kind of serious endeavor underlying all these very simple 03:08:24.140 |
algorithms ultimately. Now, I also talked about sort of like 03:08:28.620 |
the theory of mind a little bit of these models. And the thing 03:08:31.020 |
I want you to take away is that these models are really good, 03:08:33.820 |
but they're extremely useful as tools for your work. You 03:08:37.180 |
shouldn't sort of trust them fully. And I showed you some 03:08:39.660 |
examples of that. Even though we have mitigations for 03:08:41.980 |
hallucinations, the models are not perfect and they will 03:08:44.300 |
hallucinate still. It's gotten better over time and it will 03:08:47.420 |
continue to get better, but they can hallucinate. In other 03:08:50.620 |
words, in addition to that, I covered kind of like what I 03:08:53.820 |
call the Swiss cheese sort of model of LLM capabilities that 03:08:57.100 |
you should have in your mind. The models are incredibly good 03:08:59.500 |
across so many different disciplines, but then fail 03:09:01.580 |
randomly almost in some unique cases. So for example, what is 03:09:05.580 |
bigger, 9.11 or 9.9? Like the model doesn't know, but 03:09:08.700 |
simultaneously it can turn around and solve Olympiad 03:09:12.140 |
questions. And so this is a hole in the Swiss cheese and 03:09:15.500 |
there are many of them and you don't want to trip over them. 03:09:17.900 |
So don't treat these models as infallible models, check their 03:09:23.260 |
work, use them as tools, use them for inspiration, use them 03:09:26.380 |
for the first draft, but work with them as tools and be 03:09:30.060 |
ultimately responsible for the, you know, product of your work. 03:09:34.300 |
And that's roughly what I wanted to talk about. This is 03:09:39.820 |
how they're trained and this is what they are. Let's now turn 03:09:42.780 |
to what are some of the future capabilities of these models, 03:09:45.980 |
probably what's coming down the pipe. And also where can you 03:09:47.980 |
find these models? I have a few bullet points on some of the 03:09:50.780 |
things that you can expect coming down the pipe. The first 03:09:53.340 |
thing you'll notice is that models will very rapidly become 03:09:56.540 |
multimodal. Everything I've talked about about concerned 03:09:59.420 |
text, but very soon we'll have LLMs that can not just handle 03:10:02.700 |
text, but they can also operate natively and very easily over 03:10:06.380 |
audio so they can hear and speak, and also images so they 03:10:09.660 |
can see and paint. And we're already seeing the beginnings 03:10:13.100 |
of all of this, but this will be all done natively inside the 03:10:17.420 |
language model, and this will enable kind of like natural 03:10:19.900 |
conversations. And roughly speaking, the reason that this 03:10:22.540 |
is actually no different from everything we've covered above 03:10:25.180 |
is that as a baseline, you can tokenize audio and images and 03:10:30.460 |
apply the exact same approaches of everything that we've talked 03:10:32.780 |
about above. So it's not a fundamental change. It's just 03:10:35.820 |
we have to add some tokens. So as an example, for tokenizing 03:10:40.380 |
audio, we can look at slices of the spectrogram of the audio 03:10:43.740 |
signal and we can tokenize that and just add more tokens that 03:10:47.580 |
suddenly represent audio and just add them into the context 03:10:50.460 |
windows and train on them just like above. The same for images, 03:10:53.500 |
we can use patches and we can separately tokenize patches, and 03:10:58.060 |
then what is an image? An image is just a sequence of tokens. 03:11:01.180 |
And this actually kind of works, and there's a lot of early work 03:11:04.700 |
in this direction. And so we can just create streams of tokens 03:11:07.980 |
that are representing audio, images, as well as text, and 03:11:10.780 |
intersperse them and handle them all simultaneously in a 03:11:13.100 |
single model. So that's one example of multimodality. 03:11:17.420 |
Second, something that people are very interested in is 03:11:19.580 |
currently most of the work is that we're handing individual 03:11:23.500 |
tasks to the models on kind of like a silver platter, like 03:11:26.380 |
please solve this task for me. And the model sort of like does 03:11:28.860 |
this little task. But it's up to us to still sort of like 03:11:32.540 |
organize a coherent execution of tasks to perform jobs. And 03:11:37.340 |
the models are not yet at the capability required to do this 03:11:41.420 |
in a coherent error correcting way over long periods of time. 03:11:46.060 |
So they're not able to fully string together tasks to 03:11:48.380 |
perform these longer running jobs. But they're getting there 03:11:51.660 |
and this is improving over time. But probably what's going to 03:11:55.340 |
happen here is we're going to start to see what's called 03:11:57.260 |
agents, which perform tasks over time, and you, you 03:12:00.940 |
supervise them, and you watch their work, and they come up to 03:12:04.140 |
once in a while, report progress, and so on. So we're 03:12:07.100 |
going to see more long running agents, tasks that don't just 03:12:10.300 |
take, you know, a few seconds of response, but many tens of 03:12:12.940 |
seconds, or even minutes or hours over time. But these 03:12:17.020 |
models are not infallible, as we talked about above. So all 03:12:19.740 |
this will require supervision. So for example, in factories, 03:12:22.620 |
people talk about the human to robot ratio for automation, I 03:12:26.940 |
think we're going to see something similar in the 03:12:28.380 |
digital space, where we are going to be talking about human 03:12:31.180 |
to agent ratios, where humans becomes a lot more supervisors 03:12:34.620 |
of agentic tasks in the digital domain. Next, I think 03:12:40.940 |
everything's going to become a lot more pervasive and 03:12:42.620 |
invisible. So it's kind of like integrated into the tools, and 03:12:46.380 |
in everywhere. And in addition, kind of like computer using. So 03:12:52.460 |
right now, these models aren't able to take actions on your 03:12:54.700 |
behalf. But I think this is a separate bullet point. If you 03:12:59.900 |
saw ChassisVT launch the operator, then that's one early 03:13:03.580 |
example of that where you can actually hand off control to 03:13:05.820 |
the model to perform, you know, keyboard and mouse actions on 03:13:09.340 |
your behalf. So that's also something that that I think is 03:13:11.660 |
very interesting. The last point I have here is just a 03:13:14.220 |
general comment that there's still a lot of research to 03:13:16.060 |
potentially do in this domain. One example of that is 03:13:19.580 |
something along the lines of test time training. So remember 03:13:22.140 |
that everything we've done above, and that we talked about 03:13:24.620 |
has two major stages. There's first the training stage where 03:13:27.900 |
we tune the parameters of the model to perform the tasks 03:13:30.700 |
well. Once we get the parameters, we fix them, and 03:13:33.820 |
then we deploy the model for inference. From there, the 03:13:37.020 |
model is fixed, it doesn't change anymore, it doesn't 03:13:39.740 |
learn from all the stuff that it's doing at test time, it's a 03:13:42.140 |
fixed number of parameters. And the only thing that is 03:13:45.740 |
changing is now the tokens inside the context windows. And 03:13:48.940 |
so the only type of learning or test time learning that the 03:13:51.900 |
model has access to is the in-context learning of its 03:13:55.500 |
kind of like dynamically adjustable context window, 03:13:58.780 |
depending on like what it's doing at test time. So, but I 03:14:02.540 |
think this is still different from humans who actually are 03:14:04.460 |
able to like actually learn depending on what they're 03:14:06.780 |
doing, especially when you sleep, for example, like your 03:14:09.020 |
brain is updating your parameters or something like 03:14:10.860 |
that, right? So there's no kind of equivalent of that 03:14:13.660 |
currently in these models and tools. So there's a lot of 03:14:16.380 |
like more wonky ideas, I think, that are to be explored 03:14:19.020 |
still. And in particular, I think this will be necessary 03:14:22.300 |
because the context window is a finite and precious resource. 03:14:25.660 |
And especially once we start to tackle very long running 03:14:28.460 |
multimodal tasks, and we're putting in videos, and these 03:14:31.420 |
token windows will basically start to grow extremely large, 03:14:35.100 |
like not thousands or even hundreds of thousands, but 03:14:37.660 |
significantly beyond that. And the only trick, the only kind 03:14:41.100 |
of trick we have available to us right now is to make the 03:14:43.340 |
context windows longer. But I think that that approach by 03:14:46.620 |
itself will not will not scale to actual long running tasks 03:14:49.980 |
that are multimodal over time. And so I think new ideas are 03:14:53.020 |
needed in some of those disciplines, in some of those 03:14:56.620 |
kind of cases in the maze, where these tasks are going to 03:14:59.100 |
require very long contexts. So those are some examples of 03:15:02.860 |
some of the things you can expect coming down the pipe. 03:15:06.300 |
Let's now turn to where you can actually kind of keep track of 03:15:09.340 |
this progress, and you know, be up to date with the latest and 03:15:13.580 |
greatest of what's happening in the field. So I would say the 03:15:15.660 |
three resources that I have consistently used to stay up to 03:15:18.460 |
date are number one, LLM Arena. So let me show you LLM Arena. 03:15:22.620 |
This is basically an LLM leaderboard. And it ranks all 03:15:28.460 |
the top models. And the ranking is based on human comparisons. 03:15:32.780 |
So humans prompt these models, and they get to judge which one 03:15:35.900 |
gives a better answer. They don't know which model is which 03:15:38.460 |
they're just looking at which model is the better answer. And 03:15:41.180 |
you can calculate a ranking and then you get some results. And 03:15:44.300 |
so what you can hear is, what you can see here is the 03:15:46.780 |
different organizations like Google, Gemini, for example, 03:15:49.020 |
that produce these models. When you click on any one of these, 03:15:51.580 |
it takes you to the place where that model is hosted. And then 03:15:56.220 |
here we see Google is currently on top with OpenAI right behind. 03:15:59.980 |
Here we see Deep Seek in position number three. Now the 03:16:03.500 |
reason this is a big deal is the last column here, you see 03:16:05.820 |
license, Deep Seek is an MIT licensed model. It's open 03:16:09.580 |
weights, anyone can use these weights, anyone can download 03:16:12.540 |
them, anyone can host their own version of Deep Seek, and they 03:16:15.900 |
can use it in whatever way they like. And so it's not a 03:16:18.540 |
proprietary model that you don't have access to it. It's 03:16:20.780 |
basically an open weights release. And so this is kind of 03:16:24.460 |
unprecedented that a model this strong was released with open 03:16:28.380 |
weights. So pretty cool from the team. Next up, we have a few 03:16:32.060 |
more models from Google and OpenAI. And then when you 03:16:34.380 |
continue to scroll down, you're starting to see some other 03:16:36.460 |
usual suspects. So XAI here, Anthropic with Sonnet here at 03:16:42.060 |
number 14. And then Meta with Lama over here. So Lama similar 03:16:51.180 |
to Deep Seek is an open weights model. And so but it's down here 03:16:55.580 |
as opposed to up here. Now I will say that this leaderboard 03:16:58.540 |
was really good for a long time. I do think that in the last few 03:17:03.820 |
months, it's become a little bit gamed. And I don't trust it as 03:17:07.980 |
much as I used to. I think just empirically, I feel like a lot 03:17:12.220 |
of people, for example, are using Sonnet from Anthropic and 03:17:15.180 |
that it's a really good model. So but that's all the way down 03:17:17.740 |
here in number 14. And conversely, I think not as many 03:17:22.460 |
people are using Gemini, but it's racking really, really 03:17:24.460 |
high. So I think use this as a first pass, but sort of try out 03:17:31.660 |
a few of the models for your tasks and see which one 03:17:33.660 |
performs better. The second thing that I would point to is 03:17:37.100 |
the AI News newsletter. So AI News is not very creatively 03:17:42.300 |
named, but it is a very good newsletter produced by Swix and 03:17:45.340 |
Friends. So thank you for maintaining it. And it's been 03:17:47.660 |
very helpful to me because it is extremely comprehensive. So 03:17:50.460 |
if you go to archives, you see that it's produced almost every 03:17:53.740 |
other day. And it is very comprehensive. And some of it is 03:17:58.140 |
written by humans and curated by humans, but a lot of it is 03:18:00.700 |
constructed automatically with LLMs. So you'll see that these 03:18:03.580 |
are very comprehensive, and you're probably not missing 03:18:05.900 |
anything major, if you go through it. Of course, you're 03:18:08.860 |
probably not going to go through it because it's so long. But I 03:18:11.420 |
do think that these summaries all the way up top are quite 03:18:14.620 |
good, and I think have some human oversight. So this has 03:18:18.140 |
been very helpful to me. And the last thing I would point to 03:18:20.620 |
is just X and Twitter. A lot of AI happens on X. And so I would 03:18:25.660 |
just follow people who you like and trust and get all your 03:18:29.260 |
latest and greatest on X as well. So those are the major 03:18:32.620 |
places that have worked for me over time. And finally, a few 03:18:35.100 |
words on where you can find the models, and where can you use 03:18:37.900 |
them. So the first one I would say is for any of the biggest 03:18:41.020 |
proprietary models, you just have to go to the website of 03:18:43.420 |
that LLM provider. So for example, for OpenAI, that's 03:18:46.380 |
chat.com, I believe actually works now. So that's for OpenAI. 03:18:50.780 |
Now for, or you know, for, for Gemini, I think it's Gemini. 03:18:55.660 |
google.com, or AI Studio. I think they have two for some 03:18:59.260 |
reason that I don't fully understand. No one does. For the 03:19:03.500 |
open weights models like DeepSea, Clouma, etc, you have to 03:19:06.060 |
go to some kind of an inference provider of LLMs. So my 03:19:08.620 |
favorite one is together together.ai. And I showed you 03:19:11.180 |
that when you go to the playground of together.ai, then 03:19:14.060 |
you can sort of pick lots of different models. And all of 03:19:16.620 |
these are open models of different types. And you can 03:19:19.100 |
talk to them here as an example. Now, if you'd like to 03:19:23.900 |
use a base model, like, you know, a base model, then this 03:19:27.980 |
is where I think it's not as common to find base models, 03:19:30.220 |
even on these inference providers, they are all 03:19:32.220 |
targeting assistants and chat. And so I think even here, I 03:19:36.060 |
can't, I couldn't see base models here. So for base 03:19:38.780 |
models, I usually go to hyperbolic, because they serve 03:19:41.820 |
my llama 3.1 base. And I love that model. And you can just 03:19:46.140 |
talk to it here. So as far as I know, this is this is a good 03:19:49.180 |
place for a base model. And I wish more people hosted base 03:19:52.140 |
models, because they are useful and interesting to work 03:19:54.300 |
with in some cases. Finally, you can also take some of the 03:19:57.980 |
models that are smaller, and you can run them locally. And 03:20:01.420 |
so for example, DeepSea, the biggest model, you're not 03:20:04.220 |
going to be able to run locally on your MacBook. But there 03:20:07.100 |
are smaller versions of the DeepSea model that are what's 03:20:09.340 |
called distilled. And then also, you can run these models 03:20:11.980 |
at smaller precision, so not at the native precision of, for 03:20:14.700 |
example, fp8 on DeepSea, or, you know, bf16 llama, but much, 03:20:19.980 |
much lower than that. And don't worry if you don't fully 03:20:23.980 |
understand those details, but you can run smaller versions 03:20:26.300 |
that have been distilled, and then at even lower precision, 03:20:29.020 |
and then you can fit them on your computer. And so you can 03:20:32.940 |
actually run pretty okay models on your laptop. And my 03:20:36.220 |
favorite, I think place I go to usually is LM studio, which is 03:20:39.580 |
basically an app you can get. And I think it kind of actually 03:20:42.940 |
looks really ugly. And it's, I don't like that it shows you 03:20:45.500 |
all these models that are basically not that useful, like 03:20:47.580 |
everyone just wants to run DeepSea. So I don't know why 03:20:49.660 |
they give you these 500 different types of models, 03:20:52.140 |
they're really complicated to search for. And you have to 03:20:53.980 |
choose different distillations and different precisions. And 03:20:57.740 |
it's all really confusing. But once you actually understand 03:20:59.980 |
how it works, and that's a whole separate video, then you 03:21:02.300 |
can actually load up a model like here, I loaded up a llama 03:21:05.020 |
3.2, instruct 1 billion. And you can just talk to it. So I 03:21:12.220 |
asked for pelican jokes, and I can ask for another one. And it 03:21:14.860 |
gives me another one, etc. All of this that happens here is 03:21:18.460 |
locally on your computer. So we're not actually going to 03:21:20.940 |
anywhere else anyone else, this is running on the GPU on the 03:21:23.980 |
MacBook Pro. So that's very nice. And you can then inject 03:21:27.340 |
the model when you're done. And that frees up the RAM. So LM 03:21:31.020 |
studio is probably like my favorite one, even though I 03:21:33.180 |
don't I think it's got a lot of UI UX issues. And it's really 03:21:35.900 |
geared towards professionals almost. But if you watch some 03:21:39.820 |
videos on YouTube, I think you can figure out how to how to 03:21:42.220 |
use this interface. So those are a few words on where to find 03:21:45.900 |
them. So let me now loop back around to where we started. The 03:21:49.100 |
question was, when we go to chachi pt.com, and we enter 03:21:52.700 |
some kind of a query, and we hit go, what exactly is 03:21:57.100 |
happening here? What are we seeing? What are we talking to? 03:22:00.540 |
How does this work? And I hope that this video gave you some 03:22:04.300 |
appreciation for some of the under the hood details of how 03:22:07.340 |
these models are trained, and what this is that is coming 03:22:09.820 |
back. So in particular, we now know that your query is taken, 03:22:13.740 |
and is first chopped up into tokens. So we go to token, tick 03:22:17.340 |
tokenizer. And here, where is the place in the in the sort of 03:22:22.300 |
format that is for the user query, we basically put in our 03:22:27.260 |
query right there. So our query goes into what we discussed 03:22:31.180 |
here is the conversation protocol format, which is this 03:22:34.460 |
way that we maintain conversation objects. So this 03:22:37.580 |
gets inserted there. And then this whole thing ends up being 03:22:40.300 |
just a token sequence, a one dimensional token sequence 03:22:42.940 |
under the hood. So chachi pt saw this token sequence. And then 03:22:47.180 |
when we had to go, it basically continues appending tokens into 03:22:51.420 |
this list, it continues the sequence, it acts like a token 03:22:54.460 |
autocomplete. So in particular, it gave us this response. So we 03:22:58.780 |
can basically just put it here, and we see the tokens that it 03:23:01.820 |
continued. These are the tokens that it continued with roughly. 03:23:05.180 |
Now the question becomes, okay, why are these the tokens that 03:23:10.460 |
the model responded with? What are these tokens? Where are they 03:23:13.100 |
coming from? What are we talking to? And how do we program the 03:23:17.100 |
system? And so that's where we shifted gears. And we talked 03:23:20.380 |
about the under the hood pieces of it. So the first stage of 03:23:23.980 |
this process, and there are three stages is the pre training 03:23:26.300 |
stage, which fundamentally has to do with just knowledge 03:23:28.780 |
acquisition from the internet into the parameters of this 03:23:31.980 |
neural network. And so the neural net internalizes a lot of 03:23:35.820 |
knowledge from the internet. But where the personality really 03:23:38.780 |
comes in, is in the process of supervised fine tuning here. And 03:23:43.420 |
so what what happens here is that basically the company like 03:23:47.020 |
OpenAI will curate a large data set of conversations, like say 03:23:50.780 |
1 million conversation across very diverse topics. And there 03:23:54.860 |
will be conversations between a human and an assistant. And 03:23:58.220 |
even though there's a lot of synthetic data generation used 03:24:00.460 |
throughout this entire process, and a lot of LLM help, and so 03:24:03.660 |
on. Fundamentally, this is a human data curation task with 03:24:07.420 |
lots of humans involved. And in particular, these humans are 03:24:10.460 |
data labelers hired by OpenAI, who are given labeling 03:24:13.500 |
instructions that they learn, and their task is to create 03:24:16.780 |
ideal assistant responses for any arbitrary prompts. So they 03:24:21.020 |
are teaching the neural network, by example, how to respond to 03:24:25.740 |
prompts. So what is the way to think about what came back here? 03:24:31.500 |
Like, what is this? Well, I think the right way to think 03:24:34.380 |
about it is that this is the neural network simulation of a 03:24:39.020 |
data labeler at OpenAI. So it's as if I gave this query to a 03:24:44.220 |
data labeler at OpenAI. And this data labeler first reads 03:24:47.580 |
all the labeling instructions from OpenAI, and then spends two 03:24:51.100 |
hours writing up the ideal assistant response to this 03:24:54.700 |
query and giving it to me. Now, we're not actually doing that, 03:24:59.900 |
right? Because we didn't wait two hours. So what we're getting 03:25:02.060 |
here is a neural network simulation of that process. And 03:25:05.900 |
we have to keep in mind that these neural networks don't 03:25:09.180 |
function like human brains do. They are different. What's easy 03:25:12.460 |
or hard for them is different from what's easy or hard for 03:25:15.100 |
humans. And so we really are just getting a simulation. So 03:25:18.620 |
here I've shown you, this is a token stream, and this is 03:25:22.140 |
fundamentally the neural network with a bunch of 03:25:24.380 |
activations and neurons in between. This is a fixed 03:25:26.780 |
mathematical expression that mixes inputs from tokens with 03:25:31.660 |
parameters of the model, and they get mixed up and get you 03:25:35.500 |
the next token in a sequence. But this is a finite amount of 03:25:38.220 |
compute that happens for every single token. And so this is 03:25:41.420 |
some kind of a lossy simulation of a human that is kind of like 03:25:45.980 |
restricted in this way. And so whatever the humans write, the 03:25:51.100 |
language model is kind of imitating on this token level 03:25:54.140 |
with only this specific computation for every single 03:25:58.220 |
token in a sequence. We also saw that as a result of this, and 03:26:03.660 |
the cognitive differences, the models will suffer in a variety 03:26:07.020 |
of ways, and you have to be very careful with their use. So for 03:26:10.620 |
example, we saw that they will suffer from hallucinations, and 03:26:13.660 |
they also, we have the sense of a Swiss cheese model, the LLM 03:26:16.940 |
capabilities, where basically there's like holes in the 03:26:20.060 |
cheese, sometimes the models will just arbitrarily do 03:26:23.820 |
something dumb. So even though they're doing lots of magical 03:26:26.780 |
stuff, sometimes they just can't. So maybe you're not 03:26:29.500 |
giving them enough tokens to think, and maybe they're going 03:26:32.060 |
to just make stuff up because their mental arithmetic breaks. 03:26:34.380 |
Maybe they are suddenly unable to count number of letters, or 03:26:39.180 |
maybe they're unable to tell you that 9.11 is smaller than 9.9, 03:26:44.300 |
and it looks kind of dumb. And so it's a Swiss cheese 03:26:46.940 |
capability, and we have to be careful with that. And we saw 03:26:49.260 |
the reasons for that. But fundamentally, this is how we 03:26:53.420 |
think of what came back. It's, again, a simulation of this 03:26:57.500 |
neural network of a human data labeler following the labeling 03:27:03.260 |
instructions at OpenAI. So that's what we're getting back. 03:27:06.700 |
Now, I do think that things change a little bit when you 03:27:11.980 |
actually go and reach for one of the thinking models, like 03:27:15.180 |
O3 MiniHAI. And the reason for that is that GPT-4.0 basically 03:27:21.900 |
doesn't do reinforcement learning. It does do RLHF, but 03:27:25.740 |
I've told you that RLHF is not RL. There's no time for magic 03:27:30.380 |
in there. It's just a little bit of a fine-tuning is the way to 03:27:33.340 |
look at it. But these thinking models, they do use RL. So they 03:27:37.900 |
go through this third stage of perfecting their thinking 03:27:42.140 |
process and discovering new thinking strategies and 03:27:45.580 |
solutions to problem-solving that look a little bit like 03:27:50.300 |
your internal monologue in your head. And they practice that on 03:27:53.100 |
a large collection of practice problems that companies like 03:27:56.300 |
OpenAI create and curate and then make available to the LLMs. 03:28:00.620 |
So when I come here and I talk to a thinking model, and I put 03:28:03.820 |
in this question, what we're seeing here is not anymore just 03:28:08.300 |
a straightforward simulation of a human data labeler. Like this 03:28:11.420 |
is actually kind of new, unique, and interesting. And of course, 03:28:15.500 |
OpenAI is not showing us the under-the-hood thinking and the 03:28:19.020 |
chains of thought that are underlying the reasoning here. 03:28:22.460 |
But we know that such a thing exists, and this is a summary of 03:28:25.020 |
it. And what we're getting here is actually not just an 03:28:27.660 |
imitation of a human data labeler. It's actually something 03:28:30.060 |
that is kind of new and interesting and exciting in the 03:28:31.980 |
sense that it is a function of thinking that was emergent in a 03:28:36.220 |
simulation. It's not just imitating a human data labeler. 03:28:39.100 |
It comes from this reinforcement learning process. And so here 03:28:42.620 |
we're, of course, not giving it a chance to shine because this 03:28:45.020 |
is not a mathematical or reasoning problem. This is just 03:28:47.660 |
some kind of a sort of creative writing problem, roughly 03:28:50.220 |
speaking. And I think it's a question, an open question, as 03:28:56.380 |
to whether the thinking strategies that are developed 03:28:59.900 |
inside verifiable domains transfer and are generalizable 03:29:04.380 |
to other domains that are unverifiable, such as creative 03:29:07.980 |
writing. The extent to which that transfer happens is 03:29:11.020 |
unknown in the field, I would say. So we're not sure if we are 03:29:14.220 |
able to do RL on everything that is verifiable and see the 03:29:16.940 |
benefits of that on things that are unverifiable, like this 03:29:20.060 |
prompt. So that's an open question. The other thing 03:29:22.700 |
that's interesting is that this reinforcement learning here is 03:29:26.060 |
still way too new, primordial, and nascent. So we're just 03:29:30.700 |
seeing the beginnings of the hints of greatness in the 03:29:34.220 |
reasoning problems. We're seeing something that is, in 03:29:37.020 |
principle, capable of something like the equivalent of move 37, 03:29:40.540 |
but not in the game of Go, but in open domain thinking and 03:29:45.100 |
problem solving. In principle, this paradigm is capable of 03:29:48.700 |
doing something really cool, new, and exciting, something 03:29:51.100 |
even that no human has thought of before. In principle, these 03:29:54.700 |
models are capable of analogies no human has had. So I think 03:29:58.060 |
it's incredibly exciting that these models exist. But again, 03:30:00.700 |
it's very early, and these are primordial models for now. And 03:30:04.780 |
they will mostly shine in domains that are verifiable, 03:30:07.340 |
like math, and code, etc. So very interesting to play with 03:30:11.260 |
and think about and use. And then that's roughly it. I would 03:30:16.620 |
say those are the broad strokes of what's available right now. 03:30:19.740 |
I will say that overall, it is an extremely exciting time to 03:30:23.180 |
be in the field. Personally, I use these models all the time 03:30:26.780 |
daily, tens or hundreds of times because they dramatically 03:30:29.980 |
accelerate my work. I think a lot of people see the same 03:30:32.220 |
thing. I think we're going to see a huge amount of wealth 03:30:34.460 |
creation as a result of these models. Be aware of some of 03:30:38.060 |
their shortcomings. Even with RL models, they're going to 03:30:41.820 |
suffer from some of these. Use it as a tool in a toolbox. 03:30:45.500 |
Don't trust it fully, because they will randomly do dumb 03:30:48.780 |
things. They will randomly hallucinate. They will randomly 03:30:51.740 |
skip over some mental arithmetic and not get it right. 03:30:53.820 |
They randomly can't count or something like that. So use 03:30:57.420 |
them as tools in the toolbox, check their work, and own the 03:31:00.060 |
product of your work. But use them for inspiration, for 03:31:03.340 |
first draft, ask them questions, but always check and verify, 03:31:08.140 |
and you will be very successful in your work if you do so. 03:31:10.860 |
So I hope this video was useful and interesting to you. I hope 03:31:14.700 |
you had fun. And it's already, like, very long, so I apologize 03:31:18.380 |
for that. But I hope it was useful. And yeah, I will see