back to index

Deep Dive into LLMs like ChatGPT


Chapters

0:0 introduction
1:0 pretraining data (internet)
7:47 tokenization
14:27 neural network I/O
20:11 neural network internals
26:1 inference
31:9 GPT-2: training and inference
42:52 Llama 3.1 base model inference
59:23 pretraining to post-training
61:6 post-training data (conversations)
80:32 hallucinations, tool use, knowledge/working memory
101:46 knowledge of self
106:56 models need tokens to think
121:11 tokenization revisited: models struggle with spelling
124:53 jagged intelligence
127:28 supervised finetuning to reinforcement learning
134:42 reinforcement learning
147:47 DeepSeek-R1
162:7 AlphaGo
168:26 reinforcement learning from human feedback (RLHF)
189:39 preview of things to come
195:15 keeping track of LLMs
198:34 where to find LLMs
201:46 grand summary

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone, so I've wanted to make this video for a while. It is a comprehensive
00:00:05.000 | but general audience introduction to large language models like ChatGPT and
00:00:10.320 | what I'm hoping to achieve in this video is to give you kind of mental models for
00:00:14.640 | thinking through what it is that this tool is. It's obviously magical and
00:00:19.680 | amazing in some respects. It's really good at some things, not very good at
00:00:23.880 | other things, and there's also a lot of sharp edges to be aware of. So what is
00:00:27.960 | behind this text box? You can put anything in there and press enter, but
00:00:31.800 | what should we be putting there and what are these words generated back? How does
00:00:36.720 | this work and what what are you talking to exactly? So I'm hoping to get at all
00:00:40.400 | those topics in this video. We're gonna go through the entire pipeline of how
00:00:44.000 | this stuff is built, but I'm going to keep everything sort of accessible to a
00:00:48.280 | general audience. So let's take a look at first how you build something like
00:00:51.840 | ChatGPT and along the way I'm gonna talk about, you know, some of the sort of
00:00:56.800 | cognitive psychological implications of these tools. Okay so let's build ChatGPT.
00:01:02.160 | So there's going to be multiple stages arranged sequentially. The first stage is
00:01:06.840 | called the pre-training stage and the first step of the pre-training stage is
00:01:11.120 | to download and process the internet. Now to get a sense of what this roughly
00:01:14.600 | looks like, I recommend looking at this URL here. So this company called Hugging
00:01:20.800 | Face collected and created and curated this dataset called FineWeb and they go
00:01:27.760 | into a lot of detail in this blog post on how they constructed the FineWeb
00:01:31.120 | dataset and all of the major LLM providers like OpenAI, Anthropic and
00:01:35.200 | Google and so on will have some equivalent internally of something like
00:01:39.320 | the FineWeb dataset. So roughly what are we trying to achieve here? We're trying
00:01:43.080 | to get a ton of text from the internet, from publicly available sources. So we're
00:01:47.640 | trying to have a huge quantity of very high quality documents and we also want
00:01:52.680 | very large diversity of documents because we want to have a lot of
00:01:55.800 | knowledge inside these models. So we want large diversity of high quality
00:02:00.340 | documents and we want many many of them. And achieving this is quite complicated
00:02:04.520 | and as you can see here it takes multiple stages to do well. So let's take
00:02:08.800 | a look at what some of these stages look like in a bit. For now I'd like to just
00:02:12.240 | like to note that for example the FineWeb dataset which is fairly
00:02:15.000 | representative what you would see in a production grade application actually
00:02:18.920 | ends up being only about 44 terabytes of disk space. You can get a USB stick for
00:02:24.320 | like a terabyte very easily or I think this could fit on a single hard drive
00:02:27.800 | almost today. So this is not a huge amount of data at the end of the day
00:02:31.960 | even though the internet is very very large we're working with text and we're
00:02:35.680 | also filtering it aggressively so we end up with about 44 terabytes in this
00:02:39.400 | example. So let's take a look at kind of what this data looks like and what some
00:02:44.960 | of these stages also are. So the starting point for a lot of these efforts and
00:02:48.960 | something that contributes most of the data by the end of it is data from
00:02:53.160 | Common Crawl. So Common Crawl is an organization that has been basically
00:02:57.400 | scouring the internet since 2007. So as of 2024 for example Common Crawl has
00:03:03.040 | indexed 2.7 billion web pages and they have all these crawlers going around the
00:03:09.080 | internet and what you end up doing basically is you start with a few seed
00:03:12.000 | web pages and then you follow all the links and you just keep following links
00:03:15.720 | and you keep indexing all the information and you end up with a ton of
00:03:18.000 | data of the internet over time. So this is usually the starting point for a lot
00:03:22.360 | of these efforts. Now this Common Crawl data is quite raw and is
00:03:27.480 | filtered in many many different ways. So here they document - this is the
00:03:33.240 | same diagram - they document a little bit the kind of processing that happens in
00:03:36.640 | these stages. So the first thing here is something called URL filtering. So what
00:03:43.000 | that is referring to is that there's these block lists of basically URLs that
00:03:49.840 | are domains that you don't want to be getting data from. So usually this
00:03:54.280 | includes things like malware websites, spam websites, marketing websites, racist
00:03:59.320 | websites, adult sites and things like that. So there's a ton of different types
00:04:02.880 | of websites that are just eliminated at this stage because we don't want
00:04:06.600 | them in our data set. The second part is text extraction. You have to remember
00:04:11.160 | that all these web pages - this is the raw HTML of these web pages that are being
00:04:14.880 | saved by these crawlers. So when I go to inspect here, this is what the raw HTML
00:04:21.080 | actually looks like. You'll notice that it's got all this markup like lists and
00:04:26.280 | stuff like that and there's CSS and all this kind of stuff. So this is computer
00:04:31.280 | code almost for these web pages but what we really want is we just want this text
00:04:35.480 | right? We just want the text of this web page and we don't want the navigation
00:04:38.920 | and things like that. So there's a lot of filtering and processing and heuristics
00:04:42.720 | that go into adequately filtering for just the good content of these web pages.
00:04:48.400 | The next stage here is language filtering. So for example, FineWeb
00:04:53.160 | filters using a language classifier. They try to guess what language every single
00:04:58.880 | web page is in and then they only keep web pages that have more than 65% of
00:05:02.680 | English as an example. And so you can get a sense that this is like a design
00:05:06.840 | decision that different companies can can take for themselves. What fraction of
00:05:12.640 | all different types of languages are we going to include in our data set? Because
00:05:15.920 | for example, if we filter out all of the Spanish as an example, then you might
00:05:19.360 | imagine that our model later will not be very good at Spanish because it's just
00:05:22.480 | never seen that much data of that language. And so different companies can
00:05:26.440 | focus on multilingual performance to to a different degree as an example. So
00:05:30.880 | FineWeb is quite focused on English and so their language model, if they end up
00:05:35.000 | training one later, will be very good at English but not maybe very good at other
00:05:38.560 | languages. After language filtering, there's a few other filtering steps and
00:05:43.400 | deduplication and things like that. Finishing with, for example, the PII
00:05:48.440 | removal. This is personally identifiable information. So as an example, addresses,
00:05:54.320 | social security numbers, and things like that. You would try to detect them and
00:05:57.600 | you would try to filter out those kinds of webpages from the data set as well. So
00:06:01.280 | there's a lot of stages here and I won't go into full detail but it is a fairly
00:06:05.440 | extensive part of the pre-processing and you end up with, for example, the FineWeb
00:06:09.440 | data set. So when you click in on it, you can see some examples here of what this
00:06:13.800 | actually ends up looking like and anyone can download this on the Hugging Phase
00:06:17.920 | web page. And so here's some examples of the final text that ends up in the
00:06:21.920 | training set. So this is some article about tornadoes in 2012. So there's some
00:06:29.400 | tornadoes in 2012 and what happened. This next one is something about...
00:06:36.480 | "Did you know you have two little yellow 9-volt battery-sized adrenal glands in
00:06:41.120 | your body?" Okay, so this is some kind of a odd medical article. So just think of
00:06:48.600 | these as basically web pages on the Internet filtered just for the text in
00:06:53.000 | various ways. And now we have a ton of text, 40 terabytes of it, and that now is
00:06:58.600 | the starting point for the next step of this stage. Now I wanted to give you an
00:07:02.640 | intuitive sense of where we are right now. So I took the first 200 web pages
00:07:06.520 | here, and remember we have tons of them, and I just take all that text and I just
00:07:11.040 | put it all together, concatenate it. And so this is what we end up with. We just
00:07:14.960 | get this just raw text, raw internet text, and there's a ton of it even in these
00:07:21.120 | 200 web pages. So I can continue zooming out here, and we just have this like
00:07:24.960 | massive tapestry of text data. And this text data has all these patterns, and
00:07:30.080 | what we want to do now is we want to start training neural networks on this
00:07:33.320 | data so the neural networks can internalize and model how this text
00:07:39.120 | flows, right? So we just have this giant texture of text, and now we want to get
00:07:44.920 | neural nets that mimic it. Okay, now before we plug text into neural networks,
00:07:50.920 | we have to decide how we're going to represent this text, and how we're going
00:07:54.720 | to feed it in. Now the way our technology works for these neural nets is that
00:07:58.880 | they expect a one-dimensional sequence of symbols, and they want a finite set of
00:08:04.960 | symbols that are possible. And so we have to decide what are the symbols, and then
00:08:10.080 | we have to represent our data as a one-dimensional sequence of those
00:08:13.360 | symbols. So right now what we have is a one-dimensional sequence of text. It
00:08:18.480 | starts here, and it goes here, and then it comes here, etc. So this is a
00:08:22.200 | one-dimensional sequence, even though on my monitor of course it's laid out in a
00:08:25.960 | two-dimensional way, but it goes from left to right and top to bottom, right? So
00:08:29.760 | it's a one-dimensional sequence of text. Now this being computers, of course,
00:08:33.560 | there's an underlying representation here. So if I do what's called UTF-8
00:08:37.720 | encode this text, then I can get the raw bits that correspond to this text in the
00:08:44.080 | computer. And that looks like this. So it turns out that, for
00:08:50.260 | example, this very first bar here is the first eight bits as an example. So what
00:08:57.320 | is this thing, right? This is a representation that we are looking for,
00:09:02.040 | in a certain sense. We have exactly two possible symbols, 0 and 1, and we
00:09:07.880 | have a very long sequence of it, right? Now as it turns out, this sequence length
00:09:14.800 | is actually going to be a very finite and precious resource in our neural
00:09:19.640 | network, and we actually don't want extremely long sequences of just two
00:09:23.320 | symbols. Instead what we want is we want to trade off this symbol size of this
00:09:31.320 | vocabulary, as we call it, and the resulting sequence length. So we don't
00:09:35.440 | want just two symbols and extremely long sequences. We're going to want more
00:09:39.320 | symbols and shorter sequences. Okay, so one naive way of compressing or
00:09:44.680 | decreasing the length of our sequence here is to basically consider some group
00:09:49.800 | of consecutive bits, for example 8 bits, and group them into a single what's
00:09:55.160 | called byte. So because these bits are either on or off, if we take a group of
00:10:00.320 | eight of them, there turns out to be only 256 possible combinations of how these
00:10:04.520 | bits could be on or off. And so therefore we can re-represent the sequence into a
00:10:09.080 | sequence of bytes instead. So this sequence of bytes will be 8 times
00:10:14.200 | shorter, but now we have 256 possible symbols. So every number here goes from
00:10:19.680 | 0 to 255. Now I really encourage you to think of these not as numbers, but as
00:10:24.360 | unique IDs, or like unique symbols. So maybe it's a bit more, maybe it's better
00:10:29.960 | to actually think of these, to replace every one of these with a unique emoji.
00:10:33.200 | You'd get something like this. So we basically have a sequence of emojis, and
00:10:38.840 | there's 256 possible emojis. You can think of it that way. Now it turns out
00:10:44.560 | that in production, for state-of-the-art language models, you actually want to go
00:10:48.040 | even beyond this. You want to continue to shrink the length of the sequence, because
00:10:52.720 | again it is a precious resource, in return for more symbols in your
00:10:57.120 | vocabulary. And the way this is done is done by running what's called the byte
00:11:02.120 | pair encoding algorithm. And the way this works is we're basically looking for
00:11:06.000 | consecutive bytes, or symbols, that are very common. So for example, it turns out
00:11:13.760 | that the sequence 116 followed by 32 is quite common and occurs very frequently.
00:11:18.360 | So what we're going to do is we're going to group this pair into a new symbol. So
00:11:25.520 | we're going to mint a symbol with an ID 256, and we're going to rewrite every
00:11:29.640 | single pair, 116, 32, with this new symbol. And then we can iterate this
00:11:34.920 | algorithm as many times as we wish. And each time when we mint a new symbol,
00:11:39.000 | we're decreasing the length and we're increasing the symbol size. And in
00:11:43.440 | practice, it turns out that a pretty good setting of, basically, the vocabulary
00:11:48.560 | size turns out to be about 100,000 possible symbols. So in particular, GPT-4
00:11:53.400 | uses 100,277 symbols. And this process of converting from raw text into these
00:12:05.760 | symbols, or as we call them, tokens, is the process called tokenization. So let's
00:12:11.720 | now take a look at how GPT-4 performs tokenization, converting from text to
00:12:16.480 | tokens, and from tokens back to text, and what this actually looks like. So one
00:12:20.700 | website I like to use to explore these token representations is called
00:12:25.160 | TicTokenizer. And so come here to the drop-down and select CL100KBase, which
00:12:30.240 | is the GPT-4 base model tokenizer. And here on the left, you can put in text, and
00:12:35.040 | it shows you the tokenization of that text. So for example, "hello world".
00:12:43.880 | So "hello world" turns out to be exactly two tokens. The token "hello", which is the
00:12:49.720 | token with ID 15339, and the token "space world", that is the token 1917. So "hello
00:13:00.440 | space world". Now if I was to join these two, for example, I'm gonna get again two
00:13:05.680 | tokens, but it's the token "h" followed by the token "hello world", without the "h". If I
00:13:13.600 | put in two spaces here between "hello" and "world", it's again a
00:13:16.640 | different tokenization. There's a new token "220" here. Okay, so you can play
00:13:23.760 | with this and see what happens here. Also keep in mind this is case
00:13:28.100 | sensitive, so if this is a capital "H", it is something else. Or if it's "hello world",
00:13:33.840 | then actually this ends up being three tokens, since there are just two tokens.
00:13:39.600 | Yeah, so you can play with this and get a sort of like an intuitive sense of what
00:13:46.000 | these tokens work like. We're actually going to loop around to tokenization a
00:13:49.040 | bit later in the video. For now I just wanted to show you the website, and I
00:13:52.140 | wanted to show you that this text basically, at the end of the day, so for
00:13:56.440 | example if I take one line here, this is what GPT-4 will see it as. So this text
00:14:01.680 | will be a sequence of length 62. This is the sequence here, and this is how the
00:14:08.920 | chunks of text correspond to these symbols. And again there's 100,000,
00:14:14.640 | 277 possible symbols, and we now have one-dimensional sequences of
00:14:20.640 | those symbols. So yeah, we're gonna come back to tokenization, but that's for now
00:14:26.480 | where we are. Okay, so what I've done now is I've taken this sequence of text that
00:14:30.840 | we have here in the dataset, and I have re-represented it using our tokenizer
00:14:34.480 | into a sequence of tokens. And this is what that looks like now. So for example
00:14:40.560 | when we go back to the FindWeb dataset, they mentioned that not only is this
00:14:43.840 | 44 terabytes of disk space, but this is about a 15 trillion token sequence in
00:14:50.640 | this dataset. And so here, these are just some of the first one or two or three or
00:14:56.760 | a few thousand here, I think, tokens of this dataset, but there's 15 trillion
00:15:01.520 | here to keep in mind. And again, keep in mind one more time that all of these
00:15:06.160 | represent little text chunks. They're all just like atoms of these sequences, and
00:15:10.680 | the numbers here don't make any sense. They're just unique IDs.
00:15:14.560 | Okay, so now we get to the fun part, which is the neural network training. And this
00:15:21.400 | is where a lot of the heavy lifting happens computationally when you're
00:15:23.960 | training these neural networks. So what we do here in this step is we want
00:15:30.320 | to model the statistical relationships of how these tokens follow each other in
00:15:33.440 | the sequence. So what we do is we come into the data, and we take windows of
00:15:38.760 | tokens. So we take a window of tokens from this data fairly randomly, and the
00:15:47.320 | window's length can range anywhere between zero tokens, actually,
00:15:52.320 | all the way up to some maximum size that we decide on. So for example, in practice
00:15:57.960 | you could see a token windows of, say, 8,000 tokens. Now, in principle, we can
00:16:02.680 | use arbitrary window lengths of tokens, but processing very long, basically,
00:16:10.460 | window sequences would just be very computationally expensive. So we just
00:16:15.000 | kind of decide that, say, 8,000 is a good number, or 4,000, or 16,000, and we crop
00:16:19.280 | it there. Now, in this example, I'm going to be taking the first four tokens just
00:16:25.560 | so everything fits nicely. So these tokens, we're going to take a window of
00:16:31.920 | four tokens, this bar, view, ing, and space single, which are these token IDs. And now
00:16:40.320 | what we're trying to do here is we're trying to basically predict the token
00:16:42.960 | that comes next in the sequence. So 3962 comes next, right? So what we do now here
00:16:49.400 | is that we call this the context. These four tokens are context, and they feed
00:16:54.640 | into a neural network. And this is the input to the neural network. Now, I'm
00:17:00.400 | going to go into the detail of what's inside this neural network in a little
00:17:03.400 | bit. For now, what's important to understand is the input and the output
00:17:06.000 | of the neural net. So the input are sequences of tokens of variable length,
00:17:11.680 | anywhere between 0 and some maximum size, like 8,000. The output now is a
00:17:17.360 | prediction for what comes next. So because our vocabulary has 100,277
00:17:24.320 | possible tokens, the neural network is going to output exactly that many
00:17:28.440 | numbers. And all of those numbers correspond to the probability of that
00:17:32.400 | token as coming next in the sequence. So it's making guesses about what comes
00:17:36.640 | next. In the beginning, this neural network is randomly initialized. So we're
00:17:42.840 | going to see in a little bit what that means. But it's a random
00:17:46.800 | transformation. So these probabilities in the very beginning of the training are
00:17:50.280 | also going to be kind of random. So here I have three examples, but keep in mind
00:17:54.680 | that there's 100,000 numbers here. So the probability of this token, space
00:17:59.520 | direction, the neural network is saying that this is 4% likely right now. 11,799
00:18:04.520 | is 2%. And then here, the probability of 3962, which is post, is 3%. Now, of
00:18:10.960 | course, we've sampled this window from our data set. So we know what comes next.
00:18:14.680 | We know, and that's the label, we know that the correct answer is that 3962
00:18:19.520 | actually comes next in the sequence. So now what we have is this mathematical
00:18:24.360 | process for doing an update to the neural network. We have a way of tuning
00:18:29.040 | it. And we're going to go into a little bit of detail in a bit. But basically, we
00:18:34.440 | know that this probability here of 3%, we want this probability to be higher, and
00:18:39.640 | we want the probabilities of all the other tokens to be lower. And so we have
00:18:45.480 | a way of mathematically calculating how to adjust and update the neural network
00:18:50.400 | so that the correct answer has a slightly higher probability. So if I do an update
00:18:55.520 | to the neural network now, the next time I feed this particular sequence of four
00:19:00.520 | tokens into the neural network, the neural network will be slightly adjusted now and
00:19:04.000 | it will say, okay, post is maybe 4%, and case now maybe is 1%. And direction
00:19:11.000 | could become 2% or something like that. And so we have a way of nudging, of
00:19:14.880 | slightly updating the neural net to basically give a higher probability to
00:19:20.240 | the correct token that comes next in the sequence. And now we just have to
00:19:23.360 | remember that this process happens not just for this token here, where these
00:19:30.240 | four fed in and predicted this one. This process happens at the same time for all
00:19:35.440 | of these tokens in the entire data set. And so in practice, we sample little
00:19:39.320 | windows, little batches of windows, and then at every single one of these tokens,
00:19:43.600 | we want to adjust our neural network so that the probability of that token
00:19:47.600 | becomes slightly higher. And this all happens in parallel in large batches of
00:19:51.880 | these tokens. And this is the process of training the neural network. It's a
00:19:55.760 | sequence of updating it so that its predictions match up the statistics of
00:20:01.200 | what actually happens in your training set. And its probabilities become
00:20:05.080 | consistent with the statistical patterns of how these tokens follow each other in
00:20:10.040 | the data. So let's now briefly get into the internals of these neural networks
00:20:13.640 | just to give you a sense of what's inside. So neural network internals. So
00:20:18.600 | as I mentioned, we have these inputs that are sequences of tokens. In this case,
00:20:23.680 | this is four input tokens, but this can be anywhere between zero up to, let's say,
00:20:28.600 | a thousand tokens. In principle, this can be an infinite number of tokens. We just,
00:20:32.680 | it would just be too computationally expensive to process an infinite number
00:20:36.560 | of tokens. So we just crop it at a certain length, and that becomes the
00:20:39.720 | maximum context length of that model. Now these inputs X are mixed up in a giant
00:20:46.680 | mathematical expression together with the parameters or the weights of these
00:20:51.960 | neural networks. So here I'm showing six example parameters and their setting. But
00:20:57.800 | in practice, these modern neural networks will have billions of these parameters.
00:21:04.000 | And in the beginning, these parameters are completely randomly set. Now with a
00:21:09.000 | random setting of parameters, you might expect that this neural network
00:21:13.600 | would make random predictions, and it does. In the beginning, it's totally
00:21:16.800 | random predictions. But it's through this process of iteratively updating the
00:21:22.040 | network, and we call that process training a neural network, so that the
00:21:28.160 | setting of these parameters gets adjusted such that the outputs of our
00:21:31.720 | neural network becomes consistent with the patterns seen in our training set. So
00:21:37.480 | think of these parameters as kind of like knobs on a DJ set, and as you're
00:21:41.280 | twiddling these knobs, you're getting different predictions for every possible
00:21:45.800 | token sequence input. And training a neural network just means discovering a
00:21:51.120 | setting of parameters that seems to be consistent with the statistics of the
00:21:55.360 | training set. Now let me just give you an example of what this giant mathematical
00:21:59.560 | expression looks like, just to give you a sense. And modern networks are massive
00:22:03.320 | expressions with trillions of terms probably. But let me just show you a
00:22:06.600 | simple example here. It would look something like this. I mean, these are
00:22:10.240 | the kinds of expressions, just to show you that it's not very scary. We have
00:22:13.840 | inputs x, like x1, x2, in this case two example inputs, and they get mixed up
00:22:19.600 | with the weights of the network, w0, w1, w2, w3, etc. And this mixing is simple
00:22:26.680 | things like multiplication, addition, exponentiation, division, etc. And it is
00:22:32.760 | the subject of neural network architecture research to design effective
00:22:36.960 | mathematical expressions that have a lot of kind of convenient characteristics.
00:22:41.880 | They are expressive, they're optimizable, they're parallelizable, etc. And so, but at
00:22:47.760 | the end of the day, these are not complex expressions, and basically
00:22:51.200 | they mix up the inputs with the parameters to make predictions, and we're
00:22:55.680 | optimizing the parameters of this neural network so that the predictions come out
00:23:00.680 | consistent with the training set. Now, I would like to show you an actual
00:23:05.440 | production-grade example of what these neural networks look like. So for that, I
00:23:09.320 | encourage you to go to this website that has a very nice visualization of one of
00:23:12.840 | these networks. So this is what you will find on this website, and this neural
00:23:19.040 | network here that is used in production settings has this special kind of
00:23:22.760 | structure. This network is called the transformer, and this particular one as
00:23:27.560 | an example has 85,000, roughly, parameters. Now, here on the top, we take the inputs,
00:23:33.760 | which are the token sequences, and then information flows through the neural
00:23:40.400 | network until the output, which here are the logit softmax, but these are the
00:23:45.660 | predictions for what comes next, what token comes next. And then here, there's a
00:23:51.960 | sequence of transformations, and all these intermediate values that get
00:23:55.960 | produced inside this mathematical expression as it is sort of predicting
00:23:59.480 | what comes next. So as an example, these tokens are embedded into kind of like
00:24:05.400 | this distributed representation, as it's called. So every possible token has kind
00:24:09.360 | of like a vector that represents it inside the neural network. So first, we
00:24:13.320 | embed the tokens, and then those values kind of like flow through this diagram,
00:24:19.600 | and these are all very simple mathematical expressions individually.
00:24:22.720 | So we have layer norms, and matrix multiplications, and soft maxes, and so
00:24:27.360 | on. So here's kind of like the attention block of this transformer, and then
00:24:31.840 | information kind of flows through into the multi-layer perceptron block, and so
00:24:35.760 | on. And all these numbers here, these are the intermediate values of their
00:24:39.960 | expression, and you can almost think of these as kind of like the firing rates
00:24:44.920 | of these synthetic neurons. But I would caution you to not kind of think of it
00:24:50.840 | too much like neurons, because these are extremely simple neurons compared to the
00:24:54.960 | neurons you would find in your brain. Your biological neurons are very complex
00:24:58.360 | dynamical processes that have memory, and so on. There's no memory in this
00:25:02.160 | expression. It's a fixed mathematical expression from input to output with no
00:25:05.840 | memory. It's just a stateless. So these are very simple neurons in comparison to
00:25:10.280 | biological neurons, but you can still kind of loosely think of this as like a
00:25:13.560 | synthetic piece of brain tissue, if you like to think about it that way.
00:25:18.320 | So information flows through all these neurons fire until we get to the
00:25:24.480 | predictions. Now I'm not actually going to dwell too much on the precise kind of
00:25:29.040 | like mathematical details of all these transformations. Honestly, I don't think
00:25:32.080 | it's that important to get into. What's really important to understand is that
00:25:35.200 | this is a mathematical function. It is parameterized by some fixed set of
00:25:41.480 | parameters, let's say 85,000 of them, and it is a way of transforming inputs into
00:25:45.960 | outputs. And as we twiddle the parameters we are getting different kinds of
00:25:50.440 | predictions, and then we need to find a good setting of these parameters so that
00:25:54.320 | the predictions sort of match up with the patterns seen in training set. So
00:25:59.240 | that's the transformer. Okay, so I've shown you the internals of the neural
00:26:03.440 | network, and we talked a bit about the process of training it. I want to cover
00:26:07.160 | one more major stage of working with these networks, and that is the stage
00:26:11.800 | called inference. So in inference what we're doing is we're generating new data
00:26:15.880 | from the model, and so we want to basically see what kind of patterns it
00:26:20.960 | has internalized in the parameters of its network. So to generate from the
00:26:26.160 | model is relatively straightforward. We start with some tokens that are
00:26:30.800 | basically your prefix, like what you want to start with. So say we want to start
00:26:34.440 | with the token 91. Well, we feed it into the network, and remember that network
00:26:39.880 | gives us probabilities, right? It gives us this probability vector here. So what we
00:26:45.320 | can do now is we can basically flip a biased coin. So we can sample basically a
00:26:52.800 | token based on this probability distribution. So the tokens that are
00:26:57.440 | given high probability by the model are more likely to be sampled when you flip
00:27:01.920 | this biased coin. You can think of it that way. So we sample from the
00:27:05.840 | distribution to get a single unique token. So for example, token 860 comes
00:27:10.600 | next. So 860 in this case when we're generating from model could come next.
00:27:15.760 | Now 860 is a relatively likely token. It might not be the only possible token in
00:27:20.480 | this case. There could be many other tokens that could have been sampled, but
00:27:23.520 | we could see that 860 is a relatively likely token as an example, and indeed in
00:27:27.560 | our training example here, 860 does follow 91. So let's now say that we
00:27:33.640 | continue the process. So after 91 there's 860. We append it, and we again ask what
00:27:39.440 | is the third token. Let's sample, and let's just say that it's 287 exactly as
00:27:44.120 | here. Let's do that again. We come back in. Now we have a sequence of three, and we
00:27:49.960 | ask what is the likely fourth token, and we sample from that and get this one. And
00:27:54.920 | now let's say we do it one more time. We take those four, we sample, and we get
00:28:00.000 | this one. And this 13659, this is not actually 3962 as we had before. So this
00:28:08.480 | token is the token article instead, so viewing a single article. And so in this
00:28:14.440 | case we didn't exactly reproduce the sequence that we saw here in the
00:28:18.560 | training data. So keep in mind that these systems are stochastic. We're
00:28:23.960 | sampling, and we're flipping coins, and sometimes we luck out and we reproduce
00:28:29.360 | some like small chunk of the text in a training set, but sometimes we're
00:28:33.920 | getting a token that was not verbatim part of any of the documents in the
00:28:38.880 | training data. So we're going to get sort of like remixes of the data that we saw
00:28:43.840 | in the training, because at every step of the way we can flip and get a slightly
00:28:47.480 | different token, and then once that token makes it in, if you sample the next one
00:28:51.400 | and so on, you very quickly start to generate token streams that are very
00:28:56.240 | different from the token streams that occur in the training documents. So
00:29:00.520 | statistically they will have similar properties, but they are not identical
00:29:05.200 | to training data. They're kind of like inspired by the training data. And so in
00:29:09.280 | this case we got a slightly different sequence. And why would we get article?
00:29:13.160 | You might imagine that article is a relatively likely token in the context
00:29:17.240 | of bar, viewing, single, etc. And you could imagine that the word article followed
00:29:22.640 | this context window somewhere in the training documents to some extent, and we
00:29:27.840 | just happen to sample it here at that stage. So basically inference is just
00:29:32.080 | predicting from these distributions one at a time, we continue feeding back
00:29:36.400 | tokens and getting the next one, and we we're always flipping these coins, and
00:29:41.000 | depending on how lucky or unlucky we get, we might get very different kinds of
00:29:46.440 | patterns depending on how we sample from these probability distributions. So
00:29:51.440 | that's inference. So in most common scenarios, basically downloading the
00:29:56.580 | internet and tokenizing it is a pre-processing step. You do that a
00:29:59.640 | single time. And then once you have your token sequence, we can start training
00:30:04.840 | networks. And in practical cases you would try to train many different
00:30:09.080 | networks of different kinds of settings, and different kinds of arrangements, and
00:30:12.840 | different kinds of sizes. And so you'd be doing a lot of neural network training,
00:30:16.280 | and then once you have a neural network and you train it, and you have some
00:30:20.680 | specific set of parameters that you're happy with, then you can take the model
00:30:25.260 | and you can do inference, and you can actually generate data from the model.
00:30:29.280 | And when you're on chatGPT and you're talking with a model, that model is
00:30:33.280 | trained, and has been trained by OpenAI many months ago probably, and they
00:30:38.360 | have a specific set of weights that work well, and when you're talking to the
00:30:42.300 | model, all of that is just inference. There's no more training. Those
00:30:45.920 | parameters are held fixed, and you're just talking to the model, sort of. You're
00:30:51.060 | giving it some of the tokens, and it's kind of completing token sequences, and
00:30:54.720 | that's what you're seeing generated when you actually use the model on chatGPT.
00:30:58.480 | So that model then just does inference alone. So let's now look at an example of
00:31:03.600 | training and inference that is kind of concrete, and gives you a sense of what
00:31:06.840 | this actually looks like when these models are trained. Now the example that
00:31:10.680 | I would like to work with, and that I am particularly fond of, is that of OpenAI's
00:31:14.400 | GPT2. So GPT stands for generatively pre-trained transformer, and this is the
00:31:19.900 | second iteration of the GPT series by OpenAI. When you are talking to chatGPT
00:31:24.760 | today, the model that is underlying all of the magic of that interaction is GPT4,
00:31:29.480 | so the fourth iteration of that series. Now GPT2 was published in 2019 by
00:31:34.880 | OpenAI in this paper that I have right here, and the reason I like GPT2 is that
00:31:40.320 | it is the first time that a recognizably modern stack came together. So all of the
00:31:47.320 | pieces of GPT2 are recognizable today by modern standards, it's just everything
00:31:51.800 | has gotten bigger. Now I'm not going to be able to go into the full details of
00:31:55.560 | this paper, of course, because it is a technical publication, but some of the
00:31:59.440 | details that I would like to highlight are as follows. GPT2 was a transformer
00:32:03.600 | neural network, just like the neural networks you would work
00:32:06.840 | with today. It had 1.6 billion parameters, right? So these are the
00:32:12.120 | parameters that we looked at here. It would have 1.6 billion of them. Today,
00:32:16.800 | modern transformers would have a lot closer to a trillion or several hundred
00:32:20.360 | billion, probably. The maximum context length here was 1024 tokens, so it is
00:32:28.120 | when we are sampling chunks of windows of tokens from the data set, we're never
00:32:34.080 | taking more than 1024 tokens, and so when you are trying to predict the next
00:32:37.680 | token in a sequence, you will never have more than 1024 tokens kind of in your
00:32:42.480 | context in order to make that prediction. Now, this is also tiny by modern
00:32:46.640 | standards. Today, the context lengths would be a lot closer to a
00:32:52.120 | couple hundred thousand or maybe even a million, and so you have a lot more
00:32:55.880 | context, a lot more tokens in history, and you can make a lot better prediction
00:32:59.880 | about the next token in a sequence in that way. And finally, GPT2 was trained
00:33:04.360 | on approximately 100 billion tokens, and this is also fairly small by modern
00:33:08.520 | standards. As I mentioned, the fine web data set that we looked at here, the fine
00:33:12.280 | web data set has 15 trillion tokens, so 100 billion is quite small. Now, I
00:33:19.240 | actually tried to reproduce GPT2 for fun as part of this project called LLM.C,
00:33:24.440 | so you can see my write-up of doing that in this post on GitHub under the LLM.C
00:33:30.760 | repository. So in particular, the cost of training GPT2 in 2019 was
00:33:37.080 | estimated to be approximately $40,000, but today you can do
00:33:41.200 | significantly better than that, and in particular, here it took about one day
00:33:45.120 | and about $600. But this wasn't even trying too hard. I think you could really
00:33:50.720 | bring this down to about $100 today. Now, why is it that the costs have come
00:33:56.320 | down so much? Well, number one, these data sets have gotten a lot better, and the
00:34:01.160 | way we filter them, extract them, and prepare them has gotten a lot more
00:34:04.800 | refined, and so the data set is of just a lot higher quality, so that's one thing.
00:34:09.320 | But really, the biggest difference is that our computers have gotten much
00:34:12.880 | faster in terms of the hardware, and we're going to look at that in a second,
00:34:16.280 | and also the software for running these models and really squeezing out all the
00:34:22.200 | speed from the hardware as it is possible, that software has also gotten
00:34:26.720 | much better as everyone has focused on these models and tried to run them very,
00:34:30.040 | very quickly. Now, I'm not going to be able to go into the full detail of this
00:34:35.960 | GPT-2 reproduction, and this is a long technical post, but I would like to still
00:34:39.880 | give you an intuitive sense for what it looks like to actually train one of
00:34:43.240 | these models as a researcher. Like, what are you looking at, and what does it look
00:34:46.200 | like, what does it feel like? So let me give you a sense of that a little bit.
00:34:49.200 | Okay, so this is what it looks like. Let me slide this over. So what I'm doing here
00:34:54.800 | is I'm training a GPT-2 model right now, and what's happening here is that every
00:35:01.160 | single line here, like this one, is one update to the model. So remember how here
00:35:09.000 | we are basically making the prediction better for every one of these tokens, and
00:35:14.480 | we are updating these weights or parameters of the neural net. So here,
00:35:18.600 | every single line is one update to the neural network, where we change its
00:35:22.480 | parameters by a little bit so that it is better at predicting next token and
00:35:25.640 | sequence. In particular, every single line here is improving the prediction on 1
00:35:32.400 | million tokens in the training set. So we've basically taken 1 million tokens
00:35:38.200 | out of this data set, and we've tried to improve the prediction of that token as
00:35:44.560 | coming next in a sequence on all 1 million of them simultaneously. And at
00:35:50.440 | every single one of these steps, we are making an update to the network for that.
00:35:54.120 | Now the number to watch closely is this number called loss, and the loss is a
00:35:59.960 | single number that is telling you how well your neural network is
00:36:03.280 | performing right now, and it is created so that low loss is good. So you'll see
00:36:09.200 | that the loss is decreasing as we make more updates to the neural net, which
00:36:13.260 | corresponds to making better predictions on the next token in a sequence. And so
00:36:17.680 | the loss is the number that you are watching as a neural network researcher,
00:36:21.440 | and you are kind of waiting, you're twiddling your thumbs, you're drinking
00:36:25.080 | coffee, and you're making sure that this looks good so that with every update
00:36:29.520 | your loss is improving and the network is getting better at prediction. Now here
00:36:34.440 | you see that we are processing 1 million tokens per update. Each update takes
00:36:39.640 | about 7 seconds roughly, and here we are going to process a total of 32,000 steps
00:36:46.160 | of optimization. So 32,000 steps with 1 million tokens each is about 33
00:36:52.360 | billion tokens that we are going to process, and we're currently only about
00:36:55.800 | 420, step 420 out of 32,000, so we are still only a bit more than 1%
00:37:01.840 | done because I've only been running this for 10 or 15 minutes or something like
00:37:05.360 | that. Now every 20 steps I have configured this optimization to do
00:37:10.960 | inference. So what you're seeing here is the model is predicting the next token
00:37:15.120 | in a sequence, and so you sort of start it randomly, and then you continue
00:37:19.380 | plugging in the tokens. So we're running this inference step, and this is the
00:37:23.760 | model sort of predicting the next token in a sequence, and every time you see
00:37:26.280 | something appear, that's a new token. So let's just look at this, and you can see
00:37:34.760 | that this is not yet very coherent, and keep in mind that this is only 1% of the
00:37:38.460 | way through training, and so the model is not yet very good at predicting the next
00:37:41.960 | token in the sequence. So what comes out is actually kind of a little bit of
00:37:45.520 | gibberish, right, but it still has a little bit of like local coherence. So
00:37:49.800 | since she is mine, it's a part of the information, should discuss my father,
00:37:54.360 | great companions, Gordon showed me sitting over it, and etc. So I know it
00:37:59.540 | doesn't look very good, but let's actually scroll up and see what it
00:38:04.160 | looked like when I started the optimization. So all the way here, at
00:38:09.300 | step 1, so after 20 steps of optimization, you see that what we're getting here is
00:38:16.760 | looks completely random, and of course that's because the model has only had 20
00:38:20.300 | updates to its parameters, and so it's giving you random text because it's a
00:38:23.600 | random network. And so you can see that at least in comparison to this, the model
00:38:27.760 | is starting to do much better, and indeed if we waited the entire 32,000 steps, the
00:38:32.620 | model will have improved to the point that it's actually generating fairly
00:38:36.400 | coherent English, and the tokens stream correctly, and they kind of make up
00:38:42.800 | English a lot better. So this has to run for about a day or two more now, and so
00:38:50.840 | at this stage we just make sure that the loss is decreasing, everything is looking
00:38:55.040 | good, and we just have to wait. And now let me turn now to the story of the
00:39:03.100 | computation that's required, because of course I'm not running this optimization
00:39:07.100 | on my laptop. That would be way too expensive, because we have to run this
00:39:11.420 | neural network, and we have to improve it, and we have we need all this data and so
00:39:15.060 | on. So you can't run this too well on your computer, because the network is
00:39:19.180 | just too large. So all of this is running on the computer that is out there in the
00:39:23.620 | cloud, and I want to basically address the compute side of the story of training
00:39:28.080 | these models, and what that looks like. So let's take a look. Okay so the computer
00:39:31.860 | that I am running this optimization on is this 8xh100 node. So there are
00:39:37.860 | eight h100s in a single node, or a single computer. Now I am renting this
00:39:43.340 | computer, and it is somewhere in the cloud. I'm not sure where it is
00:39:45.900 | physically actually. The place I like to rent from is called Lambda, but there are
00:39:49.940 | many other companies who provide this service. So when you scroll down, you can
00:39:54.660 | see that they have some on-demand pricing for sort of computers that have
00:39:59.900 | these h100s, which are GPUs, and I'm going to show you what they look like in a
00:40:04.740 | second. But on-demand 8xNVIDIA h100 GPU. This machine comes for three
00:40:12.720 | dollars per GPU per hour, for example. So you can rent these, and then you get a
00:40:17.700 | machine in the cloud, and you can go in and you can train these models. And these
00:40:23.660 | GPUs, they look like this. So this is one h100 GPU. This is kind of what it looks
00:40:29.640 | like, and you slot this into your computer. And GPUs are this perfect fit
00:40:33.700 | for training neural networks, because they are very computationally expensive,
00:40:37.580 | but they display a lot of parallelism in the computation. So you can have many
00:40:42.100 | independent workers kind of working all at the same time in solving the matrix
00:40:48.540 | multiplication that's under the hood of training these neural networks. So this
00:40:54.160 | is just one of these h100s, but actually you would put them, you would put
00:40:57.260 | multiple of them together. So you could stack eight of them into a single node,
00:41:00.940 | and then you can stack multiple nodes into an entire data center, or an entire
00:41:04.940 | system. So when we look at a data center,
00:41:11.180 | can't spell, when we look at a data center, we start to see things that look
00:41:16.460 | like this, right? So we have one GPU goes to eight GPUs, goes to a single system,
00:41:20.180 | goes to many systems. And so these are the bigger data centers, and they of
00:41:23.940 | course would be much, much more expensive. And what's happening is that all the
00:41:28.580 | big tech companies really desire these GPUs, so they can train all these
00:41:33.100 | language models, because they are so powerful. And that is fundamentally
00:41:37.260 | what has driven the stock price of NVIDIA to be $3.4 trillion today, as an
00:41:41.860 | example, and why NVIDIA has kind of exploded. So this is the gold rush. The
00:41:47.100 | gold rush is getting the GPUs, getting enough of them, so they can all
00:41:51.540 | collaborate to perform this optimization. And what are they all
00:41:56.500 | doing? They're all collaborating to predict the next token on a data set
00:42:00.740 | like the fine web data set. This is the computational workflow that basically
00:42:06.180 | is extremely expensive. The more GPUs you have, the more tokens you can try
00:42:10.100 | to predict and improve on, and you're going to process this data set faster,
00:42:14.020 | and you can iterate faster and get a bigger network and train a bigger
00:42:17.140 | network and so on. So this is what all those machines are doing. And this is
00:42:23.900 | why all of this is such a big deal. And for example, this is a article from
00:42:29.140 | like about a month ago or so. This is why it's a big deal that, for example,
00:42:32.260 | Elon Musk is getting 100,000 GPUs in a single data center. And all of these
00:42:38.900 | GPUs are extremely expensive, are going to take a ton of power, and all of them
00:42:42.700 | are just trying to predict the next token in the sequence and improve the
00:42:45.740 | network by doing so, and get probably a lot more coherent text than what we're
00:42:50.940 | seeing here a lot faster. Okay, so unfortunately, I do not have a couple
00:42:54.940 | 10 or $100 million to spend on training a really big model like this. But
00:43:00.260 | luckily, we can turn to some big tech companies who train these models
00:43:04.340 | routinely, and release some of them once they are done training. So they've
00:43:08.740 | spent a huge amount of compute to train this network, and they release the
00:43:12.220 | network at the end of the optimization. So it's very useful because they've
00:43:15.700 | done a lot of compute for that. So there are many companies who train these
00:43:19.300 | models routinely, but actually not many of them release these what's called
00:43:23.620 | base models. So the model that comes out at the end here is what's called a base
00:43:27.940 | model. What is a base model? It's a token simulator, right? It's an internet
00:43:32.540 | text token simulator. And so that is not by itself useful yet, because what we
00:43:38.340 | want is what's called an assistant, we want to ask questions and have it
00:43:41.580 | respond to answers. These models won't do that they just create sort of remixes
00:43:46.700 | of the internet. They dream internet pages. So the base models are not very
00:43:51.900 | often released, because they're kind of just only a step one of a few other
00:43:55.300 | steps that we still need to take to get an assistant. However, a few releases
00:43:59.260 | have been made. So as an example, the GPT-2 model released the 1.6 billion,
00:44:05.940 | sorry, 1.5 billion model back in 2019. And this GPT-2 model is a base model.
00:44:11.300 | Now, what is a model release? What does it look like to release these models? So
00:44:16.380 | this is the GPT-2 repository on GitHub. Well, you need two things basically to
00:44:20.540 | release model. Number one, we need the Python code, usually, that describes the
00:44:28.500 | sequence of operations in detail that they make in their model. So if you
00:44:35.540 | remember back this transformer, the sequence of steps that are taken here in
00:44:41.540 | this neural network is what is being described by this code. So this code is
00:44:46.260 | sort of implementing the what's called forward pass of this neural network. So
00:44:50.620 | we need the specific details of exactly how they wired up that neural network.
00:44:54.380 | So this is just computer code, and it's usually just a couple hundred lines of
00:44:58.220 | code. It's not it's not that crazy. And this is all fairly understandable and
00:45:02.540 | usually fairly standard. What's not standard are the parameters. That's where
00:45:06.100 | the actual value is. What are the parameters of this neural network,
00:45:09.860 | because there's 1.6 billion of them, and we need the correct setting or a really
00:45:14.060 | good setting. And so that's why in addition to this source code, they
00:45:18.620 | release the parameters, which in this case is roughly 1.5 billion parameters.
00:45:23.580 | And these are just numbers. So it's one single list of 1.5 billion numbers, the
00:45:28.540 | precise and good setting of all the knobs, such that the tokens come out
00:45:32.580 | well. So you need those two things to get a base model release. Now, GPT-2 was
00:45:42.980 | released, but that's actually a fairly old model, as I mentioned. So actually,
00:45:46.020 | the model we're going to turn to is called LLAMA-3. And that's the one that
00:45:49.660 | I would like to show you next. So LLAMA-3, so GPT-2 again, was 1.6 billion
00:45:54.980 | parameters trained on 100 billion tokens. LLAMA-3 is a much bigger model
00:45:58.980 | and much more modern model. It is released and trained by Meta. And it is
00:46:03.620 | a 405 billion parameter model trained on 15 trillion tokens, in very much the
00:46:09.260 | same way, just much, much bigger. And Meta has also made a release of LLAMA-3.
00:46:16.060 | And that was part of this paper. So with this paper that goes into a lot of
00:46:21.540 | detail, the biggest base model that they released is the LLAMA-3.1 4.5, 405
00:46:27.820 | billion parameter model. So this is the base model. And then in addition to the
00:46:32.180 | base model, you see here, foreshadowing for later sections of the video, they
00:46:36.100 | also released the instruct model. And the instruct means that this is an
00:46:39.740 | assistant, you can ask it questions, and it will give you answers. We still
00:46:43.060 | have yet to cover that part later. For now, let's just look at this base model,
00:46:46.860 | this token simulator. And let's play with it and try to think about, you
00:46:50.820 | know, what is this thing? And how does it work? And what do we get at the end
00:46:54.780 | of this optimization, if you let this run until the end, for a very big neural
00:46:59.300 | network on a lot of data. So my favorite place to interact with the base models
00:47:03.620 | is this company called Hyperbolic, which is basically serving the base model of
00:47:09.220 | the 405B LLAMA-3.1. So when you go into the website, and I think you may have
00:47:14.420 | to register and so on, make sure that in the models, make sure that you are
00:47:18.140 | using LLAMA-3.1 405 billion base, it must be the base model. And then here,
00:47:24.420 | let's say the max tokens is how many tokens we're going to be generating. So
00:47:27.700 | let's just decrease this to be a bit less just so we don't waste compute, we
00:47:31.660 | just want the next 128 tokens. And leave the other stuff alone, I'm not going to
00:47:35.660 | go into the full detail here. Now, fundamentally, what's going to happen
00:47:39.500 | here is identical to what happens here during inference for us. So this is just
00:47:44.820 | going to continue the token sequence of whatever prefix you're going to give it.
00:47:48.620 | So I want to first show you that this model here is not yet an assistant. So
00:47:53.420 | you can, for example, ask it, what is two plus two, it's not going to tell
00:47:56.540 | you, oh, it's four. What else can I help you with? It's not going to do that.
00:48:00.900 | Because what is two plus two is going to be tokenized. And then those tokens just
00:48:05.860 | acts as a prefix. And then what the model is going to do now is just going to get
00:48:09.580 | the probability for the next token. And it's just a glorified autocomplete. It's
00:48:13.420 | a very, very expensive autocomplete of what comes next, depending on the
00:48:17.940 | statistics of what it saw in its training documents, which are basically
00:48:21.060 | web pages. So let's just hit enter to see what tokens it comes up with as a
00:48:26.580 | continuation. Okay, so here it kind of actually answered the question and
00:48:34.020 | started to go off into some philosophical territory. Let's try it
00:48:37.580 | again. So let me copy and paste. And let's try again, from scratch. What is
00:48:41.860 | two plus two? Okay, so it just goes off again. So notice one more thing that I
00:48:50.460 | want to stress is that the system, I think every time you put it in, it just
00:48:55.020 | kind of starts from scratch. So the system here is stochastic. So for the
00:49:01.380 | same prefix of tokens, we're always getting a different answer. And the
00:49:04.860 | reason for that is that we get this probability distribution, and we sample
00:49:08.860 | from it, and we always get different samples, and we sort of always go into a
00:49:12.180 | different territory afterwards. So here in this case, I don't know what this is.
00:49:18.820 | Let's try one more time. So it just continues on. So it's just doing the
00:49:25.740 | stuff that it's on the internet, right? And it's just kind of like regurgitating
00:49:30.380 | those statistical patterns. So first things, it's not an assistant yet, it's a
00:49:36.820 | token autocomplete. And second, it is a stochastic system. Now the crucial thing
00:49:43.020 | is that even though this model is not yet by itself very useful for a lot of
00:49:47.220 | applications, just yet, it is still very useful because in the task of
00:49:53.380 | predicting the next token in the sequence, the model has learned a lot
00:49:56.940 | about the world. And it has stored all that knowledge in the parameters of the
00:50:01.180 | network. So remember that our text looked like this, right? Internet web
00:50:05.700 | pages. And now all of this is sort of compressed in the weights of the
00:50:10.260 | network. So you can think of these 405 billion parameters as a kind of
00:50:16.100 | compression of the internet. You can think of the 405 billion parameters as
00:50:21.020 | kind of like a zip file. But it's not a lossless compression, it's a lossy
00:50:26.260 | compression, we're kind of like left with kind of a gestalt of the internet
00:50:29.780 | and we can generate from it, right? Now we can elicit some of this knowledge by
00:50:34.900 | prompting the base model accordingly. So for example, here's a prompt that
00:50:39.300 | might work to elicit some of that knowledge that's hiding in the
00:50:41.980 | parameters. Here's my top 10 list of the top landmarks to see in Paris. And I'm
00:50:51.420 | doing it this way, because I'm trying to prime the model to now continue this
00:50:54.860 | list. So let's see if that works when I press enter. Okay, so you see that it
00:51:00.140 | started the list, and it's now kind of giving me some of those landmarks. And
00:51:04.300 | I noticed that it's trying to give a lot of information here. Now, you might not
00:51:08.380 | be able to actually fully trust some of the information here. Remember that this
00:51:11.420 | is all just a recollection of some of the internet documents. And so the
00:51:16.100 | things that occur very frequently in the internet data are probably more likely
00:51:20.420 | to be remembered correctly, compared to things that happen very infrequently. So
00:51:24.980 | you can't fully trust some of the things that is some of the information that is
00:51:27.980 | here, because it's all just a vague recollection of internet documents.
00:51:31.220 | Because the information is not stored explicitly in any of the parameters, it's
00:51:35.860 | all just the recollection. That said, we did get something that is probably
00:51:39.580 | approximately correct. And I don't actually have the expertise to verify
00:51:43.260 | that this is roughly correct. But you see that we've elicited a lot of the
00:51:46.900 | knowledge of the model. And this knowledge is not precise and exact. This
00:51:51.460 | knowledge is vague, and probabilistic, and statistical. And the kinds of things
00:51:56.180 | that occur often are the kinds of things that are more likely to be remembered in
00:52:01.060 | the model. Now I want to show you a few more examples of this model's behavior.
00:52:04.780 | The first thing I want to show you is this example. I went to the Wikipedia
00:52:08.900 | page for Zebra. And let me just copy-paste the first, even one sentence
00:52:13.780 | here. And let me put it here. Now when I click enter, what kind of completion are
00:52:19.980 | we going to get? So let me just hit enter. There are three living species,
00:52:25.860 | etc, etc. What the model is producing here is an exact regurgitation of this
00:52:31.620 | Wikipedia entry. It is reciting this Wikipedia entry purely from memory. And
00:52:36.420 | this memory is stored in its parameters. And so it is possible that at some point
00:52:41.340 | in these 512 tokens, the model will stray away from the Wikipedia entry. But
00:52:46.460 | you can see that it has huge chunks of it memorized here. Let me see, for
00:52:50.020 | example, if this sentence occurs by now. Okay, so we're still on track. Let me
00:52:56.860 | check here. Okay, we're still on track. It will eventually stray away. Okay, so
00:53:05.700 | this thing is just recited to a very large extent. It will eventually deviate
00:53:09.780 | because it won't be able to remember exactly. Now, the reason that this
00:53:13.540 | happens is because these models can be extremely good at memorization. And
00:53:17.740 | usually, this is not what you want in the final model. And this is something
00:53:20.820 | called regurgitation. And it's usually undesirable to cite things directly
00:53:26.060 | that you have trained on. Now, the reason that this happens actually is
00:53:29.940 | because for a lot of documents, like for example, Wikipedia, when these
00:53:33.700 | documents are deemed to be of very high quality as a source, like for
00:53:37.100 | example, Wikipedia, it is very often the case that when you train the model,
00:53:41.700 | you will preferentially sample from those sources. So basically, the model
00:53:46.220 | has probably done a few epochs on this data, meaning that it has seen this
00:53:49.860 | web page, like maybe probably 10 times or so. And it's a bit like you like
00:53:53.740 | when you read some kind of a text many, many times, say you read something 100
00:53:57.500 | times, then you will be able to recite it. And it's very similar for this
00:54:01.060 | model, if it sees something way too often, it's going to be able to recite
00:54:03.820 | it later from memory. Except these models can be a lot more efficient,
00:54:08.340 | like per presentation than a human. So probably it's only seen this Wikipedia
00:54:12.740 | entry 10 times, but basically it has remembered this article exactly in its
00:54:16.660 | parameters. Okay, the next thing I want to show you is something that the
00:54:19.340 | model has definitely not seen during its training. So for example, if we go
00:54:23.420 | to the paper, and then we navigate to the pre training data, we'll see here
00:54:29.100 | that the data set has a knowledge cutoff until the end of 2023. So it
00:54:35.540 | will not have seen documents after this point. And certainly it has not seen
00:54:39.540 | anything about the 2024 election and how it turned out. Now, if we prime the
00:54:44.980 | model with the tokens from the future, it will continue the token sequence,
00:54:49.900 | and it will just take its best guess according to the knowledge that it has
00:54:53.020 | in its own parameters. So let's take a look at what that could look like. So
00:54:57.540 | the Republican party could Trump. Okay, President of the United States from
00:55:02.340 | 2017. And let's see what it says after this point. So for example, the model
00:55:07.020 | will have to guess at the running mate and who it's against, etc. So let's
00:55:10.940 | hit enter. So here are things that Mike Pence was the running mate instead of
00:55:15.740 | JD Vance. And the ticket was against Hillary Clinton and Tim Kaine. So this
00:55:22.940 | is kind of a interesting parallel universe potentially of what could have
00:55:26.260 | happened according to the alarm. Let's get a different sample. So the
00:55:29.940 | identical prompt, and let's resample. So here the running mate was Ron
00:55:35.700 | DeSantis. And they ran against Joe Biden and Kamala Harris. So this is
00:55:40.540 | again, a different parallel universe. So the model will take educated guesses,
00:55:44.020 | and it will continue the token sequence based on this knowledge. And we'll just
00:55:48.220 | kind of like all of what we're seeing here is what's called hallucination. The
00:55:51.980 | model is just taking its best guess in a probabilistic manner. The next thing I
00:55:56.900 | would like to show you is that even though this is a base model and not yet
00:56:00.100 | an assistant model, it can still be utilized in practical applications if
00:56:04.260 | you are clever with your prompt design. So here's something that we would call a
00:56:08.380 | few shot prompt. So what it is here is that I have 10 words, or 10 pairs, and
00:56:15.020 | each pair is a word of English colon, and then the translation in Korean. And
00:56:21.740 | we have 10 of them. And what the model does here is at the end, we have
00:56:25.980 | teacher colon, and then here's where we're going to do a completion of say,
00:56:29.580 | just five tokens. And these models have what we call in context learning
00:56:34.180 | abilities. And what that's referring to is that as it is reading this context,
00:56:38.700 | it is learning sort of in place that there's some kind of an algorithmic
00:56:44.340 | pattern going on in my data. And it knows to continue that pattern. And this
00:56:49.100 | is called kind of like in context learning. So it takes on the role of
00:56:53.460 | translator. And when we hit completion, we see that the teacher translation is
00:56:59.780 | "선생님," which is correct. And so this is how you can build apps by being
00:57:04.700 | clever with your prompting, even though we still just have a base model for now.
00:57:08.140 | And it relies on what we call this in context learning ability. And it is done
00:57:14.140 | by constructing what's called a few shot prompt. Okay, and finally, I want to
00:57:17.780 | show you that there is a clever way to actually instantiate a whole language
00:57:21.540 | model assistant just by prompting. And the trick to it is that we're going to
00:57:26.300 | structure a prompt to look like a web page that is a conversation between a
00:57:31.140 | helpful AI assistant and a human. And then the model will continue that
00:57:34.900 | conversation. So actually, to write the prompt, I turned to chat GPT itself,
00:57:39.780 | which is kind of meta. But I told it, I want to create an OLM assistant, but all
00:57:44.340 | I have is the base model. So can you please write my prompt. And this is what
00:57:51.740 | it came up with, which is actually quite good. So here's a conversation between
00:57:55.060 | an AI assistant and a human. The AI assistant is knowledgeable, helpful,
00:57:58.580 | capable of answering a wide variety of questions, etc. And then here, it's not
00:58:03.780 | enough to just give it a sort of description. It works much better if you
00:58:07.740 | create this few shot prompt. So here's a few terms of human assistant, human
00:58:12.220 | assistant. And we have, you know, a few turns of conversation. And then here at
00:58:17.980 | the end is we're going to be putting the actual query that we like. So let me
00:58:21.260 | copy paste this into the base model prompt. And now, let me do human column.
00:58:28.220 | And this is where we put our actual prompt. Why is the sky blue? And let's
00:58:34.900 | run. Assistant, the sky appears blue due to the phenomenon called Rayleigh
00:58:41.460 | scattering, etc, etc. So you see that the base model is just continuing the
00:58:45.220 | sequence. But because the sequence looks like this conversation, it takes on that
00:58:49.940 | role. But it is a little subtle, because here it just, you know, it ends the
00:58:54.460 | assistant and then just, you know, hallucinates the next question by the
00:58:57.220 | human, etc. So we'll just continue going on and on. But you can see that we have
00:59:01.820 | sort of accomplished the task. And if you just took this, why is the sky blue?
00:59:06.420 | And if we just refresh this, and put it here, then of course, we don't expect
00:59:11.020 | this to work with the base model, right? We're just gonna, who knows what we're
00:59:14.100 | gonna get? Okay, we're just gonna get more questions. Okay. So this is one way
00:59:19.220 | to create an assistant, even though you may only have a base model. Okay, so this
00:59:23.980 | is the kind of brief summary of the things we talked about over the last few
00:59:27.380 | minutes. Now, let me zoom out here. And this is kind of like what we've talked
00:59:35.060 | about so far. We wish to train LLM assistants like ChatGPT. We've discussed
00:59:40.540 | the first stage of that, which is the pre training stage. And we saw that
00:59:44.020 | really what it comes down to is we take internet documents, we break them up into
00:59:47.340 | these tokens, these atoms of little text chunks. And then we predict token
00:59:51.420 | sequences using neural networks. The output of this entire stage is this base
00:59:56.620 | model, it is the setting of the parameters of this network. And this base
01:00:01.540 | model is basically an internet document simulator on the token level. So it can
01:00:05.620 | just, it can generate token sequences that have the same kind of like
01:00:09.780 | statistics as internet documents. And we saw that we can use it in some
01:00:13.700 | applications, but we actually need to do better. We want an assistant, we want to
01:00:17.300 | be able to ask questions, and we want the model to give us answers. And so we need
01:00:21.300 | to now go into the second stage, which is called the post training stage. So we
01:00:26.380 | take our base model, our internet document simulator, and hand it off to
01:00:29.980 | post training. So we're now going to discuss a few ways to do what's called
01:00:33.860 | post training of these models. These stages in post training are going to be
01:00:38.020 | computationally much less expensive, most of the computational work, all of
01:00:41.940 | the massive data centers, and all of the sort of heavy compute and millions of
01:00:47.660 | dollars are the pre training stage. But now we're going to the slightly cheaper,
01:00:52.460 | but still extremely important stage called post training, where we turn this
01:00:56.620 | LLM model into an assistant. So let's take a look at how we can get our model
01:01:01.580 | to not sample internet documents, but to give answers to questions. So in other
01:01:07.300 | words, what we want to do is we want to start thinking about conversations. And
01:01:10.900 | these are conversations that can be multi term. So so there can be multiple
01:01:15.020 | turns, and they are in the simplest case, a conversation between a human and an
01:01:18.900 | assistant. And so for example, we can imagine the conversation could look
01:01:22.340 | something like this. When a human says what is two plus two, the assistant
01:01:25.700 | should respond with something like two plus two is four. When a human follows up
01:01:29.260 | and says what if it was stars that have a plus assistant could respond with
01:01:32.620 | something like this. And similar here, this is another example showing that the
01:01:37.020 | assistant could also have some kind of a personality here, that it's kind of like
01:01:40.500 | nice. And then here in the third example, I'm showing that when a human is asking
01:01:44.620 | for something that we don't wish to help with, we can produce what's called
01:01:48.660 | refusal, we can say that we cannot help with that. So in other words, what we
01:01:53.220 | want to do now is we want to think through how an assistant should interact
01:01:56.780 | with a human. And we want to program the assistant and its behavior in these
01:02:01.060 | conversations. Now, because this is neural networks, we're not going to be
01:02:04.780 | programming these explicitly in code, we're not going to be able to program
01:02:08.620 | the assistant in that way. Because this is neural networks, everything is done
01:02:12.340 | through neural network training on data sets. And so because of that, we are
01:02:17.380 | going to be implicitly programming the assistant by creating data sets of
01:02:21.660 | conversations. So these are three independent examples of conversations in
01:02:25.620 | a data set, an actual data set, and I'm going to show you examples will be much
01:02:29.500 | larger, it could have hundreds of 1000s of conversations that are multi turn
01:02:33.020 | very long, etc. And would cover a diverse breadth of topics. But here I'm only
01:02:37.780 | showing three examples. But the way this works basically is assistant is being
01:02:43.620 | programmed by example. And where is this data coming from, like two times two
01:02:48.020 | equals four, same as two plus two, etc. Where does that come from? This comes
01:02:51.540 | from human labelers. So we will basically give human labelers some
01:02:55.820 | conversational context. And we will ask them to basically give the ideal
01:03:00.100 | assistant response in this situation. And a human will write out the ideal
01:03:05.540 | response for an assistant in any situation. And then we're going to get
01:03:09.020 | the model to basically train on this and to imitate those kinds of responses. So
01:03:15.220 | the way this works, then is we are going to take our base model, which we
01:03:18.100 | produced in the pre training stage. And this base model was trained on internet
01:03:22.300 | documents, we're now going to take that data set of internet documents, and
01:03:25.340 | we're going to throw it out. And we're going to substitute a new data set. And
01:03:29.540 | that's going to be a data set of conversations. And we're going to
01:03:32.060 | continue training the model on these conversations on this new data set of
01:03:35.540 | conversations. And what happens is that the model will very rapidly adjust, and
01:03:40.620 | we'll sort of like learn the statistics of how this assistant response to human
01:03:45.500 | queries. And then later during inference, we'll be able to basically
01:03:49.900 | prime the assistant and get the response. And it will be imitating what
01:03:55.380 | the humans with human labelers would do in that situation, if that makes sense.
01:03:58.940 | So we're going to see examples of that. And this is going to become a bit more
01:04:02.420 | concrete. I also wanted to mention that this post training stage, we're going
01:04:06.140 | to basically just continue training the model. But the pre training stage can in
01:04:11.100 | practice take roughly three months of training on many 1000s of computers.
01:04:15.460 | The post training stage will typically be much shorter, like three hours, for
01:04:19.060 | example. And that's because the data set of conversations that we're going to
01:04:22.900 | create here manually is much, much smaller than the data set of text on the
01:04:27.980 | internet. And so this training will be very short. But fundamentally, we're just
01:04:32.900 | going to take our base model, we're going to continue training using the exact
01:04:36.380 | same algorithm, the exact same everything, except we're swapping out
01:04:39.820 | the data set for conversations. So the questions now are, what are these
01:04:43.580 | conversations? How do we represent them? How do we get the model to see
01:04:47.620 | conversations instead of just raw text? And then what are the outcomes of this
01:04:53.220 | kind of training? And what do you get in a certain like psychological sense when
01:04:57.780 | we talk about the model? So let's turn to those questions now. So let's start by
01:05:01.620 | talking about the tokenization of conversations. Everything in these models
01:05:06.060 | has to be turned into tokens, because everything is just about token
01:05:09.100 | sequences. So how do we turn conversations into token sequences is
01:05:12.980 | the question. And so for that, we need to design some kind of an encoding. And
01:05:17.220 | this is kind of similar to maybe if you're familiar, you don't have to be
01:05:20.500 | with, for example, the TCP/IP packet in on the internet, there are precise rules
01:05:25.580 | and protocols for how you represent information, how everything is
01:05:28.500 | structured together, so that you have all this kind of data laid out in a way
01:05:32.380 | that is written out on a paper, and that everyone can agree on. And so it's the
01:05:36.340 | same thing now happening in LLMs, we need some kind of data structures, and
01:05:39.860 | we need to have some rules around how these data structures like
01:05:42.340 | conversations, get encoded and decoded to and from tokens. And so I want to
01:05:47.780 | show you now how I would recreate this conversation in the token space. So if
01:05:53.940 | you go to TickTokenizer, I can take that conversation. And this is how it is
01:05:59.180 | represented in for the language model. So here we have we are iterating a user
01:06:04.860 | and an assistant in this two turn conversation. And what you're seeing
01:06:09.940 | here is it looks ugly, but it's actually relatively simple. The way it gets
01:06:13.540 | turned into a token sequence here at the end is a little bit complicated. But at
01:06:17.900 | the end, this conversation between the user and assistant ends up being 49
01:06:21.580 | tokens, it is a one dimensional sequence of 49 tokens. And these are the tokens.
01:06:26.020 | Okay. And all the different LLMs will have a slightly different format or
01:06:31.540 | protocols. And it's a little bit of a Wild West right now. But for example,
01:06:35.980 | GPT-40 does it in the following way. You have this special token called I am
01:06:41.140 | underscore start. And this is short for imaginary monologue, the start, then you
01:06:47.500 | have to specify, I don't actually know why it's called that, to be honest, then
01:06:51.940 | you have to specify whose turn it is. So for example, user, which is a token 1428.
01:06:56.660 | Then you have internal monologue separator. And then it's the exact
01:07:02.940 | question. So the tokens of the question, and then you have to close it. So I am
01:07:07.140 | end, the end of the imaginary monologue. So basically, the question from a user
01:07:12.860 | of what is two plus two ends up being the token sequence of these tokens. And
01:07:19.340 | now the important thing to mention here is that I am start, this is not text,
01:07:23.100 | right? I am start is a special token that gets added, it's a new token. And
01:07:29.940 | this token has never been trained on so far, it is a new token that we create in
01:07:33.860 | a post training stage, and we introduce. And so these special tokens like I am
01:07:38.900 | set, I am start, etc, are introduced and interspersed with text, so that they sort
01:07:44.420 | of get the model to learn that, hey, this is the start of a turn for, who is it
01:07:49.300 | started the term for the start of the turn is for the user. And then this is
01:07:53.780 | what the user says, and then the user ends. And then it's a new start of a
01:07:57.780 | turn, and it is by the assistant. And then what does the assistant say? Well,
01:08:02.460 | these are the tokens of what the assistant says, etc. And so this
01:08:05.940 | conversation is not turned into the sequence of tokens. The specific details
01:08:10.060 | here are not actually that important. All I'm trying to show you in concrete
01:08:13.300 | terms, is that our conversations, which we think of as kind of like a structured
01:08:17.300 | object, end up being turned via some encoding into one dimensional sequences
01:08:22.300 | of tokens. And so, because this is one dimensional sequence of tokens, we can
01:08:27.300 | apply all this stuff that we applied before. Now it's just a sequence of
01:08:30.820 | tokens. And now we can train a language model on it. And so we're just
01:08:34.740 | predicting the next token in a sequence, just like before. And we can represent
01:08:39.780 | and train on conversations. And then what does it look like at test time
01:08:43.580 | during inference? So say we've trained a model. And we've trained a model on
01:08:48.660 | these kinds of data sets of conversations. And now we want to
01:08:51.740 | inference. So during inference, what does this look like when you're on
01:08:55.460 | Chats GPT? Well, you come to Chats GPT, and you have, say, like a dialogue
01:09:00.020 | with it. And the way this works is basically, say that this was already
01:09:06.180 | filled in. So like, what is two plus two, two plus two is four. And now you
01:09:09.060 | issue what if it was times, IM_END. And what basically ends up happening on
01:09:15.420 | the servers of OpenAI or something like that, is they put an IM_START,
01:09:19.180 | assistant, IM_SEP. And this is where they end it, right here. So they
01:09:24.580 | construct this context. And now they start sampling from the model. So it's
01:09:29.180 | at this stage that they will go to the model and say, okay, what is a good
01:09:32.100 | first sequence? What is a good first token? What is a good second token?
01:09:35.820 | What is a good third token? And this is where the LLM takes over and creates a
01:09:40.180 | response, like for example, response that looks something like this, but it
01:09:44.540 | doesn't have to be identical to this. But it will have the flavor of this, if
01:09:48.340 | this kind of a conversation was in the data set. So that's roughly how the
01:09:53.140 | protocol works. Although the details of this protocol are not important. So
01:09:58.100 | again, my goal is just to show you that everything ends up being just a
01:10:01.900 | one-dimensional token sequence. So we can apply everything we've already
01:10:05.140 | seen. But we're not training on conversations. And we're now basically
01:10:10.460 | generating conversations as well. Okay, so now I would like to turn to what
01:10:14.020 | these data sets look like in practice. The first paper that I would like to
01:10:17.180 | show you and the first effort in this direction is this paper from OpenAI in
01:10:21.660 | 2022. And this paper was called InstructGPT, or the technique that they
01:10:26.380 | developed. And this was the first time that OpenAI has kind of talked about
01:10:29.620 | how you can take language models and fine tune them on conversations. And so
01:10:33.820 | this paper has a number of details that I would like to take you through. So the
01:10:37.260 | first stop I would like to make is in section 3.4, where they talk about the
01:10:41.220 | human contractors that they hired, in this case from Upwork or through ScaleAI
01:10:46.540 | to construct these conversations. And so there are human labelers involved
01:10:51.660 | whose job it is professionally to create these conversations. And these
01:10:55.900 | labelers are asked to come up with prompts, and then they are asked to also
01:11:00.060 | complete the ideal assistant responses. And so these are the kinds of prompts
01:11:04.100 | that people came up with. So these are human labelers. So list five ideas for
01:11:08.060 | how to regain enthusiasm for my career. What are the top 10 science fiction
01:11:11.500 | books I should read next? And there's many different types of kind of prompts
01:11:16.060 | here. So translate the sentence to Spanish, etc. And so there's many things
01:11:21.500 | here that people came up with. They first come up with the prompt, and then
01:11:25.460 | they also answer that prompt, and they give the ideal assistant response. Now,
01:11:30.140 | how do they know what is the ideal assistant response that they should
01:11:33.300 | write for these prompts? So when we scroll down a little bit further, we see
01:11:37.260 | that here we have this excerpt of labeling instructions that are given to
01:11:41.300 | the human labelers. So the company that is developing the language model, like
01:11:45.220 | for example, OpenAI, writes up labeling instructions for how the humans should
01:11:49.540 | create ideal responses. And so here, for example, is an excerpt of these kinds of
01:11:54.780 | labeling instructions. On a high level, you're asking people to be helpful,
01:11:58.100 | truthful, and harmless. And you can pause the video if you'd like to see more
01:12:01.980 | here. But on a high level, basically just answer, try to be helpful, try to be
01:12:06.380 | truthful, and don't answer questions that we don't want kind of the system to
01:12:10.900 | handle later in ChatGPT. And so, roughly speaking, the company comes up with the
01:12:17.020 | labeling instructions. Usually they are not this short. Usually they are hundreds
01:12:20.340 | of pages, and people have to study them professionally. And then they write out
01:12:24.900 | the ideal assistant responses following those labeling instructions. So this is
01:12:29.620 | a very human-heavy process, as it was described in this paper. Now, the data
01:12:34.260 | set for InstructGPT was never actually released by OpenAI. But we do have some
01:12:38.140 | open source reproductions that were trying to follow this kind of a setup and
01:12:42.700 | collect their own data. So one that I'm familiar with, for example, is the effort
01:12:47.220 | of Open Assistant from a while back. And this is just one of, I think, many
01:12:51.620 | examples, but I just want to show you an example. So here's, so these were people
01:12:56.140 | on the internet that were asked to basically create these conversations
01:12:59.020 | similar to what OpenAI did with human labelers. And so here's an entry of a
01:13:04.980 | person who came up with this prompt. Can you write a short introduction to the
01:13:08.300 | relevance of the term monopsony in economics? Please use examples, etc. And
01:13:14.100 | then the same person, or potentially a different person, will write up the
01:13:17.700 | response. So here's the assistant response to this. And so then the same
01:13:22.260 | person or different person will actually write out this ideal response. And then
01:13:28.380 | this is an example of maybe how the conversation could continue. Now explain
01:13:31.980 | it to a dog. And then you can try to come up with a slightly simpler
01:13:35.940 | explanation or something like that. Now, this then becomes the label, and we end
01:13:41.300 | up training on this. So what happens during training is that, of course,
01:13:47.620 | we're not going to have a full coverage of all the possible questions that the
01:13:53.100 | model will encounter at test time during inference. We can't possibly cover all
01:13:57.340 | the possible prompts that people are going to be asking in the future. But if
01:14:01.220 | we have a, like a data set of a few of these examples, then the model during
01:14:05.780 | training will start to take on this persona of this helpful, truthful,
01:14:10.540 | harmless assistant. And it's all programmed by example. And so these are
01:14:15.180 | all examples of behavior. And if you have conversations of these example
01:14:18.740 | behaviors, and you have enough of them, like 100,000, and you train on it, the
01:14:22.380 | model sort of starts to understand the statistical pattern. And it kind of
01:14:25.700 | takes on this personality of this assistant. Now, it's possible that when
01:14:30.300 | you get the exact same question like this, at test time, it's possible that
01:14:35.460 | the answer will be recited as exactly what was in the training set. But more
01:14:40.340 | likely than that is that the model will kind of like do something of a similar
01:14:44.220 | vibe. And it will understand that this is the kind of answer that you want. So
01:14:51.100 | that's what we're doing. We're programming the system by example, and
01:14:55.540 | the system adopts statistically, this persona of this helpful, truthful,
01:15:00.460 | harmless assistant, which is kind of like reflected in the labeling
01:15:03.900 | instructions that the company creates. Now, I want to show you that the state
01:15:07.540 | of the art has kind of advanced in the last two or three years, since the
01:15:10.820 | instruct GPT paper. So in particular, it's not very common for humans to be
01:15:15.060 | doing all the heavy lifting just by themselves anymore. And that's because
01:15:18.260 | we now have language models. And these language models are helping us create
01:15:21.300 | these data sets and conversations. So it is very rare that the people will like
01:15:25.220 | literally just write out the response from scratch, it is a lot more likely
01:15:28.820 | that they will use an existing LLM to basically like, come up with an answer,
01:15:32.540 | and then they will edit it, or things like that. So there's many different
01:15:35.740 | ways in which now LLMs have started to kind of permeate this post training set
01:15:41.020 | stack. And LLMs are basically used pervasively to help create these massive
01:15:46.220 | data sets of conversations. So I don't want to show like UltraChat is one such
01:15:51.540 | example of like a more modern data set of conversations. It is to a very large
01:15:56.100 | extent synthetic, but I believe there's some human involvement, I could be wrong
01:15:59.860 | with that. Usually, there'll be a little bit of human, but there will be a huge
01:16:03.020 | amount of synthetic help. And this is all kind of like, constructed in different
01:16:09.060 | ways. And UltraChat is just one example of many SFT data sets that currently
01:16:12.420 | exist. And the only thing I want to show you is that these data sets have now
01:16:16.540 | millions of conversations. These conversations are mostly synthetic, but
01:16:20.220 | they're probably edited to some extent by humans. And they span a huge diversity
01:16:24.620 | of sort of areas and so on. So these are fairly extensive artifacts by now. And
01:16:33.780 | there are all these like SFT mixtures, as they're called. So you have a mixture
01:16:37.300 | of like lots of different types and sources, and it's partially synthetic,
01:16:40.540 | partially human. And it's kind of like gone in that direction since. But roughly
01:16:46.620 | speaking, we still have SFT data sets, they're made up of conversations, we're
01:16:50.500 | training on them, just like we did before. And I guess like the last thing
01:16:56.500 | to note is that I want to dispel a little bit of the magic of talking to an
01:17:01.220 | AI. Like when you go to ChatGPT, and you give it a question, and then you hit
01:17:05.820 | enter, what is coming back is kind of like statistically aligned with what's
01:17:12.340 | happening in the training set. And these training sets, I mean, they really just
01:17:16.420 | have a seed in humans following labeling instructions. So what are you actually
01:17:21.700 | talking to in ChatGPT? Or how should you think about it? Well, it's not coming
01:17:25.660 | from some magical AI, like roughly speaking, it's coming from something
01:17:29.340 | that is statistically imitating human labelers, which comes from labeling
01:17:33.980 | instructions written by these companies. And so you're kind of imitating this,
01:17:37.620 | you're kind of getting, it's almost as if you're asking a human labeler. And
01:17:42.220 | imagine that the answer that is given to you from ChatGPT is some kind of a
01:17:46.660 | simulation of a human labeler. And it's kind of like asking what would a human
01:17:51.620 | labeler say in this kind of a conversation. And it's not just like this
01:17:58.620 | human labeler is not just like a random person from the internet, because these
01:18:01.580 | companies actually hire experts. So for example, when you are asking questions
01:18:04.780 | about code, and so on, the human labelers that would be involved in creation of
01:18:09.100 | these conversation datasets, they will usually be educated expert people. And
01:18:14.180 | you're kind of like asking a question of like a simulation of those people, if
01:18:18.580 | that makes sense. So you're not talking to a magical AI, you're talking to an
01:18:21.860 | average labeler, this average labeler is probably fairly highly skilled, but
01:18:25.500 | you're talking to kind of like an instantaneous simulation of that kind of
01:18:29.340 | a person that would be hired in the construction of these datasets. So let
01:18:34.620 | me give you one more specific example before we move on. For example, when I
01:18:38.460 | go to chat GPT, and I say, recommend the top five landmarks you see in Paris, and
01:18:42.340 | then I hit enter. Okay, here we go. Okay, when I hit enter, what's coming out
01:18:53.060 | here? How do I think about it? Well, it's not some kind of a magical AI that has
01:18:57.900 | gone out and researched all the landmarks and then ranked them using its
01:19:01.580 | infinite intelligence, etc. What I'm getting is a statistical simulation of a
01:19:06.340 | labeler that was hired by open AI, you can think about it roughly in that way.
01:19:10.460 | And so if this specific question is in the post training dataset, somewhere at
01:19:17.700 | open AI, then I'm very likely to see an answer that is probably very, very
01:19:21.980 | similar to what that human labeler would have put down for those five
01:19:26.060 | landmarks. How does the human labeler come up with this? Well, they go off and
01:19:29.100 | they go on the internet, and they kind of do their own little research for 20
01:19:31.780 | minutes, and they just come up with a list, right? Now, so if they come up with
01:19:35.740 | this list, and this is in the dataset, I'm probably very likely to see what
01:19:39.580 | they submitted as the correct answer from the assistant. Now, if this
01:19:44.580 | specific query is not part of the post training dataset, then what I'm getting
01:19:48.300 | here is a little bit more emergent. Because the model kind of understands
01:19:53.540 | that statistically, the kinds of landmarks that are in the training set
01:19:58.340 | are usually the prominent landmarks, the landmarks that people usually want to
01:20:01.380 | see, the kinds of landmarks that are usually very often talked about on the
01:20:05.980 | internet. And remember that the model already has a ton of knowledge from its
01:20:09.500 | pre-training on the internet. So it's probably seen a ton of conversations
01:20:12.820 | about pairs, about landmarks, about the kinds of things that people like to see.
01:20:16.140 | And so it's the pre-training knowledge that is then combined with the post
01:20:19.460 | training dataset that results in this kind of an imitation. So that's, that's
01:20:26.460 | roughly how you can kind of think about what's happening behind the scenes here
01:20:30.660 | in, in the statistical sense. Okay, now I want to turn to the topic of LLM
01:20:34.900 | psychology, as I like to call it, which is where sort of the emergent cognitive
01:20:38.820 | effects of the training pipeline that we have for these models. So in particular,
01:20:43.740 | the first one I want to talk to is, of course, hallucinations. So you might be
01:20:50.100 | familiar with model hallucinations. It's when LLMs make stuff up, they just
01:20:53.460 | totally fabricate information, etc. And it's a big problem with LLM assistants.
01:20:57.700 | It is a problem that existed to a large extent with early models for many years
01:21:02.020 | ago. And I think the problem has gotten a bit better, because there are some
01:21:05.580 | medications that I'm going to go into in a second. For now, let's just try to
01:21:08.860 | understand where these hallucinations come from. So here's a specific example
01:21:13.220 | of a few of three conversations that you might think you have in your training
01:21:17.740 | set. And these are pretty reasonable conversations that you could imagine
01:21:22.300 | being in the training set. So like, for example, who is Tom Cruise? Well, Tom
01:21:25.700 | Cruise is a famous actor, American actor and producer, etc. Who is John Barrasso?
01:21:30.460 | This turns out to be a US senator, for example. Who is Genghis Khan? Well,
01:21:35.820 | Genghis Khan was blah, blah, blah. And so this is what your conversations could
01:21:40.220 | look like at training time. Now, the problem with this is that when the human
01:21:45.180 | is writing the correct answer for the assistant, in each one of these cases,
01:21:49.700 | the human either like knows who this person is, or they research them on the
01:21:53.020 | internet, and they come in, and they write this response that kind of has
01:21:56.340 | this like confident tone of an answer. And what happens basically is that at
01:22:00.380 | test time, when you ask for someone who is, this is a totally random name that I
01:22:04.260 | totally came up with, and I don't think this person exists. As far as I know, I
01:22:08.980 | just tried to generate it randomly. The problem is when we ask who is Orson
01:22:12.860 | Kovats, the problem is that the assistant will not just tell you, oh, I
01:22:17.660 | don't know. Even if the assistant and the language model itself might know
01:22:22.740 | inside its features inside its activations inside of its brain sort of,
01:22:26.340 | it might know that this person is like not someone that that is that it's
01:22:30.780 | familiar with, even if some part of the network kind of knows that in some
01:22:33.980 | sense, the saying that, oh, I don't know who this is, is is not going to
01:22:39.340 | happen. Because the model statistically imitates his training set. In the
01:22:44.460 | training set, the questions of the form who is blah are confidently answered
01:22:48.620 | with the correct answer. And so it's going to take on the style of the
01:22:52.500 | answer, and it's going to do its best, it's going to give you statistically
01:22:55.900 | the most likely guess, and it's just going to basically make stuff up.
01:22:59.020 | Because these models, again, we just talked about it is they don't have
01:23:02.620 | access to the internet, they're not doing research. These are statistical
01:23:05.940 | token tumblers, as I call them, is just trying to sample the next token in the
01:23:09.860 | sequence. And it's gonna basically make stuff up. So let's take a look at what
01:23:13.860 | this looks like. I have here what's called the inference playground from
01:23:19.140 | hugging face. And I am on purpose picking on a model called Falcon 7b,
01:23:23.820 | which is an old model. This is a few years ago now. So it's an older model.
01:23:28.020 | So it suffers from hallucinations. And as I mentioned, this has improved over
01:23:31.780 | time recently. But let's say who is Orson Kovats? Let's ask Falcon 7b
01:23:36.060 | instruct. Run. Oh, yeah, Orson Kovats is an American author and science
01:23:41.020 | fiction writer. Okay. That's totally false. It's a hallucination. Let's try
01:23:45.700 | again. These are statistical systems, right? So we can resample. This time,
01:23:50.060 | Orson Kovats is a fictional character from this 1950s TV show. It's total BS,
01:23:55.020 | right? Let's try again. He's a former minor league baseball player. Okay, so
01:24:01.220 | it basically the model doesn't know. And it's given us lots of different
01:24:04.700 | answers. Because it doesn't know. It's just kind of like sampling from these
01:24:08.540 | probabilities. The model starts with the tokens who is Orson Kovats
01:24:12.460 | assistant, and then it comes in here. And it's good. It's getting these
01:24:17.740 | probabilities. And it's just sampling from the probabilities. And it just
01:24:20.540 | like comes up with stuff. And the stuff is actually statistically consistent
01:24:26.100 | with the style of the answer in its training set. And it's just doing that.
01:24:30.580 | But you and I experienced it as a made up factual knowledge. But keep in mind
01:24:35.100 | that the model basically doesn't know. And it's just imitating the format of
01:24:38.820 | the answer. And it's not going to go off and look it up. Because it's just
01:24:43.100 | imitating, again, the answer. So how can we mitigate this? Because for example,
01:24:47.860 | when we go to chat GPT, and I say, who is Orson Kovats, and I'm now asking the
01:24:51.580 | state of the art state of the art model from AI, this model will tell you. Oh,
01:24:57.820 | so this model is actually is even smarter, because you saw very briefly,
01:25:02.340 | it said, searching the web, we're going to cover this later. It's actually
01:25:06.940 | trying to do tool use. And kind of just like came up with some kind of a story.
01:25:13.820 | But I want to just use Orson Kovats did not use any tools. I don't want it to do
01:25:19.700 | web search. There's a well known historical republic figure named Orson
01:25:26.060 | Kovats. So this model is not going to make up stuff. This model knows that it
01:25:30.060 | doesn't know. And it tells you that it doesn't appear to be a person that this
01:25:33.460 | model knows. So somehow, we sort of improved hallucinations, even though they
01:25:38.620 | clearly are an issue in older models. And it makes totally sense why you would be
01:25:44.180 | getting these kinds of answers, if this is what your training set looks like. So
01:25:47.740 | how do we fix this? Okay, well, clearly, we need some examples in our data set,
01:25:51.860 | that were the correct answer for the assistant is that the model doesn't know
01:25:56.700 | about some particular fact. But we only need to have those answers be produced
01:26:01.780 | in the cases where the model actually doesn't know. And so the question is, how
01:26:05.140 | do we know what the model knows or doesn't know? Well, we can empirically
01:26:08.700 | probe the model to figure that out. So let's take a look at, for example, how
01:26:12.900 | meta dealt with hallucinations for the llama three series of models as an
01:26:18.060 | example. So in this paper that they published from meta, we can go into
01:26:21.740 | hallucinations, which they call here factuality. And they describe the
01:26:28.140 | procedure by which they basically interrogate the model to figure out what
01:26:32.380 | it knows and doesn't know to figure out sort of like the boundary of its
01:26:35.540 | knowledge. And then they add examples to the training set, where for the things
01:26:42.940 | where the model doesn't know them, the correct answer is that the model doesn't
01:26:46.660 | know them, which sounds like a very easy thing to do in principle. But this
01:26:51.300 | roughly fixes the issue. And the reason it fixes the issue is because remember
01:26:57.020 | like, the model might actually have a pretty good model of its self knowledge
01:27:02.220 | inside the network. So remember, we looked at the network and all these
01:27:06.380 | neurons inside the network, you might imagine there's a neuron somewhere in
01:27:10.260 | the network, that sort of like lights up for when the model is uncertain. But
01:27:15.180 | the problem is that the activation of that neuron is not currently wired up to
01:27:19.820 | the model actually saying in words that it doesn't know. So even though the
01:27:23.460 | internals of the neural network know, because there's some neurons that
01:27:26.540 | represent that, the model will not surface that it will instead take its
01:27:31.380 | best guess so that it sounds confident. Just like it sees in a training set. So
01:27:36.420 | we need to basically interrogate the model and allow it to say I don't know
01:27:40.380 | in the cases that it doesn't know. So let me take you through what meta roughly
01:27:43.820 | does. So basically what they do is here I have an example. Dominikasik is the
01:27:50.220 | featured article today. So I just went there randomly. And what they do is
01:27:54.380 | basically they take a random document in a training set, and they take a
01:27:58.420 | paragraph, and then they use an LM to construct questions about that
01:28:03.620 | paragraph. So for example, I did that with chat GPT here. So I said, here's a
01:28:11.140 | paragraph from this document, generate three specific factual questions based
01:28:15.220 | on this paragraph, and give me the questions and the answers. And so the
01:28:19.100 | LLMs are already good enough to create and reframe this information. So if the
01:28:24.780 | information is in the context window of this LLM, this actually works pretty
01:28:29.940 | well, it doesn't have to rely on its memory. It's right there in the context
01:28:33.540 | window. And so it can basically reframe that information with fairly high
01:28:38.100 | accuracy. So for example, it can generate questions for us like, for which team
01:28:42.060 | did he play? Here's the answer. How many cups did he win, etc. And now what we
01:28:46.900 | have to do is we have some question and answers. And now we want to interrogate
01:28:50.300 | the model. So roughly speaking, what we'll do is we'll take our questions.
01:28:53.740 | And we'll go to our model, which would be say Llama in meta. But let's just
01:28:59.380 | interrogate Mistral7b here as an example. That's another model. So does
01:29:04.060 | this model know about this answer? Let's take a look. So he played for Buffalo
01:29:11.220 | Sabres, right? So the model knows. And the way that you can programmatically
01:29:15.740 | decide is basically we're going to take this answer from the model. And we're
01:29:20.100 | going to compare it to the correct answer. And again, the models are good
01:29:24.220 | enough to do this automatically. So there's no humans involved here. We can
01:29:27.620 | take basically the answer from the model. And we can use another LLM judge to
01:29:32.780 | check if that is correct, according to this answer. And if it is correct, that
01:29:36.460 | means that the model probably knows. So we're going to do is we're going to do
01:29:40.020 | this maybe a few times. So okay, it knows it's Buffalo Sabres. Let's try again.
01:29:43.660 | Buffalo Sabres. Let's try one more time. Buffalo Sabres. So we asked three times
01:29:54.020 | about this factual question, and the model seems to know. So everything is
01:29:57.860 | great. Now let's try the second question. How many Stanley Cups did he win?
01:30:03.180 | And again, let's interrogate the model about that. And the correct answer is
01:30:05.660 | two. So here, the model claims that he won four times, which is not correct,
01:30:16.740 | right? It doesn't match two. So the model doesn't know it's making stuff up.
01:30:20.580 | Let's try again. So here the model again, it's kind of like making stuff up,
01:30:30.260 | right? Let's try again. Here it says he did not even, did not win during his
01:30:37.780 | career. So obviously the model doesn't know. And the way we can programmatically
01:30:41.620 | tell again is we interrogate the model three times, and we compare its answers
01:30:45.620 | maybe three times, five times, whatever it is, to the correct answer. And if the
01:30:50.300 | model doesn't know, then we know that the model doesn't know this question. And
01:30:53.820 | then what we do is we take this question, we create a new conversation in the
01:30:59.020 | training set. So we're going to add a new conversation training set. And when
01:31:03.100 | the question is, how many Stanley Cups did he win? The answer is, I'm sorry, I
01:31:07.580 | don't know, or I don't remember. And that's the correct answer for this
01:31:11.500 | question, because we interrogated the model and we saw that that's the case.
01:31:14.460 | If you do this for many different types of questions, for many different types of
01:31:20.140 | documents, you are giving the model an opportunity to, in its training set,
01:31:24.860 | refuse to say based on its knowledge. And if you just have a few examples of
01:31:28.860 | that, in your training set, the model will know and has the opportunity to
01:31:34.140 | learn the association of this knowledge-based refusal to this internal
01:31:39.420 | neuron somewhere in its network that we presume exists. And empirically, this
01:31:43.660 | turns out to be probably the case. And it can learn that association that, hey,
01:31:47.660 | when this neuron of uncertainty is high, then I actually don't know. And I'm
01:31:52.860 | allowed to say that, I'm sorry, but I don't think I remember this, etc. And if
01:31:57.420 | you have these examples in your training set, then this is a large mitigation for
01:32:02.460 | hallucination. And that's, roughly speaking, why ChatGPT is able to do
01:32:06.620 | stuff like this as well. So these are the kinds of mitigations that people have
01:32:10.940 | implemented and that have improved the factuality issue over time. Okay, so I've
01:32:15.740 | described mitigation number one for basically mitigating the hallucinations
01:32:20.060 | issue. Now, we can actually do much better than that. It's, instead of just
01:32:25.820 | saying that we don't know, we can introduce an additional mitigation
01:32:29.420 | number two to give the LLM an opportunity to be factual and actually
01:32:33.500 | answer the question. Now, what do you and I do if I was to ask you a factual
01:32:38.540 | question and you don't know? What would you do in order to answer the question?
01:32:43.260 | Well, you could go off and do some search and use the internet, and you
01:32:47.740 | could figure out the answer and then tell me what that answer is. And we can
01:32:52.380 | do the exact same thing with these models. So think of the knowledge inside
01:32:56.780 | the neural network, inside its billions of parameters. Think of that as kind of
01:33:00.860 | a vague recollection of the things that the model has seen during its training,
01:33:05.820 | during the pre-training stage, a long time ago. So think of that knowledge in
01:33:09.660 | the parameters as something you read a month ago. And if you keep reading
01:33:14.220 | something, then you will remember it and the model remembers that. But if it's
01:33:17.580 | something rare, then you probably don't have a really good recollection of that
01:33:20.540 | information. But what you and I do is we just go and look it up. Now, when you
01:33:24.620 | go and look it up, what you're doing basically is like you're refreshing
01:33:27.180 | your working memory with information, and then you're able to sort of like
01:33:30.700 | retrieve it, talk about it, or etc. So we need some equivalent of allowing the
01:33:35.020 | model to refresh its memory or its recollection. And we can do that by
01:33:39.740 | introducing tools for the models. So the way we are going to approach this is
01:33:45.020 | that instead of just saying, "Hey, I'm sorry, I don't know," we can attempt to
01:33:48.860 | use tools. So we can create a mechanism by which the language model can emit
01:33:55.900 | special tokens. And these are tokens that we're going to introduce, new
01:33:59.020 | tokens. So for example, here I've introduced two tokens, and I've
01:34:03.500 | introduced a format or a protocol for how the model is allowed to use these
01:34:07.820 | tokens. So for example, instead of answering the question, when the model
01:34:11.820 | does not, instead of just saying, "I don't know," sorry, the model has the
01:34:15.740 | option now to emitting the special token search start. And this is the query
01:34:20.060 | that will go to like bing.com in the case of OpenAI or say Google search or
01:34:23.740 | something like that. So we'll emit the query, and then it will emit search
01:34:28.140 | end. And then here, what will happen is that the program that is sampling
01:34:33.580 | from the model that is running the inference, when it sees the special
01:34:37.180 | token search end, instead of sampling the next token in the sequence, it
01:34:42.620 | will actually pause generating from the model, it will go off, it will open a
01:34:47.020 | session with bing.com, and it will paste the search query into bing. And it
01:34:52.140 | will then get all the text that is retrieved. And it will basically take
01:34:56.940 | that text, it will maybe represent it again with some other special tokens or
01:35:00.300 | something like that. And it will take that text and it will copy paste it
01:35:03.740 | here into what I tried to like show the brackets. So all that text kind of
01:35:08.860 | comes here. And when the text comes here, it enters the context window. So
01:35:14.460 | the model, so that text from the web search is now inside the context window
01:35:19.260 | that will feed into the neural network. And you should think of the context
01:35:22.460 | window as kind of like the working memory of the model. That data that is
01:35:26.380 | in the context window is directly accessible by the model, it directly
01:35:29.900 | feeds into the neural network. So it's not anymore a vague recollection, it's
01:35:34.220 | data that it it has in the context window is directly available to that
01:35:38.300 | model. So now when it's sampling new tokens here afterwards, it can
01:35:43.500 | reference very easily the data that has been copy pasted in there. So that's
01:35:48.620 | roughly how these how these tools use tools function. And so web search is
01:35:54.940 | just one of the tools, we're going to look at some of the other tools in a
01:35:57.180 | bit. But basically, you introduce new tokens, you introduce some schema by
01:36:01.340 | which the model can utilize these tokens and can call these special
01:36:05.100 | functions like web search functions. And how do you teach the model how to
01:36:09.020 | correctly use these tools, like say web search, search start, search end, etc.
01:36:13.100 | Well, again, you do that through training sets. So we need now to have a
01:36:16.540 | bunch of data, and a bunch of conversations that show the model by
01:36:20.940 | example, how to use web search. So what are the what are the settings where
01:36:25.740 | you're using the search? And what does that look like? And here's by
01:36:29.420 | example, how you start a search and a search, etc. And if you have a few
01:36:34.700 | 1000, maybe examples of that in your training set, the model will actually
01:36:38.140 | do a pretty good job of understanding how this tool works. And it will know
01:36:42.140 | how to sort of structure its queries. And of course, because of the
01:36:45.260 | pre-training data set, and its understanding of the world, it actually
01:36:48.780 | kind of understands what a web search is. And so it actually kind of has a
01:36:51.820 | pretty good native understanding of what kind of stuff is a good search
01:36:56.300 | query. And so it all kind of just like works, you just need a little bit of a
01:37:00.620 | few examples to show it how to use this new tool. And then it can lean on it to
01:37:04.940 | retrieve information, and put it in the context window. And that's
01:37:08.540 | equivalent to you and I looking something up. Because once it's in the
01:37:12.140 | context, it's in the working memory, and it's very easy to manipulate and
01:37:14.780 | access. So that's what we saw a few minutes ago, when I was searching on
01:37:19.580 | ChatGPT for who is Orson Kovats. The ChatGPT language model decided that
01:37:23.900 | this is some kind of a rare individual or something like that. And instead
01:37:29.020 | of giving me an answer from its memory, it decided that it will sample a
01:37:32.220 | special token that is going to do a web search. And we saw briefly something
01:37:35.900 | flash was like using the web tool or something like that. So it briefly said
01:37:39.740 | that, and then we waited for like two seconds, and then it generated this.
01:37:42.940 | And you see how it's creating references here. And so it's citing
01:37:46.780 | sources. So what happened here is, it went off, it did a web search, it
01:37:52.460 | found these sources and these URLs. And the text of these web pages was all
01:37:58.620 | stuffed in between here. And it's not shown here, but it's it's basically
01:38:02.860 | stuffed as text in between here. And now it sees that text. And now it
01:38:08.460 | kind of references it and says that, okay, it could be these people
01:38:12.300 | citation, it could be those people citation, etc. So that's what happened
01:38:15.740 | here. And that's what and that's why when I said who is Orson Kovats, I
01:38:19.260 | could also say, don't use any tools. And then that's enough to basically
01:38:24.460 | convince Chachapiti to not use tools and just use its memory and its
01:38:27.420 | recollection. I also went off and I tried to ask this question of Chachapiti.
01:38:34.780 | So how many Stanley Cups did Dominik Hasek win? And Chachapiti actually
01:38:39.100 | decided that it knows the answer. And it has the confidence to say that he
01:38:42.540 | won twice. And so it kind of just relied on its memory because presumably it
01:38:46.620 | has it has enough of a kind of confidence in its weights and its
01:38:53.420 | parameters and activations that this is retrievable just from memory. But
01:38:59.020 | you can also conversely use web search to make sure. And then for the same
01:39:05.020 | query, it actually goes off and it searches and then it finds a bunch of
01:39:08.380 | sources. It finds all this. All of this stuff gets copy pasted in there and
01:39:12.780 | then it tells us to again and sites. And it actually says the Wikipedia
01:39:18.060 | article, which is the source of this information for us as well. So that's
01:39:23.260 | tools, web search. The model determines when to search. And then that's kind
01:39:27.660 | of like how these tools work. And this is an additional kind of mitigation
01:39:33.020 | for hallucinations and factuality. So I want to stress one more time this
01:39:37.340 | very important sort of psychology point. Knowledge in the parameters of the
01:39:43.020 | neural network is a vague recollection. The knowledge in the tokens that make
01:39:47.180 | up the context window is the working memory. And it roughly speaking works
01:39:52.140 | kind of like it works for us in our brain. The stuff we remember is our
01:39:56.860 | parameters and the stuff that we just experienced like a few seconds or
01:40:01.820 | minutes ago and so on. You can imagine that being in our context window. And
01:40:04.780 | this context window is being built up as you have a conscious experience
01:40:07.820 | around you. So this has a bunch of implications also for your use of LLMs
01:40:13.100 | in practice. So for example, I can go to Chachipiti and I can do something
01:40:17.260 | like this. I can say, can you summarize chapter one of Jane Austen's Pride and
01:40:20.380 | Prejudice, right? And this is a perfectly fine prompt. And Chachipiti
01:40:24.940 | actually does something relatively reasonable here. And the reason it does
01:40:27.980 | that is because Chachipiti has a pretty good recollection of a famous work
01:40:31.740 | like Pride and Prejudice. It's probably seen a ton of stuff about it. There's
01:40:35.180 | probably forums about this book. It's probably read versions of this book.
01:40:38.540 | And it's kind of like remembers because even if you've read this or articles
01:40:45.820 | about it, you'd kind of have a recollection enough to actually say all
01:40:48.620 | this. But usually when I actually interact with LLMs and I want them to
01:40:52.060 | recall specific things, it always works better if you just give it to them.
01:40:55.660 | So I think a much better prompt would be something like this. Can you summarize
01:40:59.740 | for me chapter one of Jane Austen's Pride and Prejudice? And then I am
01:41:03.020 | attaching it below for your reference. And then I do something like a
01:41:05.500 | delimiter here and I paste it in. And I found that just copy pasting it from
01:41:10.300 | some website that I found here. So copy pasting the chapter one here. And I do
01:41:16.060 | that because when it's in the context window, the model has direct access to
01:41:19.740 | it and can exactly, it doesn't have to recall it. It just has direct access to
01:41:23.740 | it. And so this summary is, can be expected to be a significantly high
01:41:27.900 | quality or higher quality than the summary just because it's directly
01:41:31.980 | available to the model. And I think you and I would work in the same way. If
01:41:35.340 | you want to, it would be, you would produce a much better summary if you
01:41:38.620 | had re-read this chapter before you had to summarize it. And that's basically
01:41:43.820 | what's happening here or the equivalent of it. The next sort of psychological
01:41:47.580 | quirk I'd like to talk about briefly is that of the knowledge of self. So what
01:41:51.740 | I see very often on the internet is that people do something like this. They
01:41:55.260 | ask LLMs something like, what model are you and who built you? And basically
01:42:00.300 | this question is a little bit nonsensical. And the reason I say that is
01:42:03.980 | that as I tried to kind of explain with some of the under the hood
01:42:06.940 | fundamentals, this thing is not a person, right? It doesn't have a
01:42:10.380 | persistent existence in any way. It sort of boots up, processes tokens and
01:42:15.900 | shuts off. And it does that for every single person. It just kind of builds
01:42:18.860 | up a context window of conversation and then everything gets deleted. And so
01:42:22.540 | this entity is kind of like restarted from scratch every single conversation,
01:42:26.060 | if that makes sense. It has no persistent self, has no sense of self. It's a
01:42:29.660 | token tumbler and it follows the statistical regularities of its training
01:42:34.540 | set. So it doesn't really make sense to ask it, who are you, what built you,
01:42:39.020 | et cetera. And by default, if you do what I described and just by default and
01:42:44.060 | from nowhere, you're going to get some pretty random answers. So for example,
01:42:46.700 | let's pick on Falcon, which is a fairly old model, and let's see what it tells
01:42:51.100 | us. So it's evading the question, talented engineers and developers. Here
01:42:57.900 | it says I was built by open AI based on the GPT-3 model. It's totally making
01:43:01.580 | stuff up. Now, the fact that it's built by open AI here, I think a lot of
01:43:05.660 | people would take this as evidence that this model was somehow trained on open
01:43:08.780 | AI data or something like that. I don't actually think that that's necessarily
01:43:11.820 | true. The reason for that is that if you don't explicitly program the model to
01:43:18.300 | answer these kinds of questions, then what you're going to get is its
01:43:21.820 | statistical best guess at the answer. And this model had a SFT data mixture of
01:43:29.020 | conversations. And during the fine tuning, the model sort of understands as
01:43:35.820 | it's training on this data, that it's taking on this personality of this like
01:43:39.500 | helpful assistant. And it doesn't know how to, it doesn't actually, it wasn't
01:43:43.500 | told exactly what label to apply to self. It just kind of is taking on this,
01:43:48.460 | this persona of a helpful assistant. And remember that the pre training stage
01:43:54.300 | took the documents from the entire internet. And chat GPT and open AI are
01:43:58.380 | very prominent in these documents. And so I think what's actually likely to be
01:44:02.460 | happening here is that this is just it's hallucinated label for what it is. This
01:44:07.180 | is itself identity is that it's chat GPT by open AI. And it's only saying that
01:44:11.820 | because there's a ton of data on the internet of answers like this, that are
01:44:17.420 | actually coming from open AI from chat GPT. And so that's its label for what it
01:44:21.980 | is. Now, you can override this as a developer, if you have an LLM model, you
01:44:26.780 | can actually override it. And there are a few ways to do that. So for example,
01:44:30.220 | let me show you, there's this Olmo model from Allen AI. And this is one LLM.
01:44:36.620 | It's not a top tier LLM or anything like that. But I like it because it's fully
01:44:39.900 | open source. So the paper for Olmo and everything else is completely fully open
01:44:43.580 | source, which is nice. So here we are looking at its SFT mixture. So this is
01:44:48.300 | the data mixture of the fine tuning. So this is the conversations data set,
01:44:52.940 | right. And so the way that they are solving it for the Olmo model, is we see
01:44:57.340 | that there's a bunch of stuff in the mixture. And there's a total of 1
01:44:59.580 | million conversations here. But here we have Olmo two hard coded. If we go
01:45:04.940 | there, we see that this is 240 conversations. And look at these 240
01:45:10.220 | conversations, they're hard coded, tell me about yourself, says user. And then
01:45:15.260 | the assistant says, I'm Olmo, an open language model developed by AI2, Allen
01:45:18.940 | Institute of Artificial Intelligence, etc. I'm here to help, blah, blah, blah.
01:45:22.540 | What is your name? The Olmo project. So these are all kinds of like cooked up
01:45:26.700 | hard coded questions about Olmo two, and the correct answers to give in these
01:45:30.940 | cases. If you take 240 questions like this, or conversations, put them into
01:45:35.740 | your training set and fine tune with it, then the model will actually be
01:45:38.780 | expected to parrot this stuff later. If you don't give it this, then it's
01:45:44.380 | probably a chachivity by AI. And there's one more way to sometimes do this, is
01:45:50.380 | that basically, in these conversations, and you have terms between human and
01:45:56.220 | assistant, sometimes there's a special message called system message, at the
01:46:00.220 | very beginning of the conversation. So it's not just between human and
01:46:03.340 | assistant, there's a system. And in the system message, you can actually
01:46:07.100 | hard code and remind the model that, hey, you are a model developed by open
01:46:11.500 | AI. And your name is chachivity 4.0. And you were trained on this date, and
01:46:16.940 | your knowledge cutoff is this. And basically, it kind of like documents the
01:46:20.140 | model a little bit. And then this is inserted into your conversations. So
01:46:23.820 | when you go on chachivity, you see a blank page, but actually the system
01:46:26.700 | message is kind of like hidden in there. And those tokens are in the context
01:46:29.900 | window. And so those are the two ways to kind of program the models to talk
01:46:35.420 | about themselves, either is done through data like this, or is done through
01:46:40.380 | system message and things like that, basically invisible tokens that are in
01:46:43.900 | the context window, and remind the model of its identity. But it's all just
01:46:47.660 | kind of like cooked up and bolted on in some in some way, it's not actually
01:46:51.500 | like really deeply there in any real sense, as it would be for a human. I
01:46:56.540 | want to now continue to the next section, which deals with the
01:46:59.340 | computational capabilities, or like I should say, the native computational
01:47:02.460 | capabilities of these models in problem solving scenarios. And so in
01:47:06.220 | particular, we have to be very careful with these models when we construct
01:47:09.340 | our examples of conversations. And there's a lot of sharp edges here, and
01:47:12.780 | that are kind of like elucidative, is that a word? They're kind of like
01:47:16.300 | interesting to look at when we consider how these models think. So consider
01:47:22.060 | the following prompt from a human. And suppose that basically that we are
01:47:25.580 | building out a conversation to enter into our training set of conversations.
01:47:28.700 | So we're going to train the model on this, we're teaching you how to
01:47:31.340 | basically solve simple math problems. So the prompt is, Emily buys three
01:47:35.580 | apples and two oranges, each orange cost $2, the total cost is 13. What is
01:47:39.740 | the cost of apples? Very simple math question. Now, there are two answers
01:47:44.140 | here on the left and on the right. They are both correct answers, they both
01:47:48.220 | say that the answer is three, which is correct. But one of these two is a
01:47:52.140 | significantly better answer for the assistant than the other. Like if I was
01:47:56.300 | data labeler, and I was creating one of these, one of these would be a really
01:48:01.340 | terrible answer for the assistant, and the other would be okay. And so I'd
01:48:05.500 | like you to potentially pause the video even, and think through why one of
01:48:09.180 | these two is significantly better answer than the other. And if you use the
01:48:14.620 | wrong one, your model will actually be really bad at math potentially, and it
01:48:19.260 | would have bad outcomes. And this is something that you would be careful
01:48:22.140 | with in your labeling documentations when you are training people to create
01:48:25.580 | the ideal responses for the assistant. Okay, so the key to this question is to
01:48:29.740 | realize and remember that when the models are training and also
01:48:34.140 | inferencing, they are working in one dimensional sequence of tokens from left
01:48:38.140 | to right. And this is the picture that I often have in my mind. I imagine
01:48:42.140 | basically the token sequence evolving from left to right. And to always
01:48:45.660 | produce the next token in a sequence, we are feeding all these tokens into the
01:48:50.380 | neural network. And this neural network then gives us the probabilities for the
01:48:53.580 | next token in sequence, right? So this picture here is the exact same picture
01:48:57.260 | we saw before up here. And this comes from the web demo that I showed you
01:49:02.300 | before, right? So this is the calculation that basically takes the input tokens
01:49:06.620 | here on the top, and performs these operations of all these neurons, and
01:49:12.860 | gives you the answer for the probabilities of what comes next. Now, the
01:49:15.820 | important thing to realize is that, roughly speaking, there's basically a
01:49:20.780 | finite number of layers of computation that happen here. So for example, this
01:49:24.700 | model here has only one, two, three layers of what's called attention and
01:49:30.300 | MLP here. Maybe a typical modern state-of-the-art network would have more
01:49:35.980 | like, say, 100 layers or something like that. But there's only 100 layers of
01:49:39.020 | computation or something like that to go from the previous token sequence to
01:49:42.540 | the probabilities for the next token. And so there's a finite amount of
01:49:46.060 | computation that happens here for every single token. And you should think of
01:49:49.740 | this as a very small amount of computation. And this amount of
01:49:52.940 | computation is almost roughly fixed for every single token in this sequence.
01:49:57.500 | That's not actually fully true, because the more tokens you feed in, the more
01:50:03.180 | expensive this forward pass will be of this neural network, but not by much. So
01:50:08.940 | you should think of this, and I think is a good model to have in mind, this is a
01:50:12.140 | fixed amount of compute that's going to happen in this box for every single one
01:50:15.340 | of these tokens. And this amount of compute cannot possibly be too big,
01:50:18.620 | because there's not that many layers that are sort of going from the top to
01:50:22.060 | bottom here. There's not that much computationally that will happen here.
01:50:25.820 | And so you can't imagine a model to basically do arbitrary computation in a
01:50:29.500 | single forward pass to get a single token. And so what that means is that we
01:50:33.900 | actually have to distribute our reasoning and our computation across
01:50:37.660 | many tokens, because every single token is only spending a finite amount of
01:50:41.740 | computation on it. And so we kind of want to distribute the computation
01:50:47.180 | across many tokens. And we can't have too much computation or expect too much
01:50:51.900 | computation out of the model in any single individual token, because there's
01:50:55.820 | only so much computation that happens per token. Okay, roughly fixed amount of
01:51:00.540 | computation here. So that's why this answer here is significantly worse. And
01:51:07.180 | the reason for that is, imagine going from left to right here. And I copy
01:51:11.180 | pasted it right here. The answer is three, etc. Imagine the model having to
01:51:16.620 | go from left to right, emitting these tokens one at a time, it has to say, or
01:51:20.620 | we're expecting to say, the answer is space dollar sign. And then right here,
01:51:27.980 | we're expecting it to basically cram all the computation of this problem into
01:51:31.500 | this single token, it has to emit the correct answer three. And then once
01:51:36.060 | we've emitted the answer three, we're expecting it to say all these tokens.
01:51:40.060 | But at this point, we've already produced the answer. And it's already in
01:51:43.420 | the context window for all these tokens that follow. So anything here is just
01:51:47.340 | kind of post hoc justification of why this is the answer. Because the answer
01:51:52.940 | is already created, it's already in the token window. So it's, it's not
01:51:56.700 | actually being calculated here. And so if you are answering the question
01:52:01.260 | directly, and immediately, you are training the model to try to basically
01:52:06.380 | guess the answer in a single token. And that is just not going to work
01:52:10.060 | because of the finite amount of computation that happens per token.
01:52:12.700 | That's why this answer on the right is significantly better, because we are
01:52:17.100 | distributing this computation across the answer, we're actually getting the
01:52:20.460 | model to sort of slowly come to the answer. From the left to right, we're
01:52:24.300 | getting intermediate results, we're saying, okay, the total cost of oranges
01:52:27.580 | is four. So 13 minus four is nine. And so we're creating intermediate
01:52:32.540 | calculations. And each one of these calculations is by itself not that
01:52:36.060 | expensive. And so we're actually basically kind of guessing a little bit
01:52:39.420 | the difficulty that the model is capable of in any single one of these
01:52:43.740 | individual tokens. And there can never be too much work in any one of these
01:52:48.380 | tokens computationally, because then the model won't be able to do that later
01:52:52.380 | at test time. And so we're teaching the model here to spread out its reasoning
01:52:57.260 | and to spread out its computation over the tokens. And in this way, it only has
01:53:02.140 | very simple problems in each token, and they can add up. And then by the time
01:53:07.260 | it's near the end, it has all the previous results in its working memory.
01:53:11.340 | And it's much easier for it to determine that the answer is and here it is
01:53:14.540 | three. So this is a significantly better label for our computation. This would
01:53:19.580 | be really bad. And this teaching the model to try to do all the computation
01:53:23.500 | in a single token is really bad. So that's kind of like an interesting thing
01:53:29.020 | to keep in mind is in your prompts. Usually don't have to think about it
01:53:33.740 | explicitly because the people at open AI have labelers and so on that actually
01:53:39.420 | worry about this and to make sure that the answers are spread out. And so
01:53:42.940 | actually open AI will kind of like do the right thing. So when I asked this
01:53:46.140 | question for chat GPT, it's actually going to go very slowly, it's going to
01:53:49.580 | be like, okay, let's define our variables, set up the equation. And it's
01:53:53.100 | kind of creating all these intermediate results. These are not for you. These
01:53:56.460 | are for the model. If the model is not creating these intermediate results for
01:54:00.300 | itself, it's not going to be able to reach three. I also wanted to show you
01:54:04.540 | that it's possible to be a bit mean to the model, we can just ask for things. So
01:54:08.540 | as an example, I said, I gave it the exact same prompt. And I said, answer
01:54:13.420 | the question in a single token, just immediately give me the answer, nothing
01:54:16.540 | else. And it turns out that for this simple prompt here, it actually was able
01:54:21.740 | to do it in a single go. So it just created a single I think this is two
01:54:25.180 | tokens, right? Because the dollar sign is its own token. So basically, this
01:54:30.140 | model didn't give me a single token and give me two tokens, but it still
01:54:33.420 | produced the correct answer. And it did that in a single forward pass of the
01:54:36.860 | network. Now, that's because the numbers here I think are very simple. And so I
01:54:41.580 | made it a bit more difficult to be a bit mean to the model. So I said Emily
01:54:45.100 | buys 23 apples and 177 oranges. And then I just made the numbers a bit bigger.
01:54:49.900 | And I'm just making it harder for the model, I'm asking you to do more
01:54:52.380 | computation in a single token. And so I said the same thing. And here it gave
01:54:56.700 | me five, and five is actually not correct. So the model failed to do all
01:55:00.860 | this calculation in a single forward pass of the network, it failed to go
01:55:04.860 | from the input tokens. And then in a single forward pass of the network,
01:55:09.660 | single go through the network, it couldn't produce the result. And then I
01:55:13.420 | said, Okay, now don't worry about the, the token limit, and just solve the
01:55:17.660 | problem as usual. And then it goes all the intermediate results, it
01:55:20.940 | simplifies. And every one of these intermediate results here, and
01:55:24.700 | intermediate calculations is much easier for the model. And it's sort of,
01:55:29.900 | it's not too much work per token, all of the tokens here are correct. And it
01:55:33.740 | arises the resolution, which is seven. And it just couldn't squeeze all this
01:55:37.260 | work. It couldn't squeeze that into a single forward pass of the network. So
01:55:41.180 | I think that's kind of just a cute example. And something to kind of like
01:55:44.300 | think about. And I think it's kind of, again, just elucidative in terms of how
01:55:47.820 | these models work. The last thing that I would say on this topic is that if I
01:55:51.580 | was in practice trying to actually solve this in my day to day life, I might
01:55:54.780 | actually not trust that the model that all the intermediate calculations
01:55:58.780 | correctly here. So actually, probably what I do is something like this, I
01:56:01.580 | would come here and I would say, use code. And that's because code is one of
01:56:07.900 | the possible tools that Chachapiti can use. And instead of it having to do
01:56:12.620 | mental arithmetic, like this mental arithmetic here, I don't fully trust it.
01:56:16.460 | And especially if the numbers get really big. There's no guarantee that the
01:56:19.420 | model will do this correctly. Any one of these intermediate steps might, in
01:56:23.180 | principle, fail. We're using neural networks to do mental arithmetic, kind
01:56:27.020 | of like you doing mental arithmetic in your brain. It might just like screw up
01:56:30.860 | some of the intermediate results. It's actually kind of amazing that it can
01:56:33.500 | even do this kind of mental arithmetic. I don't think I could do this in my
01:56:35.820 | head. But basically, the model is kind of like doing it in its head. And I
01:56:39.100 | don't trust that. So I wanted to use tools. So you can say stuff like, use
01:56:42.300 | code. And I'm not sure what happened there. Use code. And so like I
01:56:52.540 | mentioned, there's a special tool and the model can write code. And I can
01:56:57.580 | inspect that this code is correct. And then it's not relying on its mental
01:57:02.380 | arithmetic. It is using the Python interpreter, which is a very simple
01:57:05.500 | programming language, to basically write out the code that calculates the
01:57:09.100 | result. And I would personally trust this a lot more because this came out
01:57:12.220 | of a Python program, which I think has a lot more correctness guarantees
01:57:16.060 | than the mental arithmetic of a language model. So just another kind of
01:57:21.580 | potential hint that if you have these kinds of problems, you may want to
01:57:24.860 | basically just ask the model to use the code interpreter. And just like we
01:57:28.860 | saw with the web search, the model has special kind of tokens for calling,
01:57:35.500 | like it will not actually generate these tokens from the language model. It
01:57:38.540 | will write the program. And then it actually sends that program to a
01:57:42.220 | different sort of part of the computer that actually just runs that program
01:57:45.900 | and brings back the result. And then the model gets access to that result
01:57:49.420 | and can tell you that, okay, the cost of each Apple is seven. So that's
01:57:53.660 | another kind of tool. And I would use this in practice for yourself. And
01:57:57.820 | it's, yeah, it's just less error prone, I would say. So that's why I
01:58:03.020 | called this section, Models Need Tokens to Think. Distribute your
01:58:07.020 | competition across many tokens. Ask models to create intermediate
01:58:10.620 | results. Or whenever you can, lean on tools and tool use instead of
01:58:15.420 | allowing the models to do all of this stuff in their memory. So if they
01:58:18.140 | try to do it all in their memory, don't fully trust it and prefer to use
01:58:21.740 | tools whenever possible. I want to show you one more example of where
01:58:25.180 | this actually comes up, and that's in counting. So models actually are
01:58:29.020 | not very good at counting for the exact same reason. You're asking for
01:58:32.380 | way too much in a single individual token. So let me show you a simple
01:58:36.300 | example of that. How many dots are below? And then I just put in a bunch
01:58:40.700 | of dots. And Chachapiti says there are, and then it just tries to solve
01:58:45.500 | the problem in a single token. So in a single token, it has to count the
01:58:50.140 | number of dots in its context window. And it has to do that in a single
01:58:55.020 | forward pass of a network. In a single forward pass of a network, as we
01:58:58.300 | talked about, there's not that much computation that can happen there.
01:59:00.860 | Just think of that as being like very little computation that happens
01:59:03.820 | there. So if I just look at what the model sees, let's go to the LLM tokenizer.
01:59:10.060 | It sees this. How many dots are below? And then it turns out that these
01:59:16.460 | dots here, this group of I think 20 dots, is a single token. And then
01:59:21.260 | this group of whatever it is, is another token. And then for some
01:59:24.860 | reason, they break up as this. So I don't actually, this has to do with
01:59:28.940 | the details of the tokenizer, but it turns out that these, the model
01:59:33.020 | basically sees the token ID, this, this, this, and so on. And then from
01:59:38.460 | these token IDs, it's expected to count the number. And spoiler alert,
01:59:43.340 | it's not 161. It's actually, I believe, 177. So here's what we can do
01:59:47.340 | instead. We can say use code. And you might expect that, like, why
01:59:52.140 | should this work? And it's actually kind of subtle and kind of
01:59:55.100 | interesting. So when I say use code, I actually expect this to work.
01:59:58.220 | Let's see. Okay. 177 is correct. So what happens here is I've actually,
02:00:03.900 | it doesn't look like it, but I've broken down the problem into
02:00:06.620 | problems that are easier for the model. I know that the model can't
02:00:11.100 | count. It can't do mental counting. But I know that the model is
02:00:14.620 | actually pretty good at doing copy-pasting. So what I'm doing here
02:00:17.980 | is when I say use code, it creates a string in Python for this. And
02:00:22.460 | the task of basically copy-pasting my input here to here is very
02:00:27.980 | simple. Because for the model, it sees this string of, it sees it
02:00:34.140 | as just these four tokens or whatever it is. So it's very simple
02:00:37.100 | for the model to copy-paste those token IDs and kind of unpack them
02:00:43.260 | into dots here. And so it creates this string, and then it calls
02:00:48.460 | Python routine dot count, and then it comes up with the correct
02:00:51.420 | answer. So the Python interpreter is doing the counting. It's not
02:00:54.540 | the model's mental arithmetic doing the counting. So it's, again,
02:00:57.660 | a simple example of models need tokens to think, don't rely on
02:01:02.300 | their mental arithmetic. And that's why also the models are not
02:01:06.300 | very good at counting. If you need them to do counting tasks,
02:01:08.780 | always ask them to lean on the tool. Now, the models also have
02:01:12.860 | many other little cognitive deficits here and there. And these
02:01:15.420 | are kind of like sharp edges of the technology to be kind of aware
02:01:17.980 | of over time. So as an example, the models are not very good with
02:01:21.820 | all kinds of spelling-related tasks. They're not very good at it.
02:01:25.340 | And I told you that we would loop back around to tokenization.
02:01:28.620 | And the reason to do for this is that the models, they don't see
02:01:31.900 | the characters. They see tokens. And their entire world is about
02:01:35.900 | tokens, which are these little text chunks. And so they don't see
02:01:38.860 | characters like our eyes do. And so very simple character-level
02:01:42.380 | tasks often fail. So, for example, I'm giving it a string,
02:01:47.420 | ubiquitous, and I'm asking it to print only every third character
02:01:51.340 | starting with the first one. So we start with you, and then we
02:01:54.460 | should go every third. So 1, 2, 3, Q should be next, and then
02:02:00.220 | et cetera. So this I see is not correct. And again, my hypothesis
02:02:04.620 | is that this is, again, the mental arithmetic here is failing,
02:02:07.980 | number one, a little bit. But number two, I think the more
02:02:10.780 | important issue here is that if you go to TickTokenizer and you
02:02:14.780 | look at ubiquitous, we see that it is three tokens, right? So you
02:02:19.020 | and I see ubiquitous, and we can easily access the individual
02:02:22.700 | letters, because we kind of see them. And when we have it in the
02:02:25.580 | working memory of our visual sort of field, we can really
02:02:28.620 | easily index into every third letter, and I can do that task.
02:02:31.260 | But the models don't have access to the individual letters. They
02:02:34.380 | see this as these three tokens. And remember, these models are
02:02:38.460 | trained from scratch on the internet. And all these token,
02:02:41.100 | basically, the model has to discover how many of all these
02:02:44.620 | different letters are packed into all these different tokens.
02:02:47.100 | And the reason we even use tokens is mostly for efficiency.
02:02:50.700 | But I think a lot of people are interested to delete tokens
02:02:53.100 | entirely. Like, we should really have character level or byte
02:02:55.820 | level models. It's just that that would create very long
02:02:58.540 | sequences, and people don't know how to deal with that right now.
02:03:01.500 | So while we have the token world, any kind of spelling
02:03:04.060 | tasks are not actually expected to work super well.
02:03:05.980 | So because I know that spelling is not a strong suit because of
02:03:09.580 | tokenization, I can, again, ask it to lean on tools. So I can
02:03:13.180 | just say use code. And I would, again, expect this to work,
02:03:16.700 | because the task of copy pasting ubiquitous into the Python
02:03:19.580 | interpreter is much easier. And then we're leaning on Python
02:03:22.700 | interpreter to manipulate the characters of this string.
02:03:26.300 | So when I say use code, ubiquitous, yes, it indexes into
02:03:32.700 | every third character. And the actual truth is UQ2S, UQTS,
02:03:36.940 | which looks correct to me. So again, an example of spelling
02:03:42.380 | related tasks not working very well. A very famous example of
02:03:45.420 | that recently is how many R are there in strawberry. And this
02:03:49.100 | went viral many times. And basically, the models now get
02:03:52.300 | it correct. They say there are three R's in strawberry. But for
02:03:55.020 | a very long time, all the state of the art models would insist
02:03:57.500 | that there are only two R's in strawberry. And this caused a
02:04:00.780 | lot of, you know, ruckus, because is that a word? I think
02:04:03.980 | so. Because it's just kind of like, why are the models so
02:04:08.060 | brilliant? And they can solve math Olympiad questions, but
02:04:10.860 | they can't like count R's in strawberry. And the answer for
02:04:14.300 | that, again, is I've kind of built up to it kind of slowly.
02:04:16.860 | But number one, the models don't see characters, they see
02:04:19.820 | tokens. And number two, they are not very good at counting.
02:04:23.500 | And so here we are combining the difficulty of seeing
02:04:26.700 | characters with the difficulty of counting. And that's why the
02:04:29.660 | models struggled with this, even though I think by now,
02:04:32.620 | honestly, I think opening I may have hardcoded the answer here,
02:04:35.020 | or I'm not sure what they did. But this specific query now
02:04:39.580 | works. So models are not very good at spelling. And there's a
02:04:45.020 | bunch of other little sharp edges. And I don't want to go
02:04:46.780 | into all of them. I just want to show you a few examples of
02:04:49.260 | things to be aware of. And when you're using these models in
02:04:52.300 | practice, I don't actually want to have a comprehensive
02:04:54.700 | analysis here of all the ways that the models are kind of
02:04:57.740 | like falling short, I just want to make the point that there
02:05:00.140 | are some jagged edges here and there. And we've discussed a
02:05:03.580 | few of them. And a few of them make sense. But some of them
02:05:05.660 | also will just not make as much sense. And they're kind of
02:05:08.380 | like you're left scratching your head, even if you understand
02:05:11.100 | in depth how these models work. And a good example of that
02:05:14.220 | recently is the following. The models are not very good at
02:05:17.340 | very simple questions like this. And this is shocking to a lot
02:05:20.780 | of people, because these math, these problems can solve
02:05:23.580 | complex math problems, they can answer PhD grade physics,
02:05:27.260 | chemistry, biology questions much better than I can, but
02:05:30.220 | sometimes they fall short in like super simple problems like
02:05:32.380 | this. So here we go. 9.11 is bigger than 9.9. And it
02:05:37.820 | justifies this in some way, but obviously, and then at the end,
02:05:41.100 | okay, it actually it flips its decision later. So I don't
02:05:46.700 | believe that this is very reproducible. Sometimes it flips
02:05:49.260 | around its answer, sometimes it gets it right, sometimes get us
02:05:51.500 | gets it wrong. Let's try again. Okay, even though it might look
02:05:59.340 | larger. Okay, so here it doesn't even correct itself in the end.
02:06:02.700 | If you ask many times, sometimes it gets it right, too. But how
02:06:05.900 | is it that the model can do so great at Olympiad grade
02:06:09.340 | problems, but then fail on very simple problems like this.
02:06:12.300 | And I think this one is, as I mentioned, a little bit of a
02:06:16.300 | head scratcher. It turns out that a bunch of people studied
02:06:18.700 | this in depth, and I haven't actually read the paper. But
02:06:21.820 | what I was told by this team was that when you scrutinize the
02:06:26.940 | activations inside the neural network, when you look at some
02:06:29.660 | of the features and what what features turn on or off and what
02:06:32.620 | neurons turn on or off a bunch of neurons inside the neural
02:06:36.300 | network light up, that are usually associated with Bible
02:06:39.180 | verses. And so I think the model is kind of like reminded that
02:06:43.980 | these almost look like Bible verse markers. And in a Bible
02:06:47.900 | verse setting, 9.11 would come after 9.9. And so basically, the
02:06:52.620 | model somehow finds it like cognitively very distracting,
02:06:55.260 | that in Bible verses 9.11 would be greater. Even though here
02:07:00.540 | it's actually trying to justify it and come up to the answer
02:07:02.940 | with a math, it still ends up with the wrong answer here. So
02:07:06.860 | it basically just doesn't fully make sense. And it's not fully
02:07:10.060 | understood. And there's a few jagged issues like that. So
02:07:14.620 | that's why treat this as a as what it is, which is a stochastic
02:07:18.620 | system that is really magical, but that you can't also fully
02:07:21.420 | trust. And you want to use it as a tool, not as something that
02:07:24.380 | you kind of like let it rip on a problem and copy paste the
02:07:27.660 | results. Okay, so we have now covered two major stages of
02:07:31.100 | training of large language models. We saw that in the first
02:07:34.780 | stage, this is called the pre training stage, we are
02:07:37.580 | basically training on internet documents. And when you train a
02:07:41.020 | language model on internet documents, you get what's called
02:07:43.500 | a base model. And it's basically an internet document
02:07:46.060 | simulator, right? Now, we saw that this is an interesting
02:07:49.580 | artifact. And this takes many months to train on 1000s of
02:07:53.740 | computers. And it's kind of a lossy compression of the
02:07:56.140 | internet. And it's extremely interesting, but it's not
02:07:58.380 | directly useful. Because we don't want to sample internet
02:08:01.100 | documents, we want to ask questions of an AI and have it
02:08:04.300 | respond to our questions. So for that, we need an assistant.
02:08:08.060 | And we saw that we can actually construct an assistant in the
02:08:11.100 | process of post training. And specifically, in the process of
02:08:16.780 | supervised fine tuning, as we call it. So in this stage, we
02:08:21.900 | saw that it's algorithmically identical to pre training,
02:08:24.540 | nothing is going to change. The only thing that changes is the
02:08:27.100 | data set. So instead of internet documents, we now want to create
02:08:31.100 | and curate a very nice data set of conversations. So we want
02:08:35.660 | millions conversations on all kinds of diverse topics between
02:08:40.620 | a human and an assistant. And fundamentally, these
02:08:44.140 | conversations are created by humans. So humans write the
02:08:48.140 | prompts, and humans write the ideal responses. And they do
02:08:52.300 | that based on labeling documentations. Now, in the
02:08:56.140 | modern stack, it's not actually done fully and manually by
02:08:59.420 | humans, right? They actually now have a lot of help from these
02:09:02.140 | tools. So we can use language models to help us create these
02:09:06.060 | data sets. And we've done extensively. But fundamentally,
02:09:08.860 | it's all still coming from human curation at the end. So we
02:09:12.300 | create these conversations that now becomes our data set, we
02:09:15.180 | fine tune on it, or continue training on it, and we get an
02:09:18.300 | assistant. And then we kind of shifted gears and started
02:09:21.100 | talking about some of the kind of cognitive implications of
02:09:23.500 | what the system is like. And we saw that, for example, the
02:09:26.540 | assistant will hallucinate, if you don't take some sort of
02:09:30.540 | mitigations towards it. So we saw that hallucinations would
02:09:33.980 | be common. And then we looked at some of the mitigations of
02:09:36.380 | those hallucinations. And then we saw that the models are quite
02:09:39.260 | impressive and can do a lot of stuff in their head. But we saw
02:09:41.820 | that they can also lean on tools to become better. So for
02:09:45.260 | example, we can lean on the web search in order to hallucinate
02:09:49.100 | less, and to maybe bring up some more recent information or
02:09:53.500 | something like that. Or we can lean on tools like Code
02:09:56.060 | Interpreter, so the LLM can write some code and actually
02:09:59.820 | run it and see the results. So these are some of the topics we
02:10:03.820 | looked at so far. Now what I'd like to do is, I'd like to cover
02:10:08.380 | the last and major stage of this pipeline. And that is
02:10:12.540 | reinforcement learning. So reinforcement learning is still
02:10:15.980 | kind of thought to be under the umbrella of post-training. But
02:10:19.500 | it is the last third major stage, and it's a different way
02:10:23.100 | of training language models, and usually follows as this third
02:10:27.100 | step. So inside companies like OpenAI, you will start here, and
02:10:31.020 | these are all separate teams. So there's a team doing data for
02:10:34.300 | pre-training, and a team doing training for pre-training. And
02:10:37.420 | then there's a team doing all the conversation generation in
02:10:41.900 | a different team that is kind of doing the supervised fine
02:10:44.540 | tuning. And there will be a team for the reinforcement learning
02:10:46.940 | as well. So it's kind of like a handoff of these models. You
02:10:49.820 | get your base model, then you fine tune it to be an assistant,
02:10:52.860 | and then you go into reinforcement learning, which
02:10:54.540 | we'll talk about now. So that's kind of like the major flow.
02:10:59.500 | And so let's now focus on reinforcement learning, the last
02:11:02.700 | major stage of training. And let me first actually motivate it
02:11:06.380 | and why we would want to do reinforcement learning and what
02:11:09.100 | it looks like on a high level. So now I'd like to try to
02:11:11.980 | motivate the reinforcement learning stage and what it
02:11:13.740 | corresponds to. It's something that you're probably familiar
02:11:15.900 | with, and that is basically going to school. So just like
02:11:19.180 | you went to school to become really good at something, we
02:11:22.380 | want to take large language models through school. And
02:11:25.580 | really what we're doing is we have a few paradigms of ways of
02:11:31.980 | giving them knowledge or transferring skills. So in
02:11:35.260 | particular, when we're working with textbooks in school, you'll
02:11:38.140 | see that there are three major pieces of information in these
02:11:42.460 | textbooks, three classes of information. The first thing
02:11:45.900 | you'll see is you'll see a lot of exposition. And by the way,
02:11:49.020 | this is a totally random book I pulled from the internet. I
02:11:51.100 | think it's some kind of organic chemistry or something. I'm not
02:11:53.660 | sure. But the important thing is that you'll see that most of
02:11:56.780 | the text, most of it is kind of just like the meat of it, is
02:11:59.660 | exposition. It's kind of like background knowledge, etc. As
02:12:03.500 | you are reading through the words of this exposition, you
02:12:07.260 | can think of that roughly as training on that data. And
02:12:12.380 | that's why when you're reading through this stuff, this
02:12:14.380 | background knowledge, and there's all this context
02:12:15.820 | information, it's kind of equivalent to pre-training. So
02:12:19.900 | it's where we build sort of like a knowledge base of this
02:12:23.580 | data and get a sense of the topic. The next major kind of
02:12:28.140 | information that you will see is these problems and with their
02:12:33.020 | worked solutions. So basically a human expert, in this case, the
02:12:37.100 | author of this book, has given us not just a problem, but has
02:12:40.140 | also worked through the solution. And the solution is
02:12:43.020 | basically like equivalent to having like this ideal response
02:12:46.460 | for an assistant. So it's basically the expert is showing
02:12:49.260 | us how to solve the problem and it's kind of like in its full
02:12:53.580 | form. So as we are reading the solution, we are basically
02:12:57.980 | training on the expert data. And then later we can try to
02:13:02.060 | imitate the expert. And basically that roughly
02:13:07.180 | corresponds to having the SFT model. That's what it would be
02:13:09.580 | doing. So basically we've already done pre-training and
02:13:12.940 | we've already covered this imitation of experts and how
02:13:17.340 | they solve these problems. And the third stage of
02:13:20.220 | reinforcement learning is basically the practice problems.
02:13:23.100 | So sometimes you'll see this is just a single practice problem
02:13:26.620 | here. But of course, there will be usually many practice
02:13:28.940 | problems at the end of each chapter in any textbook. And
02:13:32.140 | practice problems, of course, we know are critical for
02:13:34.220 | learning, because what are they getting you to do? They're
02:13:36.780 | getting you to practice yourself and discover ways of
02:13:40.940 | solving these problems yourself. And so what you get in
02:13:43.980 | the practice problem is you get the problem description, but
02:13:47.180 | you're not given the solution, but you are given the final
02:13:50.780 | answer, usually in the answer key of the textbook. And so you
02:13:54.860 | know the final answer that you're trying to get to, and you
02:13:57.100 | have the problem statement, but you don't have the solution.
02:13:59.900 | You are trying to practice the solution. You're trying out
02:14:02.940 | many different things, and you're seeing what gets you to
02:14:06.140 | the final solution the best. And so you're discovering how
02:14:09.900 | to solve these problems. And in the process of that, you're
02:14:12.860 | relying on, number one, the background information, which
02:14:15.340 | comes from pre-training, and number two, maybe a little bit
02:14:17.820 | of imitation of human experts. And you can probably try
02:14:21.420 | similar kinds of solutions and so on. So we've done this and
02:14:25.420 | this, and now in this section, we're going to try to practice.
02:14:28.300 | And so we're going to be given prompts. We're going to be
02:14:32.140 | given solutions. Sorry, the final answers, but we're not
02:14:35.740 | going to be given expert solutions. We have to practice
02:14:38.940 | and try stuff out. And that's what reinforcement learning is
02:14:41.580 | about. Okay, so let's go back to the problem that we worked
02:14:44.620 | with previously, just so we have a concrete example to talk
02:14:47.420 | through as we explore the topic here. So I'm here in the
02:14:52.220 | tick tokenizer because I'd also like to, well, I get a text box,
02:14:55.580 | which is useful. But number two, I want to remind you again
02:14:58.780 | that we're always working with one-dimensional token
02:15:00.540 | sequences. And so I actually prefer this view because this
02:15:04.460 | is the native view of the LLM, if that makes sense. This is
02:15:07.660 | what it actually sees. It sees token IDs, right? So Emily buys
02:15:13.180 | three apples and two oranges. Each orange is $2. The total
02:15:17.020 | cost of all the fruit is $13. What is the cost of each apple?
02:15:21.500 | And what I'd like you to appreciate here is these are
02:15:25.180 | like four possible candidate solutions as an example. And
02:15:30.460 | they all reach the answer three. Now what I'd like you to
02:15:33.260 | appreciate at this point is that if I'm the human data
02:15:36.060 | labeler that is creating a conversation to be entered into
02:15:39.100 | the training set, I don't actually really know which of
02:15:42.940 | these conversations to add to the data set. Some of these
02:15:49.420 | conversations kind of set up a system of equations. Some of
02:15:51.900 | them sort of like just talk through it in English, and some
02:15:55.020 | of them just kind of like skip right through to the solution.
02:15:57.740 | If you look at chatGPT, for example, and you give it this
02:16:02.380 | question, it defines a system of variables and it kind of like
02:16:05.420 | does this little thing. What we have to appreciate and
02:16:08.780 | differentiate between though is the first purpose of a
02:16:13.180 | solution is to reach the right answer. Of course, we want to
02:16:15.580 | get the final answer three. That is the important purpose
02:16:19.020 | here. But there's kind of like a secondary purpose as well,
02:16:21.660 | where here we are also just kind of trying to make it like
02:16:24.620 | nice for the human, because we're kind of assuming that the
02:16:27.980 | person wants to see the solution, they want to see the
02:16:29.980 | intermediate steps, we want to present it nicely, etc. So there
02:16:33.100 | are two separate things going on here. Number one is the
02:16:35.900 | presentation for the human. But number two, we're trying to
02:16:38.300 | actually get the right answer. So let's, for the moment, focus
02:16:42.380 | on just reaching the final answer. If we only care about
02:16:46.700 | the final answer, then which of these is the optimal or like
02:16:50.860 | the best prompt? Sorry, the best solution for the LLM to
02:16:55.820 | reach the right answer. And what I'm trying to get at is we
02:17:00.620 | don't know. Me, as a human labeler, I would not know which
02:17:03.580 | one of these is best. So as an example, we saw earlier on when
02:17:07.340 | we looked at the token sequences here and the mental arithmetic
02:17:12.540 | and reasoning, we saw that for each token, we can only spend
02:17:15.580 | basically a finite number of finite amount of compute here
02:17:18.780 | that is not very large, or you should think about it that way.
02:17:20.940 | And so we can't actually make too big of a leap in any one
02:17:25.100 | token is maybe the way to think about it. So as an example, in
02:17:29.020 | this one, what's really nice about it is that it's very few
02:17:31.820 | tokens, so it's gonna take us very short amount of time to get
02:17:34.700 | to the answer. But right here, when we're doing 13 minus four
02:17:38.380 | divide three equals, right in this token here, we're actually
02:17:42.700 | asking for a lot of computation to happen on that single
02:17:44.940 | individual token. And so maybe this is a bad example to give
02:17:48.060 | to the LLM because it's kind of incentivizing it to skip through
02:17:50.460 | the calculations very quickly, and it's going to actually make
02:17:52.860 | up mistakes, make mistakes in this mental arithmetic. So maybe
02:17:56.940 | it would work better to like spread out the spread out more.
02:17:59.900 | Maybe it would be better to set up as an equation, maybe it
02:18:03.180 | would be better to talk through it. We fundamentally don't know.
02:18:06.780 | And we don't know because what is easy for you or I as or as
02:18:11.900 | human labelers, what's easy for us or hard for us is different
02:18:15.420 | than what's easy or hard for the LLM. Its cognition is different.
02:18:18.860 | And the token sequences are kind of like different hard for it.
02:18:24.460 | And so some of the token sequences here that are trivial
02:18:30.300 | for me might be very too much of a leap for the LLM. So right
02:18:35.500 | here, this token would be way too hard. But conversely, many
02:18:39.580 | of the tokens that I'm creating here might be just trivial to
02:18:43.180 | the LLM. And we're just wasting tokens, like why waste all these
02:18:46.140 | tokens when this is all trivial. So if the only thing we care
02:18:49.980 | about is reaching the final answer, and we're separating out
02:18:52.620 | the issue of the presentation to the human, then we don't
02:18:56.140 | actually really know how to annotate this example. We don't
02:18:58.700 | know what solution to get to the LLM, because we are not the
02:19:01.340 | LLM. And it's clear here in the case of like the math example,
02:19:06.540 | but this is actually like a very pervasive issue like for our
02:19:09.820 | knowledge is not LLM's knowledge, like the LLM actually
02:19:13.260 | has a ton of knowledge of PhD in math and physics and chemistry
02:19:15.980 | and whatnot. So in many ways, it actually knows more than I do.
02:19:19.180 | And I'm potentially not utilizing that knowledge in its
02:19:22.700 | problem solving. But conversely, I might be injecting a bunch of
02:19:26.300 | knowledge in my solutions that the LLM doesn't know in its
02:19:29.980 | parameters. And then those are like sudden leaps that are very
02:19:33.740 | confusing to the model. And so our cognitions are different.
02:19:38.300 | And I don't really know what to put here, if all we care about
02:19:41.980 | is the reaching the final solution, and doing it
02:19:44.620 | economically, ideally. And so, long story short, we are not in
02:19:50.220 | a good position to create these token sequences for the LLM. And
02:19:55.260 | they're useful by imitation to initialize the system. But we
02:19:59.100 | really want the LLM to discover the token sequences that work
02:20:02.140 | for it. It needs to find for itself what token sequence
02:20:07.260 | reliably gets to the answer, given the prompt. And it needs
02:20:11.180 | to discover that in a process of reinforcement learning and of
02:20:13.660 | trial and error. So let's see how this example would work like
02:20:18.700 | in reinforcement learning. Okay, so we're now back in the
02:20:22.540 | Hugging Face Inference Playground. And that just allows
02:20:26.300 | me to very easily call different kinds of models. So as an
02:20:29.500 | example, here on the top right, I chose the Gemma 2, 2 billion
02:20:33.420 | parameter model. So 2 billion is very, very small. So this is a
02:20:36.540 | tiny model, but it's okay. So we're going to give it the way
02:20:40.300 | that reinforcement learning will basically work is actually
02:20:42.380 | quite, quite simple. We need to try many different kinds of
02:20:47.180 | solutions. And we want to see which solutions work well or
02:20:50.060 | not. So we're basically going to take the prompt, we're going
02:20:53.740 | to run the model. And the model generates a solution. And then
02:20:58.940 | we're going to inspect the solution. And we know that the
02:21:01.660 | correct answer for this one is $3. And so indeed, the model
02:21:05.500 | gets it correct, says it's $3. So this is correct. So that's
02:21:09.180 | just one attempt at the solution. So now we're going to
02:21:11.980 | delete this, and we're going to rerun it again. Let's try a
02:21:14.860 | second attempt. So the model solves it in a bit slightly
02:21:18.060 | different way, right? Every single attempt will be a
02:21:21.180 | different generation, because these models are stochastic
02:21:23.420 | systems. Remember that every single token here, we have a
02:21:25.980 | probability distribution, and we're sampling from that
02:21:28.460 | distribution. So we end up kind of going down slightly
02:21:31.580 | different paths. And so this is the second solution that also
02:21:34.940 | ends in the correct answer. Now we're going to delete that.
02:21:38.380 | Let's go a third time. Okay, so again, slightly different
02:21:41.980 | solution, but also gets it correct. Now we can actually
02:21:45.900 | repeat this many times. And so in practice, you might actually
02:21:49.740 | sample 1000s of independent solutions, or even like a
02:21:52.780 | million solutions for just a single prompt. And some of them
02:21:57.260 | will be correct, and some of them will not be very correct.
02:21:59.660 | And basically, what we want to do is we want to encourage the
02:22:02.140 | solutions that lead to correct answers. So let's take a look
02:22:05.820 | at what that looks like. So if we come back over here, here's
02:22:09.260 | kind of like a cartoon diagram of what this is looking like.
02:22:11.420 | We have a prompt. And then we tried many different solutions
02:22:15.740 | in parallel. And some of the solutions might go well, so they
02:22:20.940 | get the right answer, which is in green. And some of the
02:22:23.980 | solutions might go poorly and may not reach the right answer,
02:22:26.700 | which is red. Now, this problem here, unfortunately, is not the
02:22:29.900 | best example, because it's a trivial prompt. And as we saw,
02:22:33.020 | even like a two billion parameter model always gets it
02:22:36.060 | right. So it's not the best example in that sense. But let's
02:22:39.020 | just exercise some imagination here. And let's just suppose
02:22:42.700 | that the green ones are good, and the red ones are bad. Okay,
02:22:50.060 | so we generated 15 solutions, only four of them got the right
02:22:53.340 | answer. And so now what we want to do is, basically, we want to
02:22:57.420 | encourage the kinds of solutions that lead to right answers. So
02:23:00.860 | whatever token sequences happened in these red solutions,
02:23:04.460 | obviously, something went wrong along the way somewhere. And
02:23:07.340 | this was not a good path to take through the solution. And
02:23:10.940 | whatever token sequences that were in these green solutions,
02:23:13.740 | well, things went pretty well in this situation. And so we
02:23:17.660 | want to do more things like it in prompts like this. And the
02:23:22.140 | way we encourage this kind of a behavior in the future is we
02:23:25.100 | basically train on these sequences. But these training
02:23:28.540 | sequences now are not coming from expert human annotators.
02:23:31.740 | There's no human who decided that this is the correct
02:23:34.220 | solution. This solution came from the model itself. So the
02:23:37.900 | model is practicing here, it's tried out a few solutions, four
02:23:40.940 | of them seem to have worked. And now the model will kind of
02:23:43.580 | like train on them. And this corresponds to a student
02:23:46.220 | basically looking at their solutions and being like, okay,
02:23:48.380 | well, this one worked really well. So this is how I should be
02:23:50.940 | solving these kinds of problems. And here in this example, there
02:23:55.980 | are many different ways to actually like really tweak the
02:23:58.460 | methodology a little bit here. But just to get the core idea
02:24:01.500 | across, maybe it's simplest to just think about taking the
02:24:04.940 | single best solution out of these four, like say this one,
02:24:08.300 | that's why it was yellow. So this is the solution that not
02:24:12.700 | only looked at the right answer, but maybe had some other nice
02:24:15.500 | properties. Maybe it was the shortest one, or it looked
02:24:18.300 | nicest in some ways, or there's other criteria you could think
02:24:21.740 | of as an example. But we're going to decide that this is the
02:24:24.300 | top solution, we're going to train on it. And then the model
02:24:28.380 | will be slightly more likely, once you do the parameter
02:24:31.580 | update, to take this path in this kind of a setting in the
02:24:35.900 | future. But you have to remember that we're going to run many
02:24:39.500 | different diverse prompts across lots of math problems and
02:24:42.540 | physics problems and whatever, whatever there might be. So
02:24:46.060 | 10s of 1000s of prompts, maybe have in mind, there's 1000s of
02:24:49.580 | solutions per prompt. And so this is all happening kind of
02:24:52.620 | like at the same time. And as we're iterating this process,
02:24:56.700 | the model is discovering for itself, what kinds of token
02:25:00.060 | sequences lead it to correct answers. It's not coming from a
02:25:04.540 | human annotator. The model is kind of like playing in this
02:25:08.540 | playground. And it knows what it's trying to get to. And it's
02:25:12.300 | discovering sequences that work for it. These are sequences
02:25:15.980 | that don't make any mental leaps. They seem to work
02:25:20.300 | reliably and statistically, and fully utilize the knowledge of
02:25:24.380 | the model as it has it. And so this is the process of
02:25:28.460 | reinforcement learning. It's basically a guess and check,
02:25:31.820 | we're going to guess many different types of solutions,
02:25:33.580 | we're going to check them, and we're going to do more of what
02:25:35.900 | worked in the future. And that is reinforcement learning. So in
02:25:40.620 | the context of what came before, we see now that the SFT model,
02:25:44.300 | the supervised fine tuning model, it's still helpful,
02:25:47.020 | because it's still kind of like initializes the model a little
02:25:49.500 | bit into the vicinity of the correct solutions. So it's kind
02:25:52.940 | of like a initialization of the model, in the sense that it kind
02:25:57.420 | of gets the model to, you know, take solutions, like write out
02:26:01.100 | solutions, and maybe it has an understanding of setting up a
02:26:03.820 | system of equations, or maybe it kind of like talks through a
02:26:06.380 | solution. So it gets you into the vicinity of correct
02:26:09.020 | solutions. But reinforcement learning is where everything
02:26:11.660 | gets dialed in, we really discover the solutions that
02:26:14.460 | work for the model, get the right answers, we encourage
02:26:17.420 | them, and then the model just kind of like gets better over
02:26:20.140 | time. Okay, so that is the high level process for how we train
02:26:23.660 | large language models. In short, we train them kind of very
02:26:27.180 | similar to how we train children. And basically, the
02:26:30.380 | only difference is that children go through chapters of
02:26:32.860 | books, and they do all these different types of training
02:26:35.980 | exercises, kind of within the chapter of each book. But
02:26:39.740 | instead, when we train AIs, it's almost like we kind of do it
02:26:42.380 | stage by stage, depending on the type of that stage. So first,
02:26:46.540 | what we do is we do pre training, which as we saw is
02:26:49.020 | equivalent to basically reading all the expository material. So
02:26:53.020 | we look at all the textbooks at the same time, and we read all
02:26:55.980 | the exposition, and we try to build a knowledge base. The
02:26:59.580 | second thing then is we go into the SFT stage, which is really
02:27:03.180 | looking at all the fixed sort of like solutions from human
02:27:07.180 | experts of all the different kinds of worked solutions
02:27:10.860 | across all the textbooks. And we just kind of get an SFT
02:27:14.380 | model, which is able to imitate the experts, but does so kind
02:27:17.500 | of blindly, it just kind of like does its best guess, kind
02:27:21.260 | of just like trying to mimic statistically the expert
02:27:23.580 | behavior. And so that's what you get when you look at all the
02:27:26.060 | work solutions. And then finally, in the last stage, we
02:27:29.660 | do all the practice problems in the RL stage across all the
02:27:33.180 | textbooks, we only do the practice problems. And that's
02:27:36.220 | how we get the RL model. So on a high level, the way we train
02:27:40.140 | LLMs is very much equivalent to the process that we train, that
02:27:44.860 | we use for training of children. The next point I would
02:27:47.900 | like to make is that actually these first two stages pre
02:27:50.540 | training and surprise fine tuning, they've been around for
02:27:52.940 | years, and they are very standard, and everyone does them
02:27:55.020 | all the different LLM providers. It is this last stage, the RL
02:27:58.860 | training, there is a lot more early in its process of
02:28:01.820 | development, and is not standard yet in the field. And so this
02:28:07.900 | stage is a lot more kind of early and nascent. And the
02:28:11.020 | reason for that is because I actually skipped over a ton of
02:28:13.580 | little details here in this process. The high level idea is
02:28:16.140 | very simple, it's trial and error learning, but there's a
02:28:18.700 | ton of details and little mathematical kind of like
02:28:21.180 | nuances to exactly how you pick the solutions that are the
02:28:23.580 | best, and how much you train on them, and what is the prompt
02:28:26.540 | distribution, and how to set up the training run, such that
02:28:29.260 | this actually works. So there's a lot of little details and
02:28:31.900 | knobs to the core idea that is very, very simple. And so
02:28:35.500 | getting the details right here is not trivial. And so a lot of
02:28:39.580 | companies like for example, OpenAI and other LLM providers
02:28:42.060 | have experimented internally with reinforcement learning
02:28:45.420 | fine tuning for LLMs for a while, but they've not talked
02:28:48.700 | about it publicly. It's all kind of done inside the company.
02:28:53.180 | And so that's why the paper from DeepSeq that came out very,
02:28:56.540 | very recently was such a big deal. Because this is a paper
02:28:59.740 | from this company called DeepSeq AI in China. And this paper
02:29:04.380 | really talked very publicly about reinforcement learning
02:29:07.100 | fine tuning for large language models, and how incredibly
02:29:10.380 | important it is for large language models, and how it
02:29:13.500 | brings out a lot of reasoning capabilities in the models.
02:29:15.980 | We'll go into this in a second. So this paper reinvigorated the
02:29:19.900 | public interest of using RL for LLMs, and gave a lot of the
02:29:26.220 | sort of nitty-gritty details that are needed to reproduce
02:29:28.540 | the results, and actually get the stage to work for large
02:29:31.340 | language models. So let me take you briefly through this
02:29:34.300 | DeepSeq RL paper, and what happens when you actually
02:29:36.780 | correctly apply RL to language models, and what that looks
02:29:39.260 | like, and what that gives you. So the first thing I'll scroll
02:29:41.420 | to is this kind of figure two here, where we are looking at
02:29:44.700 | the improvement in how the models are solving mathematical
02:29:47.900 | problems. So this is the accuracy of solving mathematical
02:29:50.700 | problems on the AIME accuracy. And then we can go to the web
02:29:54.620 | page, and we can see the kinds of problems that are actually
02:29:56.460 | in these kinds of math problems that are being measured here.
02:30:00.300 | So these are simple math problems. You can pause the
02:30:03.420 | video if you like, but these are the kinds of problems that
02:30:05.660 | basically the models are being asked to solve. And you can see
02:30:08.300 | that in the beginning they're not doing very well, but then
02:30:10.380 | as you update the model with this many thousands of steps,
02:30:13.580 | their accuracy kind of continues to climb. So the models are
02:30:17.260 | improving, and they're solving these problems with a higher
02:30:19.500 | accuracy as you do this trial and error on a large dataset of
02:30:23.580 | these kinds of problems. And the models are discovering how
02:30:26.540 | to solve math problems. But even more incredible than the
02:30:30.540 | quantitative kind of results of solving these problems with a
02:30:33.580 | higher accuracy is the qualitative means by which the
02:30:36.060 | model achieves these results. So when we scroll down, one of
02:30:40.380 | the figures here that is kind of interesting is that later on
02:30:43.420 | in the optimization, the model seems to be using average
02:30:48.140 | length per response goes up. So the model seems to be using
02:30:51.500 | more tokens to get its higher accuracy results. So it's
02:30:55.900 | learning to create very, very long solutions. Why are these
02:30:59.420 | solutions very long? We can look at them qualitatively here.
02:31:02.140 | So basically what they discover is that the model solution get
02:31:05.900 | very, very long partially because, so here's a question,
02:31:08.620 | and here's kind of the answer from the model. What the model
02:31:11.420 | learns to do, and this is an emergent property of the
02:31:14.780 | optimization, it just discovers that this is good for problem
02:31:17.820 | solving, is it starts to do stuff like this. Wait, wait,
02:31:20.620 | wait, that's an aha moment I can flag here. Let's re-evaluate
02:31:23.500 | this step by step to identify the correct sum can be. So
02:31:26.380 | what is the model doing here, right? The model is basically
02:31:29.500 | re-evaluating steps. It has learned that it works better
02:31:33.340 | for accuracy to try out lots of ideas, try something from
02:31:37.100 | different perspectives, retrace, reframe, backtrack. It's
02:31:40.620 | doing a lot of the things that you and I are doing in the
02:31:42.620 | process of problem solving for mathematical questions, but
02:31:45.580 | it's rediscovering what happens in your head, not what you put
02:31:48.380 | down on the solution, and there is no human who can hard code
02:31:51.660 | this stuff in the ideal assistant response. This is only
02:31:54.860 | something that can be discovered in the process of
02:31:56.620 | reinforcement learning because you wouldn't know what to put
02:31:59.420 | here. This just turns out to work for the model, and it
02:32:02.940 | improves its accuracy in problem solving. So the model learns
02:32:06.620 | what we call these chains of thought in your head, and it's
02:32:09.580 | an emergent property of the optimization, and that's what's
02:32:13.820 | bloating up the response lengths, but that's also what's
02:32:17.260 | increasing the accuracy of the problem solving. So what's
02:32:20.780 | incredible here is basically the model is discovering ways
02:32:23.660 | to think. It's learning what I like to call cognitive
02:32:26.540 | strategies of how you manipulate a problem and how you
02:32:29.980 | approach it from different perspectives, how you pull in
02:32:32.620 | some analogies or do different kinds of things like that, and
02:32:35.500 | how you kind of try out many different things over time,
02:32:37.980 | check a result from different perspectives, and how you kind
02:32:41.260 | of solve problems. But here, it's kind of discovered by the
02:32:44.460 | RL, so extremely incredible to see this emerge in the
02:32:47.820 | optimization without having to hard code it anywhere. The only
02:32:50.860 | thing we've given it are the correct answers, and this comes
02:32:53.740 | out from trying to just solve them correctly, which is
02:32:56.220 | incredible. Now let's go back to actually the problem that
02:33:00.780 | we've been working with, and let's take a look at what it
02:33:03.100 | would look like for this kind of a model, what we call
02:33:07.900 | reasoning or thinking model, to solve that problem. Okay, so
02:33:11.660 | recall that this is the problem we've been working with, and
02:33:13.900 | when I pasted it into ChatGPT 4.0, I'm getting this kind of
02:33:17.340 | a response. Let's take a look at what happens when you give
02:33:20.540 | the same query to what's called a reasoning or a thinking
02:33:23.500 | model. This is a model that was trained with reinforcement
02:33:25.660 | learning. So this model described in this paper, Deep
02:33:29.740 | Seek R1, is available on chat.deepseek.com. So this is
02:33:34.300 | kind of like the company that developed it is hosting it. You
02:33:37.260 | have to make sure that the Deep Think button is turned on to
02:33:40.060 | get the R1 model, as it's called. We can paste it here
02:33:43.180 | and run it. And so let's take a look at what happens now, and
02:33:47.180 | what is the output of the model. Okay, so here's what it
02:33:49.820 | says. So this is previously what we get using basically
02:33:53.260 | what's an SFT approach, a supervised fine-tuning
02:33:55.500 | approach. This is like mimicking an expert solution.
02:33:58.460 | This is what we get from the RL model. Okay, let me try to
02:34:01.820 | figure this out. So Emily buys three apples and two oranges.
02:34:04.460 | Each orange costs $2, total is $13. I need to find out blah
02:34:07.660 | blah blah. So here, as you're reading this, you can't
02:34:12.780 | escape thinking that this model is thinking. It's
02:34:17.820 | definitely pursuing the solution. It derives that it
02:34:21.340 | must cost $3. And then it says, wait a second, let me check my
02:34:23.900 | math again to be sure. And then it tries it from a slightly
02:34:26.220 | different perspective. And then it says, yep, all that checks
02:34:29.340 | out. I think that's the answer. I don't see any mistakes. Let
02:34:33.180 | me see if there's another way to approach the problem, maybe
02:34:35.340 | setting up an equation. Let's let the cost of one apple be
02:34:39.500 | $8, then blah blah blah. Yep, same answer. So definitely each
02:34:43.420 | apple is $3. All right, confident that that's correct.
02:34:46.300 | And then what it does, once it sort of did the thinking
02:34:50.140 | process, is it writes up the nice solution for the human.
02:34:53.660 | And so this is now considering -- so this is more about the
02:34:56.620 | correctness aspect, and this is more about the presentation
02:34:59.660 | aspect, where it kind of writes it out nicely and boxes in the
02:35:04.060 | correct answer at the bottom. And so what's incredible about
02:35:06.940 | this is we get this like thinking process of the model.
02:35:09.340 | And this is what's coming from the reinforcement learning
02:35:11.900 | process. This is what's bloating up the length of the token
02:35:15.820 | sequences. They're doing thinking and they're trying
02:35:17.820 | different ways. This is what's giving you higher accuracy in
02:35:21.420 | problem solving. And this is where we are seeing these aha
02:35:25.340 | moments and these different strategies and these ideas for
02:35:29.500 | how you can make sure that you're getting the correct
02:35:31.420 | answer. The last point I wanted to make is some people are a
02:35:35.260 | little bit nervous about putting, you know, very
02:35:37.980 | sensitive data into chat.deepseq.com because this is a
02:35:41.260 | Chinese company. So people don't -- people are a little bit
02:35:43.980 | careful and cagey with that a little bit. DeepSeq R1 is a
02:35:48.220 | model that was released by this company. So this is an open
02:35:51.660 | source model or open weights model. It is available for
02:35:54.860 | anyone to download and use. You will not be able to like run it
02:35:58.300 | in its full sort of -- the full model in full precision. You
02:36:03.100 | won't run that on a MacBook or like a local device because
02:36:07.020 | this is a fairly large model. But many companies are hosting
02:36:09.980 | the full largest model. One of those companies that I like to
02:36:13.180 | use is called together.ai. So when you go to together.ai, you
02:36:17.260 | sign up and you go to playgrounds. You can select here
02:36:20.220 | in the chat DeepSeq R1 and there's many different kinds of
02:36:23.580 | other models that you can select here. These are all
02:36:25.340 | state-of-the-art models. So this is kind of similar to the
02:36:27.900 | Hugging Face inference playground that we've been
02:36:29.820 | playing with so far. But together.ai will usually host
02:36:33.020 | all the state-of-the-art models. So select DeepSeq R1.
02:36:36.220 | You can try to ignore a lot of these. I think the default
02:36:39.180 | settings will often be okay. And we can put in this. And
02:36:43.420 | because the model was released by DeepSeq, what you're getting
02:36:46.380 | here should be basically equivalent to what you're
02:36:48.780 | getting here. Now because of the randomness in the sampling,
02:36:51.420 | we're going to get something slightly different. But in
02:36:53.820 | principle, this should be identical in terms of the power
02:36:56.700 | of the model. And you should be able to see the same things
02:36:58.940 | quantitatively and qualitatively. But this model is
02:37:02.140 | coming from kind of an American company. So that's DeepSeq
02:37:06.780 | and that's what's called a reasoning model. Now when I go
02:37:10.860 | back to chat, let me go to chat here. Okay, so the model that
02:37:14.700 | you're going to see in the drop down here, some of them like
02:37:17.420 | O1, O3 mini, O3 mini high, etc. They are talking about
02:37:21.180 | users-advanced reasoning. Now what this is referring to,
02:37:24.460 | users-advanced reasoning, is it's referring to the fact that
02:37:27.260 | it was trained by reinforcement learning with techniques very
02:37:30.060 | similar to those of DeepSeq R1, per public statements of
02:37:33.900 | OpenAI employees. So these are thinking models trained with
02:37:38.700 | RL. And these models like GPT-40 or GPT-40 mini that you're
02:37:42.460 | getting in the free tier, you should think of them as mostly
02:37:44.860 | SFT models, supervised fine-tuning models. They don't
02:37:47.660 | actually do this like thinking as you see in the RL models.
02:37:50.780 | And even though there's a little bit of reinforcement
02:37:53.420 | learning involved with these models, and I'll go into that
02:37:55.980 | in a second, these are mostly SFT models. I think you should
02:37:58.540 | think about it that way. So in the same way as what we saw
02:38:01.580 | here, we can pick one of the thinking models, like say O3
02:38:04.860 | mini high. And these models, by the way, might not be
02:38:07.500 | available to you unless you pay a chat GPT subscription of
02:38:10.940 | either $20 per month or $200 per month for some of the top
02:38:14.540 | models. So we can pick a thinking model and run. Now
02:38:19.420 | what's going to happen here is it's going to say reasoning,
02:38:21.580 | and it's going to start to do stuff like this. And what we're
02:38:25.740 | seeing here is not exactly the stuff we're seeing here. So
02:38:29.420 | even though under the hood, the model produces these kinds of
02:38:32.700 | chains of thought, OpenAI chooses to not show the exact
02:38:37.820 | chains of thought in the web interface. It shows little
02:38:40.620 | summaries of those chains of thought. And OpenAI kind of
02:38:44.060 | does this, I think, partly because they are worried about
02:38:46.940 | what's called a distillation risk. That is that someone
02:38:49.420 | could come in and actually try to imitate those reasoning
02:38:52.140 | traces and recover a lot of the reasoning performance by just
02:38:55.260 | imitating the reasoning chains of thought. And so they kind of
02:38:58.780 | hide them and they only show little summaries of them. So
02:39:01.180 | you're not getting exactly what you would get in DeepSeq
02:39:03.340 | as with respect to the reasoning itself. And then they
02:39:07.020 | write up the solution. So these are kind of like equivalent,
02:39:11.020 | even though we're not seeing the full under the hood details.
02:39:13.980 | Now, in terms of the performance, these models and
02:39:17.420 | DeepSeq models are currently roughly on par, I would say.
02:39:20.540 | It's kind of hard to tell because of the evaluations. But
02:39:22.540 | if you're paying $200 per month to OpenAI, some of these
02:39:25.020 | models I believe are currently, they basically still look
02:39:27.660 | better. But DeepSeq R1 for now is still a very solid choice
02:39:32.620 | for a thinking model that would be available to you either on
02:39:38.300 | this website or any other website because the model is
02:39:40.220 | open weights. You can just download it. So that's thinking
02:39:44.460 | models. So what is the summary so far? Well, we've talked
02:39:47.980 | about reinforcement learning and the fact that thinking
02:39:51.180 | emerges in the process of the optimization on when we
02:39:54.220 | basically run RL on many math and kind of code problems that
02:39:57.820 | have verifiable solutions. So there's like an answer three,
02:40:00.940 | et cetera. Now, these thinking models you can access in, for
02:40:05.340 | example, DeepSeq or any inference provider like
02:40:08.220 | together.ai and choosing DeepSeq over there. These
02:40:12.380 | thinking models are also available in chatGPT under any
02:40:16.140 | of the O1 or O3 models. But these GPT-4.0 models, et
02:40:20.380 | cetera, they're not thinking models. You should think of
02:40:21.980 | them as mostly SFT models. Now, if you have a prompt that
02:40:27.740 | requires advanced reasoning and so on, you should probably
02:40:30.300 | use some of the thinking models or at least try them
02:40:31.980 | out. But empirically, for a lot of my use, when you're
02:40:35.340 | asking a simpler question, there's like a knowledge-
02:40:37.100 | based question or something like that, this might be
02:40:38.940 | overkill. There's no need to think 30 seconds about some
02:40:41.340 | factual question. So for that, I will sometimes default to
02:40:44.620 | just GPT-4.0. So empirically, about 80, 90 percent of my use
02:40:48.380 | is just GPT-4.0. And when I come across a very difficult
02:40:51.100 | problem, like in math and code, et cetera, I will reach for
02:40:53.740 | the thinking models. But then I have to wait a bit longer
02:40:57.020 | because they are thinking. So you can access these on
02:41:00.300 | chatGPT, on DeepSeq. Also, I wanted to point out that
02:41:03.260 | aistudio.google.com, even though it looks really busy,
02:41:08.060 | really ugly, because Google is just unable to do this kind
02:41:11.420 | of stuff well, is like what is happening. But if you choose
02:41:15.020 | model and you choose here, Gemini 2.0 Flash Thinking
02:41:18.300 | Experimental 0121, if you choose that one, that's also a
02:41:22.060 | kind of early experiment, experimental of a thinking
02:41:25.180 | model by Google. So we can go here and we can give it the
02:41:28.540 | same problem and click run. And this is also a thinking
02:41:31.500 | problem, thinking model that will also do something similar
02:41:34.860 | and comes out with the right answer here. So basically,
02:41:39.100 | Gemini also offers a thinking model. Anthropic currently
02:41:42.460 | does not offer a thinking model. But basically, this is
02:41:44.780 | kind of like the frontier development of these LLMs. I
02:41:47.420 | think RL is kind of like this new exciting stage, but getting
02:41:51.100 | the details right is difficult. And that's why all these
02:41:53.740 | models and thinking models are currently experimental as of
02:41:56.620 | 2025, very early 2025. But this is kind of like the frontier
02:42:01.500 | development of pushing the performance on these very
02:42:03.500 | difficult problems using reasoning that is emergent in
02:42:06.220 | these optimizations. One more connection that I wanted to
02:42:08.940 | bring up is that the discovery that reinforcement learning is
02:42:12.540 | extremely powerful way of learning is not new to the
02:42:16.220 | field of AI. And one place where we've already seen this
02:42:19.420 | demonstrated is in the game of Go. And famously, DeepMind
02:42:23.740 | developed the system AlphaGo, and you can watch a movie about
02:42:26.540 | it, where the system is learning to play the game of Go
02:42:30.460 | against top human players. And when we go to the paper
02:42:35.340 | underlying AlphaGo, so in this paper, when we scroll down,
02:42:42.140 | we actually find a really interesting plot that I think
02:42:46.620 | is kind of familiar to us, and we're kind of like rediscovering
02:42:49.980 | in the more open domain of arbitrary problem solving,
02:42:53.260 | instead of on the closed specific domain of the game of
02:42:55.820 | Go. But basically what they saw, and we're going to see this in
02:42:58.620 | LLMs as well, as this becomes more mature, is this is the ELO
02:43:03.660 | rating of playing game of Go. And this is Lee Sedol, an
02:43:06.620 | extremely strong human player. And here where they are
02:43:09.660 | comparing is the strength of a model learned, trained by
02:43:12.860 | supervised learning, and a model trained by reinforcement
02:43:15.500 | learning. So the supervised learning model is imitating
02:43:19.020 | human expert players. So if you just get a huge amount of games
02:43:22.460 | played by expert players in the game of Go, and you try to
02:43:24.940 | imitate them, you are going to get better, but then you top
02:43:29.100 | out, and you never quite get better than some of the top,
02:43:32.620 | top, top players in the game of Go, like Lee Sedol. So you're
02:43:35.740 | never going to reach there, because you're just imitating
02:43:38.140 | human players. You can't fundamentally go beyond a human
02:43:40.700 | player if you're just imitating human players. But in a process
02:43:44.060 | of reinforcement learning is significantly more powerful. In
02:43:47.260 | reinforcement learning for a game of Go, it means that the
02:43:49.980 | system is playing moves that empirically and statistically
02:43:53.900 | lead to win, to winning the game. And so AlphaGo is a
02:43:58.460 | system where it kind of plays against itself, and it's using
02:44:02.540 | reinforcement learning to create rollouts. So it's the exact
02:44:06.380 | same diagram here, but there's no prompt. It's just, because
02:44:10.380 | there's no prompt, it's just a fixed game of Go. But it's
02:44:12.940 | trying out lots of solutions, it's trying lots of plays, and
02:44:16.540 | then the games that lead to a win, instead of a specific
02:44:19.740 | answer, are reinforced. They're made stronger. And so the
02:44:25.660 | system is learning basically the sequences of actions that
02:44:27.980 | empirically and statistically lead to winning the game. And
02:44:31.980 | reinforcement learning is not going to be constrained by
02:44:34.220 | human performance. And reinforcement learning can do
02:44:36.540 | significantly better and overcome even the top players
02:44:39.660 | like Lisa Dole. And so probably they could have run this
02:44:44.780 | longer, and they just chose to crop it at some point because
02:44:46.940 | this costs money. But this is a very powerful demonstration of
02:44:49.740 | reinforcement learning. And we're only starting to kind of
02:44:52.300 | see hints of this diagram in larger language models for
02:44:56.780 | reasoning problems. So we're not going to get too far by just
02:44:59.740 | imitating experts. We need to go beyond that, set up these
02:45:02.780 | little game environments, and let the system discover
02:45:07.900 | reasoning traces or ways of solving problems that are
02:45:12.780 | unique and that just basically work well. Now on this aspect
02:45:17.900 | of uniqueness, notice that when you're doing reinforcement
02:45:20.300 | learning, nothing prevents you from veering off the
02:45:23.340 | distribution of how humans are playing the game. And so when
02:45:26.700 | we go back to this AlphaGo search here, one of the
02:45:29.980 | suggested modifications is called move 37. And move 37 in
02:45:34.380 | AlphaGo is referring to a specific point in time where
02:45:37.740 | AlphaGo basically played a move that no human expert would
02:45:42.380 | play. So the probability of this move to be played by a
02:45:46.140 | human player was evaluated to be about 1 in 10,000. So it's a
02:45:49.580 | very rare move. But in retrospect, it was a brilliant
02:45:52.380 | move. So AlphaGo, in the process of reinforcement
02:45:54.860 | learning, discovered kind of like a strategy of playing that
02:45:57.900 | was unknown to humans, but is in retrospect brilliant. I
02:46:02.300 | recommend this YouTube video, Lee Sedol versus AlphaGo move
02:46:05.580 | 37 reaction analysis. And this is kind of what it looked like
02:46:08.940 | when AlphaGo played this move.
02:46:10.700 | "That's a very surprising move." "I thought it was a
02:46:20.380 | mistake." "When I see this move." Anyway, so basically
02:46:24.940 | people are kind of freaking out because it's a move that
02:46:27.820 | a human would not play, that AlphaGo played, because in its
02:46:32.140 | training, this move seemed to be a good idea. It just happens
02:46:35.420 | not to be a kind of thing that humans would do. And so that
02:46:38.700 | is, again, the power of reinforcement learning. And in
02:46:40.860 | principle, we can actually see the equivalence of that if we
02:46:43.580 | continue scaling this paradigm in language models. And what
02:46:46.620 | that looks like is kind of unknown. So what does it mean to
02:46:50.540 | solve problems in such a way that even humans would not be
02:46:55.420 | able to get? How can you be better at reasoning or thinking
02:46:58.380 | than humans? How can you go beyond just a thinking human?
02:47:03.500 | Like maybe it means discovering analogies that humans would
02:47:06.300 | not be able to create. Or maybe it's like a new thinking
02:47:09.500 | strategy. It's kind of hard to think through. Maybe it's a
02:47:12.860 | wholly new language that actually is not even English.
02:47:16.300 | Maybe it discovers its own language that is a lot better
02:47:19.260 | at thinking. Because the model is unconstrained to even like
02:47:23.740 | stick with English. So maybe it takes a different language to
02:47:27.500 | think in, or it discovers its own language. So in principle,
02:47:31.020 | the behavior of the system is a lot less defined. It is open to
02:47:35.180 | do whatever works. And it is open to also slowly drift from
02:47:39.980 | the distribution of its training data, which is English.
02:47:41.900 | But all that can only be done if we have a very large, diverse
02:47:46.140 | set of problems in which these strategies can be refined and
02:47:49.500 | perfected. And so that is a lot of the frontier LLM research
02:47:53.020 | that's going on right now is trying to kind of create those
02:47:55.740 | kinds of prompt distributions that are large and diverse.
02:47:58.540 | These are all kind of like game environments in which the LLMs
02:48:01.100 | can practice their thinking. And it's kind of like writing,
02:48:05.500 | you know, these practice problems. We have to create
02:48:07.420 | practice problems for all of domains of knowledge. And if we
02:48:11.260 | have practice problems and tons of them, the models will be able
02:48:14.380 | to reinforcement learning, reinforcement learn on them and
02:48:17.980 | kind of create these kinds of diagrams. But in the domain of
02:48:22.540 | open thinking, instead of a closed domain like Game of Go.
02:48:25.900 | There's one more section within reinforcement learning that I
02:48:28.780 | wanted to cover. And that is that of learning in unverifiable
02:48:32.780 | domains. So, so far, all of the problems that we've looked at
02:48:36.220 | are in what's called verifiable domains. That is, any candidate
02:48:39.580 | solution, we can score very easily against a concrete
02:48:43.020 | answer. So for example, answer is three, and we can very easily
02:48:46.220 | score these solutions against the answer of three. Either we
02:48:49.980 | require the models to like box in their answers, and then we
02:48:53.020 | just check for equality of whatever's in the box with the
02:48:56.140 | answer. Or you can also use kind of what's called an LLM
02:48:59.500 | judge. So the LLM judge looks at a solution, and it gets the
02:49:03.340 | answer, and just basically scores the solution for whether
02:49:06.300 | it's consistent with the answer or not. And LLMs empirically are
02:49:09.980 | good enough at the current capability that they can do this
02:49:12.380 | fairly reliably. So we can apply those kinds of techniques as
02:49:15.100 | well. In any case, we have a concrete answer, and we're just
02:49:17.820 | checking solutions against it. And we can do this automatically
02:49:20.780 | with no kind of humans in the loop. The problem is that we
02:49:23.820 | can't apply the strategy in what's called unverifiable
02:49:26.620 | domains. So usually these are, for example, creative writing
02:49:29.580 | tasks like write a joke about pelicans or write a poem or
02:49:32.620 | summarize a paragraph or something like that. In these
02:49:35.260 | kinds of domains, it becomes harder to score our different
02:49:38.780 | solutions to this problem. So for example, writing a joke
02:49:41.660 | about pelicans, we can generate lots of different jokes, of
02:49:44.380 | course, that's fine. For example, you can go to ChessGPT
02:49:47.100 | and we can get it to generate a joke about pelicans.
02:49:50.220 | So much stuff in their beaks because they don't bellican in
02:49:55.420 | backpacks. Why? Okay, we can we can try something else. Why don't
02:50:02.860 | pelicans ever pay for their drinks because they always bill
02:50:05.980 | it to someone else? Haha. Okay, so these models are not
02:50:10.460 | obviously not very good at humor. Actually, I think it's
02:50:12.540 | pretty fascinating because I think humor is secretly very
02:50:14.780 | difficult and the models don't have the capability, I think.
02:50:17.660 | Anyway, in any case, you could imagine creating lots of jokes.
02:50:22.780 | The problem that we are facing is how do we score them? Now, in
02:50:26.300 | principle, we could, of course, get a human to look at all
02:50:29.260 | these jokes, just like I did right now. The problem with
02:50:31.980 | that is if you are doing reinforcement learning, you're
02:50:34.380 | going to be doing many thousands of updates. And for
02:50:37.420 | each update, you want to be looking at, say, thousands of
02:50:39.660 | prompts. And for each prompt, you want to be potentially
02:50:42.060 | looking at looking at hundreds or thousands of different kinds
02:50:44.860 | of generations. And so there's just like way too many of these
02:50:48.460 | to look at. And so, in principle, you could have a
02:50:51.740 | human inspect all of them and score them and decide that,
02:50:53.900 | okay, maybe this one is funny. And maybe this one is funny. And
02:50:58.060 | this one is funny. And we could train on them to get the model
02:51:01.580 | to become slightly better at jokes, in the context of
02:51:05.100 | pelicans, at least. The problem is that it's just like way too
02:51:09.580 | much human time. This is an unscalable strategy. We need
02:51:12.300 | some kind of an automatic strategy for doing this. And
02:51:15.340 | one sort of solution to this was proposed in this paper that
02:51:19.340 | introduced what's called reinforcement learning from
02:51:21.260 | human feedback. And so this was a paper from OpenAI at the
02:51:24.220 | time. Many of these people are now co-founders in Anthropic.
02:51:27.580 | And this kind of proposed a approach for basically doing
02:51:33.100 | reinforcement learning in unverifiable domains. So let's
02:51:35.820 | take a look at how that works. So this is the cartoon diagram
02:51:39.420 | of the core ideas involved. So as I mentioned, the naive
02:51:42.540 | approach is if we just had infinity human time, we could
02:51:46.300 | just run RL in these domains just fine. So, for example, we
02:51:50.220 | can run RL as usual if I have infinity humans. I just want to
02:51:53.820 | do, and these are just cartoon numbers, I want to do 1,000
02:51:56.700 | updates where each update will be on 1,000 prompts. And for
02:52:00.780 | each prompt, we're going to have 1,000 rollouts that we're
02:52:03.820 | scoring. So we can run RL with this kind of a setup. The
02:52:08.700 | problem is in the process of doing this, I will need to run
02:52:11.580 | one, I would need to ask a human to evaluate a joke a total
02:52:15.180 | of 1 billion times. And so that's a lot of people looking
02:52:18.620 | at really terrible jokes. So we don't want to do that. So
02:52:21.900 | instead, we want to take the RLHF approach. So in RLHF
02:52:26.860 | approach, we are kind of like the core trick is that of
02:52:29.900 | indirection. So we're going to involve humans just a little
02:52:33.900 | bit. And the way we cheat is that we basically train a whole
02:52:37.660 | separate neural network that we call a reward model. And this
02:52:41.500 | neural network will kind of like imitate human scores. So
02:52:45.340 | we're going to ask humans to score rollouts, we're going to
02:52:49.500 | then imitate human scores using a neural network. And this
02:52:54.060 | neural network will become a kind of simulator of human
02:52:56.060 | preferences. And now that we have a neural network simulator,
02:52:59.820 | we can do RL against it. So instead of asking a real human,
02:53:03.740 | we're asking a simulated human for their score of a joke as an
02:53:08.140 | example. And so once we have a simulator, we're off to the
02:53:11.820 | races because we can query it as many times as we want to. And
02:53:15.100 | it's all whole automatic process. And we can now do
02:53:17.660 | reinforcement learning with respect to the simulator. And
02:53:20.060 | the simulator, as you might expect, is not going to be a
02:53:21.980 | perfect human. But if it's at least statistically similar to
02:53:25.740 | human judgment, then you might expect that this will do
02:53:28.060 | something. And in practice, indeed, it does. So once we have
02:53:31.900 | a simulator, we can do RL and everything works great. So let
02:53:34.860 | me show you a cartoon diagram a little bit of what this process
02:53:37.980 | looks like, although the details are not 100% like super
02:53:40.780 | important, it's just a core idea of how this works. So here we
02:53:43.580 | have a cartoon diagram of a hypothetical example of what
02:53:46.140 | training the reward model would look like. So we have a prompt
02:53:49.740 | like write a joke about pelicans. And then here we have
02:53:52.460 | five separate rollouts. So these are all five different jokes,
02:53:55.340 | just like this one. Now, the first thing we're going to do
02:53:59.500 | is we are going to ask a human to order these jokes from the
02:54:04.300 | best to worst. So this is so here, this human thought that
02:54:09.100 | this joke is the best, the funniest. So number one joke,
02:54:12.940 | this is number two joke, number three joke, four, and five. So
02:54:17.420 | this is the worst joke. We're asking humans to order instead
02:54:20.620 | of give scores directly, because it's a bit of an easier task.
02:54:23.580 | It's easier for a human to give an ordering than to give precise
02:54:26.300 | scores. Now, that is now the supervision for the model. So
02:54:30.620 | the human has ordered them. And that is kind of like their
02:54:32.780 | contribution to the training process. But now separately,
02:54:35.740 | what we're going to do is we're going to ask a reward model
02:54:38.060 | about its scoring of these jokes. Now the reward model is a
02:54:42.700 | whole separate neural network, completely separate neural net.
02:54:45.660 | And it's also probably a transformer. But it's not a
02:54:50.060 | language model in the sense that it generates diverse language,
02:54:53.340 | etc. It's just a scoring model. So the reward model will take as
02:54:57.820 | an input, the prompt, number one, and number two, a candidate
02:55:02.540 | joke. So those are the two inputs that go into the reward
02:55:06.300 | model. So here, for example, the reward model would be taking
02:55:09.020 | this prompt, and this joke. Now the output of a reward model is
02:55:13.100 | a single number. And this number is thought of as a score. And
02:55:17.100 | it can range, for example, from zero to one. So zero would be
02:55:20.300 | the worst score, and one would be the best score. So here are
02:55:23.820 | some examples of what a hypothetical reward model at
02:55:26.620 | some stage in the training process would give as scoring to
02:55:30.060 | these jokes. So 0.1 is a very low score, 0.8 is a really high
02:55:35.100 | score, and so on. And so now we compare the scores given by the
02:55:41.340 | reward model with the ordering given by the human. And there's
02:55:45.580 | a precise mathematical way to actually calculate this,
02:55:48.860 | basically set up a loss function and calculate a kind of like a
02:55:52.060 | correspondence here, and update a model based on it. But I just
02:55:55.740 | want to give you the intuition, which is that, as an example
02:55:58.780 | here, for this second joke, the human thought that it was the
02:56:02.220 | funniest, and the model kind of agreed, right? 0.8 is a relatively
02:56:05.260 | high score. But this score should have been even higher,
02:56:07.900 | right? So after an update, we would expect that maybe the
02:56:11.420 | score should have been, will actually grow after an update of
02:56:14.300 | the network to be like, say, 0.81 or something. For this one
02:56:18.860 | here, they actually are in a massive disagreement, because
02:56:21.100 | the human thought that this was number two, but here the score
02:56:24.380 | is only 0.1. And so this score needs to be much higher. So
02:56:29.180 | after an update, on top of this kind of a supervision, this
02:56:33.180 | might grow a lot more, like maybe it's 0.15 or something
02:56:35.580 | like that. And then here, the human thought that this one was
02:56:40.300 | the worst joke, but here the model actually gave it a fairly
02:56:43.340 | high number. So you might expect that after the update, this
02:56:46.940 | would come down to maybe 3.5 or something like that. So
02:56:50.060 | basically, we're doing what we did before. We're slightly
02:56:52.700 | nudging the predictions from the models using neural network
02:56:57.340 | training process. And we're trying to make the reward model
02:57:01.020 | scores be consistent with human ordering. And so as we update
02:57:06.940 | the reward model on human data, it becomes better and better
02:57:10.300 | simulator of the scores and orders that humans provide, and
02:57:14.860 | then becomes kind of like the simulator of human
02:57:18.700 | preferences, which we can then do RL against. But critically,
02:57:22.300 | we're not asking humans 1 billion times to look at a joke.
02:57:25.180 | We're maybe looking at 1000 prompts and 5 rollouts each. So
02:57:28.380 | maybe 5000 jokes that humans have to look at in total. And
02:57:31.900 | they just give the ordering, and then we're training the model
02:57:34.140 | to be consistent with that ordering. And I'm skipping over
02:57:36.620 | the mathematical details. But I just want you to understand a
02:57:39.580 | high level idea that this reward model is basically giving us
02:57:43.900 | the scores, and we have a way of training it to be consistent
02:57:47.020 | with human orderings. And that's how RLHF works. Okay, so
02:57:50.940 | that is the rough idea. We basically train simulators of
02:57:53.980 | humans and RL with respect to those simulators. Now, I want
02:57:58.060 | to talk about first, the upside of reinforcement learning from
02:58:01.580 | human feedback. The first thing is that this allows us to run
02:58:06.540 | reinforcement learning, which we know is incredibly powerful
02:58:09.100 | kind of set of techniques. And it allows us to do it in
02:58:11.500 | arbitrary domains, and including the ones that are unverifiable.
02:58:15.500 | So things like summarization, and poem writing, joke writing,
02:58:18.700 | or any other creative writing, really, in domains outside of
02:58:21.900 | math and code, etc. Now, empirically, what we see when
02:58:25.660 | we actually apply RLHF is that this is a way to improve the
02:58:28.780 | performance of the model. And I have a top answer for why that
02:58:33.740 | might be, but I don't actually know that it is like super well
02:58:36.940 | established on like, why this is, you can empirically observe
02:58:39.900 | that when you do RLHF correctly, the models you get are just like
02:58:43.180 | a little bit better. But as to why is I think like, not as
02:58:46.460 | clear. So here's my best guess. My best guess is that this is
02:58:49.580 | possibly mostly due to the discriminator generator gap.
02:58:52.780 | What that means is that in many cases, it is significantly
02:58:57.500 | easier to discriminate than to generate for humans. So in
02:59:01.100 | particular, an example of this is in when we do supervised
02:59:06.780 | fine tuning, right, SFT. We're asking humans to generate the
02:59:12.060 | ideal assistant response. And in many cases here, as I've shown
02:59:17.020 | it, the ideal response is very simple to write, but in many
02:59:20.540 | cases might not be. So for example, in summarization, or
02:59:23.500 | poem writing, or joke writing, like how are you as a human
02:59:26.460 | assistant, as a human labeler, supposed to get the ideal
02:59:30.140 | response in these cases, it requires creative human writing
02:59:33.180 | to do that. And so RLHF kind of sidesteps this, because we get
02:59:37.580 | we get to ask people a significantly easier question as
02:59:41.100 | a data labelers, they're not asked to write poems directly,
02:59:44.540 | they're just given five points from the model, and they're just
02:59:47.180 | asked to order them. And so that's just a much easier task
02:59:50.860 | for a human labeler to do. And so what I think this allows you
02:59:54.380 | to do basically is it, it's kind of like allows a lot more
02:59:59.740 | higher accuracy data, because we're not asking people to do
03:00:03.020 | the generation task, which can be extremely difficult. Like
03:00:05.900 | we're not asking them to do creative writing, we're just
03:00:08.140 | trying to get them to distinguish between creative
03:00:10.380 | writings, and find ones that are best. And that is the signal
03:00:14.860 | that humans are providing just the ordering. And that is their
03:00:17.740 | input into the system. And then the system in RLHF just
03:00:21.820 | discovers the kinds of responses that would be graded well by
03:00:26.060 | humans. And so that step of indirection allows the models to
03:00:30.380 | become even better. So that is the upside of RLHF. It allows
03:00:34.540 | us to run RL, it empirically results in better models, and
03:00:37.900 | it allows people to contribute their supervision, even without
03:00:41.900 | having to do extremely difficult tasks in the case of writing
03:00:45.500 | ideal responses. Unfortunately, RLHF also comes with
03:00:48.780 | significant downsides. And so the main one is that basically
03:00:54.620 | we are doing reinforcement learning, not with respect to
03:00:56.860 | humans and actual human judgment, but with respect to a
03:00:59.340 | lossy simulation of humans, right? And this lossy
03:01:02.380 | simulation could be misleading, because it's just a it's just a
03:01:05.180 | simulation, right? It's just a language model that's kind of
03:01:08.060 | outputting scores, and it might not perfectly reflect the
03:01:11.340 | opinion of an actual human with an actual brain in all the
03:01:14.780 | possible different cases. So that's number one. There's
03:01:17.500 | actually something even more subtle and devious going on that
03:01:19.980 | really dramatically holds back RLHF as a technique that we can
03:01:25.500 | really scale to significantly kind of smart systems. And that
03:01:31.580 | is that reinforcement learning is extremely good at discovering
03:01:34.700 | a way to game the model, to game the simulation. So this reward
03:01:39.660 | model that we're constructing here, that gives the scores,
03:01:42.380 | these models are transformers. These transformers are massive
03:01:47.260 | neural nets. They have billions of parameters, and they imitate
03:01:50.540 | humans, but they do so in a kind of like a simulation way. Now,
03:01:53.820 | the problem is that these are massive, complicated systems,
03:01:56.380 | right? There's a billion parameters here that are
03:01:58.300 | outputting a single score. It turns out that there are ways
03:02:02.700 | to game these models. You can find kinds of inputs that were
03:02:07.180 | not part of their training set. And these inputs inexplicably
03:02:11.820 | get very high scores, but in a fake way. So very often what you
03:02:16.780 | find if you run RLHF for very long, so for example, if we do
03:02:19.820 | 1000 updates, which is like say a lot of updates, you might
03:02:23.660 | expect that your jokes are getting better and that you're
03:02:25.820 | getting like real bangers about pelicans, but that's not exactly
03:02:29.260 | what happens. What happens is that in the first few hundred
03:02:33.500 | steps, the jokes about pelicans are probably improving a little
03:02:35.900 | bit. And then they actually dramatically fall off the cliff
03:02:38.620 | and you start to get extremely nonsensical results. Like for
03:02:41.980 | example, you start to get the top joke about pelicans starts
03:02:45.420 | to be the the the the the the. And this makes no sense, right?
03:02:48.940 | Like when you look at it, why should this be a top joke? But
03:02:51.180 | when you take the the the the the the and you plug it into
03:02:53.900 | your reward model, you'd expect score of zero, but actually the
03:02:57.260 | reward model loves this as a joke. It will tell you that the
03:03:01.180 | the the the the is a score of 1.0. This is a top joke and this
03:03:06.380 | makes no sense, right? But it's because these models are just
03:03:08.940 | simulations of humans and they're massive neural nuts and
03:03:11.660 | you can find inputs at the bottom that kind of like get
03:03:15.180 | into the part of the input space that kind of gives you
03:03:17.100 | nonsensical results. These examples are what's called
03:03:20.220 | adversarial examples, and I'm not going to go into the topic
03:03:22.860 | too much, but these are adversarial inputs to the model.
03:03:25.980 | They are specific little inputs that kind of go between the
03:03:29.740 | nooks and crannies of the model and give nonsensical results at
03:03:32.380 | the top. Now here's what you might imagine doing. You say,
03:03:35.820 | okay, the the the is obviously not score of one. It's obviously
03:03:39.580 | a low score. So let's take the the the the the. Let's add it
03:03:42.700 | to the data set and give it an ordering that is extremely bad,
03:03:46.140 | like a score of five. And indeed, your model will learn
03:03:48.940 | that the the the the should have a very low score, and we'll
03:03:51.740 | give it score of zero. The problem is that there will
03:03:54.140 | always be basically infinite number of nonsensical
03:03:57.580 | adversarial examples hiding in the model. If you iterate this
03:04:01.500 | process many times and you keep adding nonsensical stuff to
03:04:04.140 | your reward model and giving it very low scores, you'll never
03:04:07.660 | win the game. You can do this many, many rounds, and
03:04:10.940 | reinforcement learning, if you run it long enough, will always
03:04:13.660 | find a way to game the model. It will discover adversarial
03:04:16.460 | examples. It will get really high scores with nonsensical
03:04:20.220 | results. And fundamentally, this is because our scoring
03:04:23.820 | function is a giant neural net, and RL is extremely good at
03:04:28.860 | finding just the ways to trick it. So long story short, you
03:04:35.340 | always run RLHF for maybe a few hundred updates, the model is
03:04:38.780 | getting better, and then you have to crop it and you are
03:04:41.180 | done. You can't run too much against this reward model
03:04:45.980 | because the optimization will start to game it, and you
03:04:49.420 | basically crop it and you call it and you ship it. And you can
03:04:55.340 | improve the reward model, but you kind of like come across
03:04:57.260 | these situations eventually at some point. So RLHF, basically
03:05:02.140 | what I usually say is that RLHF is not RL. And what I mean by
03:05:05.740 | that is, I mean, RLHF is RL, obviously, but it's not RL in
03:05:09.900 | the magical sense. This is not RL that you can run
03:05:12.860 | indefinitely. These kinds of problems, like where you are
03:05:16.700 | getting concrete correct answer, you cannot gain this as
03:05:19.900 | easily. You either got the correct answer or you didn't.
03:05:22.380 | And the scoring function is much, much simpler. You're just
03:05:24.540 | looking at the boxed area and seeing if the result is
03:05:27.180 | correct. So it's very difficult to gain these functions, but
03:05:31.260 | gaining a reward model is possible. Now, in these
03:05:34.060 | verifiable domains, you can run RL indefinitely. You could run
03:05:37.900 | for tens of thousands, hundreds of thousands of steps and
03:05:40.220 | discover all kinds of really crazy strategies that we might
03:05:42.540 | not even ever think about of performing really well for all
03:05:46.060 | these problems. In the game of Go, there's no way to basically
03:05:50.860 | game the winning of a game or losing of a game. We have a
03:05:54.220 | perfect simulator. We know where all the stones are placed, and
03:05:59.100 | we can calculate whether someone has won or not. There's no way
03:06:02.140 | to game that. And so you can do RL indefinitely, and you can
03:06:05.740 | eventually beat even Lisa Dole. But with models like this,
03:06:10.540 | which are gameable, you cannot repeat this process
03:06:13.340 | indefinitely. So I kind of see RLHF as not real RL because the
03:06:18.060 | reward function is gameable. So it's kind of more like in the
03:06:21.260 | realm of like little fine-tuning. It's a little
03:06:25.260 | improvement, but it's not something that is fundamentally
03:06:27.980 | set up correctly, where you can insert more compute, run for
03:06:31.500 | longer, and get much better and magical results. So it's
03:06:35.580 | not RL in that sense. It's not RL in the sense that it lacks
03:06:38.940 | magic. It can fine-tune your model and get a better
03:06:41.980 | performance. And indeed, if we go back to ChessGPT, the GPT40
03:06:47.020 | model has gone through RLHF because it works well, but it's
03:06:51.420 | just not RL in the same sense. RLHF is like a little fine-tune
03:06:54.940 | that slightly improves your model, is maybe like the way I
03:06:57.180 | would think about it. Okay, so that's most of the technical
03:06:59.820 | content that I wanted to cover. I took you through the three
03:07:02.780 | major stages and paradigms of training these models.
03:07:05.820 | Pre-training, supervised fine-tuning, and reinforcement
03:07:08.220 | learning. And I showed you that they loosely correspond to the
03:07:11.020 | process we already use for teaching children. And so in
03:07:14.380 | particular, we talked about pre-training being sort of like
03:07:17.100 | the basic knowledge acquisition of reading exposition,
03:07:20.380 | supervised fine-tuning being the process of looking at lots
03:07:22.940 | and lots of worked examples and imitating experts, and practice
03:07:27.180 | problems. The only difference is that we now have to effectively
03:07:30.620 | write textbooks for LLMs and AIs across all the disciplines of
03:07:34.940 | human knowledge. And also in all the cases where we actually
03:07:38.460 | would like them to work, like code and math and basically all
03:07:42.860 | the other disciplines. So we're in the process of writing
03:07:45.020 | textbooks for them, refining all the algorithms that I've
03:07:47.900 | presented on the high level. And then of course, doing a
03:07:50.380 | really, really good job at the execution of training these
03:07:53.020 | models at scale and efficiently. So in particular, I didn't go
03:07:56.380 | into too many details, but these are extremely large and
03:07:59.900 | complicated distributed sort of jobs that have to run over
03:08:06.380 | tens of thousands or even hundreds of thousands of GPUs.
03:08:08.620 | And the engineering that goes into this is really at the
03:08:12.060 | state of the art of what's possible with computers at that
03:08:14.300 | scale. So I didn't cover that aspect too much, but this is a
03:08:21.020 | very kind of serious endeavor underlying all these very simple
03:08:24.140 | algorithms ultimately. Now, I also talked about sort of like
03:08:28.620 | the theory of mind a little bit of these models. And the thing
03:08:31.020 | I want you to take away is that these models are really good,
03:08:33.820 | but they're extremely useful as tools for your work. You
03:08:37.180 | shouldn't sort of trust them fully. And I showed you some
03:08:39.660 | examples of that. Even though we have mitigations for
03:08:41.980 | hallucinations, the models are not perfect and they will
03:08:44.300 | hallucinate still. It's gotten better over time and it will
03:08:47.420 | continue to get better, but they can hallucinate. In other
03:08:50.620 | words, in addition to that, I covered kind of like what I
03:08:53.820 | call the Swiss cheese sort of model of LLM capabilities that
03:08:57.100 | you should have in your mind. The models are incredibly good
03:08:59.500 | across so many different disciplines, but then fail
03:09:01.580 | randomly almost in some unique cases. So for example, what is
03:09:05.580 | bigger, 9.11 or 9.9? Like the model doesn't know, but
03:09:08.700 | simultaneously it can turn around and solve Olympiad
03:09:12.140 | questions. And so this is a hole in the Swiss cheese and
03:09:15.500 | there are many of them and you don't want to trip over them.
03:09:17.900 | So don't treat these models as infallible models, check their
03:09:23.260 | work, use them as tools, use them for inspiration, use them
03:09:26.380 | for the first draft, but work with them as tools and be
03:09:30.060 | ultimately responsible for the, you know, product of your work.
03:09:34.300 | And that's roughly what I wanted to talk about. This is
03:09:39.820 | how they're trained and this is what they are. Let's now turn
03:09:42.780 | to what are some of the future capabilities of these models,
03:09:45.980 | probably what's coming down the pipe. And also where can you
03:09:47.980 | find these models? I have a few bullet points on some of the
03:09:50.780 | things that you can expect coming down the pipe. The first
03:09:53.340 | thing you'll notice is that models will very rapidly become
03:09:56.540 | multimodal. Everything I've talked about about concerned
03:09:59.420 | text, but very soon we'll have LLMs that can not just handle
03:10:02.700 | text, but they can also operate natively and very easily over
03:10:06.380 | audio so they can hear and speak, and also images so they
03:10:09.660 | can see and paint. And we're already seeing the beginnings
03:10:13.100 | of all of this, but this will be all done natively inside the
03:10:17.420 | language model, and this will enable kind of like natural
03:10:19.900 | conversations. And roughly speaking, the reason that this
03:10:22.540 | is actually no different from everything we've covered above
03:10:25.180 | is that as a baseline, you can tokenize audio and images and
03:10:30.460 | apply the exact same approaches of everything that we've talked
03:10:32.780 | about above. So it's not a fundamental change. It's just
03:10:35.820 | we have to add some tokens. So as an example, for tokenizing
03:10:40.380 | audio, we can look at slices of the spectrogram of the audio
03:10:43.740 | signal and we can tokenize that and just add more tokens that
03:10:47.580 | suddenly represent audio and just add them into the context
03:10:50.460 | windows and train on them just like above. The same for images,
03:10:53.500 | we can use patches and we can separately tokenize patches, and
03:10:58.060 | then what is an image? An image is just a sequence of tokens.
03:11:01.180 | And this actually kind of works, and there's a lot of early work
03:11:04.700 | in this direction. And so we can just create streams of tokens
03:11:07.980 | that are representing audio, images, as well as text, and
03:11:10.780 | intersperse them and handle them all simultaneously in a
03:11:13.100 | single model. So that's one example of multimodality.
03:11:17.420 | Second, something that people are very interested in is
03:11:19.580 | currently most of the work is that we're handing individual
03:11:23.500 | tasks to the models on kind of like a silver platter, like
03:11:26.380 | please solve this task for me. And the model sort of like does
03:11:28.860 | this little task. But it's up to us to still sort of like
03:11:32.540 | organize a coherent execution of tasks to perform jobs. And
03:11:37.340 | the models are not yet at the capability required to do this
03:11:41.420 | in a coherent error correcting way over long periods of time.
03:11:46.060 | So they're not able to fully string together tasks to
03:11:48.380 | perform these longer running jobs. But they're getting there
03:11:51.660 | and this is improving over time. But probably what's going to
03:11:55.340 | happen here is we're going to start to see what's called
03:11:57.260 | agents, which perform tasks over time, and you, you
03:12:00.940 | supervise them, and you watch their work, and they come up to
03:12:04.140 | once in a while, report progress, and so on. So we're
03:12:07.100 | going to see more long running agents, tasks that don't just
03:12:10.300 | take, you know, a few seconds of response, but many tens of
03:12:12.940 | seconds, or even minutes or hours over time. But these
03:12:17.020 | models are not infallible, as we talked about above. So all
03:12:19.740 | this will require supervision. So for example, in factories,
03:12:22.620 | people talk about the human to robot ratio for automation, I
03:12:26.940 | think we're going to see something similar in the
03:12:28.380 | digital space, where we are going to be talking about human
03:12:31.180 | to agent ratios, where humans becomes a lot more supervisors
03:12:34.620 | of agentic tasks in the digital domain. Next, I think
03:12:40.940 | everything's going to become a lot more pervasive and
03:12:42.620 | invisible. So it's kind of like integrated into the tools, and
03:12:46.380 | in everywhere. And in addition, kind of like computer using. So
03:12:52.460 | right now, these models aren't able to take actions on your
03:12:54.700 | behalf. But I think this is a separate bullet point. If you
03:12:59.900 | saw ChassisVT launch the operator, then that's one early
03:13:03.580 | example of that where you can actually hand off control to
03:13:05.820 | the model to perform, you know, keyboard and mouse actions on
03:13:09.340 | your behalf. So that's also something that that I think is
03:13:11.660 | very interesting. The last point I have here is just a
03:13:14.220 | general comment that there's still a lot of research to
03:13:16.060 | potentially do in this domain. One example of that is
03:13:19.580 | something along the lines of test time training. So remember
03:13:22.140 | that everything we've done above, and that we talked about
03:13:24.620 | has two major stages. There's first the training stage where
03:13:27.900 | we tune the parameters of the model to perform the tasks
03:13:30.700 | well. Once we get the parameters, we fix them, and
03:13:33.820 | then we deploy the model for inference. From there, the
03:13:37.020 | model is fixed, it doesn't change anymore, it doesn't
03:13:39.740 | learn from all the stuff that it's doing at test time, it's a
03:13:42.140 | fixed number of parameters. And the only thing that is
03:13:45.740 | changing is now the tokens inside the context windows. And
03:13:48.940 | so the only type of learning or test time learning that the
03:13:51.900 | model has access to is the in-context learning of its
03:13:55.500 | kind of like dynamically adjustable context window,
03:13:58.780 | depending on like what it's doing at test time. So, but I
03:14:02.540 | think this is still different from humans who actually are
03:14:04.460 | able to like actually learn depending on what they're
03:14:06.780 | doing, especially when you sleep, for example, like your
03:14:09.020 | brain is updating your parameters or something like
03:14:10.860 | that, right? So there's no kind of equivalent of that
03:14:13.660 | currently in these models and tools. So there's a lot of
03:14:16.380 | like more wonky ideas, I think, that are to be explored
03:14:19.020 | still. And in particular, I think this will be necessary
03:14:22.300 | because the context window is a finite and precious resource.
03:14:25.660 | And especially once we start to tackle very long running
03:14:28.460 | multimodal tasks, and we're putting in videos, and these
03:14:31.420 | token windows will basically start to grow extremely large,
03:14:35.100 | like not thousands or even hundreds of thousands, but
03:14:37.660 | significantly beyond that. And the only trick, the only kind
03:14:41.100 | of trick we have available to us right now is to make the
03:14:43.340 | context windows longer. But I think that that approach by
03:14:46.620 | itself will not will not scale to actual long running tasks
03:14:49.980 | that are multimodal over time. And so I think new ideas are
03:14:53.020 | needed in some of those disciplines, in some of those
03:14:56.620 | kind of cases in the maze, where these tasks are going to
03:14:59.100 | require very long contexts. So those are some examples of
03:15:02.860 | some of the things you can expect coming down the pipe.
03:15:06.300 | Let's now turn to where you can actually kind of keep track of
03:15:09.340 | this progress, and you know, be up to date with the latest and
03:15:13.580 | greatest of what's happening in the field. So I would say the
03:15:15.660 | three resources that I have consistently used to stay up to
03:15:18.460 | date are number one, LLM Arena. So let me show you LLM Arena.
03:15:22.620 | This is basically an LLM leaderboard. And it ranks all
03:15:28.460 | the top models. And the ranking is based on human comparisons.
03:15:32.780 | So humans prompt these models, and they get to judge which one
03:15:35.900 | gives a better answer. They don't know which model is which
03:15:38.460 | they're just looking at which model is the better answer. And
03:15:41.180 | you can calculate a ranking and then you get some results. And
03:15:44.300 | so what you can hear is, what you can see here is the
03:15:46.780 | different organizations like Google, Gemini, for example,
03:15:49.020 | that produce these models. When you click on any one of these,
03:15:51.580 | it takes you to the place where that model is hosted. And then
03:15:56.220 | here we see Google is currently on top with OpenAI right behind.
03:15:59.980 | Here we see Deep Seek in position number three. Now the
03:16:03.500 | reason this is a big deal is the last column here, you see
03:16:05.820 | license, Deep Seek is an MIT licensed model. It's open
03:16:09.580 | weights, anyone can use these weights, anyone can download
03:16:12.540 | them, anyone can host their own version of Deep Seek, and they
03:16:15.900 | can use it in whatever way they like. And so it's not a
03:16:18.540 | proprietary model that you don't have access to it. It's
03:16:20.780 | basically an open weights release. And so this is kind of
03:16:24.460 | unprecedented that a model this strong was released with open
03:16:28.380 | weights. So pretty cool from the team. Next up, we have a few
03:16:32.060 | more models from Google and OpenAI. And then when you
03:16:34.380 | continue to scroll down, you're starting to see some other
03:16:36.460 | usual suspects. So XAI here, Anthropic with Sonnet here at
03:16:42.060 | number 14. And then Meta with Lama over here. So Lama similar
03:16:51.180 | to Deep Seek is an open weights model. And so but it's down here
03:16:55.580 | as opposed to up here. Now I will say that this leaderboard
03:16:58.540 | was really good for a long time. I do think that in the last few
03:17:03.820 | months, it's become a little bit gamed. And I don't trust it as
03:17:07.980 | much as I used to. I think just empirically, I feel like a lot
03:17:12.220 | of people, for example, are using Sonnet from Anthropic and
03:17:15.180 | that it's a really good model. So but that's all the way down
03:17:17.740 | here in number 14. And conversely, I think not as many
03:17:22.460 | people are using Gemini, but it's racking really, really
03:17:24.460 | high. So I think use this as a first pass, but sort of try out
03:17:31.660 | a few of the models for your tasks and see which one
03:17:33.660 | performs better. The second thing that I would point to is
03:17:37.100 | the AI News newsletter. So AI News is not very creatively
03:17:42.300 | named, but it is a very good newsletter produced by Swix and
03:17:45.340 | Friends. So thank you for maintaining it. And it's been
03:17:47.660 | very helpful to me because it is extremely comprehensive. So
03:17:50.460 | if you go to archives, you see that it's produced almost every
03:17:53.740 | other day. And it is very comprehensive. And some of it is
03:17:58.140 | written by humans and curated by humans, but a lot of it is
03:18:00.700 | constructed automatically with LLMs. So you'll see that these
03:18:03.580 | are very comprehensive, and you're probably not missing
03:18:05.900 | anything major, if you go through it. Of course, you're
03:18:08.860 | probably not going to go through it because it's so long. But I
03:18:11.420 | do think that these summaries all the way up top are quite
03:18:14.620 | good, and I think have some human oversight. So this has
03:18:18.140 | been very helpful to me. And the last thing I would point to
03:18:20.620 | is just X and Twitter. A lot of AI happens on X. And so I would
03:18:25.660 | just follow people who you like and trust and get all your
03:18:29.260 | latest and greatest on X as well. So those are the major
03:18:32.620 | places that have worked for me over time. And finally, a few
03:18:35.100 | words on where you can find the models, and where can you use
03:18:37.900 | them. So the first one I would say is for any of the biggest
03:18:41.020 | proprietary models, you just have to go to the website of
03:18:43.420 | that LLM provider. So for example, for OpenAI, that's
03:18:46.380 | chat.com, I believe actually works now. So that's for OpenAI.
03:18:50.780 | Now for, or you know, for, for Gemini, I think it's Gemini.
03:18:55.660 | google.com, or AI Studio. I think they have two for some
03:18:59.260 | reason that I don't fully understand. No one does. For the
03:19:03.500 | open weights models like DeepSea, Clouma, etc, you have to
03:19:06.060 | go to some kind of an inference provider of LLMs. So my
03:19:08.620 | favorite one is together together.ai. And I showed you
03:19:11.180 | that when you go to the playground of together.ai, then
03:19:14.060 | you can sort of pick lots of different models. And all of
03:19:16.620 | these are open models of different types. And you can
03:19:19.100 | talk to them here as an example. Now, if you'd like to
03:19:23.900 | use a base model, like, you know, a base model, then this
03:19:27.980 | is where I think it's not as common to find base models,
03:19:30.220 | even on these inference providers, they are all
03:19:32.220 | targeting assistants and chat. And so I think even here, I
03:19:36.060 | can't, I couldn't see base models here. So for base
03:19:38.780 | models, I usually go to hyperbolic, because they serve
03:19:41.820 | my llama 3.1 base. And I love that model. And you can just
03:19:46.140 | talk to it here. So as far as I know, this is this is a good
03:19:49.180 | place for a base model. And I wish more people hosted base
03:19:52.140 | models, because they are useful and interesting to work
03:19:54.300 | with in some cases. Finally, you can also take some of the
03:19:57.980 | models that are smaller, and you can run them locally. And
03:20:01.420 | so for example, DeepSea, the biggest model, you're not
03:20:04.220 | going to be able to run locally on your MacBook. But there
03:20:07.100 | are smaller versions of the DeepSea model that are what's
03:20:09.340 | called distilled. And then also, you can run these models
03:20:11.980 | at smaller precision, so not at the native precision of, for
03:20:14.700 | example, fp8 on DeepSea, or, you know, bf16 llama, but much,
03:20:19.980 | much lower than that. And don't worry if you don't fully
03:20:23.980 | understand those details, but you can run smaller versions
03:20:26.300 | that have been distilled, and then at even lower precision,
03:20:29.020 | and then you can fit them on your computer. And so you can
03:20:32.940 | actually run pretty okay models on your laptop. And my
03:20:36.220 | favorite, I think place I go to usually is LM studio, which is
03:20:39.580 | basically an app you can get. And I think it kind of actually
03:20:42.940 | looks really ugly. And it's, I don't like that it shows you
03:20:45.500 | all these models that are basically not that useful, like
03:20:47.580 | everyone just wants to run DeepSea. So I don't know why
03:20:49.660 | they give you these 500 different types of models,
03:20:52.140 | they're really complicated to search for. And you have to
03:20:53.980 | choose different distillations and different precisions. And
03:20:57.740 | it's all really confusing. But once you actually understand
03:20:59.980 | how it works, and that's a whole separate video, then you
03:21:02.300 | can actually load up a model like here, I loaded up a llama
03:21:05.020 | 3.2, instruct 1 billion. And you can just talk to it. So I
03:21:12.220 | asked for pelican jokes, and I can ask for another one. And it
03:21:14.860 | gives me another one, etc. All of this that happens here is
03:21:18.460 | locally on your computer. So we're not actually going to
03:21:20.940 | anywhere else anyone else, this is running on the GPU on the
03:21:23.980 | MacBook Pro. So that's very nice. And you can then inject
03:21:27.340 | the model when you're done. And that frees up the RAM. So LM
03:21:31.020 | studio is probably like my favorite one, even though I
03:21:33.180 | don't I think it's got a lot of UI UX issues. And it's really
03:21:35.900 | geared towards professionals almost. But if you watch some
03:21:39.820 | videos on YouTube, I think you can figure out how to how to
03:21:42.220 | use this interface. So those are a few words on where to find
03:21:45.900 | them. So let me now loop back around to where we started. The
03:21:49.100 | question was, when we go to chachi pt.com, and we enter
03:21:52.700 | some kind of a query, and we hit go, what exactly is
03:21:57.100 | happening here? What are we seeing? What are we talking to?
03:22:00.540 | How does this work? And I hope that this video gave you some
03:22:04.300 | appreciation for some of the under the hood details of how
03:22:07.340 | these models are trained, and what this is that is coming
03:22:09.820 | back. So in particular, we now know that your query is taken,
03:22:13.740 | and is first chopped up into tokens. So we go to token, tick
03:22:17.340 | tokenizer. And here, where is the place in the in the sort of
03:22:22.300 | format that is for the user query, we basically put in our
03:22:27.260 | query right there. So our query goes into what we discussed
03:22:31.180 | here is the conversation protocol format, which is this
03:22:34.460 | way that we maintain conversation objects. So this
03:22:37.580 | gets inserted there. And then this whole thing ends up being
03:22:40.300 | just a token sequence, a one dimensional token sequence
03:22:42.940 | under the hood. So chachi pt saw this token sequence. And then
03:22:47.180 | when we had to go, it basically continues appending tokens into
03:22:51.420 | this list, it continues the sequence, it acts like a token
03:22:54.460 | autocomplete. So in particular, it gave us this response. So we
03:22:58.780 | can basically just put it here, and we see the tokens that it
03:23:01.820 | continued. These are the tokens that it continued with roughly.
03:23:05.180 | Now the question becomes, okay, why are these the tokens that
03:23:10.460 | the model responded with? What are these tokens? Where are they
03:23:13.100 | coming from? What are we talking to? And how do we program the
03:23:17.100 | system? And so that's where we shifted gears. And we talked
03:23:20.380 | about the under the hood pieces of it. So the first stage of
03:23:23.980 | this process, and there are three stages is the pre training
03:23:26.300 | stage, which fundamentally has to do with just knowledge
03:23:28.780 | acquisition from the internet into the parameters of this
03:23:31.980 | neural network. And so the neural net internalizes a lot of
03:23:35.820 | knowledge from the internet. But where the personality really
03:23:38.780 | comes in, is in the process of supervised fine tuning here. And
03:23:43.420 | so what what happens here is that basically the company like
03:23:47.020 | OpenAI will curate a large data set of conversations, like say
03:23:50.780 | 1 million conversation across very diverse topics. And there
03:23:54.860 | will be conversations between a human and an assistant. And
03:23:58.220 | even though there's a lot of synthetic data generation used
03:24:00.460 | throughout this entire process, and a lot of LLM help, and so
03:24:03.660 | on. Fundamentally, this is a human data curation task with
03:24:07.420 | lots of humans involved. And in particular, these humans are
03:24:10.460 | data labelers hired by OpenAI, who are given labeling
03:24:13.500 | instructions that they learn, and their task is to create
03:24:16.780 | ideal assistant responses for any arbitrary prompts. So they
03:24:21.020 | are teaching the neural network, by example, how to respond to
03:24:25.740 | prompts. So what is the way to think about what came back here?
03:24:31.500 | Like, what is this? Well, I think the right way to think
03:24:34.380 | about it is that this is the neural network simulation of a
03:24:39.020 | data labeler at OpenAI. So it's as if I gave this query to a
03:24:44.220 | data labeler at OpenAI. And this data labeler first reads
03:24:47.580 | all the labeling instructions from OpenAI, and then spends two
03:24:51.100 | hours writing up the ideal assistant response to this
03:24:54.700 | query and giving it to me. Now, we're not actually doing that,
03:24:59.900 | right? Because we didn't wait two hours. So what we're getting
03:25:02.060 | here is a neural network simulation of that process. And
03:25:05.900 | we have to keep in mind that these neural networks don't
03:25:09.180 | function like human brains do. They are different. What's easy
03:25:12.460 | or hard for them is different from what's easy or hard for
03:25:15.100 | humans. And so we really are just getting a simulation. So
03:25:18.620 | here I've shown you, this is a token stream, and this is
03:25:22.140 | fundamentally the neural network with a bunch of
03:25:24.380 | activations and neurons in between. This is a fixed
03:25:26.780 | mathematical expression that mixes inputs from tokens with
03:25:31.660 | parameters of the model, and they get mixed up and get you
03:25:35.500 | the next token in a sequence. But this is a finite amount of
03:25:38.220 | compute that happens for every single token. And so this is
03:25:41.420 | some kind of a lossy simulation of a human that is kind of like
03:25:45.980 | restricted in this way. And so whatever the humans write, the
03:25:51.100 | language model is kind of imitating on this token level
03:25:54.140 | with only this specific computation for every single
03:25:58.220 | token in a sequence. We also saw that as a result of this, and
03:26:03.660 | the cognitive differences, the models will suffer in a variety
03:26:07.020 | of ways, and you have to be very careful with their use. So for
03:26:10.620 | example, we saw that they will suffer from hallucinations, and
03:26:13.660 | they also, we have the sense of a Swiss cheese model, the LLM
03:26:16.940 | capabilities, where basically there's like holes in the
03:26:20.060 | cheese, sometimes the models will just arbitrarily do
03:26:23.820 | something dumb. So even though they're doing lots of magical
03:26:26.780 | stuff, sometimes they just can't. So maybe you're not
03:26:29.500 | giving them enough tokens to think, and maybe they're going
03:26:32.060 | to just make stuff up because their mental arithmetic breaks.
03:26:34.380 | Maybe they are suddenly unable to count number of letters, or
03:26:39.180 | maybe they're unable to tell you that 9.11 is smaller than 9.9,
03:26:44.300 | and it looks kind of dumb. And so it's a Swiss cheese
03:26:46.940 | capability, and we have to be careful with that. And we saw
03:26:49.260 | the reasons for that. But fundamentally, this is how we
03:26:53.420 | think of what came back. It's, again, a simulation of this
03:26:57.500 | neural network of a human data labeler following the labeling
03:27:03.260 | instructions at OpenAI. So that's what we're getting back.
03:27:06.700 | Now, I do think that things change a little bit when you
03:27:11.980 | actually go and reach for one of the thinking models, like
03:27:15.180 | O3 MiniHAI. And the reason for that is that GPT-4.0 basically
03:27:21.900 | doesn't do reinforcement learning. It does do RLHF, but
03:27:25.740 | I've told you that RLHF is not RL. There's no time for magic
03:27:30.380 | in there. It's just a little bit of a fine-tuning is the way to
03:27:33.340 | look at it. But these thinking models, they do use RL. So they
03:27:37.900 | go through this third stage of perfecting their thinking
03:27:42.140 | process and discovering new thinking strategies and
03:27:45.580 | solutions to problem-solving that look a little bit like
03:27:50.300 | your internal monologue in your head. And they practice that on
03:27:53.100 | a large collection of practice problems that companies like
03:27:56.300 | OpenAI create and curate and then make available to the LLMs.
03:28:00.620 | So when I come here and I talk to a thinking model, and I put
03:28:03.820 | in this question, what we're seeing here is not anymore just
03:28:08.300 | a straightforward simulation of a human data labeler. Like this
03:28:11.420 | is actually kind of new, unique, and interesting. And of course,
03:28:15.500 | OpenAI is not showing us the under-the-hood thinking and the
03:28:19.020 | chains of thought that are underlying the reasoning here.
03:28:22.460 | But we know that such a thing exists, and this is a summary of
03:28:25.020 | it. And what we're getting here is actually not just an
03:28:27.660 | imitation of a human data labeler. It's actually something
03:28:30.060 | that is kind of new and interesting and exciting in the
03:28:31.980 | sense that it is a function of thinking that was emergent in a
03:28:36.220 | simulation. It's not just imitating a human data labeler.
03:28:39.100 | It comes from this reinforcement learning process. And so here
03:28:42.620 | we're, of course, not giving it a chance to shine because this
03:28:45.020 | is not a mathematical or reasoning problem. This is just
03:28:47.660 | some kind of a sort of creative writing problem, roughly
03:28:50.220 | speaking. And I think it's a question, an open question, as
03:28:56.380 | to whether the thinking strategies that are developed
03:28:59.900 | inside verifiable domains transfer and are generalizable
03:29:04.380 | to other domains that are unverifiable, such as creative
03:29:07.980 | writing. The extent to which that transfer happens is
03:29:11.020 | unknown in the field, I would say. So we're not sure if we are
03:29:14.220 | able to do RL on everything that is verifiable and see the
03:29:16.940 | benefits of that on things that are unverifiable, like this
03:29:20.060 | prompt. So that's an open question. The other thing
03:29:22.700 | that's interesting is that this reinforcement learning here is
03:29:26.060 | still way too new, primordial, and nascent. So we're just
03:29:30.700 | seeing the beginnings of the hints of greatness in the
03:29:34.220 | reasoning problems. We're seeing something that is, in
03:29:37.020 | principle, capable of something like the equivalent of move 37,
03:29:40.540 | but not in the game of Go, but in open domain thinking and
03:29:45.100 | problem solving. In principle, this paradigm is capable of
03:29:48.700 | doing something really cool, new, and exciting, something
03:29:51.100 | even that no human has thought of before. In principle, these
03:29:54.700 | models are capable of analogies no human has had. So I think
03:29:58.060 | it's incredibly exciting that these models exist. But again,
03:30:00.700 | it's very early, and these are primordial models for now. And
03:30:04.780 | they will mostly shine in domains that are verifiable,
03:30:07.340 | like math, and code, etc. So very interesting to play with
03:30:11.260 | and think about and use. And then that's roughly it. I would
03:30:16.620 | say those are the broad strokes of what's available right now.
03:30:19.740 | I will say that overall, it is an extremely exciting time to
03:30:23.180 | be in the field. Personally, I use these models all the time
03:30:26.780 | daily, tens or hundreds of times because they dramatically
03:30:29.980 | accelerate my work. I think a lot of people see the same
03:30:32.220 | thing. I think we're going to see a huge amount of wealth
03:30:34.460 | creation as a result of these models. Be aware of some of
03:30:38.060 | their shortcomings. Even with RL models, they're going to
03:30:41.820 | suffer from some of these. Use it as a tool in a toolbox.
03:30:45.500 | Don't trust it fully, because they will randomly do dumb
03:30:48.780 | things. They will randomly hallucinate. They will randomly
03:30:51.740 | skip over some mental arithmetic and not get it right.
03:30:53.820 | They randomly can't count or something like that. So use
03:30:57.420 | them as tools in the toolbox, check their work, and own the
03:31:00.060 | product of your work. But use them for inspiration, for
03:31:03.340 | first draft, ask them questions, but always check and verify,
03:31:08.140 | and you will be very successful in your work if you do so.
03:31:10.860 | So I hope this video was useful and interesting to you. I hope
03:31:14.700 | you had fun. And it's already, like, very long, so I apologize
03:31:18.380 | for that. But I hope it was useful. And yeah, I will see
03:31:21.100 | you later.