Hi everyone, so I've wanted to make this video for a while. It is a comprehensive but general audience introduction to large language models like ChatGPT and what I'm hoping to achieve in this video is to give you kind of mental models for thinking through what it is that this tool is.
It's obviously magical and amazing in some respects. It's really good at some things, not very good at other things, and there's also a lot of sharp edges to be aware of. So what is behind this text box? You can put anything in there and press enter, but what should we be putting there and what are these words generated back?
How does this work and what what are you talking to exactly? So I'm hoping to get at all those topics in this video. We're gonna go through the entire pipeline of how this stuff is built, but I'm going to keep everything sort of accessible to a general audience. So let's take a look at first how you build something like ChatGPT and along the way I'm gonna talk about, you know, some of the sort of cognitive psychological implications of these tools.
Okay so let's build ChatGPT. So there's going to be multiple stages arranged sequentially. The first stage is called the pre-training stage and the first step of the pre-training stage is to download and process the internet. Now to get a sense of what this roughly looks like, I recommend looking at this URL here.
So this company called Hugging Face collected and created and curated this dataset called FineWeb and they go into a lot of detail in this blog post on how they constructed the FineWeb dataset and all of the major LLM providers like OpenAI, Anthropic and Google and so on will have some equivalent internally of something like the FineWeb dataset.
So roughly what are we trying to achieve here? We're trying to get a ton of text from the internet, from publicly available sources. So we're trying to have a huge quantity of very high quality documents and we also want very large diversity of documents because we want to have a lot of knowledge inside these models.
So we want large diversity of high quality documents and we want many many of them. And achieving this is quite complicated and as you can see here it takes multiple stages to do well. So let's take a look at what some of these stages look like in a bit.
For now I'd like to just like to note that for example the FineWeb dataset which is fairly representative what you would see in a production grade application actually ends up being only about 44 terabytes of disk space. You can get a USB stick for like a terabyte very easily or I think this could fit on a single hard drive almost today.
So this is not a huge amount of data at the end of the day even though the internet is very very large we're working with text and we're also filtering it aggressively so we end up with about 44 terabytes in this example. So let's take a look at kind of what this data looks like and what some of these stages also are.
So the starting point for a lot of these efforts and something that contributes most of the data by the end of it is data from Common Crawl. So Common Crawl is an organization that has been basically scouring the internet since 2007. So as of 2024 for example Common Crawl has indexed 2.7 billion web pages and they have all these crawlers going around the internet and what you end up doing basically is you start with a few seed web pages and then you follow all the links and you just keep following links and you keep indexing all the information and you end up with a ton of data of the internet over time.
So this is usually the starting point for a lot of these efforts. Now this Common Crawl data is quite raw and is filtered in many many different ways. So here they document - this is the same diagram - they document a little bit the kind of processing that happens in these stages.
So the first thing here is something called URL filtering. So what that is referring to is that there's these block lists of basically URLs that are domains that you don't want to be getting data from. So usually this includes things like malware websites, spam websites, marketing websites, racist websites, adult sites and things like that.
So there's a ton of different types of websites that are just eliminated at this stage because we don't want them in our data set. The second part is text extraction. You have to remember that all these web pages - this is the raw HTML of these web pages that are being saved by these crawlers.
So when I go to inspect here, this is what the raw HTML actually looks like. You'll notice that it's got all this markup like lists and stuff like that and there's CSS and all this kind of stuff. So this is computer code almost for these web pages but what we really want is we just want this text right?
We just want the text of this web page and we don't want the navigation and things like that. So there's a lot of filtering and processing and heuristics that go into adequately filtering for just the good content of these web pages. The next stage here is language filtering. So for example, FineWeb filters using a language classifier.
They try to guess what language every single web page is in and then they only keep web pages that have more than 65% of English as an example. And so you can get a sense that this is like a design decision that different companies can can take for themselves.
What fraction of all different types of languages are we going to include in our data set? Because for example, if we filter out all of the Spanish as an example, then you might imagine that our model later will not be very good at Spanish because it's just never seen that much data of that language.
And so different companies can focus on multilingual performance to to a different degree as an example. So FineWeb is quite focused on English and so their language model, if they end up training one later, will be very good at English but not maybe very good at other languages. After language filtering, there's a few other filtering steps and deduplication and things like that.
Finishing with, for example, the PII removal. This is personally identifiable information. So as an example, addresses, social security numbers, and things like that. You would try to detect them and you would try to filter out those kinds of webpages from the data set as well. So there's a lot of stages here and I won't go into full detail but it is a fairly extensive part of the pre-processing and you end up with, for example, the FineWeb data set.
So when you click in on it, you can see some examples here of what this actually ends up looking like and anyone can download this on the Hugging Phase web page. And so here's some examples of the final text that ends up in the training set. So this is some article about tornadoes in 2012.
So there's some tornadoes in 2012 and what happened. This next one is something about... "Did you know you have two little yellow 9-volt battery-sized adrenal glands in your body?" Okay, so this is some kind of a odd medical article. So just think of these as basically web pages on the Internet filtered just for the text in various ways.
And now we have a ton of text, 40 terabytes of it, and that now is the starting point for the next step of this stage. Now I wanted to give you an intuitive sense of where we are right now. So I took the first 200 web pages here, and remember we have tons of them, and I just take all that text and I just put it all together, concatenate it.
And so this is what we end up with. We just get this just raw text, raw internet text, and there's a ton of it even in these 200 web pages. So I can continue zooming out here, and we just have this like massive tapestry of text data. And this text data has all these patterns, and what we want to do now is we want to start training neural networks on this data so the neural networks can internalize and model how this text flows, right?
So we just have this giant texture of text, and now we want to get neural nets that mimic it. Okay, now before we plug text into neural networks, we have to decide how we're going to represent this text, and how we're going to feed it in. Now the way our technology works for these neural nets is that they expect a one-dimensional sequence of symbols, and they want a finite set of symbols that are possible.
And so we have to decide what are the symbols, and then we have to represent our data as a one-dimensional sequence of those symbols. So right now what we have is a one-dimensional sequence of text. It starts here, and it goes here, and then it comes here, etc. So this is a one-dimensional sequence, even though on my monitor of course it's laid out in a two-dimensional way, but it goes from left to right and top to bottom, right?
So it's a one-dimensional sequence of text. Now this being computers, of course, there's an underlying representation here. So if I do what's called UTF-8 encode this text, then I can get the raw bits that correspond to this text in the computer. And that looks like this. So it turns out that, for example, this very first bar here is the first eight bits as an example.
So what is this thing, right? This is a representation that we are looking for, in a certain sense. We have exactly two possible symbols, 0 and 1, and we have a very long sequence of it, right? Now as it turns out, this sequence length is actually going to be a very finite and precious resource in our neural network, and we actually don't want extremely long sequences of just two symbols.
Instead what we want is we want to trade off this symbol size of this vocabulary, as we call it, and the resulting sequence length. So we don't want just two symbols and extremely long sequences. We're going to want more symbols and shorter sequences. Okay, so one naive way of compressing or decreasing the length of our sequence here is to basically consider some group of consecutive bits, for example 8 bits, and group them into a single what's called byte.
So because these bits are either on or off, if we take a group of eight of them, there turns out to be only 256 possible combinations of how these bits could be on or off. And so therefore we can re-represent the sequence into a sequence of bytes instead. So this sequence of bytes will be 8 times shorter, but now we have 256 possible symbols.
So every number here goes from 0 to 255. Now I really encourage you to think of these not as numbers, but as unique IDs, or like unique symbols. So maybe it's a bit more, maybe it's better to actually think of these, to replace every one of these with a unique emoji.
You'd get something like this. So we basically have a sequence of emojis, and there's 256 possible emojis. You can think of it that way. Now it turns out that in production, for state-of-the-art language models, you actually want to go even beyond this. You want to continue to shrink the length of the sequence, because again it is a precious resource, in return for more symbols in your vocabulary.
And the way this is done is done by running what's called the byte pair encoding algorithm. And the way this works is we're basically looking for consecutive bytes, or symbols, that are very common. So for example, it turns out that the sequence 116 followed by 32 is quite common and occurs very frequently.
So what we're going to do is we're going to group this pair into a new symbol. So we're going to mint a symbol with an ID 256, and we're going to rewrite every single pair, 116, 32, with this new symbol. And then we can iterate this algorithm as many times as we wish.
And each time when we mint a new symbol, we're decreasing the length and we're increasing the symbol size. And in practice, it turns out that a pretty good setting of, basically, the vocabulary size turns out to be about 100,000 possible symbols. So in particular, GPT-4 uses 100,277 symbols. And this process of converting from raw text into these symbols, or as we call them, tokens, is the process called tokenization.
So let's now take a look at how GPT-4 performs tokenization, converting from text to tokens, and from tokens back to text, and what this actually looks like. So one website I like to use to explore these token representations is called TicTokenizer. And so come here to the drop-down and select CL100KBase, which is the GPT-4 base model tokenizer.
And here on the left, you can put in text, and it shows you the tokenization of that text. So for example, "hello world". So "hello world" turns out to be exactly two tokens. The token "hello", which is the token with ID 15339, and the token "space world", that is the token 1917.
So "hello space world". Now if I was to join these two, for example, I'm gonna get again two tokens, but it's the token "h" followed by the token "hello world", without the "h". If I put in two spaces here between "hello" and "world", it's again a different tokenization. There's a new token "220" here.
Okay, so you can play with this and see what happens here. Also keep in mind this is case sensitive, so if this is a capital "H", it is something else. Or if it's "hello world", then actually this ends up being three tokens, since there are just two tokens. Yeah, so you can play with this and get a sort of like an intuitive sense of what these tokens work like.
We're actually going to loop around to tokenization a bit later in the video. For now I just wanted to show you the website, and I wanted to show you that this text basically, at the end of the day, so for example if I take one line here, this is what GPT-4 will see it as.
So this text will be a sequence of length 62. This is the sequence here, and this is how the chunks of text correspond to these symbols. And again there's 100,000, 277 possible symbols, and we now have one-dimensional sequences of those symbols. So yeah, we're gonna come back to tokenization, but that's for now where we are.
Okay, so what I've done now is I've taken this sequence of text that we have here in the dataset, and I have re-represented it using our tokenizer into a sequence of tokens. And this is what that looks like now. So for example when we go back to the FindWeb dataset, they mentioned that not only is this 44 terabytes of disk space, but this is about a 15 trillion token sequence in this dataset.
And so here, these are just some of the first one or two or three or a few thousand here, I think, tokens of this dataset, but there's 15 trillion here to keep in mind. And again, keep in mind one more time that all of these represent little text chunks.
They're all just like atoms of these sequences, and the numbers here don't make any sense. They're just unique IDs. Okay, so now we get to the fun part, which is the neural network training. And this is where a lot of the heavy lifting happens computationally when you're training these neural networks.
So what we do here in this step is we want to model the statistical relationships of how these tokens follow each other in the sequence. So what we do is we come into the data, and we take windows of tokens. So we take a window of tokens from this data fairly randomly, and the window's length can range anywhere between zero tokens, actually, all the way up to some maximum size that we decide on.
So for example, in practice you could see a token windows of, say, 8,000 tokens. Now, in principle, we can use arbitrary window lengths of tokens, but processing very long, basically, window sequences would just be very computationally expensive. So we just kind of decide that, say, 8,000 is a good number, or 4,000, or 16,000, and we crop it there.
Now, in this example, I'm going to be taking the first four tokens just so everything fits nicely. So these tokens, we're going to take a window of four tokens, this bar, view, ing, and space single, which are these token IDs. And now what we're trying to do here is we're trying to basically predict the token that comes next in the sequence.
So 3962 comes next, right? So what we do now here is that we call this the context. These four tokens are context, and they feed into a neural network. And this is the input to the neural network. Now, I'm going to go into the detail of what's inside this neural network in a little bit.
For now, what's important to understand is the input and the output of the neural net. So the input are sequences of tokens of variable length, anywhere between 0 and some maximum size, like 8,000. The output now is a prediction for what comes next. So because our vocabulary has 100,277 possible tokens, the neural network is going to output exactly that many numbers.
And all of those numbers correspond to the probability of that token as coming next in the sequence. So it's making guesses about what comes next. In the beginning, this neural network is randomly initialized. So we're going to see in a little bit what that means. But it's a random transformation.
So these probabilities in the very beginning of the training are also going to be kind of random. So here I have three examples, but keep in mind that there's 100,000 numbers here. So the probability of this token, space direction, the neural network is saying that this is 4% likely right now.
11,799 is 2%. And then here, the probability of 3962, which is post, is 3%. Now, of course, we've sampled this window from our data set. So we know what comes next. We know, and that's the label, we know that the correct answer is that 3962 actually comes next in the sequence.
So now what we have is this mathematical process for doing an update to the neural network. We have a way of tuning it. And we're going to go into a little bit of detail in a bit. But basically, we know that this probability here of 3%, we want this probability to be higher, and we want the probabilities of all the other tokens to be lower.
And so we have a way of mathematically calculating how to adjust and update the neural network so that the correct answer has a slightly higher probability. So if I do an update to the neural network now, the next time I feed this particular sequence of four tokens into the neural network, the neural network will be slightly adjusted now and it will say, okay, post is maybe 4%, and case now maybe is 1%.
And direction could become 2% or something like that. And so we have a way of nudging, of slightly updating the neural net to basically give a higher probability to the correct token that comes next in the sequence. And now we just have to remember that this process happens not just for this token here, where these four fed in and predicted this one.
This process happens at the same time for all of these tokens in the entire data set. And so in practice, we sample little windows, little batches of windows, and then at every single one of these tokens, we want to adjust our neural network so that the probability of that token becomes slightly higher.
And this all happens in parallel in large batches of these tokens. And this is the process of training the neural network. It's a sequence of updating it so that its predictions match up the statistics of what actually happens in your training set. And its probabilities become consistent with the statistical patterns of how these tokens follow each other in the data.
So let's now briefly get into the internals of these neural networks just to give you a sense of what's inside. So neural network internals. So as I mentioned, we have these inputs that are sequences of tokens. In this case, this is four input tokens, but this can be anywhere between zero up to, let's say, a thousand tokens.
In principle, this can be an infinite number of tokens. We just, it would just be too computationally expensive to process an infinite number of tokens. So we just crop it at a certain length, and that becomes the maximum context length of that model. Now these inputs X are mixed up in a giant mathematical expression together with the parameters or the weights of these neural networks.
So here I'm showing six example parameters and their setting. But in practice, these modern neural networks will have billions of these parameters. And in the beginning, these parameters are completely randomly set. Now with a random setting of parameters, you might expect that this neural network would make random predictions, and it does.
In the beginning, it's totally random predictions. But it's through this process of iteratively updating the network, and we call that process training a neural network, so that the setting of these parameters gets adjusted such that the outputs of our neural network becomes consistent with the patterns seen in our training set.
So think of these parameters as kind of like knobs on a DJ set, and as you're twiddling these knobs, you're getting different predictions for every possible token sequence input. And training a neural network just means discovering a setting of parameters that seems to be consistent with the statistics of the training set.
Now let me just give you an example of what this giant mathematical expression looks like, just to give you a sense. And modern networks are massive expressions with trillions of terms probably. But let me just show you a simple example here. It would look something like this. I mean, these are the kinds of expressions, just to show you that it's not very scary.
We have inputs x, like x1, x2, in this case two example inputs, and they get mixed up with the weights of the network, w0, w1, w2, w3, etc. And this mixing is simple things like multiplication, addition, exponentiation, division, etc. And it is the subject of neural network architecture research to design effective mathematical expressions that have a lot of kind of convenient characteristics.
They are expressive, they're optimizable, they're parallelizable, etc. And so, but at the end of the day, these are not complex expressions, and basically they mix up the inputs with the parameters to make predictions, and we're optimizing the parameters of this neural network so that the predictions come out consistent with the training set.
Now, I would like to show you an actual production-grade example of what these neural networks look like. So for that, I encourage you to go to this website that has a very nice visualization of one of these networks. So this is what you will find on this website, and this neural network here that is used in production settings has this special kind of structure.
This network is called the transformer, and this particular one as an example has 85,000, roughly, parameters. Now, here on the top, we take the inputs, which are the token sequences, and then information flows through the neural network until the output, which here are the logit softmax, but these are the predictions for what comes next, what token comes next.
And then here, there's a sequence of transformations, and all these intermediate values that get produced inside this mathematical expression as it is sort of predicting what comes next. So as an example, these tokens are embedded into kind of like this distributed representation, as it's called. So every possible token has kind of like a vector that represents it inside the neural network.
So first, we embed the tokens, and then those values kind of like flow through this diagram, and these are all very simple mathematical expressions individually. So we have layer norms, and matrix multiplications, and soft maxes, and so on. So here's kind of like the attention block of this transformer, and then information kind of flows through into the multi-layer perceptron block, and so on.
And all these numbers here, these are the intermediate values of their expression, and you can almost think of these as kind of like the firing rates of these synthetic neurons. But I would caution you to not kind of think of it too much like neurons, because these are extremely simple neurons compared to the neurons you would find in your brain.
Your biological neurons are very complex dynamical processes that have memory, and so on. There's no memory in this expression. It's a fixed mathematical expression from input to output with no memory. It's just a stateless. So these are very simple neurons in comparison to biological neurons, but you can still kind of loosely think of this as like a synthetic piece of brain tissue, if you like to think about it that way.
So information flows through all these neurons fire until we get to the predictions. Now I'm not actually going to dwell too much on the precise kind of like mathematical details of all these transformations. Honestly, I don't think it's that important to get into. What's really important to understand is that this is a mathematical function.
It is parameterized by some fixed set of parameters, let's say 85,000 of them, and it is a way of transforming inputs into outputs. And as we twiddle the parameters we are getting different kinds of predictions, and then we need to find a good setting of these parameters so that the predictions sort of match up with the patterns seen in training set.
So that's the transformer. Okay, so I've shown you the internals of the neural network, and we talked a bit about the process of training it. I want to cover one more major stage of working with these networks, and that is the stage called inference. So in inference what we're doing is we're generating new data from the model, and so we want to basically see what kind of patterns it has internalized in the parameters of its network.
So to generate from the model is relatively straightforward. We start with some tokens that are basically your prefix, like what you want to start with. So say we want to start with the token 91. Well, we feed it into the network, and remember that network gives us probabilities, right?
It gives us this probability vector here. So what we can do now is we can basically flip a biased coin. So we can sample basically a token based on this probability distribution. So the tokens that are given high probability by the model are more likely to be sampled when you flip this biased coin.
You can think of it that way. So we sample from the distribution to get a single unique token. So for example, token 860 comes next. So 860 in this case when we're generating from model could come next. Now 860 is a relatively likely token. It might not be the only possible token in this case.
There could be many other tokens that could have been sampled, but we could see that 860 is a relatively likely token as an example, and indeed in our training example here, 860 does follow 91. So let's now say that we continue the process. So after 91 there's 860. We append it, and we again ask what is the third token.
Let's sample, and let's just say that it's 287 exactly as here. Let's do that again. We come back in. Now we have a sequence of three, and we ask what is the likely fourth token, and we sample from that and get this one. And now let's say we do it one more time.
We take those four, we sample, and we get this one. And this 13659, this is not actually 3962 as we had before. So this token is the token article instead, so viewing a single article. And so in this case we didn't exactly reproduce the sequence that we saw here in the training data.
So keep in mind that these systems are stochastic. We're sampling, and we're flipping coins, and sometimes we luck out and we reproduce some like small chunk of the text in a training set, but sometimes we're getting a token that was not verbatim part of any of the documents in the training data.
So we're going to get sort of like remixes of the data that we saw in the training, because at every step of the way we can flip and get a slightly different token, and then once that token makes it in, if you sample the next one and so on, you very quickly start to generate token streams that are very different from the token streams that occur in the training documents.
So statistically they will have similar properties, but they are not identical to training data. They're kind of like inspired by the training data. And so in this case we got a slightly different sequence. And why would we get article? You might imagine that article is a relatively likely token in the context of bar, viewing, single, etc.
And you could imagine that the word article followed this context window somewhere in the training documents to some extent, and we just happen to sample it here at that stage. So basically inference is just predicting from these distributions one at a time, we continue feeding back tokens and getting the next one, and we we're always flipping these coins, and depending on how lucky or unlucky we get, we might get very different kinds of patterns depending on how we sample from these probability distributions.
So that's inference. So in most common scenarios, basically downloading the internet and tokenizing it is a pre-processing step. You do that a single time. And then once you have your token sequence, we can start training networks. And in practical cases you would try to train many different networks of different kinds of settings, and different kinds of arrangements, and different kinds of sizes.
And so you'd be doing a lot of neural network training, and then once you have a neural network and you train it, and you have some specific set of parameters that you're happy with, then you can take the model and you can do inference, and you can actually generate data from the model.
And when you're on chatGPT and you're talking with a model, that model is trained, and has been trained by OpenAI many months ago probably, and they have a specific set of weights that work well, and when you're talking to the model, all of that is just inference. There's no more training.
Those parameters are held fixed, and you're just talking to the model, sort of. You're giving it some of the tokens, and it's kind of completing token sequences, and that's what you're seeing generated when you actually use the model on chatGPT. So that model then just does inference alone. So let's now look at an example of training and inference that is kind of concrete, and gives you a sense of what this actually looks like when these models are trained.
Now the example that I would like to work with, and that I am particularly fond of, is that of OpenAI's GPT2. So GPT stands for generatively pre-trained transformer, and this is the second iteration of the GPT series by OpenAI. When you are talking to chatGPT today, the model that is underlying all of the magic of that interaction is GPT4, so the fourth iteration of that series.
Now GPT2 was published in 2019 by OpenAI in this paper that I have right here, and the reason I like GPT2 is that it is the first time that a recognizably modern stack came together. So all of the pieces of GPT2 are recognizable today by modern standards, it's just everything has gotten bigger.
Now I'm not going to be able to go into the full details of this paper, of course, because it is a technical publication, but some of the details that I would like to highlight are as follows. GPT2 was a transformer neural network, just like the neural networks you would work with today.
It had 1.6 billion parameters, right? So these are the parameters that we looked at here. It would have 1.6 billion of them. Today, modern transformers would have a lot closer to a trillion or several hundred billion, probably. The maximum context length here was 1024 tokens, so it is when we are sampling chunks of windows of tokens from the data set, we're never taking more than 1024 tokens, and so when you are trying to predict the next token in a sequence, you will never have more than 1024 tokens kind of in your context in order to make that prediction.
Now, this is also tiny by modern standards. Today, the context lengths would be a lot closer to a couple hundred thousand or maybe even a million, and so you have a lot more context, a lot more tokens in history, and you can make a lot better prediction about the next token in a sequence in that way.
And finally, GPT2 was trained on approximately 100 billion tokens, and this is also fairly small by modern standards. As I mentioned, the fine web data set that we looked at here, the fine web data set has 15 trillion tokens, so 100 billion is quite small. Now, I actually tried to reproduce GPT2 for fun as part of this project called LLM.C, so you can see my write-up of doing that in this post on GitHub under the LLM.C repository.
So in particular, the cost of training GPT2 in 2019 was estimated to be approximately $40,000, but today you can do significantly better than that, and in particular, here it took about one day and about $600. But this wasn't even trying too hard. I think you could really bring this down to about $100 today.
Now, why is it that the costs have come down so much? Well, number one, these data sets have gotten a lot better, and the way we filter them, extract them, and prepare them has gotten a lot more refined, and so the data set is of just a lot higher quality, so that's one thing.
But really, the biggest difference is that our computers have gotten much faster in terms of the hardware, and we're going to look at that in a second, and also the software for running these models and really squeezing out all the speed from the hardware as it is possible, that software has also gotten much better as everyone has focused on these models and tried to run them very, very quickly.
Now, I'm not going to be able to go into the full detail of this GPT-2 reproduction, and this is a long technical post, but I would like to still give you an intuitive sense for what it looks like to actually train one of these models as a researcher. Like, what are you looking at, and what does it look like, what does it feel like?
So let me give you a sense of that a little bit. Okay, so this is what it looks like. Let me slide this over. So what I'm doing here is I'm training a GPT-2 model right now, and what's happening here is that every single line here, like this one, is one update to the model.
So remember how here we are basically making the prediction better for every one of these tokens, and we are updating these weights or parameters of the neural net. So here, every single line is one update to the neural network, where we change its parameters by a little bit so that it is better at predicting next token and sequence.
In particular, every single line here is improving the prediction on 1 million tokens in the training set. So we've basically taken 1 million tokens out of this data set, and we've tried to improve the prediction of that token as coming next in a sequence on all 1 million of them simultaneously.
And at every single one of these steps, we are making an update to the network for that. Now the number to watch closely is this number called loss, and the loss is a single number that is telling you how well your neural network is performing right now, and it is created so that low loss is good.
So you'll see that the loss is decreasing as we make more updates to the neural net, which corresponds to making better predictions on the next token in a sequence. And so the loss is the number that you are watching as a neural network researcher, and you are kind of waiting, you're twiddling your thumbs, you're drinking coffee, and you're making sure that this looks good so that with every update your loss is improving and the network is getting better at prediction.
Now here you see that we are processing 1 million tokens per update. Each update takes about 7 seconds roughly, and here we are going to process a total of 32,000 steps of optimization. So 32,000 steps with 1 million tokens each is about 33 billion tokens that we are going to process, and we're currently only about 420, step 420 out of 32,000, so we are still only a bit more than 1% done because I've only been running this for 10 or 15 minutes or something like that.
Now every 20 steps I have configured this optimization to do inference. So what you're seeing here is the model is predicting the next token in a sequence, and so you sort of start it randomly, and then you continue plugging in the tokens. So we're running this inference step, and this is the model sort of predicting the next token in a sequence, and every time you see something appear, that's a new token.
So let's just look at this, and you can see that this is not yet very coherent, and keep in mind that this is only 1% of the way through training, and so the model is not yet very good at predicting the next token in the sequence. So what comes out is actually kind of a little bit of gibberish, right, but it still has a little bit of like local coherence.
So since she is mine, it's a part of the information, should discuss my father, great companions, Gordon showed me sitting over it, and etc. So I know it doesn't look very good, but let's actually scroll up and see what it looked like when I started the optimization. So all the way here, at step 1, so after 20 steps of optimization, you see that what we're getting here is looks completely random, and of course that's because the model has only had 20 updates to its parameters, and so it's giving you random text because it's a random network.
And so you can see that at least in comparison to this, the model is starting to do much better, and indeed if we waited the entire 32,000 steps, the model will have improved to the point that it's actually generating fairly coherent English, and the tokens stream correctly, and they kind of make up English a lot better.
So this has to run for about a day or two more now, and so at this stage we just make sure that the loss is decreasing, everything is looking good, and we just have to wait. And now let me turn now to the story of the computation that's required, because of course I'm not running this optimization on my laptop.
That would be way too expensive, because we have to run this neural network, and we have to improve it, and we have we need all this data and so on. So you can't run this too well on your computer, because the network is just too large. So all of this is running on the computer that is out there in the cloud, and I want to basically address the compute side of the story of training these models, and what that looks like.
So let's take a look. Okay so the computer that I am running this optimization on is this 8xh100 node. So there are eight h100s in a single node, or a single computer. Now I am renting this computer, and it is somewhere in the cloud. I'm not sure where it is physically actually.
The place I like to rent from is called Lambda, but there are many other companies who provide this service. So when you scroll down, you can see that they have some on-demand pricing for sort of computers that have these h100s, which are GPUs, and I'm going to show you what they look like in a second.
But on-demand 8xNVIDIA h100 GPU. This machine comes for three dollars per GPU per hour, for example. So you can rent these, and then you get a machine in the cloud, and you can go in and you can train these models. And these GPUs, they look like this. So this is one h100 GPU.
This is kind of what it looks like, and you slot this into your computer. And GPUs are this perfect fit for training neural networks, because they are very computationally expensive, but they display a lot of parallelism in the computation. So you can have many independent workers kind of working all at the same time in solving the matrix multiplication that's under the hood of training these neural networks.
So this is just one of these h100s, but actually you would put them, you would put multiple of them together. So you could stack eight of them into a single node, and then you can stack multiple nodes into an entire data center, or an entire system. So when we look at a data center, can't spell, when we look at a data center, we start to see things that look like this, right?
So we have one GPU goes to eight GPUs, goes to a single system, goes to many systems. And so these are the bigger data centers, and they of course would be much, much more expensive. And what's happening is that all the big tech companies really desire these GPUs, so they can train all these language models, because they are so powerful.
And that is fundamentally what has driven the stock price of NVIDIA to be $3.4 trillion today, as an example, and why NVIDIA has kind of exploded. So this is the gold rush. The gold rush is getting the GPUs, getting enough of them, so they can all collaborate to perform this optimization.
And what are they all doing? They're all collaborating to predict the next token on a data set like the fine web data set. This is the computational workflow that basically is extremely expensive. The more GPUs you have, the more tokens you can try to predict and improve on, and you're going to process this data set faster, and you can iterate faster and get a bigger network and train a bigger network and so on.
So this is what all those machines are doing. And this is why all of this is such a big deal. And for example, this is a article from like about a month ago or so. This is why it's a big deal that, for example, Elon Musk is getting 100,000 GPUs in a single data center.
And all of these GPUs are extremely expensive, are going to take a ton of power, and all of them are just trying to predict the next token in the sequence and improve the network by doing so, and get probably a lot more coherent text than what we're seeing here a lot faster.
Okay, so unfortunately, I do not have a couple 10 or $100 million to spend on training a really big model like this. But luckily, we can turn to some big tech companies who train these models routinely, and release some of them once they are done training. So they've spent a huge amount of compute to train this network, and they release the network at the end of the optimization.
So it's very useful because they've done a lot of compute for that. So there are many companies who train these models routinely, but actually not many of them release these what's called base models. So the model that comes out at the end here is what's called a base model.
What is a base model? It's a token simulator, right? It's an internet text token simulator. And so that is not by itself useful yet, because what we want is what's called an assistant, we want to ask questions and have it respond to answers. These models won't do that they just create sort of remixes of the internet.
They dream internet pages. So the base models are not very often released, because they're kind of just only a step one of a few other steps that we still need to take to get an assistant. However, a few releases have been made. So as an example, the GPT-2 model released the 1.6 billion, sorry, 1.5 billion model back in 2019.
And this GPT-2 model is a base model. Now, what is a model release? What does it look like to release these models? So this is the GPT-2 repository on GitHub. Well, you need two things basically to release model. Number one, we need the Python code, usually, that describes the sequence of operations in detail that they make in their model.
So if you remember back this transformer, the sequence of steps that are taken here in this neural network is what is being described by this code. So this code is sort of implementing the what's called forward pass of this neural network. So we need the specific details of exactly how they wired up that neural network.
So this is just computer code, and it's usually just a couple hundred lines of code. It's not it's not that crazy. And this is all fairly understandable and usually fairly standard. What's not standard are the parameters. That's where the actual value is. What are the parameters of this neural network, because there's 1.6 billion of them, and we need the correct setting or a really good setting.
And so that's why in addition to this source code, they release the parameters, which in this case is roughly 1.5 billion parameters. And these are just numbers. So it's one single list of 1.5 billion numbers, the precise and good setting of all the knobs, such that the tokens come out well.
So you need those two things to get a base model release. Now, GPT-2 was released, but that's actually a fairly old model, as I mentioned. So actually, the model we're going to turn to is called LLAMA-3. And that's the one that I would like to show you next. So LLAMA-3, so GPT-2 again, was 1.6 billion parameters trained on 100 billion tokens.
LLAMA-3 is a much bigger model and much more modern model. It is released and trained by Meta. And it is a 405 billion parameter model trained on 15 trillion tokens, in very much the same way, just much, much bigger. And Meta has also made a release of LLAMA-3. And that was part of this paper.
So with this paper that goes into a lot of detail, the biggest base model that they released is the LLAMA-3.1 4.5, 405 billion parameter model. So this is the base model. And then in addition to the base model, you see here, foreshadowing for later sections of the video, they also released the instruct model.
And the instruct means that this is an assistant, you can ask it questions, and it will give you answers. We still have yet to cover that part later. For now, let's just look at this base model, this token simulator. And let's play with it and try to think about, you know, what is this thing?
And how does it work? And what do we get at the end of this optimization, if you let this run until the end, for a very big neural network on a lot of data. So my favorite place to interact with the base models is this company called Hyperbolic, which is basically serving the base model of the 405B LLAMA-3.1.
So when you go into the website, and I think you may have to register and so on, make sure that in the models, make sure that you are using LLAMA-3.1 405 billion base, it must be the base model. And then here, let's say the max tokens is how many tokens we're going to be generating.
So let's just decrease this to be a bit less just so we don't waste compute, we just want the next 128 tokens. And leave the other stuff alone, I'm not going to go into the full detail here. Now, fundamentally, what's going to happen here is identical to what happens here during inference for us.
So this is just going to continue the token sequence of whatever prefix you're going to give it. So I want to first show you that this model here is not yet an assistant. So you can, for example, ask it, what is two plus two, it's not going to tell you, oh, it's four.
What else can I help you with? It's not going to do that. Because what is two plus two is going to be tokenized. And then those tokens just acts as a prefix. And then what the model is going to do now is just going to get the probability for the next token.
And it's just a glorified autocomplete. It's a very, very expensive autocomplete of what comes next, depending on the statistics of what it saw in its training documents, which are basically web pages. So let's just hit enter to see what tokens it comes up with as a continuation. Okay, so here it kind of actually answered the question and started to go off into some philosophical territory.
Let's try it again. So let me copy and paste. And let's try again, from scratch. What is two plus two? Okay, so it just goes off again. So notice one more thing that I want to stress is that the system, I think every time you put it in, it just kind of starts from scratch.
So the system here is stochastic. So for the same prefix of tokens, we're always getting a different answer. And the reason for that is that we get this probability distribution, and we sample from it, and we always get different samples, and we sort of always go into a different territory afterwards.
So here in this case, I don't know what this is. Let's try one more time. So it just continues on. So it's just doing the stuff that it's on the internet, right? And it's just kind of like regurgitating those statistical patterns. So first things, it's not an assistant yet, it's a token autocomplete.
And second, it is a stochastic system. Now the crucial thing is that even though this model is not yet by itself very useful for a lot of applications, just yet, it is still very useful because in the task of predicting the next token in the sequence, the model has learned a lot about the world.
And it has stored all that knowledge in the parameters of the network. So remember that our text looked like this, right? Internet web pages. And now all of this is sort of compressed in the weights of the network. So you can think of these 405 billion parameters as a kind of compression of the internet.
You can think of the 405 billion parameters as kind of like a zip file. But it's not a lossless compression, it's a lossy compression, we're kind of like left with kind of a gestalt of the internet and we can generate from it, right? Now we can elicit some of this knowledge by prompting the base model accordingly.
So for example, here's a prompt that might work to elicit some of that knowledge that's hiding in the parameters. Here's my top 10 list of the top landmarks to see in Paris. And I'm doing it this way, because I'm trying to prime the model to now continue this list.
So let's see if that works when I press enter. Okay, so you see that it started the list, and it's now kind of giving me some of those landmarks. And I noticed that it's trying to give a lot of information here. Now, you might not be able to actually fully trust some of the information here.
Remember that this is all just a recollection of some of the internet documents. And so the things that occur very frequently in the internet data are probably more likely to be remembered correctly, compared to things that happen very infrequently. So you can't fully trust some of the things that is some of the information that is here, because it's all just a vague recollection of internet documents.
Because the information is not stored explicitly in any of the parameters, it's all just the recollection. That said, we did get something that is probably approximately correct. And I don't actually have the expertise to verify that this is roughly correct. But you see that we've elicited a lot of the knowledge of the model.
And this knowledge is not precise and exact. This knowledge is vague, and probabilistic, and statistical. And the kinds of things that occur often are the kinds of things that are more likely to be remembered in the model. Now I want to show you a few more examples of this model's behavior.
The first thing I want to show you is this example. I went to the Wikipedia page for Zebra. And let me just copy-paste the first, even one sentence here. And let me put it here. Now when I click enter, what kind of completion are we going to get? So let me just hit enter.
There are three living species, etc, etc. What the model is producing here is an exact regurgitation of this Wikipedia entry. It is reciting this Wikipedia entry purely from memory. And this memory is stored in its parameters. And so it is possible that at some point in these 512 tokens, the model will stray away from the Wikipedia entry.
But you can see that it has huge chunks of it memorized here. Let me see, for example, if this sentence occurs by now. Okay, so we're still on track. Let me check here. Okay, we're still on track. It will eventually stray away. Okay, so this thing is just recited to a very large extent.
It will eventually deviate because it won't be able to remember exactly. Now, the reason that this happens is because these models can be extremely good at memorization. And usually, this is not what you want in the final model. And this is something called regurgitation. And it's usually undesirable to cite things directly that you have trained on.
Now, the reason that this happens actually is because for a lot of documents, like for example, Wikipedia, when these documents are deemed to be of very high quality as a source, like for example, Wikipedia, it is very often the case that when you train the model, you will preferentially sample from those sources.
So basically, the model has probably done a few epochs on this data, meaning that it has seen this web page, like maybe probably 10 times or so. And it's a bit like you like when you read some kind of a text many, many times, say you read something 100 times, then you will be able to recite it.
And it's very similar for this model, if it sees something way too often, it's going to be able to recite it later from memory. Except these models can be a lot more efficient, like per presentation than a human. So probably it's only seen this Wikipedia entry 10 times, but basically it has remembered this article exactly in its parameters.
Okay, the next thing I want to show you is something that the model has definitely not seen during its training. So for example, if we go to the paper, and then we navigate to the pre training data, we'll see here that the data set has a knowledge cutoff until the end of 2023.
So it will not have seen documents after this point. And certainly it has not seen anything about the 2024 election and how it turned out. Now, if we prime the model with the tokens from the future, it will continue the token sequence, and it will just take its best guess according to the knowledge that it has in its own parameters.
So let's take a look at what that could look like. So the Republican party could Trump. Okay, President of the United States from 2017. And let's see what it says after this point. So for example, the model will have to guess at the running mate and who it's against, etc.
So let's hit enter. So here are things that Mike Pence was the running mate instead of JD Vance. And the ticket was against Hillary Clinton and Tim Kaine. So this is kind of a interesting parallel universe potentially of what could have happened according to the alarm. Let's get a different sample.
So the identical prompt, and let's resample. So here the running mate was Ron DeSantis. And they ran against Joe Biden and Kamala Harris. So this is again, a different parallel universe. So the model will take educated guesses, and it will continue the token sequence based on this knowledge. And we'll just kind of like all of what we're seeing here is what's called hallucination.
The model is just taking its best guess in a probabilistic manner. The next thing I would like to show you is that even though this is a base model and not yet an assistant model, it can still be utilized in practical applications if you are clever with your prompt design.
So here's something that we would call a few shot prompt. So what it is here is that I have 10 words, or 10 pairs, and each pair is a word of English colon, and then the translation in Korean. And we have 10 of them. And what the model does here is at the end, we have teacher colon, and then here's where we're going to do a completion of say, just five tokens.
And these models have what we call in context learning abilities. And what that's referring to is that as it is reading this context, it is learning sort of in place that there's some kind of an algorithmic pattern going on in my data. And it knows to continue that pattern.
And this is called kind of like in context learning. So it takes on the role of translator. And when we hit completion, we see that the teacher translation is "선생님," which is correct. And so this is how you can build apps by being clever with your prompting, even though we still just have a base model for now.
And it relies on what we call this in context learning ability. And it is done by constructing what's called a few shot prompt. Okay, and finally, I want to show you that there is a clever way to actually instantiate a whole language model assistant just by prompting. And the trick to it is that we're going to structure a prompt to look like a web page that is a conversation between a helpful AI assistant and a human.
And then the model will continue that conversation. So actually, to write the prompt, I turned to chat GPT itself, which is kind of meta. But I told it, I want to create an OLM assistant, but all I have is the base model. So can you please write my prompt.
And this is what it came up with, which is actually quite good. So here's a conversation between an AI assistant and a human. The AI assistant is knowledgeable, helpful, capable of answering a wide variety of questions, etc. And then here, it's not enough to just give it a sort of description.
It works much better if you create this few shot prompt. So here's a few terms of human assistant, human assistant. And we have, you know, a few turns of conversation. And then here at the end is we're going to be putting the actual query that we like. So let me copy paste this into the base model prompt.
And now, let me do human column. And this is where we put our actual prompt. Why is the sky blue? And let's run. Assistant, the sky appears blue due to the phenomenon called Rayleigh scattering, etc, etc. So you see that the base model is just continuing the sequence. But because the sequence looks like this conversation, it takes on that role.
But it is a little subtle, because here it just, you know, it ends the assistant and then just, you know, hallucinates the next question by the human, etc. So we'll just continue going on and on. But you can see that we have sort of accomplished the task. And if you just took this, why is the sky blue?
And if we just refresh this, and put it here, then of course, we don't expect this to work with the base model, right? We're just gonna, who knows what we're gonna get? Okay, we're just gonna get more questions. Okay. So this is one way to create an assistant, even though you may only have a base model.
Okay, so this is the kind of brief summary of the things we talked about over the last few minutes. Now, let me zoom out here. And this is kind of like what we've talked about so far. We wish to train LLM assistants like ChatGPT. We've discussed the first stage of that, which is the pre training stage.
And we saw that really what it comes down to is we take internet documents, we break them up into these tokens, these atoms of little text chunks. And then we predict token sequences using neural networks. The output of this entire stage is this base model, it is the setting of the parameters of this network.
And this base model is basically an internet document simulator on the token level. So it can just, it can generate token sequences that have the same kind of like statistics as internet documents. And we saw that we can use it in some applications, but we actually need to do better.
We want an assistant, we want to be able to ask questions, and we want the model to give us answers. And so we need to now go into the second stage, which is called the post training stage. So we take our base model, our internet document simulator, and hand it off to post training.
So we're now going to discuss a few ways to do what's called post training of these models. These stages in post training are going to be computationally much less expensive, most of the computational work, all of the massive data centers, and all of the sort of heavy compute and millions of dollars are the pre training stage.
But now we're going to the slightly cheaper, but still extremely important stage called post training, where we turn this LLM model into an assistant. So let's take a look at how we can get our model to not sample internet documents, but to give answers to questions. So in other words, what we want to do is we want to start thinking about conversations.
And these are conversations that can be multi term. So so there can be multiple turns, and they are in the simplest case, a conversation between a human and an assistant. And so for example, we can imagine the conversation could look something like this. When a human says what is two plus two, the assistant should respond with something like two plus two is four.
When a human follows up and says what if it was stars that have a plus assistant could respond with something like this. And similar here, this is another example showing that the assistant could also have some kind of a personality here, that it's kind of like nice. And then here in the third example, I'm showing that when a human is asking for something that we don't wish to help with, we can produce what's called refusal, we can say that we cannot help with that.
So in other words, what we want to do now is we want to think through how an assistant should interact with a human. And we want to program the assistant and its behavior in these conversations. Now, because this is neural networks, we're not going to be programming these explicitly in code, we're not going to be able to program the assistant in that way.
Because this is neural networks, everything is done through neural network training on data sets. And so because of that, we are going to be implicitly programming the assistant by creating data sets of conversations. So these are three independent examples of conversations in a data set, an actual data set, and I'm going to show you examples will be much larger, it could have hundreds of 1000s of conversations that are multi turn very long, etc.
And would cover a diverse breadth of topics. But here I'm only showing three examples. But the way this works basically is assistant is being programmed by example. And where is this data coming from, like two times two equals four, same as two plus two, etc. Where does that come from?
This comes from human labelers. So we will basically give human labelers some conversational context. And we will ask them to basically give the ideal assistant response in this situation. And a human will write out the ideal response for an assistant in any situation. And then we're going to get the model to basically train on this and to imitate those kinds of responses.
So the way this works, then is we are going to take our base model, which we produced in the pre training stage. And this base model was trained on internet documents, we're now going to take that data set of internet documents, and we're going to throw it out. And we're going to substitute a new data set.
And that's going to be a data set of conversations. And we're going to continue training the model on these conversations on this new data set of conversations. And what happens is that the model will very rapidly adjust, and we'll sort of like learn the statistics of how this assistant response to human queries.
And then later during inference, we'll be able to basically prime the assistant and get the response. And it will be imitating what the humans with human labelers would do in that situation, if that makes sense. So we're going to see examples of that. And this is going to become a bit more concrete.
I also wanted to mention that this post training stage, we're going to basically just continue training the model. But the pre training stage can in practice take roughly three months of training on many 1000s of computers. The post training stage will typically be much shorter, like three hours, for example.
And that's because the data set of conversations that we're going to create here manually is much, much smaller than the data set of text on the internet. And so this training will be very short. But fundamentally, we're just going to take our base model, we're going to continue training using the exact same algorithm, the exact same everything, except we're swapping out the data set for conversations.
So the questions now are, what are these conversations? How do we represent them? How do we get the model to see conversations instead of just raw text? And then what are the outcomes of this kind of training? And what do you get in a certain like psychological sense when we talk about the model?
So let's turn to those questions now. So let's start by talking about the tokenization of conversations. Everything in these models has to be turned into tokens, because everything is just about token sequences. So how do we turn conversations into token sequences is the question. And so for that, we need to design some kind of an encoding.
And this is kind of similar to maybe if you're familiar, you don't have to be with, for example, the TCP/IP packet in on the internet, there are precise rules and protocols for how you represent information, how everything is structured together, so that you have all this kind of data laid out in a way that is written out on a paper, and that everyone can agree on.
And so it's the same thing now happening in LLMs, we need some kind of data structures, and we need to have some rules around how these data structures like conversations, get encoded and decoded to and from tokens. And so I want to show you now how I would recreate this conversation in the token space.
So if you go to TickTokenizer, I can take that conversation. And this is how it is represented in for the language model. So here we have we are iterating a user and an assistant in this two turn conversation. And what you're seeing here is it looks ugly, but it's actually relatively simple.
The way it gets turned into a token sequence here at the end is a little bit complicated. But at the end, this conversation between the user and assistant ends up being 49 tokens, it is a one dimensional sequence of 49 tokens. And these are the tokens. Okay. And all the different LLMs will have a slightly different format or protocols.
And it's a little bit of a Wild West right now. But for example, GPT-40 does it in the following way. You have this special token called I am underscore start. And this is short for imaginary monologue, the start, then you have to specify, I don't actually know why it's called that, to be honest, then you have to specify whose turn it is.
So for example, user, which is a token 1428. Then you have internal monologue separator. And then it's the exact question. So the tokens of the question, and then you have to close it. So I am end, the end of the imaginary monologue. So basically, the question from a user of what is two plus two ends up being the token sequence of these tokens.
And now the important thing to mention here is that I am start, this is not text, right? I am start is a special token that gets added, it's a new token. And this token has never been trained on so far, it is a new token that we create in a post training stage, and we introduce.
And so these special tokens like I am set, I am start, etc, are introduced and interspersed with text, so that they sort of get the model to learn that, hey, this is the start of a turn for, who is it started the term for the start of the turn is for the user.
And then this is what the user says, and then the user ends. And then it's a new start of a turn, and it is by the assistant. And then what does the assistant say? Well, these are the tokens of what the assistant says, etc. And so this conversation is not turned into the sequence of tokens.
The specific details here are not actually that important. All I'm trying to show you in concrete terms, is that our conversations, which we think of as kind of like a structured object, end up being turned via some encoding into one dimensional sequences of tokens. And so, because this is one dimensional sequence of tokens, we can apply all this stuff that we applied before.
Now it's just a sequence of tokens. And now we can train a language model on it. And so we're just predicting the next token in a sequence, just like before. And we can represent and train on conversations. And then what does it look like at test time during inference?
So say we've trained a model. And we've trained a model on these kinds of data sets of conversations. And now we want to inference. So during inference, what does this look like when you're on Chats GPT? Well, you come to Chats GPT, and you have, say, like a dialogue with it.
And the way this works is basically, say that this was already filled in. So like, what is two plus two, two plus two is four. And now you issue what if it was times, IM_END. And what basically ends up happening on the servers of OpenAI or something like that, is they put an IM_START, assistant, IM_SEP.
And this is where they end it, right here. So they construct this context. And now they start sampling from the model. So it's at this stage that they will go to the model and say, okay, what is a good first sequence? What is a good first token? What is a good second token?
What is a good third token? And this is where the LLM takes over and creates a response, like for example, response that looks something like this, but it doesn't have to be identical to this. But it will have the flavor of this, if this kind of a conversation was in the data set.
So that's roughly how the protocol works. Although the details of this protocol are not important. So again, my goal is just to show you that everything ends up being just a one-dimensional token sequence. So we can apply everything we've already seen. But we're not training on conversations. And we're now basically generating conversations as well.
Okay, so now I would like to turn to what these data sets look like in practice. The first paper that I would like to show you and the first effort in this direction is this paper from OpenAI in 2022. And this paper was called InstructGPT, or the technique that they developed.
And this was the first time that OpenAI has kind of talked about how you can take language models and fine tune them on conversations. And so this paper has a number of details that I would like to take you through. So the first stop I would like to make is in section 3.4, where they talk about the human contractors that they hired, in this case from Upwork or through ScaleAI to construct these conversations.
And so there are human labelers involved whose job it is professionally to create these conversations. And these labelers are asked to come up with prompts, and then they are asked to also complete the ideal assistant responses. And so these are the kinds of prompts that people came up with.
So these are human labelers. So list five ideas for how to regain enthusiasm for my career. What are the top 10 science fiction books I should read next? And there's many different types of kind of prompts here. So translate the sentence to Spanish, etc. And so there's many things here that people came up with.
They first come up with the prompt, and then they also answer that prompt, and they give the ideal assistant response. Now, how do they know what is the ideal assistant response that they should write for these prompts? So when we scroll down a little bit further, we see that here we have this excerpt of labeling instructions that are given to the human labelers.
So the company that is developing the language model, like for example, OpenAI, writes up labeling instructions for how the humans should create ideal responses. And so here, for example, is an excerpt of these kinds of labeling instructions. On a high level, you're asking people to be helpful, truthful, and harmless.
And you can pause the video if you'd like to see more here. But on a high level, basically just answer, try to be helpful, try to be truthful, and don't answer questions that we don't want kind of the system to handle later in ChatGPT. And so, roughly speaking, the company comes up with the labeling instructions.
Usually they are not this short. Usually they are hundreds of pages, and people have to study them professionally. And then they write out the ideal assistant responses following those labeling instructions. So this is a very human-heavy process, as it was described in this paper. Now, the data set for InstructGPT was never actually released by OpenAI.
But we do have some open source reproductions that were trying to follow this kind of a setup and collect their own data. So one that I'm familiar with, for example, is the effort of Open Assistant from a while back. And this is just one of, I think, many examples, but I just want to show you an example.
So here's, so these were people on the internet that were asked to basically create these conversations similar to what OpenAI did with human labelers. And so here's an entry of a person who came up with this prompt. Can you write a short introduction to the relevance of the term monopsony in economics?
Please use examples, etc. And then the same person, or potentially a different person, will write up the response. So here's the assistant response to this. And so then the same person or different person will actually write out this ideal response. And then this is an example of maybe how the conversation could continue.
Now explain it to a dog. And then you can try to come up with a slightly simpler explanation or something like that. Now, this then becomes the label, and we end up training on this. So what happens during training is that, of course, we're not going to have a full coverage of all the possible questions that the model will encounter at test time during inference.
We can't possibly cover all the possible prompts that people are going to be asking in the future. But if we have a, like a data set of a few of these examples, then the model during training will start to take on this persona of this helpful, truthful, harmless assistant.
And it's all programmed by example. And so these are all examples of behavior. And if you have conversations of these example behaviors, and you have enough of them, like 100,000, and you train on it, the model sort of starts to understand the statistical pattern. And it kind of takes on this personality of this assistant.
Now, it's possible that when you get the exact same question like this, at test time, it's possible that the answer will be recited as exactly what was in the training set. But more likely than that is that the model will kind of like do something of a similar vibe.
And it will understand that this is the kind of answer that you want. So that's what we're doing. We're programming the system by example, and the system adopts statistically, this persona of this helpful, truthful, harmless assistant, which is kind of like reflected in the labeling instructions that the company creates.
Now, I want to show you that the state of the art has kind of advanced in the last two or three years, since the instruct GPT paper. So in particular, it's not very common for humans to be doing all the heavy lifting just by themselves anymore. And that's because we now have language models.
And these language models are helping us create these data sets and conversations. So it is very rare that the people will like literally just write out the response from scratch, it is a lot more likely that they will use an existing LLM to basically like, come up with an answer, and then they will edit it, or things like that.
So there's many different ways in which now LLMs have started to kind of permeate this post training set stack. And LLMs are basically used pervasively to help create these massive data sets of conversations. So I don't want to show like UltraChat is one such example of like a more modern data set of conversations.
It is to a very large extent synthetic, but I believe there's some human involvement, I could be wrong with that. Usually, there'll be a little bit of human, but there will be a huge amount of synthetic help. And this is all kind of like, constructed in different ways. And UltraChat is just one example of many SFT data sets that currently exist.
And the only thing I want to show you is that these data sets have now millions of conversations. These conversations are mostly synthetic, but they're probably edited to some extent by humans. And they span a huge diversity of sort of areas and so on. So these are fairly extensive artifacts by now.
And there are all these like SFT mixtures, as they're called. So you have a mixture of like lots of different types and sources, and it's partially synthetic, partially human. And it's kind of like gone in that direction since. But roughly speaking, we still have SFT data sets, they're made up of conversations, we're training on them, just like we did before.
And I guess like the last thing to note is that I want to dispel a little bit of the magic of talking to an AI. Like when you go to ChatGPT, and you give it a question, and then you hit enter, what is coming back is kind of like statistically aligned with what's happening in the training set.
And these training sets, I mean, they really just have a seed in humans following labeling instructions. So what are you actually talking to in ChatGPT? Or how should you think about it? Well, it's not coming from some magical AI, like roughly speaking, it's coming from something that is statistically imitating human labelers, which comes from labeling instructions written by these companies.
And so you're kind of imitating this, you're kind of getting, it's almost as if you're asking a human labeler. And imagine that the answer that is given to you from ChatGPT is some kind of a simulation of a human labeler. And it's kind of like asking what would a human labeler say in this kind of a conversation.
And it's not just like this human labeler is not just like a random person from the internet, because these companies actually hire experts. So for example, when you are asking questions about code, and so on, the human labelers that would be involved in creation of these conversation datasets, they will usually be educated expert people.
And you're kind of like asking a question of like a simulation of those people, if that makes sense. So you're not talking to a magical AI, you're talking to an average labeler, this average labeler is probably fairly highly skilled, but you're talking to kind of like an instantaneous simulation of that kind of a person that would be hired in the construction of these datasets.
So let me give you one more specific example before we move on. For example, when I go to chat GPT, and I say, recommend the top five landmarks you see in Paris, and then I hit enter. Okay, here we go. Okay, when I hit enter, what's coming out here?
How do I think about it? Well, it's not some kind of a magical AI that has gone out and researched all the landmarks and then ranked them using its infinite intelligence, etc. What I'm getting is a statistical simulation of a labeler that was hired by open AI, you can think about it roughly in that way.
And so if this specific question is in the post training dataset, somewhere at open AI, then I'm very likely to see an answer that is probably very, very similar to what that human labeler would have put down for those five landmarks. How does the human labeler come up with this?
Well, they go off and they go on the internet, and they kind of do their own little research for 20 minutes, and they just come up with a list, right? Now, so if they come up with this list, and this is in the dataset, I'm probably very likely to see what they submitted as the correct answer from the assistant.
Now, if this specific query is not part of the post training dataset, then what I'm getting here is a little bit more emergent. Because the model kind of understands that statistically, the kinds of landmarks that are in the training set are usually the prominent landmarks, the landmarks that people usually want to see, the kinds of landmarks that are usually very often talked about on the internet.
And remember that the model already has a ton of knowledge from its pre-training on the internet. So it's probably seen a ton of conversations about pairs, about landmarks, about the kinds of things that people like to see. And so it's the pre-training knowledge that is then combined with the post training dataset that results in this kind of an imitation.
So that's, that's roughly how you can kind of think about what's happening behind the scenes here in, in the statistical sense. Okay, now I want to turn to the topic of LLM psychology, as I like to call it, which is where sort of the emergent cognitive effects of the training pipeline that we have for these models.
So in particular, the first one I want to talk to is, of course, hallucinations. So you might be familiar with model hallucinations. It's when LLMs make stuff up, they just totally fabricate information, etc. And it's a big problem with LLM assistants. It is a problem that existed to a large extent with early models for many years ago.
And I think the problem has gotten a bit better, because there are some medications that I'm going to go into in a second. For now, let's just try to understand where these hallucinations come from. So here's a specific example of a few of three conversations that you might think you have in your training set.
And these are pretty reasonable conversations that you could imagine being in the training set. So like, for example, who is Tom Cruise? Well, Tom Cruise is a famous actor, American actor and producer, etc. Who is John Barrasso? This turns out to be a US senator, for example. Who is Genghis Khan?
Well, Genghis Khan was blah, blah, blah. And so this is what your conversations could look like at training time. Now, the problem with this is that when the human is writing the correct answer for the assistant, in each one of these cases, the human either like knows who this person is, or they research them on the internet, and they come in, and they write this response that kind of has this like confident tone of an answer.
And what happens basically is that at test time, when you ask for someone who is, this is a totally random name that I totally came up with, and I don't think this person exists. As far as I know, I just tried to generate it randomly. The problem is when we ask who is Orson Kovats, the problem is that the assistant will not just tell you, oh, I don't know.
Even if the assistant and the language model itself might know inside its features inside its activations inside of its brain sort of, it might know that this person is like not someone that that is that it's familiar with, even if some part of the network kind of knows that in some sense, the saying that, oh, I don't know who this is, is is not going to happen.
Because the model statistically imitates his training set. In the training set, the questions of the form who is blah are confidently answered with the correct answer. And so it's going to take on the style of the answer, and it's going to do its best, it's going to give you statistically the most likely guess, and it's just going to basically make stuff up.
Because these models, again, we just talked about it is they don't have access to the internet, they're not doing research. These are statistical token tumblers, as I call them, is just trying to sample the next token in the sequence. And it's gonna basically make stuff up. So let's take a look at what this looks like.
I have here what's called the inference playground from hugging face. And I am on purpose picking on a model called Falcon 7b, which is an old model. This is a few years ago now. So it's an older model. So it suffers from hallucinations. And as I mentioned, this has improved over time recently.
But let's say who is Orson Kovats? Let's ask Falcon 7b instruct. Run. Oh, yeah, Orson Kovats is an American author and science fiction writer. Okay. That's totally false. It's a hallucination. Let's try again. These are statistical systems, right? So we can resample. This time, Orson Kovats is a fictional character from this 1950s TV show.
It's total BS, right? Let's try again. He's a former minor league baseball player. Okay, so it basically the model doesn't know. And it's given us lots of different answers. Because it doesn't know. It's just kind of like sampling from these probabilities. The model starts with the tokens who is Orson Kovats assistant, and then it comes in here.
And it's good. It's getting these probabilities. And it's just sampling from the probabilities. And it just like comes up with stuff. And the stuff is actually statistically consistent with the style of the answer in its training set. And it's just doing that. But you and I experienced it as a made up factual knowledge.
But keep in mind that the model basically doesn't know. And it's just imitating the format of the answer. And it's not going to go off and look it up. Because it's just imitating, again, the answer. So how can we mitigate this? Because for example, when we go to chat GPT, and I say, who is Orson Kovats, and I'm now asking the state of the art state of the art model from AI, this model will tell you.
Oh, so this model is actually is even smarter, because you saw very briefly, it said, searching the web, we're going to cover this later. It's actually trying to do tool use. And kind of just like came up with some kind of a story. But I want to just use Orson Kovats did not use any tools.
I don't want it to do web search. There's a well known historical republic figure named Orson Kovats. So this model is not going to make up stuff. This model knows that it doesn't know. And it tells you that it doesn't appear to be a person that this model knows.
So somehow, we sort of improved hallucinations, even though they clearly are an issue in older models. And it makes totally sense why you would be getting these kinds of answers, if this is what your training set looks like. So how do we fix this? Okay, well, clearly, we need some examples in our data set, that were the correct answer for the assistant is that the model doesn't know about some particular fact.
But we only need to have those answers be produced in the cases where the model actually doesn't know. And so the question is, how do we know what the model knows or doesn't know? Well, we can empirically probe the model to figure that out. So let's take a look at, for example, how meta dealt with hallucinations for the llama three series of models as an example.
So in this paper that they published from meta, we can go into hallucinations, which they call here factuality. And they describe the procedure by which they basically interrogate the model to figure out what it knows and doesn't know to figure out sort of like the boundary of its knowledge.
And then they add examples to the training set, where for the things where the model doesn't know them, the correct answer is that the model doesn't know them, which sounds like a very easy thing to do in principle. But this roughly fixes the issue. And the reason it fixes the issue is because remember like, the model might actually have a pretty good model of its self knowledge inside the network.
So remember, we looked at the network and all these neurons inside the network, you might imagine there's a neuron somewhere in the network, that sort of like lights up for when the model is uncertain. But the problem is that the activation of that neuron is not currently wired up to the model actually saying in words that it doesn't know.
So even though the internals of the neural network know, because there's some neurons that represent that, the model will not surface that it will instead take its best guess so that it sounds confident. Just like it sees in a training set. So we need to basically interrogate the model and allow it to say I don't know in the cases that it doesn't know.
So let me take you through what meta roughly does. So basically what they do is here I have an example. Dominikasik is the featured article today. So I just went there randomly. And what they do is basically they take a random document in a training set, and they take a paragraph, and then they use an LM to construct questions about that paragraph.
So for example, I did that with chat GPT here. So I said, here's a paragraph from this document, generate three specific factual questions based on this paragraph, and give me the questions and the answers. And so the LLMs are already good enough to create and reframe this information. So if the information is in the context window of this LLM, this actually works pretty well, it doesn't have to rely on its memory.
It's right there in the context window. And so it can basically reframe that information with fairly high accuracy. So for example, it can generate questions for us like, for which team did he play? Here's the answer. How many cups did he win, etc. And now what we have to do is we have some question and answers.
And now we want to interrogate the model. So roughly speaking, what we'll do is we'll take our questions. And we'll go to our model, which would be say Llama in meta. But let's just interrogate Mistral7b here as an example. That's another model. So does this model know about this answer?
Let's take a look. So he played for Buffalo Sabres, right? So the model knows. And the way that you can programmatically decide is basically we're going to take this answer from the model. And we're going to compare it to the correct answer. And again, the models are good enough to do this automatically.
So there's no humans involved here. We can take basically the answer from the model. And we can use another LLM judge to check if that is correct, according to this answer. And if it is correct, that means that the model probably knows. So we're going to do is we're going to do this maybe a few times.
So okay, it knows it's Buffalo Sabres. Let's try again. Buffalo Sabres. Let's try one more time. Buffalo Sabres. So we asked three times about this factual question, and the model seems to know. So everything is great. Now let's try the second question. How many Stanley Cups did he win?
And again, let's interrogate the model about that. And the correct answer is two. So here, the model claims that he won four times, which is not correct, right? It doesn't match two. So the model doesn't know it's making stuff up. Let's try again. So here the model again, it's kind of like making stuff up, right?
Let's try again. Here it says he did not even, did not win during his career. So obviously the model doesn't know. And the way we can programmatically tell again is we interrogate the model three times, and we compare its answers maybe three times, five times, whatever it is, to the correct answer.
And if the model doesn't know, then we know that the model doesn't know this question. And then what we do is we take this question, we create a new conversation in the training set. So we're going to add a new conversation training set. And when the question is, how many Stanley Cups did he win?
The answer is, I'm sorry, I don't know, or I don't remember. And that's the correct answer for this question, because we interrogated the model and we saw that that's the case. If you do this for many different types of questions, for many different types of documents, you are giving the model an opportunity to, in its training set, refuse to say based on its knowledge.
And if you just have a few examples of that, in your training set, the model will know and has the opportunity to learn the association of this knowledge-based refusal to this internal neuron somewhere in its network that we presume exists. And empirically, this turns out to be probably the case.
And it can learn that association that, hey, when this neuron of uncertainty is high, then I actually don't know. And I'm allowed to say that, I'm sorry, but I don't think I remember this, etc. And if you have these examples in your training set, then this is a large mitigation for hallucination.
And that's, roughly speaking, why ChatGPT is able to do stuff like this as well. So these are the kinds of mitigations that people have implemented and that have improved the factuality issue over time. Okay, so I've described mitigation number one for basically mitigating the hallucinations issue. Now, we can actually do much better than that.
It's, instead of just saying that we don't know, we can introduce an additional mitigation number two to give the LLM an opportunity to be factual and actually answer the question. Now, what do you and I do if I was to ask you a factual question and you don't know?
What would you do in order to answer the question? Well, you could go off and do some search and use the internet, and you could figure out the answer and then tell me what that answer is. And we can do the exact same thing with these models. So think of the knowledge inside the neural network, inside its billions of parameters.
Think of that as kind of a vague recollection of the things that the model has seen during its training, during the pre-training stage, a long time ago. So think of that knowledge in the parameters as something you read a month ago. And if you keep reading something, then you will remember it and the model remembers that.
But if it's something rare, then you probably don't have a really good recollection of that information. But what you and I do is we just go and look it up. Now, when you go and look it up, what you're doing basically is like you're refreshing your working memory with information, and then you're able to sort of like retrieve it, talk about it, or etc.
So we need some equivalent of allowing the model to refresh its memory or its recollection. And we can do that by introducing tools for the models. So the way we are going to approach this is that instead of just saying, "Hey, I'm sorry, I don't know," we can attempt to use tools.
So we can create a mechanism by which the language model can emit special tokens. And these are tokens that we're going to introduce, new tokens. So for example, here I've introduced two tokens, and I've introduced a format or a protocol for how the model is allowed to use these tokens.
So for example, instead of answering the question, when the model does not, instead of just saying, "I don't know," sorry, the model has the option now to emitting the special token search start. And this is the query that will go to like bing.com in the case of OpenAI or say Google search or something like that.
So we'll emit the query, and then it will emit search end. And then here, what will happen is that the program that is sampling from the model that is running the inference, when it sees the special token search end, instead of sampling the next token in the sequence, it will actually pause generating from the model, it will go off, it will open a session with bing.com, and it will paste the search query into bing.
And it will then get all the text that is retrieved. And it will basically take that text, it will maybe represent it again with some other special tokens or something like that. And it will take that text and it will copy paste it here into what I tried to like show the brackets.
So all that text kind of comes here. And when the text comes here, it enters the context window. So the model, so that text from the web search is now inside the context window that will feed into the neural network. And you should think of the context window as kind of like the working memory of the model.
That data that is in the context window is directly accessible by the model, it directly feeds into the neural network. So it's not anymore a vague recollection, it's data that it it has in the context window is directly available to that model. So now when it's sampling new tokens here afterwards, it can reference very easily the data that has been copy pasted in there.
So that's roughly how these how these tools use tools function. And so web search is just one of the tools, we're going to look at some of the other tools in a bit. But basically, you introduce new tokens, you introduce some schema by which the model can utilize these tokens and can call these special functions like web search functions.
And how do you teach the model how to correctly use these tools, like say web search, search start, search end, etc. Well, again, you do that through training sets. So we need now to have a bunch of data, and a bunch of conversations that show the model by example, how to use web search.
So what are the what are the settings where you're using the search? And what does that look like? And here's by example, how you start a search and a search, etc. And if you have a few 1000, maybe examples of that in your training set, the model will actually do a pretty good job of understanding how this tool works.
And it will know how to sort of structure its queries. And of course, because of the pre-training data set, and its understanding of the world, it actually kind of understands what a web search is. And so it actually kind of has a pretty good native understanding of what kind of stuff is a good search query.
And so it all kind of just like works, you just need a little bit of a few examples to show it how to use this new tool. And then it can lean on it to retrieve information, and put it in the context window. And that's equivalent to you and I looking something up.
Because once it's in the context, it's in the working memory, and it's very easy to manipulate and access. So that's what we saw a few minutes ago, when I was searching on ChatGPT for who is Orson Kovats. The ChatGPT language model decided that this is some kind of a rare individual or something like that.
And instead of giving me an answer from its memory, it decided that it will sample a special token that is going to do a web search. And we saw briefly something flash was like using the web tool or something like that. So it briefly said that, and then we waited for like two seconds, and then it generated this.
And you see how it's creating references here. And so it's citing sources. So what happened here is, it went off, it did a web search, it found these sources and these URLs. And the text of these web pages was all stuffed in between here. And it's not shown here, but it's it's basically stuffed as text in between here.
And now it sees that text. And now it kind of references it and says that, okay, it could be these people citation, it could be those people citation, etc. So that's what happened here. And that's what and that's why when I said who is Orson Kovats, I could also say, don't use any tools.
And then that's enough to basically convince Chachapiti to not use tools and just use its memory and its recollection. I also went off and I tried to ask this question of Chachapiti. So how many Stanley Cups did Dominik Hasek win? And Chachapiti actually decided that it knows the answer.
And it has the confidence to say that he won twice. And so it kind of just relied on its memory because presumably it has it has enough of a kind of confidence in its weights and its parameters and activations that this is retrievable just from memory. But you can also conversely use web search to make sure.
And then for the same query, it actually goes off and it searches and then it finds a bunch of sources. It finds all this. All of this stuff gets copy pasted in there and then it tells us to again and sites. And it actually says the Wikipedia article, which is the source of this information for us as well.
So that's tools, web search. The model determines when to search. And then that's kind of like how these tools work. And this is an additional kind of mitigation for hallucinations and factuality. So I want to stress one more time this very important sort of psychology point. Knowledge in the parameters of the neural network is a vague recollection.
The knowledge in the tokens that make up the context window is the working memory. And it roughly speaking works kind of like it works for us in our brain. The stuff we remember is our parameters and the stuff that we just experienced like a few seconds or minutes ago and so on.
You can imagine that being in our context window. And this context window is being built up as you have a conscious experience around you. So this has a bunch of implications also for your use of LLMs in practice. So for example, I can go to Chachipiti and I can do something like this.
I can say, can you summarize chapter one of Jane Austen's Pride and Prejudice, right? And this is a perfectly fine prompt. And Chachipiti actually does something relatively reasonable here. And the reason it does that is because Chachipiti has a pretty good recollection of a famous work like Pride and Prejudice.
It's probably seen a ton of stuff about it. There's probably forums about this book. It's probably read versions of this book. And it's kind of like remembers because even if you've read this or articles about it, you'd kind of have a recollection enough to actually say all this. But usually when I actually interact with LLMs and I want them to recall specific things, it always works better if you just give it to them.
So I think a much better prompt would be something like this. Can you summarize for me chapter one of Jane Austen's Pride and Prejudice? And then I am attaching it below for your reference. And then I do something like a delimiter here and I paste it in. And I found that just copy pasting it from some website that I found here.
So copy pasting the chapter one here. And I do that because when it's in the context window, the model has direct access to it and can exactly, it doesn't have to recall it. It just has direct access to it. And so this summary is, can be expected to be a significantly high quality or higher quality than the summary just because it's directly available to the model.
And I think you and I would work in the same way. If you want to, it would be, you would produce a much better summary if you had re-read this chapter before you had to summarize it. And that's basically what's happening here or the equivalent of it. The next sort of psychological quirk I'd like to talk about briefly is that of the knowledge of self.
So what I see very often on the internet is that people do something like this. They ask LLMs something like, what model are you and who built you? And basically this question is a little bit nonsensical. And the reason I say that is that as I tried to kind of explain with some of the under the hood fundamentals, this thing is not a person, right?
It doesn't have a persistent existence in any way. It sort of boots up, processes tokens and shuts off. And it does that for every single person. It just kind of builds up a context window of conversation and then everything gets deleted. And so this entity is kind of like restarted from scratch every single conversation, if that makes sense.
It has no persistent self, has no sense of self. It's a token tumbler and it follows the statistical regularities of its training set. So it doesn't really make sense to ask it, who are you, what built you, et cetera. And by default, if you do what I described and just by default and from nowhere, you're going to get some pretty random answers.
So for example, let's pick on Falcon, which is a fairly old model, and let's see what it tells us. So it's evading the question, talented engineers and developers. Here it says I was built by open AI based on the GPT-3 model. It's totally making stuff up. Now, the fact that it's built by open AI here, I think a lot of people would take this as evidence that this model was somehow trained on open AI data or something like that.
I don't actually think that that's necessarily true. The reason for that is that if you don't explicitly program the model to answer these kinds of questions, then what you're going to get is its statistical best guess at the answer. And this model had a SFT data mixture of conversations.
And during the fine tuning, the model sort of understands as it's training on this data, that it's taking on this personality of this like helpful assistant. And it doesn't know how to, it doesn't actually, it wasn't told exactly what label to apply to self. It just kind of is taking on this, this persona of a helpful assistant.
And remember that the pre training stage took the documents from the entire internet. And chat GPT and open AI are very prominent in these documents. And so I think what's actually likely to be happening here is that this is just it's hallucinated label for what it is. This is itself identity is that it's chat GPT by open AI.
And it's only saying that because there's a ton of data on the internet of answers like this, that are actually coming from open AI from chat GPT. And so that's its label for what it is. Now, you can override this as a developer, if you have an LLM model, you can actually override it.
And there are a few ways to do that. So for example, let me show you, there's this Olmo model from Allen AI. And this is one LLM. It's not a top tier LLM or anything like that. But I like it because it's fully open source. So the paper for Olmo and everything else is completely fully open source, which is nice.
So here we are looking at its SFT mixture. So this is the data mixture of the fine tuning. So this is the conversations data set, right. And so the way that they are solving it for the Olmo model, is we see that there's a bunch of stuff in the mixture.
And there's a total of 1 million conversations here. But here we have Olmo two hard coded. If we go there, we see that this is 240 conversations. And look at these 240 conversations, they're hard coded, tell me about yourself, says user. And then the assistant says, I'm Olmo, an open language model developed by AI2, Allen Institute of Artificial Intelligence, etc.
I'm here to help, blah, blah, blah. What is your name? The Olmo project. So these are all kinds of like cooked up hard coded questions about Olmo two, and the correct answers to give in these cases. If you take 240 questions like this, or conversations, put them into your training set and fine tune with it, then the model will actually be expected to parrot this stuff later.
If you don't give it this, then it's probably a chachivity by AI. And there's one more way to sometimes do this, is that basically, in these conversations, and you have terms between human and assistant, sometimes there's a special message called system message, at the very beginning of the conversation.
So it's not just between human and assistant, there's a system. And in the system message, you can actually hard code and remind the model that, hey, you are a model developed by open AI. And your name is chachivity 4.0. And you were trained on this date, and your knowledge cutoff is this.
And basically, it kind of like documents the model a little bit. And then this is inserted into your conversations. So when you go on chachivity, you see a blank page, but actually the system message is kind of like hidden in there. And those tokens are in the context window.
And so those are the two ways to kind of program the models to talk about themselves, either is done through data like this, or is done through system message and things like that, basically invisible tokens that are in the context window, and remind the model of its identity. But it's all just kind of like cooked up and bolted on in some in some way, it's not actually like really deeply there in any real sense, as it would be for a human.
I want to now continue to the next section, which deals with the computational capabilities, or like I should say, the native computational capabilities of these models in problem solving scenarios. And so in particular, we have to be very careful with these models when we construct our examples of conversations.
And there's a lot of sharp edges here, and that are kind of like elucidative, is that a word? They're kind of like interesting to look at when we consider how these models think. So consider the following prompt from a human. And suppose that basically that we are building out a conversation to enter into our training set of conversations.
So we're going to train the model on this, we're teaching you how to basically solve simple math problems. So the prompt is, Emily buys three apples and two oranges, each orange cost $2, the total cost is 13. What is the cost of apples? Very simple math question. Now, there are two answers here on the left and on the right.
They are both correct answers, they both say that the answer is three, which is correct. But one of these two is a significantly better answer for the assistant than the other. Like if I was data labeler, and I was creating one of these, one of these would be a really terrible answer for the assistant, and the other would be okay.
And so I'd like you to potentially pause the video even, and think through why one of these two is significantly better answer than the other. And if you use the wrong one, your model will actually be really bad at math potentially, and it would have bad outcomes. And this is something that you would be careful with in your labeling documentations when you are training people to create the ideal responses for the assistant.
Okay, so the key to this question is to realize and remember that when the models are training and also inferencing, they are working in one dimensional sequence of tokens from left to right. And this is the picture that I often have in my mind. I imagine basically the token sequence evolving from left to right.
And to always produce the next token in a sequence, we are feeding all these tokens into the neural network. And this neural network then gives us the probabilities for the next token in sequence, right? So this picture here is the exact same picture we saw before up here. And this comes from the web demo that I showed you before, right?
So this is the calculation that basically takes the input tokens here on the top, and performs these operations of all these neurons, and gives you the answer for the probabilities of what comes next. Now, the important thing to realize is that, roughly speaking, there's basically a finite number of layers of computation that happen here.
So for example, this model here has only one, two, three layers of what's called attention and MLP here. Maybe a typical modern state-of-the-art network would have more like, say, 100 layers or something like that. But there's only 100 layers of computation or something like that to go from the previous token sequence to the probabilities for the next token.
And so there's a finite amount of computation that happens here for every single token. And you should think of this as a very small amount of computation. And this amount of computation is almost roughly fixed for every single token in this sequence. That's not actually fully true, because the more tokens you feed in, the more expensive this forward pass will be of this neural network, but not by much.
So you should think of this, and I think is a good model to have in mind, this is a fixed amount of compute that's going to happen in this box for every single one of these tokens. And this amount of compute cannot possibly be too big, because there's not that many layers that are sort of going from the top to bottom here.
There's not that much computationally that will happen here. And so you can't imagine a model to basically do arbitrary computation in a single forward pass to get a single token. And so what that means is that we actually have to distribute our reasoning and our computation across many tokens, because every single token is only spending a finite amount of computation on it.
And so we kind of want to distribute the computation across many tokens. And we can't have too much computation or expect too much computation out of the model in any single individual token, because there's only so much computation that happens per token. Okay, roughly fixed amount of computation here.
So that's why this answer here is significantly worse. And the reason for that is, imagine going from left to right here. And I copy pasted it right here. The answer is three, etc. Imagine the model having to go from left to right, emitting these tokens one at a time, it has to say, or we're expecting to say, the answer is space dollar sign.
And then right here, we're expecting it to basically cram all the computation of this problem into this single token, it has to emit the correct answer three. And then once we've emitted the answer three, we're expecting it to say all these tokens. But at this point, we've already produced the answer.
And it's already in the context window for all these tokens that follow. So anything here is just kind of post hoc justification of why this is the answer. Because the answer is already created, it's already in the token window. So it's, it's not actually being calculated here. And so if you are answering the question directly, and immediately, you are training the model to try to basically guess the answer in a single token.
And that is just not going to work because of the finite amount of computation that happens per token. That's why this answer on the right is significantly better, because we are distributing this computation across the answer, we're actually getting the model to sort of slowly come to the answer.
From the left to right, we're getting intermediate results, we're saying, okay, the total cost of oranges is four. So 13 minus four is nine. And so we're creating intermediate calculations. And each one of these calculations is by itself not that expensive. And so we're actually basically kind of guessing a little bit the difficulty that the model is capable of in any single one of these individual tokens.
And there can never be too much work in any one of these tokens computationally, because then the model won't be able to do that later at test time. And so we're teaching the model here to spread out its reasoning and to spread out its computation over the tokens. And in this way, it only has very simple problems in each token, and they can add up.
And then by the time it's near the end, it has all the previous results in its working memory. And it's much easier for it to determine that the answer is and here it is three. So this is a significantly better label for our computation. This would be really bad.
And this teaching the model to try to do all the computation in a single token is really bad. So that's kind of like an interesting thing to keep in mind is in your prompts. Usually don't have to think about it explicitly because the people at open AI have labelers and so on that actually worry about this and to make sure that the answers are spread out.
And so actually open AI will kind of like do the right thing. So when I asked this question for chat GPT, it's actually going to go very slowly, it's going to be like, okay, let's define our variables, set up the equation. And it's kind of creating all these intermediate results.
These are not for you. These are for the model. If the model is not creating these intermediate results for itself, it's not going to be able to reach three. I also wanted to show you that it's possible to be a bit mean to the model, we can just ask for things.
So as an example, I said, I gave it the exact same prompt. And I said, answer the question in a single token, just immediately give me the answer, nothing else. And it turns out that for this simple prompt here, it actually was able to do it in a single go.
So it just created a single I think this is two tokens, right? Because the dollar sign is its own token. So basically, this model didn't give me a single token and give me two tokens, but it still produced the correct answer. And it did that in a single forward pass of the network.
Now, that's because the numbers here I think are very simple. And so I made it a bit more difficult to be a bit mean to the model. So I said Emily buys 23 apples and 177 oranges. And then I just made the numbers a bit bigger. And I'm just making it harder for the model, I'm asking you to do more computation in a single token.
And so I said the same thing. And here it gave me five, and five is actually not correct. So the model failed to do all this calculation in a single forward pass of the network, it failed to go from the input tokens. And then in a single forward pass of the network, single go through the network, it couldn't produce the result.
And then I said, Okay, now don't worry about the, the token limit, and just solve the problem as usual. And then it goes all the intermediate results, it simplifies. And every one of these intermediate results here, and intermediate calculations is much easier for the model. And it's sort of, it's not too much work per token, all of the tokens here are correct.
And it arises the resolution, which is seven. And it just couldn't squeeze all this work. It couldn't squeeze that into a single forward pass of the network. So I think that's kind of just a cute example. And something to kind of like think about. And I think it's kind of, again, just elucidative in terms of how these models work.
The last thing that I would say on this topic is that if I was in practice trying to actually solve this in my day to day life, I might actually not trust that the model that all the intermediate calculations correctly here. So actually, probably what I do is something like this, I would come here and I would say, use code.
And that's because code is one of the possible tools that Chachapiti can use. And instead of it having to do mental arithmetic, like this mental arithmetic here, I don't fully trust it. And especially if the numbers get really big. There's no guarantee that the model will do this correctly.
Any one of these intermediate steps might, in principle, fail. We're using neural networks to do mental arithmetic, kind of like you doing mental arithmetic in your brain. It might just like screw up some of the intermediate results. It's actually kind of amazing that it can even do this kind of mental arithmetic.
I don't think I could do this in my head. But basically, the model is kind of like doing it in its head. And I don't trust that. So I wanted to use tools. So you can say stuff like, use code. And I'm not sure what happened there. Use code.
And so like I mentioned, there's a special tool and the model can write code. And I can inspect that this code is correct. And then it's not relying on its mental arithmetic. It is using the Python interpreter, which is a very simple programming language, to basically write out the code that calculates the result.
And I would personally trust this a lot more because this came out of a Python program, which I think has a lot more correctness guarantees than the mental arithmetic of a language model. So just another kind of potential hint that if you have these kinds of problems, you may want to basically just ask the model to use the code interpreter.
And just like we saw with the web search, the model has special kind of tokens for calling, like it will not actually generate these tokens from the language model. It will write the program. And then it actually sends that program to a different sort of part of the computer that actually just runs that program and brings back the result.
And then the model gets access to that result and can tell you that, okay, the cost of each Apple is seven. So that's another kind of tool. And I would use this in practice for yourself. And it's, yeah, it's just less error prone, I would say. So that's why I called this section, Models Need Tokens to Think.
Distribute your competition across many tokens. Ask models to create intermediate results. Or whenever you can, lean on tools and tool use instead of allowing the models to do all of this stuff in their memory. So if they try to do it all in their memory, don't fully trust it and prefer to use tools whenever possible.
I want to show you one more example of where this actually comes up, and that's in counting. So models actually are not very good at counting for the exact same reason. You're asking for way too much in a single individual token. So let me show you a simple example of that.
How many dots are below? And then I just put in a bunch of dots. And Chachapiti says there are, and then it just tries to solve the problem in a single token. So in a single token, it has to count the number of dots in its context window. And it has to do that in a single forward pass of a network.
In a single forward pass of a network, as we talked about, there's not that much computation that can happen there. Just think of that as being like very little computation that happens there. So if I just look at what the model sees, let's go to the LLM tokenizer. It sees this.
How many dots are below? And then it turns out that these dots here, this group of I think 20 dots, is a single token. And then this group of whatever it is, is another token. And then for some reason, they break up as this. So I don't actually, this has to do with the details of the tokenizer, but it turns out that these, the model basically sees the token ID, this, this, this, and so on.
And then from these token IDs, it's expected to count the number. And spoiler alert, it's not 161. It's actually, I believe, 177. So here's what we can do instead. We can say use code. And you might expect that, like, why should this work? And it's actually kind of subtle and kind of interesting.
So when I say use code, I actually expect this to work. Let's see. Okay. 177 is correct. So what happens here is I've actually, it doesn't look like it, but I've broken down the problem into problems that are easier for the model. I know that the model can't count.
It can't do mental counting. But I know that the model is actually pretty good at doing copy-pasting. So what I'm doing here is when I say use code, it creates a string in Python for this. And the task of basically copy-pasting my input here to here is very simple.
Because for the model, it sees this string of, it sees it as just these four tokens or whatever it is. So it's very simple for the model to copy-paste those token IDs and kind of unpack them into dots here. And so it creates this string, and then it calls Python routine dot count, and then it comes up with the correct answer.
So the Python interpreter is doing the counting. It's not the model's mental arithmetic doing the counting. So it's, again, a simple example of models need tokens to think, don't rely on their mental arithmetic. And that's why also the models are not very good at counting. If you need them to do counting tasks, always ask them to lean on the tool.
Now, the models also have many other little cognitive deficits here and there. And these are kind of like sharp edges of the technology to be kind of aware of over time. So as an example, the models are not very good with all kinds of spelling-related tasks. They're not very good at it.
And I told you that we would loop back around to tokenization. And the reason to do for this is that the models, they don't see the characters. They see tokens. And their entire world is about tokens, which are these little text chunks. And so they don't see characters like our eyes do.
And so very simple character-level tasks often fail. So, for example, I'm giving it a string, ubiquitous, and I'm asking it to print only every third character starting with the first one. So we start with you, and then we should go every third. So 1, 2, 3, Q should be next, and then et cetera.
So this I see is not correct. And again, my hypothesis is that this is, again, the mental arithmetic here is failing, number one, a little bit. But number two, I think the more important issue here is that if you go to TickTokenizer and you look at ubiquitous, we see that it is three tokens, right?
So you and I see ubiquitous, and we can easily access the individual letters, because we kind of see them. And when we have it in the working memory of our visual sort of field, we can really easily index into every third letter, and I can do that task. But the models don't have access to the individual letters.
They see this as these three tokens. And remember, these models are trained from scratch on the internet. And all these token, basically, the model has to discover how many of all these different letters are packed into all these different tokens. And the reason we even use tokens is mostly for efficiency.
But I think a lot of people are interested to delete tokens entirely. Like, we should really have character level or byte level models. It's just that that would create very long sequences, and people don't know how to deal with that right now. So while we have the token world, any kind of spelling tasks are not actually expected to work super well.
So because I know that spelling is not a strong suit because of tokenization, I can, again, ask it to lean on tools. So I can just say use code. And I would, again, expect this to work, because the task of copy pasting ubiquitous into the Python interpreter is much easier.
And then we're leaning on Python interpreter to manipulate the characters of this string. So when I say use code, ubiquitous, yes, it indexes into every third character. And the actual truth is UQ2S, UQTS, which looks correct to me. So again, an example of spelling related tasks not working very well.
A very famous example of that recently is how many R are there in strawberry. And this went viral many times. And basically, the models now get it correct. They say there are three R's in strawberry. But for a very long time, all the state of the art models would insist that there are only two R's in strawberry.
And this caused a lot of, you know, ruckus, because is that a word? I think so. Because it's just kind of like, why are the models so brilliant? And they can solve math Olympiad questions, but they can't like count R's in strawberry. And the answer for that, again, is I've kind of built up to it kind of slowly.
But number one, the models don't see characters, they see tokens. And number two, they are not very good at counting. And so here we are combining the difficulty of seeing characters with the difficulty of counting. And that's why the models struggled with this, even though I think by now, honestly, I think opening I may have hardcoded the answer here, or I'm not sure what they did.
But this specific query now works. So models are not very good at spelling. And there's a bunch of other little sharp edges. And I don't want to go into all of them. I just want to show you a few examples of things to be aware of. And when you're using these models in practice, I don't actually want to have a comprehensive analysis here of all the ways that the models are kind of like falling short, I just want to make the point that there are some jagged edges here and there.
And we've discussed a few of them. And a few of them make sense. But some of them also will just not make as much sense. And they're kind of like you're left scratching your head, even if you understand in depth how these models work. And a good example of that recently is the following.
The models are not very good at very simple questions like this. And this is shocking to a lot of people, because these math, these problems can solve complex math problems, they can answer PhD grade physics, chemistry, biology questions much better than I can, but sometimes they fall short in like super simple problems like this.
So here we go. 9.11 is bigger than 9.9. And it justifies this in some way, but obviously, and then at the end, okay, it actually it flips its decision later. So I don't believe that this is very reproducible. Sometimes it flips around its answer, sometimes it gets it right, sometimes get us gets it wrong.
Let's try again. Okay, even though it might look larger. Okay, so here it doesn't even correct itself in the end. If you ask many times, sometimes it gets it right, too. But how is it that the model can do so great at Olympiad grade problems, but then fail on very simple problems like this.
And I think this one is, as I mentioned, a little bit of a head scratcher. It turns out that a bunch of people studied this in depth, and I haven't actually read the paper. But what I was told by this team was that when you scrutinize the activations inside the neural network, when you look at some of the features and what what features turn on or off and what neurons turn on or off a bunch of neurons inside the neural network light up, that are usually associated with Bible verses.
And so I think the model is kind of like reminded that these almost look like Bible verse markers. And in a Bible verse setting, 9.11 would come after 9.9. And so basically, the model somehow finds it like cognitively very distracting, that in Bible verses 9.11 would be greater. Even though here it's actually trying to justify it and come up to the answer with a math, it still ends up with the wrong answer here.
So it basically just doesn't fully make sense. And it's not fully understood. And there's a few jagged issues like that. So that's why treat this as a as what it is, which is a stochastic system that is really magical, but that you can't also fully trust. And you want to use it as a tool, not as something that you kind of like let it rip on a problem and copy paste the results.
Okay, so we have now covered two major stages of training of large language models. We saw that in the first stage, this is called the pre training stage, we are basically training on internet documents. And when you train a language model on internet documents, you get what's called a base model.
And it's basically an internet document simulator, right? Now, we saw that this is an interesting artifact. And this takes many months to train on 1000s of computers. And it's kind of a lossy compression of the internet. And it's extremely interesting, but it's not directly useful. Because we don't want to sample internet documents, we want to ask questions of an AI and have it respond to our questions.
So for that, we need an assistant. And we saw that we can actually construct an assistant in the process of post training. And specifically, in the process of supervised fine tuning, as we call it. So in this stage, we saw that it's algorithmically identical to pre training, nothing is going to change.
The only thing that changes is the data set. So instead of internet documents, we now want to create and curate a very nice data set of conversations. So we want millions conversations on all kinds of diverse topics between a human and an assistant. And fundamentally, these conversations are created by humans.
So humans write the prompts, and humans write the ideal responses. And they do that based on labeling documentations. Now, in the modern stack, it's not actually done fully and manually by humans, right? They actually now have a lot of help from these tools. So we can use language models to help us create these data sets.
And we've done extensively. But fundamentally, it's all still coming from human curation at the end. So we create these conversations that now becomes our data set, we fine tune on it, or continue training on it, and we get an assistant. And then we kind of shifted gears and started talking about some of the kind of cognitive implications of what the system is like.
And we saw that, for example, the assistant will hallucinate, if you don't take some sort of mitigations towards it. So we saw that hallucinations would be common. And then we looked at some of the mitigations of those hallucinations. And then we saw that the models are quite impressive and can do a lot of stuff in their head.
But we saw that they can also lean on tools to become better. So for example, we can lean on the web search in order to hallucinate less, and to maybe bring up some more recent information or something like that. Or we can lean on tools like Code Interpreter, so the LLM can write some code and actually run it and see the results.
So these are some of the topics we looked at so far. Now what I'd like to do is, I'd like to cover the last and major stage of this pipeline. And that is reinforcement learning. So reinforcement learning is still kind of thought to be under the umbrella of post-training.
But it is the last third major stage, and it's a different way of training language models, and usually follows as this third step. So inside companies like OpenAI, you will start here, and these are all separate teams. So there's a team doing data for pre-training, and a team doing training for pre-training.
And then there's a team doing all the conversation generation in a different team that is kind of doing the supervised fine tuning. And there will be a team for the reinforcement learning as well. So it's kind of like a handoff of these models. You get your base model, then you fine tune it to be an assistant, and then you go into reinforcement learning, which we'll talk about now.
So that's kind of like the major flow. And so let's now focus on reinforcement learning, the last major stage of training. And let me first actually motivate it and why we would want to do reinforcement learning and what it looks like on a high level. So now I'd like to try to motivate the reinforcement learning stage and what it corresponds to.
It's something that you're probably familiar with, and that is basically going to school. So just like you went to school to become really good at something, we want to take large language models through school. And really what we're doing is we have a few paradigms of ways of giving them knowledge or transferring skills.
So in particular, when we're working with textbooks in school, you'll see that there are three major pieces of information in these textbooks, three classes of information. The first thing you'll see is you'll see a lot of exposition. And by the way, this is a totally random book I pulled from the internet.
I think it's some kind of organic chemistry or something. I'm not sure. But the important thing is that you'll see that most of the text, most of it is kind of just like the meat of it, is exposition. It's kind of like background knowledge, etc. As you are reading through the words of this exposition, you can think of that roughly as training on that data.
And that's why when you're reading through this stuff, this background knowledge, and there's all this context information, it's kind of equivalent to pre-training. So it's where we build sort of like a knowledge base of this data and get a sense of the topic. The next major kind of information that you will see is these problems and with their worked solutions.
So basically a human expert, in this case, the author of this book, has given us not just a problem, but has also worked through the solution. And the solution is basically like equivalent to having like this ideal response for an assistant. So it's basically the expert is showing us how to solve the problem and it's kind of like in its full form.
So as we are reading the solution, we are basically training on the expert data. And then later we can try to imitate the expert. And basically that roughly corresponds to having the SFT model. That's what it would be doing. So basically we've already done pre-training and we've already covered this imitation of experts and how they solve these problems.
And the third stage of reinforcement learning is basically the practice problems. So sometimes you'll see this is just a single practice problem here. But of course, there will be usually many practice problems at the end of each chapter in any textbook. And practice problems, of course, we know are critical for learning, because what are they getting you to do?
They're getting you to practice yourself and discover ways of solving these problems yourself. And so what you get in the practice problem is you get the problem description, but you're not given the solution, but you are given the final answer, usually in the answer key of the textbook. And so you know the final answer that you're trying to get to, and you have the problem statement, but you don't have the solution.
You are trying to practice the solution. You're trying out many different things, and you're seeing what gets you to the final solution the best. And so you're discovering how to solve these problems. And in the process of that, you're relying on, number one, the background information, which comes from pre-training, and number two, maybe a little bit of imitation of human experts.
And you can probably try similar kinds of solutions and so on. So we've done this and this, and now in this section, we're going to try to practice. And so we're going to be given prompts. We're going to be given solutions. Sorry, the final answers, but we're not going to be given expert solutions.
We have to practice and try stuff out. And that's what reinforcement learning is about. Okay, so let's go back to the problem that we worked with previously, just so we have a concrete example to talk through as we explore the topic here. So I'm here in the tick tokenizer because I'd also like to, well, I get a text box, which is useful.
But number two, I want to remind you again that we're always working with one-dimensional token sequences. And so I actually prefer this view because this is the native view of the LLM, if that makes sense. This is what it actually sees. It sees token IDs, right? So Emily buys three apples and two oranges.
Each orange is $2. The total cost of all the fruit is $13. What is the cost of each apple? And what I'd like you to appreciate here is these are like four possible candidate solutions as an example. And they all reach the answer three. Now what I'd like you to appreciate at this point is that if I'm the human data labeler that is creating a conversation to be entered into the training set, I don't actually really know which of these conversations to add to the data set.
Some of these conversations kind of set up a system of equations. Some of them sort of like just talk through it in English, and some of them just kind of like skip right through to the solution. If you look at chatGPT, for example, and you give it this question, it defines a system of variables and it kind of like does this little thing.
What we have to appreciate and differentiate between though is the first purpose of a solution is to reach the right answer. Of course, we want to get the final answer three. That is the important purpose here. But there's kind of like a secondary purpose as well, where here we are also just kind of trying to make it like nice for the human, because we're kind of assuming that the person wants to see the solution, they want to see the intermediate steps, we want to present it nicely, etc.
So there are two separate things going on here. Number one is the presentation for the human. But number two, we're trying to actually get the right answer. So let's, for the moment, focus on just reaching the final answer. If we only care about the final answer, then which of these is the optimal or like the best prompt?
Sorry, the best solution for the LLM to reach the right answer. And what I'm trying to get at is we don't know. Me, as a human labeler, I would not know which one of these is best. So as an example, we saw earlier on when we looked at the token sequences here and the mental arithmetic and reasoning, we saw that for each token, we can only spend basically a finite number of finite amount of compute here that is not very large, or you should think about it that way.
And so we can't actually make too big of a leap in any one token is maybe the way to think about it. So as an example, in this one, what's really nice about it is that it's very few tokens, so it's gonna take us very short amount of time to get to the answer.
But right here, when we're doing 13 minus four divide three equals, right in this token here, we're actually asking for a lot of computation to happen on that single individual token. And so maybe this is a bad example to give to the LLM because it's kind of incentivizing it to skip through the calculations very quickly, and it's going to actually make up mistakes, make mistakes in this mental arithmetic.
So maybe it would work better to like spread out the spread out more. Maybe it would be better to set up as an equation, maybe it would be better to talk through it. We fundamentally don't know. And we don't know because what is easy for you or I as or as human labelers, what's easy for us or hard for us is different than what's easy or hard for the LLM.
Its cognition is different. And the token sequences are kind of like different hard for it. And so some of the token sequences here that are trivial for me might be very too much of a leap for the LLM. So right here, this token would be way too hard. But conversely, many of the tokens that I'm creating here might be just trivial to the LLM.
And we're just wasting tokens, like why waste all these tokens when this is all trivial. So if the only thing we care about is reaching the final answer, and we're separating out the issue of the presentation to the human, then we don't actually really know how to annotate this example.
We don't know what solution to get to the LLM, because we are not the LLM. And it's clear here in the case of like the math example, but this is actually like a very pervasive issue like for our knowledge is not LLM's knowledge, like the LLM actually has a ton of knowledge of PhD in math and physics and chemistry and whatnot.
So in many ways, it actually knows more than I do. And I'm potentially not utilizing that knowledge in its problem solving. But conversely, I might be injecting a bunch of knowledge in my solutions that the LLM doesn't know in its parameters. And then those are like sudden leaps that are very confusing to the model.
And so our cognitions are different. And I don't really know what to put here, if all we care about is the reaching the final solution, and doing it economically, ideally. And so, long story short, we are not in a good position to create these token sequences for the LLM.
And they're useful by imitation to initialize the system. But we really want the LLM to discover the token sequences that work for it. It needs to find for itself what token sequence reliably gets to the answer, given the prompt. And it needs to discover that in a process of reinforcement learning and of trial and error.
So let's see how this example would work like in reinforcement learning. Okay, so we're now back in the Hugging Face Inference Playground. And that just allows me to very easily call different kinds of models. So as an example, here on the top right, I chose the Gemma 2, 2 billion parameter model.
So 2 billion is very, very small. So this is a tiny model, but it's okay. So we're going to give it the way that reinforcement learning will basically work is actually quite, quite simple. We need to try many different kinds of solutions. And we want to see which solutions work well or not.
So we're basically going to take the prompt, we're going to run the model. And the model generates a solution. And then we're going to inspect the solution. And we know that the correct answer for this one is $3. And so indeed, the model gets it correct, says it's $3.
So this is correct. So that's just one attempt at the solution. So now we're going to delete this, and we're going to rerun it again. Let's try a second attempt. So the model solves it in a bit slightly different way, right? Every single attempt will be a different generation, because these models are stochastic systems.
Remember that every single token here, we have a probability distribution, and we're sampling from that distribution. So we end up kind of going down slightly different paths. And so this is the second solution that also ends in the correct answer. Now we're going to delete that. Let's go a third time.
Okay, so again, slightly different solution, but also gets it correct. Now we can actually repeat this many times. And so in practice, you might actually sample 1000s of independent solutions, or even like a million solutions for just a single prompt. And some of them will be correct, and some of them will not be very correct.
And basically, what we want to do is we want to encourage the solutions that lead to correct answers. So let's take a look at what that looks like. So if we come back over here, here's kind of like a cartoon diagram of what this is looking like. We have a prompt.
And then we tried many different solutions in parallel. And some of the solutions might go well, so they get the right answer, which is in green. And some of the solutions might go poorly and may not reach the right answer, which is red. Now, this problem here, unfortunately, is not the best example, because it's a trivial prompt.
And as we saw, even like a two billion parameter model always gets it right. So it's not the best example in that sense. But let's just exercise some imagination here. And let's just suppose that the green ones are good, and the red ones are bad. Okay, so we generated 15 solutions, only four of them got the right answer.
And so now what we want to do is, basically, we want to encourage the kinds of solutions that lead to right answers. So whatever token sequences happened in these red solutions, obviously, something went wrong along the way somewhere. And this was not a good path to take through the solution.
And whatever token sequences that were in these green solutions, well, things went pretty well in this situation. And so we want to do more things like it in prompts like this. And the way we encourage this kind of a behavior in the future is we basically train on these sequences.
But these training sequences now are not coming from expert human annotators. There's no human who decided that this is the correct solution. This solution came from the model itself. So the model is practicing here, it's tried out a few solutions, four of them seem to have worked. And now the model will kind of like train on them.
And this corresponds to a student basically looking at their solutions and being like, okay, well, this one worked really well. So this is how I should be solving these kinds of problems. And here in this example, there are many different ways to actually like really tweak the methodology a little bit here.
But just to get the core idea across, maybe it's simplest to just think about taking the single best solution out of these four, like say this one, that's why it was yellow. So this is the solution that not only looked at the right answer, but maybe had some other nice properties.
Maybe it was the shortest one, or it looked nicest in some ways, or there's other criteria you could think of as an example. But we're going to decide that this is the top solution, we're going to train on it. And then the model will be slightly more likely, once you do the parameter update, to take this path in this kind of a setting in the future.
But you have to remember that we're going to run many different diverse prompts across lots of math problems and physics problems and whatever, whatever there might be. So 10s of 1000s of prompts, maybe have in mind, there's 1000s of solutions per prompt. And so this is all happening kind of like at the same time.
And as we're iterating this process, the model is discovering for itself, what kinds of token sequences lead it to correct answers. It's not coming from a human annotator. The model is kind of like playing in this playground. And it knows what it's trying to get to. And it's discovering sequences that work for it.
These are sequences that don't make any mental leaps. They seem to work reliably and statistically, and fully utilize the knowledge of the model as it has it. And so this is the process of reinforcement learning. It's basically a guess and check, we're going to guess many different types of solutions, we're going to check them, and we're going to do more of what worked in the future.
And that is reinforcement learning. So in the context of what came before, we see now that the SFT model, the supervised fine tuning model, it's still helpful, because it's still kind of like initializes the model a little bit into the vicinity of the correct solutions. So it's kind of like a initialization of the model, in the sense that it kind of gets the model to, you know, take solutions, like write out solutions, and maybe it has an understanding of setting up a system of equations, or maybe it kind of like talks through a solution.
So it gets you into the vicinity of correct solutions. But reinforcement learning is where everything gets dialed in, we really discover the solutions that work for the model, get the right answers, we encourage them, and then the model just kind of like gets better over time. Okay, so that is the high level process for how we train large language models.
In short, we train them kind of very similar to how we train children. And basically, the only difference is that children go through chapters of books, and they do all these different types of training exercises, kind of within the chapter of each book. But instead, when we train AIs, it's almost like we kind of do it stage by stage, depending on the type of that stage.
So first, what we do is we do pre training, which as we saw is equivalent to basically reading all the expository material. So we look at all the textbooks at the same time, and we read all the exposition, and we try to build a knowledge base. The second thing then is we go into the SFT stage, which is really looking at all the fixed sort of like solutions from human experts of all the different kinds of worked solutions across all the textbooks.
And we just kind of get an SFT model, which is able to imitate the experts, but does so kind of blindly, it just kind of like does its best guess, kind of just like trying to mimic statistically the expert behavior. And so that's what you get when you look at all the work solutions.
And then finally, in the last stage, we do all the practice problems in the RL stage across all the textbooks, we only do the practice problems. And that's how we get the RL model. So on a high level, the way we train LLMs is very much equivalent to the process that we train, that we use for training of children.
The next point I would like to make is that actually these first two stages pre training and surprise fine tuning, they've been around for years, and they are very standard, and everyone does them all the different LLM providers. It is this last stage, the RL training, there is a lot more early in its process of development, and is not standard yet in the field.
And so this stage is a lot more kind of early and nascent. And the reason for that is because I actually skipped over a ton of little details here in this process. The high level idea is very simple, it's trial and error learning, but there's a ton of details and little mathematical kind of like nuances to exactly how you pick the solutions that are the best, and how much you train on them, and what is the prompt distribution, and how to set up the training run, such that this actually works.
So there's a lot of little details and knobs to the core idea that is very, very simple. And so getting the details right here is not trivial. And so a lot of companies like for example, OpenAI and other LLM providers have experimented internally with reinforcement learning fine tuning for LLMs for a while, but they've not talked about it publicly.
It's all kind of done inside the company. And so that's why the paper from DeepSeq that came out very, very recently was such a big deal. Because this is a paper from this company called DeepSeq AI in China. And this paper really talked very publicly about reinforcement learning fine tuning for large language models, and how incredibly important it is for large language models, and how it brings out a lot of reasoning capabilities in the models.
We'll go into this in a second. So this paper reinvigorated the public interest of using RL for LLMs, and gave a lot of the sort of nitty-gritty details that are needed to reproduce the results, and actually get the stage to work for large language models. So let me take you briefly through this DeepSeq RL paper, and what happens when you actually correctly apply RL to language models, and what that looks like, and what that gives you.
So the first thing I'll scroll to is this kind of figure two here, where we are looking at the improvement in how the models are solving mathematical problems. So this is the accuracy of solving mathematical problems on the AIME accuracy. And then we can go to the web page, and we can see the kinds of problems that are actually in these kinds of math problems that are being measured here.
So these are simple math problems. You can pause the video if you like, but these are the kinds of problems that basically the models are being asked to solve. And you can see that in the beginning they're not doing very well, but then as you update the model with this many thousands of steps, their accuracy kind of continues to climb.
So the models are improving, and they're solving these problems with a higher accuracy as you do this trial and error on a large dataset of these kinds of problems. And the models are discovering how to solve math problems. But even more incredible than the quantitative kind of results of solving these problems with a higher accuracy is the qualitative means by which the model achieves these results.
So when we scroll down, one of the figures here that is kind of interesting is that later on in the optimization, the model seems to be using average length per response goes up. So the model seems to be using more tokens to get its higher accuracy results. So it's learning to create very, very long solutions.
Why are these solutions very long? We can look at them qualitatively here. So basically what they discover is that the model solution get very, very long partially because, so here's a question, and here's kind of the answer from the model. What the model learns to do, and this is an emergent property of the optimization, it just discovers that this is good for problem solving, is it starts to do stuff like this.
Wait, wait, wait, that's an aha moment I can flag here. Let's re-evaluate this step by step to identify the correct sum can be. So what is the model doing here, right? The model is basically re-evaluating steps. It has learned that it works better for accuracy to try out lots of ideas, try something from different perspectives, retrace, reframe, backtrack.
It's doing a lot of the things that you and I are doing in the process of problem solving for mathematical questions, but it's rediscovering what happens in your head, not what you put down on the solution, and there is no human who can hard code this stuff in the ideal assistant response.
This is only something that can be discovered in the process of reinforcement learning because you wouldn't know what to put here. This just turns out to work for the model, and it improves its accuracy in problem solving. So the model learns what we call these chains of thought in your head, and it's an emergent property of the optimization, and that's what's bloating up the response lengths, but that's also what's increasing the accuracy of the problem solving.
So what's incredible here is basically the model is discovering ways to think. It's learning what I like to call cognitive strategies of how you manipulate a problem and how you approach it from different perspectives, how you pull in some analogies or do different kinds of things like that, and how you kind of try out many different things over time, check a result from different perspectives, and how you kind of solve problems.
But here, it's kind of discovered by the RL, so extremely incredible to see this emerge in the optimization without having to hard code it anywhere. The only thing we've given it are the correct answers, and this comes out from trying to just solve them correctly, which is incredible. Now let's go back to actually the problem that we've been working with, and let's take a look at what it would look like for this kind of a model, what we call reasoning or thinking model, to solve that problem.
Okay, so recall that this is the problem we've been working with, and when I pasted it into ChatGPT 4.0, I'm getting this kind of a response. Let's take a look at what happens when you give the same query to what's called a reasoning or a thinking model. This is a model that was trained with reinforcement learning.
So this model described in this paper, Deep Seek R1, is available on chat.deepseek.com. So this is kind of like the company that developed it is hosting it. You have to make sure that the Deep Think button is turned on to get the R1 model, as it's called. We can paste it here and run it.
And so let's take a look at what happens now, and what is the output of the model. Okay, so here's what it says. So this is previously what we get using basically what's an SFT approach, a supervised fine-tuning approach. This is like mimicking an expert solution. This is what we get from the RL model.
Okay, let me try to figure this out. So Emily buys three apples and two oranges. Each orange costs $2, total is $13. I need to find out blah blah blah. So here, as you're reading this, you can't escape thinking that this model is thinking. It's definitely pursuing the solution.
It derives that it must cost $3. And then it says, wait a second, let me check my math again to be sure. And then it tries it from a slightly different perspective. And then it says, yep, all that checks out. I think that's the answer. I don't see any mistakes.
Let me see if there's another way to approach the problem, maybe setting up an equation. Let's let the cost of one apple be $8, then blah blah blah. Yep, same answer. So definitely each apple is $3. All right, confident that that's correct. And then what it does, once it sort of did the thinking process, is it writes up the nice solution for the human.
And so this is now considering -- so this is more about the correctness aspect, and this is more about the presentation aspect, where it kind of writes it out nicely and boxes in the correct answer at the bottom. And so what's incredible about this is we get this like thinking process of the model.
And this is what's coming from the reinforcement learning process. This is what's bloating up the length of the token sequences. They're doing thinking and they're trying different ways. This is what's giving you higher accuracy in problem solving. And this is where we are seeing these aha moments and these different strategies and these ideas for how you can make sure that you're getting the correct answer.
The last point I wanted to make is some people are a little bit nervous about putting, you know, very sensitive data into chat.deepseq.com because this is a Chinese company. So people don't -- people are a little bit careful and cagey with that a little bit. DeepSeq R1 is a model that was released by this company.
So this is an open source model or open weights model. It is available for anyone to download and use. You will not be able to like run it in its full sort of -- the full model in full precision. You won't run that on a MacBook or like a local device because this is a fairly large model.
But many companies are hosting the full largest model. One of those companies that I like to use is called together.ai. So when you go to together.ai, you sign up and you go to playgrounds. You can select here in the chat DeepSeq R1 and there's many different kinds of other models that you can select here.
These are all state-of-the-art models. So this is kind of similar to the Hugging Face inference playground that we've been playing with so far. But together.ai will usually host all the state-of-the-art models. So select DeepSeq R1. You can try to ignore a lot of these. I think the default settings will often be okay.
And we can put in this. And because the model was released by DeepSeq, what you're getting here should be basically equivalent to what you're getting here. Now because of the randomness in the sampling, we're going to get something slightly different. But in principle, this should be identical in terms of the power of the model.
And you should be able to see the same things quantitatively and qualitatively. But this model is coming from kind of an American company. So that's DeepSeq and that's what's called a reasoning model. Now when I go back to chat, let me go to chat here. Okay, so the model that you're going to see in the drop down here, some of them like O1, O3 mini, O3 mini high, etc.
They are talking about users-advanced reasoning. Now what this is referring to, users-advanced reasoning, is it's referring to the fact that it was trained by reinforcement learning with techniques very similar to those of DeepSeq R1, per public statements of OpenAI employees. So these are thinking models trained with RL. And these models like GPT-40 or GPT-40 mini that you're getting in the free tier, you should think of them as mostly SFT models, supervised fine-tuning models.
They don't actually do this like thinking as you see in the RL models. And even though there's a little bit of reinforcement learning involved with these models, and I'll go into that in a second, these are mostly SFT models. I think you should think about it that way. So in the same way as what we saw here, we can pick one of the thinking models, like say O3 mini high.
And these models, by the way, might not be available to you unless you pay a chat GPT subscription of either $20 per month or $200 per month for some of the top models. So we can pick a thinking model and run. Now what's going to happen here is it's going to say reasoning, and it's going to start to do stuff like this.
And what we're seeing here is not exactly the stuff we're seeing here. So even though under the hood, the model produces these kinds of chains of thought, OpenAI chooses to not show the exact chains of thought in the web interface. It shows little summaries of those chains of thought.
And OpenAI kind of does this, I think, partly because they are worried about what's called a distillation risk. That is that someone could come in and actually try to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating the reasoning chains of thought. And so they kind of hide them and they only show little summaries of them.
So you're not getting exactly what you would get in DeepSeq as with respect to the reasoning itself. And then they write up the solution. So these are kind of like equivalent, even though we're not seeing the full under the hood details. Now, in terms of the performance, these models and DeepSeq models are currently roughly on par, I would say.
It's kind of hard to tell because of the evaluations. But if you're paying $200 per month to OpenAI, some of these models I believe are currently, they basically still look better. But DeepSeq R1 for now is still a very solid choice for a thinking model that would be available to you either on this website or any other website because the model is open weights.
You can just download it. So that's thinking models. So what is the summary so far? Well, we've talked about reinforcement learning and the fact that thinking emerges in the process of the optimization on when we basically run RL on many math and kind of code problems that have verifiable solutions.
So there's like an answer three, et cetera. Now, these thinking models you can access in, for example, DeepSeq or any inference provider like together.ai and choosing DeepSeq over there. These thinking models are also available in chatGPT under any of the O1 or O3 models. But these GPT-4.0 models, et cetera, they're not thinking models.
You should think of them as mostly SFT models. Now, if you have a prompt that requires advanced reasoning and so on, you should probably use some of the thinking models or at least try them out. But empirically, for a lot of my use, when you're asking a simpler question, there's like a knowledge- based question or something like that, this might be overkill.
There's no need to think 30 seconds about some factual question. So for that, I will sometimes default to just GPT-4.0. So empirically, about 80, 90 percent of my use is just GPT-4.0. And when I come across a very difficult problem, like in math and code, et cetera, I will reach for the thinking models.
But then I have to wait a bit longer because they are thinking. So you can access these on chatGPT, on DeepSeq. Also, I wanted to point out that aistudio.google.com, even though it looks really busy, really ugly, because Google is just unable to do this kind of stuff well, is like what is happening.
But if you choose model and you choose here, Gemini 2.0 Flash Thinking Experimental 0121, if you choose that one, that's also a kind of early experiment, experimental of a thinking model by Google. So we can go here and we can give it the same problem and click run. And this is also a thinking problem, thinking model that will also do something similar and comes out with the right answer here.
So basically, Gemini also offers a thinking model. Anthropic currently does not offer a thinking model. But basically, this is kind of like the frontier development of these LLMs. I think RL is kind of like this new exciting stage, but getting the details right is difficult. And that's why all these models and thinking models are currently experimental as of 2025, very early 2025.
But this is kind of like the frontier development of pushing the performance on these very difficult problems using reasoning that is emergent in these optimizations. One more connection that I wanted to bring up is that the discovery that reinforcement learning is extremely powerful way of learning is not new to the field of AI.
And one place where we've already seen this demonstrated is in the game of Go. And famously, DeepMind developed the system AlphaGo, and you can watch a movie about it, where the system is learning to play the game of Go against top human players. And when we go to the paper underlying AlphaGo, so in this paper, when we scroll down, we actually find a really interesting plot that I think is kind of familiar to us, and we're kind of like rediscovering in the more open domain of arbitrary problem solving, instead of on the closed specific domain of the game of Go.
But basically what they saw, and we're going to see this in LLMs as well, as this becomes more mature, is this is the ELO rating of playing game of Go. And this is Lee Sedol, an extremely strong human player. And here where they are comparing is the strength of a model learned, trained by supervised learning, and a model trained by reinforcement learning.
So the supervised learning model is imitating human expert players. So if you just get a huge amount of games played by expert players in the game of Go, and you try to imitate them, you are going to get better, but then you top out, and you never quite get better than some of the top, top, top players in the game of Go, like Lee Sedol.
So you're never going to reach there, because you're just imitating human players. You can't fundamentally go beyond a human player if you're just imitating human players. But in a process of reinforcement learning is significantly more powerful. In reinforcement learning for a game of Go, it means that the system is playing moves that empirically and statistically lead to win, to winning the game.
And so AlphaGo is a system where it kind of plays against itself, and it's using reinforcement learning to create rollouts. So it's the exact same diagram here, but there's no prompt. It's just, because there's no prompt, it's just a fixed game of Go. But it's trying out lots of solutions, it's trying lots of plays, and then the games that lead to a win, instead of a specific answer, are reinforced.
They're made stronger. And so the system is learning basically the sequences of actions that empirically and statistically lead to winning the game. And reinforcement learning is not going to be constrained by human performance. And reinforcement learning can do significantly better and overcome even the top players like Lisa Dole.
And so probably they could have run this longer, and they just chose to crop it at some point because this costs money. But this is a very powerful demonstration of reinforcement learning. And we're only starting to kind of see hints of this diagram in larger language models for reasoning problems.
So we're not going to get too far by just imitating experts. We need to go beyond that, set up these little game environments, and let the system discover reasoning traces or ways of solving problems that are unique and that just basically work well. Now on this aspect of uniqueness, notice that when you're doing reinforcement learning, nothing prevents you from veering off the distribution of how humans are playing the game.
And so when we go back to this AlphaGo search here, one of the suggested modifications is called move 37. And move 37 in AlphaGo is referring to a specific point in time where AlphaGo basically played a move that no human expert would play. So the probability of this move to be played by a human player was evaluated to be about 1 in 10,000.
So it's a very rare move. But in retrospect, it was a brilliant move. So AlphaGo, in the process of reinforcement learning, discovered kind of like a strategy of playing that was unknown to humans, but is in retrospect brilliant. I recommend this YouTube video, Lee Sedol versus AlphaGo move 37 reaction analysis.
And this is kind of what it looked like when AlphaGo played this move. "That's a very surprising move." "I thought it was a mistake." "When I see this move." Anyway, so basically people are kind of freaking out because it's a move that a human would not play, that AlphaGo played, because in its training, this move seemed to be a good idea.
It just happens not to be a kind of thing that humans would do. And so that is, again, the power of reinforcement learning. And in principle, we can actually see the equivalence of that if we continue scaling this paradigm in language models. And what that looks like is kind of unknown.
So what does it mean to solve problems in such a way that even humans would not be able to get? How can you be better at reasoning or thinking than humans? How can you go beyond just a thinking human? Like maybe it means discovering analogies that humans would not be able to create.
Or maybe it's like a new thinking strategy. It's kind of hard to think through. Maybe it's a wholly new language that actually is not even English. Maybe it discovers its own language that is a lot better at thinking. Because the model is unconstrained to even like stick with English.
So maybe it takes a different language to think in, or it discovers its own language. So in principle, the behavior of the system is a lot less defined. It is open to do whatever works. And it is open to also slowly drift from the distribution of its training data, which is English.
But all that can only be done if we have a very large, diverse set of problems in which these strategies can be refined and perfected. And so that is a lot of the frontier LLM research that's going on right now is trying to kind of create those kinds of prompt distributions that are large and diverse.
These are all kind of like game environments in which the LLMs can practice their thinking. And it's kind of like writing, you know, these practice problems. We have to create practice problems for all of domains of knowledge. And if we have practice problems and tons of them, the models will be able to reinforcement learning, reinforcement learn on them and kind of create these kinds of diagrams.
But in the domain of open thinking, instead of a closed domain like Game of Go. There's one more section within reinforcement learning that I wanted to cover. And that is that of learning in unverifiable domains. So, so far, all of the problems that we've looked at are in what's called verifiable domains.
That is, any candidate solution, we can score very easily against a concrete answer. So for example, answer is three, and we can very easily score these solutions against the answer of three. Either we require the models to like box in their answers, and then we just check for equality of whatever's in the box with the answer.
Or you can also use kind of what's called an LLM judge. So the LLM judge looks at a solution, and it gets the answer, and just basically scores the solution for whether it's consistent with the answer or not. And LLMs empirically are good enough at the current capability that they can do this fairly reliably.
So we can apply those kinds of techniques as well. In any case, we have a concrete answer, and we're just checking solutions against it. And we can do this automatically with no kind of humans in the loop. The problem is that we can't apply the strategy in what's called unverifiable domains.
So usually these are, for example, creative writing tasks like write a joke about pelicans or write a poem or summarize a paragraph or something like that. In these kinds of domains, it becomes harder to score our different solutions to this problem. So for example, writing a joke about pelicans, we can generate lots of different jokes, of course, that's fine.
For example, you can go to ChessGPT and we can get it to generate a joke about pelicans. So much stuff in their beaks because they don't bellican in backpacks. Why? Okay, we can we can try something else. Why don't pelicans ever pay for their drinks because they always bill it to someone else?
Haha. Okay, so these models are not obviously not very good at humor. Actually, I think it's pretty fascinating because I think humor is secretly very difficult and the models don't have the capability, I think. Anyway, in any case, you could imagine creating lots of jokes. The problem that we are facing is how do we score them?
Now, in principle, we could, of course, get a human to look at all these jokes, just like I did right now. The problem with that is if you are doing reinforcement learning, you're going to be doing many thousands of updates. And for each update, you want to be looking at, say, thousands of prompts.
And for each prompt, you want to be potentially looking at looking at hundreds or thousands of different kinds of generations. And so there's just like way too many of these to look at. And so, in principle, you could have a human inspect all of them and score them and decide that, okay, maybe this one is funny.
And maybe this one is funny. And this one is funny. And we could train on them to get the model to become slightly better at jokes, in the context of pelicans, at least. The problem is that it's just like way too much human time. This is an unscalable strategy.
We need some kind of an automatic strategy for doing this. And one sort of solution to this was proposed in this paper that introduced what's called reinforcement learning from human feedback. And so this was a paper from OpenAI at the time. Many of these people are now co-founders in Anthropic.
And this kind of proposed a approach for basically doing reinforcement learning in unverifiable domains. So let's take a look at how that works. So this is the cartoon diagram of the core ideas involved. So as I mentioned, the naive approach is if we just had infinity human time, we could just run RL in these domains just fine.
So, for example, we can run RL as usual if I have infinity humans. I just want to do, and these are just cartoon numbers, I want to do 1,000 updates where each update will be on 1,000 prompts. And for each prompt, we're going to have 1,000 rollouts that we're scoring.
So we can run RL with this kind of a setup. The problem is in the process of doing this, I will need to run one, I would need to ask a human to evaluate a joke a total of 1 billion times. And so that's a lot of people looking at really terrible jokes.
So we don't want to do that. So instead, we want to take the RLHF approach. So in RLHF approach, we are kind of like the core trick is that of indirection. So we're going to involve humans just a little bit. And the way we cheat is that we basically train a whole separate neural network that we call a reward model.
And this neural network will kind of like imitate human scores. So we're going to ask humans to score rollouts, we're going to then imitate human scores using a neural network. And this neural network will become a kind of simulator of human preferences. And now that we have a neural network simulator, we can do RL against it.
So instead of asking a real human, we're asking a simulated human for their score of a joke as an example. And so once we have a simulator, we're off to the races because we can query it as many times as we want to. And it's all whole automatic process.
And we can now do reinforcement learning with respect to the simulator. And the simulator, as you might expect, is not going to be a perfect human. But if it's at least statistically similar to human judgment, then you might expect that this will do something. And in practice, indeed, it does.
So once we have a simulator, we can do RL and everything works great. So let me show you a cartoon diagram a little bit of what this process looks like, although the details are not 100% like super important, it's just a core idea of how this works. So here we have a cartoon diagram of a hypothetical example of what training the reward model would look like.
So we have a prompt like write a joke about pelicans. And then here we have five separate rollouts. So these are all five different jokes, just like this one. Now, the first thing we're going to do is we are going to ask a human to order these jokes from the best to worst.
So this is so here, this human thought that this joke is the best, the funniest. So number one joke, this is number two joke, number three joke, four, and five. So this is the worst joke. We're asking humans to order instead of give scores directly, because it's a bit of an easier task.
It's easier for a human to give an ordering than to give precise scores. Now, that is now the supervision for the model. So the human has ordered them. And that is kind of like their contribution to the training process. But now separately, what we're going to do is we're going to ask a reward model about its scoring of these jokes.
Now the reward model is a whole separate neural network, completely separate neural net. And it's also probably a transformer. But it's not a language model in the sense that it generates diverse language, etc. It's just a scoring model. So the reward model will take as an input, the prompt, number one, and number two, a candidate joke.
So those are the two inputs that go into the reward model. So here, for example, the reward model would be taking this prompt, and this joke. Now the output of a reward model is a single number. And this number is thought of as a score. And it can range, for example, from zero to one.
So zero would be the worst score, and one would be the best score. So here are some examples of what a hypothetical reward model at some stage in the training process would give as scoring to these jokes. So 0.1 is a very low score, 0.8 is a really high score, and so on.
And so now we compare the scores given by the reward model with the ordering given by the human. And there's a precise mathematical way to actually calculate this, basically set up a loss function and calculate a kind of like a correspondence here, and update a model based on it.
But I just want to give you the intuition, which is that, as an example here, for this second joke, the human thought that it was the funniest, and the model kind of agreed, right? 0.8 is a relatively high score. But this score should have been even higher, right? So after an update, we would expect that maybe the score should have been, will actually grow after an update of the network to be like, say, 0.81 or something.
For this one here, they actually are in a massive disagreement, because the human thought that this was number two, but here the score is only 0.1. And so this score needs to be much higher. So after an update, on top of this kind of a supervision, this might grow a lot more, like maybe it's 0.15 or something like that.
And then here, the human thought that this one was the worst joke, but here the model actually gave it a fairly high number. So you might expect that after the update, this would come down to maybe 3.5 or something like that. So basically, we're doing what we did before.
We're slightly nudging the predictions from the models using neural network training process. And we're trying to make the reward model scores be consistent with human ordering. And so as we update the reward model on human data, it becomes better and better simulator of the scores and orders that humans provide, and then becomes kind of like the simulator of human preferences, which we can then do RL against.
But critically, we're not asking humans 1 billion times to look at a joke. We're maybe looking at 1000 prompts and 5 rollouts each. So maybe 5000 jokes that humans have to look at in total. And they just give the ordering, and then we're training the model to be consistent with that ordering.
And I'm skipping over the mathematical details. But I just want you to understand a high level idea that this reward model is basically giving us the scores, and we have a way of training it to be consistent with human orderings. And that's how RLHF works. Okay, so that is the rough idea.
We basically train simulators of humans and RL with respect to those simulators. Now, I want to talk about first, the upside of reinforcement learning from human feedback. The first thing is that this allows us to run reinforcement learning, which we know is incredibly powerful kind of set of techniques.
And it allows us to do it in arbitrary domains, and including the ones that are unverifiable. So things like summarization, and poem writing, joke writing, or any other creative writing, really, in domains outside of math and code, etc. Now, empirically, what we see when we actually apply RLHF is that this is a way to improve the performance of the model.
And I have a top answer for why that might be, but I don't actually know that it is like super well established on like, why this is, you can empirically observe that when you do RLHF correctly, the models you get are just like a little bit better. But as to why is I think like, not as clear.
So here's my best guess. My best guess is that this is possibly mostly due to the discriminator generator gap. What that means is that in many cases, it is significantly easier to discriminate than to generate for humans. So in particular, an example of this is in when we do supervised fine tuning, right, SFT.
We're asking humans to generate the ideal assistant response. And in many cases here, as I've shown it, the ideal response is very simple to write, but in many cases might not be. So for example, in summarization, or poem writing, or joke writing, like how are you as a human assistant, as a human labeler, supposed to get the ideal response in these cases, it requires creative human writing to do that.
And so RLHF kind of sidesteps this, because we get we get to ask people a significantly easier question as a data labelers, they're not asked to write poems directly, they're just given five points from the model, and they're just asked to order them. And so that's just a much easier task for a human labeler to do.
And so what I think this allows you to do basically is it, it's kind of like allows a lot more higher accuracy data, because we're not asking people to do the generation task, which can be extremely difficult. Like we're not asking them to do creative writing, we're just trying to get them to distinguish between creative writings, and find ones that are best.
And that is the signal that humans are providing just the ordering. And that is their input into the system. And then the system in RLHF just discovers the kinds of responses that would be graded well by humans. And so that step of indirection allows the models to become even better.
So that is the upside of RLHF. It allows us to run RL, it empirically results in better models, and it allows people to contribute their supervision, even without having to do extremely difficult tasks in the case of writing ideal responses. Unfortunately, RLHF also comes with significant downsides. And so the main one is that basically we are doing reinforcement learning, not with respect to humans and actual human judgment, but with respect to a lossy simulation of humans, right?
And this lossy simulation could be misleading, because it's just a it's just a simulation, right? It's just a language model that's kind of outputting scores, and it might not perfectly reflect the opinion of an actual human with an actual brain in all the possible different cases. So that's number one.
There's actually something even more subtle and devious going on that really dramatically holds back RLHF as a technique that we can really scale to significantly kind of smart systems. And that is that reinforcement learning is extremely good at discovering a way to game the model, to game the simulation.
So this reward model that we're constructing here, that gives the scores, these models are transformers. These transformers are massive neural nets. They have billions of parameters, and they imitate humans, but they do so in a kind of like a simulation way. Now, the problem is that these are massive, complicated systems, right?
There's a billion parameters here that are outputting a single score. It turns out that there are ways to game these models. You can find kinds of inputs that were not part of their training set. And these inputs inexplicably get very high scores, but in a fake way. So very often what you find if you run RLHF for very long, so for example, if we do 1000 updates, which is like say a lot of updates, you might expect that your jokes are getting better and that you're getting like real bangers about pelicans, but that's not exactly what happens.
What happens is that in the first few hundred steps, the jokes about pelicans are probably improving a little bit. And then they actually dramatically fall off the cliff and you start to get extremely nonsensical results. Like for example, you start to get the top joke about pelicans starts to be the the the the the the.
And this makes no sense, right? Like when you look at it, why should this be a top joke? But when you take the the the the the the and you plug it into your reward model, you'd expect score of zero, but actually the reward model loves this as a joke.
It will tell you that the the the the the is a score of 1.0. This is a top joke and this makes no sense, right? But it's because these models are just simulations of humans and they're massive neural nuts and you can find inputs at the bottom that kind of like get into the part of the input space that kind of gives you nonsensical results.
These examples are what's called adversarial examples, and I'm not going to go into the topic too much, but these are adversarial inputs to the model. They are specific little inputs that kind of go between the nooks and crannies of the model and give nonsensical results at the top. Now here's what you might imagine doing.
You say, okay, the the the is obviously not score of one. It's obviously a low score. So let's take the the the the the. Let's add it to the data set and give it an ordering that is extremely bad, like a score of five. And indeed, your model will learn that the the the the should have a very low score, and we'll give it score of zero.
The problem is that there will always be basically infinite number of nonsensical adversarial examples hiding in the model. If you iterate this process many times and you keep adding nonsensical stuff to your reward model and giving it very low scores, you'll never win the game. You can do this many, many rounds, and reinforcement learning, if you run it long enough, will always find a way to game the model.
It will discover adversarial examples. It will get really high scores with nonsensical results. And fundamentally, this is because our scoring function is a giant neural net, and RL is extremely good at finding just the ways to trick it. So long story short, you always run RLHF for maybe a few hundred updates, the model is getting better, and then you have to crop it and you are done.
You can't run too much against this reward model because the optimization will start to game it, and you basically crop it and you call it and you ship it. And you can improve the reward model, but you kind of like come across these situations eventually at some point. So RLHF, basically what I usually say is that RLHF is not RL.
And what I mean by that is, I mean, RLHF is RL, obviously, but it's not RL in the magical sense. This is not RL that you can run indefinitely. These kinds of problems, like where you are getting concrete correct answer, you cannot gain this as easily. You either got the correct answer or you didn't.
And the scoring function is much, much simpler. You're just looking at the boxed area and seeing if the result is correct. So it's very difficult to gain these functions, but gaining a reward model is possible. Now, in these verifiable domains, you can run RL indefinitely. You could run for tens of thousands, hundreds of thousands of steps and discover all kinds of really crazy strategies that we might not even ever think about of performing really well for all these problems.
In the game of Go, there's no way to basically game the winning of a game or losing of a game. We have a perfect simulator. We know where all the stones are placed, and we can calculate whether someone has won or not. There's no way to game that. And so you can do RL indefinitely, and you can eventually beat even Lisa Dole.
But with models like this, which are gameable, you cannot repeat this process indefinitely. So I kind of see RLHF as not real RL because the reward function is gameable. So it's kind of more like in the realm of like little fine-tuning. It's a little improvement, but it's not something that is fundamentally set up correctly, where you can insert more compute, run for longer, and get much better and magical results.
So it's not RL in that sense. It's not RL in the sense that it lacks magic. It can fine-tune your model and get a better performance. And indeed, if we go back to ChessGPT, the GPT40 model has gone through RLHF because it works well, but it's just not RL in the same sense.
RLHF is like a little fine-tune that slightly improves your model, is maybe like the way I would think about it. Okay, so that's most of the technical content that I wanted to cover. I took you through the three major stages and paradigms of training these models. Pre-training, supervised fine-tuning, and reinforcement learning.
And I showed you that they loosely correspond to the process we already use for teaching children. And so in particular, we talked about pre-training being sort of like the basic knowledge acquisition of reading exposition, supervised fine-tuning being the process of looking at lots and lots of worked examples and imitating experts, and practice problems.
The only difference is that we now have to effectively write textbooks for LLMs and AIs across all the disciplines of human knowledge. And also in all the cases where we actually would like them to work, like code and math and basically all the other disciplines. So we're in the process of writing textbooks for them, refining all the algorithms that I've presented on the high level.
And then of course, doing a really, really good job at the execution of training these models at scale and efficiently. So in particular, I didn't go into too many details, but these are extremely large and complicated distributed sort of jobs that have to run over tens of thousands or even hundreds of thousands of GPUs.
And the engineering that goes into this is really at the state of the art of what's possible with computers at that scale. So I didn't cover that aspect too much, but this is a very kind of serious endeavor underlying all these very simple algorithms ultimately. Now, I also talked about sort of like the theory of mind a little bit of these models.
And the thing I want you to take away is that these models are really good, but they're extremely useful as tools for your work. You shouldn't sort of trust them fully. And I showed you some examples of that. Even though we have mitigations for hallucinations, the models are not perfect and they will hallucinate still.
It's gotten better over time and it will continue to get better, but they can hallucinate. In other words, in addition to that, I covered kind of like what I call the Swiss cheese sort of model of LLM capabilities that you should have in your mind. The models are incredibly good across so many different disciplines, but then fail randomly almost in some unique cases.
So for example, what is bigger, 9.11 or 9.9? Like the model doesn't know, but simultaneously it can turn around and solve Olympiad questions. And so this is a hole in the Swiss cheese and there are many of them and you don't want to trip over them. So don't treat these models as infallible models, check their work, use them as tools, use them for inspiration, use them for the first draft, but work with them as tools and be ultimately responsible for the, you know, product of your work.
And that's roughly what I wanted to talk about. This is how they're trained and this is what they are. Let's now turn to what are some of the future capabilities of these models, probably what's coming down the pipe. And also where can you find these models? I have a few bullet points on some of the things that you can expect coming down the pipe.
The first thing you'll notice is that models will very rapidly become multimodal. Everything I've talked about about concerned text, but very soon we'll have LLMs that can not just handle text, but they can also operate natively and very easily over audio so they can hear and speak, and also images so they can see and paint.
And we're already seeing the beginnings of all of this, but this will be all done natively inside the language model, and this will enable kind of like natural conversations. And roughly speaking, the reason that this is actually no different from everything we've covered above is that as a baseline, you can tokenize audio and images and apply the exact same approaches of everything that we've talked about above.
So it's not a fundamental change. It's just we have to add some tokens. So as an example, for tokenizing audio, we can look at slices of the spectrogram of the audio signal and we can tokenize that and just add more tokens that suddenly represent audio and just add them into the context windows and train on them just like above.
The same for images, we can use patches and we can separately tokenize patches, and then what is an image? An image is just a sequence of tokens. And this actually kind of works, and there's a lot of early work in this direction. And so we can just create streams of tokens that are representing audio, images, as well as text, and intersperse them and handle them all simultaneously in a single model.
So that's one example of multimodality. Second, something that people are very interested in is currently most of the work is that we're handing individual tasks to the models on kind of like a silver platter, like please solve this task for me. And the model sort of like does this little task.
But it's up to us to still sort of like organize a coherent execution of tasks to perform jobs. And the models are not yet at the capability required to do this in a coherent error correcting way over long periods of time. So they're not able to fully string together tasks to perform these longer running jobs.
But they're getting there and this is improving over time. But probably what's going to happen here is we're going to start to see what's called agents, which perform tasks over time, and you, you supervise them, and you watch their work, and they come up to once in a while, report progress, and so on.
So we're going to see more long running agents, tasks that don't just take, you know, a few seconds of response, but many tens of seconds, or even minutes or hours over time. But these models are not infallible, as we talked about above. So all this will require supervision. So for example, in factories, people talk about the human to robot ratio for automation, I think we're going to see something similar in the digital space, where we are going to be talking about human to agent ratios, where humans becomes a lot more supervisors of agentic tasks in the digital domain.
Next, I think everything's going to become a lot more pervasive and invisible. So it's kind of like integrated into the tools, and in everywhere. And in addition, kind of like computer using. So right now, these models aren't able to take actions on your behalf. But I think this is a separate bullet point.
If you saw ChassisVT launch the operator, then that's one early example of that where you can actually hand off control to the model to perform, you know, keyboard and mouse actions on your behalf. So that's also something that that I think is very interesting. The last point I have here is just a general comment that there's still a lot of research to potentially do in this domain.
One example of that is something along the lines of test time training. So remember that everything we've done above, and that we talked about has two major stages. There's first the training stage where we tune the parameters of the model to perform the tasks well. Once we get the parameters, we fix them, and then we deploy the model for inference.
From there, the model is fixed, it doesn't change anymore, it doesn't learn from all the stuff that it's doing at test time, it's a fixed number of parameters. And the only thing that is changing is now the tokens inside the context windows. And so the only type of learning or test time learning that the model has access to is the in-context learning of its kind of like dynamically adjustable context window, depending on like what it's doing at test time.
So, but I think this is still different from humans who actually are able to like actually learn depending on what they're doing, especially when you sleep, for example, like your brain is updating your parameters or something like that, right? So there's no kind of equivalent of that currently in these models and tools.
So there's a lot of like more wonky ideas, I think, that are to be explored still. And in particular, I think this will be necessary because the context window is a finite and precious resource. And especially once we start to tackle very long running multimodal tasks, and we're putting in videos, and these token windows will basically start to grow extremely large, like not thousands or even hundreds of thousands, but significantly beyond that.
And the only trick, the only kind of trick we have available to us right now is to make the context windows longer. But I think that that approach by itself will not will not scale to actual long running tasks that are multimodal over time. And so I think new ideas are needed in some of those disciplines, in some of those kind of cases in the maze, where these tasks are going to require very long contexts.
So those are some examples of some of the things you can expect coming down the pipe. Let's now turn to where you can actually kind of keep track of this progress, and you know, be up to date with the latest and greatest of what's happening in the field. So I would say the three resources that I have consistently used to stay up to date are number one, LLM Arena.
So let me show you LLM Arena. This is basically an LLM leaderboard. And it ranks all the top models. And the ranking is based on human comparisons. So humans prompt these models, and they get to judge which one gives a better answer. They don't know which model is which they're just looking at which model is the better answer.
And you can calculate a ranking and then you get some results. And so what you can hear is, what you can see here is the different organizations like Google, Gemini, for example, that produce these models. When you click on any one of these, it takes you to the place where that model is hosted.
And then here we see Google is currently on top with OpenAI right behind. Here we see Deep Seek in position number three. Now the reason this is a big deal is the last column here, you see license, Deep Seek is an MIT licensed model. It's open weights, anyone can use these weights, anyone can download them, anyone can host their own version of Deep Seek, and they can use it in whatever way they like.
And so it's not a proprietary model that you don't have access to it. It's basically an open weights release. And so this is kind of unprecedented that a model this strong was released with open weights. So pretty cool from the team. Next up, we have a few more models from Google and OpenAI.
And then when you continue to scroll down, you're starting to see some other usual suspects. So XAI here, Anthropic with Sonnet here at number 14. And then Meta with Lama over here. So Lama similar to Deep Seek is an open weights model. And so but it's down here as opposed to up here.
Now I will say that this leaderboard was really good for a long time. I do think that in the last few months, it's become a little bit gamed. And I don't trust it as much as I used to. I think just empirically, I feel like a lot of people, for example, are using Sonnet from Anthropic and that it's a really good model.
So but that's all the way down here in number 14. And conversely, I think not as many people are using Gemini, but it's racking really, really high. So I think use this as a first pass, but sort of try out a few of the models for your tasks and see which one performs better.
The second thing that I would point to is the AI News newsletter. So AI News is not very creatively named, but it is a very good newsletter produced by Swix and Friends. So thank you for maintaining it. And it's been very helpful to me because it is extremely comprehensive.
So if you go to archives, you see that it's produced almost every other day. And it is very comprehensive. And some of it is written by humans and curated by humans, but a lot of it is constructed automatically with LLMs. So you'll see that these are very comprehensive, and you're probably not missing anything major, if you go through it.
Of course, you're probably not going to go through it because it's so long. But I do think that these summaries all the way up top are quite good, and I think have some human oversight. So this has been very helpful to me. And the last thing I would point to is just X and Twitter.
A lot of AI happens on X. And so I would just follow people who you like and trust and get all your latest and greatest on X as well. So those are the major places that have worked for me over time. And finally, a few words on where you can find the models, and where can you use them.
So the first one I would say is for any of the biggest proprietary models, you just have to go to the website of that LLM provider. So for example, for OpenAI, that's chat.com, I believe actually works now. So that's for OpenAI. Now for, or you know, for, for Gemini, I think it's Gemini.
google.com, or AI Studio. I think they have two for some reason that I don't fully understand. No one does. For the open weights models like DeepSea, Clouma, etc, you have to go to some kind of an inference provider of LLMs. So my favorite one is together together.ai. And I showed you that when you go to the playground of together.ai, then you can sort of pick lots of different models.
And all of these are open models of different types. And you can talk to them here as an example. Now, if you'd like to use a base model, like, you know, a base model, then this is where I think it's not as common to find base models, even on these inference providers, they are all targeting assistants and chat.
And so I think even here, I can't, I couldn't see base models here. So for base models, I usually go to hyperbolic, because they serve my llama 3.1 base. And I love that model. And you can just talk to it here. So as far as I know, this is this is a good place for a base model.
And I wish more people hosted base models, because they are useful and interesting to work with in some cases. Finally, you can also take some of the models that are smaller, and you can run them locally. And so for example, DeepSea, the biggest model, you're not going to be able to run locally on your MacBook.
But there are smaller versions of the DeepSea model that are what's called distilled. And then also, you can run these models at smaller precision, so not at the native precision of, for example, fp8 on DeepSea, or, you know, bf16 llama, but much, much lower than that. And don't worry if you don't fully understand those details, but you can run smaller versions that have been distilled, and then at even lower precision, and then you can fit them on your computer.
And so you can actually run pretty okay models on your laptop. And my favorite, I think place I go to usually is LM studio, which is basically an app you can get. And I think it kind of actually looks really ugly. And it's, I don't like that it shows you all these models that are basically not that useful, like everyone just wants to run DeepSea.
So I don't know why they give you these 500 different types of models, they're really complicated to search for. And you have to choose different distillations and different precisions. And it's all really confusing. But once you actually understand how it works, and that's a whole separate video, then you can actually load up a model like here, I loaded up a llama 3.2, instruct 1 billion.
And you can just talk to it. So I asked for pelican jokes, and I can ask for another one. And it gives me another one, etc. All of this that happens here is locally on your computer. So we're not actually going to anywhere else anyone else, this is running on the GPU on the MacBook Pro.
So that's very nice. And you can then inject the model when you're done. And that frees up the RAM. So LM studio is probably like my favorite one, even though I don't I think it's got a lot of UI UX issues. And it's really geared towards professionals almost. But if you watch some videos on YouTube, I think you can figure out how to how to use this interface.
So those are a few words on where to find them. So let me now loop back around to where we started. The question was, when we go to chachi pt.com, and we enter some kind of a query, and we hit go, what exactly is happening here? What are we seeing?
What are we talking to? How does this work? And I hope that this video gave you some appreciation for some of the under the hood details of how these models are trained, and what this is that is coming back. So in particular, we now know that your query is taken, and is first chopped up into tokens.
So we go to token, tick tokenizer. And here, where is the place in the in the sort of format that is for the user query, we basically put in our query right there. So our query goes into what we discussed here is the conversation protocol format, which is this way that we maintain conversation objects.
So this gets inserted there. And then this whole thing ends up being just a token sequence, a one dimensional token sequence under the hood. So chachi pt saw this token sequence. And then when we had to go, it basically continues appending tokens into this list, it continues the sequence, it acts like a token autocomplete.
So in particular, it gave us this response. So we can basically just put it here, and we see the tokens that it continued. These are the tokens that it continued with roughly. Now the question becomes, okay, why are these the tokens that the model responded with? What are these tokens?
Where are they coming from? What are we talking to? And how do we program the system? And so that's where we shifted gears. And we talked about the under the hood pieces of it. So the first stage of this process, and there are three stages is the pre training stage, which fundamentally has to do with just knowledge acquisition from the internet into the parameters of this neural network.
And so the neural net internalizes a lot of knowledge from the internet. But where the personality really comes in, is in the process of supervised fine tuning here. And so what what happens here is that basically the company like OpenAI will curate a large data set of conversations, like say 1 million conversation across very diverse topics.
And there will be conversations between a human and an assistant. And even though there's a lot of synthetic data generation used throughout this entire process, and a lot of LLM help, and so on. Fundamentally, this is a human data curation task with lots of humans involved. And in particular, these humans are data labelers hired by OpenAI, who are given labeling instructions that they learn, and their task is to create ideal assistant responses for any arbitrary prompts.
So they are teaching the neural network, by example, how to respond to prompts. So what is the way to think about what came back here? Like, what is this? Well, I think the right way to think about it is that this is the neural network simulation of a data labeler at OpenAI.
So it's as if I gave this query to a data labeler at OpenAI. And this data labeler first reads all the labeling instructions from OpenAI, and then spends two hours writing up the ideal assistant response to this query and giving it to me. Now, we're not actually doing that, right?
Because we didn't wait two hours. So what we're getting here is a neural network simulation of that process. And we have to keep in mind that these neural networks don't function like human brains do. They are different. What's easy or hard for them is different from what's easy or hard for humans.
And so we really are just getting a simulation. So here I've shown you, this is a token stream, and this is fundamentally the neural network with a bunch of activations and neurons in between. This is a fixed mathematical expression that mixes inputs from tokens with parameters of the model, and they get mixed up and get you the next token in a sequence.
But this is a finite amount of compute that happens for every single token. And so this is some kind of a lossy simulation of a human that is kind of like restricted in this way. And so whatever the humans write, the language model is kind of imitating on this token level with only this specific computation for every single token in a sequence.
We also saw that as a result of this, and the cognitive differences, the models will suffer in a variety of ways, and you have to be very careful with their use. So for example, we saw that they will suffer from hallucinations, and they also, we have the sense of a Swiss cheese model, the LLM capabilities, where basically there's like holes in the cheese, sometimes the models will just arbitrarily do something dumb.
So even though they're doing lots of magical stuff, sometimes they just can't. So maybe you're not giving them enough tokens to think, and maybe they're going to just make stuff up because their mental arithmetic breaks. Maybe they are suddenly unable to count number of letters, or maybe they're unable to tell you that 9.11 is smaller than 9.9, and it looks kind of dumb.
And so it's a Swiss cheese capability, and we have to be careful with that. And we saw the reasons for that. But fundamentally, this is how we think of what came back. It's, again, a simulation of this neural network of a human data labeler following the labeling instructions at OpenAI.
So that's what we're getting back. Now, I do think that things change a little bit when you actually go and reach for one of the thinking models, like O3 MiniHAI. And the reason for that is that GPT-4.0 basically doesn't do reinforcement learning. It does do RLHF, but I've told you that RLHF is not RL.
There's no time for magic in there. It's just a little bit of a fine-tuning is the way to look at it. But these thinking models, they do use RL. So they go through this third stage of perfecting their thinking process and discovering new thinking strategies and solutions to problem-solving that look a little bit like your internal monologue in your head.
And they practice that on a large collection of practice problems that companies like OpenAI create and curate and then make available to the LLMs. So when I come here and I talk to a thinking model, and I put in this question, what we're seeing here is not anymore just a straightforward simulation of a human data labeler.
Like this is actually kind of new, unique, and interesting. And of course, OpenAI is not showing us the under-the-hood thinking and the chains of thought that are underlying the reasoning here. But we know that such a thing exists, and this is a summary of it. And what we're getting here is actually not just an imitation of a human data labeler.
It's actually something that is kind of new and interesting and exciting in the sense that it is a function of thinking that was emergent in a simulation. It's not just imitating a human data labeler. It comes from this reinforcement learning process. And so here we're, of course, not giving it a chance to shine because this is not a mathematical or reasoning problem.
This is just some kind of a sort of creative writing problem, roughly speaking. And I think it's a question, an open question, as to whether the thinking strategies that are developed inside verifiable domains transfer and are generalizable to other domains that are unverifiable, such as creative writing. The extent to which that transfer happens is unknown in the field, I would say.
So we're not sure if we are able to do RL on everything that is verifiable and see the benefits of that on things that are unverifiable, like this prompt. So that's an open question. The other thing that's interesting is that this reinforcement learning here is still way too new, primordial, and nascent.
So we're just seeing the beginnings of the hints of greatness in the reasoning problems. We're seeing something that is, in principle, capable of something like the equivalent of move 37, but not in the game of Go, but in open domain thinking and problem solving. In principle, this paradigm is capable of doing something really cool, new, and exciting, something even that no human has thought of before.
In principle, these models are capable of analogies no human has had. So I think it's incredibly exciting that these models exist. But again, it's very early, and these are primordial models for now. And they will mostly shine in domains that are verifiable, like math, and code, etc. So very interesting to play with and think about and use.
And then that's roughly it. I would say those are the broad strokes of what's available right now. I will say that overall, it is an extremely exciting time to be in the field. Personally, I use these models all the time daily, tens or hundreds of times because they dramatically accelerate my work.
I think a lot of people see the same thing. I think we're going to see a huge amount of wealth creation as a result of these models. Be aware of some of their shortcomings. Even with RL models, they're going to suffer from some of these. Use it as a tool in a toolbox.
Don't trust it fully, because they will randomly do dumb things. They will randomly hallucinate. They will randomly skip over some mental arithmetic and not get it right. They randomly can't count or something like that. So use them as tools in the toolbox, check their work, and own the product of your work.
But use them for inspiration, for first draft, ask them questions, but always check and verify, and you will be very successful in your work if you do so. So I hope this video was useful and interesting to you. I hope you had fun. And it's already, like, very long, so I apologize for that.
But I hope it was useful. And yeah, I will see you later.