back to indexPhi-1: A 'Textbook' Model
00:00:00.000 |
The importance of the new PHY1 model isn't just that it's small enough to be on a smartphone, 00:00:04.680 |
set to be open-sourced and capable of interview-level Python coding tasks. 00:00:10.520 |
Its significance is also in what the model tells us about the future of language models 00:00:15.060 |
and the timelines of our march to human-level intelligence. 00:00:19.060 |
I spoke in depth with one of the authors of the paper, Ronan L. Dan, 00:00:23.140 |
to get you more insights and I'm only going to cover the best bits. 00:00:27.800 |
First thing to notice is how small this model is at 1.3 billion parameters. 00:00:35.140 |
Well, for reference, that's about 1% the size of GPT-3, 00:00:40.160 |
which was behind the original chat GPT phenomenon. 00:00:45.220 |
it's about a thousand times smaller than the combined parameter count of GPT-4. 00:00:50.640 |
So we're talking a tiny model here that could fit on my Samsung S23. 00:01:00.420 |
that means pass first time, of 50% on human eval, 00:01:06.740 |
Andrey Karpathy of OpenAI and Tesla fame said that we're probably going to see 00:01:11.620 |
a lot more of this creative scaling down work, 00:01:15.100 |
prioritizing data quality and diversity over quantity, 00:01:18.540 |
using synthetic data to create small but highly capable expert models. 00:01:24.340 |
And the author I spoke to actually retweeted that and said, 00:01:27.780 |
for skeptics, the model will be available on Hugging Face soon. 00:01:32.140 |
Back to the paper, which says everyone knows about scaling laws, 00:01:37.580 |
But following the footsteps of Eldan and Li in Tiny Stories, 00:01:41.340 |
which I'll get to in a second, we explore the improvement that can be 00:01:44.340 |
obtained along a different axis, the quality of the data. 00:01:48.060 |
Of course, anyone familiar with my Orca video will know that data quality is 00:01:52.380 |
super important, but let's get to this paper they mentioned. 00:01:55.420 |
And I'm going to give you the 30 second version 00:01:59.700 |
They created a diverse and synthetic data set of short stories using GPT 3.5 and GPT 4. 00:02:08.300 |
parameter models and smaller actually, which, as they say, are two orders 00:02:12.540 |
of magnitude smaller than GPT 2, which was only 1.5 billion parameters. 00:02:18.020 |
And by curating the synthetic data carefully, look at the difference in results. 00:02:22.860 |
The ending of this story was so much better on the tiny model trained on this 00:02:27.620 |
data set, especially compared to GPT 2, which is so much bigger. 00:02:38.180 |
They filtered the stack and stack overflow to only get the most teachable bits of code 00:02:45.300 |
They then created a synthetic textbook consisting of about one billion tokens 00:02:57.620 |
synthetic exercises data set consisting of only 180 million tokens of exercises 00:03:03.020 |
and solutions. Now, of course, other people have used the stack before. 00:03:06.700 |
But as Ronan says, I do think that from the data we do 00:03:09.860 |
have, we are not even close to extracting everything from it. 00:03:13.300 |
And look at the results of this tiny 1.3 billion parameter model trained in this way. 00:03:19.060 |
There have been only two models that have scored more than 50 percent on human eval 00:03:23.340 |
pass at one that's a wizard coder and of course GPT 4. 00:03:27.620 |
Of course, those models are massively bigger and therefore much more expensive to train. 00:03:32.340 |
And actually, I find this chart perhaps the most interesting one of all in the entire paper. 00:03:42.460 |
And remember, the scores are the percentage accuracy on human eval. 00:03:49.180 |
First, look at the consistent increase from when you just train on the filtered stack versus on the synthetic code textbook. 00:04:02.180 |
This could be the synthetic data event horizon that Sam Altman talked about. 00:04:06.540 |
And that code textbook was generated using GPT 3.5, not even GPT 4. 00:04:11.820 |
Next, compare the parameter count of the models. 00:04:14.500 |
350 million on the left and in the center and 1.3 billion on the right. 00:04:21.700 |
We knew that increasing the parameters yields better performance. 00:04:25.060 |
But nevertheless, you can see it vividly in action. 00:04:27.620 |
Third, and I think this one is really fascinating. 00:04:30.020 |
Look at the difference between the left and the center charts. 00:04:33.940 |
The only thing that really changed was the number of GPU hours. 00:04:37.380 |
And of course, the number of tokens went from 26 billion to 76 billion. 00:04:42.020 |
But wait, I thought the data set size was fixed at 7 billion. 00:04:46.180 |
What gives? Well, of course, what's happening is that they're passing over the data multiple times. 00:04:51.220 |
This is called training for more so-called epochs or passes over the data. 00:04:57.620 |
They're the same tokens being trained on more times. 00:05:00.940 |
As Ronan said to me, my personal impression is that many people in the community thought 00:05:05.060 |
that we would never want to do more than like one or two epochs because we'll start overfitting. 00:05:10.140 |
And just for 20 seconds, I can't resist bringing in this paper that they referenced in the textbook's paper. 00:05:15.820 |
It's essentially talking about how you can still scale language models even if you run out of data. 00:05:22.580 |
They say training for up to four epochs or passes is almost as good as new data. 00:05:27.620 |
And it's only when you get to around 40 epochs that repeating is worthless. 00:05:31.980 |
Obviously, we don't know about GPT-4, but GPT-3 seems to be trained on far less than that. 00:05:36.740 |
But there was one final trend from this amazing set of charts that I wanted to point out. 00:05:43.740 |
Look at the huge jump to the dark green bars. 00:05:47.940 |
That's when they train the model on those additional synthetic exercises with solutions. 00:05:52.620 |
The authors note that one can only imagine how frustrating and inefficient it would be 00:05:57.660 |
for a human learner to try to acquire coding skills from such data sets like the unfiltered stack, 00:06:03.620 |
as they would have to deal with a lot of noise, ambiguity and incompleteness in the data. 00:06:07.900 |
We hypothesize that these issues also affect the performance of language models as they 00:06:12.420 |
reduce the quality and quantity of the signal that maps natural language to code. 00:06:17.460 |
Let me quickly give you a bit more detail about how they filtered the stack. 00:06:21.380 |
They got about 100,000 samples of the stack and stack overflow and then prompted GPT-4 00:06:27.620 |
to determine its educational value for a student whose goal is to learn basic coding concepts. 00:06:33.780 |
They then use those annotations to train a random forest classifier that predicts 00:06:38.780 |
the quality of a file using its output embedding, essentially a basic searching 00:06:43.220 |
mechanism to find out which parts of the stack are the most educational. 00:06:47.300 |
But at this point, I want to pause and imagine if they'd used a different prompt. 00:06:51.500 |
Imagine a future paper looking across a different data set. 00:06:57.660 |
the educational value for a student whose goal is to learn French. 00:07:01.340 |
Then you could have an amazing French speaking model. 00:07:03.980 |
Or maybe they could get it to annotate which examples would be most educational 00:07:07.980 |
for learning to predict the stock market and then maybe train it on a small synthetic 00:07:12.700 |
textbook of successful previous examples of predicting the stock market. 00:07:16.580 |
I'm just saying this seems to be a model that could be applied elsewhere. 00:07:20.060 |
And these annotations here were the only times they used GPT-4. 00:07:27.620 |
GPT-4 is not only great as something we can use directly for better productivity, 00:07:32.900 |
but it's also a way to get much better other models. 00:07:38.380 |
Anthropic and Google to address the capability of their models to train smaller models. 00:07:44.060 |
Here, by the way, is an example of the kind of exercises 00:07:47.340 |
and solutions that the model was then fine tuned on, created, of course, by GPT-3.5. 00:07:53.140 |
And the authors note that quite remarkably, the model after fine tuning on those few 00:07:57.820 |
over than 200 million tokens of exercises and solutions also exhibits a substantial 00:08:03.660 |
improvement in executing tasks that are not featured in the fine tuning dataset. 00:08:08.820 |
For example, fine tuning on code exercises unexpectedly improves the model's ability 00:08:13.740 |
to use external libraries such as Pygame, even though our exercises do not contain 00:08:18.780 |
these libraries. This suggests that fine tuning not only improves the tasks we 00:08:23.300 |
targeted, but also makes unrelated tasks easier to distill. 00:08:27.820 |
It's this unexpectedness that I find really interesting. 00:08:33.060 |
did they expect the emergent ability to do self repair or reflection? 00:08:37.820 |
According to this new paper, that ability is not found in GPT-3.5. 00:08:42.420 |
Going back to the PHY-1 paper, the authors admit that there remain a number 00:08:46.940 |
of limitations of our model compared to larger models for code. 00:08:50.340 |
Firstly, PHY-1 is specialized in Python coding, 00:08:53.660 |
which restricts its versatility compared to multi language models. 00:08:57.620 |
Secondly, PHY-1 lacks the domain specific knowledge of larger models, 00:09:01.300 |
such as programming with specific APIs or using less common packages. 00:09:05.580 |
It's a bit like the more classical narrow AI, good at only a few things. 00:09:11.900 |
of the datasets and the lack of diversity in terms of language and style, 00:09:15.580 |
it's less robust to stylistic variations or errors in the prompt. 00:09:24.700 |
But what about this? We also believe that significant 00:09:27.660 |
gains could be achieved by using GPT-4 to generate the synthetic data instead 00:09:32.700 |
of GPT-3.5, as we notice that GPT-3.5 data has a high error rate. 00:09:37.500 |
I asked Ronan about that, speculating that it's because GPT-4 costs more. 00:09:44.860 |
But another reason is we wanted to demonstrate something here, 00:09:48.340 |
that you don't even need a smart model like GPT-4. 00:09:51.660 |
Even GPT-3.5, which isn't that great at coding, is enough. 00:09:57.660 |
better results on this using GPT-4, but at the moment, GPT-4 is a bit too slow. 00:10:01.940 |
Before I get to timelines, some of you might have noticed the WizardCoder 00:10:05.340 |
results and wondered how that model did so well, despite only being 16 billion 00:10:10.020 |
parameters, which of course is 10 times bigger than PHY1. 00:10:14.740 |
as well as almost every paper referenced in the textbook's paper. 00:10:20.780 |
been increasing the difficulty of the training data. 00:10:24.380 |
Fine tune the model with more difficult examples, e.g. 00:10:27.660 |
if the original problem can be solved with only a few logical steps, 00:10:30.740 |
please add more reasoning steps, maybe complicate the input or deepen 00:10:35.020 |
the question or increase the reasoning involved. 00:10:37.740 |
You can start to see the shared themes of Orca, WizardCoder and PHY1. 00:10:44.580 |
pointing to in the Asterisk magazine that I read yesterday. 00:10:48.340 |
I'm not sponsored by them, but it was a great issue. 00:10:53.500 |
of scaling laws or an acceleration of their slope, I think this is more like a move 00:10:57.660 |
in a different direction altogether towards a Cambrian explosion of little AIs 00:11:02.260 |
used for different purposes where getting good performance on a task depends 00:11:06.260 |
on the quality of your task specific dataset like PHY1 for Python. 00:11:12.380 |
of the art continuing to progress steadily along scaling law lines for quite some time. 00:11:19.180 |
incentive towards ever bigger models would diminish and would enter an entirely new 00:11:23.340 |
era where AI progress would not be driven primarily by semiconductor 00:11:29.180 |
This relates directly to a tweet from the co-founder of Anthropic, Jack Clark. 00:11:33.780 |
He said a world where we can push a button and stop larger compute things 00:11:38.580 |
being built and all focus on safety for a while is good. 00:11:41.900 |
That is really interesting to hear from someone at the top of an AGI lab. 00:11:46.220 |
But I do have some questions for this policy. 00:11:48.420 |
If we freeze compute, wouldn't that incentivize every company just 00:11:52.100 |
to use algorithmic progress to get more out of the compute we do have? 00:11:57.660 |
I think it's far more effective public messaging to focus on concrete things 00:12:03.780 |
For example, in this paper from Oxford this week, 00:12:06.260 |
LLMs will in particular lower barriers to biological misuse. 00:12:10.180 |
Biological design tools will expand the capabilities of sophisticated actors. 00:12:17.980 |
of pandemic pathogens substantially worse than anything seen to date and could 00:12:22.740 |
enable forms of more predictable and targeted biological weapons. 00:12:29.500 |
And as the paper says, it's been hypothesized that for evolutionary 00:12:33.060 |
reasons, naturally emerging pathogens feature a tradeoff between transmissibility. 00:12:43.500 |
capabilities that are able to overcome this tradeoff. 00:12:46.460 |
Thus, for the first time, humanity might face a security threat 00:12:49.940 |
from pathogens substantially worse than anything nature might create, 00:12:53.500 |
including pathogens capable of posing an existential threat. 00:12:57.620 |
To be honest, this is my main safety concern. 00:13:01.780 |
Here is another snippet of my conversation with Ronan. 00:13:06.860 |
to something really transformative than the public has quite realized. 00:13:13.460 |
years we will have something as powerful as a corporation. 00:13:18.140 |
Ronan replied, that depends on how much resources are actually spent 00:13:24.020 |
I have no idea what OpenAI and Google are doing right. 00:13:27.300 |
But definitely if this is our main goal, I think it can easily be five years. 00:13:34.820 |
I feel like the bottleneck is maybe the production of GPUs. 00:13:38.180 |
And I mean, it's not just to produce the GPUs. 00:13:40.820 |
You also have to build the data centers and connect them to electricity, etc, etc. 00:13:45.180 |
I think if you have all that, then, yeah, I don't see the barrier. 00:13:50.940 |
synthetic data, better and better algorithms and more and better GPUs and TPUs. 00:13:56.900 |
That's what we mean when we say we don't see a barrier. 00:13:59.260 |
Of course, everyone has slightly different definitions of AGI, but almost everyone 00:14:03.660 |
agrees that the next five to 10 years are going to be the most critical in seeing 00:14:08.620 |
whether more data, better data, better algorithms or just more and more compute 00:14:16.660 |
I loved how Karl Schulman put it on the Dvorkes Patel podcast. 00:14:20.700 |
If you generate like close to $10 million a year out of the future version 00:14:26.500 |
of H100, that costs tens of thousands of dollars with a huge profit margin now. 00:14:31.860 |
And profit margin could be reduced with like large production. 00:14:45.100 |
as much to have these fabs constructed more rapidly. 00:14:49.180 |
You could have if AI is starting to be able to contribute, could have AI contributing more of the 00:14:56.100 |
real technical work that makes it hard for, say, NVIDIA to suddenly find thousands 00:15:01.140 |
upon thousands of top quality engineering hires if AI can provide that. 00:15:05.700 |
Now, if AI hasn't reached that level of performance, then this is how you can have 00:15:10.740 |
things stall out and like a world where AI progress stalls out is one where you go 00:15:15.180 |
to the $100 billion and then over succeeding years, trillion dollar things, 00:15:26.700 |
You lose the gains that you are getting from moving researchers from other fields. 00:15:31.100 |
Lots of physicists and people from other areas of computer science have been going 00:15:34.620 |
to AI, but you sort of tap out those resources as AI becomes a larger proportion 00:15:40.940 |
of the research field and like, okay, you've put in all of these inputs, 00:15:49.900 |
yield the kind of AI capabilities needed for intelligence explosion. 00:15:55.300 |
then you're going to have to wait for the slow grind of things like general economic 00:15:59.620 |
growth, population growth and such, and so things slow. 00:16:02.900 |
And that results in my credences and this kind of advanced AI happening to be 00:16:07.380 |
relatively concentrated, like over the next ten years compared to the rest of the 00:16:12.300 |
century, because we just can't we can't keep going with this rapid, 00:16:19.100 |
And so I think that's a really important thing to think about. 00:16:22.700 |
And I think that's a really important thing to think about. 00:16:24.900 |
And I think that's a really important thing to think about. 00:16:26.300 |
And I think that's a really important thing to think about. 00:16:28.220 |
And I think that's a really important thing to think about. 00:16:31.860 |
Phi One with me, and as always, thank you so much for staying all the way to the end.