Phi-1: A 'Textbook' Model

00:00:00.000 | The importance of the new PHY1 model isn't just that it's small enough to be on a smartphone,

00:00:04.680 | set to be open-sourced and capable of interview-level Python coding tasks.

00:00:10.520 | Its significance is also in what the model tells us about the future of language models

00:00:15.060 | and the timelines of our march to human-level intelligence.

00:00:19.060 | I spoke in depth with one of the authors of the paper, Ronan L. Dan,

00:00:23.140 | to get you more insights and I'm only going to cover the best bits.

00:00:26.560 | So let's start.

00:00:27.800 | First thing to notice is how small this model is at 1.3 billion parameters.

00:00:33.440 | But what does that number mean?

00:00:35.140 | Well, for reference, that's about 1% the size of GPT-3,

00:00:40.160 | which was behind the original chat GPT phenomenon.

00:00:42.920 | And if recent rumors are to be believed,

00:00:45.220 | it's about a thousand times smaller than the combined parameter count of GPT-4.

00:00:50.640 | So we're talking a tiny model here that could fit on my Samsung S23.

00:00:54.800 | We read that despite this small scale,

00:00:57.660 | it's by one attains a pass at one accuracy,

00:01:00.420 | that means pass first time, of 50% on human eval,

00:01:04.500 | testing Python coding challenges.

00:01:06.740 | Andrey Karpathy of OpenAI and Tesla fame said that we're probably going to see

00:01:11.620 | a lot more of this creative scaling down work,

00:01:15.100 | prioritizing data quality and diversity over quantity,

00:01:18.540 | using synthetic data to create small but highly capable expert models.

00:01:24.340 | And the author I spoke to actually retweeted that and said,

00:01:27.780 | for skeptics, the model will be available on Hugging Face soon.

00:01:31.300 | Give it a try.

00:01:32.140 | Back to the paper, which says everyone knows about scaling laws,

00:01:35.260 | adding more compute, adding more data.

00:01:37.580 | But following the footsteps of Eldan and Li in Tiny Stories,

00:01:41.340 | which I'll get to in a second, we explore the improvement that can be

00:01:44.340 | obtained along a different axis, the quality of the data.

00:01:48.060 | Of course, anyone familiar with my Orca video will know that data quality is

00:01:52.380 | super important, but let's get to this paper they mentioned.

00:01:55.420 | And I'm going to give you the 30 second version

00:01:57.740 | of the paper co-authored by Ronan.

00:01:59.700 | They created a diverse and synthetic data set of short stories using GPT 3.5 and GPT 4.

00:02:06.060 | And then they trained tiny 28 million

00:02:08.300 | parameter models and smaller actually, which, as they say, are two orders

00:02:12.540 | of magnitude smaller than GPT 2, which was only 1.5 billion parameters.

00:02:18.020 | And by curating the synthetic data carefully, look at the difference in results.

00:02:22.860 | The ending of this story was so much better on the tiny model trained on this

00:02:27.620 | data set, especially compared to GPT 2, which is so much bigger.

00:02:31.060 | But it says the soup is too old.

00:02:32.900 | It's a terrible ending to the story.

00:02:34.660 | So what did they do for 5.1?

00:02:36.460 | Well, here is the short version.

00:02:38.180 | They filtered the stack and stack overflow to only get the most teachable bits of code

00:02:43.180 | consisting of about six billion tokens.

00:02:45.300 | They then created a synthetic textbook consisting of about one billion tokens

00:02:49.700 | of GPT 3.5 generated Python textbooks.

00:02:53.140 | That's not even GPT 4.

00:02:54.660 | Then, quite crucially, they created a small,

00:02:57.620 | synthetic exercises data set consisting of only 180 million tokens of exercises

00:03:03.020 | and solutions. Now, of course, other people have used the stack before.

00:03:06.700 | But as Ronan says, I do think that from the data we do

00:03:09.860 | have, we are not even close to extracting everything from it.

00:03:13.300 | And look at the results of this tiny 1.3 billion parameter model trained in this way.

00:03:19.060 | There have been only two models that have scored more than 50 percent on human eval

00:03:23.340 | pass at one that's a wizard coder and of course GPT 4.

00:03:27.620 | Of course, those models are massively bigger and therefore much more expensive to train.

00:03:32.340 | And actually, I find this chart perhaps the most interesting one of all in the entire paper.

00:03:36.940 | You can see so many trends in one diagram.

00:03:39.980 | Let me try to pick a few of these out.

00:03:42.460 | And remember, the scores are the percentage accuracy on human eval.

00:03:47.020 | Think moderate level coding challenges.

00:03:49.180 | First, look at the consistent increase from when you just train on the filtered stack versus on the synthetic code textbook.

00:03:57.620 | From 11 to 16, 12 to 20, 17 to 29.

00:04:02.180 | This could be the synthetic data event horizon that Sam Altman talked about.

00:04:06.540 | And that code textbook was generated using GPT 3.5, not even GPT 4.

00:04:11.820 | Next, compare the parameter count of the models.

00:04:14.500 | 350 million on the left and in the center and 1.3 billion on the right.

00:04:19.500 | This one isn't as big a surprise.

00:04:21.700 | We knew that increasing the parameters yields better performance.

00:04:25.060 | But nevertheless, you can see it vividly in action.

00:04:27.620 | Third, and I think this one is really fascinating.

00:04:30.020 | Look at the difference between the left and the center charts.

00:04:33.940 | The only thing that really changed was the number of GPU hours.

00:04:37.380 | And of course, the number of tokens went from 26 billion to 76 billion.

00:04:42.020 | But wait, I thought the data set size was fixed at 7 billion.

00:04:46.180 | What gives? Well, of course, what's happening is that they're passing over the data multiple times.

00:04:51.220 | This is called training for more so-called epochs or passes over the data.

00:04:56.020 | So these aren't new tokens.

00:04:57.620 | They're the same tokens being trained on more times.

00:05:00.940 | As Ronan said to me, my personal impression is that many people in the community thought

00:05:05.060 | that we would never want to do more than like one or two epochs because we'll start overfitting.

00:05:10.140 | And just for 20 seconds, I can't resist bringing in this paper that they referenced in the textbook's paper.

00:05:15.820 | It's essentially talking about how you can still scale language models even if you run out of data.

00:05:20.620 | And take a look at these two diagrams.

00:05:22.580 | They say training for up to four epochs or passes is almost as good as new data.

00:05:27.620 | And it's only when you get to around 40 epochs that repeating is worthless.

00:05:31.980 | Obviously, we don't know about GPT-4, but GPT-3 seems to be trained on far less than that.

00:05:36.740 | But there was one final trend from this amazing set of charts that I wanted to point out.

00:05:41.900 | And it's probably the most obvious one.

00:05:43.740 | Look at the huge jump to the dark green bars.

00:05:47.940 | That's when they train the model on those additional synthetic exercises with solutions.

00:05:52.620 | The authors note that one can only imagine how frustrating and inefficient it would be

00:05:57.660 | for a human learner to try to acquire coding skills from such data sets like the unfiltered stack,

00:06:03.620 | as they would have to deal with a lot of noise, ambiguity and incompleteness in the data.

00:06:07.900 | We hypothesize that these issues also affect the performance of language models as they

00:06:12.420 | reduce the quality and quantity of the signal that maps natural language to code.

00:06:17.460 | Let me quickly give you a bit more detail about how they filtered the stack.

00:06:21.380 | They got about 100,000 samples of the stack and stack overflow and then prompted GPT-4

00:06:27.620 | to determine its educational value for a student whose goal is to learn basic coding concepts.

00:06:33.780 | They then use those annotations to train a random forest classifier that predicts

00:06:38.780 | the quality of a file using its output embedding, essentially a basic searching

00:06:43.220 | mechanism to find out which parts of the stack are the most educational.

00:06:47.300 | But at this point, I want to pause and imagine if they'd used a different prompt.

00:06:51.500 | Imagine a future paper looking across a different data set.

00:06:54.980 | That paper could prompt GPT-4 to annotate

00:06:57.660 | the educational value for a student whose goal is to learn French.

00:07:01.340 | Then you could have an amazing French speaking model.

00:07:03.980 | Or maybe they could get it to annotate which examples would be most educational

00:07:07.980 | for learning to predict the stock market and then maybe train it on a small synthetic

00:07:12.700 | textbook of successful previous examples of predicting the stock market.

00:07:16.580 | I'm just saying this seems to be a model that could be applied elsewhere.

00:07:20.060 | And these annotations here were the only times they used GPT-4.

00:07:24.140 | The rest was GPT-3.5.

00:07:26.220 | And as Ronan says,

00:07:27.620 | GPT-4 is not only great as something we can use directly for better productivity,

00:07:32.900 | but it's also a way to get much better other models.

00:07:36.540 | And that's one thing I want OpenAI,

00:07:38.380 | Anthropic and Google to address the capability of their models to train smaller models.

00:07:44.060 | Here, by the way, is an example of the kind of exercises

00:07:47.340 | and solutions that the model was then fine tuned on, created, of course, by GPT-3.5.

00:07:53.140 | And the authors note that quite remarkably, the model after fine tuning on those few

00:07:57.820 | over than 200 million tokens of exercises and solutions also exhibits a substantial

00:08:03.660 | improvement in executing tasks that are not featured in the fine tuning dataset.

00:08:08.820 | For example, fine tuning on code exercises unexpectedly improves the model's ability

00:08:13.740 | to use external libraries such as Pygame, even though our exercises do not contain

00:08:18.780 | these libraries. This suggests that fine tuning not only improves the tasks we

00:08:23.300 | targeted, but also makes unrelated tasks easier to distill.

00:08:27.820 | It's this unexpectedness that I find really interesting.

00:08:30.820 | For example, before training GPT-4,

00:08:33.060 | did they expect the emergent ability to do self repair or reflection?

00:08:37.820 | According to this new paper, that ability is not found in GPT-3.5.

00:08:42.420 | Going back to the PHY-1 paper, the authors admit that there remain a number

00:08:46.940 | of limitations of our model compared to larger models for code.

00:08:50.340 | Firstly, PHY-1 is specialized in Python coding,

00:08:53.660 | which restricts its versatility compared to multi language models.

00:08:57.620 | Secondly, PHY-1 lacks the domain specific knowledge of larger models,

00:09:01.300 | such as programming with specific APIs or using less common packages.

00:09:05.580 | It's a bit like the more classical narrow AI, good at only a few things.

00:09:09.780 | Furthermore, due to the structured nature

00:09:11.900 | of the datasets and the lack of diversity in terms of language and style,

00:09:15.580 | it's less robust to stylistic variations or errors in the prompt.

00:09:19.860 | It's quite funny if you make a grammatical

00:09:21.820 | mistake in your prompt, it does a lot worse.

00:09:24.700 | But what about this? We also believe that significant

00:09:27.660 | gains could be achieved by using GPT-4 to generate the synthetic data instead

00:09:32.700 | of GPT-3.5, as we notice that GPT-3.5 data has a high error rate.

00:09:37.500 | I asked Ronan about that, speculating that it's because GPT-4 costs more.

00:09:41.260 | And he said, yeah, it costs more.

00:09:42.780 | Also, GPT-4 is much slower.

00:09:44.860 | But another reason is we wanted to demonstrate something here,

00:09:48.340 | that you don't even need a smart model like GPT-4.

00:09:51.660 | Even GPT-3.5, which isn't that great at coding, is enough.

00:09:55.980 | So there you go. You could get even

00:09:57.660 | better results on this using GPT-4, but at the moment, GPT-4 is a bit too slow.

00:10:01.940 | Before I get to timelines, some of you might have noticed the WizardCoder

00:10:05.340 | results and wondered how that model did so well, despite only being 16 billion

00:10:10.020 | parameters, which of course is 10 times bigger than PHY1.

00:10:12.980 | Well, of course, I read that paper too,

00:10:14.740 | as well as almost every paper referenced in the textbook's paper.

00:10:18.500 | The secret of WizardCoder seems to have

00:10:20.780 | been increasing the difficulty of the training data.

00:10:24.380 | Fine tune the model with more difficult examples, e.g.

00:10:27.660 | if the original problem can be solved with only a few logical steps,

00:10:30.740 | please add more reasoning steps, maybe complicate the input or deepen

00:10:35.020 | the question or increase the reasoning involved.

00:10:37.740 | You can start to see the shared themes of Orca, WizardCoder and PHY1.

00:10:42.780 | This could be what Sarah Constantine was

00:10:44.580 | pointing to in the Asterisk magazine that I read yesterday.

00:10:48.340 | I'm not sponsored by them, but it was a great issue.

00:10:50.580 | So do check out.

00:10:51.580 | She said rather than a refutation

00:10:53.500 | of scaling laws or an acceleration of their slope, I think this is more like a move

00:10:57.660 | in a different direction altogether towards a Cambrian explosion of little AIs

00:11:02.260 | used for different purposes where getting good performance on a task depends

00:11:06.260 | on the quality of your task specific dataset like PHY1 for Python.

00:11:10.820 | That could be consistent with the state

00:11:12.380 | of the art continuing to progress steadily along scaling law lines for quite some time.

00:11:16.900 | But it could also mean the economic

00:11:19.180 | incentive towards ever bigger models would diminish and would enter an entirely new

00:11:23.340 | era where AI progress would not be driven primarily by semiconductor

00:11:27.660 | scaling or Moore's law.

00:11:29.180 | This relates directly to a tweet from the co-founder of Anthropic, Jack Clark.

00:11:33.780 | He said a world where we can push a button and stop larger compute things

00:11:38.580 | being built and all focus on safety for a while is good.

00:11:41.900 | That is really interesting to hear from someone at the top of an AGI lab.

00:11:46.220 | But I do have some questions for this policy.

00:11:48.420 | If we freeze compute, wouldn't that incentivize every company just

00:11:52.100 | to use algorithmic progress to get more out of the compute we do have?

00:11:56.180 | And so on the safety front,

00:11:57.660 | I think it's far more effective public messaging to focus on concrete things

00:12:02.380 | that everyone can understand.

00:12:03.780 | For example, in this paper from Oxford this week,

00:12:06.260 | LLMs will in particular lower barriers to biological misuse.

00:12:10.180 | Biological design tools will expand the capabilities of sophisticated actors.

00:12:15.140 | Concretely, BDTs may enable the creation

00:12:17.980 | of pandemic pathogens substantially worse than anything seen to date and could

00:12:22.740 | enable forms of more predictable and targeted biological weapons.

00:12:26.620 | I think this is

00:12:27.660 | something that everyone can get behind.

00:12:29.500 | And as the paper says, it's been hypothesized that for evolutionary

00:12:33.060 | reasons, naturally emerging pathogens feature a tradeoff between transmissibility.

00:12:37.740 | That's how much they spread and virulence.

00:12:39.700 | That's how deadly they are.

00:12:40.900 | AI based BDTs might generate design

00:12:43.500 | capabilities that are able to overcome this tradeoff.

00:12:46.460 | Thus, for the first time, humanity might face a security threat

00:12:49.940 | from pathogens substantially worse than anything nature might create,

00:12:53.500 | including pathogens capable of posing an existential threat.

00:12:57.620 | To be honest, this is my main safety concern.

00:12:59.380 | But back to the paper and timelines.

00:13:01.780 | Here is another snippet of my conversation with Ronan.

00:13:04.660 | I said, I just feel like we are much closer

00:13:06.860 | to something really transformative than the public has quite realized.

00:13:10.620 | And people like OpenAI puts out that in 10

00:13:13.460 | years we will have something as powerful as a corporation.

00:13:16.220 | I say three to five years.

00:13:18.140 | Ronan replied, that depends on how much resources are actually spent

00:13:22.020 | into training bigger and bigger models.

00:13:24.020 | I have no idea what OpenAI and Google are doing right.

00:13:27.300 | But definitely if this is our main goal, I think it can easily be five years.

00:13:32.140 | I said, or less.

00:13:33.300 | Ronan replied, or less.

00:13:34.820 | I feel like the bottleneck is maybe the production of GPUs.

00:13:38.180 | And I mean, it's not just to produce the GPUs.

00:13:40.820 | You also have to build the data centers and connect them to electricity, etc, etc.

00:13:45.180 | I think if you have all that, then, yeah, I don't see the barrier.

00:13:48.660 | With more data, higher quality data,

00:13:50.940 | synthetic data, better and better algorithms and more and better GPUs and TPUs.

00:13:56.900 | That's what we mean when we say we don't see a barrier.

00:13:59.260 | Of course, everyone has slightly different definitions of AGI, but almost everyone

00:14:03.660 | agrees that the next five to 10 years are going to be the most critical in seeing

00:14:08.620 | whether more data, better data, better algorithms or just more and more compute

00:14:13.380 | will lead to AGI or superintelligence.

00:14:16.660 | I loved how Karl Schulman put it on the Dvorkes Patel podcast.

00:14:20.700 | If you generate like close to $10 million a year out of the future version

00:14:26.500 | of H100, that costs tens of thousands of dollars with a huge profit margin now.

00:14:31.860 | And profit margin could be reduced with like large production.

00:14:35.540 | That is a big difference.

00:14:37.660 | That chip pays for itself almost instantly.

00:14:41.100 | And so you could support paying ten times

00:14:45.100 | as much to have these fabs constructed more rapidly.

00:14:49.180 | You could have if AI is starting to be able to contribute, could have AI contributing more of the

00:14:56.100 | real technical work that makes it hard for, say, NVIDIA to suddenly find thousands

00:15:01.140 | upon thousands of top quality engineering hires if AI can provide that.

00:15:05.700 | Now, if AI hasn't reached that level of performance, then this is how you can have

00:15:10.740 | things stall out and like a world where AI progress stalls out is one where you go

00:15:15.180 | to the $100 billion and then over succeeding years, trillion dollar things,

00:15:21.020 | software progress

00:15:24.460 | turns out to

00:15:25.700 | stall.

00:15:26.700 | You lose the gains that you are getting from moving researchers from other fields.

00:15:31.100 | Lots of physicists and people from other areas of computer science have been going

00:15:34.620 | to AI, but you sort of tap out those resources as AI becomes a larger proportion

00:15:40.940 | of the research field and like, okay, you've put in all of these inputs,

00:15:44.380 | but they just haven't yielded AGI yet.

00:15:46.820 | I think that set of inputs probably would

00:15:49.900 | yield the kind of AI capabilities needed for intelligence explosion.

00:15:53.620 | But if it doesn't,

00:15:55.300 | then you're going to have to wait for the slow grind of things like general economic

00:15:59.620 | growth, population growth and such, and so things slow.

00:16:02.900 | And that results in my credences and this kind of advanced AI happening to be

00:16:07.380 | relatively concentrated, like over the next ten years compared to the rest of the

00:16:12.300 | century, because we just can't we can't keep going with this rapid,

00:16:17.100 | you know, slow growth of AI.

00:16:19.100 | And so I think that's a really important thing to think about.

00:16:22.700 | And I think that's a really important thing to think about.

00:16:24.900 | And I think that's a really important thing to think about.

00:16:26.300 | And I think that's a really important thing to think about.

00:16:28.220 | And I think that's a really important thing to think about.

00:16:29.740 | Thank you so much for learning about

00:16:31.860 | Phi One with me, and as always, thank you so much for staying all the way to the end.

00:16:36.540 | Do try to have a wonderful day.