Phi-1: A 'Textbook' Model

The importance of the new PHY1 model isn't just that it's small enough to be on a smartphone, set to be open-sourced and capable of interview-level Python coding tasks. Its significance is also in what the model tells us about the future of language models and the timelines of our march to human-level intelligence.

I spoke in depth with one of the authors of the paper, Ronan L. Dan, to get you more insights and I'm only going to cover the best bits. So let's start. First thing to notice is how small this model is at 1.3 billion parameters. But what does that number mean?

Well, for reference, that's about 1% the size of GPT-3, which was behind the original chat GPT phenomenon. And if recent rumors are to be believed, it's about a thousand times smaller than the combined parameter count of GPT-4. So we're talking a tiny model here that could fit on my Samsung S23.

We read that despite this small scale, it's by one attains a pass at one accuracy, that means pass first time, of 50% on human eval, testing Python coding challenges. Andrey Karpathy of OpenAI and Tesla fame said that we're probably going to see a lot more of this creative scaling down work, prioritizing data quality and diversity over quantity, using synthetic data to create small but highly capable expert models.

And the author I spoke to actually retweeted that and said, for skeptics, the model will be available on Hugging Face soon. Give it a try. Back to the paper, which says everyone knows about scaling laws, adding more compute, adding more data. But following the footsteps of Eldan and Li in Tiny Stories, which I'll get to in a second, we explore the improvement that can be obtained along a different axis, the quality of the data.

Of course, anyone familiar with my Orca video will know that data quality is super important, but let's get to this paper they mentioned. And I'm going to give you the 30 second version of the paper co-authored by Ronan. They created a diverse and synthetic data set of short stories using GPT 3.5 and GPT 4.

And then they trained tiny 28 million parameter models and smaller actually, which, as they say, are two orders of magnitude smaller than GPT 2, which was only 1.5 billion parameters. And by curating the synthetic data carefully, look at the difference in results. The ending of this story was so much better on the tiny model trained on this data set, especially compared to GPT 2, which is so much bigger.

But it says the soup is too old. It's a terrible ending to the story. So what did they do for 5.1? Well, here is the short version. They filtered the stack and stack overflow to only get the most teachable bits of code consisting of about six billion tokens. They then created a synthetic textbook consisting of about one billion tokens of GPT 3.5 generated Python textbooks.

That's not even GPT 4. Then, quite crucially, they created a small, synthetic exercises data set consisting of only 180 million tokens of exercises and solutions. Now, of course, other people have used the stack before. But as Ronan says, I do think that from the data we do have, we are not even close to extracting everything from it.

And look at the results of this tiny 1.3 billion parameter model trained in this way. There have been only two models that have scored more than 50 percent on human eval pass at one that's a wizard coder and of course GPT 4. Of course, those models are massively bigger and therefore much more expensive to train.

And actually, I find this chart perhaps the most interesting one of all in the entire paper. You can see so many trends in one diagram. Let me try to pick a few of these out. And remember, the scores are the percentage accuracy on human eval. Think moderate level coding challenges.

First, look at the consistent increase from when you just train on the filtered stack versus on the synthetic code textbook. From 11 to 16, 12 to 20, 17 to 29. This could be the synthetic data event horizon that Sam Altman talked about. And that code textbook was generated using GPT 3.5, not even GPT 4.

Next, compare the parameter count of the models. 350 million on the left and in the center and 1.3 billion on the right. This one isn't as big a surprise. We knew that increasing the parameters yields better performance. But nevertheless, you can see it vividly in action. Third, and I think this one is really fascinating.

Look at the difference between the left and the center charts. The only thing that really changed was the number of GPU hours. And of course, the number of tokens went from 26 billion to 76 billion. But wait, I thought the data set size was fixed at 7 billion. What gives?

Well, of course, what's happening is that they're passing over the data multiple times. This is called training for more so-called epochs or passes over the data. So these aren't new tokens. They're the same tokens being trained on more times. As Ronan said to me, my personal impression is that many people in the community thought that we would never want to do more than like one or two epochs because we'll start overfitting.

And just for 20 seconds, I can't resist bringing in this paper that they referenced in the textbook's paper. It's essentially talking about how you can still scale language models even if you run out of data. And take a look at these two diagrams. They say training for up to four epochs or passes is almost as good as new data.

And it's only when you get to around 40 epochs that repeating is worthless. Obviously, we don't know about GPT-4, but GPT-3 seems to be trained on far less than that. But there was one final trend from this amazing set of charts that I wanted to point out. And it's probably the most obvious one.

Look at the huge jump to the dark green bars. That's when they train the model on those additional synthetic exercises with solutions. The authors note that one can only imagine how frustrating and inefficient it would be for a human learner to try to acquire coding skills from such data sets like the unfiltered stack, as they would have to deal with a lot of noise, ambiguity and incompleteness in the data.

We hypothesize that these issues also affect the performance of language models as they reduce the quality and quantity of the signal that maps natural language to code. Let me quickly give you a bit more detail about how they filtered the stack. They got about 100,000 samples of the stack and stack overflow and then prompted GPT-4 to determine its educational value for a student whose goal is to learn basic coding concepts.

They then use those annotations to train a random forest classifier that predicts the quality of a file using its output embedding, essentially a basic searching mechanism to find out which parts of the stack are the most educational. But at this point, I want to pause and imagine if they'd used a different prompt.

Imagine a future paper looking across a different data set. That paper could prompt GPT-4 to annotate the educational value for a student whose goal is to learn French. Then you could have an amazing French speaking model. Or maybe they could get it to annotate which examples would be most educational for learning to predict the stock market and then maybe train it on a small synthetic textbook of successful previous examples of predicting the stock market.

I'm just saying this seems to be a model that could be applied elsewhere. And these annotations here were the only times they used GPT-4. The rest was GPT-3.5. And as Ronan says, GPT-4 is not only great as something we can use directly for better productivity, but it's also a way to get much better other models.

And that's one thing I want OpenAI, Anthropic and Google to address the capability of their models to train smaller models. Here, by the way, is an example of the kind of exercises and solutions that the model was then fine tuned on, created, of course, by GPT-3.5. And the authors note that quite remarkably, the model after fine tuning on those few over than 200 million tokens of exercises and solutions also exhibits a substantial improvement in executing tasks that are not featured in the fine tuning dataset.

For example, fine tuning on code exercises unexpectedly improves the model's ability to use external libraries such as Pygame, even though our exercises do not contain these libraries. This suggests that fine tuning not only improves the tasks we targeted, but also makes unrelated tasks easier to distill. It's this unexpectedness that I find really interesting.

For example, before training GPT-4, did they expect the emergent ability to do self repair or reflection? According to this new paper, that ability is not found in GPT-3.5. Going back to the PHY-1 paper, the authors admit that there remain a number of limitations of our model compared to larger models for code.

Firstly, PHY-1 is specialized in Python coding, which restricts its versatility compared to multi language models. Secondly, PHY-1 lacks the domain specific knowledge of larger models, such as programming with specific APIs or using less common packages. It's a bit like the more classical narrow AI, good at only a few things.

Furthermore, due to the structured nature of the datasets and the lack of diversity in terms of language and style, it's less robust to stylistic variations or errors in the prompt. It's quite funny if you make a grammatical mistake in your prompt, it does a lot worse. But what about this?

We also believe that significant gains could be achieved by using GPT-4 to generate the synthetic data instead of GPT-3.5, as we notice that GPT-3.5 data has a high error rate. I asked Ronan about that, speculating that it's because GPT-4 costs more. And he said, yeah, it costs more. Also, GPT-4 is much slower.

But another reason is we wanted to demonstrate something here, that you don't even need a smart model like GPT-4. Even GPT-3.5, which isn't that great at coding, is enough. So there you go. You could get even better results on this using GPT-4, but at the moment, GPT-4 is a bit too slow.

Before I get to timelines, some of you might have noticed the WizardCoder results and wondered how that model did so well, despite only being 16 billion parameters, which of course is 10 times bigger than PHY1. Well, of course, I read that paper too, as well as almost every paper referenced in the textbook's paper.

The secret of WizardCoder seems to have been increasing the difficulty of the training data. Fine tune the model with more difficult examples, e.g. if the original problem can be solved with only a few logical steps, please add more reasoning steps, maybe complicate the input or deepen the question or increase the reasoning involved.

You can start to see the shared themes of Orca, WizardCoder and PHY1. This could be what Sarah Constantine was pointing to in the Asterisk magazine that I read yesterday. I'm not sponsored by them, but it was a great issue. So do check out. She said rather than a refutation of scaling laws or an acceleration of their slope, I think this is more like a move in a different direction altogether towards a Cambrian explosion of little AIs used for different purposes where getting good performance on a task depends on the quality of your task specific dataset like PHY1 for Python.

That could be consistent with the state of the art continuing to progress steadily along scaling law lines for quite some time. But it could also mean the economic incentive towards ever bigger models would diminish and would enter an entirely new era where AI progress would not be driven primarily by semiconductor scaling or Moore's law.

This relates directly to a tweet from the co-founder of Anthropic, Jack Clark. He said a world where we can push a button and stop larger compute things being built and all focus on safety for a while is good. That is really interesting to hear from someone at the top of an AGI lab.

But I do have some questions for this policy. If we freeze compute, wouldn't that incentivize every company just to use algorithmic progress to get more out of the compute we do have? And so on the safety front, I think it's far more effective public messaging to focus on concrete things that everyone can understand.

For example, in this paper from Oxford this week, LLMs will in particular lower barriers to biological misuse. Biological design tools will expand the capabilities of sophisticated actors. Concretely, BDTs may enable the creation of pandemic pathogens substantially worse than anything seen to date and could enable forms of more predictable and targeted biological weapons.

I think this is something that everyone can get behind. And as the paper says, it's been hypothesized that for evolutionary reasons, naturally emerging pathogens feature a tradeoff between transmissibility. That's how much they spread and virulence. That's how deadly they are. AI based BDTs might generate design capabilities that are able to overcome this tradeoff.

Thus, for the first time, humanity might face a security threat from pathogens substantially worse than anything nature might create, including pathogens capable of posing an existential threat. To be honest, this is my main safety concern. But back to the paper and timelines. Here is another snippet of my conversation with Ronan.

I said, I just feel like we are much closer to something really transformative than the public has quite realized. And people like OpenAI puts out that in 10 years we will have something as powerful as a corporation. I say three to five years. Ronan replied, that depends on how much resources are actually spent into training bigger and bigger models.

I have no idea what OpenAI and Google are doing right. But definitely if this is our main goal, I think it can easily be five years. I said, or less. Ronan replied, or less. I feel like the bottleneck is maybe the production of GPUs. And I mean, it's not just to produce the GPUs.

You also have to build the data centers and connect them to electricity, etc, etc. I think if you have all that, then, yeah, I don't see the barrier. With more data, higher quality data, synthetic data, better and better algorithms and more and better GPUs and TPUs. That's what we mean when we say we don't see a barrier.

Of course, everyone has slightly different definitions of AGI, but almost everyone agrees that the next five to 10 years are going to be the most critical in seeing whether more data, better data, better algorithms or just more and more compute will lead to AGI or superintelligence. I loved how Karl Schulman put it on the Dvorkes Patel podcast.

If you generate like close to $10 million a year out of the future version of H100, that costs tens of thousands of dollars with a huge profit margin now. And profit margin could be reduced with like large production. That is a big difference. That chip pays for itself almost instantly.

And so you could support paying ten times as much to have these fabs constructed more rapidly. You could have if AI is starting to be able to contribute, could have AI contributing more of the real technical work that makes it hard for, say, NVIDIA to suddenly find thousands upon thousands of top quality engineering hires if AI can provide that.

Now, if AI hasn't reached that level of performance, then this is how you can have things stall out and like a world where AI progress stalls out is one where you go to the $100 billion and then over succeeding years, trillion dollar things, software progress turns out to stall.

You lose the gains that you are getting from moving researchers from other fields. Lots of physicists and people from other areas of computer science have been going to AI, but you sort of tap out those resources as AI becomes a larger proportion of the research field and like, okay, you've put in all of these inputs, but they just haven't yielded AGI yet.

I think that set of inputs probably would yield the kind of AI capabilities needed for intelligence explosion. But if it doesn't, then you're going to have to wait for the slow grind of things like general economic growth, population growth and such, and so things slow. And that results in my credences and this kind of advanced AI happening to be relatively concentrated, like over the next ten years compared to the rest of the century, because we just can't we can't keep going with this rapid, you know, slow growth of AI.

And so I think that's a really important thing to think about. And I think that's a really important thing to think about. And I think that's a really important thing to think about. And I think that's a really important thing to think about. And I think that's a really important thing to think about.

Thank you so much for learning about Phi One with me, and as always, thank you so much for staying all the way to the end. Do try to have a wonderful day.

Phi-1: A 'Textbook' Model

Transcript