Back to Index

Orca: The Model Few Saw Coming


Transcript

Do you remember this paper, less than two weeks old? It made waves by concluding that open source models can mimic the style, but not the factuality of ChatGPT. Overall, we can conclude, they say, that model imitation is a false promise. Well, 48 hours ago, we have this, a 51-page report on Orca based on a small 13 billion parameter model.

I don't often comment on open source models because they're simply not competitive with OpenAI's models. But Orca is not just competitive with GPT-3.5. It beats it in quite a few well-established benchmarks and even matches GPT-4 in a couple of tests of reasoning. As always, I've read both papers in full and can also bring in just-released comments from Sam Altman and Ilya Sutskova on competition from open source models.

But let's start with Orca, named presumably because Orcas, or killer whales, are frequent visitors to South American coastlines. And South America is, of course, the land of llamas and vicunas. But all the research was done by Microsoft, which I find interesting, and I'll come back to that at the end.

But why did they make Orca, and why does it perform better than models like Llama, Alpaca, and Vicuna? Well, they say here in the abstract that those other models lack rigorous evaluation, resulting in overestimating the small model's capability as they tend to learn to imitate the style, but not the actual model.

So, if you're interested in learning more about Orca, and you're interested in learning more about Orca, but not the reasoning of LFMs, large foundation models. To address these challenges, we develop Orca, a 13 billion parameter model, that learns to imitate the reasoning process of the larger models. Orca learns by looking at GPT-4's step-by-step thought processes, and is guided by teacher assistance from ChatGPT, which is GPT 3.5.

And to give you a taste of what's to come, Orca surpasses conventional state-of-the-art models, such as Vicuna, by more than 100% and more than 100% of the models that are used in Orca in complex zero-shot reasoning benchmarks, like the Big Bench Hard, which I'll talk about, and by 42% on AGI eval.

It goes on, Orca reaches parity with ChatGPT on the Big Bench Hard, and shows competitive performance in professional and academic examinations by the SAT, LSAT, GRE, and GMAT. And I know many of you will be interested in this footnote. We are working with our legal team to publicly release a diff of the model weights in accordance with Lama's release policy.

So if this is anything like Lama, it's going to be leaked across the internet imminently. I'm going to show you so many tests and benchmarks in a moment, but just to give you a sample, here is Orca outperforming ChatGPT in the Vicuna evaluation set, and matching Text DaVinci 3 in the SAT, LSAT, GRE, and GMAT.

And as I'll touch on later, this was zero-shot without chain of thought or any advanced methods. You can watch, pretty much any of my other videos to see how advanced prompt engineering would probably boost those results still further. For those who didn't know, 13 billion parameters is about 7% the size of GPT-3, which is 175 billion parameters, and possibly around 1% or 2% of GPT-4's size.

That gives you an indication of the difference in size between Orca and these models that it's competing with. And if that doesn't make any sense, a smaller size means it can be run on much smaller, much more advanced devices, like a desktop or even possibly a laptop. The authors start off by giving a little slap to the other paper, you know that one that said, "Model imitation is a false promise." And they continue that, "Contrary to this assertion, it is possible to reduce the gap with proprietary LLMs on multiple zero-shot benchmarks that require sophisticated reasoning." As we'll see, models like Vicuna claim to have 90% of ChatGPT's quality, but when it came to reasoning tasks, or more technical tasks, it was basically flopped.

Here's a chart I'll come back to, outlining some of the more technical challenges you can give a language model. We should remember that Vicuna is a fine-tuned version of the LLMA model, and it's competitive or even better than Palm II. But give it some of the harder challenges for a language model, and it really struggles, as you can see in this column.

Take logical deduction, where it only scored 1.2%. Well, this Orca model was 2,900% better than that, scoring 36%, and it was a lot more competitive with ChatGPT. I'm gonna come back to the Big Bench benchmark, but look for a second at Causal Judgment, where Orca, a 13 billion parameter model, matches GPT-4, which is about 100 times the size.

But back to how they actually did it. Models like Alpaca and Vicuna were given lots of query and responses from ChatGPT or GPT-4. But what they did is they leveraged system instructions, asking models like GPT-4 and ChatGPT to do a little bit of a test. This gave Orca access to detailed responses from the model that explained the reasoning process of the teacher as it generates the response.

It allowed these parent models of GPT-3.5 and GPT-4 to be much better tutors for this young Orca. Also, they let the teachers of ChatGPT, which is 3.5, and GPT-4, give far more examples to their student, 5 million and 1 million examples, respectively. That compares to the other models you may have heard of, like Alpaca, Wizard, Vicuna, and so on.

They had tens of thousands or the low hundreds of thousands of examples. But again, the key difference is the explanations, the step-by-step thinking that the smaller Orca could then imitate. They give a quick demo here of how the other open source models learn from their GPT parents with a simplistic question and answer format.

In contrast, the authors leveraged system messages to get ChatGPT and GPT-4 to think step-by-step, leading to much richer explanations. As you can see in this diagram. It wasn't just, "Let's think step-by-step," by the way, also things like, "Explain like I'm five." They also wanted the task to be as complex and diverse as possible, so they used the Flan collection.

This was released by Google in February and focused on balancing the kind of prompts and tasks that you fine-tune the language models on. You can see here the 16 system messages that they give to ChatGPT and GPT-4, and you can see here the kind of difference that that makes.

Imagine a language model trying to learn from this human. The human is asked, "Pick which sentence is not logical." Sentence A, "People in the desert often look forward to flood," or sentence B, "People in the desert often look forward to rain." The human responds, "There is no reason to look forward to a flood because floods cause damage." The answer is sentence A.

Now, yes, a language model can learn from that, but by leveraging those system assistant messages, look at the kind of response that GPT-4 gives. Now, Orca can learn a lot more from that explanation, and that's one of the main reasons it's better than all the other open-source models. Because remember, Vicuna is the best of the open-source models.

In this leaderboard, it has an ELO of 1054, better even than Palm II Bison. All the models higher than it are proprietary. But there is another reason why Orca performs so much better. You might have wondered, why didn't they just use only GPT-4? Well, yes, there were cost and time considerations, but there was another factor that they found.

They were able to use GPT-4 to use ChatGPT or GPT-3.5 as an intermediate teacher. That teacher, ChatGPT, was able to reduce the gap in capabilities. So Orca got smarter and better able to learn. A bit like progressive learning, where you first learn from easier examples, then followed by harder ones.

After that, they gave it outputs from GPT-4. Notice, by the way, what happens if you skip the ChatGPT teaching assistant and only train on those 1 million examples from GPT-4. What happens is it's a bit like a student struggling in a class that's too advanced for them. Orca actually performs worse in those circumstances, averaging 37%.

But with that intermediate teacher beforehand, it gets 41.7%. Speaking of time, it only took about 200 hours to train Orca on 20 A100 GPUs. They did take a few weeks to collect the data from ChatGPT and GPT-4. But presumably, if they're planning to open source this, which they say they are, then that step could be skipped by a week.

But that's not the case. They're going to be able to do it in a few weeks. So, let's look at some more of the results. First, for open-ended generation, not multiple choice. Orca is 95% of ChatGPT quality and 85% of GPT-4's quality as assessed by GPT-4. But they wanted to quickly move on to some more definitive tasks.

There is a problem of using GPT-4 as an assessor. For example, they observed that there is a positive bias in GPT-4 evaluation toward the response of the first model in the QA. And they were able to do this by using the same comparison set. This reminded me of the unfaithful reasoning paper that I talked about in one of my recent videos.

You can't always trust GPT-4 to give its true reasoning. But here it is in more objective multiple choice questions. And notice how much harder many of these tests are for even these advanced language models. I am fortunate and proud to have attained a perfect score in some of the tests in this chart, like the GRE and GMAT.

They were part of the AQUA-RAT test that they gave the models. So, I can say that they really are quite challenging. Hence why GPT-4 only gets a 40%. But you can see that throughout, AQUA outperforms Vicuna by quite a margin. And is very competitive with Text DaVinci 3. Of course, overall, it does lag behind GPT-4.

But this is all zero-shot. A bit later on, I'll come back to the range of methods that we could use to further improve on AQUA. The percentages, by the way, are the improvements on Vicuna. Again, the second best open source model. So far, we've looked at human-centric benchmarks. Like the GMAT and GRE.

These are grouped with the lovely name AGI EVAL. And as we've seen, even the top models lag behind the top human performers. But what about a benchmark specifically for language models? It's called BigBench Hard. The original BigBench had 207 tasks. But language models got so good, that they had to narrow down the benchmark to just the 23 challenging tasks where human raters still did better than language models.

Now, it turns out when you add chain of thought prompting to the models, they do even better. And there are even fewer tasks that humans are better at. But anyway, all you have to remember is that these are 23 of the hardest tasks for language models. And I'll just let you compare the results for yourself.

But the trend is really quite clear. AQUA massively outperforming the previous best open source model, Vicuna. Beating even ChatGPT on average. But still, of course, lagging behind GPT-4. Except for a few tasks. Look at Web of Lies, which is a very common task in the form of GPT-4. That would be a question like this: Alexis says Shonda tells the truth.

Jim lies. Antoine says Jim tells the truth. Shonda says Antoine lies. Does Alexis tell the truth? Or what about temporal sequences? Where AQUA absolutely crushes Vicuna and doubles ChatGPT's performance. That would be a situation like this. Now, I'm not going to read it all out. But essentially, you have to figure out when the timings match up.

Basically, keeping track of time. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up. And then, you have to figure out when the timings match up.

And then, you have to figure out when the timings match up. This does seem a little bit naive to me. I mean, that's what Meta said when they released Llama. But then, everyone and their grandma just used the language model for whatever. I do wonder what it means when they say, "We are working with our legal team." And it is particularly interesting to me that this was all done by Microsoft.

I'm going to go into a little bit of speculation here about why I think they conducted this research. You might remember that leaked memo from Google, "We have no moat." And they even mentioned Vicuna. And talked about how it circumvented restrictions on the online market. And talked about how it circumvented restrictions on the online market.

And talked about how it circumvented restrictions on the online market. And my theory is that the Microsoft researchers were testing this point from the memo. The point was that training giant models from scratch not only throws away the pre-training. But also any iterative, open source improvements that have been made on top.

It doesn't take long for those improvements to dominate, making the full retrain extremely costly. Maybe Microsoft is hesitating about future investments in GPT-5 or GPT-6. And they really want to test out if it's easy to integrate. And maybe it's easy to imitate those large models on the cheap. If it is, then why would Microsoft invest billions in a new giant model?

That's my own theory as to why Microsoft is working on this. But let me know in the comments what your theory is. In the conclusion, the authors state that "AUCA suggests that learning from step-by-step explanations could significantly improve the quality of models regardless of their size." And that they hope these insights will inform the design of more robust evaluation methods.

For example, the development of the most advanced training techniques. And the advancement of alignment and post-training techniques. And the more effective use of powerful models like GPT-4 as teachers. And maybe they should have said, and also with ChatGPT as an intermediate teacher. I'm going to end with the thoughts of the leaders of OpenAI, Ilya Sutskova and Sam Altman on open source models.

And I think there is a bit of a contrast between the two answers. Ilya Sutskova thinks that the gap is growing ever wider. To the open source versus non-open source. The open source versus non-open source models question. You don't want to think about it in binary black and white terms where, like, there is a secret source that will never be rediscovered.

What I will say, or whether GPT-4 will ever be reproduced by open source models, perhaps one day it will be. But when it will be, it will be a much more powerful model in the companies. So there will always be a gap between the open source models and the private models.

And this gap may even be increasing with time. The amount of effort and engineering and research that it takes to produce one such neural net keeps increasing. And so even if there are open source models, they will never be, they will be less and less produced by small groups of dedicated researchers and engineers.

And it will only be the providence of a company. A big company. While Sam Altman seems to say that even if open source models do catch up, OpenAI will always have a different kind of moat. What are your thoughts about the "We have no moat" document that was released lately?

The leaked document. The thing that is special about OpenAI, and I think the thing that is so misunderstood by that document, aside from the fact that we have, like, a gigantic number of users, and people that like have formed some sort of relationship with us and our products, is what OpenAI is special about is figuring out what comes next.

It is the ability, it is easy to copy something once you know it can be done, and in that sense, sure. It is very hard to go figure out what to do next. And the ideas, the big ideas, the medium-sized ideas, the small ideas, and the careful execution on them that it takes to get from here to superintelligence, that's what our moat is.

Anyway, this video could have been at least three times longer. There was so much I had to edit out for brevity. If you're interested in me talking more about open source models, do let me know in the comments. I've got much more to say. As always, thank you so much for watching to the end, and have a wonderful day.