back to indexOrca: The Model Few Saw Coming
00:00:00.000 |
Do you remember this paper, less than two weeks old? 00:00:03.720 |
It made waves by concluding that open source models 00:00:06.940 |
can mimic the style, but not the factuality of ChatGPT. 00:00:32.620 |
But Orca is not just competitive with GPT-3.5. 00:00:36.480 |
It beats it in quite a few well-established benchmarks 00:00:39.960 |
and even matches GPT-4 in a couple of tests of reasoning. 00:00:55.580 |
named presumably because Orcas, or killer whales, 00:00:58.880 |
are frequent visitors to South American coastlines. 00:01:18.960 |
that those other models lack rigorous evaluation, 00:01:22.100 |
resulting in overestimating the small model's capability 00:01:28.000 |
So, if you're interested in learning more about Orca, 00:01:28.080 |
and you're interested in learning more about Orca, 00:01:33.740 |
we develop Orca, a 13 billion parameter model, 00:01:44.880 |
and is guided by teacher assistance from ChatGPT, 00:01:52.460 |
Orca surpasses conventional state-of-the-art models, 00:02:21.500 |
And I know many of you will be interested in this footnote. 00:02:33.940 |
it's going to be leaked across the internet imminently. 00:03:01.620 |
would probably boost those results still further. 00:03:05.460 |
13 billion parameters is about 7% the size of GPT-3, 00:03:12.660 |
and possibly around 1% or 2% of GPT-4's size. 00:03:24.800 |
a smaller size means it can be run on much smaller, 00:03:32.820 |
The authors start off by giving a little slap 00:03:41.300 |
it is possible to reduce the gap with proprietary LLMs 00:04:00.360 |
outlining some of the more technical challenges 00:04:04.480 |
We should remember that Vicuna is a fine-tuned version 00:04:09.080 |
and it's competitive or even better than Palm II. 00:04:22.440 |
Well, this Orca model was 2,900% better than that, 00:04:28.020 |
and it was a lot more competitive with ChatGPT. 00:04:31.020 |
I'm gonna come back to the Big Bench benchmark, 00:04:38.940 |
matches GPT-4, which is about 100 times the size. 00:04:52.060 |
But what they did is they leveraged system instructions, 00:05:01.360 |
from the model that explained the reasoning process 00:05:07.000 |
It allowed these parent models of GPT-3.5 and GPT-4 00:05:10.840 |
to be much better tutors for this young Orca. 00:05:21.060 |
5 million and 1 million examples, respectively. 00:05:24.200 |
That compares to the other models you may have heard of, 00:05:30.600 |
or the low hundreds of thousands of examples. 00:05:33.240 |
But again, the key difference is the explanations, 00:05:44.980 |
with a simplistic question and answer format. 00:05:48.480 |
In contrast, the authors leveraged system messages 00:05:51.480 |
to get ChatGPT and GPT-4 to think step-by-step, 00:05:59.960 |
It wasn't just, "Let's think step-by-step," by the way, 00:06:13.960 |
and focused on balancing the kind of prompts and tasks 00:06:25.460 |
and you can see here the kind of difference that that makes. 00:06:27.940 |
Imagine a language model trying to learn from this human. 00:06:30.940 |
The human is asked, "Pick which sentence is not logical." 00:06:33.940 |
Sentence A, "People in the desert often look forward to flood," 00:06:37.940 |
or sentence B, "People in the desert often look forward to rain." 00:06:40.940 |
The human responds, "There is no reason to look forward to a flood 00:06:47.940 |
Now, yes, a language model can learn from that, 00:06:49.940 |
but by leveraging those system assistant messages, 00:06:52.940 |
look at the kind of response that GPT-4 gives. 00:07:00.920 |
it's better than all the other open-source models. 00:07:03.920 |
Because remember, Vicuna is the best of the open-source models. 00:07:12.920 |
All the models higher than it are proprietary. 00:07:15.920 |
But there is another reason why Orca performs so much better. 00:07:21.920 |
Well, yes, there were cost and time considerations, 00:07:24.920 |
but there was another factor that they found. 00:07:36.900 |
So Orca got smarter and better able to learn. 00:07:48.900 |
what happens if you skip the ChatGPT teaching assistant 00:07:52.900 |
and only train on those 1 million examples from GPT-4. 00:07:57.880 |
it's a bit like a student struggling in a class 00:08:01.880 |
Orca actually performs worse in those circumstances, 00:08:05.880 |
But with that intermediate teacher beforehand, 00:08:17.880 |
They did take a few weeks to collect the data 00:08:45.860 |
There is a problem of using GPT-4 as an assessor. 00:08:49.860 |
For example, they observed that there is a positive bias 00:08:59.840 |
This reminded me of the unfaithful reasoning paper 00:09:01.840 |
that I talked about in one of my recent videos. 00:09:04.840 |
You can't always trust GPT-4 to give its true reasoning. 00:09:07.840 |
But here it is in more objective multiple choice questions. 00:09:10.840 |
And notice how much harder many of these tests are 00:09:15.840 |
I am fortunate and proud to have attained a perfect score 00:09:25.840 |
So, I can say that they really are quite challenging. 00:09:37.820 |
Of course, overall, it does lag behind GPT-4. 00:09:42.820 |
A bit later on, I'll come back to the range of methods 00:09:45.820 |
that we could use to further improve on AQUA. 00:09:48.820 |
The percentages, by the way, are the improvements on Vicuna. 00:09:54.820 |
So far, we've looked at human-centric benchmarks. 00:09:59.800 |
These are grouped with the lovely name AGI EVAL. 00:10:06.800 |
But what about a benchmark specifically for language models? 00:10:21.800 |
where human raters still did better than language models. 00:10:24.800 |
Now, it turns out when you add chain of thought prompting 00:10:29.780 |
And there are even fewer tasks that humans are better at. 00:10:31.780 |
But anyway, all you have to remember is that these are 23 00:10:37.780 |
And I'll just let you compare the results for yourself. 00:10:42.780 |
AQUA massively outperforming the previous best open source model, Vicuna. 00:10:57.760 |
which is a very common task in the form of GPT-4. 00:11:22.760 |
But essentially, you have to figure out when the timings match up. 00:11:27.740 |
And then, you have to figure out when the timings match up. 00:11:29.740 |
And then, you have to figure out when the timings match up. 00:11:31.740 |
And then, you have to figure out when the timings match up. 00:11:33.740 |
And then, you have to figure out when the timings match up. 00:11:35.740 |
And then, you have to figure out when the timings match up. 00:11:37.740 |
And then, you have to figure out when the timings match up. 00:11:39.740 |
And then, you have to figure out when the timings match up. 00:11:41.740 |
And then, you have to figure out when the timings match up. 00:11:43.740 |
And then, you have to figure out when the timings match up. 00:11:45.740 |
And then, you have to figure out when the timings match up. 00:11:47.740 |
And then, you have to figure out when the timings match up. 00:11:49.740 |
And then, you have to figure out when the timings match up. 00:11:51.740 |
And then, you have to figure out when the timings match up. 00:11:53.740 |
And then, you have to figure out when the timings match up. 00:11:55.740 |
And then, you have to figure out when the timings match up. 00:11:57.740 |
And then, you have to figure out when the timings match up. 00:11:59.740 |
And then, you have to figure out when the timings match up. 00:12:01.740 |
And then, you have to figure out when the timings match up. 00:12:03.740 |
And then, you have to figure out when the timings match up. 00:12:05.740 |
And then, you have to figure out when the timings match up. 00:12:07.740 |
And then, you have to figure out when the timings match up. 00:12:09.740 |
And then, you have to figure out when the timings match up. 00:12:11.740 |
And then, you have to figure out when the timings match up. 00:12:13.740 |
And then, you have to figure out when the timings match up. 00:12:15.740 |
And then, you have to figure out when the timings match up. 00:12:17.740 |
And then, you have to figure out when the timings match up. 00:12:19.740 |
And then, you have to figure out when the timings match up. 00:12:21.740 |
And then, you have to figure out when the timings match up. 00:12:23.740 |
And then, you have to figure out when the timings match up. 00:12:25.740 |
And then, you have to figure out when the timings match up. 00:12:27.740 |
And then, you have to figure out when the timings match up. 00:12:29.740 |
And then, you have to figure out when the timings match up. 00:12:31.740 |
And then, you have to figure out when the timings match up. 00:12:33.740 |
And then, you have to figure out when the timings match up. 00:12:35.740 |
And then, you have to figure out when the timings match up. 00:12:37.740 |
And then, you have to figure out when the timings match up. 00:12:39.740 |
And then, you have to figure out when the timings match up. 00:12:41.740 |
And then, you have to figure out when the timings match up. 00:12:43.740 |
And then, you have to figure out when the timings match up. 00:12:45.740 |
And then, you have to figure out when the timings match up. 00:12:47.740 |
And then, you have to figure out when the timings match up. 00:12:49.740 |
And then, you have to figure out when the timings match up. 00:12:51.740 |
And then, you have to figure out when the timings match up. 00:12:53.740 |
And then, you have to figure out when the timings match up. 00:12:55.740 |
And then, you have to figure out when the timings match up. 00:12:57.740 |
And then, you have to figure out when the timings match up. 00:12:59.740 |
And then, you have to figure out when the timings match up. 00:13:01.740 |
And then, you have to figure out when the timings match up. 00:13:03.740 |
And then, you have to figure out when the timings match up. 00:13:05.740 |
And then, you have to figure out when the timings match up. 00:13:07.740 |
And then, you have to figure out when the timings match up. 00:13:09.740 |
And then, you have to figure out when the timings match up. 00:13:11.740 |
And then, you have to figure out when the timings match up. 00:13:13.740 |
And then, you have to figure out when the timings match up. 00:13:15.740 |
And then, you have to figure out when the timings match up. 00:13:17.740 |
And then, you have to figure out when the timings match up. 00:13:19.740 |
And then, you have to figure out when the timings match up. 00:13:21.740 |
And then, you have to figure out when the timings match up. 00:13:23.740 |
And then, you have to figure out when the timings match up. 00:13:25.740 |
And then, you have to figure out when the timings match up. 00:13:29.740 |
I mean, that's what Meta said when they released Llama. 00:13:31.740 |
But then, everyone and their grandma just used the language model for whatever. 00:13:35.740 |
I do wonder what it means when they say, "We are working with our legal team." 00:13:39.740 |
And it is particularly interesting to me that this was all done by Microsoft. 00:13:43.740 |
I'm going to go into a little bit of speculation here about why I think they conducted this research. 00:13:48.740 |
You might remember that leaked memo from Google, "We have no moat." 00:13:54.740 |
And talked about how it circumvented restrictions on the online market. 00:13:57.740 |
And talked about how it circumvented restrictions on the online market. 00:13:59.740 |
And talked about how it circumvented restrictions on the online market. 00:14:01.740 |
And my theory is that the Microsoft researchers were testing this point from the memo. 00:14:05.740 |
The point was that training giant models from scratch not only throws away the pre-training. 00:14:09.740 |
But also any iterative, open source improvements that have been made on top. 00:14:13.740 |
It doesn't take long for those improvements to dominate, making the full retrain extremely costly. 00:14:18.740 |
Maybe Microsoft is hesitating about future investments in GPT-5 or GPT-6. 00:14:24.740 |
And they really want to test out if it's easy to integrate. 00:14:26.740 |
And maybe it's easy to imitate those large models on the cheap. 00:14:29.740 |
If it is, then why would Microsoft invest billions in a new giant model? 00:14:34.740 |
That's my own theory as to why Microsoft is working on this. 00:14:37.740 |
But let me know in the comments what your theory is. 00:14:41.740 |
"AUCA suggests that learning from step-by-step explanations could significantly improve the quality of models regardless of their size." 00:14:49.740 |
And that they hope these insights will inform the design of more robust evaluation methods. 00:14:55.740 |
For example, the development of the most advanced training techniques. 00:14:57.740 |
And the advancement of alignment and post-training techniques. 00:14:59.740 |
And the more effective use of powerful models like GPT-4 as teachers. 00:15:05.740 |
And maybe they should have said, and also with ChatGPT as an intermediate teacher. 00:15:09.740 |
I'm going to end with the thoughts of the leaders of OpenAI, Ilya Sutskova and Sam Altman on open source models. 00:15:15.740 |
And I think there is a bit of a contrast between the two answers. 00:15:18.740 |
Ilya Sutskova thinks that the gap is growing ever wider. 00:15:24.740 |
The open source versus non-open source models question. 00:15:27.740 |
You don't want to think about it in binary black and white terms where, like, there is a secret source that will never be rediscovered. 00:15:37.740 |
What I will say, or whether GPT-4 will ever be reproduced by open source models, perhaps one day it will be. 00:15:45.740 |
But when it will be, it will be a much more powerful model in the companies. 00:15:50.740 |
So there will always be a gap between the open source models 00:15:56.740 |
And this gap may even be increasing with time. 00:16:01.740 |
The amount of effort and engineering and research that it takes to produce one such neural net keeps increasing. 00:16:09.740 |
And so even if there are open source models, they will never be, they will be less and less produced by small groups of dedicated researchers and engineers. 00:16:20.740 |
And it will only be the providence of a company. 00:16:24.740 |
While Sam Altman seems to say that even if open source models do catch up, OpenAI will always have a different kind of moat. 00:16:32.740 |
What are your thoughts about the "We have no moat" document that was released lately? 00:16:41.740 |
The thing that is special about OpenAI, and I think the thing that is so misunderstood by that document, aside from the fact that we have, like, a gigantic number of users, 00:16:51.740 |
and people that like have formed some sort of relationship with us and our products, 00:16:57.740 |
is what OpenAI is special about is figuring out what comes next. 00:17:03.740 |
It is the ability, it is easy to copy something once you know it can be done, and in that sense, sure. 00:17:09.740 |
It is very hard to go figure out what to do next. 00:17:12.740 |
And the ideas, the big ideas, the medium-sized ideas, the small ideas, and the careful execution on them that it takes to get from here to superintelligence, 00:17:22.740 |
Anyway, this video could have been at least three times longer. 00:17:25.740 |
There was so much I had to edit out for brevity. 00:17:28.740 |
If you're interested in me talking more about open source models, do let me know in the comments. 00:17:34.740 |
As always, thank you so much for watching to the end, and have a wonderful day.