Orca: The Model Few Saw Coming

00:00:00.000 | Do you remember this paper, less than two weeks old?

00:00:03.720 | It made waves by concluding that open source models

00:00:06.940 | can mimic the style, but not the factuality of ChatGPT.

00:00:11.740 | Overall, we can conclude, they say,

00:00:14.120 | that model imitation is a false promise.

00:00:17.380 | Well, 48 hours ago, we have this,

00:00:20.440 | a 51-page report on Orca

00:00:23.600 | based on a small 13 billion parameter model.

00:00:27.060 | I don't often comment on open source models

00:00:29.440 | because they're simply not competitive

00:00:31.020 | with OpenAI's models.

00:00:32.620 | But Orca is not just competitive with GPT-3.5.

00:00:36.480 | It beats it in quite a few well-established benchmarks

00:00:39.960 | and even matches GPT-4 in a couple of tests of reasoning.

00:00:44.160 | As always, I've read both papers in full

00:00:46.320 | and can also bring in just-released comments

00:00:48.720 | from Sam Altman and Ilya Sutskova

00:00:51.100 | on competition from open source models.

00:00:53.700 | But let's start with Orca,

00:00:55.580 | named presumably because Orcas, or killer whales,

00:00:58.880 | are frequent visitors to South American coastlines.

00:01:01.820 | And South America is, of course,

00:01:03.300 | the land of llamas and vicunas.

00:01:05.560 | But all the research was done by Microsoft,

00:01:08.560 | which I find interesting,

00:01:10.060 | and I'll come back to that at the end.

00:01:11.920 | But why did they make Orca,

00:01:13.360 | and why does it perform better than models

00:01:15.360 | like Llama, Alpaca, and Vicuna?

00:01:17.340 | Well, they say here in the abstract

00:01:18.960 | that those other models lack rigorous evaluation,

00:01:22.100 | resulting in overestimating the small model's capability

00:01:25.760 | as they tend to learn to imitate the style,

00:01:27.880 | but not the actual model.

00:01:28.000 | So, if you're interested in learning more about Orca,

00:01:28.080 | and you're interested in learning more about Orca,

00:01:28.140 | but not the reasoning of LFMs,

00:01:31.120 | large foundation models.

00:01:32.380 | To address these challenges,

00:01:33.740 | we develop Orca, a 13 billion parameter model,

00:01:36.640 | that learns to imitate the reasoning process

00:01:39.420 | of the larger models.

00:01:40.820 | Orca learns by looking at GPT-4's

00:01:43.240 | step-by-step thought processes,

00:01:44.880 | and is guided by teacher assistance from ChatGPT,

00:01:48.840 | which is GPT 3.5.

00:01:50.600 | And to give you a taste of what's to come,

00:01:52.460 | Orca surpasses conventional state-of-the-art models,

00:01:55.560 | such as Vicuna,

00:01:56.660 | by more than 100%

00:01:57.580 | and more than 100%

00:01:57.580 | of the models that are used in Orca

00:01:58.120 | in complex zero-shot reasoning benchmarks,

00:02:02.200 | like the Big Bench Hard,

00:02:04.000 | which I'll talk about,

00:02:04.780 | and by 42% on AGI eval.

00:02:08.300 | It goes on,

00:02:09.060 | Orca reaches parity with ChatGPT

00:02:12.300 | on the Big Bench Hard,

00:02:13.920 | and shows competitive performance

00:02:16.200 | in professional and academic examinations

00:02:18.120 | by the SAT, LSAT, GRE, and GMAT.

00:02:21.500 | And I know many of you will be interested in this footnote.

00:02:24.160 | We are working with our legal team

00:02:26.260 | to publicly release

00:02:28.100 | a diff of the model weights

00:02:29.640 | in accordance with Lama's release policy.

00:02:32.020 | So if this is anything like Lama,

00:02:33.940 | it's going to be leaked across the internet imminently.

00:02:36.240 | I'm going to show you so many tests

00:02:38.000 | and benchmarks in a moment,

00:02:39.660 | but just to give you a sample,

00:02:41.280 | here is Orca outperforming ChatGPT

00:02:44.380 | in the Vicuna evaluation set,

00:02:46.520 | and matching Text DaVinci 3

00:02:48.340 | in the SAT, LSAT, GRE, and GMAT.

00:02:51.620 | And as I'll touch on later,

00:02:52.860 | this was zero-shot without chain of thought

00:02:55.760 | or any advanced methods.

00:02:57.280 | You can watch,

00:02:58.080 | pretty much any of my other videos

00:02:59.520 | to see how advanced prompt engineering

00:03:01.620 | would probably boost those results still further.

00:03:04.340 | For those who didn't know,

00:03:05.460 | 13 billion parameters is about 7% the size of GPT-3,

00:03:10.420 | which is 175 billion parameters,

00:03:12.660 | and possibly around 1% or 2% of GPT-4's size.

00:03:17.560 | That gives you an indication

00:03:18.860 | of the difference in size between Orca

00:03:21.360 | and these models that it's competing with.

00:03:23.320 | And if that doesn't make any sense,

00:03:24.800 | a smaller size means it can be run on much smaller,

00:03:28.060 | much more advanced devices,

00:03:29.940 | like a desktop or even possibly a laptop.

00:03:32.820 | The authors start off by giving a little slap

00:03:35.060 | to the other paper,

00:03:35.900 | you know that one that said,

00:03:36.820 | "Model imitation is a false promise."

00:03:38.940 | And they continue that,

00:03:39.860 | "Contrary to this assertion,

00:03:41.300 | it is possible to reduce the gap with proprietary LLMs

00:03:45.940 | on multiple zero-shot benchmarks

00:03:48.300 | that require sophisticated reasoning."

00:03:50.380 | As we'll see, models like Vicuna claim

00:03:52.300 | to have 90% of ChatGPT's quality,

00:03:55.180 | but when it came to reasoning tasks,

00:03:57.300 | or more technical tasks,

00:03:58.040 | it was basically flopped.

00:03:59.040 | Here's a chart I'll come back to,

00:04:00.360 | outlining some of the more technical challenges

00:04:02.840 | you can give a language model.

00:04:04.480 | We should remember that Vicuna is a fine-tuned version

00:04:07.640 | of the LLMA model,

00:04:09.080 | and it's competitive or even better than Palm II.

00:04:12.600 | But give it some of the harder challenges

00:04:14.840 | for a language model,

00:04:16.000 | and it really struggles,

00:04:17.320 | as you can see in this column.

00:04:18.920 | Take logical deduction,

00:04:20.280 | where it only scored 1.2%.

00:04:22.440 | Well, this Orca model was 2,900% better than that,

00:04:26.480 | scoring 36%,

00:04:28.020 | and it was a lot more competitive with ChatGPT.

00:04:31.020 | I'm gonna come back to the Big Bench benchmark,

00:04:33.140 | but look for a second at Causal Judgment,

00:04:35.420 | where Orca, a 13 billion parameter model,

00:04:38.940 | matches GPT-4, which is about 100 times the size.

00:04:43.420 | But back to how they actually did it.

00:04:45.340 | Models like Alpaca and Vicuna

00:04:47.260 | were given lots of query and responses

00:04:49.900 | from ChatGPT or GPT-4.

00:04:52.060 | But what they did is they leveraged system instructions,

00:04:55.300 | asking models like GPT-4 and ChatGPT

00:04:58.000 | to do a little bit of a test.

00:04:59.000 | This gave Orca access to detailed responses

00:05:01.360 | from the model that explained the reasoning process

00:05:04.500 | of the teacher as it generates the response.

00:05:07.000 | It allowed these parent models of GPT-3.5 and GPT-4

00:05:10.840 | to be much better tutors for this young Orca.

00:05:14.340 | Also, they let the teachers of ChatGPT,

00:05:16.580 | which is 3.5, and GPT-4,

00:05:18.820 | give far more examples to their student,

00:05:21.060 | 5 million and 1 million examples, respectively.

00:05:24.200 | That compares to the other models you may have heard of,

00:05:26.320 | like Alpaca, Wizard, Vicuna, and so on.

00:05:27.980 | They had tens of thousands

00:05:30.600 | or the low hundreds of thousands of examples.

00:05:33.240 | But again, the key difference is the explanations,

00:05:36.220 | the step-by-step thinking

00:05:37.760 | that the smaller Orca could then imitate.

00:05:40.180 | They give a quick demo here

00:05:41.180 | of how the other open source models

00:05:42.980 | learn from their GPT parents

00:05:44.980 | with a simplistic question and answer format.

00:05:48.480 | In contrast, the authors leveraged system messages

00:05:51.480 | to get ChatGPT and GPT-4 to think step-by-step,

00:05:55.780 | leading to much richer explanations.

00:05:57.960 | As you can see in this diagram.

00:05:59.960 | It wasn't just, "Let's think step-by-step," by the way,

00:06:02.460 | also things like, "Explain like I'm five."

00:06:04.960 | They also wanted the task to be as complex

00:06:07.960 | and diverse as possible,

00:06:09.460 | so they used the Flan collection.

00:06:11.960 | This was released by Google in February

00:06:13.960 | and focused on balancing the kind of prompts and tasks

00:06:17.460 | that you fine-tune the language models on.

00:06:19.960 | You can see here the 16 system messages

00:06:22.460 | that they give to ChatGPT and GPT-4,

00:06:25.460 | and you can see here the kind of difference that that makes.

00:06:27.940 | Imagine a language model trying to learn from this human.

00:06:30.940 | The human is asked, "Pick which sentence is not logical."

00:06:33.940 | Sentence A, "People in the desert often look forward to flood,"

00:06:37.940 | or sentence B, "People in the desert often look forward to rain."

00:06:40.940 | The human responds, "There is no reason to look forward to a flood

00:06:43.940 | because floods cause damage."

00:06:45.940 | The answer is sentence A.

00:06:47.940 | Now, yes, a language model can learn from that,

00:06:49.940 | but by leveraging those system assistant messages,

00:06:52.940 | look at the kind of response that GPT-4 gives.

00:06:55.940 | Now, Orca can learn a lot more

00:06:57.920 | from that explanation,

00:06:58.920 | and that's one of the main reasons

00:07:00.920 | it's better than all the other open-source models.

00:07:03.920 | Because remember, Vicuna is the best of the open-source models.

00:07:07.920 | In this leaderboard, it has an ELO of 1054,

00:07:10.920 | better even than Palm II Bison.

00:07:12.920 | All the models higher than it are proprietary.

00:07:15.920 | But there is another reason why Orca performs so much better.

00:07:18.920 | You might have wondered,

00:07:19.920 | why didn't they just use only GPT-4?

00:07:21.920 | Well, yes, there were cost and time considerations,

00:07:24.920 | but there was another factor that they found.

00:07:26.920 | They were able to use GPT-4

00:07:27.900 | to use ChatGPT or GPT-3.5

00:07:30.900 | as an intermediate teacher.

00:07:32.900 | That teacher, ChatGPT,

00:07:33.900 | was able to reduce the gap in capabilities.

00:07:36.900 | So Orca got smarter and better able to learn.

00:07:39.900 | A bit like progressive learning,

00:07:40.900 | where you first learn from easier examples,

00:07:43.900 | then followed by harder ones.

00:07:44.900 | After that, they gave it outputs from GPT-4.

00:07:47.900 | Notice, by the way,

00:07:48.900 | what happens if you skip the ChatGPT teaching assistant

00:07:52.900 | and only train on those 1 million examples from GPT-4.

00:07:56.900 | What happens is

00:07:57.880 | it's a bit like a student struggling in a class

00:07:59.880 | that's too advanced for them.

00:08:01.880 | Orca actually performs worse in those circumstances,

00:08:04.880 | averaging 37%.

00:08:05.880 | But with that intermediate teacher beforehand,

00:08:08.880 | it gets 41.7%.

00:08:10.880 | Speaking of time,

00:08:11.880 | it only took about 200 hours to train Orca

00:08:14.880 | on 20 A100 GPUs.

00:08:17.880 | They did take a few weeks to collect the data

00:08:19.880 | from ChatGPT and GPT-4.

00:08:21.880 | But presumably,

00:08:22.880 | if they're planning to open source this,

00:08:24.880 | which they say they are,

00:08:25.880 | then that step could be skipped by a week.

00:08:27.860 | But that's not the case.

00:08:28.860 | They're going to be able to do it

00:08:29.860 | in a few weeks.

00:08:30.860 | So, let's look at some more of the results.

00:08:31.860 | First, for open-ended generation,

00:08:32.860 | not multiple choice.

00:08:33.860 | Orca is 95% of ChatGPT quality

00:08:37.860 | and 85% of GPT-4's quality

00:08:40.860 | as assessed by GPT-4.

00:08:42.860 | But they wanted to quickly move on

00:08:44.860 | to some more definitive tasks.

00:08:45.860 | There is a problem of using GPT-4 as an assessor.

00:08:49.860 | For example, they observed that there is a positive bias

00:08:52.860 | in GPT-4 evaluation toward the response

00:08:55.860 | of the first model in the QA.

00:08:57.840 | And they were able to do this

00:08:58.840 | by using the same comparison set.

00:08:59.840 | This reminded me of the unfaithful reasoning paper

00:09:01.840 | that I talked about in one of my recent videos.

00:09:04.840 | You can't always trust GPT-4 to give its true reasoning.

00:09:07.840 | But here it is in more objective multiple choice questions.

00:09:10.840 | And notice how much harder many of these tests are

00:09:13.840 | for even these advanced language models.

00:09:15.840 | I am fortunate and proud to have attained a perfect score

00:09:18.840 | in some of the tests in this chart,

00:09:20.840 | like the GRE and GMAT.

00:09:21.840 | They were part of the AQUA-RAT test

00:09:23.840 | that they gave the models.

00:09:25.840 | So, I can say that they really are quite challenging.

00:09:27.820 | Hence why GPT-4 only gets a 40%.

00:09:30.820 | But you can see that throughout,

00:09:31.820 | AQUA outperforms Vicuna by quite a margin.

00:09:34.820 | And is very competitive with Text DaVinci 3.

00:09:37.820 | Of course, overall, it does lag behind GPT-4.

00:09:40.820 | But this is all zero-shot.

00:09:42.820 | A bit later on, I'll come back to the range of methods

00:09:45.820 | that we could use to further improve on AQUA.

00:09:48.820 | The percentages, by the way, are the improvements on Vicuna.

00:09:51.820 | Again, the second best open source model.

00:09:54.820 | So far, we've looked at human-centric benchmarks.

00:09:57.800 | Like the GMAT and GRE.

00:09:59.800 | These are grouped with the lovely name AGI EVAL.

00:10:02.800 | And as we've seen, even the top models

00:10:04.800 | lag behind the top human performers.

00:10:06.800 | But what about a benchmark specifically for language models?

00:10:10.800 | It's called BigBench Hard.

00:10:12.800 | The original BigBench had 207 tasks.

00:10:15.800 | But language models got so good,

00:10:17.800 | that they had to narrow down the benchmark

00:10:19.800 | to just the 23 challenging tasks

00:10:21.800 | where human raters still did better than language models.

00:10:24.800 | Now, it turns out when you add chain of thought prompting

00:10:27.780 | to the models, they do even better.

00:10:29.780 | And there are even fewer tasks that humans are better at.

00:10:31.780 | But anyway, all you have to remember is that these are 23

00:10:34.780 | of the hardest tasks for language models.

00:10:37.780 | And I'll just let you compare the results for yourself.

00:10:40.780 | But the trend is really quite clear.

00:10:42.780 | AQUA massively outperforming the previous best open source model, Vicuna.

00:10:47.780 | Beating even ChatGPT on average.

00:10:50.780 | But still, of course, lagging behind GPT-4.

00:10:53.780 | Except for a few tasks.

00:10:55.780 | Look at Web of Lies,

00:10:57.760 | which is a very common task in the form of GPT-4.

00:10:59.760 | That would be a question like this:

00:11:01.760 | Alexis says Shonda tells the truth.

00:11:03.760 | Jim lies.

00:11:04.760 | Antoine says Jim tells the truth.

00:11:06.760 | Shonda says Antoine lies.

00:11:08.760 | Does Alexis tell the truth?

00:11:10.760 | Or what about temporal sequences?

00:11:12.760 | Where AQUA absolutely crushes Vicuna

00:11:15.760 | and doubles ChatGPT's performance.

00:11:18.760 | That would be a situation like this.

00:11:20.760 | Now, I'm not going to read it all out.

00:11:22.760 | But essentially, you have to figure out when the timings match up.

00:11:24.760 | Basically, keeping track of time.

00:11:27.740 | And then, you have to figure out when the timings match up.

00:11:29.740 | And then, you have to figure out when the timings match up.

00:11:31.740 | And then, you have to figure out when the timings match up.

00:11:33.740 | And then, you have to figure out when the timings match up.

00:11:35.740 | And then, you have to figure out when the timings match up.

00:11:37.740 | And then, you have to figure out when the timings match up.

00:11:39.740 | And then, you have to figure out when the timings match up.

00:11:41.740 | And then, you have to figure out when the timings match up.

00:11:43.740 | And then, you have to figure out when the timings match up.

00:11:45.740 | And then, you have to figure out when the timings match up.

00:11:47.740 | And then, you have to figure out when the timings match up.

00:11:49.740 | And then, you have to figure out when the timings match up.

00:11:51.740 | And then, you have to figure out when the timings match up.

00:11:53.740 | And then, you have to figure out when the timings match up.

00:11:55.740 | And then, you have to figure out when the timings match up.

00:11:57.740 | And then, you have to figure out when the timings match up.

00:11:59.740 | And then, you have to figure out when the timings match up.

00:12:01.740 | And then, you have to figure out when the timings match up.

00:12:03.740 | And then, you have to figure out when the timings match up.

00:12:05.740 | And then, you have to figure out when the timings match up.

00:12:07.740 | And then, you have to figure out when the timings match up.

00:12:09.740 | And then, you have to figure out when the timings match up.

00:12:11.740 | And then, you have to figure out when the timings match up.

00:12:13.740 | And then, you have to figure out when the timings match up.

00:12:15.740 | And then, you have to figure out when the timings match up.

00:12:17.740 | And then, you have to figure out when the timings match up.

00:12:19.740 | And then, you have to figure out when the timings match up.

00:12:21.740 | And then, you have to figure out when the timings match up.

00:12:23.740 | And then, you have to figure out when the timings match up.

00:12:25.740 | And then, you have to figure out when the timings match up.

00:12:27.740 | And then, you have to figure out when the timings match up.

00:12:29.740 | And then, you have to figure out when the timings match up.

00:12:31.740 | And then, you have to figure out when the timings match up.

00:12:33.740 | And then, you have to figure out when the timings match up.

00:12:35.740 | And then, you have to figure out when the timings match up.

00:12:37.740 | And then, you have to figure out when the timings match up.

00:12:39.740 | And then, you have to figure out when the timings match up.

00:12:41.740 | And then, you have to figure out when the timings match up.

00:12:43.740 | And then, you have to figure out when the timings match up.

00:12:45.740 | And then, you have to figure out when the timings match up.

00:12:47.740 | And then, you have to figure out when the timings match up.

00:12:49.740 | And then, you have to figure out when the timings match up.

00:12:51.740 | And then, you have to figure out when the timings match up.

00:12:53.740 | And then, you have to figure out when the timings match up.

00:12:55.740 | And then, you have to figure out when the timings match up.

00:12:57.740 | And then, you have to figure out when the timings match up.

00:12:59.740 | And then, you have to figure out when the timings match up.

00:13:01.740 | And then, you have to figure out when the timings match up.

00:13:03.740 | And then, you have to figure out when the timings match up.

00:13:05.740 | And then, you have to figure out when the timings match up.

00:13:07.740 | And then, you have to figure out when the timings match up.

00:13:09.740 | And then, you have to figure out when the timings match up.

00:13:11.740 | And then, you have to figure out when the timings match up.

00:13:13.740 | And then, you have to figure out when the timings match up.

00:13:15.740 | And then, you have to figure out when the timings match up.

00:13:17.740 | And then, you have to figure out when the timings match up.

00:13:19.740 | And then, you have to figure out when the timings match up.

00:13:21.740 | And then, you have to figure out when the timings match up.

00:13:23.740 | And then, you have to figure out when the timings match up.

00:13:25.740 | And then, you have to figure out when the timings match up.

00:13:27.740 | This does seem a little bit naive to me.

00:13:29.740 | I mean, that's what Meta said when they released Llama.

00:13:31.740 | But then, everyone and their grandma just used the language model for whatever.

00:13:35.740 | I do wonder what it means when they say, "We are working with our legal team."

00:13:39.740 | And it is particularly interesting to me that this was all done by Microsoft.

00:13:43.740 | I'm going to go into a little bit of speculation here about why I think they conducted this research.

00:13:48.740 | You might remember that leaked memo from Google, "We have no moat."

00:13:52.740 | And they even mentioned Vicuna.

00:13:54.740 | And talked about how it circumvented restrictions on the online market.

00:13:57.740 | And talked about how it circumvented restrictions on the online market.

00:13:59.740 | And talked about how it circumvented restrictions on the online market.

00:14:01.740 | And my theory is that the Microsoft researchers were testing this point from the memo.

00:14:05.740 | The point was that training giant models from scratch not only throws away the pre-training.

00:14:09.740 | But also any iterative, open source improvements that have been made on top.

00:14:13.740 | It doesn't take long for those improvements to dominate, making the full retrain extremely costly.

00:14:18.740 | Maybe Microsoft is hesitating about future investments in GPT-5 or GPT-6.

00:14:24.740 | And they really want to test out if it's easy to integrate.

00:14:26.740 | And maybe it's easy to imitate those large models on the cheap.

00:14:29.740 | If it is, then why would Microsoft invest billions in a new giant model?

00:14:34.740 | That's my own theory as to why Microsoft is working on this.

00:14:37.740 | But let me know in the comments what your theory is.

00:14:39.740 | In the conclusion, the authors state that

00:14:41.740 | "AUCA suggests that learning from step-by-step explanations could significantly improve the quality of models regardless of their size."

00:14:49.740 | And that they hope these insights will inform the design of more robust evaluation methods.

00:14:55.740 | For example, the development of the most advanced training techniques.

00:14:57.740 | And the advancement of alignment and post-training techniques.

00:14:59.740 | And the more effective use of powerful models like GPT-4 as teachers.

00:15:05.740 | And maybe they should have said, and also with ChatGPT as an intermediate teacher.

00:15:09.740 | I'm going to end with the thoughts of the leaders of OpenAI, Ilya Sutskova and Sam Altman on open source models.

00:15:15.740 | And I think there is a bit of a contrast between the two answers.

00:15:18.740 | Ilya Sutskova thinks that the gap is growing ever wider.

00:15:22.740 | To the open source versus non-open source.

00:15:24.740 | The open source versus non-open source models question.

00:15:27.740 | You don't want to think about it in binary black and white terms where, like, there is a secret source that will never be rediscovered.

00:15:37.740 | What I will say, or whether GPT-4 will ever be reproduced by open source models, perhaps one day it will be.

00:15:45.740 | But when it will be, it will be a much more powerful model in the companies.

00:15:50.740 | So there will always be a gap between the open source models

00:15:53.740 | and the private models.

00:15:56.740 | And this gap may even be increasing with time.

00:16:01.740 | The amount of effort and engineering and research that it takes to produce one such neural net keeps increasing.

00:16:09.740 | And so even if there are open source models, they will never be, they will be less and less produced by small groups of dedicated researchers and engineers.

00:16:20.740 | And it will only be the providence of a company.

00:16:22.740 | A big company.

00:16:24.740 | While Sam Altman seems to say that even if open source models do catch up, OpenAI will always have a different kind of moat.

00:16:32.740 | What are your thoughts about the "We have no moat" document that was released lately?

00:16:38.740 | The leaked document.

00:16:41.740 | The thing that is special about OpenAI, and I think the thing that is so misunderstood by that document, aside from the fact that we have, like, a gigantic number of users,

00:16:51.740 | and people that like have formed some sort of relationship with us and our products,

00:16:57.740 | is what OpenAI is special about is figuring out what comes next.

00:17:03.740 | It is the ability, it is easy to copy something once you know it can be done, and in that sense, sure.

00:17:09.740 | It is very hard to go figure out what to do next.

00:17:12.740 | And the ideas, the big ideas, the medium-sized ideas, the small ideas, and the careful execution on them that it takes to get from here to superintelligence,

00:17:20.740 | that's what our moat is.

00:17:22.740 | Anyway, this video could have been at least three times longer.

00:17:25.740 | There was so much I had to edit out for brevity.

00:17:28.740 | If you're interested in me talking more about open source models, do let me know in the comments.

00:17:32.740 | I've got much more to say.

00:17:34.740 | As always, thank you so much for watching to the end, and have a wonderful day.