back to index

Orca: The Model Few Saw Coming


Whisper Transcript | Transcript Only Page

00:00:00.000 | Do you remember this paper, less than two weeks old?
00:00:03.720 | It made waves by concluding that open source models
00:00:06.940 | can mimic the style, but not the factuality of ChatGPT.
00:00:11.740 | Overall, we can conclude, they say,
00:00:14.120 | that model imitation is a false promise.
00:00:17.380 | Well, 48 hours ago, we have this,
00:00:20.440 | a 51-page report on Orca
00:00:23.600 | based on a small 13 billion parameter model.
00:00:27.060 | I don't often comment on open source models
00:00:29.440 | because they're simply not competitive
00:00:31.020 | with OpenAI's models.
00:00:32.620 | But Orca is not just competitive with GPT-3.5.
00:00:36.480 | It beats it in quite a few well-established benchmarks
00:00:39.960 | and even matches GPT-4 in a couple of tests of reasoning.
00:00:44.160 | As always, I've read both papers in full
00:00:46.320 | and can also bring in just-released comments
00:00:48.720 | from Sam Altman and Ilya Sutskova
00:00:51.100 | on competition from open source models.
00:00:53.700 | But let's start with Orca,
00:00:55.580 | named presumably because Orcas, or killer whales,
00:00:58.880 | are frequent visitors to South American coastlines.
00:01:01.820 | And South America is, of course,
00:01:03.300 | the land of llamas and vicunas.
00:01:05.560 | But all the research was done by Microsoft,
00:01:08.560 | which I find interesting,
00:01:10.060 | and I'll come back to that at the end.
00:01:11.920 | But why did they make Orca,
00:01:13.360 | and why does it perform better than models
00:01:15.360 | like Llama, Alpaca, and Vicuna?
00:01:17.340 | Well, they say here in the abstract
00:01:18.960 | that those other models lack rigorous evaluation,
00:01:22.100 | resulting in overestimating the small model's capability
00:01:25.760 | as they tend to learn to imitate the style,
00:01:27.880 | but not the actual model.
00:01:28.000 | So, if you're interested in learning more about Orca,
00:01:28.080 | and you're interested in learning more about Orca,
00:01:28.140 | but not the reasoning of LFMs,
00:01:31.120 | large foundation models.
00:01:32.380 | To address these challenges,
00:01:33.740 | we develop Orca, a 13 billion parameter model,
00:01:36.640 | that learns to imitate the reasoning process
00:01:39.420 | of the larger models.
00:01:40.820 | Orca learns by looking at GPT-4's
00:01:43.240 | step-by-step thought processes,
00:01:44.880 | and is guided by teacher assistance from ChatGPT,
00:01:48.840 | which is GPT 3.5.
00:01:50.600 | And to give you a taste of what's to come,
00:01:52.460 | Orca surpasses conventional state-of-the-art models,
00:01:55.560 | such as Vicuna,
00:01:56.660 | by more than 100%
00:01:57.580 | and more than 100%
00:01:57.580 | of the models that are used in Orca
00:01:58.120 | in complex zero-shot reasoning benchmarks,
00:02:02.200 | like the Big Bench Hard,
00:02:04.000 | which I'll talk about,
00:02:04.780 | and by 42% on AGI eval.
00:02:08.300 | It goes on,
00:02:09.060 | Orca reaches parity with ChatGPT
00:02:12.300 | on the Big Bench Hard,
00:02:13.920 | and shows competitive performance
00:02:16.200 | in professional and academic examinations
00:02:18.120 | by the SAT, LSAT, GRE, and GMAT.
00:02:21.500 | And I know many of you will be interested in this footnote.
00:02:24.160 | We are working with our legal team
00:02:26.260 | to publicly release
00:02:28.100 | a diff of the model weights
00:02:29.640 | in accordance with Lama's release policy.
00:02:32.020 | So if this is anything like Lama,
00:02:33.940 | it's going to be leaked across the internet imminently.
00:02:36.240 | I'm going to show you so many tests
00:02:38.000 | and benchmarks in a moment,
00:02:39.660 | but just to give you a sample,
00:02:41.280 | here is Orca outperforming ChatGPT
00:02:44.380 | in the Vicuna evaluation set,
00:02:46.520 | and matching Text DaVinci 3
00:02:48.340 | in the SAT, LSAT, GRE, and GMAT.
00:02:51.620 | And as I'll touch on later,
00:02:52.860 | this was zero-shot without chain of thought
00:02:55.760 | or any advanced methods.
00:02:57.280 | You can watch,
00:02:58.080 | pretty much any of my other videos
00:02:59.520 | to see how advanced prompt engineering
00:03:01.620 | would probably boost those results still further.
00:03:04.340 | For those who didn't know,
00:03:05.460 | 13 billion parameters is about 7% the size of GPT-3,
00:03:10.420 | which is 175 billion parameters,
00:03:12.660 | and possibly around 1% or 2% of GPT-4's size.
00:03:17.560 | That gives you an indication
00:03:18.860 | of the difference in size between Orca
00:03:21.360 | and these models that it's competing with.
00:03:23.320 | And if that doesn't make any sense,
00:03:24.800 | a smaller size means it can be run on much smaller,
00:03:28.060 | much more advanced devices,
00:03:29.940 | like a desktop or even possibly a laptop.
00:03:32.820 | The authors start off by giving a little slap
00:03:35.060 | to the other paper,
00:03:35.900 | you know that one that said,
00:03:36.820 | "Model imitation is a false promise."
00:03:38.940 | And they continue that,
00:03:39.860 | "Contrary to this assertion,
00:03:41.300 | it is possible to reduce the gap with proprietary LLMs
00:03:45.940 | on multiple zero-shot benchmarks
00:03:48.300 | that require sophisticated reasoning."
00:03:50.380 | As we'll see, models like Vicuna claim
00:03:52.300 | to have 90% of ChatGPT's quality,
00:03:55.180 | but when it came to reasoning tasks,
00:03:57.300 | or more technical tasks,
00:03:58.040 | it was basically flopped.
00:03:59.040 | Here's a chart I'll come back to,
00:04:00.360 | outlining some of the more technical challenges
00:04:02.840 | you can give a language model.
00:04:04.480 | We should remember that Vicuna is a fine-tuned version
00:04:07.640 | of the LLMA model,
00:04:09.080 | and it's competitive or even better than Palm II.
00:04:12.600 | But give it some of the harder challenges
00:04:14.840 | for a language model,
00:04:16.000 | and it really struggles,
00:04:17.320 | as you can see in this column.
00:04:18.920 | Take logical deduction,
00:04:20.280 | where it only scored 1.2%.
00:04:22.440 | Well, this Orca model was 2,900% better than that,
00:04:26.480 | scoring 36%,
00:04:28.020 | and it was a lot more competitive with ChatGPT.
00:04:31.020 | I'm gonna come back to the Big Bench benchmark,
00:04:33.140 | but look for a second at Causal Judgment,
00:04:35.420 | where Orca, a 13 billion parameter model,
00:04:38.940 | matches GPT-4, which is about 100 times the size.
00:04:43.420 | But back to how they actually did it.
00:04:45.340 | Models like Alpaca and Vicuna
00:04:47.260 | were given lots of query and responses
00:04:49.900 | from ChatGPT or GPT-4.
00:04:52.060 | But what they did is they leveraged system instructions,
00:04:55.300 | asking models like GPT-4 and ChatGPT
00:04:58.000 | to do a little bit of a test.
00:04:59.000 | This gave Orca access to detailed responses
00:05:01.360 | from the model that explained the reasoning process
00:05:04.500 | of the teacher as it generates the response.
00:05:07.000 | It allowed these parent models of GPT-3.5 and GPT-4
00:05:10.840 | to be much better tutors for this young Orca.
00:05:14.340 | Also, they let the teachers of ChatGPT,
00:05:16.580 | which is 3.5, and GPT-4,
00:05:18.820 | give far more examples to their student,
00:05:21.060 | 5 million and 1 million examples, respectively.
00:05:24.200 | That compares to the other models you may have heard of,
00:05:26.320 | like Alpaca, Wizard, Vicuna, and so on.
00:05:27.980 | They had tens of thousands
00:05:30.600 | or the low hundreds of thousands of examples.
00:05:33.240 | But again, the key difference is the explanations,
00:05:36.220 | the step-by-step thinking
00:05:37.760 | that the smaller Orca could then imitate.
00:05:40.180 | They give a quick demo here
00:05:41.180 | of how the other open source models
00:05:42.980 | learn from their GPT parents
00:05:44.980 | with a simplistic question and answer format.
00:05:48.480 | In contrast, the authors leveraged system messages
00:05:51.480 | to get ChatGPT and GPT-4 to think step-by-step,
00:05:55.780 | leading to much richer explanations.
00:05:57.960 | As you can see in this diagram.
00:05:59.960 | It wasn't just, "Let's think step-by-step," by the way,
00:06:02.460 | also things like, "Explain like I'm five."
00:06:04.960 | They also wanted the task to be as complex
00:06:07.960 | and diverse as possible,
00:06:09.460 | so they used the Flan collection.
00:06:11.960 | This was released by Google in February
00:06:13.960 | and focused on balancing the kind of prompts and tasks
00:06:17.460 | that you fine-tune the language models on.
00:06:19.960 | You can see here the 16 system messages
00:06:22.460 | that they give to ChatGPT and GPT-4,
00:06:25.460 | and you can see here the kind of difference that that makes.
00:06:27.940 | Imagine a language model trying to learn from this human.
00:06:30.940 | The human is asked, "Pick which sentence is not logical."
00:06:33.940 | Sentence A, "People in the desert often look forward to flood,"
00:06:37.940 | or sentence B, "People in the desert often look forward to rain."
00:06:40.940 | The human responds, "There is no reason to look forward to a flood
00:06:43.940 | because floods cause damage."
00:06:45.940 | The answer is sentence A.
00:06:47.940 | Now, yes, a language model can learn from that,
00:06:49.940 | but by leveraging those system assistant messages,
00:06:52.940 | look at the kind of response that GPT-4 gives.
00:06:55.940 | Now, Orca can learn a lot more
00:06:57.920 | from that explanation,
00:06:58.920 | and that's one of the main reasons
00:07:00.920 | it's better than all the other open-source models.
00:07:03.920 | Because remember, Vicuna is the best of the open-source models.
00:07:07.920 | In this leaderboard, it has an ELO of 1054,
00:07:10.920 | better even than Palm II Bison.
00:07:12.920 | All the models higher than it are proprietary.
00:07:15.920 | But there is another reason why Orca performs so much better.
00:07:18.920 | You might have wondered,
00:07:19.920 | why didn't they just use only GPT-4?
00:07:21.920 | Well, yes, there were cost and time considerations,
00:07:24.920 | but there was another factor that they found.
00:07:26.920 | They were able to use GPT-4
00:07:27.900 | to use ChatGPT or GPT-3.5
00:07:30.900 | as an intermediate teacher.
00:07:32.900 | That teacher, ChatGPT,
00:07:33.900 | was able to reduce the gap in capabilities.
00:07:36.900 | So Orca got smarter and better able to learn.
00:07:39.900 | A bit like progressive learning,
00:07:40.900 | where you first learn from easier examples,
00:07:43.900 | then followed by harder ones.
00:07:44.900 | After that, they gave it outputs from GPT-4.
00:07:47.900 | Notice, by the way,
00:07:48.900 | what happens if you skip the ChatGPT teaching assistant
00:07:52.900 | and only train on those 1 million examples from GPT-4.
00:07:56.900 | What happens is
00:07:57.880 | it's a bit like a student struggling in a class
00:07:59.880 | that's too advanced for them.
00:08:01.880 | Orca actually performs worse in those circumstances,
00:08:04.880 | averaging 37%.
00:08:05.880 | But with that intermediate teacher beforehand,
00:08:08.880 | it gets 41.7%.
00:08:10.880 | Speaking of time,
00:08:11.880 | it only took about 200 hours to train Orca
00:08:14.880 | on 20 A100 GPUs.
00:08:17.880 | They did take a few weeks to collect the data
00:08:19.880 | from ChatGPT and GPT-4.
00:08:21.880 | But presumably,
00:08:22.880 | if they're planning to open source this,
00:08:24.880 | which they say they are,
00:08:25.880 | then that step could be skipped by a week.
00:08:27.860 | But that's not the case.
00:08:28.860 | They're going to be able to do it
00:08:29.860 | in a few weeks.
00:08:30.860 | So, let's look at some more of the results.
00:08:31.860 | First, for open-ended generation,
00:08:32.860 | not multiple choice.
00:08:33.860 | Orca is 95% of ChatGPT quality
00:08:37.860 | and 85% of GPT-4's quality
00:08:40.860 | as assessed by GPT-4.
00:08:42.860 | But they wanted to quickly move on
00:08:44.860 | to some more definitive tasks.
00:08:45.860 | There is a problem of using GPT-4 as an assessor.
00:08:49.860 | For example, they observed that there is a positive bias
00:08:52.860 | in GPT-4 evaluation toward the response
00:08:55.860 | of the first model in the QA.
00:08:57.840 | And they were able to do this
00:08:58.840 | by using the same comparison set.
00:08:59.840 | This reminded me of the unfaithful reasoning paper
00:09:01.840 | that I talked about in one of my recent videos.
00:09:04.840 | You can't always trust GPT-4 to give its true reasoning.
00:09:07.840 | But here it is in more objective multiple choice questions.
00:09:10.840 | And notice how much harder many of these tests are
00:09:13.840 | for even these advanced language models.
00:09:15.840 | I am fortunate and proud to have attained a perfect score
00:09:18.840 | in some of the tests in this chart,
00:09:20.840 | like the GRE and GMAT.
00:09:21.840 | They were part of the AQUA-RAT test
00:09:23.840 | that they gave the models.
00:09:25.840 | So, I can say that they really are quite challenging.
00:09:27.820 | Hence why GPT-4 only gets a 40%.
00:09:30.820 | But you can see that throughout,
00:09:31.820 | AQUA outperforms Vicuna by quite a margin.
00:09:34.820 | And is very competitive with Text DaVinci 3.
00:09:37.820 | Of course, overall, it does lag behind GPT-4.
00:09:40.820 | But this is all zero-shot.
00:09:42.820 | A bit later on, I'll come back to the range of methods
00:09:45.820 | that we could use to further improve on AQUA.
00:09:48.820 | The percentages, by the way, are the improvements on Vicuna.
00:09:51.820 | Again, the second best open source model.
00:09:54.820 | So far, we've looked at human-centric benchmarks.
00:09:57.800 | Like the GMAT and GRE.
00:09:59.800 | These are grouped with the lovely name AGI EVAL.
00:10:02.800 | And as we've seen, even the top models
00:10:04.800 | lag behind the top human performers.
00:10:06.800 | But what about a benchmark specifically for language models?
00:10:10.800 | It's called BigBench Hard.
00:10:12.800 | The original BigBench had 207 tasks.
00:10:15.800 | But language models got so good,
00:10:17.800 | that they had to narrow down the benchmark
00:10:19.800 | to just the 23 challenging tasks
00:10:21.800 | where human raters still did better than language models.
00:10:24.800 | Now, it turns out when you add chain of thought prompting
00:10:27.780 | to the models, they do even better.
00:10:29.780 | And there are even fewer tasks that humans are better at.
00:10:31.780 | But anyway, all you have to remember is that these are 23
00:10:34.780 | of the hardest tasks for language models.
00:10:37.780 | And I'll just let you compare the results for yourself.
00:10:40.780 | But the trend is really quite clear.
00:10:42.780 | AQUA massively outperforming the previous best open source model, Vicuna.
00:10:47.780 | Beating even ChatGPT on average.
00:10:50.780 | But still, of course, lagging behind GPT-4.
00:10:53.780 | Except for a few tasks.
00:10:55.780 | Look at Web of Lies,
00:10:57.760 | which is a very common task in the form of GPT-4.
00:10:59.760 | That would be a question like this:
00:11:01.760 | Alexis says Shonda tells the truth.
00:11:03.760 | Jim lies.
00:11:04.760 | Antoine says Jim tells the truth.
00:11:06.760 | Shonda says Antoine lies.
00:11:08.760 | Does Alexis tell the truth?
00:11:10.760 | Or what about temporal sequences?
00:11:12.760 | Where AQUA absolutely crushes Vicuna
00:11:15.760 | and doubles ChatGPT's performance.
00:11:18.760 | That would be a situation like this.
00:11:20.760 | Now, I'm not going to read it all out.
00:11:22.760 | But essentially, you have to figure out when the timings match up.
00:11:24.760 | Basically, keeping track of time.
00:11:27.740 | And then, you have to figure out when the timings match up.
00:11:29.740 | And then, you have to figure out when the timings match up.
00:11:31.740 | And then, you have to figure out when the timings match up.
00:11:33.740 | And then, you have to figure out when the timings match up.
00:11:35.740 | And then, you have to figure out when the timings match up.
00:11:37.740 | And then, you have to figure out when the timings match up.
00:11:39.740 | And then, you have to figure out when the timings match up.
00:11:41.740 | And then, you have to figure out when the timings match up.
00:11:43.740 | And then, you have to figure out when the timings match up.
00:11:45.740 | And then, you have to figure out when the timings match up.
00:11:47.740 | And then, you have to figure out when the timings match up.
00:11:49.740 | And then, you have to figure out when the timings match up.
00:11:51.740 | And then, you have to figure out when the timings match up.
00:11:53.740 | And then, you have to figure out when the timings match up.
00:11:55.740 | And then, you have to figure out when the timings match up.
00:11:57.740 | And then, you have to figure out when the timings match up.
00:11:59.740 | And then, you have to figure out when the timings match up.
00:12:01.740 | And then, you have to figure out when the timings match up.
00:12:03.740 | And then, you have to figure out when the timings match up.
00:12:05.740 | And then, you have to figure out when the timings match up.
00:12:07.740 | And then, you have to figure out when the timings match up.
00:12:09.740 | And then, you have to figure out when the timings match up.
00:12:11.740 | And then, you have to figure out when the timings match up.
00:12:13.740 | And then, you have to figure out when the timings match up.
00:12:15.740 | And then, you have to figure out when the timings match up.
00:12:17.740 | And then, you have to figure out when the timings match up.
00:12:19.740 | And then, you have to figure out when the timings match up.
00:12:21.740 | And then, you have to figure out when the timings match up.
00:12:23.740 | And then, you have to figure out when the timings match up.
00:12:25.740 | And then, you have to figure out when the timings match up.
00:12:27.740 | And then, you have to figure out when the timings match up.
00:12:29.740 | And then, you have to figure out when the timings match up.
00:12:31.740 | And then, you have to figure out when the timings match up.
00:12:33.740 | And then, you have to figure out when the timings match up.
00:12:35.740 | And then, you have to figure out when the timings match up.
00:12:37.740 | And then, you have to figure out when the timings match up.
00:12:39.740 | And then, you have to figure out when the timings match up.
00:12:41.740 | And then, you have to figure out when the timings match up.
00:12:43.740 | And then, you have to figure out when the timings match up.
00:12:45.740 | And then, you have to figure out when the timings match up.
00:12:47.740 | And then, you have to figure out when the timings match up.
00:12:49.740 | And then, you have to figure out when the timings match up.
00:12:51.740 | And then, you have to figure out when the timings match up.
00:12:53.740 | And then, you have to figure out when the timings match up.
00:12:55.740 | And then, you have to figure out when the timings match up.
00:12:57.740 | And then, you have to figure out when the timings match up.
00:12:59.740 | And then, you have to figure out when the timings match up.
00:13:01.740 | And then, you have to figure out when the timings match up.
00:13:03.740 | And then, you have to figure out when the timings match up.
00:13:05.740 | And then, you have to figure out when the timings match up.
00:13:07.740 | And then, you have to figure out when the timings match up.
00:13:09.740 | And then, you have to figure out when the timings match up.
00:13:11.740 | And then, you have to figure out when the timings match up.
00:13:13.740 | And then, you have to figure out when the timings match up.
00:13:15.740 | And then, you have to figure out when the timings match up.
00:13:17.740 | And then, you have to figure out when the timings match up.
00:13:19.740 | And then, you have to figure out when the timings match up.
00:13:21.740 | And then, you have to figure out when the timings match up.
00:13:23.740 | And then, you have to figure out when the timings match up.
00:13:25.740 | And then, you have to figure out when the timings match up.
00:13:27.740 | This does seem a little bit naive to me.
00:13:29.740 | I mean, that's what Meta said when they released Llama.
00:13:31.740 | But then, everyone and their grandma just used the language model for whatever.
00:13:35.740 | I do wonder what it means when they say, "We are working with our legal team."
00:13:39.740 | And it is particularly interesting to me that this was all done by Microsoft.
00:13:43.740 | I'm going to go into a little bit of speculation here about why I think they conducted this research.
00:13:48.740 | You might remember that leaked memo from Google, "We have no moat."
00:13:52.740 | And they even mentioned Vicuna.
00:13:54.740 | And talked about how it circumvented restrictions on the online market.
00:13:57.740 | And talked about how it circumvented restrictions on the online market.
00:13:59.740 | And talked about how it circumvented restrictions on the online market.
00:14:01.740 | And my theory is that the Microsoft researchers were testing this point from the memo.
00:14:05.740 | The point was that training giant models from scratch not only throws away the pre-training.
00:14:09.740 | But also any iterative, open source improvements that have been made on top.
00:14:13.740 | It doesn't take long for those improvements to dominate, making the full retrain extremely costly.
00:14:18.740 | Maybe Microsoft is hesitating about future investments in GPT-5 or GPT-6.
00:14:24.740 | And they really want to test out if it's easy to integrate.
00:14:26.740 | And maybe it's easy to imitate those large models on the cheap.
00:14:29.740 | If it is, then why would Microsoft invest billions in a new giant model?
00:14:34.740 | That's my own theory as to why Microsoft is working on this.
00:14:37.740 | But let me know in the comments what your theory is.
00:14:39.740 | In the conclusion, the authors state that
00:14:41.740 | "AUCA suggests that learning from step-by-step explanations could significantly improve the quality of models regardless of their size."
00:14:49.740 | And that they hope these insights will inform the design of more robust evaluation methods.
00:14:55.740 | For example, the development of the most advanced training techniques.
00:14:57.740 | And the advancement of alignment and post-training techniques.
00:14:59.740 | And the more effective use of powerful models like GPT-4 as teachers.
00:15:05.740 | And maybe they should have said, and also with ChatGPT as an intermediate teacher.
00:15:09.740 | I'm going to end with the thoughts of the leaders of OpenAI, Ilya Sutskova and Sam Altman on open source models.
00:15:15.740 | And I think there is a bit of a contrast between the two answers.
00:15:18.740 | Ilya Sutskova thinks that the gap is growing ever wider.
00:15:22.740 | To the open source versus non-open source.
00:15:24.740 | The open source versus non-open source models question.
00:15:27.740 | You don't want to think about it in binary black and white terms where, like, there is a secret source that will never be rediscovered.
00:15:37.740 | What I will say, or whether GPT-4 will ever be reproduced by open source models, perhaps one day it will be.
00:15:45.740 | But when it will be, it will be a much more powerful model in the companies.
00:15:50.740 | So there will always be a gap between the open source models
00:15:53.740 | and the private models.
00:15:56.740 | And this gap may even be increasing with time.
00:16:01.740 | The amount of effort and engineering and research that it takes to produce one such neural net keeps increasing.
00:16:09.740 | And so even if there are open source models, they will never be, they will be less and less produced by small groups of dedicated researchers and engineers.
00:16:20.740 | And it will only be the providence of a company.
00:16:22.740 | A big company.
00:16:24.740 | While Sam Altman seems to say that even if open source models do catch up, OpenAI will always have a different kind of moat.
00:16:32.740 | What are your thoughts about the "We have no moat" document that was released lately?
00:16:38.740 | The leaked document.
00:16:41.740 | The thing that is special about OpenAI, and I think the thing that is so misunderstood by that document, aside from the fact that we have, like, a gigantic number of users,
00:16:51.740 | and people that like have formed some sort of relationship with us and our products,
00:16:57.740 | is what OpenAI is special about is figuring out what comes next.
00:17:03.740 | It is the ability, it is easy to copy something once you know it can be done, and in that sense, sure.
00:17:09.740 | It is very hard to go figure out what to do next.
00:17:12.740 | And the ideas, the big ideas, the medium-sized ideas, the small ideas, and the careful execution on them that it takes to get from here to superintelligence,
00:17:20.740 | that's what our moat is.
00:17:22.740 | Anyway, this video could have been at least three times longer.
00:17:25.740 | There was so much I had to edit out for brevity.
00:17:28.740 | If you're interested in me talking more about open source models, do let me know in the comments.
00:17:32.740 | I've got much more to say.
00:17:34.740 | As always, thank you so much for watching to the end, and have a wonderful day.