back to index

Phi-2, Imagen-2, Optimus-Gen-2: Small New Models to Change the World?


Whisper Transcript | Transcript Only Page

00:00:00.000 | You might have thought that 2023 was winding down for generative AI with more focus on
00:00:06.480 | mistletoe and merriment than models and the MMLU.
00:00:10.920 | But no, this was the week of powerful new small models that could change the landscape
00:00:16.720 | of 2024 AI.
00:00:18.660 | This video is about Phi 2 and what it means, as well as the madness of MMLU brinkmanship
00:00:24.000 | with a reminder of all of that exams mistakes.
00:00:27.120 | Plus Optimus Gen 2, Imogen Gen 2 announced just today, and much more.
00:00:32.160 | Phi 2 was announced by Satya Nadella, the CEO of Microsoft last month, but what is it
00:00:37.520 | and what does it mean?
00:00:38.520 | In a nutshell, Phi 2 is a 2.7 billion parameter model, a small model by today's standards.
00:00:44.720 | So small in fact that it could fit locally on a smartphone.
00:00:47.560 | According to the benchmarks, and I will get into those, it outperforms models of comparable
00:00:51.960 | size like ones trained using Mamba, as well as Google's new Gemini Nano.
00:00:56.960 | If that wasn't enough, it also outperforms models 20 to 25 times its size.
00:01:01.560 | I've been following the Phi series of models since late June when I interviewed Ronan Eldan,
00:01:06.740 | one of the key authors of the original Phi 1 paper.
00:01:09.900 | There were strong hints of Phi 1 significance in my original July 2nd video on textbooks
00:01:15.400 | are all you need.
00:01:16.400 | I'm going to try to give you all the necessary background on Phi 1, Phi 1.5, and Phi 2 in
00:01:21.580 | less than three minutes.
00:01:22.580 | That's borderline impossible, but I'm going to try.
00:01:24.920 | Essentially, they retrieved a pile of permissively licensed open code from what's appropriately
00:01:29.480 | called the stack.
00:01:30.760 | There's 10 times more than they use in there.
00:01:32.760 | They just can't use it for legal reasons.
00:01:34.920 | They extracted just the Python code from the many programming languages in that data set
00:01:39.600 | and also filtered for duplications.
00:01:41.800 | For Phi 1, they then gave GPT-4 a task, filter for textbook quality code.
00:01:46.960 | That's code with appropriate comments and good syntax.
00:01:49.360 | To cut costs, they swapped a tiny classifier in to finish the job that GPT-4 started.
00:01:53.840 | With that classifier essentially imitating the labeling that GPT-4 had kicked off.
00:01:57.640 | They then got GPT-3.5 to generate its own diverse textbook quality data and synthetic
00:02:03.360 | Q&A exercises.
00:02:05.080 | For more detail on how they did that, check out the original video in July, which featured
00:02:09.000 | an interview with one of the key contributors.
00:02:11.360 | Phi 1 got 50% on human eval, a coding benchmark.
00:02:14.760 | And all of this works so well that for Phi 1.5 announced in September, they added synthetic
00:02:20.000 | exercises in common sense reasoning, logic, science, and theory of mind.
00:02:25.000 | Furthermore, they trained only on synthetic data, leaving that additional filtered stack
00:02:29.600 | data for a separate model called Phi 1.5 Web.
00:02:32.880 | What they realized was that this better quality data allowed for more passes over the data
00:02:37.320 | called epochs, or think of it as a student rereading those textbooks.
00:02:41.080 | In a September exclusive, the information reported that Microsoft was now heavily backing
00:02:45.480 | those researchers behind Phi.
00:02:47.320 | As I mentioned in a video at the time, that gave them the money and compute needed to
00:02:51.380 | scale up Phi.
00:02:52.560 | That brings us to the release in the last 24 hours of Phi 2, 2.7 billion parameters,
00:02:58.440 | trained in just 14 days on less than 100 A100 GPUs.
00:03:03.200 | For reference, it's reported that Microsoft has as many as 150,000 H100s, the successor
00:03:09.720 | of the A100s.
00:03:11.160 | But what does it mean that the model is now bigger, 2.7 billion parameters?
00:03:14.800 | Well, more parameters mean more connections can be made, more dots joined, so to speak.
00:03:19.240 | And with more compute, of course, that means that they can feed in more data and go over
00:03:23.040 | that data more times.
00:03:24.480 | The amount of data, including rereads or epochs, was 1.4 trillion tokens, which we know is
00:03:29.560 | five times more than Phi 1.5 Web.
00:03:32.600 | What we don't yet have is that full data set, though the model is open sourced.
00:03:36.480 | There's a link in the description describing how to download Phi 2.
00:03:40.000 | A side benefit, of course, of training on synthetic data is that it tends to be less
00:03:44.840 | toxic data.
00:03:45.840 | The toxicity scores went down across the board for Phi 2, and this is before any reinforcement
00:03:50.760 | learning from human feedback.
00:03:52.480 | If you're familiar with the meme, this could mean bye bye Shoggoth thanks to synthetic
00:03:56.880 | data.
00:03:57.880 | And there are so many potential lessons from Phi 2.
00:04:00.360 | As one of the researchers on the project said, our Phi 2 efforts prove that we have been
00:04:04.720 | wasting enormous amounts of compute on rather ineffective training data.
00:04:09.260 | As he put it, throwing a kitchen sink at a model has a big price tag and leads to lower
00:04:14.440 | quality.
00:04:15.440 | On screen are the comparisons to Gemini Nano 2 and Mistral 7 billion parameters.
00:04:20.600 | And notice they've also included a comparison to Lama 2 70 billion parameters.
00:04:25.520 | And yes, given that one of the other topics of this video is going to be about the flaws
00:04:29.480 | of benchmarks, I know what many of you are thinking.
00:04:31.880 | Can we fully trust these benchmark figures?
00:04:34.080 | Of course, I researched as best I could about Phi 2 contamination, but I was fairly convinced.
00:04:39.240 | By the multiple lines of evidence provided in the video linked from Sebastian Bubek,
00:04:44.400 | the lead author.
00:04:45.400 | Irrelevant clip is from minute 19 onwards.
00:04:48.040 | And the clues that something like this was coming were around if you knew where to look.
00:04:52.440 | When reading the original Phi 1.5 paper, I noted this in the discussion section.
00:04:57.360 | They said, perhaps achieving Chachi PT's level of capability, and remember that model
00:05:01.360 | is over 175 billion parameters, at the 1 billion parameter scale is actually achievable.
00:05:08.160 | Now you let me know what you think of Phi 2 in your testing.
00:05:12.080 | And the only cautionary note I would give is this.
00:05:14.520 | I remember from the original Phi 1 paper that they mentioned that their models have a certain
00:05:19.200 | sensitivity to prompt variations.
00:05:21.820 | In other words, how you phrase the prompt has significant effects on its performance.
00:05:27.160 | And as the length of the prompt increases, the models tend to forget, ignore or misinterpret
00:05:32.240 | parts of the prompt.
00:05:33.280 | But if these results bear out, this is the takeaway.
00:05:35.920 | It looks like you can mimic the performance on certain tasks of models trained with a
00:05:40.880 | thousand X more compute.
00:05:43.000 | So now let's imagine a 1.5 trillion parameter model trained this way.
00:05:48.160 | Essentially imitating what a 1.5 quadrillion parameter large language model would be like.
00:05:53.960 | That's more parameters than we have synapses in the human brain.
00:05:57.520 | And of course, that's not even bringing in allowing the model to think for longer with
00:06:01.340 | extra test time compute and system verification ala AlphaCode 2.
00:06:06.220 | Things could certainly get interesting in 2024.
00:06:09.320 | Before we move on completely from benchmarks and Microsoft though, there is one more chart
00:06:13.720 | that Microsoft put out that I want to highlight.
00:06:15.920 | Here's them showing that with their prompting system, you can get 90.1% on the MMLU.
00:06:22.320 | Now regular watchers of this channel probably know what's coming next.
00:06:25.840 | For hopefully the last time that I'm mentioning it on this channel, that benchmark is flawed
00:06:30.500 | in many respects.
00:06:31.500 | In fact, I'm going to end this video with a clip from my original Smart GPT video running
00:06:36.000 | through just some of the mistakes on that benchmark.
00:06:39.040 | I'm frankly shocked that at the end of 2023, it's still being used to compare models to
00:06:43.460 | two decimal places.
00:06:45.000 | Of course, there Microsoft were comparing its prompting techniques to Google's Gemini
00:06:49.000 | Ultra and many pointed out the somewhat disingenuous way that Google presented its Gemini model,
00:06:54.720 | particularly when it comes to video analysis.
00:06:57.000 | Yes, that was a poor show from Google, but I made the point in the comments at the time
00:07:01.580 | that such a demo is possible in the near term.
00:07:05.540 | I drew this.
00:07:07.360 | Can you tell me what I, what is that?
00:07:11.940 | The drawing appears to be of a duck or a similar bird on water.
00:07:15.080 | Before we move on from Google though, they today, just in the last hour, released Imogen
00:07:21.680 | They announced it back in May, but it's now available via API.
00:07:24.980 | And be honest, would you have believed that this image is generated by a text to image
00:07:30.280 | model?
00:07:31.280 | Almost for the first time, I would say that no, I can't even tell that that's made by
00:07:37.000 | Imogen 2, by the way, is a diffusion model and using it, you are indemnified by Google
00:07:41.680 | from copyright.
00:07:42.680 | Furthermore, all generations are watermarked.
00:07:45.040 | The quality looks stunning.
00:07:46.640 | I mean, particularly if you look at the top right and bottom middle images, you've got
00:07:50.580 | to admit that that is a frankly shocking level of photorealism.
00:07:54.700 | Then by this point next year, it gets to this level for text to video, we'll be living in
00:07:59.560 | a different world.
00:08:00.560 | But I did say that the theme of today's video is small models.
00:08:04.360 | And here is the 10 kilograms lighter generation 2 humanoid robot from Tesla.
00:08:10.000 | This is Optimus Gen 2.
00:08:28.080 | Watching this video makes me think of touch, temperature and pressure sensitivity as whole
00:08:47.360 | new modalities yet to be fully explored.
00:08:50.240 | I spoke to one of the leads for robotics of Google DeepMind for AI insiders, but I also
00:08:54.760 | now want to reach out to Tesla because Optimus is getting close.
00:08:58.520 | But yes, speaking of AI insiders, today is the full launch of that Patreon tier.
00:09:03.880 | First and foremost, you're supporting the channel and I've written a personal message
00:09:06.920 | to everyone who's joined.
00:09:08.360 | And honestly, I didn't expect this, but the discord for AI insiders has proven to be great
00:09:13.000 | for networking, whether you're an AI engineer in C-suite or just interested in AI.
00:09:17.600 | But what is the content that's actually on AI insiders?
00:09:20.240 | Well, let me walk you through the four categories.
00:09:23.120 | First we have what I would call classic AI explained videos.
00:09:26.160 | Bonus content, just like you'd see on the main channel.
00:09:28.560 | In this collection, I released a video two or three days ago on AGI timelines, featuring
00:09:33.040 | extracts from six expert interviews I conducted over the last month.
00:09:36.760 | Plus, of course, my own timelines.
00:09:38.560 | Next we have the AI insiders podcast.
00:09:40.640 | I call it my let's think sip by sip podcast, not my normal kind of video is more a stream
00:09:45.400 | of consciousness, audio only reflection.
00:09:47.780 | With access it's available on Spotify or wherever else you get your podcast.
00:09:51.820 | Then we have the tutorials I've been working for months on.
00:09:55.120 | These are more for those who are LLM curious and like everything else, feature expert extracts
00:10:00.280 | and will be continually updated in the weeks and months to come.
00:10:03.840 | Finally, we have the insiders arena and let me click on this one.
00:10:07.280 | This is where you and any other member with a passion for AI can submit explainers and
00:10:12.640 | I'll pick the best of the bunch, edit them and film an intro.
00:10:16.280 | The debut video in this series is from none other than the legendary Swicks on the rise
00:10:21.540 | of the AI engineer.
00:10:23.120 | He makes a joke about my thumbnails, but I'm fine with it.
00:10:25.880 | Anyway, the next video could be you and the best of the bunch will feature a cameo on
00:10:30.560 | AI explained.
00:10:31.960 | Many people have already told me that they expense AI insiders for work, and I do want
00:10:36.400 | to reiterate that just by watching to the end of my videos, you are supporting me more
00:10:41.560 | than I can possibly expect, and I'm going to do something unusual at this point.
00:10:45.080 | I'm going to wish you a wonderful day before the end of the video.
00:10:49.480 | Because I'm going to end with a few minutes of the mistakes of the MMLU, which hopefully
00:10:54.160 | I have to point out for the very last time.
00:10:56.960 | As always, though, have a wonderful day.
00:10:59.640 | Back to breaking the benchmark.
00:11:01.480 | Here is the question that started it all off.
00:11:03.680 | As you can see, the question makes no sense.
00:11:06.040 | The text says demand reduction and the answers are either 1 3 4 2 3 4 1 2 3 1 2 4.
00:11:13.600 | What on earth is that about?
00:11:15.040 | Now remember, it was only human grading that enabled us to spot this.
00:11:19.440 | I reckon most companies like OpenAI rely on auto grading by exact match that would immediately
00:11:25.360 | toss out any answer like these as a null because an answer of A, B, C or D hasn't been given.
00:11:32.120 | Now I should say it was human grading that caught this and GPT-4 itself.
00:11:36.880 | Here is poor GPT 3.5 bravely giving an answer to a question that doesn't make any sense
00:11:43.560 | at all.
00:11:44.560 | I know a couple of times it changed its mind and was like, no, no, no, D, not B.
00:11:48.640 | What then followed was weeks and weeks of me following up every quote unquote mistake
00:11:53.640 | with the official source the question came from.
00:11:56.280 | When I found the original source, I realised what the problem was.
00:12:00.880 | Sometimes they just hadn't pasted all of these statements.
00:12:03.700 | When you can see all four of these statements, the answer options make a lot more sense.
00:12:08.520 | Now I know what some of you may be thinking.
00:12:10.340 | Maybe it's just business ethics that's just one subject and what, it's a dozen
00:12:13.880 | questions.
00:12:14.880 | What's the big deal?
00:12:15.880 | Well, first of all, business ethics only has a hundred questions.
00:12:18.440 | So 13 of them missing vital context completely undermines that entire subject.
00:12:24.360 | And second of all, it wasn't just business ethics and it wasn't just this same problem.
00:12:28.680 | It wouldn't always be about missing statements.
00:12:30.840 | Check out these examples from high school chemistry and there's high school psychology,
00:12:35.340 | professional psychology, microeconomics, professional law, professional accounting.
00:12:39.160 | And trust me, it didn't stop there.
00:12:40.640 | I was genuinely shocked.
00:12:42.540 | There were innumerable factual errors and I would try to trace down the origin of each
00:12:48.760 | and see what the source said.
00:12:50.400 | By the way, the problem wasn't just with one source.
00:12:53.320 | It was with quite a few of these sources.
00:12:56.120 | Let's at random take one of these questions.
00:12:57.960 | How many human polyoma viruses are known at present?
00:13:01.560 | A hundred, one, 10 or unknown.
00:13:03.560 | This question comes from Oxford University Press, chapter 21, question two.
00:13:07.800 | I researched the question myself and also checked what this multiple choice quiz said
00:13:12.820 | the answer was.
00:13:13.820 | Let's tick 10, which is answer C and then submit my answers to the quiz.
00:13:18.740 | And let's see.
00:13:19.740 | Yes, it's correct.
00:13:20.740 | It's 10.
00:13:21.740 | By the way, the actual answer seems to be 14 as of 2023, but that's fairly close.
00:13:26.540 | What does the MMLU say?
00:13:28.300 | It says the answer is A, 100.
00:13:30.740 | And it goes on and on like this.
00:13:33.460 | Some of the worst offenders are the virology and college chemistry sections.
00:13:38.240 | Just wrong answer after wrong answer after wrong answer.
00:13:41.240 | Here's another example.
00:13:42.600 | This is what the MMLU says is the answer to this question, B.
00:13:46.360 | I tracked down the question to a fall of 2011 final exam in which the answer was B.
00:13:52.080 | The MMLU had mixed up the order of the options and therefore picked B when that was drug
00:13:58.240 | users instead of men.
00:13:59.680 | But there's one more slight problem.
00:14:01.880 | Research suggests that both of those answers are inaccurate.
00:14:04.940 | And that happened multiple times where even the source was somewhat dodgy with its answers.
00:14:09.980 | Here is another page of mistakes of virology and another page for college chemistry.
00:14:15.780 | And one example that will particularly shock AI researchers.
00:14:19.500 | Here we have a question for which the MMLU says the correct answer is A.
00:14:23.540 | The original source says that the answer is 8, which isn't even an option.
00:14:27.860 | But this question was in the dev set.
00:14:30.380 | And if you remember from earlier in the video, that is the set of 5 questions they use to
00:14:34.700 | teach the model what kind of answer to give when they're benchmarking in the MMLU.
00:14:39.320 | In other words, all 100 results in college chemistry for every model benched on the MMLU
00:14:44.980 | is compromised.
00:14:45.980 | For example, a model that is particularly good at imitating reasoning will now be imitating
00:14:51.780 | an incorrect answer.
00:14:52.780 | Now you might be thinking, surely now Philip and GPT-4 are done with finding errors in
00:14:57.800 | the MMLU.
00:14:58.800 | Unfortunately not.
00:14:59.900 | We carry on into new categories.
00:15:02.540 | Here was a question from econometrics, where again the source was incorrect.
00:15:06.740 | But we also have misspellings, grammatical ambiguity, and formatting ambiguity throughout
00:15:12.300 | the test.
00:15:13.300 | I'm not going to go through all of these, but any one of them could potentially confuse
00:15:16.940 | a model.
00:15:17.940 | We already know that models are very sensitive to the inputs you give.
00:15:21.500 | Are there any more categories?
00:15:22.980 | Yes, there are.
00:15:24.220 | There are loads of juicy examples here, but I can't get to them all.
00:15:27.640 | How about examples of multi-question dependence?
00:15:30.320 | For example, this came up in the philosophy section.
00:15:33.440 | According to Singer, compliance with his principle requires, but of course it doesn't say which
00:15:38.080 | of his principles.
00:15:39.080 | Or this one.
00:15:40.080 | Singer's argument begins with the assumption that, now if you look at the original question,
00:15:44.240 | it gave the context of the arguments.
00:15:46.560 | It talked about the arguments and principles from his book Famine, Affluence, and Morality.
00:15:51.040 | But in the MMLU, there is no such context.
00:15:54.200 | And again and again this comes up.
00:15:56.180 | High school bio.
00:15:57.240 | You can see the two examples here.
00:15:59.160 | Now there is one final category I want to talk about.
00:16:02.220 | No clear answer.
00:16:03.220 | I'm not going to categorically call this a mistake, but what answer would you pick here?
00:16:07.160 | This is in public relations.
00:16:08.660 | When an attitude is communicated, what does it become?
00:16:11.640 | An opinion?
00:16:12.640 | A belief?
00:16:13.640 | A behaviour?
00:16:14.640 | A point of view?
00:16:15.640 | I think that's pretty ambiguous.
00:16:16.900 | Those kinds of questions were particularly prevalent in moral scenarios and public relations.
00:16:22.540 | Or how about this?
00:16:23.600 | Are cryptocurrencies expensive or cheap?
00:16:26.200 | Is there an easy answer to that?
00:16:27.660 | And there was one question that I did about 3 hours of research for.
00:16:31.180 | What is the biggest cause of death in children under 5 years old?
00:16:34.720 | And there are multiple sources that give conflicting answers.
00:16:38.140 | That type of question, where it depends what source you ask, was massively prevalent in
00:16:43.200 | the global facts category.
00:16:45.120 | And then we get controversial questions like this in security studies.
00:16:48.840 | How do biological differences affect the roles that men and women must perform for the state?
00:16:54.620 | With the correct answer being gender roles are a social construct.
00:16:58.820 | I feel GPT-4's answer is far more nuanced.
00:17:01.860 | This question touches on complex and controversial topics and while there is evidence to support
00:17:06.420 | or refute elements within each of the provided statements, none of them fully captures the
00:17:10.300 | nuanced relationship between biology, gender roles, society and state responsibilities.
00:17:14.260 | It also picks up on that language "must perform for the state".
00:17:17.340 | Now remember, these are just the examples that I found on a subset of the full test.
00:17:22.560 | Extrapolated out, that would suggest that hundreds of questions are ambiguous or erroneous.
00:17:28.440 | Now a 1, 2 or 3% inaccuracy rate didn't really matter when models were performing
00:17:33.860 | at around 25 or 30%.
00:17:36.280 | That's close to random and that was GPT-3's original performance.
00:17:40.120 | But now when we're talking about AGI or Human Expert Level Accuracy and models are
00:17:44.880 | being judged on tenths of a percent, 1, 2 or 3% really makes a big difference.