back to indexPhi-2, Imagen-2, Optimus-Gen-2: Small New Models to Change the World?
00:00:00.000 |
You might have thought that 2023 was winding down for generative AI with more focus on 00:00:06.480 |
mistletoe and merriment than models and the MMLU. 00:00:10.920 |
But no, this was the week of powerful new small models that could change the landscape 00:00:18.660 |
This video is about Phi 2 and what it means, as well as the madness of MMLU brinkmanship 00:00:24.000 |
with a reminder of all of that exams mistakes. 00:00:27.120 |
Plus Optimus Gen 2, Imogen Gen 2 announced just today, and much more. 00:00:32.160 |
Phi 2 was announced by Satya Nadella, the CEO of Microsoft last month, but what is it 00:00:38.520 |
In a nutshell, Phi 2 is a 2.7 billion parameter model, a small model by today's standards. 00:00:44.720 |
So small in fact that it could fit locally on a smartphone. 00:00:47.560 |
According to the benchmarks, and I will get into those, it outperforms models of comparable 00:00:51.960 |
size like ones trained using Mamba, as well as Google's new Gemini Nano. 00:00:56.960 |
If that wasn't enough, it also outperforms models 20 to 25 times its size. 00:01:01.560 |
I've been following the Phi series of models since late June when I interviewed Ronan Eldan, 00:01:06.740 |
one of the key authors of the original Phi 1 paper. 00:01:09.900 |
There were strong hints of Phi 1 significance in my original July 2nd video on textbooks 00:01:16.400 |
I'm going to try to give you all the necessary background on Phi 1, Phi 1.5, and Phi 2 in 00:01:22.580 |
That's borderline impossible, but I'm going to try. 00:01:24.920 |
Essentially, they retrieved a pile of permissively licensed open code from what's appropriately 00:01:30.760 |
There's 10 times more than they use in there. 00:01:34.920 |
They extracted just the Python code from the many programming languages in that data set 00:01:41.800 |
For Phi 1, they then gave GPT-4 a task, filter for textbook quality code. 00:01:46.960 |
That's code with appropriate comments and good syntax. 00:01:49.360 |
To cut costs, they swapped a tiny classifier in to finish the job that GPT-4 started. 00:01:53.840 |
With that classifier essentially imitating the labeling that GPT-4 had kicked off. 00:01:57.640 |
They then got GPT-3.5 to generate its own diverse textbook quality data and synthetic 00:02:05.080 |
For more detail on how they did that, check out the original video in July, which featured 00:02:09.000 |
an interview with one of the key contributors. 00:02:11.360 |
Phi 1 got 50% on human eval, a coding benchmark. 00:02:14.760 |
And all of this works so well that for Phi 1.5 announced in September, they added synthetic 00:02:20.000 |
exercises in common sense reasoning, logic, science, and theory of mind. 00:02:25.000 |
Furthermore, they trained only on synthetic data, leaving that additional filtered stack 00:02:29.600 |
data for a separate model called Phi 1.5 Web. 00:02:32.880 |
What they realized was that this better quality data allowed for more passes over the data 00:02:37.320 |
called epochs, or think of it as a student rereading those textbooks. 00:02:41.080 |
In a September exclusive, the information reported that Microsoft was now heavily backing 00:02:47.320 |
As I mentioned in a video at the time, that gave them the money and compute needed to 00:02:52.560 |
That brings us to the release in the last 24 hours of Phi 2, 2.7 billion parameters, 00:02:58.440 |
trained in just 14 days on less than 100 A100 GPUs. 00:03:03.200 |
For reference, it's reported that Microsoft has as many as 150,000 H100s, the successor 00:03:11.160 |
But what does it mean that the model is now bigger, 2.7 billion parameters? 00:03:14.800 |
Well, more parameters mean more connections can be made, more dots joined, so to speak. 00:03:19.240 |
And with more compute, of course, that means that they can feed in more data and go over 00:03:24.480 |
The amount of data, including rereads or epochs, was 1.4 trillion tokens, which we know is 00:03:32.600 |
What we don't yet have is that full data set, though the model is open sourced. 00:03:36.480 |
There's a link in the description describing how to download Phi 2. 00:03:40.000 |
A side benefit, of course, of training on synthetic data is that it tends to be less 00:03:45.840 |
The toxicity scores went down across the board for Phi 2, and this is before any reinforcement 00:03:52.480 |
If you're familiar with the meme, this could mean bye bye Shoggoth thanks to synthetic 00:03:57.880 |
And there are so many potential lessons from Phi 2. 00:04:00.360 |
As one of the researchers on the project said, our Phi 2 efforts prove that we have been 00:04:04.720 |
wasting enormous amounts of compute on rather ineffective training data. 00:04:09.260 |
As he put it, throwing a kitchen sink at a model has a big price tag and leads to lower 00:04:15.440 |
On screen are the comparisons to Gemini Nano 2 and Mistral 7 billion parameters. 00:04:20.600 |
And notice they've also included a comparison to Lama 2 70 billion parameters. 00:04:25.520 |
And yes, given that one of the other topics of this video is going to be about the flaws 00:04:29.480 |
of benchmarks, I know what many of you are thinking. 00:04:34.080 |
Of course, I researched as best I could about Phi 2 contamination, but I was fairly convinced. 00:04:39.240 |
By the multiple lines of evidence provided in the video linked from Sebastian Bubek, 00:04:48.040 |
And the clues that something like this was coming were around if you knew where to look. 00:04:52.440 |
When reading the original Phi 1.5 paper, I noted this in the discussion section. 00:04:57.360 |
They said, perhaps achieving Chachi PT's level of capability, and remember that model 00:05:01.360 |
is over 175 billion parameters, at the 1 billion parameter scale is actually achievable. 00:05:08.160 |
Now you let me know what you think of Phi 2 in your testing. 00:05:12.080 |
And the only cautionary note I would give is this. 00:05:14.520 |
I remember from the original Phi 1 paper that they mentioned that their models have a certain 00:05:21.820 |
In other words, how you phrase the prompt has significant effects on its performance. 00:05:27.160 |
And as the length of the prompt increases, the models tend to forget, ignore or misinterpret 00:05:33.280 |
But if these results bear out, this is the takeaway. 00:05:35.920 |
It looks like you can mimic the performance on certain tasks of models trained with a 00:05:43.000 |
So now let's imagine a 1.5 trillion parameter model trained this way. 00:05:48.160 |
Essentially imitating what a 1.5 quadrillion parameter large language model would be like. 00:05:53.960 |
That's more parameters than we have synapses in the human brain. 00:05:57.520 |
And of course, that's not even bringing in allowing the model to think for longer with 00:06:01.340 |
extra test time compute and system verification ala AlphaCode 2. 00:06:06.220 |
Things could certainly get interesting in 2024. 00:06:09.320 |
Before we move on completely from benchmarks and Microsoft though, there is one more chart 00:06:13.720 |
that Microsoft put out that I want to highlight. 00:06:15.920 |
Here's them showing that with their prompting system, you can get 90.1% on the MMLU. 00:06:22.320 |
Now regular watchers of this channel probably know what's coming next. 00:06:25.840 |
For hopefully the last time that I'm mentioning it on this channel, that benchmark is flawed 00:06:31.500 |
In fact, I'm going to end this video with a clip from my original Smart GPT video running 00:06:36.000 |
through just some of the mistakes on that benchmark. 00:06:39.040 |
I'm frankly shocked that at the end of 2023, it's still being used to compare models to 00:06:45.000 |
Of course, there Microsoft were comparing its prompting techniques to Google's Gemini 00:06:49.000 |
Ultra and many pointed out the somewhat disingenuous way that Google presented its Gemini model, 00:06:54.720 |
particularly when it comes to video analysis. 00:06:57.000 |
Yes, that was a poor show from Google, but I made the point in the comments at the time 00:07:01.580 |
that such a demo is possible in the near term. 00:07:11.940 |
The drawing appears to be of a duck or a similar bird on water. 00:07:15.080 |
Before we move on from Google though, they today, just in the last hour, released Imogen 00:07:21.680 |
They announced it back in May, but it's now available via API. 00:07:24.980 |
And be honest, would you have believed that this image is generated by a text to image 00:07:31.280 |
Almost for the first time, I would say that no, I can't even tell that that's made by 00:07:37.000 |
Imogen 2, by the way, is a diffusion model and using it, you are indemnified by Google 00:07:42.680 |
Furthermore, all generations are watermarked. 00:07:46.640 |
I mean, particularly if you look at the top right and bottom middle images, you've got 00:07:50.580 |
to admit that that is a frankly shocking level of photorealism. 00:07:54.700 |
Then by this point next year, it gets to this level for text to video, we'll be living in 00:08:00.560 |
But I did say that the theme of today's video is small models. 00:08:04.360 |
And here is the 10 kilograms lighter generation 2 humanoid robot from Tesla. 00:08:28.080 |
Watching this video makes me think of touch, temperature and pressure sensitivity as whole 00:08:50.240 |
I spoke to one of the leads for robotics of Google DeepMind for AI insiders, but I also 00:08:54.760 |
now want to reach out to Tesla because Optimus is getting close. 00:08:58.520 |
But yes, speaking of AI insiders, today is the full launch of that Patreon tier. 00:09:03.880 |
First and foremost, you're supporting the channel and I've written a personal message 00:09:08.360 |
And honestly, I didn't expect this, but the discord for AI insiders has proven to be great 00:09:13.000 |
for networking, whether you're an AI engineer in C-suite or just interested in AI. 00:09:17.600 |
But what is the content that's actually on AI insiders? 00:09:20.240 |
Well, let me walk you through the four categories. 00:09:23.120 |
First we have what I would call classic AI explained videos. 00:09:26.160 |
Bonus content, just like you'd see on the main channel. 00:09:28.560 |
In this collection, I released a video two or three days ago on AGI timelines, featuring 00:09:33.040 |
extracts from six expert interviews I conducted over the last month. 00:09:40.640 |
I call it my let's think sip by sip podcast, not my normal kind of video is more a stream 00:09:47.780 |
With access it's available on Spotify or wherever else you get your podcast. 00:09:51.820 |
Then we have the tutorials I've been working for months on. 00:09:55.120 |
These are more for those who are LLM curious and like everything else, feature expert extracts 00:10:00.280 |
and will be continually updated in the weeks and months to come. 00:10:03.840 |
Finally, we have the insiders arena and let me click on this one. 00:10:07.280 |
This is where you and any other member with a passion for AI can submit explainers and 00:10:12.640 |
I'll pick the best of the bunch, edit them and film an intro. 00:10:16.280 |
The debut video in this series is from none other than the legendary Swicks on the rise 00:10:23.120 |
He makes a joke about my thumbnails, but I'm fine with it. 00:10:25.880 |
Anyway, the next video could be you and the best of the bunch will feature a cameo on 00:10:31.960 |
Many people have already told me that they expense AI insiders for work, and I do want 00:10:36.400 |
to reiterate that just by watching to the end of my videos, you are supporting me more 00:10:41.560 |
than I can possibly expect, and I'm going to do something unusual at this point. 00:10:45.080 |
I'm going to wish you a wonderful day before the end of the video. 00:10:49.480 |
Because I'm going to end with a few minutes of the mistakes of the MMLU, which hopefully 00:11:01.480 |
Here is the question that started it all off. 00:11:06.040 |
The text says demand reduction and the answers are either 1 3 4 2 3 4 1 2 3 1 2 4. 00:11:15.040 |
Now remember, it was only human grading that enabled us to spot this. 00:11:19.440 |
I reckon most companies like OpenAI rely on auto grading by exact match that would immediately 00:11:25.360 |
toss out any answer like these as a null because an answer of A, B, C or D hasn't been given. 00:11:32.120 |
Now I should say it was human grading that caught this and GPT-4 itself. 00:11:36.880 |
Here is poor GPT 3.5 bravely giving an answer to a question that doesn't make any sense 00:11:44.560 |
I know a couple of times it changed its mind and was like, no, no, no, D, not B. 00:11:48.640 |
What then followed was weeks and weeks of me following up every quote unquote mistake 00:11:53.640 |
with the official source the question came from. 00:11:56.280 |
When I found the original source, I realised what the problem was. 00:12:00.880 |
Sometimes they just hadn't pasted all of these statements. 00:12:03.700 |
When you can see all four of these statements, the answer options make a lot more sense. 00:12:10.340 |
Maybe it's just business ethics that's just one subject and what, it's a dozen 00:12:15.880 |
Well, first of all, business ethics only has a hundred questions. 00:12:18.440 |
So 13 of them missing vital context completely undermines that entire subject. 00:12:24.360 |
And second of all, it wasn't just business ethics and it wasn't just this same problem. 00:12:28.680 |
It wouldn't always be about missing statements. 00:12:30.840 |
Check out these examples from high school chemistry and there's high school psychology, 00:12:35.340 |
professional psychology, microeconomics, professional law, professional accounting. 00:12:42.540 |
There were innumerable factual errors and I would try to trace down the origin of each 00:12:50.400 |
By the way, the problem wasn't just with one source. 00:12:57.960 |
How many human polyoma viruses are known at present? 00:13:03.560 |
This question comes from Oxford University Press, chapter 21, question two. 00:13:07.800 |
I researched the question myself and also checked what this multiple choice quiz said 00:13:13.820 |
Let's tick 10, which is answer C and then submit my answers to the quiz. 00:13:21.740 |
By the way, the actual answer seems to be 14 as of 2023, but that's fairly close. 00:13:33.460 |
Some of the worst offenders are the virology and college chemistry sections. 00:13:38.240 |
Just wrong answer after wrong answer after wrong answer. 00:13:42.600 |
This is what the MMLU says is the answer to this question, B. 00:13:46.360 |
I tracked down the question to a fall of 2011 final exam in which the answer was B. 00:13:52.080 |
The MMLU had mixed up the order of the options and therefore picked B when that was drug 00:14:01.880 |
Research suggests that both of those answers are inaccurate. 00:14:04.940 |
And that happened multiple times where even the source was somewhat dodgy with its answers. 00:14:09.980 |
Here is another page of mistakes of virology and another page for college chemistry. 00:14:15.780 |
And one example that will particularly shock AI researchers. 00:14:19.500 |
Here we have a question for which the MMLU says the correct answer is A. 00:14:23.540 |
The original source says that the answer is 8, which isn't even an option. 00:14:30.380 |
And if you remember from earlier in the video, that is the set of 5 questions they use to 00:14:34.700 |
teach the model what kind of answer to give when they're benchmarking in the MMLU. 00:14:39.320 |
In other words, all 100 results in college chemistry for every model benched on the MMLU 00:14:45.980 |
For example, a model that is particularly good at imitating reasoning will now be imitating 00:14:52.780 |
Now you might be thinking, surely now Philip and GPT-4 are done with finding errors in 00:15:02.540 |
Here was a question from econometrics, where again the source was incorrect. 00:15:06.740 |
But we also have misspellings, grammatical ambiguity, and formatting ambiguity throughout 00:15:13.300 |
I'm not going to go through all of these, but any one of them could potentially confuse 00:15:17.940 |
We already know that models are very sensitive to the inputs you give. 00:15:24.220 |
There are loads of juicy examples here, but I can't get to them all. 00:15:27.640 |
How about examples of multi-question dependence? 00:15:30.320 |
For example, this came up in the philosophy section. 00:15:33.440 |
According to Singer, compliance with his principle requires, but of course it doesn't say which 00:15:40.080 |
Singer's argument begins with the assumption that, now if you look at the original question, 00:15:46.560 |
It talked about the arguments and principles from his book Famine, Affluence, and Morality. 00:15:59.160 |
Now there is one final category I want to talk about. 00:16:03.220 |
I'm not going to categorically call this a mistake, but what answer would you pick here? 00:16:08.660 |
When an attitude is communicated, what does it become? 00:16:16.900 |
Those kinds of questions were particularly prevalent in moral scenarios and public relations. 00:16:27.660 |
And there was one question that I did about 3 hours of research for. 00:16:31.180 |
What is the biggest cause of death in children under 5 years old? 00:16:34.720 |
And there are multiple sources that give conflicting answers. 00:16:38.140 |
That type of question, where it depends what source you ask, was massively prevalent in 00:16:45.120 |
And then we get controversial questions like this in security studies. 00:16:48.840 |
How do biological differences affect the roles that men and women must perform for the state? 00:16:54.620 |
With the correct answer being gender roles are a social construct. 00:17:01.860 |
This question touches on complex and controversial topics and while there is evidence to support 00:17:06.420 |
or refute elements within each of the provided statements, none of them fully captures the 00:17:10.300 |
nuanced relationship between biology, gender roles, society and state responsibilities. 00:17:14.260 |
It also picks up on that language "must perform for the state". 00:17:17.340 |
Now remember, these are just the examples that I found on a subset of the full test. 00:17:22.560 |
Extrapolated out, that would suggest that hundreds of questions are ambiguous or erroneous. 00:17:28.440 |
Now a 1, 2 or 3% inaccuracy rate didn't really matter when models were performing 00:17:36.280 |
That's close to random and that was GPT-3's original performance. 00:17:40.120 |
But now when we're talking about AGI or Human Expert Level Accuracy and models are 00:17:44.880 |
being judged on tenths of a percent, 1, 2 or 3% really makes a big difference.