Back to Index

Phi-2, Imagen-2, Optimus-Gen-2: Small New Models to Change the World?


Transcript

You might have thought that 2023 was winding down for generative AI with more focus on mistletoe and merriment than models and the MMLU. But no, this was the week of powerful new small models that could change the landscape of 2024 AI. This video is about Phi 2 and what it means, as well as the madness of MMLU brinkmanship with a reminder of all of that exams mistakes.

Plus Optimus Gen 2, Imogen Gen 2 announced just today, and much more. Phi 2 was announced by Satya Nadella, the CEO of Microsoft last month, but what is it and what does it mean? In a nutshell, Phi 2 is a 2.7 billion parameter model, a small model by today's standards.

So small in fact that it could fit locally on a smartphone. According to the benchmarks, and I will get into those, it outperforms models of comparable size like ones trained using Mamba, as well as Google's new Gemini Nano. If that wasn't enough, it also outperforms models 20 to 25 times its size.

I've been following the Phi series of models since late June when I interviewed Ronan Eldan, one of the key authors of the original Phi 1 paper. There were strong hints of Phi 1 significance in my original July 2nd video on textbooks are all you need. I'm going to try to give you all the necessary background on Phi 1, Phi 1.5, and Phi 2 in less than three minutes.

That's borderline impossible, but I'm going to try. Essentially, they retrieved a pile of permissively licensed open code from what's appropriately called the stack. There's 10 times more than they use in there. They just can't use it for legal reasons. They extracted just the Python code from the many programming languages in that data set and also filtered for duplications.

For Phi 1, they then gave GPT-4 a task, filter for textbook quality code. That's code with appropriate comments and good syntax. To cut costs, they swapped a tiny classifier in to finish the job that GPT-4 started. With that classifier essentially imitating the labeling that GPT-4 had kicked off. They then got GPT-3.5 to generate its own diverse textbook quality data and synthetic Q&A exercises.

For more detail on how they did that, check out the original video in July, which featured an interview with one of the key contributors. Phi 1 got 50% on human eval, a coding benchmark. And all of this works so well that for Phi 1.5 announced in September, they added synthetic exercises in common sense reasoning, logic, science, and theory of mind.

Furthermore, they trained only on synthetic data, leaving that additional filtered stack data for a separate model called Phi 1.5 Web. What they realized was that this better quality data allowed for more passes over the data called epochs, or think of it as a student rereading those textbooks. In a September exclusive, the information reported that Microsoft was now heavily backing those researchers behind Phi.

As I mentioned in a video at the time, that gave them the money and compute needed to scale up Phi. That brings us to the release in the last 24 hours of Phi 2, 2.7 billion parameters, trained in just 14 days on less than 100 A100 GPUs. For reference, it's reported that Microsoft has as many as 150,000 H100s, the successor of the A100s.

But what does it mean that the model is now bigger, 2.7 billion parameters? Well, more parameters mean more connections can be made, more dots joined, so to speak. And with more compute, of course, that means that they can feed in more data and go over that data more times.

The amount of data, including rereads or epochs, was 1.4 trillion tokens, which we know is five times more than Phi 1.5 Web. What we don't yet have is that full data set, though the model is open sourced. There's a link in the description describing how to download Phi 2.

A side benefit, of course, of training on synthetic data is that it tends to be less toxic data. The toxicity scores went down across the board for Phi 2, and this is before any reinforcement learning from human feedback. If you're familiar with the meme, this could mean bye bye Shoggoth thanks to synthetic data.

And there are so many potential lessons from Phi 2. As one of the researchers on the project said, our Phi 2 efforts prove that we have been wasting enormous amounts of compute on rather ineffective training data. As he put it, throwing a kitchen sink at a model has a big price tag and leads to lower quality.

On screen are the comparisons to Gemini Nano 2 and Mistral 7 billion parameters. And notice they've also included a comparison to Lama 2 70 billion parameters. And yes, given that one of the other topics of this video is going to be about the flaws of benchmarks, I know what many of you are thinking.

Can we fully trust these benchmark figures? Of course, I researched as best I could about Phi 2 contamination, but I was fairly convinced. By the multiple lines of evidence provided in the video linked from Sebastian Bubek, the lead author. Irrelevant clip is from minute 19 onwards. And the clues that something like this was coming were around if you knew where to look.

When reading the original Phi 1.5 paper, I noted this in the discussion section. They said, perhaps achieving Chachi PT's level of capability, and remember that model is over 175 billion parameters, at the 1 billion parameter scale is actually achievable. Now you let me know what you think of Phi 2 in your testing.

And the only cautionary note I would give is this. I remember from the original Phi 1 paper that they mentioned that their models have a certain sensitivity to prompt variations. In other words, how you phrase the prompt has significant effects on its performance. And as the length of the prompt increases, the models tend to forget, ignore or misinterpret parts of the prompt.

But if these results bear out, this is the takeaway. It looks like you can mimic the performance on certain tasks of models trained with a thousand X more compute. So now let's imagine a 1.5 trillion parameter model trained this way. Essentially imitating what a 1.5 quadrillion parameter large language model would be like.

That's more parameters than we have synapses in the human brain. And of course, that's not even bringing in allowing the model to think for longer with extra test time compute and system verification ala AlphaCode 2. Things could certainly get interesting in 2024. Before we move on completely from benchmarks and Microsoft though, there is one more chart that Microsoft put out that I want to highlight.

Here's them showing that with their prompting system, you can get 90.1% on the MMLU. Now regular watchers of this channel probably know what's coming next. For hopefully the last time that I'm mentioning it on this channel, that benchmark is flawed in many respects. In fact, I'm going to end this video with a clip from my original Smart GPT video running through just some of the mistakes on that benchmark.

I'm frankly shocked that at the end of 2023, it's still being used to compare models to two decimal places. Of course, there Microsoft were comparing its prompting techniques to Google's Gemini Ultra and many pointed out the somewhat disingenuous way that Google presented its Gemini model, particularly when it comes to video analysis.

Yes, that was a poor show from Google, but I made the point in the comments at the time that such a demo is possible in the near term. I drew this. Can you tell me what I, what is that? The drawing appears to be of a duck or a similar bird on water.

Before we move on from Google though, they today, just in the last hour, released Imogen 2. They announced it back in May, but it's now available via API. And be honest, would you have believed that this image is generated by a text to image model? Almost for the first time, I would say that no, I can't even tell that that's made by AI.

Imogen 2, by the way, is a diffusion model and using it, you are indemnified by Google from copyright. Furthermore, all generations are watermarked. The quality looks stunning. I mean, particularly if you look at the top right and bottom middle images, you've got to admit that that is a frankly shocking level of photorealism.

Then by this point next year, it gets to this level for text to video, we'll be living in a different world. But I did say that the theme of today's video is small models. And here is the 10 kilograms lighter generation 2 humanoid robot from Tesla. This is Optimus Gen 2.

Watching this video makes me think of touch, temperature and pressure sensitivity as whole new modalities yet to be fully explored. I spoke to one of the leads for robotics of Google DeepMind for AI insiders, but I also now want to reach out to Tesla because Optimus is getting close.

But yes, speaking of AI insiders, today is the full launch of that Patreon tier. First and foremost, you're supporting the channel and I've written a personal message to everyone who's joined. And honestly, I didn't expect this, but the discord for AI insiders has proven to be great for networking, whether you're an AI engineer in C-suite or just interested in AI.

But what is the content that's actually on AI insiders? Well, let me walk you through the four categories. First we have what I would call classic AI explained videos. Bonus content, just like you'd see on the main channel. In this collection, I released a video two or three days ago on AGI timelines, featuring extracts from six expert interviews I conducted over the last month.

Plus, of course, my own timelines. Next we have the AI insiders podcast. I call it my let's think sip by sip podcast, not my normal kind of video is more a stream of consciousness, audio only reflection. With access it's available on Spotify or wherever else you get your podcast.

Then we have the tutorials I've been working for months on. These are more for those who are LLM curious and like everything else, feature expert extracts and will be continually updated in the weeks and months to come. Finally, we have the insiders arena and let me click on this one.

This is where you and any other member with a passion for AI can submit explainers and I'll pick the best of the bunch, edit them and film an intro. The debut video in this series is from none other than the legendary Swicks on the rise of the AI engineer.

He makes a joke about my thumbnails, but I'm fine with it. Anyway, the next video could be you and the best of the bunch will feature a cameo on AI explained. Many people have already told me that they expense AI insiders for work, and I do want to reiterate that just by watching to the end of my videos, you are supporting me more than I can possibly expect, and I'm going to do something unusual at this point.

I'm going to wish you a wonderful day before the end of the video. Why? Because I'm going to end with a few minutes of the mistakes of the MMLU, which hopefully I have to point out for the very last time. As always, though, have a wonderful day. Back to breaking the benchmark.

Here is the question that started it all off. As you can see, the question makes no sense. The text says demand reduction and the answers are either 1 3 4 2 3 4 1 2 3 1 2 4. What on earth is that about? Now remember, it was only human grading that enabled us to spot this.

I reckon most companies like OpenAI rely on auto grading by exact match that would immediately toss out any answer like these as a null because an answer of A, B, C or D hasn't been given. Now I should say it was human grading that caught this and GPT-4 itself.

Here is poor GPT 3.5 bravely giving an answer to a question that doesn't make any sense at all. I know a couple of times it changed its mind and was like, no, no, no, D, not B. What then followed was weeks and weeks of me following up every quote unquote mistake with the official source the question came from.

When I found the original source, I realised what the problem was. Sometimes they just hadn't pasted all of these statements. When you can see all four of these statements, the answer options make a lot more sense. Now I know what some of you may be thinking. Maybe it's just business ethics that's just one subject and what, it's a dozen questions.

What's the big deal? Well, first of all, business ethics only has a hundred questions. So 13 of them missing vital context completely undermines that entire subject. And second of all, it wasn't just business ethics and it wasn't just this same problem. It wouldn't always be about missing statements. Check out these examples from high school chemistry and there's high school psychology, professional psychology, microeconomics, professional law, professional accounting.

And trust me, it didn't stop there. I was genuinely shocked. There were innumerable factual errors and I would try to trace down the origin of each and see what the source said. By the way, the problem wasn't just with one source. It was with quite a few of these sources.

Let's at random take one of these questions. How many human polyoma viruses are known at present? A hundred, one, 10 or unknown. This question comes from Oxford University Press, chapter 21, question two. I researched the question myself and also checked what this multiple choice quiz said the answer was.

Let's tick 10, which is answer C and then submit my answers to the quiz. And let's see. Yes, it's correct. It's 10. By the way, the actual answer seems to be 14 as of 2023, but that's fairly close. What does the MMLU say? It says the answer is A, 100.

And it goes on and on like this. Some of the worst offenders are the virology and college chemistry sections. Just wrong answer after wrong answer after wrong answer. Here's another example. This is what the MMLU says is the answer to this question, B. I tracked down the question to a fall of 2011 final exam in which the answer was B.

The MMLU had mixed up the order of the options and therefore picked B when that was drug users instead of men. But there's one more slight problem. Research suggests that both of those answers are inaccurate. And that happened multiple times where even the source was somewhat dodgy with its answers.

Here is another page of mistakes of virology and another page for college chemistry. And one example that will particularly shock AI researchers. Here we have a question for which the MMLU says the correct answer is A. The original source says that the answer is 8, which isn't even an option.

But this question was in the dev set. And if you remember from earlier in the video, that is the set of 5 questions they use to teach the model what kind of answer to give when they're benchmarking in the MMLU. In other words, all 100 results in college chemistry for every model benched on the MMLU is compromised.

For example, a model that is particularly good at imitating reasoning will now be imitating an incorrect answer. Now you might be thinking, surely now Philip and GPT-4 are done with finding errors in the MMLU. Unfortunately not. We carry on into new categories. Here was a question from econometrics, where again the source was incorrect.

But we also have misspellings, grammatical ambiguity, and formatting ambiguity throughout the test. I'm not going to go through all of these, but any one of them could potentially confuse a model. We already know that models are very sensitive to the inputs you give. Are there any more categories? Yes, there are.

There are loads of juicy examples here, but I can't get to them all. How about examples of multi-question dependence? For example, this came up in the philosophy section. According to Singer, compliance with his principle requires, but of course it doesn't say which of his principles. Or this one. Singer's argument begins with the assumption that, now if you look at the original question, it gave the context of the arguments.

It talked about the arguments and principles from his book Famine, Affluence, and Morality. But in the MMLU, there is no such context. And again and again this comes up. High school bio. You can see the two examples here. Now there is one final category I want to talk about.

No clear answer. I'm not going to categorically call this a mistake, but what answer would you pick here? This is in public relations. When an attitude is communicated, what does it become? An opinion? A belief? A behaviour? A point of view? I think that's pretty ambiguous. Those kinds of questions were particularly prevalent in moral scenarios and public relations.

Or how about this? Are cryptocurrencies expensive or cheap? Is there an easy answer to that? And there was one question that I did about 3 hours of research for. What is the biggest cause of death in children under 5 years old? And there are multiple sources that give conflicting answers.

That type of question, where it depends what source you ask, was massively prevalent in the global facts category. And then we get controversial questions like this in security studies. How do biological differences affect the roles that men and women must perform for the state? With the correct answer being gender roles are a social construct.

I feel GPT-4's answer is far more nuanced. This question touches on complex and controversial topics and while there is evidence to support or refute elements within each of the provided statements, none of them fully captures the nuanced relationship between biology, gender roles, society and state responsibilities. It also picks up on that language "must perform for the state".

Now remember, these are just the examples that I found on a subset of the full test. Extrapolated out, that would suggest that hundreds of questions are ambiguous or erroneous. Now a 1, 2 or 3% inaccuracy rate didn't really matter when models were performing at around 25 or 30%. That's close to random and that was GPT-3's original performance.

But now when we're talking about AGI or Human Expert Level Accuracy and models are being judged on tenths of a percent, 1, 2 or 3% really makes a big difference.