Back to Index

An Actually Big Week in AI: AutoGen, The A-Phone, Mistral 7B, GPT-Fathom and Meta Hunts CharacterAI


Chapters

0:0
0:35 GPT Vision Use Cases
1:32 Meta AI to 4 Billion People?
3:0 CIA-Bot
3:48 The Altman Phone
4:47 AutoGen
8:26 Mistral 7B
9:58 Orca @ MSFT
11:20 GPT-Fathom

Transcript

Many times we hear of AI news that we only realize was significant in hindsight. This week came developments that are big from any perspective. I don't just mean the whole new category of use cases for GPT vision, or meta bringing language models to billions of people. There's also Autogen as the new AutoGPT, what I'm calling the Altman phone, Mistral's new 7 billion parameter model, Orca potentially replacing OpenAI models at Microsoft, and yesterday's fascinating GPT Fathom paper.

I'm going to start with a use case for GPT-4 vision that I didn't see coming. If you spot a user interface online that you like, you can try to recreate it with GPT-4. As at Scorano demonstrated on X, you can ask it to imitate the layout and give you the HTML code.

Now I know many of you will say that doesn't look too impressive, but wait until you... see the next demo. GPT-4 with vision can iterate on its designs. Because it can see its own output, it can recognize flaws and improve on them. This is from Matt Schumer, CEO of Hyperight AI.

You can see it trying to design that futuristic Google homepage and improving with each generation. We've gone beyond text feedback now into a visual feedback loop. I honestly wonder if someone is going to do something similar for DALI 3. Use GPT-4 vision to... continue to iterate outputs until it matches your prompt perfectly.

Now the next development probably won't impress too many of the viewers watching my channel, but it might have a big impact on the up to 4 billion people who use a meta product or service. We're talking Instagram, WhatsApp and of course Facebook. Now I'm not going to play you all the promotional materials, but essentially they've got a bunch of celebrities including Mr Beast and quite a few others you can see here, to put their name to a series of...

28 AI chatbots. Now I don't particularly want to speak to any of these people, but I'm sure that millions if not billions of people want to pretend to do so at least. Now remember it was only yesterday that Character AI that deals in fictional chatbots was valued at more than $5 billion.

And Zuckerberg is clearly taking aim at Character AI. Here he is in the metaverse with Lex Friedman. I don't think anyone out there is really doing... what we're doing here. I think that there are people who are doing kind of like fictional or consumer oriented character type stuff, but the extent to which we're building it out with the avatars and expressiveness and making it so that they can interact across all the different apps and they'll have profiles.

Let's be honest, if the Apple Vision Pro takes VR mainstream, then a lot of people will want to sit down and have an artificial chat with their favorite celebrity. And it doesn't take much to imagine other use cases. Now one group who can definitely think of some use cases for chatbots is the CIA.

They are apparently getting their own chat GPT style tool. Now of course we already know that the CIA monitors data that passes through the US, but there's so much of it that it must have been impossible to sort through. With a large language model, I don't think that's going to be true anymore.

That's going to be useful for catching criminals and the FBI is getting this chatbot too. But it does remind me a lot of this article about large language models being great for state censorship. For similar reasons, it would have been impossible for a country like China to monitor all of its civilian communications.

But now as the article says, there is no real way to prevent this. It's only a matter of time before well-resourced state actors begin implementing and advancing such systems. But in lighter news, yesterday we had this from The Verge. Donny Ive of Apple fame, together with Sam Altman of OpenAI fame, are coming up with the iPhone of Artificial Intelligence.

Fueled by over $1 billion in funding from the Softbank CEO. It would be OpenAI's first consumer device and we might even have a few clues about what would distinguish it. Donny Ive has previously said that Apple had a moral responsibility to mitigate the addictive nature of its technology. And according to the Financial Times, the project with OpenAI could allow Ive to create an interactive computing device that's less reliant on screens.

Of course we saw this week that ChatGPT can now take audio and visual input. Now apparently the discussions are said to be serious. So let me know in the comments what you would want out of let's call it the Altman phone. But now I want to talk about Autogen from Microsoft.

Many of you wondered if I'd been following this development and yes I have been. But rather than just talk about what they claim, I wanted to try it out for myself. I had heard that Autogen is poised to fundamentally transform and extend what large language models are capable of.

And it would do this through multi-agent conversations, joint chats and agent customization. Think of it as a more sophisticated AutoGPT allowing the easy creation of sub-agents to achieve a goal. One of the agents or models could be an engineer writing code, another one an executor executing code, or a product manager coming up with the implementation plan.

But to be honest I was like I've heard of this kind of thing before, does it actually work? I discussed use cases with the amazing AI architect Nico Giraud and we came up with this demo. A maths question that GPT-4 with code interpreter or advanced data analysis almost always gets wrong.

For anyone who's interested is in the style of GMAT data sufficiency. Autogen was able to easily create sub-agents to delegate tasks to and it could break down these compositional problems in coding or here for example math. Now the Autogen system can be run with a human in the loop essentially as one of the agents or maybe as the commander and Autogen can use tools and execute code.

All these agents are essentially in a group chat working together chipping in when necessary or when called for by a planner. It got this difficult mathematical problem right three times out of three and yes this is just an anecdotal demo. I'm going to be doing much more research on Autogen and I'm reaching out to some of the authors.

I'm going to be collecting the results I found in my previous video on reasoning in LLMs. I must say that Autogen made me think again about this recent tweet from Sam Altman. He said "short timelines and slow takeoff will be a pretty good call I think." Short timelines meaning artificial general intelligence not being several decades away.

And slow takeoff as in not exponential self-improvement on the scale of days, weeks or months when AGI comes out. But the quote continues "but the way people define the start of the takeoff may make it seem otherwise." I read that as saying that if you put too strict a definition on what AGI means.

If you continually move the goalposts such that AGI means the exact same thing as super intelligence. Being smarter than all humans combined. Well under that definition the moment AGI arrives things could start changing extremely rapidly. But if we stop moving the goalposts and admit that things like Autogen could be pretty radical now.

According to definitions from say 10 years ago many people would argue we have AGI now. Well in that scenario you could argue we might be in a fairly slow takeoff that could be measured in years. Potentially even more than a decade. Now of course I might be misinterpreting his words.

But I know that Philip from 10 years ago would have looked at GPT-4 with vision and Autogen. And been absolutely gobsmacked. And I do think that some combination of the techniques that we have available today together with scaling. Will bring us to AGI. That's why I would agree with the majority of forecasters on Metaculous.

That the first AGI will be based on deep learning. It's free to sign up to Metaculous using the link in the description. And you get the double benefit of seeing what other people are predicting. And also you get the chance to put your forecast in writing. That way when you're right you get to boast in my comments about how right you were.

Thanks as always to Metaculous for sponsoring the video. And I wonder if the next video will be about AGI. And if the next development is going to change any of these predictions. That development is the release of Mistral 7B for 7 billion parameters. Now apparently it outperforms Llama 2 13 billion parameters on all benchmarks.

And based on my limited tests with Perplexity Chat. Where you can pick the Mistral 7 billion model. I can roughly believe that that is the performance level. It does make plenty of mistakes of course. Here though you can see it beating a range of Llama models including. Llama 2 13 billion on a range of benchmarks.

As I've pointed out several times on the channel. There are problems with the benchmarks though. And we'll have to wait and see about data contamination. Now Mistral does admit that there are no moderation mechanisms. And I tried pretty much any example you can think of. And it always helped out.

For pretty much obvious reasons. I can't even demonstrate the kind of things that I asked. And it happily replied to. But remember this is only their teaser model. Much larger model. And I'm sure it will be more capable. And it's released under the Apache 2 license. Which is extremely permissive.

When the Mistral models are further fine tuned. I'm sure we are going to see more benchmarks broken. But I would ask the question. Does this constitute a race to the bottom. In which the company that spends the least amount of money on protections wins out. Of course because it's only 7 billion parameters.

It's not capable of that much. But a future Mistral 70 billion. Or 700 billion parameter model. Might make headlines for good and bad reasons. But smaller models do seem to be the trend. With even Microsoft looking to downsize. As the information put it. Microsoft is trying to lessen its addiction to open AI.

As AI costs soar. The article talks about how Microsoft researchers are making distilled models. That mimic larger models like GPT-4. But are smaller and cost far less to operate. Notably they mention Orca and the PHY series. Of models. I am proud to have been one of the first people to cover the Orca model.

And my video on it called. The model few people saw coming. Was retweeted by the lead author of Orca. Of course I also interviewed one of the lead creators. On the PHY series of models. In a separate video. As the article points out. The fine-tuned Orca model performs much better than the base model of Llama 2.

And as I remember the Orca paper pointing out. It performs nearly as well as GPT-4 on certain tasks. In fact now I speak of it. I remember predicting at the end of the Orca video. You can go and watch it. I remember saying something like. I wonder if Microsoft is funding Orca.

To see if it can substitute for GPT-4 or the next GPT-5. Actually now I look back. That was a pretty amazing prediction. Indeed it looks like Microsoft might well substitute Orca for GPT-4. For certain use cases. Particularly as it uses less than a tenth of the computing power that GPT-4 uses.

But speaking of being ahead of the curve. Here is a fact. This is a fascinating paper. That I don't think anyone has talked about. Of course that's probably because it came out yesterday. But nevertheless. It achieves part of what I talked about. Toward the end of my smart GPT video.

About systematically and fairly evaluating the leading LLMs. On a range of benchmarks. Not this patchwork of different conditions. Benchmarks. Settings and models that we currently have. I'm hoping to speak to one of the lead authors of this paper. But for now let me just give you a taste of the highlights.

First we have this epic chart. I want you to focus on one part in particular. At the top right you can see the 1st of March version of GPT 3.5. And then the 13th of June version. For GPT-4 we have the 14th of March version. And the 13th of June version.

And the paper later describes the seesaw progress. It's not a straight line upwards. In other words the conspiracy that ChatGPT is getting dumber. In some areas is actually correct. Take the first row which is natural questions. Performance went down slightly between these two versions. And then the 4 trivia QA for GPT-4.

It went down slightly. And you can look across the benchmarks to see the same things happening. And this is not just noise as the paper later points out. The June version of GPT 3.5 saw its score on math dramatically degrade from 32.0 to 15.0. That's compared to the March 1st version.

In contrast on the drop benchmark. GPT-4 performance increases from 78.7 to 87.2. Which is on par with state of the art. Between the two versions. However between those same two versions. Its score on a different benchmark plummeted from 82.2 to 68.7. This is not just noise. And OpenAI admits that when they release a new model.

While the majority of metrics might improve. There may be some tasks where the performance gets worse. Which brings me to the web based version of GPT-4 and ChatGPT. The paper notes that the dated API models. That's those with the four numbers at the end. Consistently perform slightly better than their front end counterparts.

I.e. the web version of the models. To be honest I had noticed dissimilarities when benchmarking SmartGPT. They do say that code interpreter. Now called advanced data analysis. Has significantly improved the coding benchmark performance. Getting 85.2 pass at one on human eval. So you're not hallucinating when you see differences between the web GPT models.

And the API models. Anyway there is so much more from this paper. That I want to cover. Preferably after I've spoken to one of the lead authors. But for now the video is long enough. We have covered some pretty epic topics today. If you enjoyed or learnt anything from this overview.

Please do let me know in the comments. Shameless plug at this point for my Patreon. Which is also linked in the description. And as ever. Thank you so much for watching to the end. And have a wonderful day.