back to indexAn Actually Big Week in AI: AutoGen, The A-Phone, Mistral 7B, GPT-Fathom and Meta Hunts CharacterAI
Chapters
0:0
0:35 GPT Vision Use Cases
1:32 Meta AI to 4 Billion People?
3:0 CIA-Bot
3:48 The Altman Phone
4:47 AutoGen
8:26 Mistral 7B
9:58 Orca @ MSFT
11:20 GPT-Fathom
00:00:00.000 |
Many times we hear of AI news that we only realize was significant in hindsight. 00:00:05.800 |
This week came developments that are big from any perspective. 00:00:10.320 |
I don't just mean the whole new category of use cases for GPT vision, 00:00:15.020 |
or meta bringing language models to billions of people. 00:00:18.940 |
There's also Autogen as the new AutoGPT, what I'm calling the Altman phone, 00:00:26.980 |
Orca potentially replacing OpenAI models at Microsoft, 00:00:30.420 |
and yesterday's fascinating GPT Fathom paper. 00:00:34.340 |
I'm going to start with a use case for GPT-4 vision that I didn't see coming. 00:00:38.920 |
If you spot a user interface online that you like, you can try to recreate it with GPT-4. 00:00:45.340 |
As at Scorano demonstrated on X, you can ask it to imitate the layout and give you the HTML code. 00:00:52.360 |
Now I know many of you will say that doesn't look too impressive, 00:00:58.480 |
GPT-4 with vision can iterate on its designs. 00:01:02.160 |
Because it can see its own output, it can recognize flaws and improve on them. 00:01:06.720 |
This is from Matt Schumer, CEO of Hyperight AI. 00:01:10.520 |
You can see it trying to design that futuristic Google homepage and improving with each generation. 00:01:16.760 |
We've gone beyond text feedback now into a visual feedback loop. 00:01:20.880 |
I honestly wonder if someone is going to do something similar for DALI 3. 00:01:26.520 |
continue to iterate outputs until it matches your prompt perfectly. 00:01:31.000 |
Now the next development probably won't impress too many of the viewers watching my channel, 00:01:35.680 |
but it might have a big impact on the up to 4 billion people who use a meta product or service. 00:01:42.680 |
We're talking Instagram, WhatsApp and of course Facebook. 00:01:45.760 |
Now I'm not going to play you all the promotional materials, 00:01:48.280 |
but essentially they've got a bunch of celebrities including Mr Beast and quite a few others you can see here, 00:01:58.120 |
Now I don't particularly want to speak to any of these people, 00:02:00.560 |
but I'm sure that millions if not billions of people want to pretend to do so at least. 00:02:06.040 |
Now remember it was only yesterday that Character AI that deals in fictional chatbots was valued at more than $5 billion. 00:02:14.120 |
And Zuckerberg is clearly taking aim at Character AI. 00:02:18.160 |
Here he is in the metaverse with Lex Friedman. 00:02:21.680 |
I don't think anyone out there is really doing... 00:02:27.080 |
I think that there are people who are doing kind of like fictional or consumer oriented character type stuff, 00:02:32.880 |
but the extent to which we're building it out with the avatars and expressiveness 00:02:39.680 |
and making it so that they can interact across all the different apps and they'll have profiles. 00:02:45.520 |
Let's be honest, if the Apple Vision Pro takes VR mainstream, 00:02:49.720 |
then a lot of people will want to sit down and have an artificial chat with their favorite celebrity. 00:02:55.840 |
And it doesn't take much to imagine other use cases. 00:02:59.000 |
Now one group who can definitely think of some use cases for chatbots is the CIA. 00:03:04.360 |
They are apparently getting their own chat GPT style tool. 00:03:08.280 |
Now of course we already know that the CIA monitors data that passes through the US, 00:03:13.360 |
but there's so much of it that it must have been impossible to sort through. 00:03:17.200 |
With a large language model, I don't think that's going to be true anymore. 00:03:20.280 |
That's going to be useful for catching criminals and the FBI is getting this chatbot too. 00:03:25.600 |
But it does remind me a lot of this article about large language models being great for state censorship. 00:03:32.480 |
For similar reasons, it would have been impossible for a country like China to monitor all of its civilian communications. 00:03:39.280 |
But now as the article says, there is no real way to prevent this. 00:03:43.280 |
It's only a matter of time before well-resourced state actors begin implementing and advancing such systems. 00:03:48.960 |
But in lighter news, yesterday we had this from The Verge. 00:03:52.560 |
Donny Ive of Apple fame, together with Sam Altman of 00:03:55.360 |
OpenAI fame, are coming up with the iPhone of Artificial Intelligence. 00:04:00.720 |
Fueled by over $1 billion in funding from the Softbank CEO. 00:04:04.960 |
It would be OpenAI's first consumer device and we might even have a few clues about what would distinguish it. 00:04:12.400 |
Donny Ive has previously said that Apple had a moral responsibility to mitigate the addictive nature of its technology. 00:04:20.240 |
And according to the Financial Times, the project with OpenAI could allow 00:04:25.120 |
Ive to create an interactive computing device that's less reliant on screens. 00:04:30.800 |
Of course we saw this week that ChatGPT can now take audio and visual input. 00:04:35.360 |
Now apparently the discussions are said to be serious. 00:04:38.720 |
So let me know in the comments what you would want out of let's call it the Altman phone. 00:04:43.760 |
But now I want to talk about Autogen from Microsoft. 00:04:47.520 |
Many of you wondered if I'd been following this development and yes I have been. 00:04:51.520 |
But rather than just talk about what they claim, I wanted to try 00:04:56.000 |
I had heard that Autogen is poised to fundamentally transform and extend what large language models are capable of. 00:05:02.960 |
And it would do this through multi-agent conversations, joint chats and agent customization. 00:05:08.880 |
Think of it as a more sophisticated AutoGPT allowing the easy creation of sub-agents to achieve a goal. 00:05:16.160 |
One of the agents or models could be an engineer writing code, another one an executor executing code, or a product manager coming up with the 00:05:25.440 |
But to be honest I was like I've heard of this kind of thing before, does it actually work? 00:05:29.600 |
I discussed use cases with the amazing AI architect Nico Giraud and we came up with this demo. 00:05:35.520 |
A maths question that GPT-4 with code interpreter or advanced data analysis almost always gets wrong. 00:05:41.520 |
For anyone who's interested is in the style of GMAT data sufficiency. 00:05:45.440 |
Autogen was able to easily create sub-agents to delegate tasks to and it could break down these 00:05:51.120 |
compositional problems in coding or here for example math. 00:05:54.400 |
Now the Autogen system can be run with a human in the loop essentially as one of the agents or maybe as the commander and Autogen can use tools and execute code. 00:06:04.880 |
All these agents are essentially in a group chat working together chipping in when necessary or when called for by a planner. 00:06:11.440 |
It got this difficult mathematical problem right three times out of three and yes this is just an anecdotal demo. 00:06:17.840 |
I'm going to be doing much more research on Autogen and I'm reaching out to some of the authors. 00:06:24.160 |
I'm going to be collecting the results I found in my previous video on reasoning in LLMs. 00:06:29.200 |
I must say that Autogen made me think again about this recent tweet from Sam Altman. 00:06:34.160 |
He said "short timelines and slow takeoff will be a pretty good call I think." 00:06:39.200 |
Short timelines meaning artificial general intelligence not being several decades away. 00:06:43.680 |
And slow takeoff as in not exponential self-improvement on the scale of days, 00:06:53.920 |
people define the start of the takeoff may make it seem otherwise." 00:06:58.000 |
I read that as saying that if you put too strict a definition on what AGI means. 00:07:02.960 |
If you continually move the goalposts such that AGI means the exact same thing as super intelligence. 00:07:11.120 |
Well under that definition the moment AGI arrives things could start changing extremely rapidly. 00:07:17.120 |
But if we stop moving the goalposts and admit that things like Autogen could be pretty radical now. 00:07:23.680 |
According to definitions from say 10 years ago many people would argue we have AGI now. 00:07:28.720 |
Well in that scenario you could argue we might be in a fairly slow takeoff that could be measured in years. 00:07:36.720 |
Now of course I might be misinterpreting his words. 00:07:38.800 |
But I know that Philip from 10 years ago would have looked at GPT-4 with vision and Autogen. 00:07:47.040 |
And I do think that some combination of the techniques that we have available today together with scaling. 00:07:54.880 |
That's why I would agree with the majority of forecasters on Metaculous. 00:07:59.040 |
That the first AGI will be based on deep learning. 00:08:02.400 |
It's free to sign up to Metaculous using the link in the description. 00:08:06.400 |
And you get the double benefit of seeing what other people are predicting. 00:08:10.560 |
And also you get the chance to put your forecast in writing. 00:08:14.240 |
That way when you're right you get to boast in my comments about how right you were. 00:08:18.960 |
Thanks as always to Metaculous for sponsoring the video. 00:08:22.000 |
And I wonder if the next video will be about AGI. 00:08:23.200 |
And if the next development is going to change any of these predictions. 00:08:26.320 |
That development is the release of Mistral 7B for 7 billion parameters. 00:08:31.840 |
Now apparently it outperforms Llama 2 13 billion parameters on all benchmarks. 00:08:38.480 |
And based on my limited tests with Perplexity Chat. 00:08:41.360 |
Where you can pick the Mistral 7 billion model. 00:08:44.240 |
I can roughly believe that that is the performance level. 00:08:49.120 |
Here though you can see it beating a range of Llama models including. 00:08:56.960 |
As I've pointed out several times on the channel. 00:08:59.200 |
There are problems with the benchmarks though. 00:09:01.440 |
And we'll have to wait and see about data contamination. 00:09:04.800 |
Now Mistral does admit that there are no moderation mechanisms. 00:09:08.880 |
And I tried pretty much any example you can think of. 00:09:15.280 |
I can't even demonstrate the kind of things that I asked. 00:09:19.600 |
But remember this is only their teaser model. 00:09:26.320 |
And it's released under the Apache 2 license. 00:09:31.200 |
When the Mistral models are further fine tuned. 00:09:34.080 |
I'm sure we are going to see more benchmarks broken. 00:09:41.040 |
In which the company that spends the least amount of money on protections wins out. 00:09:45.600 |
Of course because it's only 7 billion parameters. 00:09:54.960 |
Might make headlines for good and bad reasons. 00:10:05.120 |
Microsoft is trying to lessen its addiction to open AI. 00:10:09.840 |
The article talks about how Microsoft researchers are making distilled models. 00:10:16.880 |
But are smaller and cost far less to operate. 00:10:19.680 |
Notably they mention Orca and the PHY series. 00:10:23.040 |
I am proud to have been one of the first people to cover the Orca model. 00:10:32.720 |
Of course I also interviewed one of the lead creators. 00:10:40.160 |
The fine-tuned Orca model performs much better than the base model of Llama 2. 00:10:45.200 |
And as I remember the Orca paper pointing out. 00:10:47.840 |
It performs nearly as well as GPT-4 on certain tasks. 00:10:53.440 |
I remember predicting at the end of the Orca video. 00:11:01.440 |
To see if it can substitute for GPT-4 or the next GPT-5. 00:11:08.640 |
Indeed it looks like Microsoft might well substitute Orca for GPT-4. 00:11:14.160 |
Particularly as it uses less than a tenth of the computing power that GPT-4 uses. 00:11:25.760 |
Of course that's probably because it came out yesterday. 00:11:33.600 |
About systematically and fairly evaluating the leading LLMs. 00:11:43.760 |
I'm hoping to speak to one of the lead authors of this paper. 00:11:46.560 |
But for now let me just give you a taste of the highlights. 00:11:51.520 |
I want you to focus on one part in particular. 00:11:54.080 |
At the top right you can see the 1st of March version of GPT 3.5. 00:12:04.880 |
And the paper later describes the seesaw progress. 00:12:09.360 |
In other words the conspiracy that ChatGPT is getting dumber. 00:12:15.440 |
Take the first row which is natural questions. 00:12:18.160 |
Performance went down slightly between these two versions. 00:12:25.200 |
And you can look across the benchmarks to see the same things happening. 00:12:28.720 |
And this is not just noise as the paper later points out. 00:12:31.680 |
The June version of GPT 3.5 saw its score on math dramatically degrade from 32.0 to 15.0. 00:12:44.720 |
GPT-4 performance increases from 78.7 to 87.2. 00:12:55.520 |
Its score on a different benchmark plummeted from 82.2 to 68.7. 00:13:01.920 |
And OpenAI admits that when they release a new model. 00:13:07.680 |
There may be some tasks where the performance gets worse. 00:13:10.560 |
Which brings me to the web based version of GPT-4 and ChatGPT. 00:13:18.480 |
That's those with the four numbers at the end. 00:13:20.800 |
Consistently perform slightly better than their front end counterparts. 00:13:27.040 |
To be honest I had noticed dissimilarities when benchmarking SmartGPT. 00:13:35.520 |
Has significantly improved the coding benchmark performance. 00:13:42.320 |
So you're not hallucinating when you see differences between the web GPT models. 00:13:48.320 |
Anyway there is so much more from this paper. 00:13:51.840 |
Preferably after I've spoken to one of the lead authors. 00:13:57.360 |
We have covered some pretty epic topics today. 00:14:00.320 |
If you enjoyed or learnt anything from this overview.