back to index

An Actually Big Week in AI: AutoGen, The A-Phone, Mistral 7B, GPT-Fathom and Meta Hunts CharacterAI


Chapters

0:0
0:35 GPT Vision Use Cases
1:32 Meta AI to 4 Billion People?
3:0 CIA-Bot
3:48 The Altman Phone
4:47 AutoGen
8:26 Mistral 7B
9:58 Orca @ MSFT
11:20 GPT-Fathom

Whisper Transcript | Transcript Only Page

00:00:00.000 | Many times we hear of AI news that we only realize was significant in hindsight.
00:00:05.800 | This week came developments that are big from any perspective.
00:00:10.320 | I don't just mean the whole new category of use cases for GPT vision,
00:00:15.020 | or meta bringing language models to billions of people.
00:00:18.940 | There's also Autogen as the new AutoGPT, what I'm calling the Altman phone,
00:00:24.400 | Mistral's new 7 billion parameter model,
00:00:26.980 | Orca potentially replacing OpenAI models at Microsoft,
00:00:30.420 | and yesterday's fascinating GPT Fathom paper.
00:00:34.340 | I'm going to start with a use case for GPT-4 vision that I didn't see coming.
00:00:38.920 | If you spot a user interface online that you like, you can try to recreate it with GPT-4.
00:00:45.340 | As at Scorano demonstrated on X, you can ask it to imitate the layout and give you the HTML code.
00:00:52.360 | Now I know many of you will say that doesn't look too impressive,
00:00:55.760 | but wait until you...
00:00:56.760 | see the next demo.
00:00:58.480 | GPT-4 with vision can iterate on its designs.
00:01:02.160 | Because it can see its own output, it can recognize flaws and improve on them.
00:01:06.720 | This is from Matt Schumer, CEO of Hyperight AI.
00:01:10.520 | You can see it trying to design that futuristic Google homepage and improving with each generation.
00:01:16.760 | We've gone beyond text feedback now into a visual feedback loop.
00:01:20.880 | I honestly wonder if someone is going to do something similar for DALI 3.
00:01:25.020 | Use GPT-4 vision to...
00:01:26.520 | continue to iterate outputs until it matches your prompt perfectly.
00:01:31.000 | Now the next development probably won't impress too many of the viewers watching my channel,
00:01:35.680 | but it might have a big impact on the up to 4 billion people who use a meta product or service.
00:01:42.680 | We're talking Instagram, WhatsApp and of course Facebook.
00:01:45.760 | Now I'm not going to play you all the promotional materials,
00:01:48.280 | but essentially they've got a bunch of celebrities including Mr Beast and quite a few others you can see here,
00:01:53.640 | to put their name to a series of...
00:01:56.320 | 28 AI chatbots.
00:01:58.120 | Now I don't particularly want to speak to any of these people,
00:02:00.560 | but I'm sure that millions if not billions of people want to pretend to do so at least.
00:02:06.040 | Now remember it was only yesterday that Character AI that deals in fictional chatbots was valued at more than $5 billion.
00:02:14.120 | And Zuckerberg is clearly taking aim at Character AI.
00:02:18.160 | Here he is in the metaverse with Lex Friedman.
00:02:21.680 | I don't think anyone out there is really doing...
00:02:26.080 | what we're doing here.
00:02:27.080 | I think that there are people who are doing kind of like fictional or consumer oriented character type stuff,
00:02:32.880 | but the extent to which we're building it out with the avatars and expressiveness
00:02:39.680 | and making it so that they can interact across all the different apps and they'll have profiles.
00:02:45.520 | Let's be honest, if the Apple Vision Pro takes VR mainstream,
00:02:49.720 | then a lot of people will want to sit down and have an artificial chat with their favorite celebrity.
00:02:55.840 | And it doesn't take much to imagine other use cases.
00:02:59.000 | Now one group who can definitely think of some use cases for chatbots is the CIA.
00:03:04.360 | They are apparently getting their own chat GPT style tool.
00:03:08.280 | Now of course we already know that the CIA monitors data that passes through the US,
00:03:13.360 | but there's so much of it that it must have been impossible to sort through.
00:03:17.200 | With a large language model, I don't think that's going to be true anymore.
00:03:20.280 | That's going to be useful for catching criminals and the FBI is getting this chatbot too.
00:03:25.600 | But it does remind me a lot of this article about large language models being great for state censorship.
00:03:32.480 | For similar reasons, it would have been impossible for a country like China to monitor all of its civilian communications.
00:03:39.280 | But now as the article says, there is no real way to prevent this.
00:03:43.280 | It's only a matter of time before well-resourced state actors begin implementing and advancing such systems.
00:03:48.960 | But in lighter news, yesterday we had this from The Verge.
00:03:52.560 | Donny Ive of Apple fame, together with Sam Altman of
00:03:55.360 | OpenAI fame, are coming up with the iPhone of Artificial Intelligence.
00:04:00.720 | Fueled by over $1 billion in funding from the Softbank CEO.
00:04:04.960 | It would be OpenAI's first consumer device and we might even have a few clues about what would distinguish it.
00:04:12.400 | Donny Ive has previously said that Apple had a moral responsibility to mitigate the addictive nature of its technology.
00:04:20.240 | And according to the Financial Times, the project with OpenAI could allow
00:04:25.120 | Ive to create an interactive computing device that's less reliant on screens.
00:04:30.800 | Of course we saw this week that ChatGPT can now take audio and visual input.
00:04:35.360 | Now apparently the discussions are said to be serious.
00:04:38.720 | So let me know in the comments what you would want out of let's call it the Altman phone.
00:04:43.760 | But now I want to talk about Autogen from Microsoft.
00:04:47.520 | Many of you wondered if I'd been following this development and yes I have been.
00:04:51.520 | But rather than just talk about what they claim, I wanted to try
00:04:54.880 | it out for myself.
00:04:56.000 | I had heard that Autogen is poised to fundamentally transform and extend what large language models are capable of.
00:05:02.960 | And it would do this through multi-agent conversations, joint chats and agent customization.
00:05:08.880 | Think of it as a more sophisticated AutoGPT allowing the easy creation of sub-agents to achieve a goal.
00:05:16.160 | One of the agents or models could be an engineer writing code, another one an executor executing code, or a product manager coming up with the
00:05:24.640 | implementation plan.
00:05:25.440 | But to be honest I was like I've heard of this kind of thing before, does it actually work?
00:05:29.600 | I discussed use cases with the amazing AI architect Nico Giraud and we came up with this demo.
00:05:35.520 | A maths question that GPT-4 with code interpreter or advanced data analysis almost always gets wrong.
00:05:41.520 | For anyone who's interested is in the style of GMAT data sufficiency.
00:05:45.440 | Autogen was able to easily create sub-agents to delegate tasks to and it could break down these
00:05:51.120 | compositional problems in coding or here for example math.
00:05:54.400 | Now the Autogen system can be run with a human in the loop essentially as one of the agents or maybe as the commander and Autogen can use tools and execute code.
00:06:04.880 | All these agents are essentially in a group chat working together chipping in when necessary or when called for by a planner.
00:06:11.440 | It got this difficult mathematical problem right three times out of three and yes this is just an anecdotal demo.
00:06:17.840 | I'm going to be doing much more research on Autogen and I'm reaching out to some of the authors.
00:06:24.160 | I'm going to be collecting the results I found in my previous video on reasoning in LLMs.
00:06:29.200 | I must say that Autogen made me think again about this recent tweet from Sam Altman.
00:06:34.160 | He said "short timelines and slow takeoff will be a pretty good call I think."
00:06:39.200 | Short timelines meaning artificial general intelligence not being several decades away.
00:06:43.680 | And slow takeoff as in not exponential self-improvement on the scale of days,
00:06:49.520 | weeks or months when AGI comes out.
00:06:52.000 | But the quote continues "but the way
00:06:53.920 | people define the start of the takeoff may make it seem otherwise."
00:06:58.000 | I read that as saying that if you put too strict a definition on what AGI means.
00:07:02.960 | If you continually move the goalposts such that AGI means the exact same thing as super intelligence.
00:07:08.960 | Being smarter than all humans combined.
00:07:11.120 | Well under that definition the moment AGI arrives things could start changing extremely rapidly.
00:07:17.120 | But if we stop moving the goalposts and admit that things like Autogen could be pretty radical now.
00:07:23.680 | According to definitions from say 10 years ago many people would argue we have AGI now.
00:07:28.720 | Well in that scenario you could argue we might be in a fairly slow takeoff that could be measured in years.
00:07:34.720 | Potentially even more than a decade.
00:07:36.720 | Now of course I might be misinterpreting his words.
00:07:38.800 | But I know that Philip from 10 years ago would have looked at GPT-4 with vision and Autogen.
00:07:44.800 | And been absolutely gobsmacked.
00:07:47.040 | And I do think that some combination of the techniques that we have available today together with scaling.
00:07:53.440 | Will bring us to AGI.
00:07:54.880 | That's why I would agree with the majority of forecasters on Metaculous.
00:07:59.040 | That the first AGI will be based on deep learning.
00:08:02.400 | It's free to sign up to Metaculous using the link in the description.
00:08:06.400 | And you get the double benefit of seeing what other people are predicting.
00:08:10.560 | And also you get the chance to put your forecast in writing.
00:08:14.240 | That way when you're right you get to boast in my comments about how right you were.
00:08:18.960 | Thanks as always to Metaculous for sponsoring the video.
00:08:22.000 | And I wonder if the next video will be about AGI.
00:08:23.200 | And if the next development is going to change any of these predictions.
00:08:26.320 | That development is the release of Mistral 7B for 7 billion parameters.
00:08:31.840 | Now apparently it outperforms Llama 2 13 billion parameters on all benchmarks.
00:08:38.480 | And based on my limited tests with Perplexity Chat.
00:08:41.360 | Where you can pick the Mistral 7 billion model.
00:08:44.240 | I can roughly believe that that is the performance level.
00:08:47.120 | It does make plenty of mistakes of course.
00:08:49.120 | Here though you can see it beating a range of Llama models including.
00:08:52.960 | Llama 2 13 billion on a range of benchmarks.
00:08:56.960 | As I've pointed out several times on the channel.
00:08:59.200 | There are problems with the benchmarks though.
00:09:01.440 | And we'll have to wait and see about data contamination.
00:09:04.800 | Now Mistral does admit that there are no moderation mechanisms.
00:09:08.880 | And I tried pretty much any example you can think of.
00:09:12.160 | And it always helped out.
00:09:13.600 | For pretty much obvious reasons.
00:09:15.280 | I can't even demonstrate the kind of things that I asked.
00:09:18.080 | And it happily replied to.
00:09:19.600 | But remember this is only their teaser model.
00:09:22.000 | Much larger model.
00:09:22.720 | And I'm sure it will be more capable.
00:09:26.320 | And it's released under the Apache 2 license.
00:09:29.280 | Which is extremely permissive.
00:09:31.200 | When the Mistral models are further fine tuned.
00:09:34.080 | I'm sure we are going to see more benchmarks broken.
00:09:37.040 | But I would ask the question.
00:09:38.560 | Does this constitute a race to the bottom.
00:09:41.040 | In which the company that spends the least amount of money on protections wins out.
00:09:45.600 | Of course because it's only 7 billion parameters.
00:09:48.240 | It's not capable of that much.
00:09:50.160 | But a future Mistral 70 billion.
00:09:52.480 | Or 700 billion parameter model.
00:09:54.960 | Might make headlines for good and bad reasons.
00:09:58.640 | But smaller models do seem to be the trend.
00:10:00.960 | With even Microsoft looking to downsize.
00:10:03.680 | As the information put it.
00:10:05.120 | Microsoft is trying to lessen its addiction to open AI.
00:10:08.400 | As AI costs soar.
00:10:09.840 | The article talks about how Microsoft researchers are making distilled models.
00:10:14.160 | That mimic larger models like GPT-4.
00:10:16.880 | But are smaller and cost far less to operate.
00:10:19.680 | Notably they mention Orca and the PHY series.
00:10:22.240 | Of models.
00:10:23.040 | I am proud to have been one of the first people to cover the Orca model.
00:10:26.560 | And my video on it called.
00:10:27.840 | The model few people saw coming.
00:10:29.680 | Was retweeted by the lead author of Orca.
00:10:32.720 | Of course I also interviewed one of the lead creators.
00:10:36.000 | On the PHY series of models.
00:10:37.840 | In a separate video.
00:10:39.040 | As the article points out.
00:10:40.160 | The fine-tuned Orca model performs much better than the base model of Llama 2.
00:10:45.200 | And as I remember the Orca paper pointing out.
00:10:47.840 | It performs nearly as well as GPT-4 on certain tasks.
00:10:52.000 | In fact now I speak of it.
00:10:53.440 | I remember predicting at the end of the Orca video.
00:10:56.000 | You can go and watch it.
00:10:57.280 | I remember saying something like.
00:10:59.040 | I wonder if Microsoft is funding Orca.
00:11:01.440 | To see if it can substitute for GPT-4 or the next GPT-5.
00:11:05.920 | Actually now I look back.
00:11:06.960 | That was a pretty amazing prediction.
00:11:08.640 | Indeed it looks like Microsoft might well substitute Orca for GPT-4.
00:11:12.720 | For certain use cases.
00:11:14.160 | Particularly as it uses less than a tenth of the computing power that GPT-4 uses.
00:11:19.120 | But speaking of being ahead of the curve.
00:11:20.800 | Here is a fact.
00:11:21.760 | This is a fascinating paper.
00:11:23.600 | That I don't think anyone has talked about.
00:11:25.760 | Of course that's probably because it came out yesterday.
00:11:27.920 | But nevertheless.
00:11:28.880 | It achieves part of what I talked about.
00:11:30.960 | Toward the end of my smart GPT video.
00:11:33.600 | About systematically and fairly evaluating the leading LLMs.
00:11:37.600 | On a range of benchmarks.
00:11:39.040 | Not this patchwork of different conditions.
00:11:41.200 | Benchmarks.
00:11:41.920 | Settings and models that we currently have.
00:11:43.760 | I'm hoping to speak to one of the lead authors of this paper.
00:11:46.560 | But for now let me just give you a taste of the highlights.
00:11:49.680 | First we have this epic chart.
00:11:51.520 | I want you to focus on one part in particular.
00:11:54.080 | At the top right you can see the 1st of March version of GPT 3.5.
00:11:58.480 | And then the 13th of June version.
00:12:00.400 | For GPT-4 we have the 14th of March version.
00:12:03.200 | And the 13th of June version.
00:12:04.880 | And the paper later describes the seesaw progress.
00:12:08.000 | It's not a straight line upwards.
00:12:09.360 | In other words the conspiracy that ChatGPT is getting dumber.
00:12:12.560 | In some areas is actually correct.
00:12:15.440 | Take the first row which is natural questions.
00:12:18.160 | Performance went down slightly between these two versions.
00:12:21.280 | And then the 4 trivia QA for GPT-4.
00:12:23.840 | It went down slightly.
00:12:25.200 | And you can look across the benchmarks to see the same things happening.
00:12:28.720 | And this is not just noise as the paper later points out.
00:12:31.680 | The June version of GPT 3.5 saw its score on math dramatically degrade from 32.0 to 15.0.
00:12:40.480 | That's compared to the March 1st version.
00:12:42.720 | In contrast on the drop benchmark.
00:12:44.720 | GPT-4 performance increases from 78.7 to 87.2.
00:12:49.840 | Which is on par with state of the art.
00:12:51.040 | Between the two versions.
00:12:53.040 | However between those same two versions.
00:12:55.520 | Its score on a different benchmark plummeted from 82.2 to 68.7.
00:13:00.720 | This is not just noise.
00:13:01.920 | And OpenAI admits that when they release a new model.
00:13:04.800 | While the majority of metrics might improve.
00:13:07.680 | There may be some tasks where the performance gets worse.
00:13:10.560 | Which brings me to the web based version of GPT-4 and ChatGPT.
00:13:15.680 | The paper notes that the dated API models.
00:13:18.480 | That's those with the four numbers at the end.
00:13:20.800 | Consistently perform slightly better than their front end counterparts.
00:13:24.800 | I.e. the web version of the models.
00:13:27.040 | To be honest I had noticed dissimilarities when benchmarking SmartGPT.
00:13:31.280 | They do say that code interpreter.
00:13:33.360 | Now called advanced data analysis.
00:13:35.520 | Has significantly improved the coding benchmark performance.
00:13:38.880 | Getting 85.2 pass at one on human eval.
00:13:42.320 | So you're not hallucinating when you see differences between the web GPT models.
00:13:46.880 | And the API models.
00:13:48.320 | Anyway there is so much more from this paper.
00:13:50.560 | That I want to cover.
00:13:51.840 | Preferably after I've spoken to one of the lead authors.
00:13:54.960 | But for now the video is long enough.
00:13:57.360 | We have covered some pretty epic topics today.
00:14:00.320 | If you enjoyed or learnt anything from this overview.
00:14:03.200 | Please do let me know in the comments.
00:14:05.200 | Shameless plug at this point for my Patreon.
00:14:07.600 | Which is also linked in the description.
00:14:09.600 | And as ever.
00:14:10.320 | Thank you so much for watching to the end.
00:14:12.560 | And have a wonderful day.