‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’

00:00:00.000 | And just as I was finishing editing the video you're about to see,

00:00:04.000 | Llama 3 was dropped by Meta. But rather than do a full video on that, I'm going to give you the

00:00:09.920 | TL;DR. That's because Meta aren't releasing their biggest and best model and the research paper is

00:00:15.440 | coming later. They have though tonight released two smaller models that are competitive to say

00:00:20.960 | the least with other models in their class. Note that Llama 3 70B is competitive with Gemini Pro

00:00:28.080 | 1.5 and Claude 3 Sonnet although without their context window size. And here you can see the

00:00:34.640 | human evaluated comparisons between Llama 3 70B released tonight and Mistral Medium,

00:00:39.760 | Claude Sonnet, GPT 3.5. What Meta appear to have found, although there were early glimpses of this

00:00:45.360 | in the original Llama paper, is that model performance continues to improve even after a

00:00:50.720 | model is trained on two orders of magnitude more data than the chinchilla optimal amount. So

00:00:57.040 | essentially they saturated their models with quality data giving a special emphasis to coding

00:01:02.560 | data. They do say that they're going to release multiple models with new capabilities including

00:01:07.200 | multi-modality, conversing in multiple languages, a longer context window and stronger overall

00:01:12.400 | capabilities. But before we get to the main video, here is the quick comparison you're probably all

00:01:18.160 | curious about. The mystery model that's still training versus the new GPT 4 Turbo and Claude

00:01:24.320 | 3 Opus. For the infamous MMLU, the performance is about the same for all three models. For the

00:01:31.200 | google proof graduate stem assessment, the performance is again almost identical with

00:01:36.480 | Claude 3 just about having the lead. For the coding benchmark human eval, although that's a

00:01:41.680 | deeply flawed benchmark, GPT 4 still seems to be in the lead. For mathematics, somewhat surprisingly

00:01:47.120 | many would say, GPT 4 crushes this new Llama 3 model. So despite the fact that they haven't

00:01:52.560 | given us a paper, we can say the two smaller models released tonight are super competitive

00:01:58.080 | with other models of their size and that this mystery model will be of a GPT 4 and Claude 3

00:02:04.320 | Opus class. And now I must move on from Llama 3 because I think in the last 48 hours there

00:02:10.240 | was an announcement that is arguably even more interesting. Using just a single photo of you,

00:02:17.520 | we can now get you to say anything. Have you ever had, maybe you're in that place right now where

00:02:23.200 | you want to turn your life around and you know somewhere deep in your soul

00:02:28.080 | there could be some decisions that you have to make. It is proving much easier than many people

00:02:34.800 | thought to use AI to imitate not just human writing, voices, artwork and music but now even

00:02:41.760 | our facial expressions. And by the way, in real time unlike Sora from OpenAI. But what does this

00:02:48.960 | even mean? For one, I think it is now almost certain that you will be able to have a real

00:02:54.640 | time Zoom call with the next generation of models out later this year. I think that will change how

00:03:01.200 | billions of people interact with AI. How intelligent those models will be and how soon has been the

00:03:08.400 | subject of a striking new debate this week. Of course I'll cover that, controversy over the new

00:03:14.400 | and imposing Atlas robot, AI nurses outperforming real ones and much more. The VASA 1 paper from

00:03:21.440 | Microsoft came out in the last 48 hours and I've read the paper in full and I'm going to give you

00:03:26.640 | only the most relevant highlights. But why pick out VASA when there have been papers and demos

00:03:32.320 | of relatively realistic deep fakes this year? Well, it's all about the facial expressions,

00:03:37.920 | the blinking, the expressiveness of the lips and eyebrows. "Surprises me still. I ran it on

00:03:44.160 | someone just last night. It was fascinating. You know, she had complained of, she had complained

00:03:51.440 | of shoulder like pain in her arm." No model at this resolution has been this good. I think a

00:03:57.680 | significant segment of the public, if shown this for the first time with no prep, could believe

00:04:02.640 | that these were real. You can control not only the emotion that the avatar is conveying, from

00:04:07.360 | happiness to anger, but also their distance from the camera and the direction of their gaze. "I

00:04:14.160 | would say that we as readers are not meant to look at him in any other way but with disdain,

00:04:19.760 | especially in how he treats his daughter, okay? But of course he is able to clearly see through

00:04:25.440 | Morris." And even though the VASA 1 model was only trained on real videos, which I'll get to

00:04:30.480 | in a moment, it can do things like this. "I'm a paparazzi. I don't play no Yahtzee. I go pop,

00:04:37.440 | pop, pop, pop, pop, pop. See, I tell the truth from what I see and sell it to Perez Hilty."

00:04:42.640 | And the creators of VASA say this in the first page of their paper. "This paves the way for

00:04:48.480 | real-time engagements with lifelike avatars that emulate human conversational behaviors.

00:04:54.880 | At the moment, the resolution is almost HD at 40 frames per second." They also mention,

00:05:00.720 | which is crucial, negligible starting latency. Let me try to demonstrate. Again, all you need

00:05:06.640 | is an image and an audio clip from anywhere, AI generated or otherwise. "But you know what I

00:05:12.560 | decided to do? I decided to focus." Now, somewhat ambitiously, the authors mention that this

00:05:19.520 | technology will amplify the richness of human-to-human interaction. I would more agree

00:05:25.600 | with the end of the paragraph where they talk about the social interaction in healthcare.

00:05:30.560 | A few weeks ago, we learned that Hippocratic AI and NVIDIA had teamed up to release less than $9

00:05:37.600 | an hour AI nurses. I'll show you the performance metrics, but here's a taster. "This is Linda

00:05:42.800 | calling from Memorial Hospital on a recorded line. Is this Albert Wu?" "Yes, it is." "Wonderful. I'm

00:05:48.720 | calling on behalf of Dr. Brown, your cardiologist. To protect your privacy, can you please share your

00:05:53.440 | date of birth?" "It's January 1st. I mean, she's not trying to kill me, right? I thought that after

00:05:58.560 | all these years of me teasing her, she's finally trying to get back at me." "Rest assured, your

00:06:02.240 | wife isn't out to get you. And there's no need to worry about a negative interaction with your

00:06:06.480 | listenopril. Your latest lab results show your potassium levels are within the normal range,

00:06:11.520 | which is between 3.5 and 5." And according to ratings given by human nurses, these AI nurses,

00:06:18.320 | even without a video avatar, outperformed in terms of bedside manner and educating the patient. On a

00:06:25.280 | technical level, they outperformed in identifying a medication's impact on lab values, identifying

00:06:31.520 | disallowed over-the-counter medications, and way outperformed in detecting toxic dosages.

00:06:37.600 | So now imagine your next nurse appointment looking like this. "I'd love to begin with you firstly,

00:06:43.280 | just because I read that you started out in advertising and now you run a wellness business."

00:06:49.520 | "These principles will not only make your user's journey more pleasant, they'll contribute to

00:06:53.120 | better business metrics as well. Users hate being interrupted and they hate getting broken

00:06:57.600 | experiences. Keeping these principles in mind in your app design makes for a better user journey."

00:07:02.160 | Now let's briefly touch on their methodology. What they did so differently was to map all

00:07:06.720 | possible facial dynamics, lip motion, non-lip expression, eye gaze, and blinking onto a latent

00:07:13.200 | space. Think of that as being a compute-efficient, condensed machine representation of the actual 3D

00:07:19.920 | complexity of facial movements. Previous methods focused much more just on the lips and had much

00:07:25.200 | more rigid expressions. The authors also revealed that it was a diffusion transformer model. They

00:07:30.800 | used the transformer architecture to map audio to facial expressions and head movements. So the

00:07:38.000 | model actually first takes the audio clip and generates the appropriate head movements and

00:07:44.240 | facial expressions, or at least a latent variable representing those things. Only then, using those

00:07:49.920 | facial and head motion codes, does their method produce video frames. Which of course also takes

00:07:54.880 | the appearance and identity features extracted from the input image. Buried deep in the paper,

00:07:59.760 | you might be surprised by just how little data it takes to train VASA 1. They used the public

00:08:06.000 | Vox Celeb 2 dataset. I looked it up and it calls itself a large-scale dataset, but it's just 2,000

00:08:12.320 | hours. For reference, YouTube might have 2 billion hours. And we know according to leaks that OpenAI

00:08:19.200 | trained on a million hours of YouTube data. Now I know this dataset is curated, but the point

00:08:24.560 | remains about the kind of results you could get with this little data. Now in fairness, they did

00:08:29.040 | also mention supplementing with their own smaller dataset using 3,500 subjects. But the scale of

00:08:35.600 | data remains really quite small. But here is the 15-second headline comparing their methods

00:08:41.120 | to real video and previous methods. The lip-syncing accuracy is unprecedented and the

00:08:47.680 | synchronization to the audio is state-of-the-art. The video quality is improved, but of course still

00:08:53.680 | far from reality. They say they're working on better imitation of hair and clothing and extending

00:08:59.600 | to the full upper body. Now for fairly obvious reasons, Microsoft are not planning to release

00:09:04.880 | VASA 1 and say we have no plans to release an online demo, API, product, or any related offerings.

00:09:10.880 | Until at least we are certain that the technology will be used responsibly and in accordance with

00:09:16.080 | proper regulations. I'm not quite sure how you could ever be certain of that. So likely a VASA

00:09:21.520 | 1 equivalent will be released open source on the dark web in the coming years. Now of course to get

00:09:27.920 | to her levels of realism, we'd also need an AI to analyze our own emotions. "Are you social or

00:09:35.520 | anti-social?" "I guess I haven't been social in a while, mostly because-" "In your voice, I sense

00:09:41.680 | hesitance. Would you agree with that?" "Is there something hesitant?" "Yes." "No, sorry for something

00:09:47.360 | hesitant. I was just trying to be more accurate." "Would you like your OS to have a male or female

00:09:54.080 | voice?" "Female, I guess." But you're probably not surprised to learn that there's a company

00:10:00.320 | focused squarely on that, Hume AI. I'm going to start a conversation and have the AI analyze the

00:10:06.480 | emotions in my voice. Should be interesting. Tonight, I am actually debuting a new newsletter

00:10:12.320 | called Signal to Noise and the link will be in the description. I'm pretty pumped. Determination,

00:10:19.440 | calmness? I don't think I'm that calm. Concentration? Okay, I'll take it. And yes,

00:10:23.760 | that wasn't just to test Hume AI, that's a real announcement. I have worked for months on this

00:10:29.200 | one and I'm really proud of how it looks and sounds. It's free to sign up and the inspiration

00:10:35.200 | behind the name was this. As all of you guys watching on YouTube know, there's a lot of

00:10:40.080 | noise around but not as much signal. And on this channel, I try to maintain a high signal to noise

00:10:46.720 | ratio. I basically only make videos on this channel when there's something that's happened

00:10:51.040 | that I actually find interesting myself. And it will be the same with this newsletter. I'm only

00:10:56.080 | actually going to do posts when there's something interesting that's happened. And more than that,

00:11:00.240 | I'm going to give every post a "Does it change everything?" dice rating. That's my quirky way

00:11:06.480 | of analyzing whether the entire industry is actually stunned. So absolutely no spam, quality

00:11:12.320 | writing, at least in my opinion, and a "Does it change everything?" rating that you can see at

00:11:17.360 | a glance. Each post is like a three, four minute read and the philosophy was I wanted a newsletter

00:11:23.200 | that I would be excited about. And only for those who really want to support the hype-free ethos of

00:11:29.600 | the channel and the newsletter, there is the Insider Essentials tier. You'll get exclusive

00:11:34.480 | posts, sample Insider videos, and access to an experimental Smart GPT 2.0. Absolutely no

00:11:40.880 | obligation to join. I would be overjoyed if you simply sign up to the free newsletter. Whether

00:11:46.560 | you're subbing for free or with Essentials, do check your spam because sometimes the welcome

00:11:51.440 | message goes there. As always, if you want all my extra video content and professional networking

00:11:57.200 | and tip sharing, do sign up on AI Insiders on Patreon. At least so far, I've been able to

00:12:03.280 | individually welcome every single new member. But of course, while deep fakes progress,

00:12:08.240 | robot agility is also progressing. Here's the new Atlas from Boston Dynamics.

00:12:14.960 | So

00:12:21.680 | now the other most famous robot on the scene is the figure one, which I talked about in a

00:12:42.240 | recent video. And just two hours ago, the CEO of the company that makes figure one said this,

00:12:48.160 | speaking of Boston Dynamics, new Atlas won't be the last time we get copied. If it's not obvious

00:12:53.600 | yet, figure is doing the best mechanical design in the world for robotics. And he was referencing

00:12:58.960 | the waist design of the new Atlas robot. Now, whether that comment is more about PR and posture,

00:13:05.520 | only time will tell. But before we completely leave the topic of AI social interaction and her,

00:13:12.160 | here's Sam Altman from two days ago. He suggests that the personalization of AI to you might be

00:13:19.360 | even more important than their inherent intelligence. I do start

00:13:49.280 | to wonder if that's part of a deliberate strategy from OpenAI. In my recent Stargate video,

00:13:55.440 | I talked about how Microsoft are spending a hundred billion dollars. But this week,

00:13:59.840 | Hasabi said that Google will be spending more than that on compute. So if it is true that Google

00:14:05.920 | starts to race away with the power of their models, that could be one way that OpenAI competes,

00:14:11.680 | get more data from more users and personalize their AI to you, likely with a video avatar.

00:14:18.480 | And don't forget, we got very early hints of this with the GPT store. OpenAI are now paying US

00:14:24.880 | builders based on user engagement with their GPTs. At the moment, that user engagement is apparently

00:14:31.120 | really quite low, but throw in a lifelike video avatar and that might change quite quickly. Of

00:14:36.560 | course, those models would only become truly addictive for many when they were as smart as

00:14:42.400 | the average human. There are those though, of course, that say that's never going to happen,

00:14:46.400 | including the creators of some cutting edge models. Here's Arthur Mensch, co-founder of

00:14:51.280 | Mistral. "The whole AGI rhetoric, artificial general intelligence, is about creating God.

00:14:56.320 | I don't believe in God. I'm a strong atheist, so I don't believe in AGI." I'm not personally sure

00:15:02.160 | about the link there, but it's an interesting quote. Then we have Yann LeCun, a famous LLM

00:15:07.360 | skeptic. He's previously said that something like AGI definitely wouldn't be coming in the next five

00:15:12.240 | years. Three days ago, he said this, "There is no question that AI will eventually reach and surpass

00:15:18.080 | human intelligence in all domains, but it won't happen next year." He then went on in parentheses

00:15:24.640 | to say that also regressive LLMs may indeed constitute a component of AGI. That does seem

00:15:30.800 | to me to be a slight change in emphasis from previous statements. Others like the CEO of

00:15:37.200 | Anthropic have much more aggressive timelines. For the context of what you're about to hear from

00:15:42.480 | Dario Amadei, ASL Level 3 refers to systems that substantially increase the risk of catastrophic

00:15:49.440 | misuse or show low-level autonomous capabilities, whereas AI Safety Level 4 indicates systems that

00:15:56.640 | involve qualitative escalations in catastrophic misuse potential and autonomy. On timelines,

00:16:02.720 | just this week, he said this, "When you imagine how many years away, just roughly, ASL 3 is and

00:16:09.040 | how many years away ASL 4 is, right, you've thought a lot about this exponential scaling curve. If you

00:16:14.960 | just had to guess, what are we talking about?" "Yeah, I think ASL 3 is, you know, could easily

00:16:20.160 | happen this year or next year. I think ASL..." "Oh, Jesus Christ." "No, no, I told you, I'm a

00:16:25.840 | believer in exponentials. I think ASL 4 could happen anywhere from 2025 to 2028." "So that is

00:16:31.920 | fast." "Yeah, no, no, I'm truly talking about the near future here. I'm not talking about

00:16:36.560 | 50 years away." So, according to who you listen to, AGI either doesn't exist or is coming pretty

00:16:42.560 | imminently. But I have to end as I began with Her. Some say that the movie Her was set in the year

00:16:48.640 | 2025 and that's starting to seem pretty appropriate. Now, whether or not it's actually

00:16:53.760 | released, I do think we, humanity, will be technologically capable of something approximating

00:16:59.840 | Her by next year. Let me know if you agree. Thank you so much for watching to the end of the video.

00:17:05.440 | Please do check out my new newsletter. I'm super proud of it. And as always, have a wonderful day.