back to index‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’
00:00:00.000 |
And just as I was finishing editing the video you're about to see, 00:00:04.000 |
Llama 3 was dropped by Meta. But rather than do a full video on that, I'm going to give you the 00:00:09.920 |
TL;DR. That's because Meta aren't releasing their biggest and best model and the research paper is 00:00:15.440 |
coming later. They have though tonight released two smaller models that are competitive to say 00:00:20.960 |
the least with other models in their class. Note that Llama 3 70B is competitive with Gemini Pro 00:00:28.080 |
1.5 and Claude 3 Sonnet although without their context window size. And here you can see the 00:00:34.640 |
human evaluated comparisons between Llama 3 70B released tonight and Mistral Medium, 00:00:39.760 |
Claude Sonnet, GPT 3.5. What Meta appear to have found, although there were early glimpses of this 00:00:45.360 |
in the original Llama paper, is that model performance continues to improve even after a 00:00:50.720 |
model is trained on two orders of magnitude more data than the chinchilla optimal amount. So 00:00:57.040 |
essentially they saturated their models with quality data giving a special emphasis to coding 00:01:02.560 |
data. They do say that they're going to release multiple models with new capabilities including 00:01:07.200 |
multi-modality, conversing in multiple languages, a longer context window and stronger overall 00:01:12.400 |
capabilities. But before we get to the main video, here is the quick comparison you're probably all 00:01:18.160 |
curious about. The mystery model that's still training versus the new GPT 4 Turbo and Claude 00:01:24.320 |
3 Opus. For the infamous MMLU, the performance is about the same for all three models. For the 00:01:31.200 |
google proof graduate stem assessment, the performance is again almost identical with 00:01:36.480 |
Claude 3 just about having the lead. For the coding benchmark human eval, although that's a 00:01:41.680 |
deeply flawed benchmark, GPT 4 still seems to be in the lead. For mathematics, somewhat surprisingly 00:01:47.120 |
many would say, GPT 4 crushes this new Llama 3 model. So despite the fact that they haven't 00:01:52.560 |
given us a paper, we can say the two smaller models released tonight are super competitive 00:01:58.080 |
with other models of their size and that this mystery model will be of a GPT 4 and Claude 3 00:02:04.320 |
Opus class. And now I must move on from Llama 3 because I think in the last 48 hours there 00:02:10.240 |
was an announcement that is arguably even more interesting. Using just a single photo of you, 00:02:17.520 |
we can now get you to say anything. Have you ever had, maybe you're in that place right now where 00:02:23.200 |
you want to turn your life around and you know somewhere deep in your soul 00:02:28.080 |
there could be some decisions that you have to make. It is proving much easier than many people 00:02:34.800 |
thought to use AI to imitate not just human writing, voices, artwork and music but now even 00:02:41.760 |
our facial expressions. And by the way, in real time unlike Sora from OpenAI. But what does this 00:02:48.960 |
even mean? For one, I think it is now almost certain that you will be able to have a real 00:02:54.640 |
time Zoom call with the next generation of models out later this year. I think that will change how 00:03:01.200 |
billions of people interact with AI. How intelligent those models will be and how soon has been the 00:03:08.400 |
subject of a striking new debate this week. Of course I'll cover that, controversy over the new 00:03:14.400 |
and imposing Atlas robot, AI nurses outperforming real ones and much more. The VASA 1 paper from 00:03:21.440 |
Microsoft came out in the last 48 hours and I've read the paper in full and I'm going to give you 00:03:26.640 |
only the most relevant highlights. But why pick out VASA when there have been papers and demos 00:03:32.320 |
of relatively realistic deep fakes this year? Well, it's all about the facial expressions, 00:03:37.920 |
the blinking, the expressiveness of the lips and eyebrows. "Surprises me still. I ran it on 00:03:44.160 |
someone just last night. It was fascinating. You know, she had complained of, she had complained 00:03:51.440 |
of shoulder like pain in her arm." No model at this resolution has been this good. I think a 00:03:57.680 |
significant segment of the public, if shown this for the first time with no prep, could believe 00:04:02.640 |
that these were real. You can control not only the emotion that the avatar is conveying, from 00:04:07.360 |
happiness to anger, but also their distance from the camera and the direction of their gaze. "I 00:04:14.160 |
would say that we as readers are not meant to look at him in any other way but with disdain, 00:04:19.760 |
especially in how he treats his daughter, okay? But of course he is able to clearly see through 00:04:25.440 |
Morris." And even though the VASA 1 model was only trained on real videos, which I'll get to 00:04:30.480 |
in a moment, it can do things like this. "I'm a paparazzi. I don't play no Yahtzee. I go pop, 00:04:37.440 |
pop, pop, pop, pop, pop. See, I tell the truth from what I see and sell it to Perez Hilty." 00:04:42.640 |
And the creators of VASA say this in the first page of their paper. "This paves the way for 00:04:48.480 |
real-time engagements with lifelike avatars that emulate human conversational behaviors. 00:04:54.880 |
At the moment, the resolution is almost HD at 40 frames per second." They also mention, 00:05:00.720 |
which is crucial, negligible starting latency. Let me try to demonstrate. Again, all you need 00:05:06.640 |
is an image and an audio clip from anywhere, AI generated or otherwise. "But you know what I 00:05:12.560 |
decided to do? I decided to focus." Now, somewhat ambitiously, the authors mention that this 00:05:19.520 |
technology will amplify the richness of human-to-human interaction. I would more agree 00:05:25.600 |
with the end of the paragraph where they talk about the social interaction in healthcare. 00:05:30.560 |
A few weeks ago, we learned that Hippocratic AI and NVIDIA had teamed up to release less than $9 00:05:37.600 |
an hour AI nurses. I'll show you the performance metrics, but here's a taster. "This is Linda 00:05:42.800 |
calling from Memorial Hospital on a recorded line. Is this Albert Wu?" "Yes, it is." "Wonderful. I'm 00:05:48.720 |
calling on behalf of Dr. Brown, your cardiologist. To protect your privacy, can you please share your 00:05:53.440 |
date of birth?" "It's January 1st. I mean, she's not trying to kill me, right? I thought that after 00:05:58.560 |
all these years of me teasing her, she's finally trying to get back at me." "Rest assured, your 00:06:02.240 |
wife isn't out to get you. And there's no need to worry about a negative interaction with your 00:06:06.480 |
listenopril. Your latest lab results show your potassium levels are within the normal range, 00:06:11.520 |
which is between 3.5 and 5." And according to ratings given by human nurses, these AI nurses, 00:06:18.320 |
even without a video avatar, outperformed in terms of bedside manner and educating the patient. On a 00:06:25.280 |
technical level, they outperformed in identifying a medication's impact on lab values, identifying 00:06:31.520 |
disallowed over-the-counter medications, and way outperformed in detecting toxic dosages. 00:06:37.600 |
So now imagine your next nurse appointment looking like this. "I'd love to begin with you firstly, 00:06:43.280 |
just because I read that you started out in advertising and now you run a wellness business." 00:06:49.520 |
"These principles will not only make your user's journey more pleasant, they'll contribute to 00:06:53.120 |
better business metrics as well. Users hate being interrupted and they hate getting broken 00:06:57.600 |
experiences. Keeping these principles in mind in your app design makes for a better user journey." 00:07:02.160 |
Now let's briefly touch on their methodology. What they did so differently was to map all 00:07:06.720 |
possible facial dynamics, lip motion, non-lip expression, eye gaze, and blinking onto a latent 00:07:13.200 |
space. Think of that as being a compute-efficient, condensed machine representation of the actual 3D 00:07:19.920 |
complexity of facial movements. Previous methods focused much more just on the lips and had much 00:07:25.200 |
more rigid expressions. The authors also revealed that it was a diffusion transformer model. They 00:07:30.800 |
used the transformer architecture to map audio to facial expressions and head movements. So the 00:07:38.000 |
model actually first takes the audio clip and generates the appropriate head movements and 00:07:44.240 |
facial expressions, or at least a latent variable representing those things. Only then, using those 00:07:49.920 |
facial and head motion codes, does their method produce video frames. Which of course also takes 00:07:54.880 |
the appearance and identity features extracted from the input image. Buried deep in the paper, 00:07:59.760 |
you might be surprised by just how little data it takes to train VASA 1. They used the public 00:08:06.000 |
Vox Celeb 2 dataset. I looked it up and it calls itself a large-scale dataset, but it's just 2,000 00:08:12.320 |
hours. For reference, YouTube might have 2 billion hours. And we know according to leaks that OpenAI 00:08:19.200 |
trained on a million hours of YouTube data. Now I know this dataset is curated, but the point 00:08:24.560 |
remains about the kind of results you could get with this little data. Now in fairness, they did 00:08:29.040 |
also mention supplementing with their own smaller dataset using 3,500 subjects. But the scale of 00:08:35.600 |
data remains really quite small. But here is the 15-second headline comparing their methods 00:08:41.120 |
to real video and previous methods. The lip-syncing accuracy is unprecedented and the 00:08:47.680 |
synchronization to the audio is state-of-the-art. The video quality is improved, but of course still 00:08:53.680 |
far from reality. They say they're working on better imitation of hair and clothing and extending 00:08:59.600 |
to the full upper body. Now for fairly obvious reasons, Microsoft are not planning to release 00:09:04.880 |
VASA 1 and say we have no plans to release an online demo, API, product, or any related offerings. 00:09:10.880 |
Until at least we are certain that the technology will be used responsibly and in accordance with 00:09:16.080 |
proper regulations. I'm not quite sure how you could ever be certain of that. So likely a VASA 00:09:21.520 |
1 equivalent will be released open source on the dark web in the coming years. Now of course to get 00:09:27.920 |
to her levels of realism, we'd also need an AI to analyze our own emotions. "Are you social or 00:09:35.520 |
anti-social?" "I guess I haven't been social in a while, mostly because-" "In your voice, I sense 00:09:41.680 |
hesitance. Would you agree with that?" "Is there something hesitant?" "Yes." "No, sorry for something 00:09:47.360 |
hesitant. I was just trying to be more accurate." "Would you like your OS to have a male or female 00:09:54.080 |
voice?" "Female, I guess." But you're probably not surprised to learn that there's a company 00:10:00.320 |
focused squarely on that, Hume AI. I'm going to start a conversation and have the AI analyze the 00:10:06.480 |
emotions in my voice. Should be interesting. Tonight, I am actually debuting a new newsletter 00:10:12.320 |
called Signal to Noise and the link will be in the description. I'm pretty pumped. Determination, 00:10:19.440 |
calmness? I don't think I'm that calm. Concentration? Okay, I'll take it. And yes, 00:10:23.760 |
that wasn't just to test Hume AI, that's a real announcement. I have worked for months on this 00:10:29.200 |
one and I'm really proud of how it looks and sounds. It's free to sign up and the inspiration 00:10:35.200 |
behind the name was this. As all of you guys watching on YouTube know, there's a lot of 00:10:40.080 |
noise around but not as much signal. And on this channel, I try to maintain a high signal to noise 00:10:46.720 |
ratio. I basically only make videos on this channel when there's something that's happened 00:10:51.040 |
that I actually find interesting myself. And it will be the same with this newsletter. I'm only 00:10:56.080 |
actually going to do posts when there's something interesting that's happened. And more than that, 00:11:00.240 |
I'm going to give every post a "Does it change everything?" dice rating. That's my quirky way 00:11:06.480 |
of analyzing whether the entire industry is actually stunned. So absolutely no spam, quality 00:11:12.320 |
writing, at least in my opinion, and a "Does it change everything?" rating that you can see at 00:11:17.360 |
a glance. Each post is like a three, four minute read and the philosophy was I wanted a newsletter 00:11:23.200 |
that I would be excited about. And only for those who really want to support the hype-free ethos of 00:11:29.600 |
the channel and the newsletter, there is the Insider Essentials tier. You'll get exclusive 00:11:34.480 |
posts, sample Insider videos, and access to an experimental Smart GPT 2.0. Absolutely no 00:11:40.880 |
obligation to join. I would be overjoyed if you simply sign up to the free newsletter. Whether 00:11:46.560 |
you're subbing for free or with Essentials, do check your spam because sometimes the welcome 00:11:51.440 |
message goes there. As always, if you want all my extra video content and professional networking 00:11:57.200 |
and tip sharing, do sign up on AI Insiders on Patreon. At least so far, I've been able to 00:12:03.280 |
individually welcome every single new member. But of course, while deep fakes progress, 00:12:08.240 |
robot agility is also progressing. Here's the new Atlas from Boston Dynamics. 00:12:21.680 |
now the other most famous robot on the scene is the figure one, which I talked about in a 00:12:42.240 |
recent video. And just two hours ago, the CEO of the company that makes figure one said this, 00:12:48.160 |
speaking of Boston Dynamics, new Atlas won't be the last time we get copied. If it's not obvious 00:12:53.600 |
yet, figure is doing the best mechanical design in the world for robotics. And he was referencing 00:12:58.960 |
the waist design of the new Atlas robot. Now, whether that comment is more about PR and posture, 00:13:05.520 |
only time will tell. But before we completely leave the topic of AI social interaction and her, 00:13:12.160 |
here's Sam Altman from two days ago. He suggests that the personalization of AI to you might be 00:13:19.360 |
even more important than their inherent intelligence. I do start 00:13:49.280 |
to wonder if that's part of a deliberate strategy from OpenAI. In my recent Stargate video, 00:13:55.440 |
I talked about how Microsoft are spending a hundred billion dollars. But this week, 00:13:59.840 |
Hasabi said that Google will be spending more than that on compute. So if it is true that Google 00:14:05.920 |
starts to race away with the power of their models, that could be one way that OpenAI competes, 00:14:11.680 |
get more data from more users and personalize their AI to you, likely with a video avatar. 00:14:18.480 |
And don't forget, we got very early hints of this with the GPT store. OpenAI are now paying US 00:14:24.880 |
builders based on user engagement with their GPTs. At the moment, that user engagement is apparently 00:14:31.120 |
really quite low, but throw in a lifelike video avatar and that might change quite quickly. Of 00:14:36.560 |
course, those models would only become truly addictive for many when they were as smart as 00:14:42.400 |
the average human. There are those though, of course, that say that's never going to happen, 00:14:46.400 |
including the creators of some cutting edge models. Here's Arthur Mensch, co-founder of 00:14:51.280 |
Mistral. "The whole AGI rhetoric, artificial general intelligence, is about creating God. 00:14:56.320 |
I don't believe in God. I'm a strong atheist, so I don't believe in AGI." I'm not personally sure 00:15:02.160 |
about the link there, but it's an interesting quote. Then we have Yann LeCun, a famous LLM 00:15:07.360 |
skeptic. He's previously said that something like AGI definitely wouldn't be coming in the next five 00:15:12.240 |
years. Three days ago, he said this, "There is no question that AI will eventually reach and surpass 00:15:18.080 |
human intelligence in all domains, but it won't happen next year." He then went on in parentheses 00:15:24.640 |
to say that also regressive LLMs may indeed constitute a component of AGI. That does seem 00:15:30.800 |
to me to be a slight change in emphasis from previous statements. Others like the CEO of 00:15:37.200 |
Anthropic have much more aggressive timelines. For the context of what you're about to hear from 00:15:42.480 |
Dario Amadei, ASL Level 3 refers to systems that substantially increase the risk of catastrophic 00:15:49.440 |
misuse or show low-level autonomous capabilities, whereas AI Safety Level 4 indicates systems that 00:15:56.640 |
involve qualitative escalations in catastrophic misuse potential and autonomy. On timelines, 00:16:02.720 |
just this week, he said this, "When you imagine how many years away, just roughly, ASL 3 is and 00:16:09.040 |
how many years away ASL 4 is, right, you've thought a lot about this exponential scaling curve. If you 00:16:14.960 |
just had to guess, what are we talking about?" "Yeah, I think ASL 3 is, you know, could easily 00:16:20.160 |
happen this year or next year. I think ASL..." "Oh, Jesus Christ." "No, no, I told you, I'm a 00:16:25.840 |
believer in exponentials. I think ASL 4 could happen anywhere from 2025 to 2028." "So that is 00:16:31.920 |
fast." "Yeah, no, no, I'm truly talking about the near future here. I'm not talking about 00:16:36.560 |
50 years away." So, according to who you listen to, AGI either doesn't exist or is coming pretty 00:16:42.560 |
imminently. But I have to end as I began with Her. Some say that the movie Her was set in the year 00:16:48.640 |
2025 and that's starting to seem pretty appropriate. Now, whether or not it's actually 00:16:53.760 |
released, I do think we, humanity, will be technologically capable of something approximating 00:16:59.840 |
Her by next year. Let me know if you agree. Thank you so much for watching to the end of the video. 00:17:05.440 |
Please do check out my new newsletter. I'm super proud of it. And as always, have a wonderful day.