Back to Index

‘Her’ AI, Almost Here? Llama 3, Vasa-1, and Altman ‘Plugging Into Everything You Want To Do’


Transcript

And just as I was finishing editing the video you're about to see, Llama 3 was dropped by Meta. But rather than do a full video on that, I'm going to give you the TL;DR. That's because Meta aren't releasing their biggest and best model and the research paper is coming later.

They have though tonight released two smaller models that are competitive to say the least with other models in their class. Note that Llama 3 70B is competitive with Gemini Pro 1.5 and Claude 3 Sonnet although without their context window size. And here you can see the human evaluated comparisons between Llama 3 70B released tonight and Mistral Medium, Claude Sonnet, GPT 3.5.

What Meta appear to have found, although there were early glimpses of this in the original Llama paper, is that model performance continues to improve even after a model is trained on two orders of magnitude more data than the chinchilla optimal amount. So essentially they saturated their models with quality data giving a special emphasis to coding data.

They do say that they're going to release multiple models with new capabilities including multi-modality, conversing in multiple languages, a longer context window and stronger overall capabilities. But before we get to the main video, here is the quick comparison you're probably all curious about. The mystery model that's still training versus the new GPT 4 Turbo and Claude 3 Opus.

For the infamous MMLU, the performance is about the same for all three models. For the google proof graduate stem assessment, the performance is again almost identical with Claude 3 just about having the lead. For the coding benchmark human eval, although that's a deeply flawed benchmark, GPT 4 still seems to be in the lead.

For mathematics, somewhat surprisingly many would say, GPT 4 crushes this new Llama 3 model. So despite the fact that they haven't given us a paper, we can say the two smaller models released tonight are super competitive with other models of their size and that this mystery model will be of a GPT 4 and Claude 3 Opus class.

And now I must move on from Llama 3 because I think in the last 48 hours there was an announcement that is arguably even more interesting. Using just a single photo of you, we can now get you to say anything. Have you ever had, maybe you're in that place right now where you want to turn your life around and you know somewhere deep in your soul there could be some decisions that you have to make.

It is proving much easier than many people thought to use AI to imitate not just human writing, voices, artwork and music but now even our facial expressions. And by the way, in real time unlike Sora from OpenAI. But what does this even mean? For one, I think it is now almost certain that you will be able to have a real time Zoom call with the next generation of models out later this year.

I think that will change how billions of people interact with AI. How intelligent those models will be and how soon has been the subject of a striking new debate this week. Of course I'll cover that, controversy over the new and imposing Atlas robot, AI nurses outperforming real ones and much more.

The VASA 1 paper from Microsoft came out in the last 48 hours and I've read the paper in full and I'm going to give you only the most relevant highlights. But why pick out VASA when there have been papers and demos of relatively realistic deep fakes this year? Well, it's all about the facial expressions, the blinking, the expressiveness of the lips and eyebrows.

"Surprises me still. I ran it on someone just last night. It was fascinating. You know, she had complained of, she had complained of shoulder like pain in her arm." No model at this resolution has been this good. I think a significant segment of the public, if shown this for the first time with no prep, could believe that these were real.

You can control not only the emotion that the avatar is conveying, from happiness to anger, but also their distance from the camera and the direction of their gaze. "I would say that we as readers are not meant to look at him in any other way but with disdain, especially in how he treats his daughter, okay?

But of course he is able to clearly see through Morris." And even though the VASA 1 model was only trained on real videos, which I'll get to in a moment, it can do things like this. "I'm a paparazzi. I don't play no Yahtzee. I go pop, pop, pop, pop, pop, pop.

See, I tell the truth from what I see and sell it to Perez Hilty." And the creators of VASA say this in the first page of their paper. "This paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors. At the moment, the resolution is almost HD at 40 frames per second." They also mention, which is crucial, negligible starting latency.

Let me try to demonstrate. Again, all you need is an image and an audio clip from anywhere, AI generated or otherwise. "But you know what I decided to do? I decided to focus." Now, somewhat ambitiously, the authors mention that this technology will amplify the richness of human-to-human interaction. I would more agree with the end of the paragraph where they talk about the social interaction in healthcare.

A few weeks ago, we learned that Hippocratic AI and NVIDIA had teamed up to release less than $9 an hour AI nurses. I'll show you the performance metrics, but here's a taster. "This is Linda calling from Memorial Hospital on a recorded line. Is this Albert Wu?" "Yes, it is." "Wonderful.

I'm calling on behalf of Dr. Brown, your cardiologist. To protect your privacy, can you please share your date of birth?" "It's January 1st. I mean, she's not trying to kill me, right? I thought that after all these years of me teasing her, she's finally trying to get back at me." "Rest assured, your wife isn't out to get you.

And there's no need to worry about a negative interaction with your listenopril. Your latest lab results show your potassium levels are within the normal range, which is between 3.5 and 5." And according to ratings given by human nurses, these AI nurses, even without a video avatar, outperformed in terms of bedside manner and educating the patient.

On a technical level, they outperformed in identifying a medication's impact on lab values, identifying disallowed over-the-counter medications, and way outperformed in detecting toxic dosages. So now imagine your next nurse appointment looking like this. "I'd love to begin with you firstly, just because I read that you started out in advertising and now you run a wellness business." "These principles will not only make your user's journey more pleasant, they'll contribute to better business metrics as well.

Users hate being interrupted and they hate getting broken experiences. Keeping these principles in mind in your app design makes for a better user journey." Now let's briefly touch on their methodology. What they did so differently was to map all possible facial dynamics, lip motion, non-lip expression, eye gaze, and blinking onto a latent space.

Think of that as being a compute-efficient, condensed machine representation of the actual 3D complexity of facial movements. Previous methods focused much more just on the lips and had much more rigid expressions. The authors also revealed that it was a diffusion transformer model. They used the transformer architecture to map audio to facial expressions and head movements.

So the model actually first takes the audio clip and generates the appropriate head movements and facial expressions, or at least a latent variable representing those things. Only then, using those facial and head motion codes, does their method produce video frames. Which of course also takes the appearance and identity features extracted from the input image.

Buried deep in the paper, you might be surprised by just how little data it takes to train VASA 1. They used the public Vox Celeb 2 dataset. I looked it up and it calls itself a large-scale dataset, but it's just 2,000 hours. For reference, YouTube might have 2 billion hours.

And we know according to leaks that OpenAI trained on a million hours of YouTube data. Now I know this dataset is curated, but the point remains about the kind of results you could get with this little data. Now in fairness, they did also mention supplementing with their own smaller dataset using 3,500 subjects.

But the scale of data remains really quite small. But here is the 15-second headline comparing their methods to real video and previous methods. The lip-syncing accuracy is unprecedented and the synchronization to the audio is state-of-the-art. The video quality is improved, but of course still far from reality. They say they're working on better imitation of hair and clothing and extending to the full upper body.

Now for fairly obvious reasons, Microsoft are not planning to release VASA 1 and say we have no plans to release an online demo, API, product, or any related offerings. Until at least we are certain that the technology will be used responsibly and in accordance with proper regulations. I'm not quite sure how you could ever be certain of that.

So likely a VASA 1 equivalent will be released open source on the dark web in the coming years. Now of course to get to her levels of realism, we'd also need an AI to analyze our own emotions. "Are you social or anti-social?" "I guess I haven't been social in a while, mostly because-" "In your voice, I sense hesitance.

Would you agree with that?" "Is there something hesitant?" "Yes." "No, sorry for something hesitant. I was just trying to be more accurate." "Would you like your OS to have a male or female voice?" "Female, I guess." But you're probably not surprised to learn that there's a company focused squarely on that, Hume AI.

I'm going to start a conversation and have the AI analyze the emotions in my voice. Should be interesting. Tonight, I am actually debuting a new newsletter called Signal to Noise and the link will be in the description. I'm pretty pumped. Determination, calmness? I don't think I'm that calm. Concentration?

Okay, I'll take it. And yes, that wasn't just to test Hume AI, that's a real announcement. I have worked for months on this one and I'm really proud of how it looks and sounds. It's free to sign up and the inspiration behind the name was this. As all of you guys watching on YouTube know, there's a lot of noise around but not as much signal.

And on this channel, I try to maintain a high signal to noise ratio. I basically only make videos on this channel when there's something that's happened that I actually find interesting myself. And it will be the same with this newsletter. I'm only actually going to do posts when there's something interesting that's happened.

And more than that, I'm going to give every post a "Does it change everything?" dice rating. That's my quirky way of analyzing whether the entire industry is actually stunned. So absolutely no spam, quality writing, at least in my opinion, and a "Does it change everything?" rating that you can see at a glance.

Each post is like a three, four minute read and the philosophy was I wanted a newsletter that I would be excited about. And only for those who really want to support the hype-free ethos of the channel and the newsletter, there is the Insider Essentials tier. You'll get exclusive posts, sample Insider videos, and access to an experimental Smart GPT 2.0.

Absolutely no obligation to join. I would be overjoyed if you simply sign up to the free newsletter. Whether you're subbing for free or with Essentials, do check your spam because sometimes the welcome message goes there. As always, if you want all my extra video content and professional networking and tip sharing, do sign up on AI Insiders on Patreon.

At least so far, I've been able to individually welcome every single new member. But of course, while deep fakes progress, robot agility is also progressing. Here's the new Atlas from Boston Dynamics. So now the other most famous robot on the scene is the figure one, which I talked about in a recent video.

And just two hours ago, the CEO of the company that makes figure one said this, speaking of Boston Dynamics, new Atlas won't be the last time we get copied. If it's not obvious yet, figure is doing the best mechanical design in the world for robotics. And he was referencing the waist design of the new Atlas robot.

Now, whether that comment is more about PR and posture, only time will tell. But before we completely leave the topic of AI social interaction and her, here's Sam Altman from two days ago. He suggests that the personalization of AI to you might be even more important than their inherent intelligence.

I do start to wonder if that's part of a deliberate strategy from OpenAI. In my recent Stargate video, I talked about how Microsoft are spending a hundred billion dollars. But this week, Hasabi said that Google will be spending more than that on compute. So if it is true that Google starts to race away with the power of their models, that could be one way that OpenAI competes, get more data from more users and personalize their AI to you, likely with a video avatar.

And don't forget, we got very early hints of this with the GPT store. OpenAI are now paying US builders based on user engagement with their GPTs. At the moment, that user engagement is apparently really quite low, but throw in a lifelike video avatar and that might change quite quickly.

Of course, those models would only become truly addictive for many when they were as smart as the average human. There are those though, of course, that say that's never going to happen, including the creators of some cutting edge models. Here's Arthur Mensch, co-founder of Mistral. "The whole AGI rhetoric, artificial general intelligence, is about creating God.

I don't believe in God. I'm a strong atheist, so I don't believe in AGI." I'm not personally sure about the link there, but it's an interesting quote. Then we have Yann LeCun, a famous LLM skeptic. He's previously said that something like AGI definitely wouldn't be coming in the next five years.

Three days ago, he said this, "There is no question that AI will eventually reach and surpass human intelligence in all domains, but it won't happen next year." He then went on in parentheses to say that also regressive LLMs may indeed constitute a component of AGI. That does seem to me to be a slight change in emphasis from previous statements.

Others like the CEO of Anthropic have much more aggressive timelines. For the context of what you're about to hear from Dario Amadei, ASL Level 3 refers to systems that substantially increase the risk of catastrophic misuse or show low-level autonomous capabilities, whereas AI Safety Level 4 indicates systems that involve qualitative escalations in catastrophic misuse potential and autonomy.

On timelines, just this week, he said this, "When you imagine how many years away, just roughly, ASL 3 is and how many years away ASL 4 is, right, you've thought a lot about this exponential scaling curve. If you just had to guess, what are we talking about?" "Yeah, I think ASL 3 is, you know, could easily happen this year or next year.

I think ASL..." "Oh, Jesus Christ." "No, no, I told you, I'm a believer in exponentials. I think ASL 4 could happen anywhere from 2025 to 2028." "So that is fast." "Yeah, no, no, I'm truly talking about the near future here. I'm not talking about 50 years away." So, according to who you listen to, AGI either doesn't exist or is coming pretty imminently.

But I have to end as I began with Her. Some say that the movie Her was set in the year 2025 and that's starting to seem pretty appropriate. Now, whether or not it's actually released, I do think we, humanity, will be technologically capable of something approximating Her by next year.

Let me know if you agree. Thank you so much for watching to the end of the video. Please do check out my new newsletter. I'm super proud of it. And as always, have a wonderful day.