Back to Index

Udio, the Mysterious GPT Update, and Infinite Attention


Transcript

It's been a strange 48 hours in the world of AI with releases like Oudio that have reminded millions of people what AI is capable of and models that can pay you infinite attention. But we also got befuddling updates from OpenAI that suggest that not all is smooth sailing. I'll start, of course, with the new model on Oudio.com and how musicians are reacting.

Then cover the perplexing manner of the release of GPT-4 Turbo with Vision and touch on a fascinating new Infinite Context paper from Google. But now let's hear three 20-second extracts from Oudio to give you an inkling, if you haven't heard it already, of what it's capable of. Here's Dune, the Broadway musical.

And now for some quite frankly amazing AI-generated classical music. And next, something I'm going to bleep a little bit, but represents the reaction of Uncharted Labs, who are behind Oudio, to their servers going down. And of course, I have been playing about with Oudio like almost everyone has, and did you know it can do stand-up comedy?

Now I'm not sure if this guy is talking about me, but I thought I'd let you know that this kind of thing is possible. And how about a quick, direct comparison between Oudio and Suno V3? Now, I prefer Oudio there, but you do sometimes get complete gobbledygook. Now Will.i.am calls Oudio the best tech on earth, and Uncharted Labs, which is the company behind Oudio, he says is really aiming to be an ally for creatives and artists.

Now it should of course be noted that Will.i.am is an investor in Oudio, but again they repeat that Oudio is about building AI tools to enable the next generation of music creators. Now of course everyone has their own opinion, but let's now get a taste of the reaction from some musicians.

One says it's pretty scary thinking what is going to exist a year or two from now, and what it means for musicians, listeners, and the industry as a whole. The top comment says I would buy a band t-shirt, but never buy a shirt for an AI, which makes sense.

But here are two more common reactions. I am a music professional, producer/composer. This is highly advanced, and I thought this stuff was years away. And one more, I've already gone full circle with it, past the confusion and devastation, and now I'm just curious what Gregorian chant would sound like with, I can't even pronounce that, and blast beats.

So definitely a mixed reaction from musicians. Personally, I don't think it's too much of an exaggeration to call this the Chachapiti moment for music generation. Zuno often has a slight tinniness that gives it away for those not following AI, but with Udio, I think you could convince many people that they're listening to human music, just like Chachapiti felt like human text if you didn't look too closely.

I could well see before the end of this year, hundreds of millions of people using this for entertainment. Imagine every school child in the world walking out of their lesson in whichever language with a catchy tune about what they've learned. So yes, I do believe that Udio is the biggest news of this week.

But of course, we had the mysterious release of a new GPT-4 Turbo model from OpenAI. And why do I call it mysterious? Well, not because it wasn't named GPT-4.5. They probably thought it wasn't enough of a step forward to give it that name. The strangeness was the repeated emphasis on it being better than previous iterations, but without any detail.

They called it majorly improved. Where are the benchmarks though? And now here's some more mystery. All the top players at OpenAI like Greg Brockman and Mira Murati tweeted out the news of the new model. But strangely, for the first time, Sam Altman didn't. Now this isn't about reading any tea leaves, it's just a very strange announcement from OpenAI.

I ran my own maths and logic benchmarks and I couldn't see much of a difference. It failed the same questions that the January version of GPT-4 Turbo failed. Of course, the functionality improved with function calling within vision. But what intrigued me was the repeated claims that GPT-4 reasoning had been further improved.

Naturally, on this channel, that's what I was most focused about. The cutting edge of intelligence. Here though is some of the best benchmarking work that I could find. On the noted math benchmark from Dan Hendricks, you could see a bump in its performance on the hardest style of questions, from 35% to around 45%.

Even one level down, the performance bumped up from 57% to 66%. The difference on the easier questions wasn't nearly as pronounced. It seems pretty clear that the dataset got augmented with some high-level mathematics and code. Otherwise, it wasn't too much changed. Here's another example, LiveCodeBench. You can't complain about contamination because they source their questions from after the training day of the models.

And again, as you can see, performance has increased, particularly for harder questions. These are sourced from contests like LeetCode. And that applies not just to code generation, but self-repair. Again though, we're not talking about massive leaps, just small bumps. Here though is the clearest assessment from Epoch AI. The diamond set of the GPQA are the hardest kind of graduate questions.

We're talking Google-proof STEM questions that even PhDs find hard. And yes, there was a bump, maybe by 2% or 3%, but GPT-4 Turbo, April edition, is still lower performing than Claude III Opus. Of course, the deeper question is whether or not this indicates some inherent limitations on just simply training on more and more advanced data.

It's a bit like the current paradigm can only go so far, even with better data. Of course, you can watch any of my other videos to see why I don't think that will be much of a bottleneck that much longer. Now it would be remiss of me not to spend a few seconds touching on two releases from the OpenWeights community.

I'm not going to call it the open source community because they're not releasing their training datasets. I'm talking about the new Mixed Trial 8x22 billion Mixture of Experts model and Cohere's Command R+. Now you can judge for yourself, but they land around the level of Claude III Sonnet, which is the medium-sized model.

Of course, that is a proprietary model. Some people may have expected the OpenWeights community to have caught up to GPT-4 by now, but that's not quite the case. Of course, let's wait to see if LLAMA 3 can further bridge that gap. Now before we get to Google, there was one more announcement of a model I want to touch on.

So as I've done once before on this channel, I reached out to the company to ask about a sponsorship. I've probably turned down thousands of sponsorship offers, but I'm happy to say that this part of the video is sponsored by Assembly AI. So what happened? They released Universal One.

And basically the reason I reached out to them is because it's really darn good. I'm often transcribing videos and rarely do they get characters like GPT correct, let alone names like Satya Nadella. Universal One did. So yes, Universal One is the model I personally use and you can see some comparisons to other models in this chart.

It does seem to hallucinate less than Whisper and takes 38 seconds to process an hour of audio. Anyway, Universal One only came out like a week ago and I think it's epic, but let me know what you think. The link, of course, will be in the description. But now from yesterday, a quite fascinating paper from Google.

It's about transformer models that could have infinite context. Not 1 million or 10 million, but infinite. I must say unusually for this channel, I haven't had a chance to finish the paper before talking about it. I wanted to include it in this video for a reason. Of course, the prospect of feeding in entire libraries is fascinating.

But my theory is that this approach might be behind Gemini 1.5's long context ability. If you remember, Gemini 1.5, whose API is now widely available, was able to process up to at least 10 million tokens. Notice the phrase "at least" there. If you're not familiar with tokens, think 10 million tokens as being around 8 million words.

And if that's a daunting number, think 8 entire sets of Harry Potter novels. Now, on the day that Gemini 1.5 came out, I called it the biggest development of that day, despite it being the same day that Sora came out. I would still stick to that to this day.

Gemini 1.5 could find metaphorical needles in videos 3 hours long or audio 22 hours long. And the performance just kept improving up to and beyond 10 million tokens. But back to yesterday's paper, why do I think there's any link? Now, one hint is that one of the authors, Manal Faruqi, and sorry if I'm mispronouncing your name, was also an author in the original Gemini papers.

The other hint comes from the paper itself, where they call their approach a "plug-and-play long-context adaptation capability" with which they can "continually pre-train existing LLMs". In other words, it appears like you can take existing LLMs and just pre-train them with this approach to make them great at long-context, or indeed infinite-context.

Is that part of what happened to Gemini 1 Pro to turn it into Gemini 1.5 Pro? Anyway, it is interesting that Google published this, while still being a bit cagey about some crucial details. They do conclude though that this approach enables LLMs to process infinitely long-context, even though they've got bounded memory and computation resources.

Now I am going to consult with some colleagues before I say much more about this paper, but just think about some of the possibilities. Imagine a model being able to process every film made by a particular director, or every work of French literature between a particular period, or every email that you've ever sent since birth.

But let's not get too far ahead of ourselves because it's not like Google don't have their own issues. This week we learned that apparently Demis Hassabis said that he thought it would be especially difficult for Google to catch up to its rival OpenAI with generated video. He also apparently mused about leaving Google and raising billions of dollars to start a new research lab.

If he did leave to start his own lab, that would swiftly become a very competitive lab. To bring us back to the start, that's actually how UDIO was born. We learned from the information that UDIO is the work of Uncharted Labs, made up primarily of former Google DeepMind staff.

Those researchers had created the model Lyria back in the spring of last year. That could be a very similar model to what we now have in UDIO, but the company didn't unveil it until November of last year and Google still hasn't made it available to the public. It seems like Demis Hassabis isn't the only one with some frustration at Google.

But before I end the video, I must give Google great credit for this release within the last 24 hours. With deep learning, of course, they trained these ultra cute football players. And yes, I'm calling it football. These two players weren't manually designed to do the moves they're doing. Through deep reinforcement learning, they learnt to anticipate ball movements and block opponent shots.

And these guys were trained in simulation, which I talked about in my recent NVIDIA video. Compared to a pre-scripted baseline, these agents walked three times faster, turned four times faster and kicked the ball 30% faster. Soon therefore, we could have our own mini Erling Haaland. So quite the rollercoaster 48 hours in AI.

As always, let me know what you think in the comments. Feel free to hop on board my Patreon. But regardless, thank you so much for watching and have a wonderful day.