Back to Index

Not Slowing Down: GAIA-1 to GPT Vision Tips, Nvidia B100 to Bard vs LLaVA


Transcript

For AI progress to slow down it would need to run out of data, compute and algorithmic efficiency. But developments this week suggest that the field isn't running out of any of these things, let alone all of them. I'm going to give you a glimpse of what this means in robotics, audio and vision and end with some practical tips to help you use GPT vision as well as comparing it to BARD and LAVA.

But let's start with Gaia-1 from Wave which is generating the synthetic video that you can see now. And no I'm not just bringing it up because it looks cool, the CEO this week said I believe synthetic training data is the future for AI because it's safer, cheaper and infinitely scalable.

That's my point, when synthetic data gets this good we're not going to run out of data. Many of you may not know that GPT-4 itself was trained on some synthetic data and if you're interested do check out my videos on Orca and Phi-1 to see how much synthetic data can actually be used to train your AI.

And if you're interested do check out my videos on Orca and Phi-1 to see how much synthetic data can actually help smaller language models. And the synthetic video data you just saw came from a scrappy outsider training on fewer than 100 Nvidia A100s. Now imagine the kind of synthetic data that Tesla could come up with, with the equivalent of 300,000 A100s.

And of course Tesla already has billions of hours of real world data that compares to the 4,700 hours that Gaia-1 was trained on. Now many of you might say that yes it's crazy that things are improving but it's not. And that's because there's a lot of data that's being used to improve this quickly with synthetic video data.

And yes it's cool that a model like this can generate unlimited data including adversarial examples. What does that mean by the way in this context? Well for example people walking across the road jaywalking in the fog. Even Tesla with its billions of hours of real world data probably only saw that scenario a limited number of times.

But impressive as it is isn't this just for autonomous driving? No not even close this is also for real world robotics. Just two days ago Tesla launched a new model called the Unisim. And it's a very interesting model. It's a very interesting model because it is a very simple model.

It's a very simple model that can be used to simulate a lot of things. It's a very simple model that can be used to simulate a lot of things. It's a very simple model that can be used to simulate a lot of things. But it can simulate a range of things like unveiling toothpaste, picking up the toothpaste in multiple steps.

Now you probably don't need me to tell you why unlimited training data for robotics might be useful. I'll let you watch this imaginary demo of a robot closing the bottom drawer and opening it. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer.

And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer. And opening the top drawer.

And opening the top drawer. at 45%. Final failure, I asked it to list the bottom three countries in terms of the percentage visiting the science slash technology museum and this time it skipped over both Japan and South Korea to list the EU as the one with the third lowest percentage.

So what is my tip and let me know in the comments if any of you find this helpful. Well, drawing a bit on fuse shotting and self-consistency, I gave it three different angles of the same chart. But even more crucially than that perhaps, I asked it recreate the data from the tables.

I then said check for any dissimilarities and resolve them by majority vote. The reason I did this is that I noticed that sometimes it could output a correct table and still get the analysis wrong even though it's simple mathematics. So what this was doing was splitting the task up into two.

First it was reducing the chance of minor errors by giving it a higher score than the other two. And then it was getting it to do the analysis only after it had already recreated the tables. And look at the difference. This time when I asked it about the bottom three countries, it got it right.

And then I asked it again, what was it, about the zoo slash aquarium. That was the one it got wrong twice before as you saw. This time it correctly picked out China at 51%. If you're wondering by the way how I got different versions of the same image, it was by pressing windows shift and S and then just highlighting like this.

Anyway, I think that's a cool tip. Try it out. Let me know in the comments if it's at all helpful. But finally, let's compare Lava and Bard to GPT Vision. On text, Lava didn't do as well. It wasn't able to notice that this coffee cup missed out the B in sip by sip.

Bard not only noticed but even came up with an amazing metric to find the distance between the two texts, the prompt and what came out. Another difference I found between the models was when it came to faces. I asked what was the fate of this character, Saruman. GPT-4 successfully said the character is Saruman and gave the fate of that character.

Bard flat out refused saying sorry I can't help with images of people yet. While Lava was kind of helpful saying the character in the image is Gandalf. What about some of those table questions like I was giving GPT-4 earlier? Well, Bard kind of flopped saying that the answer was the US but at least got the percentage correct at 51%.

Lava did less well saying that the answer was Brazil. Now maybe this demo doesn't reflect the full capabilities of Lava because I read the paper that came with the announcement of Lava 1.5. Apparently it got 80% in visual question answering version 2. That's less than one of Google's models which I've talked about before, Parley 17 billion, but apparently better than GPT-4.

So you guys can let me know if I'm missing anything about this. I'll be happy to answer any questions you have. Just before I move on though, I can't help but say that I was really impressed by GPT-4's analysis of this image. I asked what is poignant and unexpected about this image and it picked up on the contrast between the devastating event that's unfolding and the seemingly calm demeanor of the observers.

It picked up on almost every detail of the image and it was a fantastic answer. I saw that yesterday someone had the idea of putting the Mona Lisa into GPT vision asking it to describe the image. It then got DALI 3 to generate an image based on that description and put it into a recursive loop.

And this was the result of that recursive loop. And with the explosion in synthetic data and compute, I predict the world will get equally crazy quite soon. Thank you as ever for watching to the end and have a wonderful day.