back to indexNot Slowing Down: GAIA-1 to GPT Vision Tips, Nvidia B100 to Bard vs LLaVA
00:00:00.000 |
For AI progress to slow down it would need to run out of data, compute and algorithmic efficiency. 00:00:06.960 |
But developments this week suggest that the field isn't running out of any of these things, 00:00:11.220 |
let alone all of them. I'm going to give you a glimpse of what this means in robotics, 00:00:15.720 |
audio and vision and end with some practical tips to help you use GPT vision as well as 00:00:22.080 |
comparing it to BARD and LAVA. But let's start with Gaia-1 from Wave which is generating the 00:00:28.420 |
synthetic video that you can see now. And no I'm not just bringing it up because it looks cool, 00:00:33.380 |
the CEO this week said I believe synthetic training data is the future for AI because it's safer, 00:00:40.120 |
cheaper and infinitely scalable. That's my point, when synthetic data gets this good we're not going 00:00:46.040 |
to run out of data. Many of you may not know that GPT-4 itself was trained on some synthetic data 00:00:51.960 |
and if you're interested do check out my videos on Orca and Phi-1 to see how much synthetic data 00:00:58.020 |
can actually be used to train your AI. And if you're interested do check out my videos on Orca 00:00:58.400 |
and Phi-1 to see how much synthetic data can actually help smaller language models. And the 00:01:00.840 |
synthetic video data you just saw came from a scrappy outsider training on fewer than 100 00:01:06.900 |
Nvidia A100s. Now imagine the kind of synthetic data that Tesla could come up with, with the 00:01:12.500 |
equivalent of 300,000 A100s. And of course Tesla already has billions of hours of real world data 00:01:20.300 |
that compares to the 4,700 hours that Gaia-1 was trained on. Now many of you might say that yes 00:01:26.860 |
it's crazy that things are improving but it's not. And that's because there's a lot of data that's 00:01:28.380 |
being used to improve this quickly with synthetic video data. And yes it's cool that a model like 00:01:32.820 |
this can generate unlimited data including adversarial examples. What does that mean by 00:01:37.760 |
the way in this context? Well for example people walking across the road jaywalking in the fog. 00:01:43.400 |
Even Tesla with its billions of hours of real world data probably only saw that scenario a 00:01:49.060 |
limited number of times. But impressive as it is isn't this just for autonomous driving? No not 00:01:54.160 |
even close this is also for real world robotics. Just two days ago Tesla launched a new model 00:01:58.360 |
called the Unisim. And it's a very interesting model. It's a very interesting model because it 00:02:01.660 |
is a very simple model. It's a very simple model that can be used to simulate a lot of things. 00:02:04.900 |
It's a very simple model that can be used to simulate a lot of things. It's a very simple model 00:02:07.240 |
that can be used to simulate a lot of things. But it can simulate a range of things like 00:02:09.900 |
unveiling toothpaste, picking up the toothpaste in multiple steps. Now you probably don't need me 00:02:16.100 |
to tell you why unlimited training data for robotics might be useful. I'll let you watch 00:02:21.860 |
this imaginary demo of a robot closing the bottom drawer and opening it. 00:10:28.020 |
at 45%. Final failure, I asked it to list the bottom three countries in terms of the percentage 00:10:34.960 |
visiting the science slash technology museum and this time it skipped over both Japan and South 00:10:40.440 |
Korea to list the EU as the one with the third lowest percentage. So what is my tip and let me 00:10:47.180 |
know in the comments if any of you find this helpful. Well, drawing a bit on fuse shotting 00:10:52.400 |
and self-consistency, I gave it three different angles of the same chart. But even more crucially 00:10:58.260 |
than that perhaps, I asked it recreate the data from the tables. I then said check for any 00:11:04.120 |
dissimilarities and resolve them by majority vote. The reason I did this is that I noticed that 00:11:09.400 |
sometimes it could output a correct table and still get the analysis wrong even though it's 00:11:14.520 |
simple mathematics. So what this was doing was splitting the task up into two. First it was 00:11:19.600 |
reducing the chance of minor errors by giving it a higher score than the other two. And then 00:11:22.380 |
it was getting it to do the analysis only after it had already recreated the tables. And look at 00:11:30.840 |
the difference. This time when I asked it about the bottom three countries, it got it right. And 00:11:35.840 |
then I asked it again, what was it, about the zoo slash aquarium. That was the one it got wrong 00:11:41.180 |
twice before as you saw. This time it correctly picked out China at 51%. If you're wondering by 00:11:47.200 |
the way how I got different versions of the same image, it was by pressing windows shift and 00:11:52.360 |
S and then just highlighting like this. Anyway, I think that's a cool tip. Try it out. Let me know 00:11:58.640 |
in the comments if it's at all helpful. But finally, let's compare Lava and Bard to GPT 00:12:04.680 |
Vision. On text, Lava didn't do as well. It wasn't able to notice that this coffee cup missed out the 00:12:11.620 |
B in sip by sip. Bard not only noticed but even came up with an amazing metric to find the 00:12:18.780 |
distance between the two texts, the prompt and what came out. 00:12:22.340 |
Another difference I found between the models was when it came to faces. I asked what was the 00:12:27.400 |
fate of this character, Saruman. GPT-4 successfully said the character is Saruman and gave the fate 00:12:33.700 |
of that character. Bard flat out refused saying sorry I can't help with images of people yet. 00:12:38.960 |
While Lava was kind of helpful saying the character in the image is Gandalf. What about some of those 00:12:46.220 |
table questions like I was giving GPT-4 earlier? Well, Bard kind of flopped saying that the answer 00:12:52.320 |
was the US but at least got the percentage correct at 51%. Lava did less well saying that the answer 00:12:59.740 |
was Brazil. Now maybe this demo doesn't reflect the full capabilities of Lava because I read the 00:13:06.040 |
paper that came with the announcement of Lava 1.5. Apparently it got 80% in visual question 00:13:11.760 |
answering version 2. That's less than one of Google's models which I've talked about before, 00:13:15.640 |
Parley 17 billion, but apparently better than GPT-4. So you guys can let me know if I'm missing 00:13:21.360 |
anything about this. I'll be happy to answer any questions you have. 00:13:22.300 |
Just before I move on though, I can't help but say that I was really impressed by GPT-4's 00:13:28.960 |
analysis of this image. I asked what is poignant and unexpected about this image and it picked up 00:13:34.940 |
on the contrast between the devastating event that's unfolding and the seemingly calm demeanor 00:13:40.340 |
of the observers. It picked up on almost every detail of the image and it was a fantastic answer. 00:13:46.300 |
I saw that yesterday someone had the idea of putting the Mona Lisa into GPT vision asking it 00:13:52.280 |
to describe the image. It then got DALI 3 to generate an image based on that description 00:13:57.760 |
and put it into a recursive loop. And this was the result of that recursive loop. And with the 00:14:03.340 |
explosion in synthetic data and compute, I predict the world will get equally crazy quite soon. 00:14:09.500 |
Thank you as ever for watching to the end and have a wonderful day.