Not Slowing Down: GAIA-1 to GPT Vision Tips, Nvidia B100 to Bard vs LLaVA

00:00:00.000 | For AI progress to slow down it would need to run out of data, compute and algorithmic efficiency.

00:00:06.960 | But developments this week suggest that the field isn't running out of any of these things,

00:00:11.220 | let alone all of them. I'm going to give you a glimpse of what this means in robotics,

00:00:15.720 | audio and vision and end with some practical tips to help you use GPT vision as well as

00:00:22.080 | comparing it to BARD and LAVA. But let's start with Gaia-1 from Wave which is generating the

00:00:28.420 | synthetic video that you can see now. And no I'm not just bringing it up because it looks cool,

00:00:33.380 | the CEO this week said I believe synthetic training data is the future for AI because it's safer,

00:00:40.120 | cheaper and infinitely scalable. That's my point, when synthetic data gets this good we're not going

00:00:46.040 | to run out of data. Many of you may not know that GPT-4 itself was trained on some synthetic data

00:00:51.960 | and if you're interested do check out my videos on Orca and Phi-1 to see how much synthetic data

00:00:58.020 | can actually be used to train your AI. And if you're interested do check out my videos on Orca

00:00:58.400 | and Phi-1 to see how much synthetic data can actually help smaller language models. And the

00:01:00.840 | synthetic video data you just saw came from a scrappy outsider training on fewer than 100

00:01:06.900 | Nvidia A100s. Now imagine the kind of synthetic data that Tesla could come up with, with the

00:01:12.500 | equivalent of 300,000 A100s. And of course Tesla already has billions of hours of real world data

00:01:20.300 | that compares to the 4,700 hours that Gaia-1 was trained on. Now many of you might say that yes

00:01:26.860 | it's crazy that things are improving but it's not. And that's because there's a lot of data that's

00:01:28.380 | being used to improve this quickly with synthetic video data. And yes it's cool that a model like

00:01:32.820 | this can generate unlimited data including adversarial examples. What does that mean by

00:01:37.760 | the way in this context? Well for example people walking across the road jaywalking in the fog.

00:01:43.400 | Even Tesla with its billions of hours of real world data probably only saw that scenario a

00:01:49.060 | limited number of times. But impressive as it is isn't this just for autonomous driving? No not

00:01:54.160 | even close this is also for real world robotics. Just two days ago Tesla launched a new model

00:01:58.360 | called the Unisim. And it's a very interesting model. It's a very interesting model because it

00:02:01.660 | is a very simple model. It's a very simple model that can be used to simulate a lot of things.

00:02:04.900 | It's a very simple model that can be used to simulate a lot of things. It's a very simple model

00:02:07.240 | that can be used to simulate a lot of things. But it can simulate a range of things like

00:02:09.900 | unveiling toothpaste, picking up the toothpaste in multiple steps. Now you probably don't need me

00:02:16.100 | to tell you why unlimited training data for robotics might be useful. I'll let you watch

00:02:21.860 | this imaginary demo of a robot closing the bottom drawer and opening it.

00:02:28.340 | And opening the top drawer.

00:02:58.320 | And opening the top drawer.

00:03:28.300 | And opening the top drawer.

00:03:58.280 | And opening the top drawer.

00:04:28.260 | And opening the top drawer.

00:04:58.240 | And opening the top drawer.

00:05:28.220 | And opening the top drawer.

00:05:58.200 | And opening the top drawer.

00:06:28.180 | And opening the top drawer.

00:06:58.160 | And opening the top drawer.

00:07:28.140 | And opening the top drawer.

00:07:58.120 | And opening the top drawer.

00:08:28.100 | And opening the top drawer.

00:08:58.080 | And opening the top drawer.

00:09:28.060 | And opening the top drawer.

00:09:58.040 | And opening the top drawer.

00:10:28.020 | at 45%. Final failure, I asked it to list the bottom three countries in terms of the percentage

00:10:34.960 | visiting the science slash technology museum and this time it skipped over both Japan and South

00:10:40.440 | Korea to list the EU as the one with the third lowest percentage. So what is my tip and let me

00:10:47.180 | know in the comments if any of you find this helpful. Well, drawing a bit on fuse shotting

00:10:52.400 | and self-consistency, I gave it three different angles of the same chart. But even more crucially

00:10:58.260 | than that perhaps, I asked it recreate the data from the tables. I then said check for any

00:11:04.120 | dissimilarities and resolve them by majority vote. The reason I did this is that I noticed that

00:11:09.400 | sometimes it could output a correct table and still get the analysis wrong even though it's

00:11:14.520 | simple mathematics. So what this was doing was splitting the task up into two. First it was

00:11:19.600 | reducing the chance of minor errors by giving it a higher score than the other two. And then

00:11:22.380 | it was getting it to do the analysis only after it had already recreated the tables. And look at

00:11:30.840 | the difference. This time when I asked it about the bottom three countries, it got it right. And

00:11:35.840 | then I asked it again, what was it, about the zoo slash aquarium. That was the one it got wrong

00:11:41.180 | twice before as you saw. This time it correctly picked out China at 51%. If you're wondering by

00:11:47.200 | the way how I got different versions of the same image, it was by pressing windows shift and

00:11:52.360 | S and then just highlighting like this. Anyway, I think that's a cool tip. Try it out. Let me know

00:11:58.640 | in the comments if it's at all helpful. But finally, let's compare Lava and Bard to GPT

00:12:04.680 | Vision. On text, Lava didn't do as well. It wasn't able to notice that this coffee cup missed out the

00:12:11.620 | B in sip by sip. Bard not only noticed but even came up with an amazing metric to find the

00:12:18.780 | distance between the two texts, the prompt and what came out.

00:12:22.340 | Another difference I found between the models was when it came to faces. I asked what was the

00:12:27.400 | fate of this character, Saruman. GPT-4 successfully said the character is Saruman and gave the fate

00:12:33.700 | of that character. Bard flat out refused saying sorry I can't help with images of people yet.

00:12:38.960 | While Lava was kind of helpful saying the character in the image is Gandalf. What about some of those

00:12:46.220 | table questions like I was giving GPT-4 earlier? Well, Bard kind of flopped saying that the answer

00:12:52.320 | was the US but at least got the percentage correct at 51%. Lava did less well saying that the answer

00:12:59.740 | was Brazil. Now maybe this demo doesn't reflect the full capabilities of Lava because I read the

00:13:06.040 | paper that came with the announcement of Lava 1.5. Apparently it got 80% in visual question

00:13:11.760 | answering version 2. That's less than one of Google's models which I've talked about before,

00:13:15.640 | Parley 17 billion, but apparently better than GPT-4. So you guys can let me know if I'm missing

00:13:21.360 | anything about this. I'll be happy to answer any questions you have.

00:13:22.300 | Just before I move on though, I can't help but say that I was really impressed by GPT-4's

00:13:28.960 | analysis of this image. I asked what is poignant and unexpected about this image and it picked up

00:13:34.940 | on the contrast between the devastating event that's unfolding and the seemingly calm demeanor

00:13:40.340 | of the observers. It picked up on almost every detail of the image and it was a fantastic answer.

00:13:46.300 | I saw that yesterday someone had the idea of putting the Mona Lisa into GPT vision asking it

00:13:52.280 | to describe the image. It then got DALI 3 to generate an image based on that description

00:13:57.760 | and put it into a recursive loop. And this was the result of that recursive loop. And with the

00:14:03.340 | explosion in synthetic data and compute, I predict the world will get equally crazy quite soon.

00:14:09.500 | Thank you as ever for watching to the end and have a wonderful day.