back to index

Not Slowing Down: GAIA-1 to GPT Vision Tips, Nvidia B100 to Bard vs LLaVA


Whisper Transcript | Transcript Only Page

00:00:00.000 | For AI progress to slow down it would need to run out of data, compute and algorithmic efficiency.
00:00:06.960 | But developments this week suggest that the field isn't running out of any of these things,
00:00:11.220 | let alone all of them. I'm going to give you a glimpse of what this means in robotics,
00:00:15.720 | audio and vision and end with some practical tips to help you use GPT vision as well as
00:00:22.080 | comparing it to BARD and LAVA. But let's start with Gaia-1 from Wave which is generating the
00:00:28.420 | synthetic video that you can see now. And no I'm not just bringing it up because it looks cool,
00:00:33.380 | the CEO this week said I believe synthetic training data is the future for AI because it's safer,
00:00:40.120 | cheaper and infinitely scalable. That's my point, when synthetic data gets this good we're not going
00:00:46.040 | to run out of data. Many of you may not know that GPT-4 itself was trained on some synthetic data
00:00:51.960 | and if you're interested do check out my videos on Orca and Phi-1 to see how much synthetic data
00:00:58.020 | can actually be used to train your AI. And if you're interested do check out my videos on Orca
00:00:58.400 | and Phi-1 to see how much synthetic data can actually help smaller language models. And the
00:01:00.840 | synthetic video data you just saw came from a scrappy outsider training on fewer than 100
00:01:06.900 | Nvidia A100s. Now imagine the kind of synthetic data that Tesla could come up with, with the
00:01:12.500 | equivalent of 300,000 A100s. And of course Tesla already has billions of hours of real world data
00:01:20.300 | that compares to the 4,700 hours that Gaia-1 was trained on. Now many of you might say that yes
00:01:26.860 | it's crazy that things are improving but it's not. And that's because there's a lot of data that's
00:01:28.380 | being used to improve this quickly with synthetic video data. And yes it's cool that a model like
00:01:32.820 | this can generate unlimited data including adversarial examples. What does that mean by
00:01:37.760 | the way in this context? Well for example people walking across the road jaywalking in the fog.
00:01:43.400 | Even Tesla with its billions of hours of real world data probably only saw that scenario a
00:01:49.060 | limited number of times. But impressive as it is isn't this just for autonomous driving? No not
00:01:54.160 | even close this is also for real world robotics. Just two days ago Tesla launched a new model
00:01:58.360 | called the Unisim. And it's a very interesting model. It's a very interesting model because it
00:02:01.660 | is a very simple model. It's a very simple model that can be used to simulate a lot of things.
00:02:04.900 | It's a very simple model that can be used to simulate a lot of things. It's a very simple model
00:02:07.240 | that can be used to simulate a lot of things. But it can simulate a range of things like
00:02:09.900 | unveiling toothpaste, picking up the toothpaste in multiple steps. Now you probably don't need me
00:02:16.100 | to tell you why unlimited training data for robotics might be useful. I'll let you watch
00:02:21.860 | this imaginary demo of a robot closing the bottom drawer and opening it.
00:02:28.340 | And opening the top drawer.
00:02:58.320 | And opening the top drawer.
00:03:28.300 | And opening the top drawer.
00:03:58.280 | And opening the top drawer.
00:04:28.260 | And opening the top drawer.
00:04:58.240 | And opening the top drawer.
00:05:28.220 | And opening the top drawer.
00:05:58.200 | And opening the top drawer.
00:06:28.180 | And opening the top drawer.
00:06:58.160 | And opening the top drawer.
00:07:28.140 | And opening the top drawer.
00:07:58.120 | And opening the top drawer.
00:08:28.100 | And opening the top drawer.
00:08:58.080 | And opening the top drawer.
00:09:28.060 | And opening the top drawer.
00:09:58.040 | And opening the top drawer.
00:10:28.020 | at 45%. Final failure, I asked it to list the bottom three countries in terms of the percentage
00:10:34.960 | visiting the science slash technology museum and this time it skipped over both Japan and South
00:10:40.440 | Korea to list the EU as the one with the third lowest percentage. So what is my tip and let me
00:10:47.180 | know in the comments if any of you find this helpful. Well, drawing a bit on fuse shotting
00:10:52.400 | and self-consistency, I gave it three different angles of the same chart. But even more crucially
00:10:58.260 | than that perhaps, I asked it recreate the data from the tables. I then said check for any
00:11:04.120 | dissimilarities and resolve them by majority vote. The reason I did this is that I noticed that
00:11:09.400 | sometimes it could output a correct table and still get the analysis wrong even though it's
00:11:14.520 | simple mathematics. So what this was doing was splitting the task up into two. First it was
00:11:19.600 | reducing the chance of minor errors by giving it a higher score than the other two. And then
00:11:22.380 | it was getting it to do the analysis only after it had already recreated the tables. And look at
00:11:30.840 | the difference. This time when I asked it about the bottom three countries, it got it right. And
00:11:35.840 | then I asked it again, what was it, about the zoo slash aquarium. That was the one it got wrong
00:11:41.180 | twice before as you saw. This time it correctly picked out China at 51%. If you're wondering by
00:11:47.200 | the way how I got different versions of the same image, it was by pressing windows shift and
00:11:52.360 | S and then just highlighting like this. Anyway, I think that's a cool tip. Try it out. Let me know
00:11:58.640 | in the comments if it's at all helpful. But finally, let's compare Lava and Bard to GPT
00:12:04.680 | Vision. On text, Lava didn't do as well. It wasn't able to notice that this coffee cup missed out the
00:12:11.620 | B in sip by sip. Bard not only noticed but even came up with an amazing metric to find the
00:12:18.780 | distance between the two texts, the prompt and what came out.
00:12:22.340 | Another difference I found between the models was when it came to faces. I asked what was the
00:12:27.400 | fate of this character, Saruman. GPT-4 successfully said the character is Saruman and gave the fate
00:12:33.700 | of that character. Bard flat out refused saying sorry I can't help with images of people yet.
00:12:38.960 | While Lava was kind of helpful saying the character in the image is Gandalf. What about some of those
00:12:46.220 | table questions like I was giving GPT-4 earlier? Well, Bard kind of flopped saying that the answer
00:12:52.320 | was the US but at least got the percentage correct at 51%. Lava did less well saying that the answer
00:12:59.740 | was Brazil. Now maybe this demo doesn't reflect the full capabilities of Lava because I read the
00:13:06.040 | paper that came with the announcement of Lava 1.5. Apparently it got 80% in visual question
00:13:11.760 | answering version 2. That's less than one of Google's models which I've talked about before,
00:13:15.640 | Parley 17 billion, but apparently better than GPT-4. So you guys can let me know if I'm missing
00:13:21.360 | anything about this. I'll be happy to answer any questions you have.
00:13:22.300 | Just before I move on though, I can't help but say that I was really impressed by GPT-4's
00:13:28.960 | analysis of this image. I asked what is poignant and unexpected about this image and it picked up
00:13:34.940 | on the contrast between the devastating event that's unfolding and the seemingly calm demeanor
00:13:40.340 | of the observers. It picked up on almost every detail of the image and it was a fantastic answer.
00:13:46.300 | I saw that yesterday someone had the idea of putting the Mona Lisa into GPT vision asking it
00:13:52.280 | to describe the image. It then got DALI 3 to generate an image based on that description
00:13:57.760 | and put it into a recursive loop. And this was the result of that recursive loop. And with the
00:14:03.340 | explosion in synthetic data and compute, I predict the world will get equally crazy quite soon.
00:14:09.500 | Thank you as ever for watching to the end and have a wonderful day.