Tesla AI Day Highlights

00:00:00.000 | Tesla AI Day presented the most amazing real-world AI

00:00:03.640 | and engineering effort I have ever seen in my life.

00:00:07.760 | I wrote this and I meant it.

00:00:10.200 | Why was it amazing to me?

00:00:12.060 | No, not primarily because of the Tesla bot.

00:00:15.160 | It was amazing because I believe the autonomous driving task

00:00:18.540 | and the general real-world robotics perception

00:00:20.800 | and planning task is a lot harder

00:00:23.040 | than people generally think.

00:00:24.760 | And I also believed the scale of effort

00:00:27.680 | in algorithm data, annotation simulation,

00:00:30.480 | inference compute and training compute required

00:00:33.160 | to solve these problems is something no one

00:00:36.400 | would be able to do in the near term.

00:00:38.520 | Yesterday was the first time I saw in one place

00:00:42.160 | just the kind and the scale of effort

00:00:44.820 | that has a chance to solve this,

00:00:47.180 | the autonomous driving problem

00:00:48.840 | and the general real-world robotics perception

00:00:51.440 | and planning problem.

00:00:52.760 | This includes the neural network architecture and pipeline,

00:00:56.340 | the autopilot compute hardware in the car,

00:00:58.840 | dojo compute hardware for training,

00:01:01.080 | the data and the annotation,

00:01:03.040 | the simulation for rare edge cases,

00:01:05.400 | and yes, the generalized application of all of the above

00:01:09.860 | beyond the car robot to the humanoid form.

00:01:13.920 | Let's go through the big innovations.

00:01:15.920 | The neural network.

00:01:18.440 | Each of these is a difficult

00:01:20.360 | and I would say brilliant design idea

00:01:22.840 | that is either a step or a leap forward

00:01:25.240 | from the state of the art in machine learning.

00:01:27.580 | First is to predict the vector space, not in image space.

00:01:31.520 | This alone is a big leap beyond

00:01:33.740 | what is usually done in computer vision

00:01:35.720 | that usually operates in the image space,

00:01:38.000 | in the two-dimensional image.

00:01:40.220 | The thing about reality is that it happens out there

00:01:43.420 | in the three-dimensional world

00:01:44.980 | and it doesn't make sense to be doing

00:01:46.460 | all the machine learning on the 2D projections

00:01:49.020 | onto images.

00:01:50.260 | Like many good ideas, this is an obvious one,

00:01:53.020 | but a very difficult one.

00:01:55.060 | Second is the fusion of camera sensor data

00:01:57.900 | before the detections.

00:01:59.660 | The detections performed by the different heads

00:02:02.420 | of the multitask neural network.

00:02:04.700 | For now, the fusion is at the multiscale feature level.

00:02:08.460 | Again, in retrospect, an obvious

00:02:11.140 | but a very difficult engineering step

00:02:13.120 | of doing the detection and the machine learning

00:02:16.300 | on all of the sensors combined

00:02:19.340 | as opposed to doing them individually

00:02:21.020 | and combining only the decisions.

00:02:23.740 | Third is using video context to model

00:02:25.860 | not just vector space, but time.

00:02:28.120 | At each frame, concatenating positional encodings,

00:02:31.780 | multicam features, and ego kinematics.

00:02:34.540 | Using a pretty cool spatial

00:02:36.580 | recurrent neural network architecture

00:02:39.020 | that forms a 2D grid around the car

00:02:41.180 | where each cell of the grid is a RNN,

00:02:43.940 | recurrent neural network.

00:02:45.400 | The other cool aspect of this

00:02:47.020 | is that you can then build a map

00:02:49.120 | in the space of RNN features.

00:02:51.740 | And then perhaps do planning in that space,

00:02:54.540 | which is a fascinating concept.

00:02:56.660 | Andrei Karpathy, I think also mentioned

00:02:59.020 | some future improvements,

00:03:00.580 | performing the fusion earlier and earlier

00:03:03.260 | in the neural network.

00:03:04.500 | So currently the fusion of space and time

00:03:06.500 | are late in the network.

00:03:08.140 | Moving the fusion earlier on

00:03:10.540 | takes us further toward full end-to-end driving

00:03:15.380 | with multiple modalities.

00:03:16.820 | Seamlessly fusing, integrating

00:03:19.460 | the multiple sources of sensory data.

00:03:21.740 | Finally, the place where there's currently,

00:03:24.060 | from my understanding, the least amount

00:03:25.940 | of utilization of neural networks is planning.

00:03:29.780 | So obviously optimal planning in action space

00:03:33.100 | is intractable, so that you have to come up

00:03:35.020 | with a bunch of heuristics.

00:03:36.620 | You can do those manually,

00:03:38.900 | or you could do those through learning.

00:03:40.620 | So the idea that was presented

00:03:41.900 | is to use neural networks as heuristics.

00:03:44.660 | In a similar way that neural networks

00:03:46.580 | were used as heuristics

00:03:48.220 | in the Monte Carlo tree search

00:03:49.780 | for mu zero and alpha zero

00:03:51.620 | to play different games,

00:03:53.220 | to play Go, to play chess.

00:03:54.980 | This allows you to significantly prune

00:03:56.660 | the search through action space

00:03:59.420 | for a plan that doesn't get stuck

00:04:01.140 | in the local optima and gets pretty close

00:04:02.980 | to the global optima.

00:04:04.340 | I really appreciated that the presentation

00:04:06.700 | didn't dumb anything down,

00:04:09.140 | but maybe in all the technical details,

00:04:10.860 | it was easy to miss just how much

00:04:12.540 | brilliant innovation that was here.

00:04:14.980 | The move to predicting in vector space

00:04:17.420 | is truly brilliant.

00:04:18.580 | Of course, you can only do that

00:04:19.740 | if you have the data

00:04:20.700 | and you have the annotation for it.

00:04:22.340 | But just to take that step

00:04:24.660 | is already taking a step outside the box

00:04:27.020 | of the way things are currently done

00:04:28.340 | in computer vision.

00:04:29.660 | Then fusing seamlessly across

00:04:33.340 | many camera sensors,

00:04:35.620 | incorporating timing to the whole thing

00:04:37.300 | in a way that's differentiable

00:04:38.660 | with these spatial RNNs.

00:04:40.740 | And then of course, using that beautiful

00:04:42.500 | mess of features,

00:04:44.220 | both on the individual image side

00:04:47.060 | and the RNN side to make plans

00:04:50.420 | using neural network architecture

00:04:51.940 | for as a heuristic.

00:04:53.380 | I mean, all of that is just brilliant.

00:04:55.700 | The other critical part of making all of this work

00:04:57.820 | is the data and the data annotation.

00:04:59.900 | First is the manual labeling.

00:05:01.300 | So to make the neural networks

00:05:03.100 | that predict in vector space work,

00:05:04.860 | you have to label in vector space.

00:05:06.540 | So you have to create in-house tools.

00:05:08.020 | And as it turns out,

00:05:09.420 | Tesla hired in-house team of annotators

00:05:12.100 | to use those tools to then perform

00:05:14.700 | the labeling of vector space

00:05:16.100 | and then project it out into the image space.

00:05:18.460 | First of all, that saves a lot of work.

00:05:20.060 | And second of all,

00:05:21.020 | that means you're directly performing

00:05:23.340 | the annotation in the space

00:05:24.820 | in which you're doing the prediction.

00:05:26.380 | Obviously, as was always the case,

00:05:28.140 | as is the case with self-supervised learning,

00:05:30.300 | auto labeling is the key to this whole thing.

00:05:33.140 | One of the interesting thing that was presented

00:05:34.980 | is the use of clips of data

00:05:37.100 | that includes video, IMU, GPS, odometry, and so on

00:05:40.500 | from multiple vehicles at the same location in time

00:05:43.660 | to generate labels of both the static world

00:05:46.500 | and the moving objects and their kinematics.

00:05:49.980 | That's really cool.

00:05:50.820 | You have these little clips,

00:05:52.940 | these buckets of data from different vehicles,

00:05:55.860 | and they're kind of annotating each other.

00:05:57.900 | You're registering them together

00:05:59.380 | to then combine a solid annotation

00:06:02.660 | of that particular part of road at that particular time.

00:06:06.300 | That's amazing because the more the fleet grows,

00:06:08.500 | the stronger that kind of auto labeling becomes.

00:06:12.300 | And the more edge cases you're able to catch that way.

00:06:14.700 | Speaking of edge cases,

00:06:15.820 | that's what Tesla is using simulation for,

00:06:19.040 | is to simulate rare edge cases

00:06:20.580 | that are not going to appear often in the data,

00:06:22.460 | even when that data set grows incredibly large.

00:06:25.780 | And also, they're using it for annotation

00:06:27.880 | of ultra complex scenes where accurate labeling

00:06:30.900 | of real world data is basically impossible,

00:06:32.940 | like a scene with like a hundred pedestrians,

00:06:35.860 | which I think is the example they used.

00:06:38.180 | So I honestly think the innovations

00:06:39.600 | on the neural network architecture

00:06:41.020 | and the data annotation is really just a big leap.

00:06:44.340 | Then there's the continued innovation

00:06:46.220 | on the autopilot computer side,

00:06:48.100 | the neural network compiler that optimizes latency,

00:06:51.180 | and so on.

00:06:52.380 | There's, I think I remember really nice testing

00:06:56.460 | and debugging tools for like variants

00:06:59.980 | of candidate trained neural networks

00:07:02.180 | to be deployed in the future,

00:07:03.380 | where you can compare different neural networks together.

00:07:05.760 | That's almost like developer tools

00:07:07.860 | for to be deployed neural networks.

00:07:11.140 | And it was mentioned that almost 10,000 GPUs

00:07:13.900 | are currently being used to continually retrain the network.

00:07:18.180 | I forget what the number was,

00:07:19.420 | but I think every week or every two weeks,

00:07:21.860 | the network is fully retrained end to end.

00:07:25.180 | The other really big innovation,

00:07:26.900 | but unlike the neural network and the data annotation,

00:07:30.020 | this is in the future, so to be deployed still,

00:07:32.660 | it's still under development,

00:07:34.020 | is the Dojo computer, which is used for training.

00:07:37.940 | So the autopilot computer is the computer on the car

00:07:40.820 | that is doing the inference,

00:07:41.940 | and Dojo computer is the thing

00:07:43.740 | that you would have in a data center

00:07:45.260 | that performs the training of the neural network.

00:07:47.900 | There's a, what they're calling a single training tile

00:07:51.300 | that is nine flops.

00:07:53.820 | It's made up of D1 chips that are built in house by Tesla.

00:07:57.220 | Each chip with super fast IO,

00:07:59.440 | each tile also with super fast IO.

00:08:02.780 | So you can basically connect

00:08:04.060 | an arbitrary number of these together,

00:08:06.220 | each with a power supply and cooling.

00:08:08.620 | And then I think they connected like a million nodes

00:08:12.860 | to have a compute center.

00:08:14.740 | I forget what the name is, but it's 1.1 exaflop.

00:08:17.860 | So combined with the fact

00:08:19.620 | that this can arbitrarily scale,

00:08:22.140 | I think this is basically contending

00:08:24.300 | to be the world's most powerful

00:08:26.340 | neural network training computer.

00:08:28.180 | Again, the entire picture that was presented

00:08:30.820 | on AI day is amazing because the, what would you call it?

00:08:35.420 | The Tesla AI machine can improve arbitrarily

00:08:38.940 | through the iterative data engine process

00:08:40.940 | of auto labeling plus manual labeling of edge cases.

00:08:44.440 | So like that labeling stage,

00:08:46.340 | plus the data collection, retraining, deploying.

00:08:49.540 | And then again, you go back to the data collection,

00:08:52.540 | the labeling, retraining, and deploying.

00:08:56.000 | And you can go through this loop as many times as you want

00:08:59.980 | to arbitrarily improve the performance of the network.

00:09:02.900 | I still think nobody knows how difficult

00:09:05.260 | the autonomous driving problem is,

00:09:08.300 | but I also think this loop does not have a ceiling.

00:09:11.780 | I still think there's a big place for driver sensing.

00:09:14.560 | I still think you have to solve

00:09:15.880 | the human robot interaction problem

00:09:17.800 | to make the experience more pleasant,

00:09:19.860 | but damn it, this loop of manual and auto labeling

00:09:24.060 | that leads to retraining, that leads to deployment,

00:09:25.860 | it goes back to the data collection

00:09:28.020 | and the auto labeling and the manual labeling

00:09:29.860 | is incredible.

00:09:31.420 | Second reason this whole effort is amazing

00:09:33.860 | is that Dojo can essentially become

00:09:36.140 | an AI training as a service,

00:09:38.900 | directly taken on AWS and Google Cloud.

00:09:41.860 | So there's no reason it needs to be utilized

00:09:44.300 | specifically for the autopilot computer.

00:09:47.060 | The simplicity of the way they described

00:09:48.720 | the deployment of PyTorch across these nodes,

00:09:50.780 | you can basically use it for any kind

00:09:52.620 | of machine learning problem,

00:09:53.940 | especially one that requires scale.

00:09:55.940 | Finally, the third reason all of this was amazing

00:09:58.360 | is that the neural network architecture

00:10:00.020 | and data engine pipeline is applicable

00:10:02.220 | to much more than just roads and driving.

00:10:05.020 | It can be used in the home, in the factory,

00:10:07.620 | and by robots of basically any form,

00:10:09.740 | as long as it has cameras and actuators,

00:10:11.860 | including, yes, the humanoid form.

00:10:15.040 | As someone who loves robotics,

00:10:17.180 | the presentation of a humanoid Tesla bot

00:10:19.980 | was truly exciting.

00:10:21.980 | Of course, for me personally,

00:10:23.220 | the lifelong dream has been to build the mind,

00:10:26.900 | the robot that becomes a friend and a companion to humans,

00:10:30.500 | not just a servant that performs boring and dangerous tasks.

00:10:35.500 | But to me, these two problems should,

00:10:38.120 | and I think will be solved in parallel.

00:10:41.100 | The Tesla bot, if successful,

00:10:43.200 | just might solve the latter problem

00:10:45.040 | of perception, movement, and object manipulation.

00:10:48.460 | And I hope to play a small part

00:10:51.200 | in solving the former problem of human-robot interaction,

00:10:55.400 | and yes, friendship.

00:10:57.440 | I'm not going to mention love when talking about robots.

00:11:01.320 | Either way, all of this, to me,

00:11:03.480 | paints a picture of an exciting future.

00:11:06.180 | Thanks for watching.

00:11:07.360 | Hope to see you next time.

00:11:09.040 | (upbeat music)

00:11:11.620 | (upbeat music)

00:11:14.200 | (upbeat music)

00:11:16.780 | (upbeat music)

00:11:19.360 | (upbeat music)

00:11:21.940 | (upbeat music)

00:11:24.520 | [BLANK_AUDIO]

Tesla AI Day Highlights | Lex Fridman

Chapters