back to index

Tesla AI Day Highlights | Lex Fridman


Chapters

0:0 Overview
1:16 Neural network architecture
4:55 Data and annotation
6:44 Autopilot & DOJO
8:28 Summary: 3 key ideas
9:55 Tesla Bot

Whisper Transcript | Transcript Only Page

00:00:00.000 | Tesla AI Day presented the most amazing real-world AI
00:00:03.640 | and engineering effort I have ever seen in my life.
00:00:07.760 | I wrote this and I meant it.
00:00:10.200 | Why was it amazing to me?
00:00:12.060 | No, not primarily because of the Tesla bot.
00:00:15.160 | It was amazing because I believe the autonomous driving task
00:00:18.540 | and the general real-world robotics perception
00:00:20.800 | and planning task is a lot harder
00:00:23.040 | than people generally think.
00:00:24.760 | And I also believed the scale of effort
00:00:27.680 | in algorithm data, annotation simulation,
00:00:30.480 | inference compute and training compute required
00:00:33.160 | to solve these problems is something no one
00:00:36.400 | would be able to do in the near term.
00:00:38.520 | Yesterday was the first time I saw in one place
00:00:42.160 | just the kind and the scale of effort
00:00:44.820 | that has a chance to solve this,
00:00:47.180 | the autonomous driving problem
00:00:48.840 | and the general real-world robotics perception
00:00:51.440 | and planning problem.
00:00:52.760 | This includes the neural network architecture and pipeline,
00:00:56.340 | the autopilot compute hardware in the car,
00:00:58.840 | dojo compute hardware for training,
00:01:01.080 | the data and the annotation,
00:01:03.040 | the simulation for rare edge cases,
00:01:05.400 | and yes, the generalized application of all of the above
00:01:09.860 | beyond the car robot to the humanoid form.
00:01:13.920 | Let's go through the big innovations.
00:01:15.920 | The neural network.
00:01:18.440 | Each of these is a difficult
00:01:20.360 | and I would say brilliant design idea
00:01:22.840 | that is either a step or a leap forward
00:01:25.240 | from the state of the art in machine learning.
00:01:27.580 | First is to predict the vector space, not in image space.
00:01:31.520 | This alone is a big leap beyond
00:01:33.740 | what is usually done in computer vision
00:01:35.720 | that usually operates in the image space,
00:01:38.000 | in the two-dimensional image.
00:01:40.220 | The thing about reality is that it happens out there
00:01:43.420 | in the three-dimensional world
00:01:44.980 | and it doesn't make sense to be doing
00:01:46.460 | all the machine learning on the 2D projections
00:01:49.020 | onto images.
00:01:50.260 | Like many good ideas, this is an obvious one,
00:01:53.020 | but a very difficult one.
00:01:55.060 | Second is the fusion of camera sensor data
00:01:57.900 | before the detections.
00:01:59.660 | The detections performed by the different heads
00:02:02.420 | of the multitask neural network.
00:02:04.700 | For now, the fusion is at the multiscale feature level.
00:02:08.460 | Again, in retrospect, an obvious
00:02:11.140 | but a very difficult engineering step
00:02:13.120 | of doing the detection and the machine learning
00:02:16.300 | on all of the sensors combined
00:02:19.340 | as opposed to doing them individually
00:02:21.020 | and combining only the decisions.
00:02:23.740 | Third is using video context to model
00:02:25.860 | not just vector space, but time.
00:02:28.120 | At each frame, concatenating positional encodings,
00:02:31.780 | multicam features, and ego kinematics.
00:02:34.540 | Using a pretty cool spatial
00:02:36.580 | recurrent neural network architecture
00:02:39.020 | that forms a 2D grid around the car
00:02:41.180 | where each cell of the grid is a RNN,
00:02:43.940 | recurrent neural network.
00:02:45.400 | The other cool aspect of this
00:02:47.020 | is that you can then build a map
00:02:49.120 | in the space of RNN features.
00:02:51.740 | And then perhaps do planning in that space,
00:02:54.540 | which is a fascinating concept.
00:02:56.660 | Andrei Karpathy, I think also mentioned
00:02:59.020 | some future improvements,
00:03:00.580 | performing the fusion earlier and earlier
00:03:03.260 | in the neural network.
00:03:04.500 | So currently the fusion of space and time
00:03:06.500 | are late in the network.
00:03:08.140 | Moving the fusion earlier on
00:03:10.540 | takes us further toward full end-to-end driving
00:03:15.380 | with multiple modalities.
00:03:16.820 | Seamlessly fusing, integrating
00:03:19.460 | the multiple sources of sensory data.
00:03:21.740 | Finally, the place where there's currently,
00:03:24.060 | from my understanding, the least amount
00:03:25.940 | of utilization of neural networks is planning.
00:03:29.780 | So obviously optimal planning in action space
00:03:33.100 | is intractable, so that you have to come up
00:03:35.020 | with a bunch of heuristics.
00:03:36.620 | You can do those manually,
00:03:38.900 | or you could do those through learning.
00:03:40.620 | So the idea that was presented
00:03:41.900 | is to use neural networks as heuristics.
00:03:44.660 | In a similar way that neural networks
00:03:46.580 | were used as heuristics
00:03:48.220 | in the Monte Carlo tree search
00:03:49.780 | for mu zero and alpha zero
00:03:51.620 | to play different games,
00:03:53.220 | to play Go, to play chess.
00:03:54.980 | This allows you to significantly prune
00:03:56.660 | the search through action space
00:03:59.420 | for a plan that doesn't get stuck
00:04:01.140 | in the local optima and gets pretty close
00:04:02.980 | to the global optima.
00:04:04.340 | I really appreciated that the presentation
00:04:06.700 | didn't dumb anything down,
00:04:09.140 | but maybe in all the technical details,
00:04:10.860 | it was easy to miss just how much
00:04:12.540 | brilliant innovation that was here.
00:04:14.980 | The move to predicting in vector space
00:04:17.420 | is truly brilliant.
00:04:18.580 | Of course, you can only do that
00:04:19.740 | if you have the data
00:04:20.700 | and you have the annotation for it.
00:04:22.340 | But just to take that step
00:04:24.660 | is already taking a step outside the box
00:04:27.020 | of the way things are currently done
00:04:28.340 | in computer vision.
00:04:29.660 | Then fusing seamlessly across
00:04:33.340 | many camera sensors,
00:04:35.620 | incorporating timing to the whole thing
00:04:37.300 | in a way that's differentiable
00:04:38.660 | with these spatial RNNs.
00:04:40.740 | And then of course, using that beautiful
00:04:42.500 | mess of features,
00:04:44.220 | both on the individual image side
00:04:47.060 | and the RNN side to make plans
00:04:50.420 | using neural network architecture
00:04:51.940 | for as a heuristic.
00:04:53.380 | I mean, all of that is just brilliant.
00:04:55.700 | The other critical part of making all of this work
00:04:57.820 | is the data and the data annotation.
00:04:59.900 | First is the manual labeling.
00:05:01.300 | So to make the neural networks
00:05:03.100 | that predict in vector space work,
00:05:04.860 | you have to label in vector space.
00:05:06.540 | So you have to create in-house tools.
00:05:08.020 | And as it turns out,
00:05:09.420 | Tesla hired in-house team of annotators
00:05:12.100 | to use those tools to then perform
00:05:14.700 | the labeling of vector space
00:05:16.100 | and then project it out into the image space.
00:05:18.460 | First of all, that saves a lot of work.
00:05:20.060 | And second of all,
00:05:21.020 | that means you're directly performing
00:05:23.340 | the annotation in the space
00:05:24.820 | in which you're doing the prediction.
00:05:26.380 | Obviously, as was always the case,
00:05:28.140 | as is the case with self-supervised learning,
00:05:30.300 | auto labeling is the key to this whole thing.
00:05:33.140 | One of the interesting thing that was presented
00:05:34.980 | is the use of clips of data
00:05:37.100 | that includes video, IMU, GPS, odometry, and so on
00:05:40.500 | from multiple vehicles at the same location in time
00:05:43.660 | to generate labels of both the static world
00:05:46.500 | and the moving objects and their kinematics.
00:05:49.980 | That's really cool.
00:05:50.820 | You have these little clips,
00:05:52.940 | these buckets of data from different vehicles,
00:05:55.860 | and they're kind of annotating each other.
00:05:57.900 | You're registering them together
00:05:59.380 | to then combine a solid annotation
00:06:02.660 | of that particular part of road at that particular time.
00:06:06.300 | That's amazing because the more the fleet grows,
00:06:08.500 | the stronger that kind of auto labeling becomes.
00:06:12.300 | And the more edge cases you're able to catch that way.
00:06:14.700 | Speaking of edge cases,
00:06:15.820 | that's what Tesla is using simulation for,
00:06:19.040 | is to simulate rare edge cases
00:06:20.580 | that are not going to appear often in the data,
00:06:22.460 | even when that data set grows incredibly large.
00:06:25.780 | And also, they're using it for annotation
00:06:27.880 | of ultra complex scenes where accurate labeling
00:06:30.900 | of real world data is basically impossible,
00:06:32.940 | like a scene with like a hundred pedestrians,
00:06:35.860 | which I think is the example they used.
00:06:38.180 | So I honestly think the innovations
00:06:39.600 | on the neural network architecture
00:06:41.020 | and the data annotation is really just a big leap.
00:06:44.340 | Then there's the continued innovation
00:06:46.220 | on the autopilot computer side,
00:06:48.100 | the neural network compiler that optimizes latency,
00:06:51.180 | and so on.
00:06:52.380 | There's, I think I remember really nice testing
00:06:56.460 | and debugging tools for like variants
00:06:59.980 | of candidate trained neural networks
00:07:02.180 | to be deployed in the future,
00:07:03.380 | where you can compare different neural networks together.
00:07:05.760 | That's almost like developer tools
00:07:07.860 | for to be deployed neural networks.
00:07:11.140 | And it was mentioned that almost 10,000 GPUs
00:07:13.900 | are currently being used to continually retrain the network.
00:07:18.180 | I forget what the number was,
00:07:19.420 | but I think every week or every two weeks,
00:07:21.860 | the network is fully retrained end to end.
00:07:25.180 | The other really big innovation,
00:07:26.900 | but unlike the neural network and the data annotation,
00:07:30.020 | this is in the future, so to be deployed still,
00:07:32.660 | it's still under development,
00:07:34.020 | is the Dojo computer, which is used for training.
00:07:37.940 | So the autopilot computer is the computer on the car
00:07:40.820 | that is doing the inference,
00:07:41.940 | and Dojo computer is the thing
00:07:43.740 | that you would have in a data center
00:07:45.260 | that performs the training of the neural network.
00:07:47.900 | There's a, what they're calling a single training tile
00:07:51.300 | that is nine flops.
00:07:53.820 | It's made up of D1 chips that are built in house by Tesla.
00:07:57.220 | Each chip with super fast IO,
00:07:59.440 | each tile also with super fast IO.
00:08:02.780 | So you can basically connect
00:08:04.060 | an arbitrary number of these together,
00:08:06.220 | each with a power supply and cooling.
00:08:08.620 | And then I think they connected like a million nodes
00:08:12.860 | to have a compute center.
00:08:14.740 | I forget what the name is, but it's 1.1 exaflop.
00:08:17.860 | So combined with the fact
00:08:19.620 | that this can arbitrarily scale,
00:08:22.140 | I think this is basically contending
00:08:24.300 | to be the world's most powerful
00:08:26.340 | neural network training computer.
00:08:28.180 | Again, the entire picture that was presented
00:08:30.820 | on AI day is amazing because the, what would you call it?
00:08:35.420 | The Tesla AI machine can improve arbitrarily
00:08:38.940 | through the iterative data engine process
00:08:40.940 | of auto labeling plus manual labeling of edge cases.
00:08:44.440 | So like that labeling stage,
00:08:46.340 | plus the data collection, retraining, deploying.
00:08:49.540 | And then again, you go back to the data collection,
00:08:52.540 | the labeling, retraining, and deploying.
00:08:56.000 | And you can go through this loop as many times as you want
00:08:59.980 | to arbitrarily improve the performance of the network.
00:09:02.900 | I still think nobody knows how difficult
00:09:05.260 | the autonomous driving problem is,
00:09:08.300 | but I also think this loop does not have a ceiling.
00:09:11.780 | I still think there's a big place for driver sensing.
00:09:14.560 | I still think you have to solve
00:09:15.880 | the human robot interaction problem
00:09:17.800 | to make the experience more pleasant,
00:09:19.860 | but damn it, this loop of manual and auto labeling
00:09:24.060 | that leads to retraining, that leads to deployment,
00:09:25.860 | it goes back to the data collection
00:09:28.020 | and the auto labeling and the manual labeling
00:09:29.860 | is incredible.
00:09:31.420 | Second reason this whole effort is amazing
00:09:33.860 | is that Dojo can essentially become
00:09:36.140 | an AI training as a service,
00:09:38.900 | directly taken on AWS and Google Cloud.
00:09:41.860 | So there's no reason it needs to be utilized
00:09:44.300 | specifically for the autopilot computer.
00:09:47.060 | The simplicity of the way they described
00:09:48.720 | the deployment of PyTorch across these nodes,
00:09:50.780 | you can basically use it for any kind
00:09:52.620 | of machine learning problem,
00:09:53.940 | especially one that requires scale.
00:09:55.940 | Finally, the third reason all of this was amazing
00:09:58.360 | is that the neural network architecture
00:10:00.020 | and data engine pipeline is applicable
00:10:02.220 | to much more than just roads and driving.
00:10:05.020 | It can be used in the home, in the factory,
00:10:07.620 | and by robots of basically any form,
00:10:09.740 | as long as it has cameras and actuators,
00:10:11.860 | including, yes, the humanoid form.
00:10:15.040 | As someone who loves robotics,
00:10:17.180 | the presentation of a humanoid Tesla bot
00:10:19.980 | was truly exciting.
00:10:21.980 | Of course, for me personally,
00:10:23.220 | the lifelong dream has been to build the mind,
00:10:26.900 | the robot that becomes a friend and a companion to humans,
00:10:30.500 | not just a servant that performs boring and dangerous tasks.
00:10:35.500 | But to me, these two problems should,
00:10:38.120 | and I think will be solved in parallel.
00:10:41.100 | The Tesla bot, if successful,
00:10:43.200 | just might solve the latter problem
00:10:45.040 | of perception, movement, and object manipulation.
00:10:48.460 | And I hope to play a small part
00:10:51.200 | in solving the former problem of human-robot interaction,
00:10:55.400 | and yes, friendship.
00:10:57.440 | I'm not going to mention love when talking about robots.
00:11:01.320 | Either way, all of this, to me,
00:11:03.480 | paints a picture of an exciting future.
00:11:06.180 | Thanks for watching.
00:11:07.360 | Hope to see you next time.
00:11:09.040 | (upbeat music)
00:11:11.620 | (upbeat music)
00:11:14.200 | (upbeat music)
00:11:16.780 | (upbeat music)
00:11:19.360 | (upbeat music)
00:11:21.940 | (upbeat music)
00:11:24.520 | [BLANK_AUDIO]