Back to Index

Tesla AI Day Highlights | Lex Fridman


Chapters

0:0 Overview
1:16 Neural network architecture
4:55 Data and annotation
6:44 Autopilot & DOJO
8:28 Summary: 3 key ideas
9:55 Tesla Bot

Transcript

Tesla AI Day presented the most amazing real-world AI and engineering effort I have ever seen in my life. I wrote this and I meant it. Why was it amazing to me? No, not primarily because of the Tesla bot. It was amazing because I believe the autonomous driving task and the general real-world robotics perception and planning task is a lot harder than people generally think.

And I also believed the scale of effort in algorithm data, annotation simulation, inference compute and training compute required to solve these problems is something no one would be able to do in the near term. Yesterday was the first time I saw in one place just the kind and the scale of effort that has a chance to solve this, the autonomous driving problem and the general real-world robotics perception and planning problem.

This includes the neural network architecture and pipeline, the autopilot compute hardware in the car, dojo compute hardware for training, the data and the annotation, the simulation for rare edge cases, and yes, the generalized application of all of the above beyond the car robot to the humanoid form. Let's go through the big innovations.

The neural network. Each of these is a difficult and I would say brilliant design idea that is either a step or a leap forward from the state of the art in machine learning. First is to predict the vector space, not in image space. This alone is a big leap beyond what is usually done in computer vision that usually operates in the image space, in the two-dimensional image.

The thing about reality is that it happens out there in the three-dimensional world and it doesn't make sense to be doing all the machine learning on the 2D projections onto images. Like many good ideas, this is an obvious one, but a very difficult one. Second is the fusion of camera sensor data before the detections.

The detections performed by the different heads of the multitask neural network. For now, the fusion is at the multiscale feature level. Again, in retrospect, an obvious but a very difficult engineering step of doing the detection and the machine learning on all of the sensors combined as opposed to doing them individually and combining only the decisions.

Third is using video context to model not just vector space, but time. At each frame, concatenating positional encodings, multicam features, and ego kinematics. Using a pretty cool spatial recurrent neural network architecture that forms a 2D grid around the car where each cell of the grid is a RNN, recurrent neural network.

The other cool aspect of this is that you can then build a map in the space of RNN features. And then perhaps do planning in that space, which is a fascinating concept. Andrei Karpathy, I think also mentioned some future improvements, performing the fusion earlier and earlier in the neural network.

So currently the fusion of space and time are late in the network. Moving the fusion earlier on takes us further toward full end-to-end driving with multiple modalities. Seamlessly fusing, integrating the multiple sources of sensory data. Finally, the place where there's currently, from my understanding, the least amount of utilization of neural networks is planning.

So obviously optimal planning in action space is intractable, so that you have to come up with a bunch of heuristics. You can do those manually, or you could do those through learning. So the idea that was presented is to use neural networks as heuristics. In a similar way that neural networks were used as heuristics in the Monte Carlo tree search for mu zero and alpha zero to play different games, to play Go, to play chess.

This allows you to significantly prune the search through action space for a plan that doesn't get stuck in the local optima and gets pretty close to the global optima. I really appreciated that the presentation didn't dumb anything down, but maybe in all the technical details, it was easy to miss just how much brilliant innovation that was here.

The move to predicting in vector space is truly brilliant. Of course, you can only do that if you have the data and you have the annotation for it. But just to take that step is already taking a step outside the box of the way things are currently done in computer vision.

Then fusing seamlessly across many camera sensors, incorporating timing to the whole thing in a way that's differentiable with these spatial RNNs. And then of course, using that beautiful mess of features, both on the individual image side and the RNN side to make plans using neural network architecture for as a heuristic.

I mean, all of that is just brilliant. The other critical part of making all of this work is the data and the data annotation. First is the manual labeling. So to make the neural networks that predict in vector space work, you have to label in vector space. So you have to create in-house tools.

And as it turns out, Tesla hired in-house team of annotators to use those tools to then perform the labeling of vector space and then project it out into the image space. First of all, that saves a lot of work. And second of all, that means you're directly performing the annotation in the space in which you're doing the prediction.

Obviously, as was always the case, as is the case with self-supervised learning, auto labeling is the key to this whole thing. One of the interesting thing that was presented is the use of clips of data that includes video, IMU, GPS, odometry, and so on from multiple vehicles at the same location in time to generate labels of both the static world and the moving objects and their kinematics.

That's really cool. You have these little clips, these buckets of data from different vehicles, and they're kind of annotating each other. You're registering them together to then combine a solid annotation of that particular part of road at that particular time. That's amazing because the more the fleet grows, the stronger that kind of auto labeling becomes.

And the more edge cases you're able to catch that way. Speaking of edge cases, that's what Tesla is using simulation for, is to simulate rare edge cases that are not going to appear often in the data, even when that data set grows incredibly large. And also, they're using it for annotation of ultra complex scenes where accurate labeling of real world data is basically impossible, like a scene with like a hundred pedestrians, which I think is the example they used.

So I honestly think the innovations on the neural network architecture and the data annotation is really just a big leap. Then there's the continued innovation on the autopilot computer side, the neural network compiler that optimizes latency, and so on. There's, I think I remember really nice testing and debugging tools for like variants of candidate trained neural networks to be deployed in the future, where you can compare different neural networks together.

That's almost like developer tools for to be deployed neural networks. And it was mentioned that almost 10,000 GPUs are currently being used to continually retrain the network. I forget what the number was, but I think every week or every two weeks, the network is fully retrained end to end.

The other really big innovation, but unlike the neural network and the data annotation, this is in the future, so to be deployed still, it's still under development, is the Dojo computer, which is used for training. So the autopilot computer is the computer on the car that is doing the inference, and Dojo computer is the thing that you would have in a data center that performs the training of the neural network.

There's a, what they're calling a single training tile that is nine flops. It's made up of D1 chips that are built in house by Tesla. Each chip with super fast IO, each tile also with super fast IO. So you can basically connect an arbitrary number of these together, each with a power supply and cooling.

And then I think they connected like a million nodes to have a compute center. I forget what the name is, but it's 1.1 exaflop. So combined with the fact that this can arbitrarily scale, I think this is basically contending to be the world's most powerful neural network training computer.

Again, the entire picture that was presented on AI day is amazing because the, what would you call it? The Tesla AI machine can improve arbitrarily through the iterative data engine process of auto labeling plus manual labeling of edge cases. So like that labeling stage, plus the data collection, retraining, deploying.

And then again, you go back to the data collection, the labeling, retraining, and deploying. And you can go through this loop as many times as you want to arbitrarily improve the performance of the network. I still think nobody knows how difficult the autonomous driving problem is, but I also think this loop does not have a ceiling.

I still think there's a big place for driver sensing. I still think you have to solve the human robot interaction problem to make the experience more pleasant, but damn it, this loop of manual and auto labeling that leads to retraining, that leads to deployment, it goes back to the data collection and the auto labeling and the manual labeling is incredible.

Second reason this whole effort is amazing is that Dojo can essentially become an AI training as a service, directly taken on AWS and Google Cloud. So there's no reason it needs to be utilized specifically for the autopilot computer. The simplicity of the way they described the deployment of PyTorch across these nodes, you can basically use it for any kind of machine learning problem, especially one that requires scale.

Finally, the third reason all of this was amazing is that the neural network architecture and data engine pipeline is applicable to much more than just roads and driving. It can be used in the home, in the factory, and by robots of basically any form, as long as it has cameras and actuators, including, yes, the humanoid form.

As someone who loves robotics, the presentation of a humanoid Tesla bot was truly exciting. Of course, for me personally, the lifelong dream has been to build the mind, the robot that becomes a friend and a companion to humans, not just a servant that performs boring and dangerous tasks. But to me, these two problems should, and I think will be solved in parallel.

The Tesla bot, if successful, just might solve the latter problem of perception, movement, and object manipulation. And I hope to play a small part in solving the former problem of human-robot interaction, and yes, friendship. I'm not going to mention love when talking about robots. Either way, all of this, to me, paints a picture of an exciting future.

Thanks for watching. Hope to see you next time. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)