Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems

. Good afternoon, everyone, and really excited to be here today. Really exciting stuff so far, so many models, so many new ideas. And today I want to talk about what happens between the controller and the wire. Now, we have seen so many policies that work, that control robots. But again, we need to get that data to the actuators.

We need to get that data from sensors and feed the whole system. And what happens if your carefully crafted policy does not work as expected? Like, is this issue in the policy or is it in the software system? So today we look at a lot of instances where the issue will look like it's the policy, but it's actually the software system.

And along the way, we'll try to design a very small robotics robot. So why this talk? Again, well, robots are complex. So many systems, so many different software components. And yet we're focused on, like, one big question. When things go wrong on the robot, when you don't see that motor move, what's the root cause?

Is the policy that is not giving the command or is it the software system? And this is a question that I grapple almost every day. And so I want to talk about what I've seen so far and how to diagnose these issues on the robot. So let's go to the buildup.

Let's try to build a very small toy robotics general architecture, right? Like, this is what a general robot would look like. You'd have some actuators, a CPU, maybe a hybrid accelerator, and then a sensor. Perfect. Now, one of the most critical aspects is the communication protocol. So for our talk, we'll use CAN.

CAN is great. CAN is open source. Everyone can use CAN. It's cheap. It's affordable. And it has enough data rate and enough compatibility for a lot of components out there. So we'll stick to CAN and we'll see how that influences a lot of the design decisions down the line.

All right. So let's also start simple with the code. We'll start with receiving the data, giving that to the policy, and basically sending it back out. Nothing happening. Nothing fancy, right? And let's assume that we have approximately two milliseconds for our policy. And this is what we should expect to see, right?

Our loop's running every two milliseconds. We are able to see our policy output. We read data. We send it out. Standard. But as soon as we deploy it on the robot, this is what happens. There's a gap. Every two milliseconds, there's a gap. Wait. What's going on? Well, let's look at the loop again.

So at the edge of the loop, we have question marks. We see that we're transmitting and receiving CAN data. So let's look at the CAN bus. Maybe we'll find some hints there. Okay. So let's say we have 100 bits per message. And we have about 10 messages, five to be sent out, five to be received.

That gives us a total of 1,000 bits. And for a CAN bus that's operating at one megabit per second, that's about 0.1 milliseconds per message or one milliseconds per 10 message. You can see how even a small number of messages are saturating the CAN bus to the point that the loop time, how much our system takes to run, is on the same order as the transmission time.

And this explains the one millisecond gap. So great. But then what to do about it? It's like, it's almost unavoidable, right? Like we cannot go around this one millisecond gap. Well, that's solution number one. You just accept the delay. Hopefully it's three milliseconds and that's not too bad. But again, a system would not be high performance if we let that stop us.

So we'll multithread and we'll pipeline. We'll try to figure out how we can work around that one millisecond and see how we can sort of organize our tasks differently to still get that two millisecond loop time. So here we'll take a moment to pause and see that, you know, the loop, it has multiple components broken down into three now.

TX, RX, and the policy. And we'll be running the communication in a different thread and the policy in a different thread. And now we'll see how we'll take the simple building block and stagger it so that we can actually achieve faster loop times. And this is it. So what we do, we seek the policy the first time.

We get some data. We feed it to the policy. But before we conclude the policy, we start receiving the next set of data. And that's for the next iteration. When the next iteration starts, we transmit the data from the last policy. And we continue resuming the next iteration of this policy.

Essentially, we have parallelized our RX and TX. But we're still receiving data for the same policy at the same cadence. This is great. We might have solved our problems. Let's move on. So we deployed the system on the robot. And now we see new problems. Our system is stuttering.

Our actuators are making sounds like catching up or like we're seeing weird motions on the actuator. This has to be policy. There's no way this can be software. Well, let's investigate more. Let's get some more data from the CAN bus. So, again, like here we have our CAN bus again.

And we see our CPU, GPU, all our accelerators. And what we'll try to do is get an external transceiver. These are, again, very cheap, very open source products that you can get anywhere. And we connect it to the CAN bus and we get data off the CAN bus. We take this data, we feed it to another host computer, let's say a laptop.

And on there we can run utilities like CAN dump, which will actually give you a timestamp data of what message was seen at what time. So once we get this raw data off the bus, we can start plotting it. And this is what we should expect. That every two milliseconds, we have a message on the bus that is being sent out.

It should be very nicely spaced and it should reach the actuators in time. And if we see this on the bus, we're really happy. Now, what happens a lot of the times in systems is you will not see this, you will see something like this. Here we'll see, like, between message number three and four, there's almost no gap.

What happened there? And between two and three, there's four milliseconds of gap. It's almost like message number three was just late and four was on time. And because of that, we had this weird jitter where the actuator would try to catch up or try to follow two commands at the same time.

Okay, same thing happened with seven and eight. So let's take a deeper look, but first let's try to plot this differently. So there's this plot called the cycle time plot. And what we plot here is the time since last message. Time since last message is just a way to say, like, hey, last message came in at two milliseconds interval.

This one should also come at two milliseconds. So we should see a straight line around the two millisecond mark. But here we see some messages jump at four milliseconds and the one after that comes to zero. This is expected because if a message is delayed, the cycle time for that would be late.

But then for the next one, it would be much closer to zero because that one was not late. And the difference between the last message and the current one is basically nothing. Okay, so now we've characterized the system. We know what's going on and we can start solving it.

But this is what's going on with the TX side. So let's see. So we missed sending the data and queued it. Why would that happen? Well, policies are not very real time. At times they can take longer. At times they can take shorter. And what happens if a policy takes longer?

Well, you miss the time when you were supposed to send it out. So all you can do is just queue it somewhere. You can store it. But that cannot be sent out anymore. And when the next iteration comes around, that's when you send both the last message and the current message.

So you'll see two messages just go on the bus at the same time. And this can also happen if our TX and RX threads start desynchronizing. But this is one of the issues that is very commonly seen with like a multi-threaded system. And it's very important to have synchronization in the systems.

But let's say we do synchronize it and we are able to fix our TX side. Well, we see some improvement. We don't see that like everything is solved. We see some improvement. Okay, but now this has to be policy. Our graphs are looking fine. Everything is on the bus is fine.

This has to be policy. There's no other systems. Well, there's one last issue that we have to check. And that is what happens if we desynchronize in the RX side? What happens if our thread is delayed? Well, now our policy will not get the new data and it will work with the last data.

And because of that, the output will also be based on the last data. And so in policy number two or iteration number two, we'll actually have an old command still. Like which is relatively older. And in policy number three, we'll directly jump. We'll skip one of the data processing.

And because of that, we'll see a sort of skip of catching up behavior on the motors, which will sound like almost like a jitter. Okay, so how do we resolve these two things? Well, there are synchronization primitives. You can make conditional variables, semaphores. These are like very low-level system things that are widely used in robotics and should be used as well for this toy system.

But again, if these are not available, which is sometimes the case, like we're not working with Linux-based system. We'll work with like a real-time OS or like a microcontroller, where we may not have all these semaphores. We can just add padding. Just have some cushion, right? Like have some cushion so that if some desynchronization happens, you still have the same Rx going into the right policy and coming out the other way in a timely manner.

We don't miss messages. Okay, perfect. So this makes our system fairly robust, fairly high-performant. But there are a few other related problems which will happen with a system like this, which we should also talk about. So let's talk about logging. Logging is benign, right? We just log that, hey, that message is coming in.

We want to just log that this is the data that we got, this is the output. It's fine, right? But if we log too much, at some point we have to send those logs to a disk. And that is very costly. Imagine what happens if your main control loop starts logging and decides just one day that, hey, I'm done, I'll just start putting this on the hard disk.

Well, your robot would stay frozen for 30 milliseconds, as we saw on the Raspberry Pi with an SD card. So that's bad. How do we fix that? Well, we just throw more CPU at it. We just add another CPU, and now all our logging is handled by that third CPU.

Cool. Okay, so now we have, like, we're seeing how multithreaded is slowly getting baked into the system, how the robot is operating in a real-time deadline guarantee, and how we are able to, like, avoid the pitfalls. Perfect. Let's talk about something a little more low-level again, like microcontrollers. Microcontrollers are fairly simple, and their logging doesn't actually go through a whole disk and file system way.

They just log to some other peripheral. That takes time. In fact, for UART, it can be on the order of milliseconds, depending on how much we are logging. So here's an interesting problem. Let's say we drop a packet, and we log that, hey, we dropped the packet. Well, that log itself would take enough time that we'll drop the next packet.

And then, because you drop the next packet, you log again. So basically, just keep logging, and you see a complete blackout on the canvas. And it's very hard to debug, like, why am I getting logs and seeing packet drops but no data? These are mysterious things that, in my experience, like, it's really good to, like, know about the pitfalls beforehand before we dive in the system and really figure out that, hey, this can also be a problem, just a log statement.

Finally, there's also priority inversion. So in the kernel, in the Linux kernel, there are ways in which data is received by the user process. It's not direct. Like, it takes a while between the interrupt, the kernel process handling, and then it goes to the user process. In robotics, we tend to just boost the priority of all our processes so high that we start just blocking the kernel almost.

Like, if the kernel doesn't run, we won't get the data, but we're trying to get the data, and we're blocking the very thing that will give us the data. Well, this is inversion in action, and it will see your system again drop out for, like, seconds almost at a time.

So again, this is something we fix by just making sure we know the parts of the pipeline. We fix the right priorities and we make sure that our whole system as a whole, like, it works together well. So this is how, like, software and robotics have to work together.

We have to talk about hardware, the various profiling, the various priority stuff, and actually just take a recap from the top. So we went over a pipeline, we saw how to reduce cycle time, beat how the communication delays. We saw how synchronization can actually cause some unexpected jitter, which are hard to diagnose.

Could be the policy, could be the system. So we want to make sure that that doesn't happen. Logging strategies so that we don't block the system while we're trying to tell the user that, hey, this is happening. And finally, priority inversion to avoid starvation. And that's how we start designing high-performance robotic systems, at least on a very basic level.

And that's my talk for today. And thank you so much for being here and listening. Thank you. Thank you. Thank you. Thank you. Thank you. We'll see you next time.

Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems

Chapters

Transcript