George Hotz: Winning - A Reinforcement Learning Approach

00:00:07.680 | You said that the meaning of life is to win.

00:00:10.320 | If you look five years into the future, what does winning look like?

00:00:16.960 | I... there's a lot of... I can go into like technical depth to what I mean by that, to win.

00:00:27.840 | It may not mean... I was criticized for that in the comments, like,

00:00:30.800 | "Doesn't this guy want to like save the penguins in Antarctica?" or like...

00:00:34.800 | Oh man, you know, listen to what I'm saying. I'm not talking about like I have a yacht or something.

00:00:41.200 | I am an agent. I am put into this world.

00:00:45.440 | And I don't really know what my purpose is.

00:00:49.920 | But if you're a reinforcement, if you're an intelligent agent and you're put into a world,

00:00:53.840 | what is the ideal thing to do?

00:00:55.600 | Well, the ideal thing, mathematically, you can go back to like Schmidhuber theories about this,

00:00:59.600 | is to build a compressive model of the world,

00:01:02.880 | to build a maximally compressive to explore the world such that your exploration function

00:01:08.160 | maximizes the derivative of compression of the past.

00:01:11.200 | Schmidhuber has a paper about this.

00:01:13.120 | And like, I took that kind of as like a personal goal function.

00:01:17.760 | So what I mean to win, I mean like, maybe this is religious, but like I think that in the future,

00:01:23.680 | I might be given a real purpose or I may decide this purpose myself.

00:01:27.120 | And then at that point, now I know what the game is and I know how to win.

00:01:30.640 | I think right now, I'm still just trying to figure out what the game is.

00:01:33.120 | But once I know...

00:01:33.920 | So you have imperfect information, you have a lot of uncertainty about the reward function

00:01:40.960 | and you're discovering it.

00:01:42.400 | But the purpose is...

00:01:43.360 | That's a better way to put it.

00:01:44.240 | The purpose is to maximize it while you have a lot of uncertainty around it.

00:01:50.320 | And you're both reducing the uncertainty and maximizing at the same time.

00:01:53.840 | And so that's at the technical level.

00:01:56.720 | What is the...

00:01:57.520 | If you believe in the universal prior, what is the universal reward function?

00:02:01.680 | That's the better way to put it.

00:02:02.800 | So that win is interesting.

00:02:06.080 | I think I speak for everyone in saying that I wonder what that reward function is for you.

00:02:13.760 | And I look forward to seeing that in five years and 10 years.

00:02:19.360 | I think a lot of people, including myself, are cheering you on, man.

00:02:22.240 | So I'm happy you exist and I wish you the best of luck.

00:02:26.320 | Thank you.

00:02:27.860 | Thank you.

00:02:28.440 | [BLANK_AUDIO]

00:02:35.220 | [BLANK_AUDIO]

00:02:45.220 | [ ROBOTIC WHIRRING ]

George Hotz: Winning - A Reinforcement Learning Approach | AI Podcast Clips