back to index

George Hotz: Winning - A Reinforcement Learning Approach | AI Podcast Clips


Whisper Transcript | Transcript Only Page

00:00:00.000 | [Music]
00:00:07.680 | You said that the meaning of life is to win.
00:00:10.320 | If you look five years into the future, what does winning look like?
00:00:14.960 | So...
00:00:16.960 | I... there's a lot of... I can go into like technical depth to what I mean by that, to win.
00:00:27.840 | It may not mean... I was criticized for that in the comments, like,
00:00:30.800 | "Doesn't this guy want to like save the penguins in Antarctica?" or like...
00:00:34.800 | Oh man, you know, listen to what I'm saying. I'm not talking about like I have a yacht or something.
00:00:39.680 | Yeah.
00:00:41.200 | I am an agent. I am put into this world.
00:00:45.440 | And I don't really know what my purpose is.
00:00:49.920 | But if you're a reinforcement, if you're an intelligent agent and you're put into a world,
00:00:53.840 | what is the ideal thing to do?
00:00:55.600 | Well, the ideal thing, mathematically, you can go back to like Schmidhuber theories about this,
00:00:59.600 | is to build a compressive model of the world,
00:01:02.880 | to build a maximally compressive to explore the world such that your exploration function
00:01:08.160 | maximizes the derivative of compression of the past.
00:01:11.200 | Schmidhuber has a paper about this.
00:01:13.120 | And like, I took that kind of as like a personal goal function.
00:01:17.760 | So what I mean to win, I mean like, maybe this is religious, but like I think that in the future,
00:01:23.680 | I might be given a real purpose or I may decide this purpose myself.
00:01:27.120 | And then at that point, now I know what the game is and I know how to win.
00:01:30.640 | I think right now, I'm still just trying to figure out what the game is.
00:01:33.120 | But once I know...
00:01:33.920 | So you have imperfect information, you have a lot of uncertainty about the reward function
00:01:40.960 | and you're discovering it.
00:01:42.080 | Exactly.
00:01:42.400 | But the purpose is...
00:01:43.360 | That's a better way to put it.
00:01:44.240 | The purpose is to maximize it while you have a lot of uncertainty around it.
00:01:50.320 | And you're both reducing the uncertainty and maximizing at the same time.
00:01:53.520 | Yeah.
00:01:53.840 | And so that's at the technical level.
00:01:56.720 | What is the...
00:01:57.520 | If you believe in the universal prior, what is the universal reward function?
00:02:01.680 | That's the better way to put it.
00:02:02.800 | So that win is interesting.
00:02:06.080 | I think I speak for everyone in saying that I wonder what that reward function is for you.
00:02:13.760 | And I look forward to seeing that in five years and 10 years.
00:02:19.360 | I think a lot of people, including myself, are cheering you on, man.
00:02:22.240 | So I'm happy you exist and I wish you the best of luck.
00:02:26.320 | Thank you.
00:02:27.860 | Thank you.
00:02:28.440 | [BLANK_AUDIO]
00:02:35.220 | [BLANK_AUDIO]
00:02:45.220 | [ ROBOTIC WHIRRING ]