Thinking Deeper in Gemini — Jack Rae, Google DeepMind

00:00:00.000 | Hi, everybody.

00:00:25.440 | Yeah, my name is Jack.

00:00:26.640 | I'm a researcher at Google, and I'm the tech lead of Thinking Within Gemini.

00:00:31.320 | I'm going to give a brief, deep dive into thinking from the research perspective within

00:00:38.980 | Gemini.

00:00:39.980 | I'm going to give this talk in three stages.

00:00:43.040 | One is to give a research motivation of why we actually are excited about thinking in

00:00:47.280 | terms of unblocking bottlenecks towards intelligence, and I'm going to give a few examples of how

00:00:54.600 | often discovering the most prescient bottlenecks in our current models, our most advanced systems,

00:01:03.660 | how often, if you can just identify the crucial issues and shortcomings, you often will then

00:01:09.480 | find a solution, and there's a reason how that is linked to thinking.

00:01:12.280 | I'm then going to talk a little bit more just pragmatically about what is thinking in Gemini,

00:01:19.340 | and why is it interesting to developers.

00:01:21.340 | And I think you're -- the slides are still not here.

00:01:28.680 | We did do a rehearsal this morning where the slides are there.

00:01:31.400 | But, yeah.

00:01:32.400 | Keynote speaker slide.

00:01:33.400 | Yeah.

00:01:34.400 | Someone's -- I can see someone.

00:01:36.400 | Yeah.

00:01:37.400 | Keynote speaker folder.

00:01:40.400 | I think it's in the keynote speaker.

00:01:52.460 | Anyway.

00:01:53.460 | It's going to come up soon.

00:01:55.460 | You are close, person.

00:01:56.460 | Yeah.

00:01:57.460 | But -- and then I'm also going to talk a little bit about what's next.

00:02:02.460 | So, Logan did a great job of kind of giving an incredible overview of Gemini as a whole

00:02:24.460 | ecosystem, everything that's going on.

00:02:26.460 | I'm going to really be focusing on kind of what we're excited about in the reasoning space.

00:02:32.160 | So, with intelligence bottlenecks, we're kind of -- the message of this section is really

00:02:37.420 | about progress.

00:02:38.800 | So, progress has really been marked by identifying key bottlenecks towards intelligence and then

00:02:44.420 | solving them.

00:02:45.560 | And I'm going to kind of give some examples throughout history.

00:02:47.660 | I'm going to actually rewind the clock to 1948.

00:02:50.700 | Claude Shannon, he invents the language model, mathematical theory of communication.

00:02:54.480 | He builds a language model, a 2-gram, using a textbook of word statistics that was hand-calculated.

00:03:00.940 | And he samples from it, and he kind of marvels at the samples.

00:03:03.860 | He feels like these are getting pretty good.

00:03:05.600 | They're a lot better than unigram character, this 2-gram word model.

00:03:09.260 | But kind of he remarks, like, "I think this would be better if we could really, like, make

00:03:13.200 | a better language model and scale up this current method."

00:03:16.080 | So he really wanted to just scale up the n-gram.

00:03:18.080 | That was the bottleneck, like, small amount of data, very, you know, elementary statistics.

00:03:23.880 | And unfortunately for Claude Shannon, kind of the solution was pretty hard.

00:03:27.100 | He needed the digitalization of human knowledge.

00:03:29.420 | And he needed modern computing to be able to aggregate these statistics at scale.

00:03:32.800 | So, you know, that wasn't so easy for him to solve.

00:03:35.120 | He had it a bit more tricky.

00:03:37.080 | But fast forward a few decades.

00:03:38.300 | At Google, in the 2000s, my colleagues, such as Jeff Dean, are training n-gram language models

00:03:45.840 | over trillions of tokens.

00:03:47.460 | These are powering, at the time, the most sophisticated speech recognition and translation systems.

00:03:53.060 | And a lot of progress has been made.

00:03:54.680 | But their bottleneck was actually, with these systems, was that these n-gram language models

00:03:58.600 | were very restricted to short context.

00:04:00.880 | And they were because there's an exponential storage cost with context length.

00:04:06.640 | And there wasn't really a way around that, with just sticking with n-grams.

00:04:09.860 | The solution was the early, kind of, introduction of deep learning in 2010, with the introduction

00:04:16.440 | of recurrent neural language models, so recurrent neural networks applied to modeling text, where

00:04:21.900 | the recurrent neural networks could avoid this problem by storing compressed representation

00:04:26.740 | of the past into the state of a neural network.

00:04:28.920 | And they could now start to model, beyond a 5-gram, sentences or even paragraphs.

00:04:33.820 | And this was a massive, kind of, step change and improvement.

00:04:37.220 | However, a couple of years later, people would notice, even there, there was a bottleneck.

00:04:41.220 | So the recurrent neural network's representation of the past is in a fixed-size state.

00:04:46.260 | And this fixed-size state, there's only so much information you could put into it.

00:04:51.380 | And so, as a result, there's often observed to be kind of lossy, a lossy kind of representation

00:04:56.100 | of its context.

00:04:57.320 | The solution that was derived, I think, once people kind of really encountered this information

00:05:02.420 | bottleneck in the past, was actually just keep everything around, in terms of your past neural

00:05:07.980 | embeddings, and use an attention operator to aggregate things on the fly.

00:05:12.140 | So this was the birth of attention, and then shortly after, transformers.

00:05:15.840 | So transformers then, kind of, led to the modern deep learning revolution as we know it.

00:05:21.140 | And many other progress was made, but if we skip forward 10 years, we then are in 2024,

00:05:26.620 | we have large language models.

00:05:29.360 | They're increasingly powerful general conversational agents.

00:05:31.720 | We have models such as Gemini, ChatGPT, people are using them for all sorts of use cases.

00:05:37.280 | And there, that's where we kind of come to the bottleneck that's relevant to this talk, which

00:05:40.940 | is that although these models are very, very powerful, they're still trained to respond immediately

00:05:45.920 | to requests.

00:05:46.920 | So, in other words, in terms of a compute bottleneck, there is a constant amount of

00:05:50.600 | compute that they apply at test time to transition from your request or your question to the response

00:05:56.660 | or your answer.

00:05:57.660 | So, the bottleneck of test time compute, this is relevant to thinking.

00:06:02.660 | So, we can impact this a little bit more.

00:06:04.400 | So, when we talk about a fixed amount of test time compute, the test time compute is interesting

00:06:08.900 | to you because that's the compute that the model is spending on your particular problem, your

00:06:12.820 | particular question.

00:06:14.100 | And the way it actually kind of mechanically works is you have some text in your request.

00:06:20.200 | It gets translated to tokens, and then it's going to go through a language model.

00:06:24.240 | And at the transition from the request to its response, it's going to pass some computation

00:06:28.800 | up through a large language model, which will have some parallel computation for every layer,

00:06:33.060 | and it'll have some iterative computation across layers.

00:06:36.280 | So, that computation is really where the model can apply its intelligence to your particular

00:06:40.220 | problem, and it's a fixed size.

00:06:43.000 | One solution if you wanted a smarter model and more computation is just to make the model

00:06:46.540 | larger.

00:06:47.540 | And then you can have more compute, and you can get a smarter response.

00:06:50.840 | However, it's still not really enough.

00:06:53.280 | Users might want to be able to think a thousand or a million times and have a very large dynamic

00:06:57.400 | range and a lot of compute for very hard or challenging or valuable tasks.

00:07:02.140 | And also, users might want to have a very dynamic application of test time compute.

00:07:06.560 | So, less compute for simpler requests, more compute for harder requests, and have this process

00:07:10.080 | be very dynamic and instigated by the model.

00:07:13.620 | And that is what motivates thinking.

00:07:15.620 | So, thinking in Gemini, mechanically, I'm sure almost everyone in this room is familiar with

00:07:22.760 | this general process where we will now have a model, and we insert a thinking stage that

00:07:28.620 | that the model can emit some additional text before it decides to emit a final answer.

00:07:34.160 | So, going back to this notion of test time compute now, we've added an additional kind of loop of

00:07:40.160 | computation where the model can kind of iteratively loop and perform additional test time compute

00:07:46.160 | during this thinking stage, and this loop can be potentially thousands or tens of thousands of iterations, which gives you tens of thousands more compute before it decides to commit to what its response will be.

00:07:57.700 | And also, because it's a loop, it's dynamic.

00:07:59.700 | So, the model can learn how many iterations of this loop to apply before it decides to actually commit to its answer.

00:08:05.700 | So, we train this model to think, to use this kind of thinking stage via reinforcement learning.

00:08:13.240 | So, when we pre-train Gemini, we then have after a reinforcement learning stage where we train it to do many different tasks and we give it positive and negative rewards depending on whether or not it solves the task correctly or not.

00:08:27.240 | And this is essentially a very general training recipe really, and it's kind of remarkable it works, but the model is able to just get a very vague signal of what is correct and what is not correct,

00:08:38.780 | and to back-propagate this through this loop of thinking stage such that it can try and shape how it uses its thinking computation and thinking tokens in order to be more useful.

00:08:48.780 | In fact, we weren't really sure this would work.

00:08:52.740 | It wasn't clear how much structure we should put into something like a reasoning stage.

00:08:57.280 | And although I think probably many people here have now seen reasoning traces and played with these models, I'll just show you a historical artifact.

00:09:04.280 | From one of the times we were trying to use reinforcement learning, we started to see cool emergent behavior.

00:09:10.500 | So, in this problem, there's kind of like an integer prediction problem.

00:09:13.780 | This was just like a kind of a particular example, in this case, kind of like a MATSI example.

00:09:21.980 | And what we saw was the model was using its thinking tokens to actually first pose a hypothesis, and then test out the hypothesis.

00:09:29.520 | And then it found that basically things weren't really working, and it kind of states that this formula doesn't hold.

00:09:35.020 | It rejects its own idea.

00:09:36.520 | And then it tries an alternative approach.

00:09:38.520 | And I think it's easy to become desensitized to technology because it's so amazing every single day.

00:09:43.520 | But we were truly blown away when we saw the general recipe of reinforcement learning was creating all sorts of interesting emergent behavior, trying different ideas, self-correction.

00:09:52.060 | And I think these days, we see a lot of different strategies that the model learns to do.

00:09:56.600 | So, it learns to break down the problem into various components, explore multiple solutions, draft fragments of code and build these up in a modular way, perform intermediate calculations and use tools.

00:10:09.600 | All under the umbrella of using more test-type compute to give you a smarter response.

00:10:15.140 | Okay.

00:10:16.140 | So, I've talked a bit about why we are interested in thinking in terms of the path to AGI and unblocking bottlenecks of intelligence, and just a little bit about mechanically what it is.

00:10:26.140 | Why is it interesting to developers?

00:10:27.140 | Obviously, the number one reason is we think this is driving more capable models, and it also stacks on top of our current paradigms of how we accelerate model progress.

00:10:37.640 | So, thinking we can accelerate this process by scaling the amount of test-time compute, and we find that this can stack as a paradigm on top of pre-existing paradigms, such as pre-training, where you can scale the amount of pre-training data and model size,

00:10:54.680 | and also post-training where you can scale the quality and diversity of human feedback for many different types of tasks, and as a result, within Google, by investing in all of these and really accelerating all of them, we get kind of a multiplicative effect.

00:11:09.980 | And why is this interesting to developers?

00:11:11.880 | I think it results in just overall faster model improvement, which is very nice.

00:11:17.880 | We also see, if we kind of look back over a lineage of recent Gemini launches, you know, there's improved reasoning performance, and we can actually map this to how much test-time compute these models will devote to problems.

00:11:32.880 | So, there's kind of like a log scale, test-time compute on the x-axis, and performance across like math, code, and some science topics.

00:11:39.880 | And we see that there's kind of this trend in increasing reasoning performance, whilst also it tracks very well with increasing test-time compute.

00:11:46.680 | And on the far left, you know, you have 2.0 Flash Experimental.

00:11:50.680 | This was a model that was not launched with Thinking back in December last year, so ancient history.

00:11:57.680 | And now we have, on the right-hand side, the first launched version of 2.5 Pro.

00:12:05.680 | So, test-time scaling is working empirically, but it's not just capability that matters.

00:12:12.680 | It's also interesting from the notion of being able to steer the model's quality over cost.

00:12:18.680 | So, you know, before, you had the option of choosing a discrete number of possible model sizes,

00:12:24.680 | and that was a way to gauge how much quality you wanted and also how much cost you wanted to spend,

00:12:31.680 | cost you wanted to kind of incur for any given task.

00:12:34.680 | But it was kind of a discrete choice.

00:12:36.680 | Now, with Thinking, we can have a continuous budget, which allows you to have a much more granular slider

00:12:43.680 | of how much capability you want for any given kind of class of tasks.

00:12:48.680 | And we have Thinking Budgets now launched in Flash and Pro in the 2.5 series.

00:12:55.680 | And this allows you to have very granular choice of cost to performance,

00:12:59.680 | and also allows us to then push the frontier and allow you to kind of augment

00:13:05.680 | and drive cost higher and performance higher if your application requires it.

00:13:10.680 | So, okay, I think a lot of this stuff is really covering ground that, you know, up to the present day.

00:13:18.680 | So, what's next and what are we excited about?

00:13:21.680 | So, we're very excited about just generally improving the models and having better reasoning, of course.

00:13:26.680 | We're also excited about making the thinking process as efficient as possible.

00:13:30.680 | Really, we want thinking to just work for you and be quite adaptive and be something that you don't have to actively spend a lot of energy tuning.

00:13:38.680 | And a big part of that is ensuring our models are very efficient in how they use their thoughts.

00:13:43.680 | This is definitely an area of progress.

00:13:45.680 | I think we can find examples of our models over thinking on tasks.

00:13:48.680 | And this is just an area of research to get these things faster and faster and as cost effective as possible.

00:13:53.680 | We're very proud of how cost effective our Gemini models are, and this is just an area for improvement as well.

00:14:00.680 | And there's also deeper thinking, which is really about scaling the amount of inference compute further to drive even higher capability.

00:14:07.680 | So, people may be familiar with Gemini deep research, where you can kind of type in a query and then the model will go away for a long period of time and research a topic.

00:14:16.680 | We also now have announced at I/O, and we're launching to trusted testers, a notion of deep think.

00:14:22.680 | Deep think is a very high budget mode, thinking budget mode, built on top of 2.5 Pro.

00:14:28.680 | And its desired application is for things where you have a very hard problem, and you're happy to essentially fire off the query,

00:14:36.680 | and then have some asynchronous process that's running for a while, and you'll come back to arrive at a stronger solution.

00:14:42.680 | And its key idea is we leverage much deeper chains of thought and parallel chains of thought that can integrate with each other to produce better responses.

00:14:51.680 | We find this enhances model performance on very tough multimodal code math problems.

00:14:57.680 | An example would be USA Math Olympiad.

00:15:03.680 | This is a task that basically the state-of-the-art model in January was completely negligible performance.

00:15:05.680 | 2.5 Pro is now probably even better with the updated one today.

00:15:10.680 | It was about a 50th percentile of all participants that participated in Math Olympiad.

00:15:15.680 | And with deep think, it goes up to 65 percentile.

00:15:19.680 | And the interesting thing about deep think is as we continue to both improve the base model and improve the algorithmic ingredients that go into deep think, those two will stack together as well.

00:15:27.680 | Here is kind of like a video animation of one of these USA Math Olympiad algebra problems.

00:15:37.680 | And the key idea really with this video is just this notion of having multiple iterative ideas.

00:15:43.680 | So maybe the model starts out with some proof by contradiction idea, but then it explores two different aspects, some Rawls theorem, Newton's inequalities, it integrates them, and eventually arrives at some correct proof.

00:15:53.680 | There is not that much you can take away from this video, but it looks pretty cool, so I added it.

00:15:58.680 | One thing that's, you know, other than we talk about math a little bit in the previous slides, I'm very excited about any application where the model can spend longer and longer thinking on very open-ended coding tasks and one-shot or very few interaction vibe code.

00:16:15.680 | Things that would have taken us months in the past.

00:16:18.680 | And one example that I like from a researcher is just some of my colleagues kind of vibe-coded from DeepMind's original DQN paper.

00:16:27.680 | which was a revolution in deep reinforcement learning, kind of vibe-coded, Gemini vibe-coded the kind of training setup, the algorithm, even an Atari emulator such that it could play some of the games.

00:16:39.680 | And, you know, this is remarkable to me because these kind of things would have taken me and my colleagues months in the past, and these things are starting to happen kind of in minutes.

00:16:50.680 | One thing I'm quite excited about looking forward to the future is not really the landscape of models,

00:16:56.680 | but coming back to, like, what's our gold standard, which is the human mind.

00:17:00.680 | I would love for our models to be able to contemplate from a very small set of knowledge and think about it incredibly deeply such that we can push the frontier.

00:17:09.680 | And one example I often think about is Romain Ajaan, who's one of the world's greatest mathematicians from the early 20th century.

00:17:15.680 | And famously, he just had this one math textbook.

00:17:19.680 | He was kind of cut away from the mathematical community, but just from a small set of problems, he spent many textbooks' worth of thinking, going through problems, inventing his own theories to further extend ideas, and he invented an incredible quantity of mathematics, really just by deeply thinking from a small source subset.

00:17:40.680 | And this is where I think we are going with thinking.

00:17:43.680 | We want a model to be able to be incredibly data efficient and actually go to millions or beyond of inference tokens where the model is really building up knowledge and artifacts such that we can eventually start to push the frontier of human understanding.

00:17:58.680 | So with that said, thank you very much.

00:18:03.680 | Thank you very much.

00:18:05.680 | Thank you very much.

00:18:06.680 | Thank you.

00:18:07.680 | We'll see you next time.