back to indexLLM Scientific Reasoning: How to Make AI Capable of Nobel Prize Discoveries: Hubert Misztela

Chapters
0:0 Introduction
1:40 Background
3:18 Reasoning before the retrieval
8:12 Finding the perfect ingredient
10:0 What is reasoning
12:15 How to solve the problem
13:35 Designing choices
13:55 Testing
00:00:00.000 |
Hello everyone, there's a lot of interesting talks so thank you for being here. My name is Hubert, 00:00:19.040 |
I work for Pharma Novartis. On a daily basis I work on designing of small molecules with 00:00:26.560 |
generative AI. So designing small graphs which would try to fix our diseases, cure. 00:00:33.920 |
But today I would like to talk about scientific reasoning, scientific discoveries, 00:00:40.960 |
specifically how to use LLMs for that. And this is a joint work with my colleague Derek Lowe, 00:00:46.320 |
who is a medicinal chemist. You might know him if you are interested in direct design. 00:00:50.720 |
So today I'm going to talk about a few things, right? Probably I'm going to leave you with more 00:00:56.080 |
questions than responses, more problems than solutions, but I think it's worthwhile. 00:01:00.960 |
So first I'm going to show you an interesting paradox in biology which led to a Nobel Prize 00:01:07.440 |
discovery. Then I'm going to be talking about reasoning with rags. And you've seen many different 00:01:13.760 |
themes of rags and agentic rags during this conference, so I think it's fitting really nicely 00:01:19.680 |
to all of the other topics. Scientific discovery framework which is how we can really build something, 00:01:28.240 |
how we can build the system or rags specifically, which would help us in those scientific discoveries. 00:01:33.600 |
And in the end I'll show you a few examples of the experiments. 00:01:38.720 |
So back in 1990s, early 90s, scientists were trying to improve the color of the flowers, 00:01:50.480 |
petunia flowers. You know, this is what you do when you're a scientist. So they were trying to 00:01:54.240 |
overexpress the specific gene. So they wanted to get the color stronger. But what they get, 00:02:01.920 |
they get the color flipped, which was really surprising. Nobody really knew what was happening 00:02:06.320 |
and why was this happening, right? And this kind of results they were getting, biologists were getting, 00:02:13.120 |
from different subdomains of biology. I'm not going to get into the details. I'm not a biologist, 00:02:16.720 |
I'm a computer scientist, so I'm going to save you all of that. But the interesting thing here is that 00:02:22.240 |
there were three different set of results, set of experiments, from different subdomains, even with 00:02:29.680 |
different names. Everybody knew it was related to, something was messing with genes, but nobody knew why, 00:02:35.040 |
right? So these three things, only after eight years, more or less, 00:02:41.600 |
has been resolved, these three phenomena. And this is what led to Nobel Prize in biology, right, 00:02:49.120 |
in medicine. And so the question is, can we really use LLMs to speed this up, right? It took eight years. 00:02:59.600 |
There might be something in literature which could maybe give a hint of how we can go faster, right? And 00:03:05.680 |
the question is, what are the underlying causes, or what might be the underlying causes of these three 00:03:10.720 |
phenomena, right? And now you're all thinking, okay, let's run the RAC, let's ask the question. Probably 00:03:15.600 |
it's going to get it quite quickly, right? Let's pause for a second and try to think how we can set this up 00:03:23.760 |
and try to experiment. But before it, okay, we all know what RAC is, naive RAC is. We have a question, 00:03:31.520 |
we process the question, we process previously before the data, we represent both of them in a 00:03:37.600 |
given latent space, we do some kind of retrieval over the embeddings, and you get the response. And 00:03:43.680 |
there is a lot of different tricks. I wanted to show you just a bunch of them, because these tricks, 00:03:50.960 |
like height, for example, or ranking, or document indexing, are things relating to different pieces 00:03:58.400 |
of RAC pipeline. And they are all for different purposes, slightly different purposes, right? 00:04:03.440 |
They might try to try to fix the problem of matching between the question and answer, 00:04:07.840 |
they might try to fix the problem of the right retrieval of the top documents or top chunks, 00:04:14.320 |
and so on. But the problem is, if LLMs, and this is really interesting paper, if LLMs, and we know that, 00:04:23.120 |
cannot respond a little bit more catchy questions, right? A little bit more convoluted questions, 00:04:31.600 |
how we can expect that LLMs would understand the question, and then on top of that, 00:04:38.320 |
give us the right response with the RAC, right? So we are trying to run before we crawl. 00:04:44.240 |
So, I was thinking about it, and when you look at the literature, the papers, and the solutions during 00:04:55.600 |
this conference, you're going to realize there are themes kind of repeatedly referring to something which 00:05:02.320 |
is happening before the retrieval. And one example of reasoning before the retrieval 00:05:09.040 |
is routing. Routing, which you have in iterating flows, is really the idea of doing something with 00:05:17.680 |
the question, because we cannot throw it at the embeddings and just get the response, right? So, 00:05:22.960 |
I'm naming it reasoning for a purpose, and you're going to understand it in a few slides, right? So, 00:05:32.400 |
you need to understand a little bit more. We need to understand a little bit more what's the question 00:05:35.920 |
about, right? And on the other hand, not only with the question, because we also try to 00:05:42.880 |
do something with the data, right? So, graph RAC or a graph reader, very interesting papers, 00:05:50.960 |
if you look at them, you're going to see that they are trying to represent the knowledge 00:05:56.080 |
from these documents in a little bit different way, specifically with graphs. You might use different 00:06:02.240 |
representations in many different ways, but the point is we are doing something before the retrieval, 00:06:07.440 |
which really is a reasoning to deal with the more semantically complex question. The reasoning after 00:06:15.680 |
the retrieval usually is done by delegated to LLM, right? Because this is all about LLMs, but that 00:06:22.640 |
doesn't have to be the case. So, the point here really is that between more convoluted the semantics of 00:06:34.320 |
the question, the harder it's going to be for your standard RAC without those reasoning steps. So, 00:06:42.000 |
you probably want to try to add these reasoning steps before it, and I'm going to show you some 00:06:46.400 |
experiments with that. And small, by the way, is that, as I was saying, I work on a little bit different 00:06:53.920 |
flavor of AI, right? Which is generative, but for graphs. So, kind of old style, encoder, decoder, 00:07:02.080 |
values, now you have a latent space, and you navigate that. So, when I saw this thing which I draw, 00:07:09.600 |
which is reasoning, embedding, retrieval, and then again reasoning, I realized that this is very similar 00:07:18.720 |
to what we do in, it's very well-known pipeline in chemistry, where you take a data set, which in 00:07:25.760 |
that case are molecules, you represent them in the latent space, which are embeddings effectively, 00:07:30.720 |
and then you navigate the embeddings to optimize for that. So, for a given purpose. So, you have the 00:07:36.320 |
latent space, and you want to extract the points which are the most interesting for your purpose, 00:07:41.280 |
right? So, in our case, we are trying to extract the embeddings which are the most similar to the question, 00:07:46.320 |
right? So, we do simple similarity. But in other domains, like chemistry, we run optimization over 00:07:53.280 |
that latent space, and this is type of reasoning, really, right? So, I was like, aha, okay. So, 00:07:58.720 |
kind of common theme is appearing here. Now, the question is, do you really know? Do you really 00:08:07.440 |
need it? How do I know if my question is complex or not? So, we all know nearly in high stack, right? 00:08:16.240 |
Test or experiment or benchmark. So, I'm not going to be focusing on that too much. So, let me go quite 00:08:25.280 |
quickly, you know, like find the perfect ingredient, the ingredient to build the perfect pizza, right? 00:08:31.280 |
And a little bit extended version of that is multi-needling high stack, discussed yesterday on 00:08:40.080 |
one of the tracks as well, right? So, when you think about it, this is not only about the response which 00:08:48.400 |
you're going to get, but this is also about the question and really relationship between the question 00:08:52.880 |
and the response. And as I was saying, between more convoluted, semantically the question is, 00:08:59.600 |
the harder it's going to be. So, how we can generalize it? And I was thinking about databases, 00:09:03.440 |
right? I'm an engineer. So, one needle, one response, or one concept, one piece of information is like 00:09:12.240 |
one-to-one relationship, like in databases. Then you have one-to-n, and then it appeared to my head that, 00:09:18.880 |
okay, if I have a few concepts put together in one question, this is a multi-thing versus 00:09:29.440 |
finding one needle. And then you can extend it still to a few concepts hidden in a few chunks of 00:09:35.040 |
the document. So, really, you need to do a little bit more. You already see that, okay, you need to pass 00:09:40.320 |
somehow the complexity of the question so that you can get the right embeddings extracted. 00:09:46.080 |
And this is what kind of triggers your reasoning. If you have a few concepts in the question, 00:09:55.120 |
you probably need the reasoning before the retrieval so that you know what to retrieve and how to do it. 00:10:01.040 |
So, I was mentioning reasoning, right? So, what reasoning really is? And there are many views on 00:10:11.600 |
that because, for example, we use chain of thought as a type of reasoning, right? But I'm thinking about 00:10:16.240 |
reasoning as processing information in some logical way, right? So, we all know how we can do the aggregation, 00:10:23.680 |
how we can do simple arithmetic over the data. We also know what's logical reasoning, right? 00:10:30.080 |
So, there's also causal reasoning. If you extract specific entities from the data and causal relationships 00:10:37.280 |
between them, you might ask LLM, okay, what more you can hypothesize about this data or what maybe you 00:10:44.720 |
can deduce, right? And there are papers about it. And this kind of starts working. Then you have algorithmic 00:10:50.560 |
reasoning, which I mentioned before. You have also probabilistic reasoning. So, we were trying to 00:10:57.200 |
to make LLMs reason in a probabilistic fashion, right? Like a Bayesian inference. There's also 00:11:02.720 |
structured way of reasoning. This is a little bit different from causal because you might have 00:11:06.400 |
a structure and you can expect some kind of compositionality over that structure, right? So, 00:11:13.760 |
And there's also, of course, Arc Kaggle competition recently released with $1 million price for closer 00:11:23.280 |
getting to AGI, right? This is reasoning over geometry. That's why it's so challenging because this is not 00:11:29.680 |
typical type of reasoning, right? So, why am I saying this? Because usually what we do, we expect LLMs to 00:11:41.760 |
perform all of these reasoning well, all of these reasoning well. And do we really need LLMs for that? 00:11:48.400 |
Probably, yes, or maybe not. But you can also delegate these reasoning types to specific tools, 00:11:55.440 |
which are specialists in that. So, you can do causal inference with libraries for causal inference. 00:12:01.360 |
You can do algorithmic reasoning with specific Python, right? That's why we have a 00:12:07.520 |
REPL as a tool, right? Attached to your agent so that you can generate the code. And this is 00:12:13.040 |
this is kind of algorithmic reasoning, right? 00:12:14.720 |
So, let's try to come back to the problem and think how we can solve it, really, right? 00:12:21.920 |
So, we want to find the cause of these three phenomena. And we define the type of reasoning, 00:12:26.480 |
the type of retrieval, which is the relation between question and an answer. 00:12:33.280 |
And the particularities in our case are also that we are not, we don't want to build a rack to 00:12:39.200 |
respond many different questions over specific data set, but we want to respond really one question, 00:12:45.600 |
right? And this is the question. We want to respond this really one question and we want to 00:12:50.000 |
process our data set as many times as we possibly can so that we extract all of the relevant themes in 00:12:55.360 |
that. So, but the trick here is that LLMs during training have seen Wikipedia. So, they know about 00:13:05.360 |
this problem and RNA inference because it happened in 1998. So, we need the groundedness, right? And this 00:13:11.120 |
is another aspect of it. So, when you think about different methods and where you are in your specific 00:13:17.360 |
use case in that computational efficiency versus groundedness, you immediately see, 00:13:23.680 |
okay, which approaches might be more suitable for my case. And as I'm going to show you later on, 00:13:28.560 |
relevant classifier is one of the main things which we use because we can process all of the data set. 00:13:37.440 |
So, overall designing choices, you can see here, and this depends usually upon your specific question 00:13:46.240 |
you have, the reasoning you need, the relationship between question and answering, and another aspect 00:13:52.000 |
like groundedness versus efficiency. So, when we have defined what type of question that is, so how we 00:14:02.640 |
really want to test whether our solution is capable of doing this kind of discovery, right? So, you state 00:14:08.720 |
the question, you define the type of question, but then you need to do the knowledge cutoff, which is 00:14:13.520 |
basically you need to remove from the LLM the knowledge about the discovery. So, what we do, we use RAC 00:14:19.440 |
and in that RAC we have only the scientific papers from before the discovery, right? So, that it doesn't cheat. 00:14:27.280 |
So, we present it in the state of the situation, which was before the discovery, so that it can, we can simulate 00:14:33.440 |
that situation, right? And this is kind of a training, you can think about it as an agent training in that specific, 00:14:41.600 |
on that specific scientific problem, on that specific data set, and the overarching goal is to 00:14:46.880 |
have a system which would be able to make those discoveries on many different scientific problems 00:14:56.000 |
Now, the question is, how do we define the success of this experiment? So, the first level of success 00:15:06.640 |
would be, okay, find some hypothesis or what we really know from this data set, right? So, it's not 00:15:15.040 |
that we are asking specific questions. We want RAC to extract, okay, what do you know about this 00:15:20.640 |
specific, anything related really to that problem? And don't read the text, forgive me the amount of 00:15:27.840 |
text on these slides, but look at the graph, right? So, we want RAC to find those relationships between 00:15:34.880 |
different facts in our data set. Then the next level would be to find like less obvious links, because 00:15:42.480 |
you remember there were three different subdomains of biology, right? So, those linkages are not obvious. 00:15:48.000 |
So, we wanted to get a little bit further. Ideally, if it's exhaustive, so finding all of the 00:15:54.320 |
facts from the data set, right? And the next level would be making new hypotheses, right? Which would be, 00:16:03.200 |
okay, if I see this, I can go maybe one step further and hypothesize about a new relationship. 00:16:10.160 |
And the ideal situation, or like the high level of success, I'm not expecting to have this 00:16:16.000 |
anytime soon. Otherwise, we're going to meet in Sweden. But this is really about finding not only 00:16:22.400 |
what can be, what is related between those different facts in the literature, but also explaining how that 00:16:28.960 |
happens. And the interesting thing is that when the discovery happened, made by humans, 00:16:33.440 |
we didn't know how it was happening. We know what was happening, but we didn't know how. 00:16:38.240 |
Okay, so the last part, I'm going to go quite quickly, because I have just a less than two minutes. 00:16:44.640 |
So, the naive RAC, we all know. I'm going to show a very, very, very nice trick, which is not very prevalent, 00:16:51.120 |
actually. Because when you have the distances, similarities between the question embedding and 00:16:57.840 |
the embedding in your database, you might ask yourself, of course, how many embeddings are, 00:17:03.040 |
how many chunks are enough, right? Really. But you can do it, think about it mathematically. Because if you, 00:17:08.480 |
if you see those distances, you can calculate the variances within the different clusters 00:17:15.120 |
of those embeddings and between them. So this is called Jenkins natural breaks, very simple thing to do. 00:17:22.480 |
And you can extract the top top cluster, which would be kind of representing the lowest variance between 00:17:29.600 |
between the chunks extracted. So they are kind of telling you that the probably the giving you the most 00:17:34.880 |
information, right? There is a snippet how you can do it. I'm going to go quite quickly. In this case, 00:17:43.840 |
we got we then got the ground as well, because it was using the knowledge from from after the discovery. 00:17:50.320 |
So we need to make the prompt stronger, right? So we call it strict prompting. And then it got better, 00:17:58.080 |
right? And then what we do, we did the relevance classifier, which is really passing all of the chunks 00:18:03.920 |
in our data set through the LLM and asking whether this is relevant to our problem. Unfortunately, 00:18:08.960 |
that wasn't very much informative, because it was kind of redundant to distance of the from the embeddings, 00:18:15.760 |
right? There was a discussion yesterday about it. But when we when we did the analysis of how relevant 00:18:22.960 |
each paper is in the context of hypothesis advancing, so we built a more sophisticated prompt and thinking 00:18:31.360 |
about a little bit more about the scientist. This is when we got the hypothesis a little bit further, 00:18:36.640 |
right? So it found us all of the hypothesis in the literature related to DNA, but one to the RNA, 00:18:42.400 |
and this is the right one, right? So we see already that it's going in the right direction. I'm going to skip it 00:18:46.800 |
in the interest of time, you're going to have access to it. So my point here is when we started doing this reasoning over 00:18:53.520 |
the question and the database before the retrieval, we started getting closer to the results, 00:18:58.880 |
which are ground true results without cheating. So the conclusion is scientific discovery requires 00:19:04.640 |
solving harder than simple Q&A problems. Knowing your problem can help define more efficient rack 00:19:10.400 |
architecture. Needle in the high stack might be generalized. I'm not saying that this is like 00:19:14.560 |
the best way of doing it, but this is already you see something new and interesting and harder problems 00:19:19.440 |
might need reasoning, right? And of course you cannot forget about the brute force because if you have 00:19:26.320 |
the use case where you can do it, check your LLM, maybe it's going to do better than the distance embeddings. 00:19:35.440 |
So thank you very much for giving me rushing, I hope this has helped.