LLM Scientific Reasoning: How to Make AI Capable of Nobel Prize Discoveries: Hubert Misztela

Hello everyone, there's a lot of interesting talks so thank you for being here. My name is Hubert, I work for Pharma Novartis. On a daily basis I work on designing of small molecules with generative AI. So designing small graphs which would try to fix our diseases, cure. But today I would like to talk about scientific reasoning, scientific discoveries, specifically how to use LLMs for that.

And this is a joint work with my colleague Derek Lowe, who is a medicinal chemist. You might know him if you are interested in direct design. So today I'm going to talk about a few things, right? Probably I'm going to leave you with more questions than responses, more problems than solutions, but I think it's worthwhile.

So first I'm going to show you an interesting paradox in biology which led to a Nobel Prize discovery. Then I'm going to be talking about reasoning with rags. And you've seen many different themes of rags and agentic rags during this conference, so I think it's fitting really nicely to all of the other topics.

Scientific discovery framework which is how we can really build something, how we can build the system or rags specifically, which would help us in those scientific discoveries. And in the end I'll show you a few examples of the experiments. So back in 1990s, early 90s, scientists were trying to improve the color of the flowers, petunia flowers.

You know, this is what you do when you're a scientist. So they were trying to overexpress the specific gene. So they wanted to get the color stronger. But what they get, they get the color flipped, which was really surprising. Nobody really knew what was happening and why was this happening, right?

And this kind of results they were getting, biologists were getting, from different subdomains of biology. I'm not going to get into the details. I'm not a biologist, I'm a computer scientist, so I'm going to save you all of that. But the interesting thing here is that there were three different set of results, set of experiments, from different subdomains, even with different names.

Everybody knew it was related to, something was messing with genes, but nobody knew why, right? So these three things, only after eight years, more or less, has been resolved, these three phenomena. And this is what led to Nobel Prize in biology, right, in medicine. And so the question is, can we really use LLMs to speed this up, right?

It took eight years. There might be something in literature which could maybe give a hint of how we can go faster, right? And the question is, what are the underlying causes, or what might be the underlying causes of these three phenomena, right? And now you're all thinking, okay, let's run the RAC, let's ask the question.

Probably it's going to get it quite quickly, right? Let's pause for a second and try to think how we can set this up and try to experiment. But before it, okay, we all know what RAC is, naive RAC is. We have a question, we process the question, we process previously before the data, we represent both of them in a given latent space, we do some kind of retrieval over the embeddings, and you get the response.

And there is a lot of different tricks. I wanted to show you just a bunch of them, because these tricks, like height, for example, or ranking, or document indexing, are things relating to different pieces of RAC pipeline. And they are all for different purposes, slightly different purposes, right? They might try to try to fix the problem of matching between the question and answer, they might try to fix the problem of the right retrieval of the top documents or top chunks, and so on.

But the problem is, if LLMs, and this is really interesting paper, if LLMs, and we know that, cannot respond a little bit more catchy questions, right? A little bit more convoluted questions, how we can expect that LLMs would understand the question, and then on top of that, give us the right response with the RAC, right?

So we are trying to run before we crawl. So, I was thinking about it, and when you look at the literature, the papers, and the solutions during this conference, you're going to realize there are themes kind of repeatedly referring to something which is happening before the retrieval. And one example of reasoning before the retrieval is routing.

Routing, which you have in iterating flows, is really the idea of doing something with the question, because we cannot throw it at the embeddings and just get the response, right? So, I'm naming it reasoning for a purpose, and you're going to understand it in a few slides, right? So, you need to understand a little bit more.

We need to understand a little bit more what's the question about, right? And on the other hand, not only with the question, because we also try to do something with the data, right? So, graph RAC or a graph reader, very interesting papers, if you look at them, you're going to see that they are trying to represent the knowledge from these documents in a little bit different way, specifically with graphs.

You might use different representations in many different ways, but the point is we are doing something before the retrieval, which really is a reasoning to deal with the more semantically complex question. The reasoning after the retrieval usually is done by delegated to LLM, right? Because this is all about LLMs, but that doesn't have to be the case.

So, the point here really is that between more convoluted the semantics of the question, the harder it's going to be for your standard RAC without those reasoning steps. So, you probably want to try to add these reasoning steps before it, and I'm going to show you some experiments with that.

And small, by the way, is that, as I was saying, I work on a little bit different flavor of AI, right? Which is generative, but for graphs. So, kind of old style, encoder, decoder, values, now you have a latent space, and you navigate that. So, when I saw this thing which I draw, which is reasoning, embedding, retrieval, and then again reasoning, I realized that this is very similar to what we do in, it's very well-known pipeline in chemistry, where you take a data set, which in that case are molecules, you represent them in the latent space, which are embeddings effectively, and then you navigate the embeddings to optimize for that.

So, for a given purpose. So, you have the latent space, and you want to extract the points which are the most interesting for your purpose, right? So, in our case, we are trying to extract the embeddings which are the most similar to the question, right? So, we do simple similarity.

But in other domains, like chemistry, we run optimization over that latent space, and this is type of reasoning, really, right? So, I was like, aha, okay. So, kind of common theme is appearing here. Now, the question is, do you really know? Do you really need it? How do I know if my question is complex or not?

So, we all know nearly in high stack, right? Test or experiment or benchmark. So, I'm not going to be focusing on that too much. So, let me go quite quickly, you know, like find the perfect ingredient, the ingredient to build the perfect pizza, right? And a little bit extended version of that is multi-needling high stack, discussed yesterday on one of the tracks as well, right?

So, when you think about it, this is not only about the response which you're going to get, but this is also about the question and really relationship between the question and the response. And as I was saying, between more convoluted, semantically the question is, the harder it's going to be.

So, how we can generalize it? And I was thinking about databases, right? I'm an engineer. So, one needle, one response, or one concept, one piece of information is like one-to-one relationship, like in databases. Then you have one-to-n, and then it appeared to my head that, okay, if I have a few concepts put together in one question, this is a multi-thing versus finding one needle.

And then you can extend it still to a few concepts hidden in a few chunks of the document. So, really, you need to do a little bit more. You already see that, okay, you need to pass somehow the complexity of the question so that you can get the right embeddings extracted.

And this is what kind of triggers your reasoning. If you have a few concepts in the question, you probably need the reasoning before the retrieval so that you know what to retrieve and how to do it. So, I was mentioning reasoning, right? So, what reasoning really is? And there are many views on that because, for example, we use chain of thought as a type of reasoning, right?

But I'm thinking about reasoning as processing information in some logical way, right? So, we all know how we can do the aggregation, how we can do simple arithmetic over the data. We also know what's logical reasoning, right? So, there's also causal reasoning. If you extract specific entities from the data and causal relationships between them, you might ask LLM, okay, what more you can hypothesize about this data or what maybe you can deduce, right?

And there are papers about it. And this kind of starts working. Then you have algorithmic reasoning, which I mentioned before. You have also probabilistic reasoning. So, we were trying to to make LLMs reason in a probabilistic fashion, right? Like a Bayesian inference. There's also structured way of reasoning. This is a little bit different from causal because you might have a structure and you can expect some kind of compositionality over that structure, right?

So, And there's also, of course, Arc Kaggle competition recently released with $1 million price for closer getting to AGI, right? This is reasoning over geometry. That's why it's so challenging because this is not typical type of reasoning, right? So, why am I saying this? Because usually what we do, we expect LLMs to perform all of these reasoning well, all of these reasoning well.

And do we really need LLMs for that? Probably, yes, or maybe not. But you can also delegate these reasoning types to specific tools, which are specialists in that. So, you can do causal inference with libraries for causal inference. You can do algorithmic reasoning with specific Python, right? That's why we have a REPL as a tool, right?

Attached to your agent so that you can generate the code. And this is this is kind of algorithmic reasoning, right? So, let's try to come back to the problem and think how we can solve it, really, right? So, we want to find the cause of these three phenomena. And we define the type of reasoning, the type of retrieval, which is the relation between question and an answer.

And the particularities in our case are also that we are not, we don't want to build a rack to respond many different questions over specific data set, but we want to respond really one question, right? And this is the question. We want to respond this really one question and we want to process our data set as many times as we possibly can so that we extract all of the relevant themes in that.

So, but the trick here is that LLMs during training have seen Wikipedia. So, they know about this problem and RNA inference because it happened in 1998. So, we need the groundedness, right? And this is another aspect of it. So, when you think about different methods and where you are in your specific use case in that computational efficiency versus groundedness, you immediately see, okay, which approaches might be more suitable for my case.

And as I'm going to show you later on, relevant classifier is one of the main things which we use because we can process all of the data set. So, overall designing choices, you can see here, and this depends usually upon your specific question you have, the reasoning you need, the relationship between question and answering, and another aspect like groundedness versus efficiency.

So, when we have defined what type of question that is, so how we really want to test whether our solution is capable of doing this kind of discovery, right? So, you state the question, you define the type of question, but then you need to do the knowledge cutoff, which is basically you need to remove from the LLM the knowledge about the discovery.

So, what we do, we use RAC and in that RAC we have only the scientific papers from before the discovery, right? So, that it doesn't cheat. So, we present it in the state of the situation, which was before the discovery, so that it can, we can simulate that situation, right?

And this is kind of a training, you can think about it as an agent training in that specific, on that specific scientific problem, on that specific data set, and the overarching goal is to have a system which would be able to make those discoveries on many different scientific problems and applying this the same the same scheme.

Now, the question is, how do we define the success of this experiment? So, the first level of success would be, okay, find some hypothesis or what we really know from this data set, right? So, it's not that we are asking specific questions. We want RAC to extract, okay, what do you know about this specific, anything related really to that problem?

And don't read the text, forgive me the amount of text on these slides, but look at the graph, right? So, we want RAC to find those relationships between different facts in our data set. Then the next level would be to find like less obvious links, because you remember there were three different subdomains of biology, right?

So, those linkages are not obvious. So, we wanted to get a little bit further. Ideally, if it's exhaustive, so finding all of the facts from the data set, right? And the next level would be making new hypotheses, right? Which would be, okay, if I see this, I can go maybe one step further and hypothesize about a new relationship.

And the ideal situation, or like the high level of success, I'm not expecting to have this anytime soon. Otherwise, we're going to meet in Sweden. But this is really about finding not only what can be, what is related between those different facts in the literature, but also explaining how that happens.

And the interesting thing is that when the discovery happened, made by humans, we didn't know how it was happening. We know what was happening, but we didn't know how. Okay, so the last part, I'm going to go quite quickly, because I have just a less than two minutes. So, the naive RAC, we all know.

I'm going to show a very, very, very nice trick, which is not very prevalent, actually. Because when you have the distances, similarities between the question embedding and the embedding in your database, you might ask yourself, of course, how many embeddings are, how many chunks are enough, right? Really. But you can do it, think about it mathematically.

Because if you, if you see those distances, you can calculate the variances within the different clusters of those embeddings and between them. So this is called Jenkins natural breaks, very simple thing to do. And you can extract the top top cluster, which would be kind of representing the lowest variance between between the chunks extracted.

So they are kind of telling you that the probably the giving you the most information, right? There is a snippet how you can do it. I'm going to go quite quickly. In this case, we got we then got the ground as well, because it was using the knowledge from from after the discovery.

So we need to make the prompt stronger, right? So we call it strict prompting. And then it got better, right? And then what we do, we did the relevance classifier, which is really passing all of the chunks in our data set through the LLM and asking whether this is relevant to our problem.

Unfortunately, that wasn't very much informative, because it was kind of redundant to distance of the from the embeddings, right? There was a discussion yesterday about it. But when we when we did the analysis of how relevant each paper is in the context of hypothesis advancing, so we built a more sophisticated prompt and thinking about a little bit more about the scientist.

This is when we got the hypothesis a little bit further, right? So it found us all of the hypothesis in the literature related to DNA, but one to the RNA, and this is the right one, right? So we see already that it's going in the right direction. I'm going to skip it in the interest of time, you're going to have access to it.

So my point here is when we started doing this reasoning over the question and the database before the retrieval, we started getting closer to the results, which are ground true results without cheating. So the conclusion is scientific discovery requires solving harder than simple Q&A problems. Knowing your problem can help define more efficient rack architecture.

Needle in the high stack might be generalized. I'm not saying that this is like the best way of doing it, but this is already you see something new and interesting and harder problems might need reasoning, right? And of course you cannot forget about the brute force because if you have the use case where you can do it, check your LLM, maybe it's going to do better than the distance embeddings.

So thank you very much for giving me rushing, I hope this has helped. Thank you.

LLM Scientific Reasoning: How to Make AI Capable of Nobel Prize Discoveries: Hubert Misztela

Chapters

Transcript