LLM Scientific Reasoning: How to Make AI Capable of Nobel Prize Discoveries: Hubert Misztela

00:00:00.000 | Hello everyone, there's a lot of interesting talks so thank you for being here. My name is Hubert,

00:00:19.040 | I work for Pharma Novartis. On a daily basis I work on designing of small molecules with

00:00:26.560 | generative AI. So designing small graphs which would try to fix our diseases, cure.

00:00:33.920 | But today I would like to talk about scientific reasoning, scientific discoveries,

00:00:40.960 | specifically how to use LLMs for that. And this is a joint work with my colleague Derek Lowe,

00:00:46.320 | who is a medicinal chemist. You might know him if you are interested in direct design.

00:00:50.720 | So today I'm going to talk about a few things, right? Probably I'm going to leave you with more

00:00:56.080 | questions than responses, more problems than solutions, but I think it's worthwhile.

00:01:00.960 | So first I'm going to show you an interesting paradox in biology which led to a Nobel Prize

00:01:07.440 | discovery. Then I'm going to be talking about reasoning with rags. And you've seen many different

00:01:13.760 | themes of rags and agentic rags during this conference, so I think it's fitting really nicely

00:01:19.680 | to all of the other topics. Scientific discovery framework which is how we can really build something,

00:01:28.240 | how we can build the system or rags specifically, which would help us in those scientific discoveries.

00:01:33.600 | And in the end I'll show you a few examples of the experiments.

00:01:38.720 | So back in 1990s, early 90s, scientists were trying to improve the color of the flowers,

00:01:50.480 | petunia flowers. You know, this is what you do when you're a scientist. So they were trying to

00:01:54.240 | overexpress the specific gene. So they wanted to get the color stronger. But what they get,

00:02:01.920 | they get the color flipped, which was really surprising. Nobody really knew what was happening

00:02:06.320 | and why was this happening, right? And this kind of results they were getting, biologists were getting,

00:02:13.120 | from different subdomains of biology. I'm not going to get into the details. I'm not a biologist,

00:02:16.720 | I'm a computer scientist, so I'm going to save you all of that. But the interesting thing here is that

00:02:22.240 | there were three different set of results, set of experiments, from different subdomains, even with

00:02:29.680 | different names. Everybody knew it was related to, something was messing with genes, but nobody knew why,

00:02:35.040 | right? So these three things, only after eight years, more or less,

00:02:41.600 | has been resolved, these three phenomena. And this is what led to Nobel Prize in biology, right,

00:02:49.120 | in medicine. And so the question is, can we really use LLMs to speed this up, right? It took eight years.

00:02:59.600 | There might be something in literature which could maybe give a hint of how we can go faster, right? And

00:03:05.680 | the question is, what are the underlying causes, or what might be the underlying causes of these three

00:03:10.720 | phenomena, right? And now you're all thinking, okay, let's run the RAC, let's ask the question. Probably

00:03:15.600 | it's going to get it quite quickly, right? Let's pause for a second and try to think how we can set this up

00:03:23.760 | and try to experiment. But before it, okay, we all know what RAC is, naive RAC is. We have a question,

00:03:31.520 | we process the question, we process previously before the data, we represent both of them in a

00:03:37.600 | given latent space, we do some kind of retrieval over the embeddings, and you get the response. And

00:03:43.680 | there is a lot of different tricks. I wanted to show you just a bunch of them, because these tricks,

00:03:50.960 | like height, for example, or ranking, or document indexing, are things relating to different pieces

00:03:58.400 | of RAC pipeline. And they are all for different purposes, slightly different purposes, right?

00:04:03.440 | They might try to try to fix the problem of matching between the question and answer,

00:04:07.840 | they might try to fix the problem of the right retrieval of the top documents or top chunks,

00:04:14.320 | and so on. But the problem is, if LLMs, and this is really interesting paper, if LLMs, and we know that,

00:04:23.120 | cannot respond a little bit more catchy questions, right? A little bit more convoluted questions,

00:04:31.600 | how we can expect that LLMs would understand the question, and then on top of that,

00:04:38.320 | give us the right response with the RAC, right? So we are trying to run before we crawl.

00:04:44.240 | So, I was thinking about it, and when you look at the literature, the papers, and the solutions during

00:04:55.600 | this conference, you're going to realize there are themes kind of repeatedly referring to something which

00:05:02.320 | is happening before the retrieval. And one example of reasoning before the retrieval

00:05:09.040 | is routing. Routing, which you have in iterating flows, is really the idea of doing something with

00:05:17.680 | the question, because we cannot throw it at the embeddings and just get the response, right? So,

00:05:22.960 | I'm naming it reasoning for a purpose, and you're going to understand it in a few slides, right? So,

00:05:32.400 | you need to understand a little bit more. We need to understand a little bit more what's the question

00:05:35.920 | about, right? And on the other hand, not only with the question, because we also try to

00:05:42.880 | do something with the data, right? So, graph RAC or a graph reader, very interesting papers,

00:05:50.960 | if you look at them, you're going to see that they are trying to represent the knowledge

00:05:56.080 | from these documents in a little bit different way, specifically with graphs. You might use different

00:06:02.240 | representations in many different ways, but the point is we are doing something before the retrieval,

00:06:07.440 | which really is a reasoning to deal with the more semantically complex question. The reasoning after

00:06:15.680 | the retrieval usually is done by delegated to LLM, right? Because this is all about LLMs, but that

00:06:22.640 | doesn't have to be the case. So, the point here really is that between more convoluted the semantics of

00:06:34.320 | the question, the harder it's going to be for your standard RAC without those reasoning steps. So,

00:06:42.000 | you probably want to try to add these reasoning steps before it, and I'm going to show you some

00:06:46.400 | experiments with that. And small, by the way, is that, as I was saying, I work on a little bit different

00:06:53.920 | flavor of AI, right? Which is generative, but for graphs. So, kind of old style, encoder, decoder,

00:07:02.080 | values, now you have a latent space, and you navigate that. So, when I saw this thing which I draw,

00:07:09.600 | which is reasoning, embedding, retrieval, and then again reasoning, I realized that this is very similar

00:07:18.720 | to what we do in, it's very well-known pipeline in chemistry, where you take a data set, which in

00:07:25.760 | that case are molecules, you represent them in the latent space, which are embeddings effectively,

00:07:30.720 | and then you navigate the embeddings to optimize for that. So, for a given purpose. So, you have the

00:07:36.320 | latent space, and you want to extract the points which are the most interesting for your purpose,

00:07:41.280 | right? So, in our case, we are trying to extract the embeddings which are the most similar to the question,

00:07:46.320 | right? So, we do simple similarity. But in other domains, like chemistry, we run optimization over

00:07:53.280 | that latent space, and this is type of reasoning, really, right? So, I was like, aha, okay. So,

00:07:58.720 | kind of common theme is appearing here. Now, the question is, do you really know? Do you really

00:08:07.440 | need it? How do I know if my question is complex or not? So, we all know nearly in high stack, right?

00:08:16.240 | Test or experiment or benchmark. So, I'm not going to be focusing on that too much. So, let me go quite

00:08:25.280 | quickly, you know, like find the perfect ingredient, the ingredient to build the perfect pizza, right?

00:08:31.280 | And a little bit extended version of that is multi-needling high stack, discussed yesterday on

00:08:40.080 | one of the tracks as well, right? So, when you think about it, this is not only about the response which

00:08:48.400 | you're going to get, but this is also about the question and really relationship between the question

00:08:52.880 | and the response. And as I was saying, between more convoluted, semantically the question is,

00:08:59.600 | the harder it's going to be. So, how we can generalize it? And I was thinking about databases,

00:09:03.440 | right? I'm an engineer. So, one needle, one response, or one concept, one piece of information is like

00:09:12.240 | one-to-one relationship, like in databases. Then you have one-to-n, and then it appeared to my head that,

00:09:18.880 | okay, if I have a few concepts put together in one question, this is a multi-thing versus

00:09:29.440 | finding one needle. And then you can extend it still to a few concepts hidden in a few chunks of

00:09:35.040 | the document. So, really, you need to do a little bit more. You already see that, okay, you need to pass

00:09:40.320 | somehow the complexity of the question so that you can get the right embeddings extracted.

00:09:46.080 | And this is what kind of triggers your reasoning. If you have a few concepts in the question,

00:09:55.120 | you probably need the reasoning before the retrieval so that you know what to retrieve and how to do it.

00:10:01.040 | So, I was mentioning reasoning, right? So, what reasoning really is? And there are many views on

00:10:11.600 | that because, for example, we use chain of thought as a type of reasoning, right? But I'm thinking about

00:10:16.240 | reasoning as processing information in some logical way, right? So, we all know how we can do the aggregation,

00:10:23.680 | how we can do simple arithmetic over the data. We also know what's logical reasoning, right?

00:10:30.080 | So, there's also causal reasoning. If you extract specific entities from the data and causal relationships

00:10:37.280 | between them, you might ask LLM, okay, what more you can hypothesize about this data or what maybe you

00:10:44.720 | can deduce, right? And there are papers about it. And this kind of starts working. Then you have algorithmic

00:10:50.560 | reasoning, which I mentioned before. You have also probabilistic reasoning. So, we were trying to

00:10:57.200 | to make LLMs reason in a probabilistic fashion, right? Like a Bayesian inference. There's also

00:11:02.720 | structured way of reasoning. This is a little bit different from causal because you might have

00:11:06.400 | a structure and you can expect some kind of compositionality over that structure, right? So,

00:11:13.760 | And there's also, of course, Arc Kaggle competition recently released with $1 million price for closer

00:11:23.280 | getting to AGI, right? This is reasoning over geometry. That's why it's so challenging because this is not

00:11:29.680 | typical type of reasoning, right? So, why am I saying this? Because usually what we do, we expect LLMs to

00:11:41.760 | perform all of these reasoning well, all of these reasoning well. And do we really need LLMs for that?

00:11:48.400 | Probably, yes, or maybe not. But you can also delegate these reasoning types to specific tools,

00:11:55.440 | which are specialists in that. So, you can do causal inference with libraries for causal inference.

00:12:01.360 | You can do algorithmic reasoning with specific Python, right? That's why we have a

00:12:07.520 | REPL as a tool, right? Attached to your agent so that you can generate the code. And this is

00:12:13.040 | this is kind of algorithmic reasoning, right?

00:12:14.720 | So, let's try to come back to the problem and think how we can solve it, really, right?

00:12:21.920 | So, we want to find the cause of these three phenomena. And we define the type of reasoning,

00:12:26.480 | the type of retrieval, which is the relation between question and an answer.

00:12:33.280 | And the particularities in our case are also that we are not, we don't want to build a rack to

00:12:39.200 | respond many different questions over specific data set, but we want to respond really one question,

00:12:45.600 | right? And this is the question. We want to respond this really one question and we want to

00:12:50.000 | process our data set as many times as we possibly can so that we extract all of the relevant themes in

00:12:55.360 | that. So, but the trick here is that LLMs during training have seen Wikipedia. So, they know about

00:13:05.360 | this problem and RNA inference because it happened in 1998. So, we need the groundedness, right? And this

00:13:11.120 | is another aspect of it. So, when you think about different methods and where you are in your specific

00:13:17.360 | use case in that computational efficiency versus groundedness, you immediately see,

00:13:23.680 | okay, which approaches might be more suitable for my case. And as I'm going to show you later on,

00:13:28.560 | relevant classifier is one of the main things which we use because we can process all of the data set.

00:13:37.440 | So, overall designing choices, you can see here, and this depends usually upon your specific question

00:13:46.240 | you have, the reasoning you need, the relationship between question and answering, and another aspect

00:13:52.000 | like groundedness versus efficiency. So, when we have defined what type of question that is, so how we

00:14:02.640 | really want to test whether our solution is capable of doing this kind of discovery, right? So, you state

00:14:08.720 | the question, you define the type of question, but then you need to do the knowledge cutoff, which is

00:14:13.520 | basically you need to remove from the LLM the knowledge about the discovery. So, what we do, we use RAC

00:14:19.440 | and in that RAC we have only the scientific papers from before the discovery, right? So, that it doesn't cheat.

00:14:27.280 | So, we present it in the state of the situation, which was before the discovery, so that it can, we can simulate

00:14:33.440 | that situation, right? And this is kind of a training, you can think about it as an agent training in that specific,

00:14:41.600 | on that specific scientific problem, on that specific data set, and the overarching goal is to

00:14:46.880 | have a system which would be able to make those discoveries on many different scientific problems

00:14:53.520 | and applying this the same the same scheme.

00:14:56.000 | Now, the question is, how do we define the success of this experiment? So, the first level of success

00:15:06.640 | would be, okay, find some hypothesis or what we really know from this data set, right? So, it's not

00:15:15.040 | that we are asking specific questions. We want RAC to extract, okay, what do you know about this

00:15:20.640 | specific, anything related really to that problem? And don't read the text, forgive me the amount of

00:15:27.840 | text on these slides, but look at the graph, right? So, we want RAC to find those relationships between

00:15:34.880 | different facts in our data set. Then the next level would be to find like less obvious links, because

00:15:42.480 | you remember there were three different subdomains of biology, right? So, those linkages are not obvious.

00:15:48.000 | So, we wanted to get a little bit further. Ideally, if it's exhaustive, so finding all of the

00:15:54.320 | facts from the data set, right? And the next level would be making new hypotheses, right? Which would be,

00:16:03.200 | okay, if I see this, I can go maybe one step further and hypothesize about a new relationship.

00:16:10.160 | And the ideal situation, or like the high level of success, I'm not expecting to have this

00:16:16.000 | anytime soon. Otherwise, we're going to meet in Sweden. But this is really about finding not only

00:16:22.400 | what can be, what is related between those different facts in the literature, but also explaining how that

00:16:28.960 | happens. And the interesting thing is that when the discovery happened, made by humans,

00:16:33.440 | we didn't know how it was happening. We know what was happening, but we didn't know how.

00:16:38.240 | Okay, so the last part, I'm going to go quite quickly, because I have just a less than two minutes.

00:16:44.640 | So, the naive RAC, we all know. I'm going to show a very, very, very nice trick, which is not very prevalent,

00:16:51.120 | actually. Because when you have the distances, similarities between the question embedding and

00:16:57.840 | the embedding in your database, you might ask yourself, of course, how many embeddings are,

00:17:03.040 | how many chunks are enough, right? Really. But you can do it, think about it mathematically. Because if you,

00:17:08.480 | if you see those distances, you can calculate the variances within the different clusters

00:17:15.120 | of those embeddings and between them. So this is called Jenkins natural breaks, very simple thing to do.

00:17:22.480 | And you can extract the top top cluster, which would be kind of representing the lowest variance between

00:17:29.600 | between the chunks extracted. So they are kind of telling you that the probably the giving you the most

00:17:34.880 | information, right? There is a snippet how you can do it. I'm going to go quite quickly. In this case,

00:17:43.840 | we got we then got the ground as well, because it was using the knowledge from from after the discovery.

00:17:50.320 | So we need to make the prompt stronger, right? So we call it strict prompting. And then it got better,

00:17:58.080 | right? And then what we do, we did the relevance classifier, which is really passing all of the chunks

00:18:03.920 | in our data set through the LLM and asking whether this is relevant to our problem. Unfortunately,

00:18:08.960 | that wasn't very much informative, because it was kind of redundant to distance of the from the embeddings,

00:18:15.760 | right? There was a discussion yesterday about it. But when we when we did the analysis of how relevant

00:18:22.960 | each paper is in the context of hypothesis advancing, so we built a more sophisticated prompt and thinking

00:18:31.360 | about a little bit more about the scientist. This is when we got the hypothesis a little bit further,

00:18:36.640 | right? So it found us all of the hypothesis in the literature related to DNA, but one to the RNA,

00:18:42.400 | and this is the right one, right? So we see already that it's going in the right direction. I'm going to skip it

00:18:46.800 | in the interest of time, you're going to have access to it. So my point here is when we started doing this reasoning over

00:18:53.520 | the question and the database before the retrieval, we started getting closer to the results,

00:18:58.880 | which are ground true results without cheating. So the conclusion is scientific discovery requires

00:19:04.640 | solving harder than simple Q&A problems. Knowing your problem can help define more efficient rack

00:19:10.400 | architecture. Needle in the high stack might be generalized. I'm not saying that this is like

00:19:14.560 | the best way of doing it, but this is already you see something new and interesting and harder problems

00:19:19.440 | might need reasoning, right? And of course you cannot forget about the brute force because if you have

00:19:26.320 | the use case where you can do it, check your LLM, maybe it's going to do better than the distance embeddings.

00:19:35.440 | So thank you very much for giving me rushing, I hope this has helped.

00:19:43.520 | Thank you.

LLM Scientific Reasoning: How to Make AI Capable of Nobel Prize Discoveries: Hubert Misztela

Chapters