Chan-Zuckerberg rbio1: Training scientific reasoning LLMs with biological world models as verifiers

00:00:00.000 | Yeah, go ahead.

00:00:04.740 | My name is Ankit.

00:00:05.700 | I work as an RL researcher.

00:00:09.160 | I have my own startup.

00:00:10.320 | I'll talk about that later.

00:00:12.340 | But yeah, presenting this paper, which is basically an idea where you can use something like a softwareifier instead of using hard verification.

00:00:21.620 | In bio, this becomes an important problem because the hard verification is you going to a lab and essentially running the kind of experiment before and then get the reward and backpropagate that to a model.

00:00:34.540 | So instead of that, you already have a lot of real world data in bio.

00:00:38.860 | They trained a sort of a surrogate model, which calculates the probability of how two genes are going to interact with each other and which goes up and which goes down and essentially use that surrogate reward.

00:00:51.380 | as a reward signal for the model to backpropagate.

00:00:55.000 | They use more techniques around it and they find some encouraging results, but that's the crux of the model.

00:01:00.120 | So, yeah, to start with, like, they do not have access to exact rules facilitating the formal verification, right?

00:01:11.620 | So testing hypothesis in a lab is the way to do it.

00:01:15.200 | Real experiments is expensive and does not scale with computation.

00:01:20.840 | So, this is where you have approximate oracles with prior knowledge, which utilizes a prediction model instead of, you know, going with experimental data and they use RL in that.

00:01:35.920 | That's what I explained before.

00:01:39.260 | And the results here are that soft verification essentially distills the biology world models into an LLM and can achieve the kind of performance on the leading benchmark compared to the state of the art models.

00:01:54.100 | And essentially, they also combine verifiers with a chain of thought, like, verifying the chain of thought reasoning, providing a reward if the reasoning is correct, and that improves the performance further.

00:02:07.620 | So, yeah, at one point, you have a lot of annotated data in the industry, but there is no way to use that data as such because they just created a sort of model surrounded, but it did not really do anything outside of it.

00:02:37.020 | So, they want a way for people to be able to reason with it, to be able to utilize that knowledge and, you know, talk to it, as simple as that.

00:02:49.660 | So, yeah, there is something known as a virtual cell models that have come up recently.

00:02:54.120 | Virtual cell models are, I think it was promoted by ARC Institute as well, ARC Bio as well, where they are essentially creating a virtual model of the cell and a model of how it would interact with the environment.

00:03:10.460 | So, which shows that you don't need experimental data for all of this, and you can generate predictions for any cell state transitions, such as a disease to healthy state and vice versa.

00:03:22.780 | And there is a very specific modality for this, such as transcriptomics and imaging and genomics or even language, but most of it is not trained on multiple modalities.

00:03:35.620 | And that's why it's like a very specific model for a very specific task.

00:03:39.060 | You don't have a generic model that can do multiple things, and that's the goal of this paper.

00:03:47.280 | Like here, you have a, you have to develop a method which allows for integration of world models into a common space, and they use language as a connecting modality.

00:03:56.820 | The benefit is that it allows transformation of complex models into conversations and can engage users by distilling a biological model into an LLN.

00:04:09.280 | They're able to distill the knowledge derived from experimental data to a natural language that enables interactive and human readable dialogue, which can be seen as an alignment of reasoning LLN to a biological model, but essentially it talks about a biology world model and an LLN a lot.

00:04:29.460 | So the whole idea is that you have a lot of experiments in bio that you end up in a place where if you can put it in a LLN and then you can just interact with it.

00:04:40.680 | But yeah, it keeps talking about it for three to four pages and they build up on a lot of aggregation of world models that are already there in a representation space.

00:04:52.800 | And then they use the use of models of biology to train models, which are just as performant and can bypass the need of flat-scale data generation.

00:05:01.660 | They mentioned it at the last in the paper, their base model is Quen 2.5.

00:05:07.300 | It's struck at four billion parameters.

00:05:10.180 | So everything that happens, happens at the four billion level, four billion parameter level.

00:05:15.640 | But yeah, the results are very interesting.

00:05:17.520 | Even at that level, they even compare the results of the base model with how much they were able to improve the model.

00:05:24.260 | And the core capability that they are relying on is the generalization, especially out of domain, as well as looking at unseen problems to see how it goes through.

00:05:40.160 | Generalization also skips through a couple of things.

00:05:42.340 | Sometimes they say that if a model has not seen the test data, that is generalization.

00:05:47.000 | Sometimes they see that it's a novel problem, that is generalization.

00:05:50.480 | So this is the core aspect here.

00:05:56.820 | They use something like a surrogate model, reward formulation by a domain-specific biological model, where they calculate the probability and give you an idea about how much this is going to last.

00:06:15.640 | To give you a simpler example, think of it as isomorphic labs generating a protein structure.

00:06:22.900 | For that reward, it cannot be that you just go to a lab and test out that protein structure.

00:06:27.660 | You already have a lot of data.

00:06:29.200 | And an easier way to predict is just have other model which predicts the stability of the protein structure.

00:06:35.940 | If the stability is 0.8, you pretty much have 0.8 as the reward.

00:06:45.120 | So they call this soft verification throughout the paper, because this is not a hard verification.

00:06:50.380 | Hard verification in their language is going to a lab and testing this out.

00:06:58.960 | So this is, and they arrive at a model which they can talk to, which can reason and do things.

00:07:05.640 | Their entire experiment is on a narrow set of perturbation prediction of what happens when given a couple of genes, to other genes, if you knock down a gene in a given cell.

00:07:20.340 | So, yeah, in that sense, it's quite narrow.

00:07:22.880 | But I do think that it's going to generalize.

00:07:25.800 | Let me interrupt here.

00:07:28.180 | I think just for the sake of people who aren't familiar with genetics, either you or I should describe what knockdown means.

00:07:36.860 | Do you want to do that?

00:07:39.280 | You should go ahead with that.

00:07:40.800 | I have a very basic understanding of it.

00:07:42.640 | Yeah, I mean, I don't have a really complicated, but basically it just means you use CRISPR or other techniques.

00:07:48.460 | So CRISPR is a way that you can basically remove a gene from a genome and then look at the impact of that.

00:07:59.660 | So that's called knockdown, just to be clear.

00:08:03.200 | Yeah, so like you would typically you would look for the change in expression of other genes or other phenotypic responses like, you know, like disease or not disease or something like that.

00:08:14.520 | Sorry.

00:08:16.860 | Sorry to interrupt.

00:08:17.980 | Not an issue.

00:08:20.000 | So basically, I mean, this is pretty much the strategy that they use.

00:08:24.480 | They use GRPO throughout.

00:08:25.960 | And one surprising thing was they do use GRPO even if they are getting good trajectories.

00:08:31.760 | But I mean, that's that is the GRPO throughout.

00:08:36.160 | So the B part where you see training, you have prompts and completions and they have advantages that they do.

00:08:41.220 | This third thing where the source of verification is that there are some hard verifiers that they have used, but the ideas around it are one is experimental data and you backpropagate a hard verification and a reward to an LLM.

00:08:56.340 | Then you have a virtual cell model that you use as an input.

00:09:00.560 | You get a soft answer.

00:09:01.920 | You use a software fire based on that, get a reward and backpropagate it to an LLM.

00:09:06.820 | The third part was they created a lot of knowledge sources in biology.

00:09:11.200 | So when a reasoning LLM has a reasoning trace, what they ended up doing was essentially create a reward based on the reasoning trace and how much it corresponds to the knowledge sources.

00:09:22.840 | They add this as an additional reward to either A or B and then backpropagate that as well.

00:09:29.600 | And this is the claim here is that this is what improves it, improves the model performance a lot compared to the base model.

00:09:43.600 | And it kind of performs to a level to the state of the art bio benchmarks.

00:09:53.860 | Not to spoil the to give a spoiler though, but I think that this claim is a little bit off because I think that whereas you look at the beginning of the paper and you think they're saying, oh, if we combine experimental data plus virtual cell model plus knowledge sources, then we get better than any of the three.

00:10:15.500 | When in fact, when in fact, the experimental data all shows that experimental data alone outperforms virtual cell model and knowledge sources, even if experimental data, virtual cell model is added to experimental data, right?

00:10:35.940 | And that's, I think that that's a really disappointing result for me and they don't really talk about it.

00:10:40.620 | So I just wanted to just to be clear about what the claim versus the reality here.

00:10:45.580 | Yeah, that's true.

00:10:48.460 | So that was something that struck me as odd as well.

00:10:52.460 | But I do take it as that, say, if you don't have enough resources for an experimental data or the experimental data, experimentation is slow.

00:11:01.780 | You can pretty much use a proxy and get to a level which is closer to what a experimental benchmark would be.

00:11:09.100 | Yeah, absolutely, it's not a total loss, but the way that they write the paper makes you think that the former, but it was really the latter.

00:11:19.020 | Yeah, so I guess that's their incentive, right?

00:11:24.940 | So anyway, so basically the whole idea becomes that you have something like we are talking about knockdown, right?

00:11:34.220 | Is this particular cell line likely to result in a differential expression of gene V is what we are predicting here.

00:11:40.620 | And that's the task.

00:11:42.140 | Usually, earlier it was done through bespoke models that are trained on task specific data or foundation models,

00:11:48.060 | pre-trained on large scale corpora, which was the kind of proxy models that we are talking about, the prediction models.

00:11:55.100 | And the recent literature and the recent experiments in this regard show that the bespoke models tend to outperform models that fall under the second category,

00:12:09.180 | which is foundational models pre-trained on large scale corpora.

00:12:13.020 | So, but creating bespoke models requires large amount of specialized training data that is either annotated or obtained experimentally in a lab,

00:12:22.940 | which is costly and most of the labs do not have the kind of resources and most of it is also biased towards one side, which we'll also talk about later.

00:12:33.900 | But yeah, so I mean, the bespoke models already exist in a manner, so they tend to use the previous bespoke models as the reward signal.

00:12:47.340 | So, yeah, reasoning models trained using soft verifiers learn to generalize on out-of-distribution data set and task by passing the need to train on experimental data,

00:12:56.620 | which is the final thing.

00:12:57.980 | So, they also have something known as a foundational model for transcriptomics which uses the kind of

00:13:05.660 | biological knowledge that models have learned during pre-training runs and get it to the task specific questions

00:13:16.140 | and basically able to predict better.

00:13:21.100 | So, essentially, if you take QUEN, which is like a general model, and then you're getting to an RBIO,

00:13:27.260 | which can learn about, learn to reason about tasks like perturbation, which was not really a part of its original training data,

00:13:34.460 | although we don't know what went into QUEN's training data.

00:13:37.820 | But there is a very likely chance that the model wasn't trained on specific tasks about reasoning,

00:13:44.380 | like reasoning in situations like perturbation.

00:13:48.380 | Then, of course, this is probably the summary of the entire paper, where they're able to just,

00:13:56.220 | I mean, I'll just go through it once again, though, before I come to the method.

00:14:00.700 | Essentially, the whole idea is that you're distilling knowledge from a biology world model to a reasoning model

00:14:06.460 | So, essentially, they find out that RBIO can learn to reason about task-like perturbation, which I just mentioned.

00:14:19.420 | Then, they also show that virtual cell models can be used to tutor RBIO 1 with soft verification,

00:14:28.060 | which the kind of knowledge that they learn from here is inherently transferable and generalizable.

00:14:35.180 | And, of course, multiple sources of verification can be composed together and become better.

00:14:40.700 | All of this that they were doing, they were not using test time chain of thought.

00:14:45.980 | All they did was just use RL to train further. Think of something like a Sonic 3.5 and 3.6 before test time came out and Kimmy's current version.

00:14:57.660 | So, I want to, so my opinion here, and I want to hear, Anki, when we get to this section, this number four here, multiple sources of verification, I,

00:15:07.820 | this is, I haven't seen a paper that uses a technique that they use here to combine, and I think it's really clever

00:15:15.100 | insight about GRPO, because they use this, like, sort of group relative, right?

00:15:23.020 | So, basically, you're kind of normalizing to your relative performance so that you can combine signals

00:15:33.500 | from multiple verifiers, and they're automatically normalized to the same scale. And so, I don't know,

00:15:40.620 | I feel like that's maybe the algorithmic nugget from this paper, and they really don't do a good job of explaining it.

00:15:47.980 | So, yeah, I mean, the paper is more focused on the bio topics, I was reading it more from the

00:15:54.940 | algorithmic side of it. And so, I do think that it makes sense that the more reward signal you can

00:16:01.340 | provide to a model, the better it's going to perform, right? I mean, we are essentially debating based on

00:16:07.260 | Thinking Machines blog about how even bits of information in a reward signal works pretty well.

00:16:14.140 | So, on paper, at least in theory, it makes sense. But how you do it, and whether the signals are

00:16:20.460 | confusing, whether the credit assignment is right or wrong, is what probably ends up making a difference.

00:16:28.060 | Yeah, I mean, we saw with the DeepSeq paper last, whenever we read that recently, and also the previous

00:16:35.580 | DeepSeq, that they tend to use rollouts, right? They train these specialized models for each domain,

00:16:45.580 | and then roll it out, like take the best rollout from that, from like when you re-roll out from those

00:16:56.220 | specialized models, and then just do regular, you know, regular pre-training on that. And I thought,

00:17:04.940 | or like post-training, I guess. So, that like, I found this interesting contrast to that, where we're

00:17:10.940 | saying we're just going to throw all the, all the different reward models together into one giant reward

00:17:17.420 | function that pick, you pick like on the fly or something like that. And, and so they like, I,

00:17:25.100 | yeah, anyway, I thought that was this, that's what I really took away from this paper from the algorithmic side.

00:17:29.420 | So, of course, uh, they use GRPO, which is the first method. They have a clipped surrogate objective,

00:17:39.660 | because of the surrogate part, where they have an epsilon, uh, I mean, the clip that in a sense that

00:17:45.260 | they don't fully, uh, use that. Uh, this is a typical GRPO math, you have the clipping as well,

00:17:51.660 | and essentially you end up with a reward function, which is, uh, just based on the probability of, uh,

00:17:58.140 | the given, uh, substrat, uh, sorry, the cell, uh, the, sorry, the gene and the verification there.

00:18:05.580 | So, uh, uh, this is the whole idea around it. Uh, you have a verifier, that's the typical GRPO

00:18:13.340 | notation, uh, the probability of QoI and a V. Then you have a hard verification, where it directly comes

00:18:19.500 | from experiments, which is simply one, if the experiment works, one, if the prediction was false,

00:18:25.500 | and experiment and does, uh, results and false as well, otherwise zero. That's typically the first

00:18:31.500 | method that you're supposed to use here. Then you have something like in, uh, the soft verification

00:18:37.980 | that comes in using virtual cell models. So essentially, uh, instead of, uh, going back to

00:18:44.700 | something like a one and one, when the, uh, it was true and true and false and false,

00:18:49.340 | they have a probability of whether, uh, whether it's gonna be, uh, working or not. So essentially,

00:18:57.340 | it's not an experiment. It's just a prediction. And that's where you have a soft, uh, where verification

00:19:03.100 | is not an experiment, but more of a, uh, uh, more of a, uh, prediction model. So this is, uh, what you

00:19:13.820 | have where they just use Q, uh, two genes and then basically pass it through a virtual cell model to

00:19:20.460 | predict that whether the perturbation would happen or not. So they also have, uh, uh, uh,

00:19:27.100 | pointwise mutual information scores around it. They basically create a log of, uh,

00:19:33.340 | there's a lot of math that goes around here. The whole idea is that you have to normalize this

00:19:37.900 | and you have to bring it to a level where, uh, the reward signal is meaningful and does not outshine

00:19:42.540 | the reward signal from the previous, uh, hard verification.

00:19:45.580 | So, uh, that's there. And it comes from the top 0.05 interaction with the highest, uh,

00:19:53.580 | scores in our bio experiments. Now the third was the using prior knowledge with very,

00:20:00.220 | they use knowledge sources, uh, and use that to assign a reward. That's another soft reward.

00:20:06.700 | That's just analyzing the chain of thought as well as the reasoning, as well as the output that is there

00:20:12.060 | and just assigning a smaller reward based on our ontology or whether or not the model is on the right

00:20:17.420 | from it's road based scores, keywords based scores, and likelihood estimations.

00:20:23.500 | So that's how that, those are the different methods that they use. Essentially their entire, uh,

00:20:29.900 | reward is, uh, you have, I mean, I'm going to skip through all the math,

00:20:36.220 | but essentially their entire, uh, reward is the summation of all the three methods. And that's that,

00:20:43.340 | that's the final thing that they back propagate the variable, whichever method is applicable,

00:20:48.220 | but perhaps if the VCN cannot give, give you a result, then you don't use that.

00:20:52.540 | So, so I, I was kind of, uh,

00:20:55.900 | I was kind of baffled why they use Rooscore in, in that when, in like keyword search rather than

00:21:04.540 | use LLM as a judge here. Uh,

00:21:07.340 | I didn't know. Yeah. I, I think they could have done that just that they went with one method and

00:21:13.740 | they thought that it's working. The other side is that in this case, because, uh, the reasoning

00:21:19.900 | chains would have a lot of very specific keywords based on, uh, the subject matter. So it's an easier

00:21:26.620 | score for that, but for a generic model, probably that won't work. You need more stronger methods.

00:21:33.260 | Yeah. I also thought this, the last one here, the likelihood base verifier, I, I don't know. I didn't

00:21:38.700 | quite understand this. Did you, my interpretation of this was to look at the, um, sort of the, the log likelihood of the, um, of the ground truth or like the, the output, like, um, sorry, the log likelihood of the knowledge source for that model.

00:22:01.260 | And they use that as a score. Is that like how you interpreted that?

00:22:04.540 | I, uh, interpret it most simply, of course, they take a log and everything. Uh, but these likelihood

00:22:13.100 | based verifiers are based on this thing where, uh, they essentially, uh, uh, estimating it as well. Like

00:22:20.780 | I was talking about, uh, uh, rouge based verifiers and others. Right. Ultimately. Yeah.

00:22:26.060 | Yeah. The other ones that I understand, but the, the log one, how did, how is, I don't understand how

00:22:31.900 | they calculate that. Yeah. So they have, I mean, they didn't go through the method that well, but the whole

00:22:37.500 | idea here is that can this trajectory lead to what is the pro estimation and the likelihood that this

00:22:43.820 | trajectory, if continued further or, uh, maybe with some more information could have led to the right

00:22:49.660 | answer. It's their own model here. So like, uh, here, uh, uh, where, I mean, I've read this, but yeah,

00:23:00.220 | I just give me a minute. Yeah. So essentially one of the versions that they end up doing, uh, is exactly

00:23:07.820 | this where they, um, uh, essentially have a, um, I'll have to, I'll get back to you on this, but at this

00:23:17.260 | point, what I remember was that, Hey, uh, it's not just keyword based match matching and rogue. It's

00:23:24.060 | also that they're looking at a trajectory and estimating that cannot reach the right answer.

00:23:28.220 | And that's the log, log likelihood. Now how they did it is something, uh, which is, uh, I'm not

00:23:36.620 | hundred percent sure how it happened. It's there's this, but yeah. Yeah. Okay. Okay. Yeah. Cool.

00:23:43.420 | So that's the, then you have something, uh, yeah. Uh, then they do have a normalization for soft course.

00:23:51.260 | One of those is if they have a, if it passes a threshold that they calculate, then, uh, I mean,

00:23:57.660 | it's a simple normalization, uh, if it's below the norm, then it's 0.5, uh, based on threshold.

00:24:03.980 | If it's, uh, above the norm, uh, then, uh, you do it the same way 0.5 plus. So there is a normalization

00:24:12.300 | that happens and that's probably what helps this model. Uh, but yeah, there is not a very strong

00:24:19.340 | reasoning as to why they did this normalization. They just said that, Hey, normalization is good.

00:24:23.500 | Let's do it. But I don't see a very good reasoning here as to why this normalization even exists

00:24:28.460 | where the threshold MLP is just 1.5.

00:24:31.580 | Yeah. I didn't, I didn't understand that section very well either.

00:24:41.980 | Yeah. And, uh, they are also very clear about that because their data set is biased, uh,

00:24:48.380 | what they want to look at is TPR instead of a TNR or anything, because, uh, is the metric,

00:24:54.620 | the value because data sets are heavily imbalanced and significant number of two true positive, uh,

00:25:00.300 | uh, and the TPR is the one. So that's another thing here. And that's probably also introduced

00:25:07.820 | because in one of the, uh, one of the experiments that TNR was going to totally one, uh, uh, in a base

00:25:14.860 | model. So I'll come to the, uh, yeah. So yeah. Uh, so basically the whole idea is that you distill the

00:25:23.260 | knowledge from a model of biology into reasoning models through soft verification, uh, the same

00:25:30.220 | thing that they have been talking about for the last two sections. It is a good mechanism. It is

00:25:34.940 | probably a mechanism which can work, but just coming to the numbers first, of course, this is again a

00:25:40.380 | summary of what are the, uh, verifiers they had in terms of, uh, the example prompts and the questions

00:25:47.740 | that they are asking. One is experimental data. A couple of them is, uh, MLP based perturbation and

00:25:52.780 | transcript, uh, transcript former foundational models. Both of them come from VCM. Then one comes from

00:25:59.260 | knowledge base. Again, repeating the same thing, but that we've covered. So, so I, I don't know if you

00:26:06.140 | picked this up on paper. Did you see how, how did they actually translate? Like you have, you have some

00:26:13.020 | questions. First of all, how do you generate those questions? And then how do you,

00:26:17.820 | translate that into something that the, the verifier can consume? Right? Like I, I understand, like,

00:26:24.860 | I understand what goes into the verifier. I understand how to template a question for that,

00:26:30.460 | but like it, did they just take like gene pairs and then like feed them into the template? So that,

00:26:36.220 | that doesn't tell me like, okay. I mean, I guess as long as you're persuaded that the

00:26:42.380 | LLM will be able to… So the, they're looking at the, uh, human level reasoning there. Uh, the data

00:26:49.980 | sets they created was something like a leave one out, uh, in both, uh, uh, exp and MLP and, uh, that's

00:26:58.060 | how they were doing it. But yeah, it's initially the whole idea is that if I take this gene, uh, and this, this gene

00:27:05.340 | interaction, if I take this gene down, what happens next? Okay. So they just generated a bunch of

00:27:13.180 | questions based, they like picked random gene pairs or something like that generated a bunch of questions

00:27:18.060 | about it. Yeah. So there are literally four, five genes that they start with.

00:27:23.340 | They're not that many that they start with. Oh, I see. Yeah. Oh, I didn't catch that.

00:27:32.540 | Like they only were looking at a few genes. I would expect you to go over thousands.

00:27:37.820 | Oh, I think they're only looking at a few specific genes. I'll confirm that. So initially for training,

00:27:44.220 | that was there, for example, like, uh, when you're talking about prompts, right. Uh, the, uh,

00:27:51.260 | lightly do the prompts are almost the same, but yeah, it's somewhat modified based on which

00:27:56.700 | foundation model you are getting a reward from. Right. So that is almost the same, but yeah, they

00:28:03.180 | do not go through a lot of genes there. Yeah. So you have something, uh, okay. Uh, you're here. Uh,

00:28:13.660 | basically what they call out of distribution is out of training dataset, which the model hasn't seen.

00:28:20.460 | It's not a generalization, generalization, like it's out of domain, but it's just out of distribution.

00:28:26.060 | So learn to generalize and out of distribution, because the problems are the same, they'll happen.

00:28:31.020 | So they have, uh, uh, uh, NLP, which they train as a, for a soft verification, uh, and differential X, uh,

00:28:39.660 | to predict the differential expression. This is what they're using as a reward, uh, as a soft

00:28:44.220 | verification reward model via VCM. Uh, and then, uh, you have like a one hot encoders, you have gene to,

00:28:51.820 | like, both of them are used in different way, uh, in different ablations to be able to get different

00:28:58.380 | signals. Uh, before, uh, so when that is the, uh, uh, yeah. So they, uh, uh, they compared it with

00:29:07.660 | experimental data on, yeah. Uh, experimental data on, uh, one cell line that is in distribution,

00:29:17.900 | tests and team splits are from same cell line. So I'll come to this because this is an interesting

00:29:24.060 | graph and the numbers are also interesting. So essentially it's one cell line and actually

00:29:29.020 | they start with, uh, getting gene pairs from there. Then they leave one part out and test on that one

00:29:35.340 | part. And then the same here, they test on that. Uh, they leave one out and then they test on it.

00:29:41.500 | One is basically training and testing on the same, uh, uh, data set. Uh, then they leave,

00:29:47.900 | they have four genes. They leave one out. They test on that one gene and train on the other three.

00:29:52.300 | The same goes with, uh, via MLP as well. So in terms of results, uh, yeah, you have, uh,

00:30:01.740 | one cell line and experimental in MLP as well. Uh, then you also have one heart, which is, uh,

00:30:08.460 | uh, and you need to have these are, uh, uh, uh, vector styles essentially. Uh, so what they see

00:30:15.980 | is summer is their original model. Like, yeah. Uh, summer is a model trained on experimental data

00:30:23.820 | and detailed biological knowledge in the loop. Gears is a state of art step specialized

00:30:28.380 | perturbation production models trained solely on experimental data. And the base reasoning model

00:30:34.060 | is squint 2.5 3B. So essentially the numbers that they have, the data set is, uh, essentially,

00:30:41.740 | uh, the same, uh, in here, this is not that perturbation data set. Uh, the, uh, sorry. Uh,

00:30:48.380 | yeah, the TTR is, uh, what they look at mostly. It is getting to an 82. The R bio experimental

00:30:55.820 | one cell line works really well because it's trained on test. But other than that, summer perform,

00:31:00.460 | it doesn't perform as well. Their MLP performs very well, uh, uh, with gene, a gene to work as well

00:31:07.020 | as one hot, uh, then gears and coin 2.5, the base model isn't performing as well either. So there is

00:31:14.540 | a lot of jump that they see here. And, uh, especially in the true positive rate and TNR, the one that performs

00:31:22.940 | best of gears because it has a very low positive rate anyway. So it's, but just to be clear, I don't

00:31:31.980 | think they trained on the test set. They, it was, but it was in distribution, right? It was in distribution.

00:31:38.060 | Yeah. Just to be clear. So it's like, they didn't leave it out completely. That's what I meant.

00:31:42.220 | Yeah. Yeah. Yeah.

00:31:46.300 | So essentially it's a method of teaching, uh, how to predict perturbation based on some

00:31:57.180 | sophisticated models that we have already built. A real world parallel could be that you have a bunch

00:32:02.220 | of reports and essentially RLHF could be a good parallel if they had done it differently. The whole

00:32:08.620 | idea is that you have a bunch of, uh, text where you have to predict to how, what a human would predict,

00:32:15.980 | right? Or rather, which one a human would prefer in this case, they're pretty much doing the same thing,

00:32:21.580 | but they have a lot of sophisticated models into, uh, what is going to result in a perturbation. And

00:32:27.340 | they're teaching, uh, generic LLM to understand that and basically be very good at that task.

00:32:33.180 | Uh, so a real takeaway for me here is that small models are also very good students. They can learn

00:32:40.860 | almost any world model from anywhere may or may not be their own world model, but yeah, they can learn

00:32:46.220 | that and they get very good at a task based on just learning from a specific world model in any domain.

00:32:55.660 | So, uh, this is, uh, of course, uh, they, uh, they can, uh, exhibit transfer

00:33:02.540 | to perturbation, which is, uh, what, uh, the results are. Then the next part is obviously that

00:33:08.700 | when you combine different verifiers, it improves generalization even further.

00:33:12.380 | So that's the next figure, not this one. Yeah. I'll just come to this figure. They essentially

00:33:19.660 | combine all the sources of verification and then they showed the number, uh, it went even further.

00:33:24.700 | So yeah, you have experiment, then you have, uh, just, uh, the direct ones and it went even further.

00:33:35.500 | So coming back to that. Yeah.

00:33:37.020 | But if I, uh, disappointing that they didn't beat summer, right. I don't think in any of those cases

00:33:45.580 | or very few of them. No, no, sorry. In the, the, the, yeah, this one.

00:33:51.340 | This one. Yeah. I mean, yeah, summer is almost as good, but they do beat it in the experimental

00:33:57.980 | all cell lines. They don't beat it otherwise. And if you're looking at TPR, not beating as much,

00:34:03.980 | but then the amount of compute that they use is very low compared to the original model.

00:34:08.940 | So it's probably getting there. We are probably going to get to a point where we are able to

00:34:13.340 | maybe outperform them or maybe at least get to the, uh, you know, get the same kind of logic that we're

00:34:19.900 | getting from the specific, uh, domain models here. Yeah. Uh, I'm not very sure about this claim. I'll

00:34:26.860 | be honest, but I think verification sources keep improving performance as their essential claim

00:34:32.300 | that they also go to, uh, another figure and show that, but yeah, that's there coming back to, uh,

00:34:39.580 | one of the other results. So this is a model strain on experimental and soft verification using VCMs,

00:34:46.620 | aggregate model performance, and then the metrics. So again, the same kind of results here, they're

00:34:53.980 | beating summer comprehensively, but not in that regard. The TNR is always the one where others are

00:34:59.340 | winning, but in their own case, uh, the red and the orange lines, the ones that they have

00:35:05.420 | trained almost always performs really well, except for a fun score and MCC as well. But yeah, overall,

00:35:11.340 | they just talk about TPR as much because they're getting so many, so many good results here as well.

00:35:16.860 | The other part is that, I mean, the amazing part here is obviously how much they outperformed the base

00:35:25.980 | model because our base model is pretty dumb about these topics and they're still able to raise the

00:35:30.620 | performance there, which is quite fascinating for me because if it can happen in a very specialized

00:35:36.220 | domain like biology, it can definitely happen in a very different domain, as long as you have a

00:35:41.260 | prediction model or a surrogate model for that domain. I mean, provided the results hold up at a higher

00:35:50.060 | scale, obviously. So that's, uh, yeah, that's the, uh, yeah, those are the additional verifiers.

00:36:01.420 | You have the base model, uh, on TPR. You also have, uh, which is just the experimental model. Then you see

00:36:09.340 | how much, uh, how close can the other models get without using the experimental path?

00:36:14.620 | This one actually shows the experimental path. This is, this was the one that I was so disappointed

00:36:21.820 | in, right? Like, yeah, I feel like they didn't, this is an incomplete result because I think that with

00:36:30.300 | better training dynamics like, or, or training methodology and dynamics, they could, they should

00:36:37.660 | be able to improve over the experimental data, as long as you're looking like out of distribution.

00:36:46.380 | Right. So like what I think what they're saying here is that if you have in distribution experimental

00:36:51.980 | data, that's gold standard. And I, I can perfectly accept that you're not going to maybe get to that

00:36:59.020 | with a world model, but you can get close and the other parts of the paper show that, but I was really

00:37:04.460 | disappointed to seeing that the performance goes down when you add, uh, you went, when you add more

00:37:10.380 | data. And I think that there's some evidence and I I'd have to think through what it was again, but I

00:37:17.020 | see some evidence that they should be able to improve on that so that at least stays as good as a gold

00:37:22.540 | standard, if not gets better. Makes sense. Okay. So again, I mean, effect of validation on verifiers,

00:37:31.020 | they do the, they keep doing the same thing and do it. Uh, just don't. Yeah. So essentially, uh,

00:37:40.220 | uh, this whole, uh, chart is around that. Then a lot of this was done based on, uh, non-COT.

00:37:48.300 | Like they just, uh, gave a prompt, they, uh, uh, created trajectory, they backpropagated,

00:37:54.460 | but there was no test time computer COT happening in this case. So they just added this slide to a

00:38:01.260 | system ground, a biologist will evaluate each step of the problem using logical reasoning and evidence

00:38:06.380 | from the prompt essentially went towards, uh, a test time compute based thing. And of course,

00:38:12.140 | the performance jumped even further here. Like this is the, uh, I think this is the COT one. Yeah.

00:38:20.460 | So this is the, uh, this is a system prompt. This is a conversation. And then you have to use a

00:38:26.300 | think and slash think essentially that's the system prompt. And then, uh, it ended up with even better

00:38:33.180 | performance, but I think that's a generic thing where, uh, chain of thought essentially works even better.

00:38:40.860 | Like you have something like a 0.88 here, 0.91 here. Right.

00:38:45.180 | And this line essentially improved the performance further compared to the previous one. Sorry. My bad.

00:38:55.420 | Yeah. Oh yeah. Okay. Right. So like I would have,

00:38:59.740 | I, yeah, I would have liked to have seen, uh, the, um,

00:39:07.100 | the, the not in chain of thought and the chain of thought side by side here.

00:39:10.380 | Yeah. But I mean, you have to go up and down. Oh, I see. No, it, it, it is there. No, no,

00:39:17.420 | it's not there. So, so two things here. So they do have one before it, which wasn't chain of thought.

00:39:24.220 | Yeah. Uh, here, then they have one with chain of thought, uh, prompted. Yeah. Then they added another

00:39:31.100 | line, which is that here, the biologist will evaluate each step of the problem, using logical reasoning,

00:39:36.300 | and evidence from the prompt. Yeah. Yeah. Yeah. No, I can see that it's better, but yeah. Okay. Got it.

00:39:42.460 | They just slightly better. Just that. Yeah. Yeah. Okay. Got it.

00:39:45.420 | And then anyways, approaching the scores of 0.9, I mean, uh, at that point, improving even a percentage

00:39:53.340 | point, uh, takes a lot of effort. So probably they also needed a better dataset, which they weren't

00:39:58.940 | outperforming at the base, but then you also see that, Hey, uh, like the base COT model at MCC is that

00:40:06.060 | 0.2 and it's a 0.49. So there is a lot of, uh, jump that happens just that, uh, I mean,

00:40:13.420 | their dataset could have been lost. Yeah.

00:40:15.980 | So that begs the question though, like if it's a 3B model, how much of the improvement would you,

00:40:24.460 | like, uh, could you just re replace the experiment with, you know, a, a larger model and do the same,

00:40:35.340 | because presumably like whatever God model is trained on all sorts of, you know, papers about

00:40:42.940 | genetics and the entire ontology database and things like that.

00:40:47.100 | Yeah. I think there's a lot to build upon it, both on the, uh, biology domain, as well as the generic

00:40:54.540 | domain, because I try, I'm trying replicating the same thing for a math domain task. And it turns out

00:41:00.940 | that deep seek has already done something like that before using a surrogate model to predict

00:41:05.500 | whether or not the proof would hold up. But yeah, I mean, if I have the results, I'll share that.

00:41:10.540 | So essentially now they're looking at out, uh, answering questions completely out of distribution,

00:41:17.260 | like Alzheimer's and everything. And, uh, the qualitative evaluation, they haven't done this

00:41:23.420 | experiment in a quantitative manner, but then the qualitative evaluation is that, hey, uh,

00:41:28.220 | basically general, uh, good and consistent reasoning and very few scientific inaccuracies.

00:41:36.780 | So you, they were trained for 800 K steps, taking 10 days, uh, on a completion on 800 GPUs,

00:41:44.540 | which is not a lot of money. Like it's not even $6,000.

00:41:50.540 | So if you can really teach a very small and a dumb model, like coin, uh, coin 2.5, uh, 3B, uh,

00:42:00.620 | a totally different domains world model at such a cost. And probably there is a lot more that you can,

00:42:06.220 | uh, uh, scale. If the scales, there's a lot more you can do.

00:42:12.780 | And of course they used hugging face, uh, transfer reinforcement learning. So I don't know if there

00:42:18.300 | are other methods like reinforcement or something you can reinforce or something you can use. That's

00:42:23.340 | probably gonna give them a bit of a boost as well, because credit assignment and they are not as great

00:42:29.260 | as others. Cool. Uh, so this is the paper. They have a lot of, uh, prompts, uh, going down as well.

00:42:39.660 | And, and the ideas around this, there are a lot of, uh, work that has happened previously as well.

00:42:45.340 | One thing I wanted to highlight was, yeah, not this one. Sorry. Yeah. So the target models have

00:42:54.700 | been used previously as well, and they do refer that I wanted to highlight that part.

00:42:59.740 | But I'll have to search for it. Cool. So this is the paper.

00:43:08.540 | Uh, yeah. Questions?

00:43:10.540 | I mean, I've, I've talked a lot. Uh, my, my overall impression here is this is like, I mean, it is a

00:43:23.900 | preprint, so that's totally fine to not like, I think it's, it obscures a lot. Like with jargon and math

00:43:35.660 | I think could be more clear. Um, I think that my takeaways from the paper were the, it's interesting

00:43:43.980 | technique for combining different soft verifiers. And I think that this is particularly right for

00:43:49.900 | biology because there are all these bespoke models out there. Um, and I, I'm guessing that there are

00:43:55.820 | probably other domains in science in particular that for which it has a similar dynamic like chemistry or

00:44:01.820 | I don't know, or physics in which you, it's very expensive to do an experiment. And so that maybe,

00:44:07.660 | uh, maybe you can use a, some sort of model, um, as a soft verifier for the reasoning process.

00:44:16.140 | Um, and I think, and it's an interesting way of thinking about building a multi multimodal model.

00:44:21.180 | Um, I, I, I was disappointed because, because I think that's the most important thing. I was

00:44:29.420 | disappointed to see that, that, that they were saying that using these verifiers in distribution

00:44:37.580 | that, um, lowers your performance. And I think that I would like to see, um, it would be good if

00:44:44.940 | you, like, I would like to see an experiment in this paper where, first of all, they just fix that,

00:44:50.220 | but also it seems to me like you should be able to use a small in distribution, uh, training set plus

00:44:58.220 | verifiers to create, um, a, you know, sort of a, a model that performs as well as the, um,

00:45:07.980 | as well as the experimental data in distribution and better, uh, than it out of distribution.

00:45:17.900 | Yeah, I do think that, uh, that's the case, but I also think that they set out on a very

00:45:22.540 | narrow, uh, problem statement. It should generalize, but this is also like the first step in that

00:45:28.300 | experiment, whether they can really codify world models into an LLM and that's why the dataset was

00:45:35.660 | narrow. The problem statement was narrow and the model was small. Uh, the technique and if it works,

00:45:42.300 | probably it's a, uh, good thing simply because right now where we are at with reward models, we are

00:45:48.140 | kind of assuming static priors, the kind of, uh, surrogate models work better in terms of a reward

00:45:54.140 | model simply because you can keep capturing the changing priors that people have. It could be as

00:45:59.500 | simple as, Hey, what is the probability that this business plan succeeds given based on thousand,

00:46:05.020 | of course, a random example, but given, uh, thousands of examples previously, you can pretty much use that

00:46:10.780 | as a predictive model for something. And we have used predictive analytics in the past,

00:46:15.500 | just not for an LLM. Probably those are the domains where this can work as well, just that you need

00:46:21.420 | access to that kind of data to be able to build out the predictive model first, uh, to that knowledge

00:46:26.860 | sources, build out that software first, and then go for something like this.

00:46:30.620 | Yeah. Yeah. So I agree. So like as a stepping stone towards a model that is able, you're able to

00:46:39.900 | converse with it generally about biology data and have it reason through complex questions. I think it's a

00:46:47.260 | good stepping stone in that direction. I agree.

00:46:49.180 | I'm, um, so I, I came in very, very late. I came in around five minutes ago. Uh, I just maybe, maybe my

00:47:01.500 | question might be, um, good for anybody else who might've come in a little late, but I guess, uh, from

00:47:08.380 | what I'm gathering here, um, the model that you guys, that, that the paper authors trained was, um,

00:47:16.700 | one that was given the domain of like biological, a certain specific domain of biological knowledge

00:47:23.900 | that was, um, contextualized by biological models. Um, and there was kind of a surprise by the model's

00:47:33.180 | ability to generalize at the extent it did toward those mathematical models that kind of contextualize

00:47:39.740 | that domain space and its ability to kind of respond in and interact with that environment, along

00:47:46.300 | with kind of some reinforcement learning. If I'm, is that kind of a solid approximation?

00:47:51.340 | Yeah. I mean, that's pretty much the, you have a base model, which is not trained on anything

00:47:57.260 | in biology and it is able to compete with very specific, uh, task specific models in bio, uh, that

00:48:05.100 | were made, uh, through a lot of experimentation and collection of a lot of data.

00:48:09.500 | Right. So, yeah. And that's what you're getting at. And especially a very small model that is able to

00:48:16.540 | compete with probably the bigger models that are foundational models, but not an, not elements in that

00:48:21.340 | regard. So, you know, for an interesting thing for me is, uh, that, that this paper shows kind of extends

00:48:29.340 | outside of, uh, uh, the domain of, uh, biology, but kind of the domain of any, uh, scientific theory that can

00:48:35.660 | be parameterized by, um, models that describe the fields of the space that the, that said, um,

00:48:44.220 | machine learning model would be attempting to generalize toward, you know, um, I, I'd be interested. Are, are, are,

00:48:53.500 | all, are, is there kind of, were, were all of the biological models that contextualize the space,

00:48:59.900 | um, differential equations or, or were they, I guess, were they all kind of in a similar, um, mathematical

00:49:09.020 | form or did they extend out to different mathematical forms? I think, uh, by mathematical forms you are

00:49:18.940 | looking at MLPs. Differential equations. I'm talking about the biological models that contextualize the

00:49:26.140 | space that the machine learning model was generalizing. So like, yeah, I assume most biological

00:49:31.980 | models are differential equations, right? I'm not, not very sure about that. Okay. No, no, I think,

00:49:40.380 | uh, some, some are for sure, but I think it looked to me like the ones that they were using were

00:49:47.020 | different types of neural networks. So from simple multi-layer perceptrons to transformer-based

00:49:53.420 | models. So the verifiers, the verifiers themselves were, were, were, um, also, uh, neural networks.

00:50:03.340 | Okay. Okay. Yeah. And I think like, if you think about the mathematical models of the, of these things,

00:50:08.860 | there are definitely differential equation-based ones, but some of them are not, right? Like, okay.

00:50:13.740 | So just to answer the question, uh, by Peter, uh, I came across this paper because I was looking for

00:50:24.780 | different verification techniques. I was writing a blog post about it and searched on O3 and surfaced

00:50:31.180 | this as a interesting thing. And I think that was somewhere around August when the paper had just come

00:50:36.380 | out. I read through it and I understood that, Hey, if this generalizes in the techniques go well, this can

00:50:42.060 | be far more reaching than just this domain. So I thought, let's pick it up. And it took me two months

00:50:48.380 | to pick it up.

00:50:49.100 | Wait, wait, wait, wait. What? Tell me more about this blog post on verification.

00:50:55.900 | Uh, how, uh, yeah, what's going on there?

00:50:58.700 | So essentially it's an about it's a blog post around environments, but the key in environment

00:51:05.260 | is how do you verify the task has been done? So, uh, the whole idea here is that at some point,

00:51:12.060 | everyone is assuming that the priors would stay the same. And that's where you can have a very

00:51:16.220 | static reward function that can just work, but that's not going to be the case because a lot of

00:51:21.980 | for humans as well, priors change the evolution. I'll send you the link as well. It's a long as

00:51:26.940 | blog post. I already put it, but yeah, the whole, uh, in short, the idea here is that if you assume

00:51:33.260 | that RL is basically generalizing the priors instead of, you know, hill climbing on an iterative manner,

00:51:38.540 | uh, say it is generalizing the priors and the model needs to have priors. The priors can be coached

00:51:45.420 | through one generating enough training data, but this is the second idea where, uh,

00:51:51.420 | it can be, uh, coached through just giving a very specific domain reward model or a surrogate reward

00:51:58.860 | model as they are called in certain cases where that is not going to be enough. For example,

00:52:03.900 | when you are buying on Amazon or someone else is buying on Amazon, that's very personalized and

00:52:09.180 | preference model scenario where I don't know how to train it today, but then that's the third kind of

00:52:15.500 | reward model that these LLMs would need because you can't have a very generic static prior that, hey,

00:52:20.940 | can an LLM buy say a jacket on Amazon. You want something which takes into account preferences for

00:52:28.220 | different people and it cannot just be verbalized preferences. You need a sort of a recommendation

00:52:32.140 | system there. And that's the kind of, uh, trajectory we are moving towards just that we are not there

00:52:38.620 | yet. And I don't know what it looks like just that the middle part here is the surrogate aspect

00:52:43.500 | where you model a real world thing into a prediction model. And then, uh, you end up in a place where

00:52:50.780 | you are able to, uh, at least, uh, codify most of it into an LLM that can generalize.

00:52:56.300 | Yeah, it's very confusing. Actually. Yeah. I'm looking forward to this post.

00:53:02.220 | Yeah. We, uh, we, we, we tried, uh, yeah, I, I'm pro more recommendation systems, um, in, in, in, in, in,

00:53:13.500 | in our pipelines in general. Um, your comment made me realize actually that these systems are a lot closer than

00:53:21.500 | I thought, um, yeah, I, I, I usually put them in a different mental bucket, but maybe, maybe I should

00:53:28.460 | not. Yeah. But the, uh, architecture point that needs to be solved is how do you have different

00:53:34.860 | reward models for different preferences? So in the same model, and that's probably something

00:53:42.140 | DeepMind has a solution for with their reward heads, but nothing much has come out. I've seen a couple of

00:53:47.660 | papers around it. That's probably one way of doing it, but yeah, you don't know. And these labs are not

00:53:53.580 | really publishing those kinds of things anymore. I mean, for everyone who knows Anthropic would have

00:53:59.580 | published, uh, would have discovered it a year ago as well. No one has published anything. So we don't know,

00:54:05.660 | unless we know the people who have done it.

00:54:10.860 | there's some comments in the chat.

00:54:14.300 | I think they went on a discussion. I'm just confused as to what that discussion is about.

00:54:26.060 | Yeah. Yeah. See, maybe CJ, you want to jump in?

00:54:35.660 | Oh, I was just, um, responding to what I understood somebody to be saying, uh, not directly. Yeah.

00:54:41.980 | Yeah. Yeah. Um, but kind of just reaffirming how you guys explained to me that the models that were

00:54:49.340 | verifying the model that's being tested are neural networks. And I was contrasting that with, um, how the

00:54:56.220 | laws of a system might be expressed through differential equations, which is hence why I assumed it was

00:55:01.660 | differential equations that were contextualizing the system. But I understand what you guys mean

00:55:06.300 | that no, these other neural networks, these other transformers are acting as direct verification to

00:55:14.060 | the model being tested. This, uh, transformer. Yeah. Correct. Yeah. Yeah. Yeah.

00:55:20.060 | Awesome. Well, I think that we are out of time. Do I think, uh, Sean, we have a paper for next.

00:55:31.580 | We have a paper next week already, right? Um, dude, uh, as, as you can clearly tell,

00:55:36.780 | I, I, I have trouble keeping track. Uh, source of truth is in discord. We'll, we'll coordinate in

00:55:42.460 | discord, but our kid, I'm very impressed by, by this coverage. I, I'm, uh, I'm, you know,

00:55:46.620 | I'm really looking forward to hearing more from you. Uh, this is the spirit of the paper club.

00:55:50.780 | It's like discussing stuff like this.

00:55:54.300 | Yeah. Yeah. I'm really, really glad you picked this paper.

00:55:59.500 | And it's different. I hope that I did generalize this because when we have something, we can build

00:56:05.740 | startups on because essentially, yeah. It will generalize if the theory works. Uh,

00:56:13.100 | and people have been trying to put this into, but so far we haven't needed much good stuff in the world

00:56:22.380 | model space, but like, uh, you know, like meta also did this like code version.

00:56:26.780 | Yeah. Yeah. Uh, maybe it'll work this time. I don't know.

00:56:32.060 | Let's see.

00:56:37.340 | Awesome. All right. Thank you guys. Thank you. Later guys.

00:56:41.180 | Bye. Bye. Thank you.