Chan-Zuckerberg rbio1: Training scientific reasoning LLMs with biological world models as verifiers

Yeah, go ahead. Awesome. Okay. Hey, guys. My name is Ankit. I work as an RL researcher. I have my own startup. I'll talk about that later. But yeah, presenting this paper, which is basically an idea where you can use something like a softwareifier instead of using hard verification. In bio, this becomes an important problem because the hard verification is you going to a lab and essentially running the kind of experiment before and then get the reward and backpropagate that to a model.

So instead of that, you already have a lot of real world data in bio. They trained a sort of a surrogate model, which calculates the probability of how two genes are going to interact with each other and which goes up and which goes down and essentially use that surrogate reward.

as a reward signal for the model to backpropagate. They use more techniques around it and they find some encouraging results, but that's the crux of the model. So, yeah, to start with, like, they do not have access to exact rules facilitating the formal verification, right? So testing hypothesis in a lab is the way to do it.

Real experiments is expensive and does not scale with computation. So, this is where you have approximate oracles with prior knowledge, which utilizes a prediction model instead of, you know, going with experimental data and they use RL in that. That's what I explained before. And the results here are that soft verification essentially distills the biology world models into an LLM and can achieve the kind of performance on the leading benchmark compared to the state of the art models.

And essentially, they also combine verifiers with a chain of thought, like, verifying the chain of thought reasoning, providing a reward if the reasoning is correct, and that improves the performance further. So, yeah, at one point, you have a lot of annotated data in the industry, but there is no way to use that data as such because they just created a sort of model surrounded, but it did not really do anything outside of it.

So, they want a way for people to be able to reason with it, to be able to utilize that knowledge and, you know, talk to it, as simple as that. So, yeah, there is something known as a virtual cell models that have come up recently. Virtual cell models are, I think it was promoted by ARC Institute as well, ARC Bio as well, where they are essentially creating a virtual model of the cell and a model of how it would interact with the environment.

So, which shows that you don't need experimental data for all of this, and you can generate predictions for any cell state transitions, such as a disease to healthy state and vice versa. And there is a very specific modality for this, such as transcriptomics and imaging and genomics or even language, but most of it is not trained on multiple modalities.

And that's why it's like a very specific model for a very specific task. You don't have a generic model that can do multiple things, and that's the goal of this paper. Like here, you have a, you have to develop a method which allows for integration of world models into a common space, and they use language as a connecting modality.

The benefit is that it allows transformation of complex models into conversations and can engage users by distilling a biological model into an LLN. They're able to distill the knowledge derived from experimental data to a natural language that enables interactive and human readable dialogue, which can be seen as an alignment of reasoning LLN to a biological model, but essentially it talks about a biology world model and an LLN a lot.

So the whole idea is that you have a lot of experiments in bio that you end up in a place where if you can put it in a LLN and then you can just interact with it. But yeah, it keeps talking about it for three to four pages and they build up on a lot of aggregation of world models that are already there in a representation space.

And then they use the use of models of biology to train models, which are just as performant and can bypass the need of flat-scale data generation. They mentioned it at the last in the paper, their base model is Quen 2.5. It's struck at four billion parameters. So everything that happens, happens at the four billion level, four billion parameter level.

But yeah, the results are very interesting. Even at that level, they even compare the results of the base model with how much they were able to improve the model. And the core capability that they are relying on is the generalization, especially out of domain, as well as looking at unseen problems to see how it goes through.

Generalization also skips through a couple of things. Sometimes they say that if a model has not seen the test data, that is generalization. Sometimes they see that it's a novel problem, that is generalization. So this is the core aspect here. They use something like a surrogate model, reward formulation by a domain-specific biological model, where they calculate the probability and give you an idea about how much this is going to last.

To give you a simpler example, think of it as isomorphic labs generating a protein structure. For that reward, it cannot be that you just go to a lab and test out that protein structure. You already have a lot of data. And an easier way to predict is just have other model which predicts the stability of the protein structure.

If the stability is 0.8, you pretty much have 0.8 as the reward. So they call this soft verification throughout the paper, because this is not a hard verification. Hard verification in their language is going to a lab and testing this out. So this is, and they arrive at a model which they can talk to, which can reason and do things.

Their entire experiment is on a narrow set of perturbation prediction of what happens when given a couple of genes, to other genes, if you knock down a gene in a given cell. So, yeah, in that sense, it's quite narrow. But I do think that it's going to generalize. Let me interrupt here.

I think just for the sake of people who aren't familiar with genetics, either you or I should describe what knockdown means. Do you want to do that? You should go ahead with that. I have a very basic understanding of it. Yeah, I mean, I don't have a really complicated, but basically it just means you use CRISPR or other techniques.

So CRISPR is a way that you can basically remove a gene from a genome and then look at the impact of that. So that's called knockdown, just to be clear. Yeah, so like you would typically you would look for the change in expression of other genes or other phenotypic responses like, you know, like disease or not disease or something like that.

Sorry. Sorry to interrupt. Not an issue. So basically, I mean, this is pretty much the strategy that they use. They use GRPO throughout. And one surprising thing was they do use GRPO even if they are getting good trajectories. But I mean, that's that is the GRPO throughout. So the B part where you see training, you have prompts and completions and they have advantages that they do.

This third thing where the source of verification is that there are some hard verifiers that they have used, but the ideas around it are one is experimental data and you backpropagate a hard verification and a reward to an LLM. Then you have a virtual cell model that you use as an input.

You get a soft answer. You use a software fire based on that, get a reward and backpropagate it to an LLM. The third part was they created a lot of knowledge sources in biology. So when a reasoning LLM has a reasoning trace, what they ended up doing was essentially create a reward based on the reasoning trace and how much it corresponds to the knowledge sources.

They add this as an additional reward to either A or B and then backpropagate that as well. And this is the claim here is that this is what improves it, improves the model performance a lot compared to the base model. And it kind of performs to a level to the state of the art bio benchmarks.

Not to spoil the to give a spoiler though, but I think that this claim is a little bit off because I think that whereas you look at the beginning of the paper and you think they're saying, oh, if we combine experimental data plus virtual cell model plus knowledge sources, then we get better than any of the three.

When in fact, when in fact, the experimental data all shows that experimental data alone outperforms virtual cell model and knowledge sources, even if experimental data, virtual cell model is added to experimental data, right? And that's, I think that that's a really disappointing result for me and they don't really talk about it.

So I just wanted to just to be clear about what the claim versus the reality here. Yeah, that's true. So that was something that struck me as odd as well. But I do take it as that, say, if you don't have enough resources for an experimental data or the experimental data, experimentation is slow.

You can pretty much use a proxy and get to a level which is closer to what a experimental benchmark would be. Yeah, absolutely, it's not a total loss, but the way that they write the paper makes you think that the former, but it was really the latter. Yeah, so I guess that's their incentive, right?

So anyway, so basically the whole idea becomes that you have something like we are talking about knockdown, right? Is this particular cell line likely to result in a differential expression of gene V is what we are predicting here. And that's the task. Usually, earlier it was done through bespoke models that are trained on task specific data or foundation models, pre-trained on large scale corpora, which was the kind of proxy models that we are talking about, the prediction models.

And the recent literature and the recent experiments in this regard show that the bespoke models tend to outperform models that fall under the second category, which is foundational models pre-trained on large scale corpora. So, but creating bespoke models requires large amount of specialized training data that is either annotated or obtained experimentally in a lab, which is costly and most of the labs do not have the kind of resources and most of it is also biased towards one side, which we'll also talk about later.

But yeah, so I mean, the bespoke models already exist in a manner, so they tend to use the previous bespoke models as the reward signal. So, yeah, reasoning models trained using soft verifiers learn to generalize on out-of-distribution data set and task by passing the need to train on experimental data, which is the final thing.

So, they also have something known as a foundational model for transcriptomics which uses the kind of biological knowledge that models have learned during pre-training runs and get it to the task specific questions and basically able to predict better. So, essentially, if you take QUEN, which is like a general model, and then you're getting to an RBIO, which can learn about, learn to reason about tasks like perturbation, which was not really a part of its original training data, although we don't know what went into QUEN's training data.

But there is a very likely chance that the model wasn't trained on specific tasks about reasoning, like reasoning in situations like perturbation. Then, of course, this is probably the summary of the entire paper, where they're able to just, I mean, I'll just go through it once again, though, before I come to the method.

Essentially, the whole idea is that you're distilling knowledge from a biology world model to a reasoning model So, essentially, they find out that RBIO can learn to reason about task-like perturbation, which I just mentioned. Then, they also show that virtual cell models can be used to tutor RBIO 1 with soft verification, which the kind of knowledge that they learn from here is inherently transferable and generalizable.

And, of course, multiple sources of verification can be composed together and become better. All of this that they were doing, they were not using test time chain of thought. All they did was just use RL to train further. Think of something like a Sonic 3.5 and 3.6 before test time came out and Kimmy's current version.

So, I want to, so my opinion here, and I want to hear, Anki, when we get to this section, this number four here, multiple sources of verification, I, this is, I haven't seen a paper that uses a technique that they use here to combine, and I think it's really clever insight about GRPO, because they use this, like, sort of group relative, right?

So, basically, you're kind of normalizing to your relative performance so that you can combine signals from multiple verifiers, and they're automatically normalized to the same scale. And so, I don't know, I feel like that's maybe the algorithmic nugget from this paper, and they really don't do a good job of explaining it.

So, yeah, I mean, the paper is more focused on the bio topics, I was reading it more from the algorithmic side of it. And so, I do think that it makes sense that the more reward signal you can provide to a model, the better it's going to perform, right?

I mean, we are essentially debating based on Thinking Machines blog about how even bits of information in a reward signal works pretty well. So, on paper, at least in theory, it makes sense. But how you do it, and whether the signals are confusing, whether the credit assignment is right or wrong, is what probably ends up making a difference.

Yeah, I mean, we saw with the DeepSeq paper last, whenever we read that recently, and also the previous DeepSeq, that they tend to use rollouts, right? They train these specialized models for each domain, and then roll it out, like take the best rollout from that, from like when you re-roll out from those specialized models, and then just do regular, you know, regular pre-training on that.

And I thought, or like post-training, I guess. So, that like, I found this interesting contrast to that, where we're saying we're just going to throw all the, all the different reward models together into one giant reward function that pick, you pick like on the fly or something like that.

And, and so they like, I, yeah, anyway, I thought that was this, that's what I really took away from this paper from the algorithmic side. So, of course, uh, they use GRPO, which is the first method. They have a clipped surrogate objective, because of the surrogate part, where they have an epsilon, uh, I mean, the clip that in a sense that they don't fully, uh, use that.

Uh, this is a typical GRPO math, you have the clipping as well, and essentially you end up with a reward function, which is, uh, just based on the probability of, uh, the given, uh, substrat, uh, sorry, the cell, uh, the, sorry, the gene and the verification there. So, uh, uh, this is the whole idea around it.

Uh, you have a verifier, that's the typical GRPO notation, uh, the probability of QoI and a V. Then you have a hard verification, where it directly comes from experiments, which is simply one, if the experiment works, one, if the prediction was false, and experiment and does, uh, results and false as well, otherwise zero.

That's typically the first method that you're supposed to use here. Then you have something like in, uh, the soft verification that comes in using virtual cell models. So essentially, uh, instead of, uh, going back to something like a one and one, when the, uh, it was true and true and false and false, they have a probability of whether, uh, whether it's gonna be, uh, working or not.

So essentially, it's not an experiment. It's just a prediction. And that's where you have a soft, uh, where verification is not an experiment, but more of a, uh, uh, more of a, uh, prediction model. So this is, uh, what you have where they just use Q, uh, two genes and then basically pass it through a virtual cell model to predict that whether the perturbation would happen or not.

So they also have, uh, uh, uh, pointwise mutual information scores around it. They basically create a log of, uh, there's a lot of math that goes around here. The whole idea is that you have to normalize this and you have to bring it to a level where, uh, the reward signal is meaningful and does not outshine the reward signal from the previous, uh, hard verification.

So, uh, that's there. And it comes from the top 0.05 interaction with the highest, uh, scores in our bio experiments. Now the third was the using prior knowledge with very, they use knowledge sources, uh, and use that to assign a reward. That's another soft reward. That's just analyzing the chain of thought as well as the reasoning, as well as the output that is there and just assigning a smaller reward based on our ontology or whether or not the model is on the right from it's road based scores, keywords based scores, and likelihood estimations.

So that's how that, those are the different methods that they use. Essentially their entire, uh, reward is, uh, you have, I mean, I'm going to skip through all the math, but essentially their entire, uh, reward is the summation of all the three methods. And that's that, that's the final thing that they back propagate the variable, whichever method is applicable, but perhaps if the VCN cannot give, give you a result, then you don't use that.

So, so I, I was kind of, uh, I was kind of baffled why they use Rooscore in, in that when, in like keyword search rather than use LLM as a judge here. Uh, I didn't know. Yeah. I, I think they could have done that just that they went with one method and they thought that it's working.

The other side is that in this case, because, uh, the reasoning chains would have a lot of very specific keywords based on, uh, the subject matter. So it's an easier score for that, but for a generic model, probably that won't work. You need more stronger methods. Yeah. I also thought this, the last one here, the likelihood base verifier, I, I don't know.

I didn't quite understand this. Did you, my interpretation of this was to look at the, um, sort of the, the log likelihood of the, um, of the ground truth or like the, the output, like, um, sorry, the log likelihood of the knowledge source for that model. And they use that as a score.

Is that like how you interpreted that? I, uh, interpret it most simply, of course, they take a log and everything. Uh, but these likelihood based verifiers are based on this thing where, uh, they essentially, uh, uh, estimating it as well. Like I was talking about, uh, uh, rouge based verifiers and others.

Right. Ultimately. Yeah. Yeah. The other ones that I understand, but the, the log one, how did, how is, I don't understand how they calculate that. Yeah. So they have, I mean, they didn't go through the method that well, but the whole idea here is that can this trajectory lead to what is the pro estimation and the likelihood that this trajectory, if continued further or, uh, maybe with some more information could have led to the right answer.

It's their own model here. So like, uh, here, uh, uh, where, I mean, I've read this, but yeah, I just give me a minute. Yeah. So essentially one of the versions that they end up doing, uh, is exactly this where they, um, uh, essentially have a, um, I'll have to, I'll get back to you on this, but at this point, what I remember was that, Hey, uh, it's not just keyword based match matching and rogue.

It's also that they're looking at a trajectory and estimating that cannot reach the right answer. And that's the log, log likelihood. Now how they did it is something, uh, which is, uh, I'm not hundred percent sure how it happened. It's there's this, but yeah. Yeah. Okay. Okay. Yeah. Cool.

So that's the, then you have something, uh, yeah. Uh, then they do have a normalization for soft course. One of those is if they have a, if it passes a threshold that they calculate, then, uh, I mean, it's a simple normalization, uh, if it's below the norm, then it's 0.5, uh, based on threshold.

If it's, uh, above the norm, uh, then, uh, you do it the same way 0.5 plus. So there is a normalization that happens and that's probably what helps this model. Uh, but yeah, there is not a very strong reasoning as to why they did this normalization. They just said that, Hey, normalization is good.

Let's do it. But I don't see a very good reasoning here as to why this normalization even exists where the threshold MLP is just 1.5. Yeah. I didn't, I didn't understand that section very well either. Yeah. And, uh, they are also very clear about that because their data set is biased, uh, what they want to look at is TPR instead of a TNR or anything, because, uh, is the metric, the value because data sets are heavily imbalanced and significant number of two true positive, uh, uh, and the TPR is the one.

So that's another thing here. And that's probably also introduced because in one of the, uh, one of the experiments that TNR was going to totally one, uh, uh, in a base model. So I'll come to the, uh, yeah. So yeah. Uh, so basically the whole idea is that you distill the knowledge from a model of biology into reasoning models through soft verification, uh, the same thing that they have been talking about for the last two sections.

It is a good mechanism. It is probably a mechanism which can work, but just coming to the numbers first, of course, this is again a summary of what are the, uh, verifiers they had in terms of, uh, the example prompts and the questions that they are asking. One is experimental data.

A couple of them is, uh, MLP based perturbation and transcript, uh, transcript former foundational models. Both of them come from VCM. Then one comes from knowledge base. Again, repeating the same thing, but that we've covered. So, so I, I don't know if you picked this up on paper. Did you see how, how did they actually translate?

Like you have, you have some questions. First of all, how do you generate those questions? And then how do you, translate that into something that the, the verifier can consume? Right? Like I, I understand, like, I understand what goes into the verifier. I understand how to template a question for that, but like it, did they just take like gene pairs and then like feed them into the template?

So that, that doesn't tell me like, okay. I mean, I guess as long as you're persuaded that the LLM will be able to… So the, they're looking at the, uh, human level reasoning there. Uh, the data sets they created was something like a leave one out, uh, in both, uh, uh, exp and MLP and, uh, that's how they were doing it.

But yeah, it's initially the whole idea is that if I take this gene, uh, and this, this gene interaction, if I take this gene down, what happens next? Okay. So they just generated a bunch of questions based, they like picked random gene pairs or something like that generated a bunch of questions about it.

Yeah. So there are literally four, five genes that they start with. They're not that many that they start with. Oh, I see. Yeah. Oh, I didn't catch that. Like they only were looking at a few genes. I would expect you to go over thousands. Oh, I think they're only looking at a few specific genes.

I'll confirm that. So initially for training, that was there, for example, like, uh, when you're talking about prompts, right. Uh, the, uh, lightly do the prompts are almost the same, but yeah, it's somewhat modified based on which foundation model you are getting a reward from. Right. So that is almost the same, but yeah, they do not go through a lot of genes there.

Yeah. So you have something, uh, okay. Uh, you're here. Uh, basically what they call out of distribution is out of training dataset, which the model hasn't seen. It's not a generalization, generalization, like it's out of domain, but it's just out of distribution. So learn to generalize and out of distribution, because the problems are the same, they'll happen.

So they have, uh, uh, uh, NLP, which they train as a, for a soft verification, uh, and differential X, uh, to predict the differential expression. This is what they're using as a reward, uh, as a soft verification reward model via VCM. Uh, and then, uh, you have like a one hot encoders, you have gene to, like, both of them are used in different way, uh, in different ablations to be able to get different signals.

Uh, before, uh, so when that is the, uh, uh, yeah. So they, uh, uh, they compared it with experimental data on, yeah. Uh, experimental data on, uh, one cell line that is in distribution, tests and team splits are from same cell line. So I'll come to this because this is an interesting graph and the numbers are also interesting.

So essentially it's one cell line and actually they start with, uh, getting gene pairs from there. Then they leave one part out and test on that one part. And then the same here, they test on that. Uh, they leave one out and then they test on it. One is basically training and testing on the same, uh, uh, data set.

Uh, then they leave, they have four genes. They leave one out. They test on that one gene and train on the other three. The same goes with, uh, via MLP as well. So in terms of results, uh, yeah, you have, uh, one cell line and experimental in MLP as well.

Uh, then you also have one heart, which is, uh, uh, and you need to have these are, uh, uh, uh, vector styles essentially. Uh, so what they see is summer is their original model. Like, yeah. Uh, summer is a model trained on experimental data and detailed biological knowledge in the loop.

Gears is a state of art step specialized perturbation production models trained solely on experimental data. And the base reasoning model is squint 2.5 3B. So essentially the numbers that they have, the data set is, uh, essentially, uh, the same, uh, in here, this is not that perturbation data set.

Uh, the, uh, sorry. Uh, yeah, the TTR is, uh, what they look at mostly. It is getting to an 82. The R bio experimental one cell line works really well because it's trained on test. But other than that, summer perform, it doesn't perform as well. Their MLP performs very well, uh, uh, with gene, a gene to work as well as one hot, uh, then gears and coin 2.5, the base model isn't performing as well either.

So there is a lot of jump that they see here. And, uh, especially in the true positive rate and TNR, the one that performs best of gears because it has a very low positive rate anyway. So it's, but just to be clear, I don't think they trained on the test set.

They, it was, but it was in distribution, right? It was in distribution. Yeah. Just to be clear. So it's like, they didn't leave it out completely. That's what I meant. Yeah. Yeah. Yeah. So essentially it's a method of teaching, uh, how to predict perturbation based on some sophisticated models that we have already built.

A real world parallel could be that you have a bunch of reports and essentially RLHF could be a good parallel if they had done it differently. The whole idea is that you have a bunch of, uh, text where you have to predict to how, what a human would predict, right?

Or rather, which one a human would prefer in this case, they're pretty much doing the same thing, but they have a lot of sophisticated models into, uh, what is going to result in a perturbation. And they're teaching, uh, generic LLM to understand that and basically be very good at that task.

Uh, so a real takeaway for me here is that small models are also very good students. They can learn almost any world model from anywhere may or may not be their own world model, but yeah, they can learn that and they get very good at a task based on just learning from a specific world model in any domain.

So, uh, this is, uh, of course, uh, they, uh, they can, uh, exhibit transfer to perturbation, which is, uh, what, uh, the results are. Then the next part is obviously that when you combine different verifiers, it improves generalization even further. So that's the next figure, not this one. Yeah.

I'll just come to this figure. They essentially combine all the sources of verification and then they showed the number, uh, it went even further. So yeah, you have experiment, then you have, uh, just, uh, the direct ones and it went even further. So coming back to that. Yeah. But if I, uh, disappointing that they didn't beat summer, right.

I don't think in any of those cases or very few of them. No, no, sorry. In the, the, the, yeah, this one. This one. Yeah. I mean, yeah, summer is almost as good, but they do beat it in the experimental all cell lines. They don't beat it otherwise. And if you're looking at TPR, not beating as much, but then the amount of compute that they use is very low compared to the original model.

So it's probably getting there. We are probably going to get to a point where we are able to maybe outperform them or maybe at least get to the, uh, you know, get the same kind of logic that we're getting from the specific, uh, domain models here. Yeah. Uh, I'm not very sure about this claim.

I'll be honest, but I think verification sources keep improving performance as their essential claim that they also go to, uh, another figure and show that, but yeah, that's there coming back to, uh, one of the other results. So this is a model strain on experimental and soft verification using VCMs, aggregate model performance, and then the metrics.

So again, the same kind of results here, they're beating summer comprehensively, but not in that regard. The TNR is always the one where others are winning, but in their own case, uh, the red and the orange lines, the ones that they have trained almost always performs really well, except for a fun score and MCC as well.

But yeah, overall, they just talk about TPR as much because they're getting so many, so many good results here as well. The other part is that, I mean, the amazing part here is obviously how much they outperformed the base model because our base model is pretty dumb about these topics and they're still able to raise the performance there, which is quite fascinating for me because if it can happen in a very specialized domain like biology, it can definitely happen in a very different domain, as long as you have a prediction model or a surrogate model for that domain.

I mean, provided the results hold up at a higher scale, obviously. So that's, uh, yeah, that's the, uh, yeah, those are the additional verifiers. You have the base model, uh, on TPR. You also have, uh, which is just the experimental model. Then you see how much, uh, how close can the other models get without using the experimental path?

This one actually shows the experimental path. This is, this was the one that I was so disappointed in, right? Like, yeah, I feel like they didn't, this is an incomplete result because I think that with better training dynamics like, or, or training methodology and dynamics, they could, they should be able to improve over the experimental data, as long as you're looking like out of distribution.

Right. So like what I think what they're saying here is that if you have in distribution experimental data, that's gold standard. And I, I can perfectly accept that you're not going to maybe get to that with a world model, but you can get close and the other parts of the paper show that, but I was really disappointed to seeing that the performance goes down when you add, uh, you went, when you add more data.

And I think that there's some evidence and I I'd have to think through what it was again, but I see some evidence that they should be able to improve on that so that at least stays as good as a gold standard, if not gets better. Makes sense. Okay. So again, I mean, effect of validation on verifiers, they do the, they keep doing the same thing and do it.

Uh, just don't. Yeah. So essentially, uh, uh, this whole, uh, chart is around that. Then a lot of this was done based on, uh, non-COT. Like they just, uh, gave a prompt, they, uh, uh, created trajectory, they backpropagated, but there was no test time computer COT happening in this case.

So they just added this slide to a system ground, a biologist will evaluate each step of the problem using logical reasoning and evidence from the prompt essentially went towards, uh, a test time compute based thing. And of course, the performance jumped even further here. Like this is the, uh, I think this is the COT one.

Yeah. So this is the, uh, this is a system prompt. This is a conversation. And then you have to use a think and slash think essentially that's the system prompt. And then, uh, it ended up with even better performance, but I think that's a generic thing where, uh, chain of thought essentially works even better.

Like you have something like a 0.88 here, 0.91 here. Right. And this line essentially improved the performance further compared to the previous one. Sorry. My bad. Yeah. Oh yeah. Okay. Right. So like I would have, I, yeah, I would have liked to have seen, uh, the, um, the, the not in chain of thought and the chain of thought side by side here.

Yeah. But I mean, you have to go up and down. Oh, I see. No, it, it, it is there. No, no, it's not there. So, so two things here. So they do have one before it, which wasn't chain of thought. Yeah. Uh, here, then they have one with chain of thought, uh, prompted.

Yeah. Then they added another line, which is that here, the biologist will evaluate each step of the problem, using logical reasoning, and evidence from the prompt. Yeah. Yeah. Yeah. No, I can see that it's better, but yeah. Okay. Got it. They just slightly better. Just that. Yeah. Yeah. Okay.

Got it. And then anyways, approaching the scores of 0.9, I mean, uh, at that point, improving even a percentage point, uh, takes a lot of effort. So probably they also needed a better dataset, which they weren't outperforming at the base, but then you also see that, Hey, uh, like the base COT model at MCC is that 0.2 and it's a 0.49.

So there is a lot of, uh, jump that happens just that, uh, I mean, their dataset could have been lost. Yeah. So that begs the question though, like if it's a 3B model, how much of the improvement would you, like, uh, could you just re replace the experiment with, you know, a, a larger model and do the same, because presumably like whatever God model is trained on all sorts of, you know, papers about genetics and the entire ontology database and things like that.

Yeah. I think there's a lot to build upon it, both on the, uh, biology domain, as well as the generic domain, because I try, I'm trying replicating the same thing for a math domain task. And it turns out that deep seek has already done something like that before using a surrogate model to predict whether or not the proof would hold up.

But yeah, I mean, if I have the results, I'll share that. So essentially now they're looking at out, uh, answering questions completely out of distribution, like Alzheimer's and everything. And, uh, the qualitative evaluation, they haven't done this experiment in a quantitative manner, but then the qualitative evaluation is that, hey, uh, basically general, uh, good and consistent reasoning and very few scientific inaccuracies.

So you, they were trained for 800 K steps, taking 10 days, uh, on a completion on 800 GPUs, which is not a lot of money. Like it's not even $6,000. So if you can really teach a very small and a dumb model, like coin, uh, coin 2.5, uh, 3B, uh, a totally different domains world model at such a cost.

And probably there is a lot more that you can, uh, uh, scale. If the scales, there's a lot more you can do. And of course they used hugging face, uh, transfer reinforcement learning. So I don't know if there are other methods like reinforcement or something you can reinforce or something you can use.

That's probably gonna give them a bit of a boost as well, because credit assignment and they are not as great as others. Cool. Uh, so this is the paper. They have a lot of, uh, prompts, uh, going down as well. And, and the ideas around this, there are a lot of, uh, work that has happened previously as well.

One thing I wanted to highlight was, yeah, not this one. Sorry. Yeah. So the target models have been used previously as well, and they do refer that I wanted to highlight that part. But I'll have to search for it. Cool. So this is the paper. Uh, yeah. Questions? I mean, I've, I've talked a lot.

Uh, my, my overall impression here is this is like, I mean, it is a preprint, so that's totally fine to not like, I think it's, it obscures a lot. Like with jargon and math I think could be more clear. Um, I think that my takeaways from the paper were the, it's interesting technique for combining different soft verifiers.

And I think that this is particularly right for biology because there are all these bespoke models out there. Um, and I, I'm guessing that there are probably other domains in science in particular that for which it has a similar dynamic like chemistry or I don't know, or physics in which you, it's very expensive to do an experiment.

And so that maybe, uh, maybe you can use a, some sort of model, um, as a soft verifier for the reasoning process. Um, and I think, and it's an interesting way of thinking about building a multi multimodal model. Um, I, I, I was disappointed because, because I think that's the most important thing.

I was disappointed to see that, that, that they were saying that using these verifiers in distribution that, um, lowers your performance. And I think that I would like to see, um, it would be good if you, like, I would like to see an experiment in this paper where, first of all, they just fix that, but also it seems to me like you should be able to use a small in distribution, uh, training set plus verifiers to create, um, a, you know, sort of a, a model that performs as well as the, um, as well as the experimental data in distribution and better, uh, than it out of distribution.

Yeah, I do think that, uh, that's the case, but I also think that they set out on a very narrow, uh, problem statement. It should generalize, but this is also like the first step in that experiment, whether they can really codify world models into an LLM and that's why the dataset was narrow.

The problem statement was narrow and the model was small. Uh, the technique and if it works, probably it's a, uh, good thing simply because right now where we are at with reward models, we are kind of assuming static priors, the kind of, uh, surrogate models work better in terms of a reward model simply because you can keep capturing the changing priors that people have.

It could be as simple as, Hey, what is the probability that this business plan succeeds given based on thousand, of course, a random example, but given, uh, thousands of examples previously, you can pretty much use that as a predictive model for something. And we have used predictive analytics in the past, just not for an LLM.

Probably those are the domains where this can work as well, just that you need access to that kind of data to be able to build out the predictive model first, uh, to that knowledge sources, build out that software first, and then go for something like this. Yeah. Yeah. So I agree.

So like as a stepping stone towards a model that is able, you're able to converse with it generally about biology data and have it reason through complex questions. I think it's a good stepping stone in that direction. I agree. I'm, um, so I, I came in very, very late.

I came in around five minutes ago. Uh, I just maybe, maybe my question might be, um, good for anybody else who might've come in a little late, but I guess, uh, from what I'm gathering here, um, the model that you guys, that, that the paper authors trained was, um, one that was given the domain of like biological, a certain specific domain of biological knowledge that was, um, contextualized by biological models.

Um, and there was kind of a surprise by the model's ability to generalize at the extent it did toward those mathematical models that kind of contextualize that domain space and its ability to kind of respond in and interact with that environment, along with kind of some reinforcement learning. If I'm, is that kind of a solid approximation?

Yeah. I mean, that's pretty much the, you have a base model, which is not trained on anything in biology and it is able to compete with very specific, uh, task specific models in bio, uh, that were made, uh, through a lot of experimentation and collection of a lot of data.

Right. So, yeah. And that's what you're getting at. And especially a very small model that is able to compete with probably the bigger models that are foundational models, but not an, not elements in that regard. So, you know, for an interesting thing for me is, uh, that, that this paper shows kind of extends outside of, uh, uh, the domain of, uh, biology, but kind of the domain of any, uh, scientific theory that can be parameterized by, um, models that describe the fields of the space that the, that said, um, machine learning model would be attempting to generalize toward, you know, um, I, I'd be interested.

Are, are, are, all, are, is there kind of, were, were all of the biological models that contextualize the space, um, differential equations or, or were they, I guess, were they all kind of in a similar, um, mathematical form or did they extend out to different mathematical forms? I think, uh, by mathematical forms you are looking at MLPs.

Differential equations. I'm talking about the biological models that contextualize the space that the machine learning model was generalizing. So like, yeah, I assume most biological models are differential equations, right? I'm not, not very sure about that. Okay. No, no, I think, uh, some, some are for sure, but I think it looked to me like the ones that they were using were different types of neural networks.

So from simple multi-layer perceptrons to transformer-based models. So the verifiers, the verifiers themselves were, were, were, um, also, uh, neural networks. Okay. Okay. Yeah. And I think like, if you think about the mathematical models of the, of these things, there are definitely differential equation-based ones, but some of them are not, right?

Like, okay. So just to answer the question, uh, by Peter, uh, I came across this paper because I was looking for different verification techniques. I was writing a blog post about it and searched on O3 and surfaced this as a interesting thing. And I think that was somewhere around August when the paper had just come out.

I read through it and I understood that, Hey, if this generalizes in the techniques go well, this can be far more reaching than just this domain. So I thought, let's pick it up. And it took me two months to pick it up. Wait, wait, wait, wait. What? Tell me more about this blog post on verification.

Uh, how, uh, yeah, what's going on there? So essentially it's an about it's a blog post around environments, but the key in environment is how do you verify the task has been done? So, uh, the whole idea here is that at some point, everyone is assuming that the priors would stay the same.

And that's where you can have a very static reward function that can just work, but that's not going to be the case because a lot of for humans as well, priors change the evolution. I'll send you the link as well. It's a long as blog post. I already put it, but yeah, the whole, uh, in short, the idea here is that if you assume that RL is basically generalizing the priors instead of, you know, hill climbing on an iterative manner, uh, say it is generalizing the priors and the model needs to have priors.

The priors can be coached through one generating enough training data, but this is the second idea where, uh, it can be, uh, coached through just giving a very specific domain reward model or a surrogate reward model as they are called in certain cases where that is not going to be enough.

For example, when you are buying on Amazon or someone else is buying on Amazon, that's very personalized and preference model scenario where I don't know how to train it today, but then that's the third kind of reward model that these LLMs would need because you can't have a very generic static prior that, hey, can an LLM buy say a jacket on Amazon.

You want something which takes into account preferences for different people and it cannot just be verbalized preferences. You need a sort of a recommendation system there. And that's the kind of, uh, trajectory we are moving towards just that we are not there yet. And I don't know what it looks like just that the middle part here is the surrogate aspect where you model a real world thing into a prediction model.

And then, uh, you end up in a place where you are able to, uh, at least, uh, codify most of it into an LLM that can generalize. Yeah, it's very confusing. Actually. Yeah. I'm looking forward to this post. Yeah. We, uh, we, we, we tried, uh, yeah, I, I'm pro more recommendation systems, um, in, in, in, in, in, in our pipelines in general.

Um, your comment made me realize actually that these systems are a lot closer than I thought, um, yeah, I, I, I usually put them in a different mental bucket, but maybe, maybe I should not. Yeah. But the, uh, architecture point that needs to be solved is how do you have different reward models for different preferences?

So in the same model, and that's probably something DeepMind has a solution for with their reward heads, but nothing much has come out. I've seen a couple of papers around it. That's probably one way of doing it, but yeah, you don't know. And these labs are not really publishing those kinds of things anymore.

I mean, for everyone who knows Anthropic would have published, uh, would have discovered it a year ago as well. No one has published anything. So we don't know, unless we know the people who have done it. there's some comments in the chat. I think they went on a discussion.

I'm just confused as to what that discussion is about. Yeah. Yeah. See, maybe CJ, you want to jump in? Oh, I was just, um, responding to what I understood somebody to be saying, uh, not directly. Yeah. Yeah. Yeah. Um, but kind of just reaffirming how you guys explained to me that the models that were verifying the model that's being tested are neural networks.

And I was contrasting that with, um, how the laws of a system might be expressed through differential equations, which is hence why I assumed it was differential equations that were contextualizing the system. But I understand what you guys mean that no, these other neural networks, these other transformers are acting as direct verification to the model being tested.

This, uh, transformer. Yeah. Correct. Yeah. Yeah. Yeah. Awesome. Well, I think that we are out of time. Do I think, uh, Sean, we have a paper for next. We have a paper next week already, right? Um, dude, uh, as, as you can clearly tell, I, I, I have trouble keeping track.

Uh, source of truth is in discord. We'll, we'll coordinate in discord, but our kid, I'm very impressed by, by this coverage. I, I'm, uh, I'm, you know, I'm really looking forward to hearing more from you. Uh, this is the spirit of the paper club. It's like discussing stuff like this.

Yeah. Yeah. I'm really, really glad you picked this paper. And it's different. I hope that I did generalize this because when we have something, we can build startups on because essentially, yeah. It will generalize if the theory works. Uh, and people have been trying to put this into, but so far we haven't needed much good stuff in the world model space, but like, uh, you know, like meta also did this like code version.

Yeah. Yeah. Uh, maybe it'll work this time. I don't know. Let's see. Awesome. All right. Thank you guys. Thank you. Later guys. Bye. Bye. Thank you.

Chan-Zuckerberg rbio1: Training scientific reasoning LLMs with biological world models as verifiers

Transcript