RLHF vs SFT to break out of local maxima 📈

The intuition in general is that, for instance, for code, because this is factual, you can check if the code is correct or not, RLHF is not the way to go. You prefer to do supervised fine-tuning as a human to write the code. But in fact, because humans make mistakes, because actually even in code there are some preferences that emerge, things like that, and maybe for some other reasons that we don't know.

RLHF is so much more scalable, it costs less, it's easier, that it leads in general to just better performance. And maybe we can come with a compromise, we actually suggested Teacher Critic, where it reconciliates and unifies supervised fine-tuning and RLHF, such that when you do human preference, and you have two outputs, but both are very bad in the code, for instance, you will ask the human to edit the best answer to make it correct now.

So now you are doing SFT when all the answers were really bad, such that you can get out from the local minimum of your model.

RLHF vs SFT to break out of local maxima 📈

Transcript