RLHF vs SFT to break out of local maxima 📈

00:00:00.000 | The intuition in general is that, for instance, for code, because this is factual, you can

00:00:05.320 | check if the code is correct or not, RLHF is not the way to go.

00:00:08.600 | You prefer to do supervised fine-tuning as a human to write the code.

00:00:12.400 | But in fact, because humans make mistakes, because actually even in code there are some

00:00:16.200 | preferences that emerge, things like that, and maybe for some other reasons that we don't

00:00:21.480 | RLHF is so much more scalable, it costs less, it's easier, that it leads in general to just

00:00:26.160 | better performance.

00:00:27.660 | And maybe we can come with a compromise, we actually suggested Teacher Critic, where it

00:00:32.840 | reconciliates and unifies supervised fine-tuning and RLHF, such that when you do human preference,

00:00:38.560 | and you have two outputs, but both are very bad in the code, for instance, you will ask

00:00:43.580 | the human to edit the best answer to make it correct now.

00:00:47.380 | So now you are doing SFT when all the answers were really bad, such that you can get out

00:00:53.180 | from the local minimum of your model.