back to indexRLHF vs SFT to break out of local maxima 📈
00:00:00.000 |
The intuition in general is that, for instance, for code, because this is factual, you can 00:00:05.320 |
check if the code is correct or not, RLHF is not the way to go. 00:00:08.600 |
You prefer to do supervised fine-tuning as a human to write the code. 00:00:12.400 |
But in fact, because humans make mistakes, because actually even in code there are some 00:00:16.200 |
preferences that emerge, things like that, and maybe for some other reasons that we don't 00:00:21.480 |
RLHF is so much more scalable, it costs less, it's easier, that it leads in general to just 00:00:27.660 |
And maybe we can come with a compromise, we actually suggested Teacher Critic, where it 00:00:32.840 |
reconciliates and unifies supervised fine-tuning and RLHF, such that when you do human preference, 00:00:38.560 |
and you have two outputs, but both are very bad in the code, for instance, you will ask 00:00:43.580 |
the human to edit the best answer to make it correct now. 00:00:47.380 |
So now you are doing SFT when all the answers were really bad, such that you can get out