Maybe you've heard about this technique but you haven't completely understood it, especially the PPO part. This explanation might help.
We will focus on text-to-text language models
Reinforcement Learning from Human Feedback (RLHF) has been successfully applied in ChatGPT, hence its major increase in popularity.
RLHF is especially useful in two scenarios
- You can’t create a good loss function
- Example: how do you calculate a metric to measure if the model’s output was funny?
- You want to train with production data, but you can’t easily label your production data