Aspect | RLHF (e.g. ChatGPT) | PPO + ReAct Demo |
---|---|---|
Base model | Full LLM is fine-tuned with PPO on human feedback | LLM is frozen, only a small prefix policy is trained |
Reward model | Learned from human preference data (e.g. ranking outputs) | Simple task-specific reward (correct/incorrect math answer) |
Feedback source | Human annotators or trained reward models | Automatic programmatic scoring |
Compute | Large-scale distributed training (millions of samples) | Small-scale CPU/GPU, runs locally |
Goal | General alignment (helpfulness, safety, style) | Narrow task perf |