Skip to content

Instantly share code, notes, and snippets.

RLHF vs PPO + ReAct

Aspect RLHF (e.g. ChatGPT) PPO + ReAct Demo
Base model Full LLM is fine-tuned with PPO on human feedback LLM is frozen, only a small prefix policy is trained
Reward model Learned from human preference data (e.g. ranking outputs) Simple task-specific reward (correct/incorrect math answer)
Feedback source Human annotators or trained reward models Automatic programmatic scoring
Compute Large-scale distributed training (millions of samples) Small-scale CPU/GPU, runs locally
Goal General alignment (helpfulness, safety, style) Narrow task perf