kumasura

## gist:306fbc57de89d845337f381db20c9e96

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                kumasura
                / gist:306fbc57de89d845337f381db20c9e96
            
            
              Created
              October 4, 2025 05:22
            
          
    RLHF vs PPO + ReAct


Aspect
RLHF (e.g. ChatGPT)
PPO + ReAct Demo


Base model
Full LLM is fine-tuned with PPO on human feedback
LLM is frozen, only a small prefix policy is trained


Reward model
Learned from human preference data (e.g. ranking outputs)
Simple task-specific reward (correct/incorrect math answer)


Feedback source
Human annotators or trained reward models
Automatic programmatic scoring


Compute
Large-scale distributed training (millions of samples)
Small-scale CPU/GPU, runs locally


Goal
General alignment (helpfulness, safety, style)
Narrow task perf
Aspect	RLHF (e.g. ChatGPT)	PPO + ReAct Demo
Base model	Full LLM is fine-tuned with PPO on human feedback	LLM is frozen, only a small prefix policy is trained
Reward model	Learned from human preference data (e.g. ranking outputs)	Simple task-specific reward (correct/incorrect math answer)
Feedback source	Human annotators or trained reward models	Automatic programmatic scoring
Compute	Large-scale distributed training (millions of samples)	Small-scale CPU/GPU, runs locally
Goal	General alignment (helpfulness, safety, style)	Narrow task perf