not-a-feature/rl_summary.md

## rl_summary.md

      
    Raw
  

              rl_summary.md
            
          
Value Iteration vs Policy Iteration


Criteria
Value Iteration
Policy Iteration


Steps
Consists of a single step combining policy improvement and truncated policy evaluation
Consists of two steps: policy evaluation and policy improvement


Convergence
Converges to the optimal policy after infinite iterations
Often converges to the optimal policy faster than value iteration


Complexity
Less complex as it involves only one step
More complex as it involves two distinct steps


Examples
Value Iteration method
Policy Iteration method


Monte Carlo Methods vs Temporal Difference Learning


Criteria
Monte Carlo Methods
Temporal Difference Learning


Reward Information
Uses complete return (reward information) for updates
Uses immediate reward plus estimated return of the next state for updates


Bias/Variance
Unbiased with high variance
Biased with low variance


Learning Speed
Slower learning as it waits until the end of the episode
Faster learning as it updates estimates based on other estimates


Examples
First-Visit MC, Every-Visit MC
TD(0), SARSA, Q-Learning


Model-Based vs Model-Free Methods


Criteria
Model-Based Methods
Model-Free Methods


Model Requirement
Requires a model of the environment
Does not require a model of the environment


Sample Efficiency
More sample efficient as they can plan using the model
Less sample efficient as they learn directly from experience


Complexity
More complex as they need to maintain and learn the model
Less complex as they do not need to maintain a model


Examples
Dyna-Q, Monte Carlo Tree Search (MCTS)
SARSA, Q-Learning, DQN


On-Policy vs Off-Policy Methods


Criteria
On-Policy Methods
Off-Policy Methods


Policy Used for Learning
Learns about the policy currently being used to make decisions
Learns about an optimal policy while following an exploration policy


Examples
SARSA, On-Policy First-Visit MC
Q-Learning, Off-Policy MC Control with Weighted Importance Sampling


Value-Based vs Policy-Based vs Actor-Critic


Criteria
Value-Based Methods
Policy-Based Methods
Actor-Critic Methods


Description
Learn a value function and select actions based on it
Directly learn a policy without using a value function
Learn both a policy (actor) and a value function (critic)


Examples
Q-Learning, DQN, Value Iteration
REINFORCE, Policy Gradient, Policy Iteration
Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG)


Single-Agent vs Multi-Agent


Criteria
Single-Agent Methods
Multi-Agent Methods


Description
Designed for environments with a single decision-making entity
Designed for environments with multiple interacting entities


Examples
SARSA, Q-Learning, DQN, TD(0)
Multi-Agent DQN, Independent Q-Learning


Tabular vs Function Approximation


Criteria
Tabular Methods
Function Approximation Methods


Description
Maintain a table of values for each state-action pair
Use a function approximator (like a neural network) to generalize across states


Examples
SARSA, Q-Learning, TD(0), Policy Iteration, Value Iteration, Monte Carlo methods
DQN, A2C, DDPG, Function Approximation with Gradient Descent, Linear Function Approximation


Method
Tabular Method
Require Knowledge of Transition Function
Learning from Trajectory
Suitable for Continuous Task
Bootstrapping
Suitable for Off-Policy Learning


Policy Iteration
✓
✓
✗
✗
✓
✗


Value Iteration
✓
✓
✗
✗
✓
✗


Monte Carlo (first / every visit)
✓
✗
✓
✗
✗
✓ (with importance sampling)


TD(0)
✓
✗
✓
✓
✓
✓ (with importance sampling)


SARSA
✓
✗
✓
✓
✓
✓ (expected SARSA)


Q Learning
✓
✗
✓
✓
✓
✓
Criteria	Value Iteration	Policy Iteration
Steps	Consists of a single step combining policy improvement and truncated policy evaluation	Consists of two steps: policy evaluation and policy improvement
Convergence	Converges to the optimal policy after infinite iterations	Often converges to the optimal policy faster than value iteration
Complexity	Less complex as it involves only one step	More complex as it involves two distinct steps
Examples	Value Iteration method	Policy Iteration method
Criteria	Monte Carlo Methods	Temporal Difference Learning
Reward Information	Uses complete return (reward information) for updates	Uses immediate reward plus estimated return of the next state for updates
Bias/Variance	Unbiased with high variance	Biased with low variance
Learning Speed	Slower learning as it waits until the end of the episode	Faster learning as it updates estimates based on other estimates
Examples	First-Visit MC, Every-Visit MC	TD(0), SARSA, Q-Learning
Criteria	Model-Based Methods	Model-Free Methods
Model Requirement	Requires a model of the environment	Does not require a model of the environment
Sample Efficiency	More sample efficient as they can plan using the model	Less sample efficient as they learn directly from experience
Complexity	More complex as they need to maintain and learn the model	Less complex as they do not need to maintain a model
Examples	Dyna-Q, Monte Carlo Tree Search (MCTS)	SARSA, Q-Learning, DQN
Criteria	On-Policy Methods	Off-Policy Methods
Policy Used for Learning	Learns about the policy currently being used to make decisions	Learns about an optimal policy while following an exploration policy
Examples	SARSA, On-Policy First-Visit MC	Q-Learning, Off-Policy MC Control with Weighted Importance Sampling
Criteria	Value-Based Methods	Policy-Based Methods	Actor-Critic Methods
Description	Learn a value function and select actions based on it	Directly learn a policy without using a value function	Learn both a policy (actor) and a value function (critic)
Examples	Q-Learning, DQN, Value Iteration	REINFORCE, Policy Gradient, Policy Iteration	Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG)
Criteria	Single-Agent Methods	Multi-Agent Methods
Description	Designed for environments with a single decision-making entity	Designed for environments with multiple interacting entities
Examples	SARSA, Q-Learning, DQN, TD(0)	Multi-Agent DQN, Independent Q-Learning
Criteria	Tabular Methods	Function Approximation Methods
Description	Maintain a table of values for each state-action pair	Use a function approximator (like a neural network) to generalize across states
Examples	SARSA, Q-Learning, TD(0), Policy Iteration, Value Iteration, Monte Carlo methods	DQN, A2C, DDPG, Function Approximation with Gradient Descent, Linear Function Approximation
Method	Tabular Method	Require Knowledge of Transition Function	Learning from Trajectory	Suitable for Continuous Task	Bootstrapping	Suitable for Off-Policy Learning
Policy Iteration	✓	✓	✗	✗	✓	✗
Value Iteration	✓	✓	✗	✗	✓	✗
Monte Carlo (first / every visit)	✓	✗	✓	✗	✗	✓ (with importance sampling)
TD(0)	✓	✗	✓	✓	✓	✓ (with importance sampling)
SARSA	✓	✗	✓	✓	✓	✓ (expected SARSA)
Q Learning	✓	✗	✓	✓	✓	✓