-
Value Iteration vs Policy Iteration
Criteria Value Iteration Policy Iteration Steps Consists of a single step combining policy improvement and truncated policy evaluation Consists of two steps: policy evaluation and policy improvement Convergence Converges to the optimal policy after infinite iterations Often converges to the optimal policy faster than value iteration Complexity Less complex as it involves only one step More complex as it involves two distinct steps Examples Value Iteration method Policy Iteration method -
Monte Carlo Methods vs Temporal Difference Learning
Criteria Monte Carlo Methods Temporal Difference Learning Reward Information Uses complete return (reward information) for updates Uses immediate reward plus estimated return of the next state for updates Bias/Variance Unbiased with high variance Biased with low variance Learning Speed Slower learning as it waits until the end of the episode Faster learning as it updates estimates based on other estimates Examples First-Visit MC, Every-Visit MC TD(0), SARSA, Q-Learning -
Model-Based vs Model-Free Methods
Criteria Model-Based Methods Model-Free Methods Model Requirement Requires a model of the environment Does not require a model of the environment Sample Efficiency More sample efficient as they can plan using the model Less sample efficient as they learn directly from experience Complexity More complex as they need to maintain and learn the model Less complex as they do not need to maintain a model Examples Dyna-Q, Monte Carlo Tree Search (MCTS) SARSA, Q-Learning, DQN -
On-Policy vs Off-Policy Methods
Criteria On-Policy Methods Off-Policy Methods Policy Used for Learning Learns about the policy currently being used to make decisions Learns about an optimal policy while following an exploration policy Examples SARSA, On-Policy First-Visit MC Q-Learning, Off-Policy MC Control with Weighted Importance Sampling -
Value-Based vs Policy-Based vs Actor-Critic
Criteria Value-Based Methods Policy-Based Methods Actor-Critic Methods Description Learn a value function and select actions based on it Directly learn a policy without using a value function Learn both a policy (actor) and a value function (critic) Examples Q-Learning, DQN, Value Iteration REINFORCE, Policy Gradient, Policy Iteration Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG) -
Single-Agent vs Multi-Agent
Criteria Single-Agent Methods Multi-Agent Methods Description Designed for environments with a single decision-making entity Designed for environments with multiple interacting entities Examples SARSA, Q-Learning, DQN, TD(0) Multi-Agent DQN, Independent Q-Learning -
Tabular vs Function Approximation
Criteria Tabular Methods Function Approximation Methods Description Maintain a table of values for each state-action pair Use a function approximator (like a neural network) to generalize across states Examples SARSA, Q-Learning, TD(0), Policy Iteration, Value Iteration, Monte Carlo methods DQN, A2C, DDPG, Function Approximation with Gradient Descent, Linear Function Approximation
Method | Tabular Method | Require Knowledge of Transition Function | Learning from Trajectory | Suitable for Continuous Task | Bootstrapping | Suitable for Off-Policy Learning |
---|---|---|---|---|---|---|
Policy Iteration | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
Value Iteration | ✓ | ✓ | ✗ | ✗ | ✓ | ✗ |
Monte Carlo (first / every visit) | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ (with importance sampling) |
TD(0) | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ (with importance sampling) |
SARSA | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ (expected SARSA) |
Q Learning | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ |