Skip to content

Instantly share code, notes, and snippets.

@not-a-feature
Last active May 25, 2023 11:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save not-a-feature/de11f17dcb208c11fbfaa8e1211514fb to your computer and use it in GitHub Desktop.
Save not-a-feature/de11f17dcb208c11fbfaa8e1211514fb to your computer and use it in GitHub Desktop.
Small overview of different RL techniques
  1. Value Iteration vs Policy Iteration

    Criteria Value Iteration Policy Iteration
    Steps Consists of a single step combining policy improvement and truncated policy evaluation Consists of two steps: policy evaluation and policy improvement
    Convergence Converges to the optimal policy after infinite iterations Often converges to the optimal policy faster than value iteration
    Complexity Less complex as it involves only one step More complex as it involves two distinct steps
    Examples Value Iteration method Policy Iteration method
  2. Monte Carlo Methods vs Temporal Difference Learning

    Criteria Monte Carlo Methods Temporal Difference Learning
    Reward Information Uses complete return (reward information) for updates Uses immediate reward plus estimated return of the next state for updates
    Bias/Variance Unbiased with high variance Biased with low variance
    Learning Speed Slower learning as it waits until the end of the episode Faster learning as it updates estimates based on other estimates
    Examples First-Visit MC, Every-Visit MC TD(0), SARSA, Q-Learning
  3. Model-Based vs Model-Free Methods

    Criteria Model-Based Methods Model-Free Methods
    Model Requirement Requires a model of the environment Does not require a model of the environment
    Sample Efficiency More sample efficient as they can plan using the model Less sample efficient as they learn directly from experience
    Complexity More complex as they need to maintain and learn the model Less complex as they do not need to maintain a model
    Examples Dyna-Q, Monte Carlo Tree Search (MCTS) SARSA, Q-Learning, DQN
  4. On-Policy vs Off-Policy Methods

    Criteria On-Policy Methods Off-Policy Methods
    Policy Used for Learning Learns about the policy currently being used to make decisions Learns about an optimal policy while following an exploration policy
    Examples SARSA, On-Policy First-Visit MC Q-Learning, Off-Policy MC Control with Weighted Importance Sampling
  5. Value-Based vs Policy-Based vs Actor-Critic

    Criteria Value-Based Methods Policy-Based Methods Actor-Critic Methods
    Description Learn a value function and select actions based on it Directly learn a policy without using a value function Learn both a policy (actor) and a value function (critic)
    Examples Q-Learning, DQN, Value Iteration REINFORCE, Policy Gradient, Policy Iteration Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG)
  6. Single-Agent vs Multi-Agent

    Criteria Single-Agent Methods Multi-Agent Methods
    Description Designed for environments with a single decision-making entity Designed for environments with multiple interacting entities
    Examples SARSA, Q-Learning, DQN, TD(0) Multi-Agent DQN, Independent Q-Learning
  7. Tabular vs Function Approximation

    Criteria Tabular Methods Function Approximation Methods
    Description Maintain a table of values for each state-action pair Use a function approximator (like a neural network) to generalize across states
    Examples SARSA, Q-Learning, TD(0), Policy Iteration, Value Iteration, Monte Carlo methods DQN, A2C, DDPG, Function Approximation with Gradient Descent, Linear Function Approximation

Method Tabular Method Require Knowledge of Transition Function Learning from Trajectory Suitable for Continuous Task Bootstrapping Suitable for Off-Policy Learning
Policy Iteration
Value Iteration
Monte Carlo (first / every visit) ✓ (with importance sampling)
TD(0) ✓ (with importance sampling)
SARSA ✓ (expected SARSA)
Q Learning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment