Skip to content

Instantly share code, notes, and snippets.

@kashifulhaque
Created August 1, 2024 18:52
Show Gist options
  • Save kashifulhaque/c2f58b0d6a63c0626483dfae4d54ff15 to your computer and use it in GitHub Desktop.
Save kashifulhaque/c2f58b0d6a63c0626483dfae4d54ff15 to your computer and use it in GitHub Desktop.
Aspect On-Policy Off-Policy
Definition Learns the value function for the policy being used for action selection Can learn about a different policy than the one being used for action selection
Policy Updating Uses the same policy for both learning and action selection Can use different policies for learning and action selection
Data Collection Collects data using the current policy Can use data collected from any policy
Exploration Typically requires a balance between exploration and exploitation Can learn from data collected using a highly exploratory policy
Sample Efficiency Generally less sample efficient Often more sample efficient
Stability Usually more stable during learning Can be less stable, especially with function approximation
Examples SARSA, Actor-Critic Q-Learning, DQN
Convergence Converges to a locally optimal policy Can converge to the optimal policy (in tabular cases)
Flexibility Less flexible, tied to the current policy More flexible, can learn about multiple policies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment