Aspect | On-Policy | Off-Policy |
---|---|---|
Definition | Learns the value function for the policy being used for action selection | Can learn about a different policy than the one being used for action selection |
Policy Updating | Uses the same policy for both learning and action selection | Can use different policies for learning and action selection |
Data Collection | Collects data using the current policy | Can use data collected from any policy |
Exploration | Typically requires a balance between exploration and exploitation | Can learn from data collected using a highly exploratory policy |
Sample Efficiency | Generally less sample efficient | Often more sample efficient |
Stability | Usually more stable during learning | Can be less stable, especially with function approximation |
Examples | SARSA, Actor-Critic | Q-Learning, DQN |
Convergence | Converges to a locally optimal policy | Can converge to the optimal policy (in tabular cases) |
Flexibility | Less flexible, tied to the current policy | More flexible, can learn about multiple policies |
Created
August 1, 2024 18:52
-
-
Save kashifulhaque/c2f58b0d6a63c0626483dfae4d54ff15 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment