Skip to content

Instantly share code, notes, and snippets.

@NicolaBernini
Last active August 1, 2020 06:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NicolaBernini/acf536b2bab9f348191a842f5cc29c7e to your computer and use it in GitHub Desktop.
Save NicolaBernini/acf536b2bab9f348191a842f5cc29c7e to your computer and use it in GitHub Desktop.
Analysis of An Optimistic Perspective on Offline Reinforcement Learning
@NicolaBernini
Copy link
Author

NicolaBernini commented Jul 31, 2020

Overview (Latex All the Things Version)

How is it possible to categorize RL at a very high level

There are two RL Paradigms

  • the Online RL consists of learning by interacting with the environment which means all the observations come from the best policy which is the policy obtained by updating the policy with the new observations as soon as they are available so it works like this

    • policy $\pi_{t}$ is used to observe $o_{t}$
    • policy is immediately updated $\pi_{t+1} = f_{update}(o_{t}; \pi_{t})$
  • the Offline RL or Batch RL is radically different from the previous approach as the policy update happens in a batch mode which means

    • policy $\pi_{t}$ is used to collect a batch of N observations $\{o_{i;t}\}_{i=1,...,N}$
      • the $t$ means all these observations are related to $\pi_{t}$
    • policy is then updated $\pi_{t+1} = f_{update}(\{o_{i;t}\}_{i=1,...,N}; \pi_{t})$

The latter method is generalized in Off Policy RL where there is a distinction between

  • the exploration policy(ies), aimed at observations collection and
  • the exploitation policy

This framework allows the agent to learn a policy without any interaction at all, using just a dataset of previously collected experiences

@NicolaBernini
Copy link
Author

NicolaBernini commented Jul 31, 2020

Overview (Images)

How is it possible to categorize RL at a very high level

There are two RL Paradigms

  • the Online RL consists of learning by interacting with the environment which means all the observations come from the best policy which is the policy obtained by updating the policy with the new observations as soon as they are available so it works like this

    • policy is used to observe
    • policy is immediately updated
  • the Offline RL or Batch RL is radically different from the previous approach as the policy update happens in a batch mode which means

    • policy is used to collect a batch of N observations
      • the means all these observations are related to
    • policy is then updated

The latter method is generalized in Off Policy RL where the exploration policy(ies) hence mainly aimed at data collection are different from the exploitation policy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment