Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save mightyhorst/857b55180806d1c3ec26d7ae2006c8d9 to your computer and use it in GitHub Desktop.
Save mightyhorst/857b55180806d1c3ec26d7ae2006c8d9 to your computer and use it in GitHub Desktop.
DeepMind x UCL RL Lecture Series
Environment - dynamics of the problem
Reward - specifies the goal
Agent -
* agent state
* policy
* value function estimate - optional
* model - optional
## 1. Inside the agent: state
each time step:
observation -> state -> policy + predictions
environment state
agent --> observation --> environment
<-- action <---
history = obs + action + reward for time
full observability obs = env state
St = Ot = env state
Markov Decision Process (MDP)
a state is markovian
prob of a state doesnt change if we add more history
p(r,s | S_t, A_t) = p(r,s|H_t, A_t)
the state constaoins all we need to know
adding more history doesnt help
once state is known history can be thrown away
the full history is Markov but keeps growing
state is some compression of the histroy
use St to denot the agent state not the environment state (sometimes the same but dont assume that_
### partial observability:
observations are not markovian
robot woth camera isnt told absolute location
a poker player agent observe public cards (not other player)
now using the observation as state would not be markovian
this is called a partially observable markov decision process POMDP
environment state can be markov but agent doesnt know
we might still be able to constsut markov state
agents actions depend on its state
the agent state is a function of the history
St=Ot
St+1 = u(St, At, Rt+1 , Ot+1)
u (mu) is state update function
the agent state is often smaller htat environmnet state and full history (compression )
## 2. Inside the agent: Policy
a policy defines the agents behaviour
it is a map from agent state to action
deterministic policy always gives the same answer
A = pi(S)
stochastic policy gives probability of action given a state
pi(A|S) = p(A|S)
### value function
expected return
v_pi(s) = E(Gt | St = s , pi)
v_pi(s) = E[Rt+1 + .... + gamma^n-1 * Rt=n | St = s, pi]
value function depends on the policy , every action is dwlected based on pi
discount factor = gamma is set of [0,1] (percentage)
trade off importance of intermediate vs ling term rewards
1 = all rewards equally iportant
value depends on a policy
can be evaluate the desirability of states
and select between actions
recursive form Gt = Rt+1 + gamma* Gt+1
Bellman equation:
a sampled from pi(s)a is chosen by policy pi in state s even if pi is detereministic
optimal highest possible value
v star v*(s) = max_a_Expected[return of next time + discount factor of gamma * optima; value(state next time step) | given State of time = s , action at time t = a]
does not depend on a policy
agents approximate value fuctions
## 3. Inisde the Agent: Model
a model predicts the environment will di next
model of the world
P predicts the next state
P(s,a,s') ~= p(state time next = s' | given State_time now = s, A_time now = a)
inputs:
state, action, next state
output:
approximation to actual probability
of seeing that next state
given observing this prev state and action
approximate the Reward R
immediate reward
R(s, a) ~= Expected reward next immediate | given observed previous state of s, and action of a]
model does not immediately give us a good policy
we still need plan
consider stochastic generative models
example
state: agents location x and y
rewards: -1 per time step
actions: NESW
policy pi(s) actions to shortest path
## Agent Categories
value based
* value function
* no policy implicit
policy based
* policy
* no value function
actor critic
* policy
* value function
model free
* policy and or value function
* no model
model base
* optionally policy and or value function
* model
## subproblems
* prediction : evaluate the future for a given policy
* control: optimisng the future find the best policy
pi star of s = argmax of pi value function for pi with input of state s
2 fundamentals problems
1. learning where environment is initally unknown
2. planning
* have a enviroment or learnt model
* agent plans in this model without external interaction
aka. reasoning, pondering, thought, search planning
all components are functions
* policies: pi : State -> Actions
* value functions: v: S -> R
* models: m: S-> S and/or r: S -> R
* state update: u: S x O -> S
deep learning can learn functions
data can be correlated for overfitting
non stationary e.g. value function , policy keeps changing so changes value functions
deep reinforecement learning : deep learning + reinforecment learning
examples of atari
observation o_t is the pixels
action a_t is the joystick
reward r_t is the score
rules are unknown
learn from game play
pick actions on the joystick see pixels and score
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment