mightyhorst/DeepMind x UCL RL Lecture Series: 1. Introduction to Reinforcement Learning

## DeepMind x UCL RL Lecture Series: 1. Introduction to Reinforcement Learning
Environment - dynamics of the problem
Reward - specifies the goal
Agent -
 * agent state
 * policy
 * value function estimate - optional
 * model - optional

## 1. Inside the agent: state

each time step:
  observation -> state -> policy + predictions

 environment state

 agent --> observation --> environment
        <-- action <---

history = obs + action + reward for time

full observability obs = env state
St = Ot = env state


Markov Decision Process (MDP)
a state is markovian
prob of a state doesnt change if we add more history
p(r,s | S_t, A_t) = p(r,s|H_t, A_t)

the state constaoins all we need to know
adding more history doesnt help

once state is known history can be thrown away
the full history is Markov but keeps growing
state is some compression of the histroy
use St to denot the agent state not the environment state (sometimes the same but dont assume that_

### partial observability:
observations are not markovian
robot woth camera isnt told absolute location
a poker player agent observe public cards (not other player)

now using the observation as state would not be markovian
this is called a partially observable markov decision process POMDP

environment state can be markov but agent doesnt know
we might still be able to constsut markov state

agents actions depend on its state
the agent state is a function of the history
St=Ot
St+1 = u(St, At, Rt+1 , Ot+1)
u (mu) is state update function

the agent state is often smaller htat environmnet state and full history (compression )

## 2. Inside the agent: Policy
a policy defines the agents behaviour
it is a map from agent state to action

deterministic policy always gives the same answer
A = pi(S)

stochastic policy gives probability of action given a state
pi(A|S) = p(A|S)


### value function
expected return
v_pi(s) = E(Gt | St = s , pi)
v_pi(s) = E[Rt+1 + .... + gamma^n-1 * Rt=n | St = s, pi]

value function depends on the policy , every action is dwlected based on pi
discount factor = gamma is set of [0,1] (percentage)
trade off importance of intermediate vs ling term rewards

1 = all rewards equally iportant
value depends on a policy
can be evaluate the desirability of states
and select between actions

recursive form Gt = Rt+1 + gamma* Gt+1

Bellman equation:
a sampled from pi(s)a is chosen by policy pi in state s even if pi is detereministic

optimal highest possible value
v star  v*(s) = max_a_Expected[return of next time + discount factor of gamma * optima; value(state next time step) | given State of time = s , action at time t = a]

does not depend on a policy

agents approximate value fuctions

## 3. Inisde the Agent: Model
a model predicts the environment will di next
model of the world

P predicts the next state
P(s,a,s') ~= p(state time next = s' | given State_time now = s, A_time  now = a)
inputs:
state, action, next state

output:
approximation to actual probability
of seeing that next state
given observing this prev state and action

approximate the Reward R
immediate reward
R(s, a) ~= Expected reward next immediate | given observed previous state of s, and action of a]

model does not immediately give us a good policy
we still need plan

consider stochastic generative models

example
state: agents location x and y
rewards: -1 per time step
actions: NESW

policy pi(s) actions to shortest path

## Agent Categories
value based
* value function
* no policy implicit

policy based
* policy
* no value function

actor critic
* policy
* value function

model free
* policy and or value function
* no model

model base
* optionally policy and or value function
* model

## subproblems
* prediction : evaluate the future for a given policy
* control: optimisng the future find the best policy

pi star of s = argmax of pi value function for pi with input of state s

2 fundamentals problems
1. learning where environment is initally unknown
2. planning
* have a enviroment or learnt model
* agent plans in this model without external interaction
aka. reasoning, pondering, thought, search planning

all components are functions
* policies: pi : State -> Actions
* value functions: v: S -> R
* models: m: S-> S and/or r: S -> R
* state update: u: S x O -> S

deep learning can learn functions

data can be correlated for overfitting
non stationary e.g. value function , policy keeps changing so changes value functions

deep reinforecement learning : deep learning + reinforecment learning

examples of atari
observation o_t is the pixels
action a_t is the joystick
reward r_t is the score

rules are unknown
learn from game play
pick actions on the joystick see pixels and score


## DeepMind x UCL RL Lecture Series: 2.
todo
	Environment - dynamics of the problem
	Reward - specifies the goal
	Agent -
	* agent state
	* policy
	* value function estimate - optional
	* model - optional

	## 1. Inside the agent: state

	each time step:
	observation -> state -> policy + predictions

	environment state

	agent --> observation --> environment
	<-- action <---

	history = obs + action + reward for time

	full observability obs = env state
	St = Ot = env state


	Markov Decision Process (MDP)
	a state is markovian
	prob of a state doesnt change if we add more history
	p(r,s \| S_t, A_t) = p(r,s\|H_t, A_t)

	the state constaoins all we need to know
	adding more history doesnt help

	once state is known history can be thrown away
	the full history is Markov but keeps growing
	state is some compression of the histroy
	use St to denot the agent state not the environment state (sometimes the same but dont assume that_

	### partial observability:
	observations are not markovian
	robot woth camera isnt told absolute location
	a poker player agent observe public cards (not other player)

	now using the observation as state would not be markovian
	this is called a partially observable markov decision process POMDP

	environment state can be markov but agent doesnt know
	we might still be able to constsut markov state

	agents actions depend on its state
	the agent state is a function of the history
	St=Ot
	St+1 = u(St, At, Rt+1 , Ot+1)
	u (mu) is state update function

	the agent state is often smaller htat environmnet state and full history (compression )

	## 2. Inside the agent: Policy
	a policy defines the agents behaviour
	it is a map from agent state to action

	deterministic policy always gives the same answer
	A = pi(S)

	stochastic policy gives probability of action given a state
	pi(A\|S) = p(A\|S)


	### value function
	expected return
	v_pi(s) = E(Gt \| St = s , pi)
	v_pi(s) = E[Rt+1 + .... + gamma^n-1 * Rt=n \| St = s, pi]

	value function depends on the policy , every action is dwlected based on pi
	discount factor = gamma is set of [0,1] (percentage)
	trade off importance of intermediate vs ling term rewards

	1 = all rewards equally iportant
	value depends on a policy
	can be evaluate the desirability of states
	and select between actions

	recursive form Gt = Rt+1 + gamma* Gt+1

	Bellman equation:
	a sampled from pi(s)a is chosen by policy pi in state s even if pi is detereministic

	optimal highest possible value
	v star v(s) = max_a_Expected[return of next time + discount factor of gamma optima; value(state next time step) \| given State of time = s , action at time t = a]

	does not depend on a policy

	agents approximate value fuctions

	## 3. Inisde the Agent: Model
	a model predicts the environment will di next
	model of the world

	P predicts the next state
	P(s,a,s') ~= p(state time next = s' \| given State_time now = s, A_time now = a)
	inputs:
	state, action, next state

	output:
	approximation to actual probability
	of seeing that next state
	given observing this prev state and action

	approximate the Reward R
	immediate reward
	R(s, a) ~= Expected reward next immediate \| given observed previous state of s, and action of a]

	model does not immediately give us a good policy
	we still need plan

	consider stochastic generative models

	example
	state: agents location x and y
	rewards: -1 per time step
	actions: NESW

	policy pi(s) actions to shortest path

	## Agent Categories
	value based
	* value function
	* no policy implicit

	policy based
	* policy
	* no value function

	actor critic
	* policy
	* value function

	model free
	* policy and or value function
	* no model

	model base
	* optionally policy and or value function
	* model

	## subproblems
	* prediction : evaluate the future for a given policy
	* control: optimisng the future find the best policy

	pi star of s = argmax of pi value function for pi with input of state s

	2 fundamentals problems
	1. learning where environment is initally unknown
	2. planning
	* have a enviroment or learnt model
	* agent plans in this model without external interaction
	aka. reasoning, pondering, thought, search planning

	all components are functions
	* policies: pi : State -> Actions
	* value functions: v: S -> R
	* models: m: S-> S and/or r: S -> R
	* state update: u: S x O -> S

	deep learning can learn functions

	data can be correlated for overfitting
	non stationary e.g. value function , policy keeps changing so changes value functions

	deep reinforecement learning : deep learning + reinforecment learning

	examples of atari
	observation o_t is the pixels
	action a_t is the joystick
	reward r_t is the score

	rules are unknown
	learn from game play
	pick actions on the joystick see pixels and score