Last active
July 10, 2023 01:07
-
-
Save mightyhorst/857b55180806d1c3ec26d7ae2006c8d9 to your computer and use it in GitHub Desktop.
DeepMind x UCL RL Lecture Series
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Environment - dynamics of the problem | |
Reward - specifies the goal | |
Agent - | |
* agent state | |
* policy | |
* value function estimate - optional | |
* model - optional | |
## 1. Inside the agent: state | |
each time step: | |
observation -> state -> policy + predictions | |
environment state | |
agent --> observation --> environment | |
<-- action <--- | |
history = obs + action + reward for time | |
full observability obs = env state | |
St = Ot = env state | |
Markov Decision Process (MDP) | |
a state is markovian | |
prob of a state doesnt change if we add more history | |
p(r,s | S_t, A_t) = p(r,s|H_t, A_t) | |
the state constaoins all we need to know | |
adding more history doesnt help | |
once state is known history can be thrown away | |
the full history is Markov but keeps growing | |
state is some compression of the histroy | |
use St to denot the agent state not the environment state (sometimes the same but dont assume that_ | |
### partial observability: | |
observations are not markovian | |
robot woth camera isnt told absolute location | |
a poker player agent observe public cards (not other player) | |
now using the observation as state would not be markovian | |
this is called a partially observable markov decision process POMDP | |
environment state can be markov but agent doesnt know | |
we might still be able to constsut markov state | |
agents actions depend on its state | |
the agent state is a function of the history | |
St=Ot | |
St+1 = u(St, At, Rt+1 , Ot+1) | |
u (mu) is state update function | |
the agent state is often smaller htat environmnet state and full history (compression ) | |
## 2. Inside the agent: Policy | |
a policy defines the agents behaviour | |
it is a map from agent state to action | |
deterministic policy always gives the same answer | |
A = pi(S) | |
stochastic policy gives probability of action given a state | |
pi(A|S) = p(A|S) | |
### value function | |
expected return | |
v_pi(s) = E(Gt | St = s , pi) | |
v_pi(s) = E[Rt+1 + .... + gamma^n-1 * Rt=n | St = s, pi] | |
value function depends on the policy , every action is dwlected based on pi | |
discount factor = gamma is set of [0,1] (percentage) | |
trade off importance of intermediate vs ling term rewards | |
1 = all rewards equally iportant | |
value depends on a policy | |
can be evaluate the desirability of states | |
and select between actions | |
recursive form Gt = Rt+1 + gamma* Gt+1 | |
Bellman equation: | |
a sampled from pi(s)a is chosen by policy pi in state s even if pi is detereministic | |
optimal highest possible value | |
v star v*(s) = max_a_Expected[return of next time + discount factor of gamma * optima; value(state next time step) | given State of time = s , action at time t = a] | |
does not depend on a policy | |
agents approximate value fuctions | |
## 3. Inisde the Agent: Model | |
a model predicts the environment will di next | |
model of the world | |
P predicts the next state | |
P(s,a,s') ~= p(state time next = s' | given State_time now = s, A_time now = a) | |
inputs: | |
state, action, next state | |
output: | |
approximation to actual probability | |
of seeing that next state | |
given observing this prev state and action | |
approximate the Reward R | |
immediate reward | |
R(s, a) ~= Expected reward next immediate | given observed previous state of s, and action of a] | |
model does not immediately give us a good policy | |
we still need plan | |
consider stochastic generative models | |
example | |
state: agents location x and y | |
rewards: -1 per time step | |
actions: NESW | |
policy pi(s) actions to shortest path | |
## Agent Categories | |
value based | |
* value function | |
* no policy implicit | |
policy based | |
* policy | |
* no value function | |
actor critic | |
* policy | |
* value function | |
model free | |
* policy and or value function | |
* no model | |
model base | |
* optionally policy and or value function | |
* model | |
## subproblems | |
* prediction : evaluate the future for a given policy | |
* control: optimisng the future find the best policy | |
pi star of s = argmax of pi value function for pi with input of state s | |
2 fundamentals problems | |
1. learning where environment is initally unknown | |
2. planning | |
* have a enviroment or learnt model | |
* agent plans in this model without external interaction | |
aka. reasoning, pondering, thought, search planning | |
all components are functions | |
* policies: pi : State -> Actions | |
* value functions: v: S -> R | |
* models: m: S-> S and/or r: S -> R | |
* state update: u: S x O -> S | |
deep learning can learn functions | |
data can be correlated for overfitting | |
non stationary e.g. value function , policy keeps changing so changes value functions | |
deep reinforecement learning : deep learning + reinforecment learning | |
examples of atari | |
observation o_t is the pixels | |
action a_t is the joystick | |
reward r_t is the score | |
rules are unknown | |
learn from game play | |
pick actions on the joystick see pixels and score | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
todo |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment