Definition of intelligence: To be able to learn to make decisions to achieve goals
People and animals learn by interacting with our environment
Reinforcement learning is based on the rewuard hypothesis:
Any goal can be formalized as the outcome of maximizing a cumulative reward
There are distinct reasons to learn:
- Find solutions
- Adapt online, deal with unforeseen circumstances
At each step
- Receives observation
$O_t$ (and reward $R_t) - Executes action
$A_t$
The environment:
- Receives action
$A_t$ - Emits observation
$O_{t+1}$ (and reward$R_{t+1}$ )
- A reward
$R_t$ is a scalar feedback signal - Indicates how well agent is doing at step
$t$ -- defines the goal - The agent's job is to maximize cumulative reward $$ G_t = R_{t+1} + R_{t+2} + R_{t+3} + ... $$
- We call this the return
- The history is the full sequence of observations, actions, rewards
- This history is used to construct the agent state
$S_t$
Suppose the agent sees the full environment state
- observation = environment state
- The agent state could just be this observation: $$ S_t = O_t = environment\ state $$