Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@jknthn
Created June 6, 2018 19:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jknthn/fbec04355097d91a9392fdf0315295a9 to your computer and use it in GitHub Desktop.
Save jknthn/fbec04355097d91a9392fdf0315295a9 to your computer and use it in GitHub Desktop.
def Q_learning(env, episodes=100, step_size=0.01, exploration_rate=0.01):
policy = utils.create_random_policy(env) # Create policy, just for the util function to create Q
Q = create_state_action_dictionary(env, policy) # 1. Initialize value dictionary formated: { S1: { A1: 0.0, A2: 0.0, ...}, ...}
# 2. Loop through the number of episodes
for episode in range(episodes):
env.reset() # Gym environment reset
S = env.env.s # 3. Getting State
finished = False
# 4. Looping to the end of the episode
while not finished:
A = greedy_policy(Q)[S] # 5. Deciding on the action
S_prime, reward, finished, _ = env.step(A) # 6. Making next step
Q[S][A] = Q[S][A] + step_size * (reward + exploration_rate * max(Q[S_prime].values()) - Q[S][A]) # 7. Update rule
S = S_prime # 8. Update State for the next step
return greedy_policy(Q), Q
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment