Skip to content

Instantly share code, notes, and snippets.

@SolClover
Created October 16, 2022 06:30
Show Gist options
  • Save SolClover/98bf3d5bc21f7a4a1b52cccdd9296446 to your computer and use it in GitHub Desktop.
Save SolClover/98bf3d5bc21f7a4a1b52cccdd9296446 to your computer and use it in GitHub Desktop.
Define functions to use in training and evaluation
# This is our acting policy (epsilon-greedy), which selects an action for exploration/exploitation during training
def epsilon_greedy(Qtable, state, epsilon):
# Generate a random number and compare to epsilon, if lower then explore, otherwise exploit
randnum = np.random.uniform(0, 1)
if randnum < epsilon:
action = env.action_space.sample() # explore
else:
action = np.argmax(Qtable[state, :]) # exploit
return action
# This function is to update the Qtable.
# It is also based on epsilon-greedy approach because the next_action is decided by epsilon-greedy policy
def update_Q(Qtable, state, action, reward, next_state, next_action):
# 𝑄(𝑆𝑑,𝐴𝑑)=𝑄(𝑆𝑑,𝐴𝑑)+𝛼[𝑅𝑑+1+𝛾𝑄(𝑆𝑑+1,𝐴𝑑+1)βˆ’π‘„(𝑆𝑑,𝐴𝑑)]
Qtable[state][action] = Qtable[state][action] + alpha * (reward + gamma * (Qtable[next_state][next_action]) - Qtable[state][action])
return Qtable
# This function (greedy) will return the action from Qtable when we do evaluation
def eval_greedy(Qtable, state):
action = np.argmax(Qtable[state, :])
return action
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment