Instantly share code, notes, and snippets.

Embed
What would you like to do?
Q-Table learning in OpenAI grid world.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@OscarBalcells

This comment has been minimized.

OscarBalcells commented Sep 29, 2018

Hello Juliani, thanks for the nice post in Medium. I know this code is already very old, but I still wanted to ask you a question anyways. When you update the QValue of the state you took the action in Q[s,a] = Q[s,a] + lr*( r + y*np.max(Q[s1,:1]) - Q[s,a] ) you are in theory multiplying gamma by the expected future rewards after you've taken action a, however in the code you multiply gamma by the biggest value in the next state's q values np.max(Q[s1,:]). Am I understanding something wrong about "plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in" or is there a mistake in the code? (I'm probably wrong haha)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment