This comment has been minimized.
This comment has been minimized.
Hey! I was trying to figure out why my implementation of this wasn't working and I found out that this code only works if you add noise. Even epsilon-greedy approaches fail to get any reward. Removing |
This comment has been minimized.
This comment has been minimized.
It makes sense that the randomness is necessary. If there's no randomness, then |
This comment has been minimized.
This comment has been minimized.
Thanks for the code. Question: Frozen lake returns |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
Hello Juliani, thanks for the nice post in Medium. I know this code is already very old, but I still wanted to ask you a question anyways. When you update the QValue of the state you took the action in Q[s,a] = Q[s,a] + lr*( r + y*np.max(Q[s1,:1]) - Q[s,a] ) you are in theory multiplying gamma by the expected future rewards after you've taken action a, however in the code you multiply gamma by the biggest value in the next state's q values np.max(Q[s1,:]). Am I understanding something wrong about "plus the maximum discounted (γ) future reward expected according to our own table for the next state (s’) we would end up in" or is there a mistake in the code? (I'm probably wrong haha)