Instantly share code, notes, and snippets.

Embed
What would you like to do?
Basic Q-Learning algorithm using Tensorflow
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@lbollar

This comment has been minimized.

lbollar commented Feb 1, 2017

Why are we updating both 0 index and a[0] index here?

targetQ[0,a[0]] = r + y*maxQ1

Thanks for articles and gists!

@lbollar

This comment has been minimized.

lbollar commented Feb 1, 2017

Very sorry, i thought targetQ was a vector basically, I now see that is 2 dimensional. Devil is in the details... Still, great stuff, excited to read the entire series.

@arjay55

This comment has been minimized.

arjay55 commented Jul 3, 2017

Hi. Just starting the reinforcement learning tutorial. Is my understanding correct? Is the neural network trained upon every evaluation of the Q target values?

@eviltnan

This comment has been minimized.

eviltnan commented Nov 14, 2017

Looks like e = 1./((i/50) + 10) was intended to be e = 1./((i/50.) + 10), i is an integer and i/50 would be an integer division. Or do I miss smth?

@victor-iyiola

This comment has been minimized.

victor-iyiola commented Nov 19, 2017

@eviltnan You're right if you're on Python 2. The code will still run fine. But you can add in from __future__ import absolute_import, division, print_function

@Garrus007

This comment has been minimized.

Garrus007 commented Nov 20, 2017

Why you dont' use replay memory and train network just after perfom action?

targetQ = allQ
targetQ[0,a[0]] = r + y*maxQ1

Does following code make a vector with previous Q values, but Q for action a - is new, calculated by Bellman equation, doesnt' it?

I tried to impement Q-Network with replay memory. But it doesn't work, play worse then just random inited weights. Something like that:

D = [] # replay memory

for i in range(1000):
   state = env.reset()
    
   for j in range(99):
       a = argmax(predict(s))    # predict returns Q(s, *) for all actions
       s1, reward, done, _ = env.step(a)
       D.appen((s, a, r, s1, done))      
       s = s1
       if done:
            break

    # now do replay
    batch = random.sample(D)
    for transition in batch:
         s, a, r, s1, done = transition
         expected_q = r + gamma * max(predict(s1))
         sess.run(train_step, {state: s, action: a, expected: expected_a})

So, i need to train my network to Q(s, a) -> r + gamma * max(Q(s1, *)). It's easy to calculate expected value. But for Q(s,a) I should
get my prediction for a, which is vector for all actions, and then peek action: predict[a].

Here it is:

expected = tf.placeholder(tf.float32, shape=())
action = tf.placeholder(tf.int32, shape=())
pr_reward = prediction[action0][0]  # prediction - this is network output
error = tf.square(reward0 - pr_reward)
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(error)

I think, my problem, that error use only one predicted and expected values. Every example I see there was batch. I am new to NN and Tensorflow. Should loss function operate with vectors (batches)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment