Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Basic Q-Learning algorithm using Tensorflow
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This comment has been minimized.

Copy link

lbollar commented Feb 1, 2017

Why are we updating both 0 index and a[0] index here?

targetQ[0,a[0]] = r + y*maxQ1

Thanks for articles and gists!


This comment has been minimized.

Copy link

lbollar commented Feb 1, 2017

Very sorry, i thought targetQ was a vector basically, I now see that is 2 dimensional. Devil is in the details... Still, great stuff, excited to read the entire series.


This comment has been minimized.

Copy link

arjay55 commented Jul 3, 2017

Hi. Just starting the reinforcement learning tutorial. Is my understanding correct? Is the neural network trained upon every evaluation of the Q target values?


This comment has been minimized.

Copy link

eviltnan commented Nov 14, 2017

Looks like e = 1./((i/50) + 10) was intended to be e = 1./((i/50.) + 10), i is an integer and i/50 would be an integer division. Or do I miss smth?


This comment has been minimized.

Copy link

victor-iyiola commented Nov 19, 2017

@eviltnan You're right if you're on Python 2. The code will still run fine. But you can add in from __future__ import absolute_import, division, print_function


This comment has been minimized.

Copy link

Garrus007 commented Nov 20, 2017

Why you dont' use replay memory and train network just after perfom action?

targetQ = allQ
targetQ[0,a[0]] = r + y*maxQ1

Does following code make a vector with previous Q values, but Q for action a - is new, calculated by Bellman equation, doesnt' it?

I tried to impement Q-Network with replay memory. But it doesn't work, play worse then just random inited weights. Something like that:

D = [] # replay memory

for i in range(1000):
   state = env.reset()
   for j in range(99):
       a = argmax(predict(s))    # predict returns Q(s, *) for all actions
       s1, reward, done, _ = env.step(a)
       D.appen((s, a, r, s1, done))      
       s = s1
       if done:

    # now do replay
    batch = random.sample(D)
    for transition in batch:
         s, a, r, s1, done = transition
         expected_q = r + gamma * max(predict(s1)), {state: s, action: a, expected: expected_a})

So, i need to train my network to Q(s, a) -> r + gamma * max(Q(s1, *)). It's easy to calculate expected value. But for Q(s,a) I should
get my prediction for a, which is vector for all actions, and then peek action: predict[a].

Here it is:

expected = tf.placeholder(tf.float32, shape=())
action = tf.placeholder(tf.int32, shape=())
pr_reward = prediction[action0][0]  # prediction - this is network output
error = tf.square(reward0 - pr_reward)
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(error)

I think, my problem, that error use only one predicted and expected values. Every example I see there was batch. I am new to NN and Tensorflow. Should loss function operate with vectors (batches)?


This comment has been minimized.

Copy link

sushantMoon commented Jun 25, 2019

@awjuliani, about the line

_,W1 =[updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})

shouldn't this should be

_,W =[updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})

i.e. W instead of W1 ?? as we are wanting to update values of W

Is there anything that I am missing ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.