### lbollar commented Feb 1, 2017

 Why are we updating both 0 index and a index here? `targetQ[0,a] = r + y*maxQ1` Thanks for articles and gists!

### lbollar commented Feb 1, 2017

 Very sorry, i thought targetQ was a vector basically, I now see that is 2 dimensional. Devil is in the details... Still, great stuff, excited to read the entire series.

### arjay55 commented Jul 3, 2017

 Hi. Just starting the reinforcement learning tutorial. Is my understanding correct? Is the neural network trained upon every evaluation of the Q target values?

### eviltnan commented Nov 14, 2017

 Looks like `e = 1./((i/50) + 10)` was intended to be `e = 1./((i/50.) + 10)`, `i` is an integer and `i/50` would be an integer division. Or do I miss smth?

### victor-iyiola commented Nov 19, 2017

 @eviltnan You're right if you're on Python 2. The code will still run fine. But you can add in `from __future__ import absolute_import, division, print_function`

### Garrus007 commented Nov 20, 2017 • edited

 Why you dont' use replay memory and train network just after perfom action? ``````targetQ = allQ targetQ[0,a] = r + y*maxQ1 `````` Does following code make a vector with previous Q values, but Q for action a - is new, calculated by Bellman equation, doesnt' it? I tried to impement Q-Network with replay memory. But it doesn't work, play worse then just random inited weights. Something like that: ```D = [] # replay memory for i in range(1000): state = env.reset() for j in range(99): a = argmax(predict(s)) # predict returns Q(s, *) for all actions s1, reward, done, _ = env.step(a) D.appen((s, a, r, s1, done)) s = s1 if done: break # now do replay batch = random.sample(D) for transition in batch: s, a, r, s1, done = transition expected_q = r + gamma * max(predict(s1)) sess.run(train_step, {state: s, action: a, expected: expected_a})``` So, i need to train my network to `Q(s, a) -> r + gamma * max(Q(s1, *))`. It's easy to calculate expected value. But for `Q(s,a)` I should get my prediction for `a`, which is vector for all actions, and then peek action: `predict[a]`. Here it is: ```expected = tf.placeholder(tf.float32, shape=()) action = tf.placeholder(tf.int32, shape=()) pr_reward = prediction[action0] # prediction - this is network output error = tf.square(reward0 - pr_reward) train_step = tf.train.GradientDescentOptimizer(0.5).minimize(error)``` I think, my problem, that error use only one predicted and expected values. Every example I see there was batch. I am new to NN and Tensorflow. Should loss function operate with vectors (batches)?

### sushantMoon commented Jun 25, 2019 • edited

 @awjuliani, about the line `_,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})` shouldn't this should be `_,W = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})` i.e. `W` instead of `W1` ?? as we are wanting to update values of `W` Is there anything that I am missing ??