Skip to content

Instantly share code, notes, and snippets.

@tuxdna
Last active April 13, 2024 04:02
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save tuxdna/7e29dd37300e308a80fc1559c343c545 to your computer and use it in GitHub Desktop.
Save tuxdna/7e29dd37300e308a80fc1559c343c545 to your computer and use it in GitHub Desktop.
Policy Iteration in Python
#!/usr/bin/env python
# coding=utf-8
import numpy as np
"""
1: Procedure Policy_Iteration(S,A,P,R)
2: Inputs
3: S is the set of all states
4: A is the set of all actions
5: P is state transition function specifying P(s'|s,a)
6: R is a reward function R(s,a,s')
7: Output
8: optimal policy π
9: Local
10: action array π[S]
11: Boolean variable noChange
12: real array V[S]
13: set π arbitrarily
14: repeat
15: noChange ←true
16: Solve V[s] = ∑s'∈S P(s'|s,π[s])(R(s,a,s')+γV[s'])
17: for each s∈S do
18: Let QBest=V[s]
19: for each a ∈A do
20: Let Qsa=∑s'∈S P(s'|s,a)(R(s,a,s')+γV[s'])
21: if (Qsa > QBest) then
22: π[s]←a
23: QBest ←Qsa
24: noChange ←false
25: until noChange
26: return π
"""
states = [0,1,2,3,4]
actions = [0,1]
N_STATES = len(states)
N_ACTIONS = len(actions)
P = np.zeros((N_STATES, N_ACTIONS, N_STATES)) # transition probability
R = np.zeros((N_STATES, N_ACTIONS, N_STATES)) # rewards
P[0,0,1] = 1.0
P[1,1,2] = 1.0
P[2,0,3] = 1.0
P[3,1,4] = 1.0
P[4,0,4] = 1.0
R[0,0,1] = 1
R[1,1,2] = 10
R[2,0,3] = 100
R[3,1,4] = 1000
R[4,0,4] = 1.0
gamma = 0.75
# initialize policy and value arbitrarily
policy = [0 for s in range(N_STATES)]
V = np.zeros(N_STATES)
print "Initial policy", policy
# print V
# print P
# print R
is_value_changed = True
iterations = 0
while is_value_changed:
is_value_changed = False
iterations += 1
# run value iteration for each state
for s in range(N_STATES):
V[s] = sum([P[s,policy[s],s1] * (R[s,policy[s],s1] + gamma*V[s1]) for s1 in range(N_STATES)])
# print "Run for state", s
for s in range(N_STATES):
q_best = V[s]
# print "State", s, "q_best", q_best
for a in range(N_ACTIONS):
q_sa = sum([P[s, a, s1] * (R[s, a, s1] + gamma * V[s1]) for s1 in range(N_STATES)])
if q_sa > q_best:
print "State", s, ": q_sa", q_sa, "q_best", q_best
policy[s] = a
q_best = q_sa
is_value_changed = True
print "Iterations:", iterations
# print "Policy now", policy
print "Final policy"
print policy
print V
@tuxdna
Copy link
Author

tuxdna commented Jun 5, 2016

Initial policy [0, 0, 0, 0, 0]
State 1 : q_sa 85.0 q_best 0.0
State 3 : q_sa 1000.75 q_best 0.0
State 4 : q_sa 1.75 q_best 1.0
Iterations: 1
State 0 : q_sa 64.75 q_best 1.0
State 2 : q_sa 850.5625 q_best 100.0
State 3 : q_sa 1001.3125 q_best 1000.75
State 4 : q_sa 2.3125 q_best 1.75
Iterations: 2
State 1 : q_sa 647.921875 q_best 85.0
State 2 : q_sa 850.984375 q_best 850.5625
State 3 : q_sa 1001.734375 q_best 1001.3125
State 4 : q_sa 2.734375 q_best 2.3125
Iterations: 3
State 0 : q_sa 486.94140625 q_best 64.75
State 1 : q_sa 648.23828125 q_best 647.921875
State 2 : q_sa 851.30078125 q_best 850.984375
State 3 : q_sa 1002.05078125 q_best 1001.734375
State 4 : q_sa 3.05078125 q_best 2.734375
Iterations: 4
State 0 : q_sa 487.178710938 q_best 486.94140625
State 1 : q_sa 648.475585938 q_best 648.23828125
State 2 : q_sa 851.538085938 q_best 851.30078125
State 3 : q_sa 1002.28808594 q_best 1002.05078125
State 4 : q_sa 3.2880859375 q_best 3.05078125
Iterations: 5
State 0 : q_sa 487.356689453 q_best 487.178710938
State 1 : q_sa 648.653564453 q_best 648.475585938
State 2 : q_sa 851.716064453 q_best 851.538085938
State 3 : q_sa 1002.46606445 q_best 1002.28808594
State 4 : q_sa 3.46606445312 q_best 3.2880859375
Iterations: 6
State 0 : q_sa 487.49017334 q_best 487.356689453
State 1 : q_sa 648.78704834 q_best 648.653564453
State 2 : q_sa 851.84954834 q_best 851.716064453
State 3 : q_sa 1002.59954834 q_best 1002.46606445
State 4 : q_sa 3.59954833984 q_best 3.46606445312
Iterations: 7
State 0 : q_sa 487.590286255 q_best 487.49017334
State 1 : q_sa 648.887161255 q_best 648.78704834
State 2 : q_sa 851.949661255 q_best 851.84954834
State 3 : q_sa 1002.69966125 q_best 1002.59954834
State 4 : q_sa 3.69966125488 q_best 3.59954833984
Iterations: 8
State 0 : q_sa 487.665370941 q_best 487.590286255
State 1 : q_sa 648.962245941 q_best 648.887161255
State 2 : q_sa 852.024745941 q_best 851.949661255
State 3 : q_sa 1002.77474594 q_best 1002.69966125
State 4 : q_sa 3.77474594116 q_best 3.69966125488
Iterations: 9
State 0 : q_sa 487.721684456 q_best 487.665370941
State 1 : q_sa 649.018559456 q_best 648.962245941
State 2 : q_sa 852.081059456 q_best 852.024745941
State 3 : q_sa 1002.83105946 q_best 1002.77474594
State 4 : q_sa 3.83105945587 q_best 3.77474594116
Iterations: 10
State 0 : q_sa 487.763919592 q_best 487.721684456
State 1 : q_sa 649.060794592 q_best 649.018559456
State 2 : q_sa 852.123294592 q_best 852.081059456
State 3 : q_sa 1002.87329459 q_best 1002.83105946
State 4 : q_sa 3.8732945919 q_best 3.83105945587
Iterations: 11
State 0 : q_sa 487.795595944 q_best 487.763919592
State 1 : q_sa 649.092470944 q_best 649.060794592
State 2 : q_sa 852.154970944 q_best 852.123294592
State 3 : q_sa 1002.90497094 q_best 1002.87329459
State 4 : q_sa 3.90497094393 q_best 3.8732945919
Iterations: 12
State 0 : q_sa 487.819353208 q_best 487.795595944
State 1 : q_sa 649.116228208 q_best 649.092470944
State 2 : q_sa 852.178728208 q_best 852.154970944
State 3 : q_sa 1002.92872821 q_best 1002.90497094
State 4 : q_sa 3.92872820795 q_best 3.90497094393
Iterations: 13
State 0 : q_sa 487.837171156 q_best 487.819353208
State 1 : q_sa 649.134046156 q_best 649.116228208
State 2 : q_sa 852.196546156 q_best 852.178728208
State 3 : q_sa 1002.94654616 q_best 1002.92872821
State 4 : q_sa 3.94654615596 q_best 3.92872820795
Iterations: 14
State 0 : q_sa 487.850534617 q_best 487.837171156
State 1 : q_sa 649.147409617 q_best 649.134046156
State 2 : q_sa 852.209909617 q_best 852.196546156
State 3 : q_sa 1002.95990962 q_best 1002.94654616
State 4 : q_sa 3.95990961697 q_best 3.94654615596
Iterations: 15
State 0 : q_sa 487.860557213 q_best 487.850534617
State 1 : q_sa 649.157432213 q_best 649.147409617
State 2 : q_sa 852.219932213 q_best 852.209909617
State 3 : q_sa 1002.96993221 q_best 1002.95990962
State 4 : q_sa 3.96993221273 q_best 3.95990961697
Iterations: 16
State 0 : q_sa 487.86807416 q_best 487.860557213
State 1 : q_sa 649.16494916 q_best 649.157432213
State 2 : q_sa 852.22744916 q_best 852.219932213
State 3 : q_sa 1002.97744916 q_best 1002.96993221
State 4 : q_sa 3.97744915955 q_best 3.96993221273
Iterations: 17
State 0 : q_sa 487.87371187 q_best 487.86807416
State 1 : q_sa 649.17058687 q_best 649.16494916
State 2 : q_sa 852.23308687 q_best 852.22744916
State 3 : q_sa 1002.98308687 q_best 1002.97744916
State 4 : q_sa 3.98308686966 q_best 3.97744915955
Iterations: 18
State 0 : q_sa 487.877940152 q_best 487.87371187
State 1 : q_sa 649.174815152 q_best 649.17058687
State 2 : q_sa 852.237315152 q_best 852.23308687
State 3 : q_sa 1002.98731515 q_best 1002.98308687
State 4 : q_sa 3.98731515224 q_best 3.98308686966
Iterations: 19
State 0 : q_sa 487.881111364 q_best 487.877940152
State 1 : q_sa 649.177986364 q_best 649.174815152
State 2 : q_sa 852.240486364 q_best 852.237315152
State 3 : q_sa 1002.99048636 q_best 1002.98731515
State 4 : q_sa 3.99048636418 q_best 3.98731515224
Iterations: 20
State 0 : q_sa 487.883489773 q_best 487.881111364
State 1 : q_sa 649.180364773 q_best 649.177986364
State 2 : q_sa 852.242864773 q_best 852.240486364
State 3 : q_sa 1002.99286477 q_best 1002.99048636
State 4 : q_sa 3.99286477314 q_best 3.99048636418
Iterations: 21
State 0 : q_sa 487.88527358 q_best 487.883489773
State 1 : q_sa 649.18214858 q_best 649.180364773
State 2 : q_sa 852.24464858 q_best 852.242864773
State 3 : q_sa 1002.99464858 q_best 1002.99286477
State 4 : q_sa 3.99464857985 q_best 3.99286477314
Iterations: 22
State 0 : q_sa 487.886611435 q_best 487.88527358
State 1 : q_sa 649.183486435 q_best 649.18214858
State 2 : q_sa 852.245986435 q_best 852.24464858
State 3 : q_sa 1002.99598643 q_best 1002.99464858
State 4 : q_sa 3.99598643489 q_best 3.99464857985
Iterations: 23
State 0 : q_sa 487.887614826 q_best 487.886611435
State 1 : q_sa 649.184489826 q_best 649.183486435
State 2 : q_sa 852.246989826 q_best 852.245986435
State 3 : q_sa 1002.99698983 q_best 1002.99598643
State 4 : q_sa 3.99698982617 q_best 3.99598643489
Iterations: 24
State 0 : q_sa 487.88836737 q_best 487.887614826
State 1 : q_sa 649.18524237 q_best 649.184489826
State 2 : q_sa 852.24774237 q_best 852.246989826
State 3 : q_sa 1002.99774237 q_best 1002.99698983
State 4 : q_sa 3.99774236963 q_best 3.99698982617
Iterations: 25
State 0 : q_sa 487.888931777 q_best 487.88836737
State 1 : q_sa 649.185806777 q_best 649.18524237
State 2 : q_sa 852.248306777 q_best 852.24774237
State 3 : q_sa 1002.99830678 q_best 1002.99774237
State 4 : q_sa 3.99830677722 q_best 3.99774236963
Iterations: 26
State 0 : q_sa 487.889355083 q_best 487.888931777
State 1 : q_sa 649.186230083 q_best 649.185806777
State 2 : q_sa 852.248730083 q_best 852.248306777
State 3 : q_sa 1002.99873008 q_best 1002.99830678
State 4 : q_sa 3.99873008291 q_best 3.99830677722
Iterations: 27
State 0 : q_sa 487.889672562 q_best 487.889355083
State 1 : q_sa 649.186547562 q_best 649.186230083
State 2 : q_sa 852.249047562 q_best 852.248730083
State 3 : q_sa 1002.99904756 q_best 1002.99873008
State 4 : q_sa 3.99904756219 q_best 3.99873008291
Iterations: 28
State 0 : q_sa 487.889910672 q_best 487.889672562
State 1 : q_sa 649.186785672 q_best 649.186547562
State 2 : q_sa 852.249285672 q_best 852.249047562
State 3 : q_sa 1002.99928567 q_best 1002.99904756
State 4 : q_sa 3.99928567164 q_best 3.99904756219
Iterations: 29
State 0 : q_sa 487.890089254 q_best 487.889910672
State 1 : q_sa 649.186964254 q_best 649.186785672
State 2 : q_sa 852.249464254 q_best 852.249285672
State 3 : q_sa 1002.99946425 q_best 1002.99928567
State 4 : q_sa 3.99946425373 q_best 3.99928567164
Iterations: 30
State 0 : q_sa 487.89022319 q_best 487.890089254
State 1 : q_sa 649.18709819 q_best 649.186964254
State 2 : q_sa 852.24959819 q_best 852.249464254
State 3 : q_sa 1002.99959819 q_best 1002.99946425
State 4 : q_sa 3.9995981903 q_best 3.99946425373
Iterations: 31
State 0 : q_sa 487.890323643 q_best 487.89022319
State 1 : q_sa 649.187198643 q_best 649.18709819
State 2 : q_sa 852.249698643 q_best 852.24959819
State 3 : q_sa 1002.99969864 q_best 1002.99959819
State 4 : q_sa 3.99969864272 q_best 3.9995981903
Iterations: 32
State 0 : q_sa 487.890398982 q_best 487.890323643
State 1 : q_sa 649.187273982 q_best 649.187198643
State 2 : q_sa 852.249773982 q_best 852.249698643
State 3 : q_sa 1002.99977398 q_best 1002.99969864
State 4 : q_sa 3.99977398204 q_best 3.99969864272
Iterations: 33
State 0 : q_sa 487.890455487 q_best 487.890398982
State 1 : q_sa 649.187330487 q_best 649.187273982
State 2 : q_sa 852.249830487 q_best 852.249773982
State 3 : q_sa 1002.99983049 q_best 1002.99977398
State 4 : q_sa 3.99983048653 q_best 3.99977398204
Iterations: 34
State 0 : q_sa 487.890497865 q_best 487.890455487
State 1 : q_sa 649.187372865 q_best 649.187330487
State 2 : q_sa 852.249872865 q_best 852.249830487
State 3 : q_sa 1002.99987286 q_best 1002.99983049
State 4 : q_sa 3.9998728649 q_best 3.99983048653
Iterations: 35
State 0 : q_sa 487.890529649 q_best 487.890497865
State 1 : q_sa 649.187404649 q_best 649.187372865
State 2 : q_sa 852.249904649 q_best 852.249872865
State 3 : q_sa 1002.99990465 q_best 1002.99987286
State 4 : q_sa 3.99990464867 q_best 3.9998728649
Iterations: 36
State 0 : q_sa 487.890553487 q_best 487.890529649
State 1 : q_sa 649.187428487 q_best 649.187404649
State 2 : q_sa 852.249928487 q_best 852.249904649
State 3 : q_sa 1002.99992849 q_best 1002.99990465
State 4 : q_sa 3.99992848651 q_best 3.99990464867
Iterations: 37
State 0 : q_sa 487.890571365 q_best 487.890553487
State 1 : q_sa 649.187446365 q_best 649.187428487
State 2 : q_sa 852.249946365 q_best 852.249928487
State 3 : q_sa 1002.99994636 q_best 1002.99992849
State 4 : q_sa 3.99994636488 q_best 3.99992848651
Iterations: 38
State 0 : q_sa 487.890584774 q_best 487.890571365
State 1 : q_sa 649.187459774 q_best 649.187446365
State 2 : q_sa 852.249959774 q_best 852.249946365
State 3 : q_sa 1002.99995977 q_best 1002.99994636
State 4 : q_sa 3.99995977366 q_best 3.99994636488
Iterations: 39
State 0 : q_sa 487.89059483 q_best 487.890584774
State 1 : q_sa 649.18746983 q_best 649.187459774
State 2 : q_sa 852.24996983 q_best 852.249959774
State 3 : q_sa 1002.99996983 q_best 1002.99995977
State 4 : q_sa 3.99996983024 q_best 3.99995977366
Iterations: 40
State 0 : q_sa 487.890602373 q_best 487.89059483
State 1 : q_sa 649.187477373 q_best 649.18746983
State 2 : q_sa 852.249977373 q_best 852.24996983
State 3 : q_sa 1002.99997737 q_best 1002.99996983
State 4 : q_sa 3.99997737268 q_best 3.99996983024
Iterations: 41
State 0 : q_sa 487.89060803 q_best 487.890602373
State 1 : q_sa 649.18748303 q_best 649.187477373
State 2 : q_sa 852.24998303 q_best 852.249977373
State 3 : q_sa 1002.99998303 q_best 1002.99997737
State 4 : q_sa 3.99998302951 q_best 3.99997737268
Iterations: 42
State 0 : q_sa 487.890612272 q_best 487.89060803
State 1 : q_sa 649.187487272 q_best 649.18748303
State 2 : q_sa 852.249987272 q_best 852.24998303
State 3 : q_sa 1002.99998727 q_best 1002.99998303
State 4 : q_sa 3.99998727213 q_best 3.99998302951
Iterations: 43
State 0 : q_sa 487.890615454 q_best 487.890612272
State 1 : q_sa 649.187490454 q_best 649.187487272
State 2 : q_sa 852.249990454 q_best 852.249987272
State 3 : q_sa 1002.99999045 q_best 1002.99998727
State 4 : q_sa 3.9999904541 q_best 3.99998727213
Iterations: 44
State 0 : q_sa 487.890617841 q_best 487.890615454
State 1 : q_sa 649.187492841 q_best 649.187490454
State 2 : q_sa 852.249992841 q_best 852.249990454
State 3 : q_sa 1002.99999284 q_best 1002.99999045
State 4 : q_sa 3.99999284058 q_best 3.9999904541
Iterations: 45
State 0 : q_sa 487.89061963 q_best 487.890617841
State 1 : q_sa 649.18749463 q_best 649.187492841
State 2 : q_sa 852.24999463 q_best 852.249992841
State 3 : q_sa 1002.99999463 q_best 1002.99999284
State 4 : q_sa 3.99999463043 q_best 3.99999284058
Iterations: 46
State 0 : q_sa 487.890620973 q_best 487.89061963
State 1 : q_sa 649.187495973 q_best 649.18749463
State 2 : q_sa 852.249995973 q_best 852.24999463
State 3 : q_sa 1002.99999597 q_best 1002.99999463
State 4 : q_sa 3.99999597282 q_best 3.99999463043
Iterations: 47
State 0 : q_sa 487.89062198 q_best 487.890620973
State 1 : q_sa 649.18749698 q_best 649.187495973
State 2 : q_sa 852.24999698 q_best 852.249995973
State 3 : q_sa 1002.99999698 q_best 1002.99999597
State 4 : q_sa 3.99999697962 q_best 3.99999597282
Iterations: 48
State 0 : q_sa 487.890622735 q_best 487.89062198
State 1 : q_sa 649.187497735 q_best 649.18749698
State 2 : q_sa 852.249997735 q_best 852.24999698
State 3 : q_sa 1002.99999773 q_best 1002.99999698
State 4 : q_sa 3.99999773471 q_best 3.99999697962
Iterations: 49
State 0 : q_sa 487.890623301 q_best 487.890622735
State 1 : q_sa 649.187498301 q_best 649.187497735
State 2 : q_sa 852.249998301 q_best 852.249997735
State 3 : q_sa 1002.9999983 q_best 1002.99999773
State 4 : q_sa 3.99999830104 q_best 3.99999773471
Iterations: 50
State 0 : q_sa 487.890623726 q_best 487.890623301
State 1 : q_sa 649.187498726 q_best 649.187498301
State 2 : q_sa 852.249998726 q_best 852.249998301
State 3 : q_sa 1002.99999873 q_best 1002.9999983
State 4 : q_sa 3.99999872578 q_best 3.99999830104
Iterations: 51
State 0 : q_sa 487.890624044 q_best 487.890623726
State 1 : q_sa 649.187499044 q_best 649.187498726
State 2 : q_sa 852.249999044 q_best 852.249998726
State 3 : q_sa 1002.99999904 q_best 1002.99999873
State 4 : q_sa 3.99999904433 q_best 3.99999872578
Iterations: 52
State 0 : q_sa 487.890624283 q_best 487.890624044
State 1 : q_sa 649.187499283 q_best 649.187499044
State 2 : q_sa 852.249999283 q_best 852.249999044
State 3 : q_sa 1002.99999928 q_best 1002.99999904
State 4 : q_sa 3.99999928325 q_best 3.99999904433
Iterations: 53
State 0 : q_sa 487.890624462 q_best 487.890624283
State 1 : q_sa 649.187499462 q_best 649.187499283
State 2 : q_sa 852.249999462 q_best 852.249999283
State 3 : q_sa 1002.99999946 q_best 1002.99999928
State 4 : q_sa 3.99999946244 q_best 3.99999928325
Iterations: 54
State 0 : q_sa 487.890624597 q_best 487.890624462
State 1 : q_sa 649.187499597 q_best 649.187499462
State 2 : q_sa 852.249999597 q_best 852.249999462
State 3 : q_sa 1002.9999996 q_best 1002.99999946
State 4 : q_sa 3.99999959683 q_best 3.99999946244
Iterations: 55
State 0 : q_sa 487.890624698 q_best 487.890624597
State 1 : q_sa 649.187499698 q_best 649.187499597
State 2 : q_sa 852.249999698 q_best 852.249999597
State 3 : q_sa 1002.9999997 q_best 1002.9999996
State 4 : q_sa 3.99999969762 q_best 3.99999959683
Iterations: 56
State 0 : q_sa 487.890624773 q_best 487.890624698
State 1 : q_sa 649.187499773 q_best 649.187499698
State 2 : q_sa 852.249999773 q_best 852.249999698
State 3 : q_sa 1002.99999977 q_best 1002.9999997
State 4 : q_sa 3.99999977322 q_best 3.99999969762
Iterations: 57
State 0 : q_sa 487.89062483 q_best 487.890624773
State 1 : q_sa 649.18749983 q_best 649.187499773
State 2 : q_sa 852.24999983 q_best 852.249999773
State 3 : q_sa 1002.99999983 q_best 1002.99999977
State 4 : q_sa 3.99999982991 q_best 3.99999977322
Iterations: 58
State 0 : q_sa 487.890624872 q_best 487.89062483
State 1 : q_sa 649.187499872 q_best 649.18749983
State 2 : q_sa 852.249999872 q_best 852.24999983
State 3 : q_sa 1002.99999987 q_best 1002.99999983
State 4 : q_sa 3.99999987243 q_best 3.99999982991
Iterations: 59
State 0 : q_sa 487.890624904 q_best 487.890624872
State 1 : q_sa 649.187499904 q_best 649.187499872
State 2 : q_sa 852.249999904 q_best 852.249999872
State 3 : q_sa 1002.9999999 q_best 1002.99999987
State 4 : q_sa 3.99999990433 q_best 3.99999987243
Iterations: 60
State 0 : q_sa 487.890624928 q_best 487.890624904
State 1 : q_sa 649.187499928 q_best 649.187499904
State 2 : q_sa 852.249999928 q_best 852.249999904
State 3 : q_sa 1002.99999993 q_best 1002.9999999
State 4 : q_sa 3.99999992824 q_best 3.99999990433
Iterations: 61
State 0 : q_sa 487.890624946 q_best 487.890624928
State 1 : q_sa 649.187499946 q_best 649.187499928
State 2 : q_sa 852.249999946 q_best 852.249999928
State 3 : q_sa 1002.99999995 q_best 1002.99999993
State 4 : q_sa 3.99999994618 q_best 3.99999992824
Iterations: 62
State 0 : q_sa 487.89062496 q_best 487.890624946
State 1 : q_sa 649.18749996 q_best 649.187499946
State 2 : q_sa 852.24999996 q_best 852.249999946
State 3 : q_sa 1002.99999996 q_best 1002.99999995
State 4 : q_sa 3.99999995964 q_best 3.99999994618
Iterations: 63
State 0 : q_sa 487.89062497 q_best 487.89062496
State 1 : q_sa 649.18749997 q_best 649.18749996
State 2 : q_sa 852.24999997 q_best 852.24999996
State 3 : q_sa 1002.99999997 q_best 1002.99999996
State 4 : q_sa 3.99999996973 q_best 3.99999995964
Iterations: 64
State 0 : q_sa 487.890624977 q_best 487.89062497
State 1 : q_sa 649.187499977 q_best 649.18749997
State 2 : q_sa 852.249999977 q_best 852.24999997
State 3 : q_sa 1002.99999998 q_best 1002.99999997
State 4 : q_sa 3.9999999773 q_best 3.99999996973
Iterations: 65
State 0 : q_sa 487.890624983 q_best 487.890624977
State 1 : q_sa 649.187499983 q_best 649.187499977
State 2 : q_sa 852.249999983 q_best 852.249999977
State 3 : q_sa 1002.99999998 q_best 1002.99999998
State 4 : q_sa 3.99999998297 q_best 3.9999999773
Iterations: 66
State 0 : q_sa 487.890624987 q_best 487.890624983
State 1 : q_sa 649.187499987 q_best 649.187499983
State 2 : q_sa 852.249999987 q_best 852.249999983
State 3 : q_sa 1002.99999999 q_best 1002.99999998
State 4 : q_sa 3.99999998723 q_best 3.99999998297
Iterations: 67
State 0 : q_sa 487.89062499 q_best 487.890624987
State 1 : q_sa 649.18749999 q_best 649.187499987
State 2 : q_sa 852.24999999 q_best 852.249999987
State 3 : q_sa 1002.99999999 q_best 1002.99999999
State 4 : q_sa 3.99999999042 q_best 3.99999998723
Iterations: 68
State 0 : q_sa 487.890624993 q_best 487.89062499
State 1 : q_sa 649.187499993 q_best 649.18749999
State 2 : q_sa 852.249999993 q_best 852.24999999
State 3 : q_sa 1002.99999999 q_best 1002.99999999
State 4 : q_sa 3.99999999282 q_best 3.99999999042
Iterations: 69
State 0 : q_sa 487.890624995 q_best 487.890624993
State 1 : q_sa 649.187499995 q_best 649.187499993
State 2 : q_sa 852.249999995 q_best 852.249999993
State 3 : q_sa 1002.99999999 q_best 1002.99999999
State 4 : q_sa 3.99999999461 q_best 3.99999999282
Iterations: 70
State 0 : q_sa 487.890624996 q_best 487.890624995
State 1 : q_sa 649.187499996 q_best 649.187499995
State 2 : q_sa 852.249999996 q_best 852.249999995
State 3 : q_sa 1003.0 q_best 1002.99999999
State 4 : q_sa 3.99999999596 q_best 3.99999999461
Iterations: 71
State 0 : q_sa 487.890624997 q_best 487.890624996
State 1 : q_sa 649.187499997 q_best 649.187499996
State 2 : q_sa 852.249999997 q_best 852.249999996
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999697 q_best 3.99999999596
Iterations: 72
State 0 : q_sa 487.890624998 q_best 487.890624997
State 1 : q_sa 649.187499998 q_best 649.187499997
State 2 : q_sa 852.249999998 q_best 852.249999997
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999773 q_best 3.99999999697
Iterations: 73
State 0 : q_sa 487.890624998 q_best 487.890624998
State 1 : q_sa 649.187499998 q_best 649.187499998
State 2 : q_sa 852.249999998 q_best 852.249999998
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.9999999983 q_best 3.99999999773
Iterations: 74
State 0 : q_sa 487.890624999 q_best 487.890624998
State 1 : q_sa 649.187499999 q_best 649.187499998
State 2 : q_sa 852.249999999 q_best 852.249999998
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999872 q_best 3.9999999983
Iterations: 75
State 0 : q_sa 487.890624999 q_best 487.890624999
State 1 : q_sa 649.187499999 q_best 649.187499999
State 2 : q_sa 852.249999999 q_best 852.249999999
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999904 q_best 3.99999999872
Iterations: 76
State 0 : q_sa 487.890624999 q_best 487.890624999
State 1 : q_sa 649.187499999 q_best 649.187499999
State 2 : q_sa 852.249999999 q_best 852.249999999
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999928 q_best 3.99999999904
Iterations: 77
State 0 : q_sa 487.890624999 q_best 487.890624999
State 1 : q_sa 649.187499999 q_best 649.187499999
State 2 : q_sa 852.249999999 q_best 852.249999999
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999946 q_best 3.99999999928
Iterations: 78
State 0 : q_sa 487.890625 q_best 487.890624999
State 1 : q_sa 649.1875 q_best 649.187499999
State 2 : q_sa 852.25 q_best 852.249999999
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.9999999996 q_best 3.99999999946
Iterations: 79
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.9999999997 q_best 3.9999999996
Iterations: 80
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999977 q_best 3.9999999997
Iterations: 81
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999983 q_best 3.99999999977
Iterations: 82
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999987 q_best 3.99999999983
Iterations: 83
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.9999999999 q_best 3.99999999987
Iterations: 84
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999993 q_best 3.9999999999
Iterations: 85
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999995 q_best 3.99999999993
Iterations: 86
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999996 q_best 3.99999999995
Iterations: 87
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999997 q_best 3.99999999996
Iterations: 88
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999998 q_best 3.99999999997
Iterations: 89
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999998 q_best 3.99999999998
Iterations: 90
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999999 q_best 3.99999999998
Iterations: 91
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999999 q_best 3.99999999999
Iterations: 92
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999999 q_best 3.99999999999
Iterations: 93
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 3.99999999999 q_best 3.99999999999
Iterations: 94
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 3.99999999999
Iterations: 95
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 96
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 97
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 98
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 99
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 100
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 101
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 102
State 0 : q_sa 487.890625 q_best 487.890625
State 1 : q_sa 649.1875 q_best 649.1875
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 103
State 0 : q_sa 487.890625 q_best 487.890625
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 104
State 1 : q_sa 649.1875 q_best 649.1875
State 2 : q_sa 852.25 q_best 852.25
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 105
State 0 : q_sa 487.890625 q_best 487.890625
State 4 : q_sa 4.0 q_best 4.0
Iterations: 106
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 107
State 2 : q_sa 852.25 q_best 852.25
State 4 : q_sa 4.0 q_best 4.0
Iterations: 108
State 1 : q_sa 649.1875 q_best 649.1875
State 4 : q_sa 4.0 q_best 4.0
Iterations: 109
State 0 : q_sa 487.890625 q_best 487.890625
State 3 : q_sa 1003.0 q_best 1003.0
State 4 : q_sa 4.0 q_best 4.0
Iterations: 110
State 2 : q_sa 852.25 q_best 852.25
State 4 : q_sa 4.0 q_best 4.0
Iterations: 111
State 1 : q_sa 649.1875 q_best 649.1875
State 4 : q_sa 4.0 q_best 4.0
Iterations: 112
State 0 : q_sa 487.890625 q_best 487.890625
State 4 : q_sa 4.0 q_best 4.0
Iterations: 113
State 4 : q_sa 4.0 q_best 4.0
Iterations: 114
State 4 : q_sa 4.0 q_best 4.0
Iterations: 115
State 4 : q_sa 4.0 q_best 4.0
Iterations: 116
State 4 : q_sa 4.0 q_best 4.0
Iterations: 117
State 4 : q_sa 4.0 q_best 4.0
Iterations: 118
State 4 : q_sa 4.0 q_best 4.0
Iterations: 119
State 4 : q_sa 4.0 q_best 4.0
Iterations: 120
State 4 : q_sa 4.0 q_best 4.0
Iterations: 121
State 4 : q_sa 4.0 q_best 4.0
Iterations: 122
State 4 : q_sa 4.0 q_best 4.0
Iterations: 123
State 4 : q_sa 4.0 q_best 4.0
Iterations: 124
Iterations: 125
Final policy
[0, 1, 0, 1, 0]
[  487.890625   649.1875     852.25      1003.           4.      ]

@trentwoodbury
Copy link

I think you may want to change your definition of q_sa to be:

q_sa = sum([R[s, a, s1] + ( P[s, a, s1] * gamma * V[s1]) for s1 in range(N_STATES)])

This is because Q = R(s,a) + gamma * sum(P(s,a,s')*V(s')).

@YingxiaoKong
Copy link

for policy evaluation and policy iteration, do you need to do policy iteration after policy evaluation reach convergence?

@sasmitBiceps
Copy link

love you my guy, helped me to solve my assignment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment