Last active
October 11, 2022 21:27
-
-
Save awjuliani/b5d83fcf3bf2898656be5730f098e08b to your computer and use it in GitHub Desktop.
A Policy-Gradient algorithm that solves Contextual Bandit problems.
Instead of using slim, can use tf as:
state_in_OH = tf.one_hot(self.state_in, s_size)
output = tf.layers.dense(state_in_OH, a_size, tf.nn.sigmoid, use_bias=False, kernel_initializer = tf.ones_initializer())
Thanks Arthur! this is helpful tutorial for beginers like me. Here is tensorflow 2 implementation may be helpful for someone
Thanks Arthur! this is helpful tutorial for beginers like me. Here is tensorflow 2 implementation may be helpful for someone
Thanks for the implementation. I wonder how is the implementation a policy network? I don't see policy gradient is used.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@pooriaPoorsarvi as seen above we already got the responsible_weight variable, now we are getting the negative
Log likelihood to optimize for the maxium (tf only can optimize) no need to consider every other classes