Skip to content

Instantly share code, notes, and snippets.

@xkrishnam
Last active September 29, 2022 06:17
Show Gist options
  • Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
Save xkrishnam/d9a62d52d28eb943c3965c6cf631ad30 to your computer and use it in GitHub Desktop.
tensorflow 2 implementation of Policy gradient method for solving n-armed bandit problems.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@xkrishnam
Copy link
Author

image

loss graph

@daniel-xion
Copy link

daniel-xion commented Aug 22, 2022

Thanks for the code. May I know what is the context in this case (since it is called contextual)?
Also, I tried to add several more layers to the neural network (so it works better for large number of bandits), but was unable to do so correctly.
In the above code, the get_weights() function for ww returns a 3x4 array. If hidden layer of other sizes are added (let's say one with size 8), the ww.get_weights() for the last layer gives 8x4 array. Then one cannot use np.argmax(ww[0][a]) to find the best action from this new array.

@xkrishnam
Copy link
Author

Context here simply means that algorithm also considers information about the state of environment (context) to generate actions for getting higher rewards (i.e. not only generating random actions and optimizing loss).

2nd you can add more layers but output from last layer should be number_of_bandicts * number_of_possible_actions that means you can put layers before current first layer (i.e. layer1).

Or you can make code more generic to use it more effectively as when I coded it my only intention was to covert existing TF1 solution using TF2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment