This is the attempt to create a generic (hence the relatively long code) agent for different openAI gym environments. The model is based on Q-learning with experience replay. Collected Q-values are approximated by neural network (tensorflow). The action with the maximum Q-value for the given state is selected. Exploration rate starts at 0.6 and is quickly annealed to the standard 0.1 value. The neural network used for the cartpole environment is quite simple with one RELU hidden layer and linear activation on the output layer. The model is loosely based on excellent tutorial written by Tambet Matiisen in his blog.
The main challenge I experienced when adapting this agent to the cartpole environment was to select the proper reward model. The default award of +1 for every moment when the pole was upright was not very successful. Instead I assigned 0. award for every moment the pole was upright and penalized (-1.0