https://www.alexirpan.com/2018/02/14/rl-hard.html
Prioritized Experience Learning https://github.com/jaromiru/AI-blog/blob/master/Seaquest-DDQN-PER.py
https://github.com/Ullar-Kask/TD3-PER/blob/master/Pytorch/src/PER.py https://knowledge.udacity.com/questions/46815 https://knowledge.udacity.com/questions/56910 https://knowledge.udacity.com/questions/54781
the paper https://arxiv.org/pdf/1511.05952.pdf https://www.semanticscholar.org/paper/A-novel-DDPG-method-with-prioritized-experience-Hou-Liu/027d002d205e49989d734603ff0c2f7cbfa6b6dd
https://wpumacay.github.io/research_blog/posts/deeprlnd-project1-part3/ https://medium.com/@kinwo/learning-to-play-tennis-from-scratch-with-self-play-using-ddpg-ac7389eb980e https://towardsdatascience.com/training-two-agents-to-play-tennis-8285ebfaec5f
https://openai.com/blog/learning-dexterity/
https://towardsdatascience.com/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
Great job providing the ideas to experiment more in future with the project!
As pointed in the report, you should try implementing Prioritized Experience Replay also. It helps to improve the performance and significantly reduces the training time. This should also help to stabilize the performance to some extent. A fast implementation of Prioritized Experience Replay is possible using a special data structure called Sum Tree. I found a good implementation here.
Also, I request you to check the following posts to get familiar with more reinforcement learning algorithms.
[Asynchronous Actor-Critic Agents (A3C)](Asynchronous Actor-Critic Agents (A3C))
[Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO))[https://medium.com/@sanketgujar95/trust-region-policy-optimization-trpo-and-proximal-policy-optimization-ppo-e6e7075f39ed]
Here is an implementation of PPO on tennis environment. The training was slow but the final average score achieved was almost 1.25 (with some fluctuation). You should surely try PPO in future.