Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Nov05/6d3e2fa5a4abe1d4f0593e211d68b1ad to your computer and use it in GitHub Desktop.
Save Nov05/6d3e2fa5a4abe1d4f0593e211d68b1ad to your computer and use it in GitHub Desktop.
πŸ‘‰ Unity ML-Agents `Banana Collectors` Project Submission

πŸ‘‰ Unity ML-Agents Banana Collectors Project Submission

  1. For this toy game, two Deep Q-network methods are tried out. Since the observations (states) are simple (not in pixels), convolutional layers are not in use. And the evaluation results confirm that linear layers are sufficient for solving the problem.
    • Double DQN, with 3 linear layers (hidden dims: 256*64, later tried with 64*64)
    • Dueling DQN, with 2 linear layers + 2 split linear layers (hidden dims: 64*64)

β–ͺ️ The Dueling DQN architecture is displayed as below.

Dueling Architecture The green module

β–ͺ️ Since both the advantage and the value stream propagate gradients to the last convolutional layer in the backward pass, we rescale the combined gradient entering the last convolutional layer by 1/√2. This simple heuristic mildly increases stability.

        self.layer1 = nn.Linear(state_size, 64)
        self.layer2 = nn.Linear(64, 64)
        self.layer3_adv = nn.Linear(in_features=64, out_features=action_size) ## advantage
        self.layer3_val = nn.Linear(in_features=64, out_features=1) ## state value

    def forward(self, state):
        x = F.relu(self.layer1(state))
        x = F.relu(self.layer2(x))
        adv, val = self.layer3_adv(x), self.layer3_val(x)
        return (val + adv - adv.mean(1).unsqueeze(1).expand(x.size(0), action_size)) / (2**0.5)

β–ͺ️ In addition, we clip the gradients to have their norm less than or equal to 10. This clipping is not standard practice in deep RL, but common in recurrent network training (Bengio et al., 2013).

        ## clip the gradients
        nn.utils.clip_grad_norm_(self.qnetwork_local.parameters(), 10.)
        nn.utils.clip_grad_norm_(self.qnetwork_target.parameters(), 10.) 
  1. The following picture shows the train and eval scores (rewards) for both architectures. Since it is a toy project, trained models are not formally evaluated. We can roughly see that Dueling DQN slightly performs better with an average score of 17 vs. Double DQN 13 in 10 episodes.

  1. Project artifacts:
    • All the notebooks (trained in Google Colab, evaluated on local machine)
    • The project folder p1_navigation (which contains checkpoints dqn_checkpoint_2000.pth and dueling_dqn_checkpoint_2000.pth)
    • Video recording (which demonstrates how trained models are run on the local machine)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment