theovincent/atari_dopamine.md

## atari_dopamine.md

      
    Raw
  

              atari_dopamine.md
            
          
    Things to care about when replicating Dopamine's Atari pipeline:

I. Environmnent

 $a .$  gym environment: full_action_space=False, frameskip =  $1$ , repeat_action_probability =  $0.25$ .
 $b .$  reset: Stack  $4$  frames. The three first are zeros, the last one is the first image.
 $c .$  step: Generate  $4$  frames for the internal environment. Sum up the reward. Stop if a terminal state is reached. For the last  $2$  frames, take the gray scale version, then, apply a max pooling and then resize the image to  $84$  by  $84$  with the cv2.INTER_AREA interpolation.
Tip: use env.ale.getScreenGrayscale to get the frame from the ALE. It is faster.
II. Network

 $a .$  initializer: DQN nn.initializers.variance_scaling(scale=1.0, mode="fan_avg", distribution="truncated_normal") | IQN nn.initializers.variance_scaling(scale=1.0 / jnp.sqrt(3.0), mode="fan_in", distribution="uniform").
 $b .$  divide the state by  $255$ .
 $c .$  Torso: Conv2D features =  $32$ , kernel_size=( $8$ ,  $8$ ), strides=( $4$ ,  $4$ ) | Conv2D features =  $64$ , kernel_size=( $4$ ,  $4$ ), strides=( $2$ ,  $2$ ) | Conv2D features =  $64$ , kernel_size=( $3$ ,  $3$ ), strides=( $1$ ,  $1$ ).
 $d .$  Head: Dense features =  $512$  | Dense features=n_actions.
 $e .$  batch_size:  $32$ .
III. Replay buffer

 $a .$  You can use the one of Dopamine. max_size =  $1$   $000$   $000$  samples, initial_size =  $20$   $000$  samples.
 $b .$  Make sure to clip the reward between -1 and 1.
IV. Metric

 $a .$  Inter-quantile mean over the seeds/games. The metric is the average over one training epoch of the undiscounted sum of reward over one trajectory.
V. Training

 $a .$  Composed of  $200$  epochs. An epoch is composed of  $250$   $000$  training steps or more if the episode is not done when reaching  $250$   $000$  training steps. A training step is composed of an interaction with the environment. The training starts after  $20$   $000$  environment steps have been collected.
 $b .$  Every  $4$  training steps, a gradient step is performed. The target network is updated to the online network every  $8$   $000$  training steps.
 $c .$  Exploration: starting_epsilon =  $1$ , ending_epsilon =  $0.01$ , duration_epsilon =  $250$   $000$  training steps, horizon =  $27$   $000$  training steps.
 $d .$  Gamma =  $0.99$ .
 $e .$  Optimizer: Adam with learning_rate =  $6.25 \times 10^{- 5}$  and epsilon =  $1.5 \times 10^{- 4}$ .