Skip to content

Instantly share code, notes, and snippets.

@theovincent
Last active September 10, 2024 16:20
Atari of Dopamine

Things to care about when replicating Dopamine's Atari pipeline:

I. Environmnent

a . gym environment: full_action_space=False, frameskip = 1 , repeat_action_probability = 0.25 .

b . reset: Stack 4 frames. The three first are zeros, the last one is the first image.

c . step: Generate 4 frames for the internal environment. Sum up the reward. Stop if a terminal state is reached. For the last 2 frames, take the gray scale version, then, apply a max pooling and then resize the image to 84 by 84 with the cv2.INTER_AREA interpolation.

Tip: use env.ale.getScreenGrayscale to get the frame from the ALE. It is faster.

II. Network

a . initializer: DQN nn.initializers.variance_scaling(scale=1.0, mode="fan_avg", distribution="truncated_normal") | IQN nn.initializers.variance_scaling(scale=1.0 / jnp.sqrt(3.0), mode="fan_in", distribution="uniform").

b . divide the state by 255 .

c . Torso: Conv2D features = 32 , kernel_size=( 8 , 8 ), strides=( 4 , 4 ) | Conv2D features = 64 , kernel_size=( 4 , 4 ), strides=( 2 , 2 ) | Conv2D features = 64 , kernel_size=( 3 , 3 ), strides=( 1 , 1 ).

d . Head: Dense features = 512 | Dense features=n_actions.

e . batch_size: 32 .

III. Replay buffer

a . You can use the one of Dopamine. max_size = 1 000 000 samples, initial_size = 20 000 samples.

b . Make sure to clip the reward between -1 and 1.

IV. Metric

a . Inter-quantile mean over the seeds/games. The metric is the average over one training epoch of the undiscounted sum of reward over one trajectory.

V. Training

a . Composed of 200 epochs. An epoch is composed of 250 000 training steps or more if the episode is not done when reaching 250 000 training steps. A training step is composed of an interaction with the environment. The training starts after 20 000 environment steps have been collected.

b . Every 4 training steps, a gradient step is performed. The target network is updated to the online network every 8 000 training steps.

c . Exploration: starting_epsilon = 1 , ending_epsilon = 0.01 , duration_epsilon = 250 000 training steps, horizon = 27 000 training steps.

d . Gamma = 0.99 .

e . Optimizer: Adam with learning_rate = 6.25 × 10 5 and epsilon = 1.5 × 10 4 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment