You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
reset: Stack frames. The three first are zeros, the last one is the first image.
step: Generate frames for the internal environment. Sum up the reward. Stop if a terminal state is reached. For the last frames, take the gray scale version, then, apply a max pooling and then resize the image to by with the cv2.INTER_AREA interpolation.
Tip: use env.ale.getScreenGrayscale to get the frame from the ALE. It is faster.
Torso: Conv2D features = , kernel_size=(, ), strides=(, ) | Conv2D features = , kernel_size=(, ), strides=(, ) | Conv2D features = , kernel_size=(, ), strides=(, ).
Head: Dense features = | Dense features=n_actions.
batch_size: .
III. Replay buffer
You can use the one of Dopamine. max_size = samples, initial_size = samples.
Make sure to clip the reward between -1 and 1.
IV. Metric
Inter-quantile mean over the seeds/games. The metric is the average over one training epoch of the undiscounted sum of reward over one trajectory.
V. Training
Composed of epochs. An epoch is composed of training steps or more if the episode is not done when reaching training steps. A training step is composed of an interaction with the environment. The training starts after environment steps have been collected.
Every training steps, a gradient step is performed. The target network is updated to the online network every training steps.
Exploration: starting_epsilon = , ending_epsilon = , duration_epsilon = training steps, horizon = training steps.
Gamma = .
Optimizer: Adam with learning_rate = and epsilon = .