The optimization algorithm selects the parameters to maximize the rewards the agent receives. As with many RL tasks, we seek to assign rewards for behaviors that we think lead to the goal (winning the game), without over-crafting the reward to our own expectations. In this appendix we outline our current reward structure.
Each hero's reward is a linear combination of separate signals from the game. Each of these signals produces a score, and the agent's reward is the increase in score from one tick to the next.
The individual hero's score (before the above averaging and reweighting) is a linear combination of separate signals from the game. These signals and weights were designed by our local Dota experts at the start, and have only been tweaked a handful of times since.
Most of the scores are "individual" signals related to the individual hero:
Individual | Weight | Awarded for |
---|---|---|
Experience | 0.002 | Per unit of experience. |
Gold | 0.006 | Per unit of gold gained[1]. |
Mana | 0.75 | Mana (fraction of total). |
Hero Health | 2.0 | Gaining (or losing) health[2]. |
Last Hit | 0.16 | Last Hitting an enemy creep[3]. |
Deny | 0.2 | Last Hitting an allied creep[3]. |
Kill | -0.6 | Killing an enemy hero[3]. |
Death | -1.0 | Dying. |
For each important building class, all heroes on the team receive a fixed score if the building is alive, and a bonus score linear in the building's health:
Score for live building = Weight * (1 + 2 * Health Fraction).
Buildings | Weight |
---|---|
Shrine | 0.75 |
Tower (T1) | 0.75 |
Tower (T2) | 1.0 |
Tower (T3) | 1.5 |
Tower (T4) | 0.75 |
Barracks | 2.0 |
Ancient[4] | 2.5 |
The agent receives extra reward upon killing several special buildings near the end of the game:
Extra Team | Weight | Awarded for |
---|---|---|
Megas | 4.0 | Killing last enemy barracks. |
Win | 2.5[4] | Winning the game. |
In addition to the above reward signals, our agent receives a special reward to encourage exploration called "lane assignments." During training, we assign each hero a subset of the three lanes in the game. The model observes this assignment, and receives a negative reward (-0.02) if it leaves the designated lanes early in the game. This forces the model to see a variety of different lane assignments. During evaluation, we set all heroes' lane assignments to allow them to be in all lanes.
The hero's individual rewards are further processed to account for the competitive and cooperative aspects of the game in three ways:
Each team's mean reward is subtracted from the rewards of the enemy team:
hero_rewards[i] -= mean(enemy_rewards)
This ensures that the sum of all ten rewards is zero, thereby preventing the two teams from finding positive-sum situations. It also ensures that whenever we assign reward for a signal like "killing a tower," the agent automatically receives a comparable signal for the reverse; "defending a tower that would have died."
At the start of training we want to reward agents for their own actions, so they can more easily learn the correlations. However later on, we want them to take into account their teammates' situations, rather than greedily optimizing for their own reward. For this reason we average the reward across the team's heroes using a hyperparameter τ called "team spirit":
hero_rewards[i] = τ * mean(hero_rewards) + (1 - τ) * hero_rewards[i]
We anneal τ from 0.2 at the start of training to 0.97 at the end of our current experiment.
The majority of reward mass in our agent's rollout experience comes from the later part of the game. This is due to a variety of factors: the late-game portion is longer, the units have more abilities and thus get more kills and gold, etc. However, the early game can be very important; if the agent plays badly at the start it can be hard to recover. We wish our training scheme to place sufficient importance on the early part of the game. For this reason we scale up all rewards early in the game and scale down rewards late in the game, by multiplying all rewards by:
hero_rewards[i] *= 0.6 ** (T/10 min)
[1]: The agent receives reward when it gains gold but does not lose reward when it loses gold (e.g. by buying an item.)
[2]: Hero health is scored according to a quartic interpolation between 0 (dead) and 1 (full health) to force heavier weight when the agent is near dying.
[3]: This score supplements the score for the gold/experience gained. The explicit "Kill" score is negative to reduce the agents' reward received by a kill, but the total is still positive.
[4]: The goal of Dota is to destroy the enemy ancient, so when this building dies the game ends. The total reward for winning is thus 10.0 (2.5 for the ancient going from alive to dead, 5.0 for the ancient losing its health, and 2.5 bonus).
Thanks for the insight on this openAI system.
I sure love the fact that it is open
Nice job fellas