The algorithm depicted was programmed in inkling, a meta-level programming language developed by Bons.ai (https://bons.ai/). The following is the program which, when compiled to neural networks, solved the environment.
simulator lunarlander_simulator (LunarLanderConfig)
send schema (GameState)
end
schema GameState
Float32 x_position,
Float32 y_position,
Float32 x_velocity,
Float32 y_velocity,
Float32 angle,
Float32 rotation,
Float32 left_leg,
Float32 right_leg
end
schema LanderAction
Int8{0, 1, 2, 3} action
end
schema LunarLanderConfig
Int8 episode_length,
Int8 num_episodes,
Int8 deque_size
end
concept stay_stable is classifier
predicts (LanderAction)
follows input(GameState)
end
concept go_center is classifier
predicts (LanderAction)
follows input(GameState)
end
concept land is classifier
predicts (LanderAction)
follows stay_stable, go_center, input(GameState)
feeds output
end
curriculum stable_curriculum
train stay_stable
with simulator lunarlander_simulator
objective stable_objective
lesson stable
configure
constrain episode_length with Int8{-1},
constrain num_episodes with Int8{-1},
constrain deque_size with UInt8{1}
until
maximize stable_objective
end
curriculum center_curriculum
train go_center
with simulator lunarlander_simulator
objective center_objective
lesson center
configure
constrain episode_length with Int8{-1},
constrain num_episodes with Int8{-1},
constrain deque_size with UInt8{1}
until
maximize center_objective
end
curriculum landing_curriculum
train land
with simulator lunarlander_simulator
objective landing_objective
lesson landing
configure
constrain episode_length with Int8{-1},
constrain num_episodes with Int8{-1},
constrain deque_size with UInt8{1}
until
maximize landing_objective
end
Finally, besides the standard reward functions, we used center_objective
and stable objective
which gave Gaussian rewards (standard deviation was 0.09)
for being at angle Theta = 0
and position x = 0
.