Skip to content

Instantly share code, notes, and snippets.

@Santara
Created January 16, 2020 12:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Santara/f261e18fce8f3aa848fa1e2b63aa71c2 to your computer and use it in GitHub Desktop.
Save Santara/f261e18fce8f3aa848fa1e2b63aa71c2 to your computer and use it in GitHub Desktop.

Experiment 1

Motivation

Train a single agent to drive three cars in Alpine

Details

  • Action space: steering-acceleration-brake
  • Car models:
    • car1-stock1
    • acura-nsx-sz
    • car1-trb1
  • Track: Alpine
  • Neural network architecture: PPO default
  • Training algorithm: PPO
  • max_steps: 1000
  • noisy observations
  • noisy actions

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_14-22-20bsqfy716/checkpoint_221/checkpoint-221

Issues

  • max_steps was set to 1000
  • track length and width were wrong

Experiment 2

Motivation

Train a single agent to drive three cars in Alpine

Details

  • Action space: steering-acceleration-brake
  • Car models:
    • car1-stock1
    • acura-nsx-sz
    • car1-trb1
  • Track: Alpine
  • Neural network architecture: PPO default
  • Training algorithm: PPO
  • max_steps: 25000
  • noisy observations
  • noisy actions

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_16-00-21vj0or2ia/checkpoint_1671/checkpoint-1671

Experiment 3

Motivation

Train a single agent to drive three cars in Alpine

Details

  • Action space: steering-acceleration-brake
  • Car models:
    • car1-stock1
    • car1-stock2
    • acura-nsx-sz
    • car1-trb1
    • p406
    • 155-DTM
  • Track: Alpine
  • Neural network architecture: PPO default
  • Training algorithm: PPO
  • max_steps: 25000

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_20-31-43ilasyeke/checkpoint_51/checkpoint-51

Fate

Experiment failed due to NaN reward.

  • Added filter for nan reward in reward_manager
  • Placed trainer.train() in a try-except block to catch errors and save the checkpoint
  • Added "ignore_worker_failures" to the training config to see what happens when it tries to continue training even after worker failures.

New Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/

/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/checkpoint_842/checkpoint-842

Observations

  • p406 has immense body roll
  • cars are still wobbly - some more some less
  • all the 6 cars can finish the lap in Alpine-1 but all cars do not generalize well to spring
  • policy does not generalize to buggy

Experiment 4

Motivation

Single car all roads

Details

  • Action space: steering-acceleration-brake
  • Car models:
    • car1-stock1
  • Roads:
    • aalborg
    • alpine-1
    • alpine-2
    • brondehach
    • g-track-1
    • g-track-2
    • g-track-3
    • corkscrew
    • eroad
    • e-track-2
    • e-track-3
    • e-track-4
    • e-track-6
    • forza
    • ole-road-1
    • ruudskogen
    • street-1
    • wheel-1
    • wheel-2
    • spring

P.S. "e-track-1" gives Segmentation fault from torcs: no track observations after about 1000 steps

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48 /home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78 /home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209 /home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344 /home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-35-32a8nef0xr/checkpoint_372/checkpoint-372")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-46-52xbot_yvb/checkpoint_407/checkpoint-407")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-58-32oxheh0v5/checkpoint_492/checkpoint-492")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-11-43r7e5myyo/checkpoint_500/checkpoint-500")

Removing aalborg

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277")

Experiment 5

Motivation

Single car learning to drive among upto 9 parking traffic cars from scratch

TrackTrue

Alpine-1

Ego vehicle model

car1-stock1

Action space

steering-acceleration-brake

Experiment directory

/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/checkpoint_641/checkpoint-641

Observations

  • Car learns to drive forward
  • Car sometimes also brakes to prevent collision with the car in front but collides most of the time
  • Car always chooses to take the left lane to overtake. It is a default first step behavior. This causes the agent to hit the car in front, especially when it initializes in the right lane.
  • A curriculum is necessary to teach the agent three things - learning to drive in an open track, learning to prevent front collision, learning to prevent rear-end collision, learning to overtake

Experiment 6

Motivation

Training buggy to drive in Alpine-1

Edits to xml

Front left and right brake disk diameters originally 250 mm but that was out of bounds because the rimm diameter is 200 mm. To prevent this error, I reduced the disk diameter to 200 mm.

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/checkpoint_211/checkpoint-211

Observations

  • The buggy learns to run but the top speed is around ~70 kmph although the target speed is 100 kmph
  • The buggy learns to drift at tight corners but sometimes turns backwards in the process
  • Policy does not generalize to spring. The agent drives off track right at the beginning. Clearly the agent had memorized Alpine-1
  • Policy does not generalize to other cars. This is because buggy has very different engine characteristics than other cars. Buggy has exceptionally high torque at low RPMs. This is what makes it so difficult to control. The only other car that it can even start is the baja-bug

Experiment 7

Motivation

Train one agent to drive both buggy and car1-stock1 in Alpine-1. Can an RL agent with the SingleAgentSimpleLapObs observations, steer-accel-brake actions and reward settings:

  • ProgressReward2: scale: 1.0
  • AvgSpeedReward: scale: 1.0
  • CollisionPenalty: scale: 10.0
  • TurnBackwardPenalty: scale: 10.0
  • AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/ /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/checkpoint_691/checkpoint-691 ^ retraining from here: /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/checkpoint_1302/checkpoint-1302

Observations

  • The policy can start both the cars
  • Buggy is more stable and lives longer with this policy than a policy that is trained only on car1-stock1
  • car1-stock1 runs smooth and completes the lap but is not completely smooth in terms of lateral motion
  • The lateral motion is particularly vigorous in case of buggy and this is what sets it off track
  • Training longer with more angular acceleration penalty might help
  • After training longer, car1-stock1 can complete the race most of the time but buggy drives very poorly
  • TODO(santara) Visualize and evaluate more checkpoints from /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/

Experiment 8

Motivation

Train one agent to drive both buggy and car1-stock1 in Alpine-1. Can an RL agent with the SingleAgentSimpleLapObs observations, lanepos-speed actions and reward settings:

  • ProgressReward2: scale: 1.0
  • AvgSpeedReward: scale: 1.0
  • CollisionPenalty: scale: 10.0
  • TurnBackwardPenalty: scale: 10.0
  • AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0

The conjecture is that when the engine responses of the two cars are drastically different, a low level controller can be used to handle the engine commands while a high-level controller makes a uniform set of desires for both the cars.

P.S. The PID parameters were actually calculated for car1-trb1. An improvement over this could be to tune the parameters for each of the cars

PID latency

5

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_

After resuming from: /home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031 :

/home/anirban/ray_results/PPO_madras_env_2019-12-20_15-01-31g5dbqhfr

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031 /home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_411/checkpoint-411

Observations

  • VOILA! This agent is so good that it can drive almost every car in TORCS!
    • It can drift Buggy and Baja-Bug! Seriously!
  • Average speed is different for different cars
  • Acura-NSX-SZ and car1-stock2 can not drift well with the current PID controller. So the agent learns to stop them and take sharp turns slowly
  • We should study the engine response of p406 cars. They behave very differently from the other agents
  • kc-daytona, kc-2000gt, and kc-giulietta drift well at low speeds but go out of control while drifting at speeds ~100kmph
  • The straight stretch of Alpine-1 is muddy and the wheels get stuck. The agent actually learns to stop, downshift and gather the torque necessary to overcome the mud - a behavior that was never observed in the 3-action space agents
  • Agent chooses different speed ranges for different cars - how does it know???
  • Agent finds it difficult to start some of the newer cars off track

Experiment 9

Motivation

2 action space, 4 traffiic agents

PID latency

5

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/

Restoring from /home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21:

/home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/

Checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21 /home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/checkpoint_72/checkpoint-72 /home/anirban/ray_results/PPO_madras_env_2019-12-20_21-52-32s6vbfj4u/checkpoint_213/checkpoint-213

Observations

Experiment 10

Motivation

To drive buggy on Alpine with lanepos-vel control

PID latency

5

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/checkpoint_201/checkpoint-201

Observation

Pretty good performance but it needs some more training Sometimes goes off track at the sharpest bend

Experiment 11

Motivation

To drive car1-stock1 in Alpine with lanepos-vel control

PID latency

5

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/checkpoint_1041/checkpoint-1041

Observations

  • The maneuvers of the car are not smooth:
    • Pointless lane switches at the beginning
    • Car struggles to accelerate off the start line and navigate through the straight muddy stretch later in the track
    • Corners are not smooth drifts. The car slows down, halts and goes. It even goes off track while going around the tight corners.
    • Agent does not learn to take tight corners to reduce distance travelled, unlike the one trained on steer-accel-brake

Experiment 12

Motivation

Train an agent to drive car1-stock1 in Alpine-1 using 3-action space

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701

Experiment 19

Motivation

Train a single agent to drive car1-stock1 on spring from scratch

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/ /home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/ /home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/checkpoint_41/checkpoint-41 /home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/checkpoint_1882/checkpoint-1882 /home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/checkpoint_3633/checkpoint-3633

Experiment 20

Curriculum learning for driving in Spring. Agent pre-trained in Alpine-1. Training resumed without decreasing learning rate from: /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402

Observations

  • /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402 is a Bad checkpoint: agent drives some distance and stands still
  • Average speed low
  • Lots of sidewise motion

Better checkpoints: /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_792/checkpoint-792, /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_782/checkpoint-782, /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_772/checkpoint-772

Possible improvements

  • Refer to earlier experiment
  • Increase jerk penalty
  • Decrease the learning rate

Experiment 21

Noting this in paper

Motivation

Second attempt at curriculum learning for car1-stock1 in spring after starting from /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701 with lower learning rate of 1e-6

Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/ /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112

/home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923

Observation

Healthy training - train for some more time starting from /home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112

Good checkpoint but stil cant handle the big L-shaped left turns: /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923

  • very high speeds. coercing episode length to 40000 actually helped
  • agent drives very smoothly in alpine-1 with barely any sidewise movement
  • but agent fails to start off in corkscrew :(
    • the success of a policy is highly dependent on the initial state distribution.
    • agent brakes hard when it senses a bend ahead.
      • add temporal context to observation - should solve this problem Same problem with /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2673/checkpoint-2673

Remark: the agent must learn to slow down for the L-curve. Train the agent longer with larger max_steps.

TRAIN AN AGENT ON ALPINE-1 AND CORKSCREW FOR IT TO GENERALIZE TO SPRING. ALPINE HAS SHARP RIGHT TURNS BUT NO SHARP LEFT TURN. CORKSCREW HAS A SHARP LEFT TURN. AGENTS TRAINED ON ALPINE-1 USUALLY GO OFF TRACK IN SPRING AROUND 15000 STEPS WHEN IT ENCOUNTERS THE SHARP LEFT TURN. AN AGENT TRAINED ON BOTH ALPINE-1 AND CORKSCREW WILL BE ABLE TO LEARN TO GENERALIZE TO THE WHOLE OF SPRING.

Experiment 23

Motivation

Train an agent to drive car1-stock1 in alpine-1 and corkscrew

Present a 50-50 blend of the two track using random track assignment in madras.

training direcctories

/home/anirban/ray_results/PPO_madras_env_2019-12-26_08-05-124ejof36t/checkpoint_11/checkpoint-11 /home/anirban/ray_results/PPO_madras_env_2019-12-26_13-43-22mpipeiml/checkpoint_314/checkpoint-314

this is the best guy but it has some issues

  • can solve both alpine-1 and corkscrew about 10% of the times
  • can not solve Spring. It goes off track at the very first right hairpin bend.

/home/anirban/ray_results/PPO_madras_env_2019-12-26_14-35-37pzu8g3uf/checkpoint_745/checkpoint-745 /home/anirban/ray_results/PPO_madras_env_2019-12-26_16-03-36mg86dj5p/checkpoint_1606/checkpoint-1606

EVALUATION PENDING

Experiment 24

Motivation

Experiment #23 does not converge well. We are going to try a curriculum learning strategy in which we will finetune the agent trained on alpine-1 to generalize to corkscrew (train with a 50-50 blend of tracks) with a lower learning rate of 1e-5

Training checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1358/checkpoint-1358 better checkpoints: /home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1202/checkpoint-1202 /home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1192/checkpoint-1192 agent still goes off track at the tight S bend in corkscrew.

Experiment 25

Motivation

Agent trained in Alpine-1 doesnt generalize well to corkscrew because it never learns to negotiate sharp L-shaped left turns. Training corkscrew with alpine-1 with 50-50 blend also does not generalize well. Initializing with alpine-1 and finetuning on 50-50 corkscrew and alpine-1 doesnt fare well too. The motivation of this experiment is to test how effective is pre-training on corkscrew and finetuning on alpine-1.

Training directory

Training car1-stock1 on corkscrew from scratch:

/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/

corkscrew policy trained from scratch:

/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561

Observations

WOW! This policy can actually complete spring 43% of the times. Thus we can see that corkscrew is a closer MDP to spring than Alpine-1.

================================================== Car: car1-stock1

 Track: spring

 Average distance covered: 0.6523207230607263

 Average speed: 90.70068030268457

 Successful race completion rate: 0.43243243243243246

 Num trajectories evaluated: 37

==================================================

Experiment 26

Motivation

For training on spring, instead of Alpine-1 as starting point, start with corkscrew trained policy /home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561 and see if it can achieve high race completion rate on spring under 1000 steps of training with learning rate 5e-7

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/

Checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/checkpoint_2502/checkpoint-2502

Experiment 27

Motivation

Teach a pre-trained agent for alpine-1 with 2-action space to drive with max 2 traffic cars.

First phase of training

1 or 2 traffic cars with 50-50 probability

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/checkpoint_21/checkpoint-21 /home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82

Observations

3 Traffic agents:

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

    ================================================== Successful race completion rate: 0.5844155844155844

    Num trajectories evaluated: 77
    

    ==================================================

Evaluating on 4 traffic agents:

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 55 high: 60 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

    ================================================== Successful race completion rate: 0.4067796610169492

    Num trajectories evaluated: 59
    

    ==================================================

    ================================================== Successful race completion rate: 0.3884297520661157

    Num trajectories evaluated: 121
    

    ==================================================

  • As the episode length is very short (mostly under 500 steps, average 200 steps), too many rollouts (~30-50) need to be done to be done to fill a replay buffer of size 4000. Reduce the train_batch_size next time
  • Increase collision penalty?

Observations specific to the evaluation with 4 traffic agents:

  1. The agent learns to choose the left or the right lane as required in most of the episodes
  2. The agent does not slow down to prevent an imminent collision: could be due to high PID latency? Reduce PID latency and check.
  3. The agent sometimes fails to judge the gap between the parked car in front and the edge of the track and makes an unnecessary lane change which causes it to collide with the car in the front
  4. The agent does not speed up to prevent a collision at the rear end.
  5. The agent learns to evade a parked car on the right lane but fails to evade a parked car in the left lane most of the time. This is probably due to the reason that the agent was trained with 1 or 2 traffic agents. TORCS starts assigning lanes from the right. When there is only one traffic agent, the learning agent is assigned the left lane. If the learning agent always runs forward in its (left) lane, the traffic agent would block its way roughly 50% of the times. So the remaining 50% of the time, the agent can achieve Rank1 without having to overtake even once. When there are 2 traffic agents, the learning agent is assigned the right lane by TORCS. In that case, almost 100% of the times, the learning agent has a traffic agent parked in front of it in the right lane and it has to learn to dodge it by switching to the left lane. Thats why the agent learns to evade collisions significantly better on the right lane than the left.

Second phase of training

Starting with /home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82

train_batch_size = 1000 (down 4x) lr = 5e-6 (down 10x) CollisionPenalty scale = 20

We train with 2 traffic agents at all times.

Training checkpoint for this part: /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_433/checkpoint-433 Even better: /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373

Evaluating on the same 3 traffic agents as in the previous evaluation:

==================================================
    Successful race completion rate: 0.6666666666666666

    Num trajectories evaluated: 87

==================================================

Evaluating on the same 4 traffic agents as in the previous evaluation:

==================================================
    Successful race completion rate: 0.2717391304347826

    Num trajectories evaluated: 92

==================================================

Resuming from /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373 we train with 3 traffic agents

/home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514

Evaluating on the same 3 traffic agents as in the previous evaluation:

==================================================
    Successful race completion rate: 0.7613636363636364

    Num trajectories evaluated: 88

==================================================

Evaluating on the same 4 traffic agents as in the previous evaluation:

==================================================
    Successful race completion rate: 0.22105263157894736

    Num trajectories evaluated: 95

==================================================

Resuming from above with 4 traffic agents. no randomization /home/anirban/ray_results/PPO_madras_env_2019-12-29_00-01-40us412r8m/checkpoint_975/checkpoint-975

agent goes OOT to the right most of the time. The agent does not know how to slow down when the car is about to go off track.

We hypothesize that training on a 50-50 blennd of 3 and 4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514 might help ameliorate this overfitting behavior of overtaking from the right:

/home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_645/checkpoint-645 /home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_575/checkpoint-575

Same problem persists. Agent prefers to go off track than stopping to make some room to overtake. The agent does not learn to slow down to avoid collision either. Collisions are very frequent with this checkpoint both with 3 and 4 traffic agents.

Next steps: retrain from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514 with reduced ProgressReward2 and AngAcclPenalty

Before: rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 20.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 8.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 10.0

Now: rewards: ProgressReward2: scale: 0.25 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 20.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 10.0

/home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/ /home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/checkpoint_705/checkpoint-705

  • The agent does not stop to prevent going off track
  • The agent does not brake enough to respond to a braking agent in the front
  • We must give context to the agent

REMOVE AVERAGE SPEED REWARD BECAUSE IT ONLY APPLIES WHEN THE AGENT RUNS THE WHOLE LENGTH OF THE TRACK.

As a next step, try introducing a penalty for going off track.

Experiment 28

Motivation

Train an agent to drive with 1 to 3 traffic cars from scratch

Settings

lr: 5e-6 train_batch_size: 1000

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/

Checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381

Observation

Evaluation results with 3 traffic cars:

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

    ================================================== Successful race completion rate: 0.5238095238095238

    Num trajectories evaluated: 105
    

==================================================

With 4 traffic agents:

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 55 high: 60 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

    ================================================== Successful race completion rate: 0.37894736842105264

    Num trajectories evaluated: 95
    

    ==================================================

  • Agent only learns to turn left and run straight - It attempts to execute this exact same behavior at every episode. This is likely because for 1 traffic agent (as is the case 33.33% of the episodes), the learning agent can exploit this policy and finish the race. This gives extra leverage to this policy.
  • Per overtake rewards must be decoupled from the race over reward. Currently, CollisionPenalty and SuccessfulOvertakeReward have the same scale. If the agent overtakes 2 traffic cars and collides with the third, it gets a net positive reward. Hence the agent does not care to learn to steer right in order to avoid collision on the left lane.
  • needs more training to learn the multi-modal behavior.

Starting from /home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381 with 3 traffic cars at all times:

/home/anirban/ray_results/PPO_madras_env_2019-12-30_20-37-12zhryck8b/checkpoint_1762/checkpoint-1762

Same three traffic agents as the previous evaluation:

==================================================
    Successful race completion rate: 0.5096153846153846

    Num trajectories evaluated: 104

==================================================

Same 4 traffic agents as the previous evaluation:

==================================================
    Successful race completion rate: 0.35789473684210527

    Num trajectories evaluated: 95

==================================================

Analysis of why an agent trained on Alpine-1 with 3-action space fails to take off in some race tracks

Track specific analysis

The agent brakes whenever there is a (slight) left turn in front or when its closer to the right edge than the left. In Alpine-1, this happens when the car is in motion and the agent manages to stay in motion and come to a different state where the agent does not brake. However, in tracks where the slight left turn needs to be taken at a zero or low speed, braking results in the agent coming to a halt and as a result, the agent gets stuck in the braking state and never comes out.

  • corkscrew: Agent sees a left turn at the start line and brakes
  • aalborg: Agent comes close to the right edge of the track at the start line and brakes
  • e-track-4: Agent comes closer (not very close though) to the right edge of the track at the start line and brakes
  • g-track-1: like corkscrew
  • g-track-2: like aalborg and e-track-4
  • g-track-3: right turn in front
  • forza: like aalborg and e-track-4
  • wheel-1: like aalborg and e-track-4
  • brondehach: like aalborg and e-track-4

Experiment 30

Motivation

This is a continuation of experiment 27 and 28 with modified learning settings:

  1. The reward structure has been changed to decouple per overtake reward from race completed reward.
  2. Collision with one traffic agent after overtaking some of the other traffic agents does not give a net positive reward under any circumstances.
  3. Traffic agents are 2 or 3. Single traffic agent seems to be lenient to the agent adopting a silly fixed policy of turning left and going straight along the left edge of the track.

Reward function

ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 15.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0

Training directory

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-03_21-37-58ozx2miou/checkpoint_691/checkpoint-691

Evaluating on 3 traffic agents:

==================================================
    Successful race completion rate: 0.5555555555555556

    Num trajectories evaluated: 117

==================================================
  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
  • ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1

Observations:

The agent learns to execute a fixed policy: turn right, take the edge of the right lane and run straight. It completely ignores the opponent positions. We have to make the agent pay attention to the opponents vector and its own lane_position (trackPos)

Experiment 31

Motivation

Repeat Experiment 30 without per-overtake rewards and lower angular acceleration penalty

Reward function

ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 10.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 RankOneReward: scale: 10.0

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/

Evaluated checkpoints

  1. /home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/checkpoint_31/checkpoint-31
  2. /home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/checkpoint_242/checkpoint-242
  3. /home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/checkpoint_303/checkpoint-303
  4. /home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/checkpoint_674/checkpoint-674

Observation

  • Agent always takes left in the beginning
  • Agent does not learn to avoid collision in the left lane
  • Agent runs slowly in checkpoint #3 but speed increases a bit in checkpoint #4
  • It does slow down when there is a car in front but does not learn to steer right to avoid collision

Next action

  • I will let it train for longer
  • Consider increasing collision penalty so much that the agent can never get a positive episode reward if it makes a single collision. Currently the agent is not shy from colliding.

Experiment 32

Motivation

Same as Expt 31 but with very high collision penalty

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-04_19-55-16xzfuziaz/checkpoint_291/checkpoint-291 /home/anirban/ray_results/PPO_madras_env_2020-01-04_22-56-37tmpr7d_m/checkpoint_722/checkpoint-722

Observations

  • Agent does not steer at all
  • The learning curve does not show any growth between update steps: 291 and 722
  • Agent just slows down a bit when there is a car in front but eventually collides
  • Agent shows no intention to overtake

Actionable

  • Agent shows no intention of overtaking - bring back the per-overtake reward
  • The 2-action space does not give the agent quick steering ability - try 3-action space to see if the agent learns to steer more freely
  • The learning rate 5e-6 might be too low. Increase it 2x to 1e-5.
  • Agent still does not pay attention to the trackPos and opponents - consider removing track from observation list while using 2-action space. trackPos should give sufficient information to the agent not to go OOT

Experiment 33

Motivation

To address the actionables from Experiment 32

Changes made to the env and training config

lr 1e-5 3 action space

rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 100.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0

TODO later: remove 'track' from state space

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/checkpoint_181/checkpoint-181 /home/anirban/ray_results/PPO_madras_env_2020-01-05_07-10-53tfgdnjbw/checkpoint_362/checkpoint-362 /home/anirban/ray_results/PPO_madras_env_2020-01-05_07-53-19wdmej9r5/checkpoint_783/checkpoint-783

Experiment 34

Motivation

Force the agent to learn a zigzag maneuver and thus learn to steer out of the way of obstacles.

Methodology

  • We arrange the traffic agents in a way that the agent must first turn right and then turn left.
  • We stop the agent from taking the edge of the tracks to overtake without turning by

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_09-39-15hwy781dy/checkpoint_311/checkpoint-311 /home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382

Observation

  • Agent stops and almost comes to a perfect halt but it keeps trying to go to the left and does not steer right
  • Maybe it will improve upon more training
  • Agent remains stuck with the same behavior between 311 and 382 steps - PPO often gets stuck in local minima. Trying to ameliorate this by increasing learning rate 10x to 1e-4

After training with increased learning rate of 1e-4 from /home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382

we have: /home/anirban/ray_results/PPO_madras_env_2020-01-05_14-04-198iqqfx34/checkpoint_413/checkpoint-413

Experiment 35

Same as #34 but with 3-action space

Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/checkpoint_1894/checkpoint-1894

Observations

  • Car doesnt even learn to slow down to avoid collision.
  • Car runs straight ahead at full speed and collides. Over time, it learns to gather rewards from ProgressReward2 by running faster.

Experiment 36

Same as 34 but with modified reward function

rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 10.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_22-52-440h5i8uu6/checkpoint_11/checkpoint-11 /home/anirban/ray_results/PPO_madras_env_2020-01-05_23-05-49klpltfym/checkpoint_22/checkpoint-22

Experiment 37

Motivation

Train the agent to detect imminent collision in front and change lanes to the right.

Settings

  • Num traffic cars: 3
  • Width of the road: -0.9 to 0.9

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/ /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/checkpoint_61/checkpoint-61 /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322

Agent learns to turn right and evade collisions most of the time. But policy does not generalize to other traffic configurations.

Experiment 38

Motivation

Start from the output of Experiment 37 and finetune the policy to achieve generalization to a wide variety of traffic conditions

Settings

lr reduced 10x from #37 to 1e-6

Traffic configuration changed to the following:

Num traffic cars: 1 to 4 Traffic agents are parked to make it essential for the agent to steer.

Restored from checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 from Experiment 37.

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-06_11-59-47j4g9sq0f/checkpoint_513/checkpoint-513

Observation

  • The agent barely makes any progress in learning to generalize
  • Tested the agent on the traffic config of experiment 37 and saw that it can solve the environment very efficiently. This shows that the policy has not changed enough from /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322

Action points

  • Train again with higher learning rate? Maybe the minimum that /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 represents is too stable a solution and it is hard to bring the agent out of it.
  • Train with a 50-50 blend of two traffic configurations, in one of which, the agent needs to turn right and the other, it needs to turn left

Experiment 39

Motivation

The agent is presented with two traffic conditions - one in which it has to turn right and another in which it has to turn left.

Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521

Observations

  • The agent learns to slow down when it detects imminent collision
  • In the presence of 3 traffic agents, the agent learns to take right turn to evade the first car
  • However it does not learn to turn left and hits the traffic car in the right lane

Probable cause

  • The agent does not learn to take drastically different actions based on similar opponents input but different trackPos and track inputs
  • I narrowed the track but only implemented that in the done function. The agent's trackPos and track inputs are still in their previous scales. Changing these inputs according to the new track_width might be helpful for the agent to recognize the two different conditions under which it has to turn left or right.

Action points

  • Properly scale the trackPos and track inputs to conform with the modified track-width
  • Use a deeper neural network, in case, the network capacity is the bottleneck here
  • Start from the right-turning checkpoint and finetune it only on the 2 traffic agent setting. Observe if and when the agent learns to turn left and whether it preserves its right turn behavior while doing so.
  • Maybe the KL divergence cutoff of PPO is too small to allow the policy to have variance high enough to capture the nonlinearity involved in the selection of left/right turns depending on lane position of the agent - try DDPG and SAC; read up PPO from Spinning Up in RL

Experiment 40

Same as #39 but with DDPG and 3-action space

Training directory

/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/

Evaluated checkpoints

/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/checkpoint_217/checkpoint-217

Experiment 41

Same as #39 but with DDPG and 2-action space

Training directory

/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/

Evaluated checkpoints

/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/checkpoint_101/checkpoint-101

Observation

Stopped at 101 for a quick experiment - resume later

Experiment 42

Motivation

Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.

Restoring checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521

lr: 5e-6

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_11-33-29e1exjxo2/checkpoint_1651/checkpoint-1651

Observation

The agent does not recover from the previous behavior. Repeating with higher learning rate and target kl divergence

Experiment 43

Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.

lr: 5e-5 kl_target: 0.03

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/ /home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/ /home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/checkpoint_1615/checkpoint-1615 /home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/checkpoint_1676/checkpoint-1676 /home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/checkpoint_1749/checkpoint-1749

Experiment 44

Retrain #39 from /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_301/checkpoint-301 with learning rate 1e-5. The anticipation is that earlier on in the training the policy would be more flexible to adaptation to left/right change

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_21-32-57wlb7e74i/checkpoint_382/checkpoint-382 /home/anirban/ray_results/PPO_madras_env_2020-01-08_23-28-28xkab3zix/checkpoint_500/checkpoint-500

Observation

The agent's behavior does not change.

Explanation

The probable cause could be that the agent's lane-position action was not properly adjusted to the changed trackPos limits. The trackPos limits are restricted to +/- 0.5. This gives the agent very low values for lane positions outside the range of +/- 0.5 as it results in episode termination. With 3 traffic agents, the agent always starts in the left lane. When it sees a car in front, all actions of taking a left gets it out of track. As a result all left turns get devalued when there is a car in the front. This way the agent becomes reluctant to take left turns when there is a car in front.

Action items

Instead of blocking away half the action space, rescale the lane-position action's -1/1 to the trackPos limits set in the sim_options file.

Experiment 45

Same as #39 with action space fixed. Now -1 and +1 correspond to track_limits specified in sim_options

Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311

Observation

  1. VOILA! The agent learns to both overtake on both the left and the right.

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 2 traffic agents

==================================================
    Successful race completion rate: 0.927461139896373

    Num trajectories evaluated: 193

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 3 traffic agents

==================================================
    Successful race completion rate: 0.839572192513369

    Num trajectories evaluated: 187

==================================================

Testing generalization...

The agent does not generalize well to more traffic agents. For 4 traffic agents, it fetches 0 rate of successful race completion and for 5 traffic agents the following is the result:

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 5 traffic agents

==================================================
    Successful race completion rate: 0.03879310344827586

    Num trajectories evaluated: 232

==================================================

Actionable items

Learn to generalize to more agents

Experiment 46

4-5 traffic agent starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/ /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/checkpoint_347/checkpoint-347 /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_968/checkpoint-968

Observation

The agent learns to navigate the 5-traffic scene with 39% success rate but fails to get any success in the 4 traffic scenario. Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_918/checkpoint-918 on 5 traffic agents

==================================================
    Successful race completion rate: 0.3904761904761905

    Num trajectories evaluated: 105

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_928/checkpoint-928 on 5 traffic agents

==================================================
    Successful race completion rate: 0.35185185185185186

    Num trajectories evaluated: 108

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_808/checkpoint-808 on 5 traffic agents

==================================================
    Successful race completion rate: 0.2956521739130435

    Num trajectories evaluated: 115

==================================================

Actionable

Try training on 3 or 4 traffic agents to teach the agent to learn to cope up to a 4 traffic agent scenario without having forgotten the 3 traffic scenario.

Experiment 47

Motivation

3-4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311

Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_693/checkpoint-693 /home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_562/checkpoint-562

Observations

  • The agent still does not learn to navigate starting in the right lane (thee 4 traffic car case)
  • Although the agent aces the 3 traffic car configuration
  • As 3 traffic cars appear 50% of the time, the agent gets the episode completion reward of 100, 50% of the time. The agent seems to remain content with this.

Actionable

  • Train the agent on 3-4 traffic cars from scratch. Starting from an expert on 3 traffic cars may have created a bias for 3/5 traffic car configurations in the model
  • Present the 4 traffic configuration more often than the 3 traffic car configuration.

Experiment 48

Motivation

Train on 3-4 traffic agents from scratch

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/

Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/checkpoint_181/checkpoint-181 /home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/checkpoint_272/checkpoint-272 /home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/checkpoint_433/checkpoint-433

Actionable

Reduce the target speed of the driving agent to 25kmph. We dont need the agent to have a high speed while maneuvering through narrow road with traffic parked.

Experiment 49

Motivation

Train on 3-4 traffic agents from scratch at 50 km per hour

Training directories

home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/

Evaluated checkpoints

home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/checkpoint_121/checkpoint-121

Observations

Training gets saturated

Experiment 50

Motivation

Start from #45: /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 Reduce the speed and train on 4-5 traffic agents reducing the target speed to 50kmph

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314 /home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/checkpoint_425/checkpoint-425

Observations

  • VOILA! The agent overcomes both 4 and 5 traffic agents with high success rate starting from both the left and right sides of the road.

  • The agent generalizes to upto 9 traffic cars parked alternately on both sides of the road.

  • The agent generalizes across different car models - even baja-bug and buggy

  • Limitations:

    • The agent collides if the traffic cars switch lanes in front of the agent
    • As the initial stretch of road in aalborg is straight, the agent does not learn to make decisions keeping in mind the turns of the road. It fails to overtake all the agents when tested in Corkscrew which has a left turn in the initial part of the race.

Experiment 51

Motivation

Make the driver in #50 /home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314 learn to generalize to multiple tracks by training it on 50-50 4-5 traffic and 50-50 aalborg-corkscrew.

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/checkpoint_945/checkpoint-945 /home/anirban/ray_results/PPO_madras_env_2020-01-12_07-14-03nwuhirj8/checkpoint_956/checkpoint-956

================================================================================================================== -------------------------------------------- STOCHASTIC ACTIONS --------------------------------------------------

Experiment 52

Motivation

Train an agent to drive around corkscrew with N(0, 0.1) Gaussian action noise and 3 action space

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/checkpoint_751/checkpoint-751

Experiment 53

Motivation

Train an agent to drive around corkscrew with observation noise and no action noise

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/ /home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/checkpoint_68/checkpoint-68 /home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/checkpoint_599/checkpoint-599

Experiment 54

Motivation

Train an agent to drive around corkscrew with N(0, 0.5) Gaussian action noise and 3 action space

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/checkpoint_2172/checkpoint-2172

Experiment 55

Motivation

Train an agent to drive around corkscrew with N(0, 0.1) Gaussian action noise and observation noise and 3 action space

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/checkpoint_1141/checkpoint-1141

Experiment 56

Motivation

Baseline experiment for this section. Training Train an agent to drive around corkscrew with no observation or action noise.

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/checkpoint_1377/checkpoint-1377

Observation

The agent goes off track due to timeout. Increasing max-steps to 15000

Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/

Last checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/checkpoint_671/checkpoint-671

TODO(santara)

  • Engine characteristic graph (torque@rpm) for all cars - modify the code for full automation
  • Plot the training graphs of the first 2 experiments-------------------------------------------------------------> done
  • Train a single agent to drive 1 4WD and 1 RWD cars / 1 new car and buggy. --------------------------------------> done
  • Train a policy just on buggy to see the decision profile. -----------------------------------------------------> done
  • Extract drive-train and CG information--------------------------------------------------------------------------> done
  • Read the curriculum learning for robotics papers and write the curriculum learning section of the MADRaS paper--> done
  • Add context to observation and dilated-conv/recurrent policy
  • While training in traffic, there should be a minimum number of traffic agents-----------------------------------> done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment