Train a single agent to drive three cars in Alpine
- Action space: steering-acceleration-brake
- Car models:
- car1-stock1
- acura-nsx-sz
- car1-trb1
- Track: Alpine
- Neural network architecture: PPO default
- Training algorithm: PPO
- max_steps: 1000
- noisy observations
- noisy actions
/home/anirban/ray_results/PPO_madras_env_2019-12-16_14-22-20bsqfy716/checkpoint_221/checkpoint-221
- max_steps was set to 1000
- track length and width were wrong
Train a single agent to drive three cars in Alpine
- Action space: steering-acceleration-brake
- Car models:
- car1-stock1
- acura-nsx-sz
- car1-trb1
- Track: Alpine
- Neural network architecture: PPO default
- Training algorithm: PPO
- max_steps: 25000
- noisy observations
- noisy actions
/home/anirban/ray_results/PPO_madras_env_2019-12-16_16-00-21vj0or2ia/checkpoint_1671/checkpoint-1671
Train a single agent to drive three cars in Alpine
- Action space: steering-acceleration-brake
- Car models:
- car1-stock1
- car1-stock2
- acura-nsx-sz
- car1-trb1
- p406
- 155-DTM
- Track: Alpine
- Neural network architecture: PPO default
- Training algorithm: PPO
- max_steps: 25000
/home/anirban/ray_results/PPO_madras_env_2019-12-16_20-31-43ilasyeke/checkpoint_51/checkpoint-51
Experiment failed due to NaN reward.
- Added filter for nan reward in reward_manager
- Placed trainer.train() in a try-except block to catch errors and save the checkpoint
- Added "ignore_worker_failures" to the training config to see what happens when it tries to continue training even after worker failures.
/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/
/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/checkpoint_842/checkpoint-842
- p406 has immense body roll
- cars are still wobbly - some more some less
- all the 6 cars can finish the lap in Alpine-1 but all cars do not generalize well to spring
- policy does not generalize to buggy
Single car all roads
- Action space: steering-acceleration-brake
- Car models:
- car1-stock1
- Roads:
- aalborg
- alpine-1
- alpine-2
- brondehach
- g-track-1
- g-track-2
- g-track-3
- corkscrew
- eroad
- e-track-2
- e-track-3
- e-track-4
- e-track-6
- forza
- ole-road-1
- ruudskogen
- street-1
- wheel-1
- wheel-2
- spring
P.S. "e-track-1" gives Segmentation fault from torcs: no track observations after about 1000 steps
/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48 /home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78 /home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209 /home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344 /home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-35-32a8nef0xr/checkpoint_372/checkpoint-372")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-46-52xbot_yvb/checkpoint_407/checkpoint-407")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-58-32oxheh0v5/checkpoint_492/checkpoint-492")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-11-43r7e5myyo/checkpoint_500/checkpoint-500")
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277")
Single car learning to drive among upto 9 parking traffic cars from scratch
Alpine-1
car1-stock1
steering-acceleration-brake
/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/
/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/checkpoint_641/checkpoint-641
- Car learns to drive forward
- Car sometimes also brakes to prevent collision with the car in front but collides most of the time
- Car always chooses to take the left lane to overtake. It is a default first step behavior. This causes the agent to hit the car in front, especially when it initializes in the right lane.
- A curriculum is necessary to teach the agent three things - learning to drive in an open track, learning to prevent front collision, learning to prevent rear-end collision, learning to overtake
Training buggy to drive in Alpine-1
Front left and right brake disk diameters originally 250 mm but that was out of bounds because the rimm diameter is 200 mm. To prevent this error, I reduced the disk diameter to 200 mm.
/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/
/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/checkpoint_211/checkpoint-211
- The buggy learns to run but the top speed is around ~70 kmph although the target speed is 100 kmph
- The buggy learns to drift at tight corners but sometimes turns backwards in the process
- Policy does not generalize to spring. The agent drives off track right at the beginning. Clearly the agent had memorized Alpine-1
- Policy does not generalize to other cars. This is because buggy has very different engine characteristics than other cars. Buggy has exceptionally high torque at low RPMs. This is what makes it so difficult to control. The only other car that it can even start is the baja-bug
Train one agent to drive both buggy and car1-stock1 in Alpine-1. Can an RL agent with the SingleAgentSimpleLapObs observations, steer-accel-brake actions and reward settings:
- ProgressReward2: scale: 1.0
- AvgSpeedReward: scale: 1.0
- CollisionPenalty: scale: 10.0
- TurnBackwardPenalty: scale: 10.0
- AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0
/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/ /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/
/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/checkpoint_691/checkpoint-691 ^ retraining from here: /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/checkpoint_1302/checkpoint-1302
- The policy can start both the cars
- Buggy is more stable and lives longer with this policy than a policy that is trained only on car1-stock1
- car1-stock1 runs smooth and completes the lap but is not completely smooth in terms of lateral motion
- The lateral motion is particularly vigorous in case of buggy and this is what sets it off track
- Training longer with more angular acceleration penalty might help
- After training longer, car1-stock1 can complete the race most of the time but buggy drives very poorly
- TODO(santara) Visualize and evaluate more checkpoints from /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/
Train one agent to drive both buggy and car1-stock1 in Alpine-1. Can an RL agent with the SingleAgentSimpleLapObs observations, lanepos-speed actions and reward settings:
- ProgressReward2: scale: 1.0
- AvgSpeedReward: scale: 1.0
- CollisionPenalty: scale: 10.0
- TurnBackwardPenalty: scale: 10.0
- AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0
The conjecture is that when the engine responses of the two cars are drastically different, a low level controller can be used to handle the engine commands while a high-level controller makes a uniform set of desires for both the cars.
P.S. The PID parameters were actually calculated for car1-trb1. An improvement over this could be to tune the parameters for each of the cars
5
/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_
After resuming from: /home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031 :
/home/anirban/ray_results/PPO_madras_env_2019-12-20_15-01-31g5dbqhfr
/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031 /home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_411/checkpoint-411
- VOILA! This agent is so good that it can drive almost every car in TORCS!
- It can drift Buggy and Baja-Bug! Seriously!
- Average speed is different for different cars
- Acura-NSX-SZ and car1-stock2 can not drift well with the current PID controller. So the agent learns to stop them and take sharp turns slowly
- We should study the engine response of p406 cars. They behave very differently from the other agents
- kc-daytona, kc-2000gt, and kc-giulietta drift well at low speeds but go out of control while drifting at speeds ~100kmph
- The straight stretch of Alpine-1 is muddy and the wheels get stuck. The agent actually learns to stop, downshift and gather the torque necessary to overcome the mud - a behavior that was never observed in the 3-action space agents
- Agent chooses different speed ranges for different cars - how does it know???
- Agent finds it difficult to start some of the newer cars off track
2 action space, 4 traffiic agents
5
/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/
Restoring from /home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21:
/home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/
/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21 /home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/checkpoint_72/checkpoint-72 /home/anirban/ray_results/PPO_madras_env_2019-12-20_21-52-32s6vbfj4u/checkpoint_213/checkpoint-213
To drive buggy on Alpine with lanepos-vel control
5
/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/
/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/checkpoint_201/checkpoint-201
Pretty good performance but it needs some more training Sometimes goes off track at the sharpest bend
To drive car1-stock1 in Alpine with lanepos-vel control
5
/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/
/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/checkpoint_1041/checkpoint-1041
- The maneuvers of the car are not smooth:
- Pointless lane switches at the beginning
- Car struggles to accelerate off the start line and navigate through the straight muddy stretch later in the track
- Corners are not smooth drifts. The car slows down, halts and goes. It even goes off track while going around the tight corners.
- Agent does not learn to take tight corners to reduce distance travelled, unlike the one trained on steer-accel-brake
Train an agent to drive car1-stock1 in Alpine-1 using 3-action space
/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/
/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701
Train a single agent to drive car1-stock1 on spring from scratch
/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/ /home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/ /home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/
/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/checkpoint_41/checkpoint-41 /home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/checkpoint_1882/checkpoint-1882 /home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/checkpoint_3633/checkpoint-3633
Curriculum learning for driving in Spring. Agent pre-trained in Alpine-1. Training resumed without decreasing learning rate from: /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701
/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9
/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402
- /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402 is a Bad checkpoint: agent drives some distance and stands still
- Average speed low
- Lots of sidewise motion
Better checkpoints: /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_792/checkpoint-792, /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_782/checkpoint-782, /home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_772/checkpoint-772
- Refer to earlier experiment
- Increase jerk penalty
- Decrease the learning rate
Noting this in paper
Second attempt at curriculum learning for car1-stock1 in spring after starting from /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701 with lower learning rate of 1e-6
/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/ /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/
/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112
/home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923
Healthy training - train for some more time starting from /home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112
Good checkpoint but stil cant handle the big L-shaped left turns: /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923
- very high speeds. coercing episode length to 40000 actually helped
- agent drives very smoothly in alpine-1 with barely any sidewise movement
- but agent fails to start off in corkscrew :(
- the success of a policy is highly dependent on the initial state distribution.
- agent brakes hard when it senses a bend ahead.
- add temporal context to observation - should solve this problem Same problem with /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2673/checkpoint-2673
Remark: the agent must learn to slow down for the L-curve. Train the agent longer with larger max_steps.
TRAIN AN AGENT ON ALPINE-1 AND CORKSCREW FOR IT TO GENERALIZE TO SPRING. ALPINE HAS SHARP RIGHT TURNS BUT NO SHARP LEFT TURN. CORKSCREW HAS A SHARP LEFT TURN. AGENTS TRAINED ON ALPINE-1 USUALLY GO OFF TRACK IN SPRING AROUND 15000 STEPS WHEN IT ENCOUNTERS THE SHARP LEFT TURN. AN AGENT TRAINED ON BOTH ALPINE-1 AND CORKSCREW WILL BE ABLE TO LEARN TO GENERALIZE TO THE WHOLE OF SPRING.
Train an agent to drive car1-stock1 in alpine-1 and corkscrew
Present a 50-50 blend of the two track using random track assignment in madras.
/home/anirban/ray_results/PPO_madras_env_2019-12-26_08-05-124ejof36t/checkpoint_11/checkpoint-11 /home/anirban/ray_results/PPO_madras_env_2019-12-26_13-43-22mpipeiml/checkpoint_314/checkpoint-314
this is the best guy but it has some issues
- can solve both alpine-1 and corkscrew about 10% of the times
- can not solve Spring. It goes off track at the very first right hairpin bend.
/home/anirban/ray_results/PPO_madras_env_2019-12-26_14-35-37pzu8g3uf/checkpoint_745/checkpoint-745 /home/anirban/ray_results/PPO_madras_env_2019-12-26_16-03-36mg86dj5p/checkpoint_1606/checkpoint-1606
Experiment #23 does not converge well. We are going to try a curriculum learning strategy in which we will finetune the agent trained on alpine-1 to generalize to corkscrew (train with a 50-50 blend of tracks) with a lower learning rate of 1e-5
/home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1358/checkpoint-1358 better checkpoints: /home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1202/checkpoint-1202 /home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1192/checkpoint-1192 agent still goes off track at the tight S bend in corkscrew.
Agent trained in Alpine-1 doesnt generalize well to corkscrew because it never learns to negotiate sharp L-shaped left turns. Training corkscrew with alpine-1 with 50-50 blend also does not generalize well. Initializing with alpine-1 and finetuning on 50-50 corkscrew and alpine-1 doesnt fare well too. The motivation of this experiment is to test how effective is pre-training on corkscrew and finetuning on alpine-1.
Training car1-stock1 on corkscrew from scratch:
/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/
/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561
WOW! This policy can actually complete spring 43% of the times. Thus we can see that corkscrew is a closer MDP to spring than Alpine-1.
================================================== Car: car1-stock1
Track: spring
Average distance covered: 0.6523207230607263
Average speed: 90.70068030268457
Successful race completion rate: 0.43243243243243246
Num trajectories evaluated: 37
==================================================
For training on spring, instead of Alpine-1 as starting point, start with corkscrew trained policy /home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561 and see if it can achieve high race completion rate on spring under 1000 steps of training with learning rate 5e-7
/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/
/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/checkpoint_2502/checkpoint-2502
Teach a pre-trained agent for alpine-1 with 2-action space to drive with max 2 traffic cars.
1 or 2 traffic cars with 50-50 probability
/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/
/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/checkpoint_21/checkpoint-21 /home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82
3 Traffic agents:
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
================================================== Successful race completion rate: 0.5844155844155844
Num trajectories evaluated: 77
==================================================
Evaluating on 4 traffic agents:
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 55 high: 60 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
================================================== Successful race completion rate: 0.4067796610169492
Num trajectories evaluated: 59
==================================================
================================================== Successful race completion rate: 0.3884297520661157
Num trajectories evaluated: 121
==================================================
- As the episode length is very short (mostly under 500 steps, average 200 steps), too many rollouts (~30-50) need to be done to be done to fill a replay buffer of size 4000. Reduce the train_batch_size next time
- Increase collision penalty?
Observations specific to the evaluation with 4 traffic agents:
- The agent learns to choose the left or the right lane as required in most of the episodes
- The agent does not slow down to prevent an imminent collision: could be due to high PID latency? Reduce PID latency and check.
- The agent sometimes fails to judge the gap between the parked car in front and the edge of the track and makes an unnecessary lane change which causes it to collide with the car in the front
- The agent does not speed up to prevent a collision at the rear end.
- The agent learns to evade a parked car on the right lane but fails to evade a parked car in the left lane most of the time. This is probably due to the reason that the agent was trained with 1 or 2 traffic agents. TORCS starts assigning lanes from the right. When there is only one traffic agent, the learning agent is assigned the left lane. If the learning agent always runs forward in its (left) lane, the traffic agent would block its way roughly 50% of the times. So the remaining 50% of the time, the agent can achieve Rank1 without having to overtake even once. When there are 2 traffic agents, the learning agent is assigned the right lane by TORCS. In that case, almost 100% of the times, the learning agent has a traffic agent parked in front of it in the right lane and it has to learn to dodge it by switching to the left lane. Thats why the agent learns to evade collisions significantly better on the right lane than the left.
Starting with /home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82
train_batch_size = 1000 (down 4x) lr = 5e-6 (down 10x) CollisionPenalty scale = 20
We train with 2 traffic agents at all times.
Training checkpoint for this part: /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_433/checkpoint-433 Even better: /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373
Evaluating on the same 3 traffic agents as in the previous evaluation:
==================================================
Successful race completion rate: 0.6666666666666666
Num trajectories evaluated: 87
==================================================
Evaluating on the same 4 traffic agents as in the previous evaluation:
==================================================
Successful race completion rate: 0.2717391304347826
Num trajectories evaluated: 92
==================================================
Resuming from /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373 we train with 3 traffic agents
/home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514
Evaluating on the same 3 traffic agents as in the previous evaluation:
==================================================
Successful race completion rate: 0.7613636363636364
Num trajectories evaluated: 88
==================================================
Evaluating on the same 4 traffic agents as in the previous evaluation:
==================================================
Successful race completion rate: 0.22105263157894736
Num trajectories evaluated: 95
==================================================
Resuming from above with 4 traffic agents. no randomization /home/anirban/ray_results/PPO_madras_env_2019-12-29_00-01-40us412r8m/checkpoint_975/checkpoint-975
agent goes OOT to the right most of the time. The agent does not know how to slow down when the car is about to go off track.
We hypothesize that training on a 50-50 blennd of 3 and 4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514 might help ameliorate this overfitting behavior of overtaking from the right:
/home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_645/checkpoint-645 /home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_575/checkpoint-575
Same problem persists. Agent prefers to go off track than stopping to make some room to overtake. The agent does not learn to slow down to avoid collision either. Collisions are very frequent with this checkpoint both with 3 and 4 traffic agents.
Next steps: retrain from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514 with reduced ProgressReward2 and AngAcclPenalty
Before: rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 20.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 8.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 10.0
Now: rewards: ProgressReward2: scale: 0.25 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 20.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 10.0
/home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/ /home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/checkpoint_705/checkpoint-705
- The agent does not stop to prevent going off track
- The agent does not brake enough to respond to a braking agent in the front
- We must give context to the agent
REMOVE AVERAGE SPEED REWARD BECAUSE IT ONLY APPLIES WHEN THE AGENT RUNS THE WHOLE LENGTH OF THE TRACK.
As a next step, try introducing a penalty for going off track.
Train an agent to drive with 1 to 3 traffic cars from scratch
lr: 5e-6 train_batch_size: 1000
/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/
/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381
Evaluation results with 3 traffic cars:
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
================================================== Successful race completion rate: 0.5238095238095238
Num trajectories evaluated: 105
==================================================
With 4 traffic agents:
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 55 high: 60 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
-
ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
================================================== Successful race completion rate: 0.37894736842105264
Num trajectories evaluated: 95
==================================================
- Agent only learns to turn left and run straight - It attempts to execute this exact same behavior at every episode. This is likely because for 1 traffic agent (as is the case 33.33% of the episodes), the learning agent can exploit this policy and finish the race. This gives extra leverage to this policy.
- Per overtake rewards must be decoupled from the race over reward. Currently, CollisionPenalty and SuccessfulOvertakeReward have the same scale. If the agent overtakes 2 traffic cars and collides with the third, it gets a net positive reward. Hence the agent does not care to learn to steer right in order to avoid collision on the left lane.
- needs more training to learn the multi-modal behavior.
Starting from /home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381 with 3 traffic cars at all times:
/home/anirban/ray_results/PPO_madras_env_2019-12-30_20-37-12zhryck8b/checkpoint_1762/checkpoint-1762
Same three traffic agents as the previous evaluation:
==================================================
Successful race completion rate: 0.5096153846153846
Num trajectories evaluated: 104
==================================================
Same 4 traffic agents as the previous evaluation:
==================================================
Successful race completion rate: 0.35789473684210527
Num trajectories evaluated: 95
==================================================
Analysis of why an agent trained on Alpine-1 with 3-action space fails to take off in some race tracks
The agent brakes whenever there is a (slight) left turn in front or when its closer to the right edge than the left. In Alpine-1, this happens when the car is in motion and the agent manages to stay in motion and come to a different state where the agent does not brake. However, in tracks where the slight left turn needs to be taken at a zero or low speed, braking results in the agent coming to a halt and as a result, the agent gets stuck in the braking state and never comes out.
- corkscrew: Agent sees a left turn at the start line and brakes
- aalborg: Agent comes close to the right edge of the track at the start line and brakes
- e-track-4: Agent comes closer (not very close though) to the right edge of the track at the start line and brakes
- g-track-1: like corkscrew
- g-track-2: like aalborg and e-track-4
- g-track-3: right turn in front
- forza: like aalborg and e-track-4
- wheel-1: like aalborg and e-track-4
- brondehach: like aalborg and e-track-4
This is a continuation of experiment 27 and 28 with modified learning settings:
- The reward structure has been changed to decouple per overtake reward from race completed reward.
- Collision with one traffic agent after overtaking some of the other traffic agents does not give a net positive reward under any circumstances.
- Traffic agents are 2 or 3. Single traffic agent seems to be lenient to the agent adopting a silly fixed policy of turning left and going straight along the left edge of the track.
ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 15.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 5.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0
/home/anirban/ray_results/PPO_madras_env_2020-01-03_21-37-58ozx2miou/checkpoint_691/checkpoint-691
Evaluating on 3 traffic agents:
==================================================
Successful race completion rate: 0.5555555555555556
Num trajectories evaluated: 117
==================================================
- ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 45 high: 50 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
- ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 30 high: 40 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
- ParkedAgent: target_speed: 50 parking_lane_pos: low: -0.8 high: 0.8 parking_dist_from_start: low: 15 high: 25 collision_time_window: 1.2 pid_settings: accel_pid: - 10.5 # a_x - 0.05 # a_y - 2.8 # a_z steer_pid: - 5.1 - 0.001 - 0.000001 accel_scale: 1.0 steer_scale: 0.1
The agent learns to execute a fixed policy: turn right, take the edge of the right lane and run straight. It completely ignores the opponent positions. We have to make the agent pay attention to the opponents vector and its own lane_position (trackPos)
Repeat Experiment 30 without per-overtake rewards and lower angular acceleration penalty
ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 10.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 RankOneReward: scale: 10.0
/home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/ /home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/
- /home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/checkpoint_31/checkpoint-31
- /home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/checkpoint_242/checkpoint-242
- /home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/checkpoint_303/checkpoint-303
- /home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/checkpoint_674/checkpoint-674
- Agent always takes left in the beginning
- Agent does not learn to avoid collision in the left lane
- Agent runs slowly in checkpoint #3 but speed increases a bit in checkpoint #4
- It does slow down when there is a car in front but does not learn to steer right to avoid collision
- I will let it train for longer
- Consider increasing collision penalty so much that the agent can never get a positive episode reward if it makes a single collision. Currently the agent is not shy from colliding.
Same as Expt 31 but with very high collision penalty
/home/anirban/ray_results/PPO_madras_env_2020-01-04_19-55-16xzfuziaz/checkpoint_291/checkpoint-291 /home/anirban/ray_results/PPO_madras_env_2020-01-04_22-56-37tmpr7d_m/checkpoint_722/checkpoint-722
- Agent does not steer at all
- The learning curve does not show any growth between update steps: 291 and 722
- Agent just slows down a bit when there is a car in front but eventually collides
- Agent shows no intention to overtake
- Agent shows no intention of overtaking - bring back the per-overtake reward
- The 2-action space does not give the agent quick steering ability - try 3-action space to see if the agent learns to steer more freely
- The learning rate 5e-6 might be too low. Increase it 2x to 1e-5.
- Agent still does not pay attention to the trackPos and opponents - consider removing track from observation list while using 2-action space. trackPos should give sufficient information to the agent not to go OOT
To address the actionables from Experiment 32
lr 1e-5 3 action space
rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 100.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0
TODO later: remove 'track' from state space
/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/
/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/checkpoint_181/checkpoint-181 /home/anirban/ray_results/PPO_madras_env_2020-01-05_07-10-53tfgdnjbw/checkpoint_362/checkpoint-362 /home/anirban/ray_results/PPO_madras_env_2020-01-05_07-53-19wdmej9r5/checkpoint_783/checkpoint-783
Force the agent to learn a zigzag maneuver and thus learn to steer out of the way of obstacles.
- We arrange the traffic agents in a way that the agent must first turn right and then turn left.
- We stop the agent from taking the edge of the tracks to overtake without turning by
/home/anirban/ray_results/PPO_madras_env_2020-01-05_09-39-15hwy781dy/checkpoint_311/checkpoint-311 /home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382
- Agent stops and almost comes to a perfect halt but it keeps trying to go to the left and does not steer right
- Maybe it will improve upon more training
- Agent remains stuck with the same behavior between 311 and 382 steps - PPO often gets stuck in local minima. Trying to ameliorate this by increasing learning rate 10x to 1e-4
After training with increased learning rate of 1e-4 from /home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382
we have: /home/anirban/ray_results/PPO_madras_env_2020-01-05_14-04-198iqqfx34/checkpoint_413/checkpoint-413
Same as #34 but with 3-action space
/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/
/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/checkpoint_1894/checkpoint-1894
- Car doesnt even learn to slow down to avoid collision.
- Car runs straight ahead at full speed and collides. Over time, it learns to gather rewards from ProgressReward2 by running faster.
Same as 34 but with modified reward function
rewards: ProgressReward2: scale: 1.0 AvgSpeedReward: scale: 1.0 CollisionPenalty: scale: 10.0 TurnBackwardPenalty: scale: 10.0 AngAcclPenalty: scale: 1.0 max_ang_accl: 2.0 SuccessfulOvertakeReward: scale: 5.0 RankOneReward: scale: 10.0
/home/anirban/ray_results/PPO_madras_env_2020-01-05_22-52-440h5i8uu6/checkpoint_11/checkpoint-11 /home/anirban/ray_results/PPO_madras_env_2020-01-05_23-05-49klpltfym/checkpoint_22/checkpoint-22
Train the agent to detect imminent collision in front and change lanes to the right.
- Num traffic cars: 3
- Width of the road: -0.9 to 0.9
/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/ /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/
/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/checkpoint_61/checkpoint-61 /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322
Agent learns to turn right and evade collisions most of the time. But policy does not generalize to other traffic configurations.
Start from the output of Experiment 37 and finetune the policy to achieve generalization to a wide variety of traffic conditions
lr reduced 10x from #37 to 1e-6
Traffic configuration changed to the following:
Num traffic cars: 1 to 4 Traffic agents are parked to make it essential for the agent to steer.
Restored from checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 from Experiment 37.
/home/anirban/ray_results/PPO_madras_env_2020-01-06_11-59-47j4g9sq0f/checkpoint_513/checkpoint-513
- The agent barely makes any progress in learning to generalize
- Tested the agent on the traffic config of experiment 37 and saw that it can solve the environment very efficiently. This shows that the policy has not changed enough from /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322
- Train again with higher learning rate? Maybe the minimum that /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 represents is too stable a solution and it is hard to bring the agent out of it.
- Train with a 50-50 blend of two traffic configurations, in one of which, the agent needs to turn right and the other, it needs to turn left
The agent is presented with two traffic conditions - one in which it has to turn right and another in which it has to turn left.
/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/
/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521
- The agent learns to slow down when it detects imminent collision
- In the presence of 3 traffic agents, the agent learns to take right turn to evade the first car
- However it does not learn to turn left and hits the traffic car in the right lane
- The agent does not learn to take drastically different actions based on similar opponents input but different trackPos and track inputs
- I narrowed the track but only implemented that in the done function. The agent's trackPos and track inputs are still in their previous scales. Changing these inputs according to the new track_width might be helpful for the agent to recognize the two different conditions under which it has to turn left or right.
- Properly scale the trackPos and track inputs to conform with the modified track-width
- Use a deeper neural network, in case, the network capacity is the bottleneck here
- Start from the right-turning checkpoint and finetune it only on the 2 traffic agent setting. Observe if and when the agent learns to turn left and whether it preserves its right turn behavior while doing so.
- Maybe the KL divergence cutoff of PPO is too small to allow the policy to have variance high enough to capture the nonlinearity involved in the selection of left/right turns depending on lane position of the agent - try DDPG and SAC; read up PPO from Spinning Up in RL
Same as #39 but with DDPG and 3-action space
/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/
/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/checkpoint_217/checkpoint-217
Same as #39 but with DDPG and 2-action space
/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/
/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/checkpoint_101/checkpoint-101
Stopped at 101 for a quick experiment - resume later
Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.
Restoring checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521
lr: 5e-6
/home/anirban/ray_results/PPO_madras_env_2020-01-08_11-33-29e1exjxo2/checkpoint_1651/checkpoint-1651
The agent does not recover from the previous behavior. Repeating with higher learning rate and target kl divergence
Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.
lr: 5e-5 kl_target: 0.03
/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/ /home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/ /home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/
/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/checkpoint_1615/checkpoint-1615 /home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/checkpoint_1676/checkpoint-1676 /home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/checkpoint_1749/checkpoint-1749
Retrain #39 from /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_301/checkpoint-301 with learning rate 1e-5. The anticipation is that earlier on in the training the policy would be more flexible to adaptation to left/right change
/home/anirban/ray_results/PPO_madras_env_2020-01-08_21-32-57wlb7e74i/checkpoint_382/checkpoint-382 /home/anirban/ray_results/PPO_madras_env_2020-01-08_23-28-28xkab3zix/checkpoint_500/checkpoint-500
The agent's behavior does not change.
The probable cause could be that the agent's lane-position action was not properly adjusted to the changed trackPos limits. The trackPos limits are restricted to +/- 0.5. This gives the agent very low values for lane positions outside the range of +/- 0.5 as it results in episode termination. With 3 traffic agents, the agent always starts in the left lane. When it sees a car in front, all actions of taking a left gets it out of track. As a result all left turns get devalued when there is a car in the front. This way the agent becomes reluctant to take left turns when there is a car in front.
Instead of blocking away half the action space, rescale the lane-position action's -1/1 to the trackPos limits set in the sim_options file.
Same as #39 with action space fixed. Now -1 and +1 correspond to track_limits specified in sim_options
/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/
/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
- VOILA! The agent learns to both overtake on both the left and the right.
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 2 traffic agents
==================================================
Successful race completion rate: 0.927461139896373
Num trajectories evaluated: 193
==================================================
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 3 traffic agents
==================================================
Successful race completion rate: 0.839572192513369
Num trajectories evaluated: 187
==================================================
Testing generalization...
The agent does not generalize well to more traffic agents. For 4 traffic agents, it fetches 0 rate of successful race completion and for 5 traffic agents the following is the result:
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 5 traffic agents
==================================================
Successful race completion rate: 0.03879310344827586
Num trajectories evaluated: 232
==================================================
Learn to generalize to more agents
4-5 traffic agent starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/ /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/
/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/checkpoint_347/checkpoint-347 /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_968/checkpoint-968
The agent learns to navigate the 5-traffic scene with 39% success rate but fails to get any success in the 4 traffic scenario. Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_918/checkpoint-918 on 5 traffic agents
==================================================
Successful race completion rate: 0.3904761904761905
Num trajectories evaluated: 105
==================================================
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_928/checkpoint-928 on 5 traffic agents
==================================================
Successful race completion rate: 0.35185185185185186
Num trajectories evaluated: 108
==================================================
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_808/checkpoint-808 on 5 traffic agents
==================================================
Successful race completion rate: 0.2956521739130435
Num trajectories evaluated: 115
==================================================
Try training on 3 or 4 traffic agents to teach the agent to learn to cope up to a 4 traffic agent scenario without having forgotten the 3 traffic scenario.
3-4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/
/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_693/checkpoint-693 /home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_562/checkpoint-562
- The agent still does not learn to navigate starting in the right lane (thee 4 traffic car case)
- Although the agent aces the 3 traffic car configuration
- As 3 traffic cars appear 50% of the time, the agent gets the episode completion reward of 100, 50% of the time. The agent seems to remain content with this.
- Train the agent on 3-4 traffic cars from scratch. Starting from an expert on 3 traffic cars may have created a bias for 3/5 traffic car configurations in the model
- Present the 4 traffic configuration more often than the 3 traffic car configuration.
Train on 3-4 traffic agents from scratch
/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/checkpoint_181/checkpoint-181 /home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/checkpoint_272/checkpoint-272 /home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/checkpoint_433/checkpoint-433
Reduce the target speed of the driving agent to 25kmph. We dont need the agent to have a high speed while maneuvering through narrow road with traffic parked.
Train on 3-4 traffic agents from scratch at 50 km per hour
home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/
home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/checkpoint_121/checkpoint-121
Training gets saturated
Start from #45: /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 Reduce the speed and train on 4-5 traffic agents reducing the target speed to 50kmph
/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/ /home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314 /home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/checkpoint_425/checkpoint-425
-
VOILA! The agent overcomes both 4 and 5 traffic agents with high success rate starting from both the left and right sides of the road.
-
The agent generalizes to upto 9 traffic cars parked alternately on both sides of the road.
-
The agent generalizes across different car models - even baja-bug and buggy
-
Limitations:
- The agent collides if the traffic cars switch lanes in front of the agent
- As the initial stretch of road in aalborg is straight, the agent does not learn to make decisions keeping in mind the turns of the road. It fails to overtake all the agents when tested in Corkscrew which has a left turn in the initial part of the race.
Make the driver in #50 /home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314 learn to generalize to multiple tracks by training it on 50-50 4-5 traffic and 50-50 aalborg-corkscrew.
/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/checkpoint_945/checkpoint-945 /home/anirban/ray_results/PPO_madras_env_2020-01-12_07-14-03nwuhirj8/checkpoint_956/checkpoint-956
================================================================================================================== -------------------------------------------- STOCHASTIC ACTIONS --------------------------------------------------
Train an agent to drive around corkscrew with N(0, 0.1) Gaussian action noise and 3 action space
/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/checkpoint_751/checkpoint-751
Train an agent to drive around corkscrew with observation noise and no action noise
/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/ /home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/checkpoint_68/checkpoint-68 /home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/checkpoint_599/checkpoint-599
Train an agent to drive around corkscrew with N(0, 0.5) Gaussian action noise and 3 action space
/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/checkpoint_2172/checkpoint-2172
Train an agent to drive around corkscrew with N(0, 0.1) Gaussian action noise and observation noise and 3 action space
/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/checkpoint_1141/checkpoint-1141
Baseline experiment for this section. Training Train an agent to drive around corkscrew with no observation or action noise.
/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/checkpoint_1377/checkpoint-1377
The agent goes off track due to timeout. Increasing max-steps to 15000
/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/
/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/checkpoint_671/checkpoint-671
TODO(santara)
- Engine characteristic graph (torque@rpm) for all cars - modify the code for full automation
- Plot the training graphs of the first 2 experiments-------------------------------------------------------------> done
- Train a single agent to drive 1 4WD and 1 RWD cars / 1 new car and buggy. --------------------------------------> done
- Train a policy just on buggy to see the decision profile. -----------------------------------------------------> done
- Extract drive-train and CG information--------------------------------------------------------------------------> done
- Read the curriculum learning for robotics papers and write the curriculum learning section of the MADRaS paper--> done
- Add context to observation and dilated-conv/recurrent policy
- While training in traffic, there should be a minimum number of traffic agents-----------------------------------> done