Santara/madras_experiments.md

## madras_experiments.md

      
    Raw
  

              madras_experiments.md
            
          
    Experiment 1

Motivation

Train a single agent to drive three cars in Alpine
Details


Action space: steering-acceleration-brake
Car models:

car1-stock1
acura-nsx-sz
car1-trb1


Track: Alpine
Neural network architecture: PPO default
Training algorithm: PPO
max_steps: 1000
noisy observations
noisy actions

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_14-22-20bsqfy716/checkpoint_221/checkpoint-221
Issues


max_steps was set to 1000
track length and width were wrong

Experiment 2

Motivation

Train a single agent to drive three cars in Alpine
Details


Action space: steering-acceleration-brake
Car models:

car1-stock1
acura-nsx-sz
car1-trb1


Track: Alpine
Neural network architecture: PPO default
Training algorithm: PPO
max_steps: 25000
noisy observations
noisy actions

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_16-00-21vj0or2ia/checkpoint_1671/checkpoint-1671
Experiment 3

Motivation

Train a single agent to drive three cars in Alpine
Details


Action space: steering-acceleration-brake
Car models:

car1-stock1
car1-stock2
acura-nsx-sz
car1-trb1
p406
155-DTM


Track: Alpine
Neural network architecture: PPO default
Training algorithm: PPO
max_steps: 25000

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-16_20-31-43ilasyeke/checkpoint_51/checkpoint-51
Fate

Experiment failed due to NaN reward.

Added filter for nan reward in reward_manager
Placed trainer.train() in a try-except block to catch errors and save the checkpoint
Added "ignore_worker_failures" to the training config to see what happens when it tries to continue training even after worker failures.

New Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/
/home/anirban/ray_results/PPO_madras_env_2019-12-17_04-51-45zqcjhn23/checkpoint_842/checkpoint-842
Observations


p406 has immense body roll
cars are still wobbly - some more some less
all the 6 cars can finish the lap in Alpine-1 but all cars do not generalize well to spring
policy does not generalize to buggy

Experiment 4

Motivation

Single car all roads
Details


Action space: steering-acceleration-brake
Car models:

car1-stock1


Roads:

aalborg
alpine-1
alpine-2
brondehach
g-track-1
g-track-2
g-track-3
corkscrew
eroad
e-track-2
e-track-3
e-track-4
e-track-6
forza
ole-road-1
ruudskogen
street-1
wheel-1
wheel-2
spring


P.S. "e-track-1" gives Segmentation fault from torcs: no track observations after about 1000 steps
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48
/home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78
/home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209
/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344
/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277
trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_12-26-490nxx9rdv/checkpoint_48/checkpoint-48")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_14-24-06q9xr0gg9/checkpoint_78/checkpoint-78")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_15-46-019zdm3e92/checkpoint_209/checkpoint-209")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-11-22a2tpiv5s/checkpoint_344/checkpoint-344")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-35-32a8nef0xr/checkpoint_372/checkpoint-372")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-46-52xbot_yvb/checkpoint_407/checkpoint-407")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_16-58-32oxheh0v5/checkpoint_492/checkpoint-492")

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-11-43r7e5myyo/checkpoint_500/checkpoint-500")

Removing aalborg

trainer.restore("/home/anirban/ray_results/PPO_madras_env_2019-12-17_17-16-151sfd7wsu/checkpoint_1277/checkpoint-1277")

Experiment 5

Motivation

Single car learning to drive among upto 9 parking traffic cars from scratch
TrackTrue

Alpine-1
Ego vehicle model

car1-stock1
Action space

steering-acceleration-brake
Experiment directory

/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-17_20-51-25wwi1tpuj/checkpoint_641/checkpoint-641
Observations


Car learns to drive forward
Car sometimes also brakes to prevent collision with the car in front but collides most of the time
Car always chooses to take the left lane to overtake. It is a default first step behavior. This causes the agent to hit the car in front, especially when it initializes in the right lane.
A curriculum is necessary to teach the agent three things - learning to drive in an open track, learning to prevent front collision, learning to prevent rear-end collision, learning to overtake

Experiment 6

Motivation

Training buggy to drive in Alpine-1
Edits  to  xml

Front left and right brake disk diameters originally 250 mm but that was out of bounds because the rimm diameter is 200 mm. To prevent this error, I reduced the disk diameter to 200 mm.
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-18_06-15-44v5_kk1jo/checkpoint_211/checkpoint-211
Observations


The buggy learns to run but the top speed is around ~70 kmph although the target speed is 100 kmph
The buggy learns to drift at tight corners but sometimes turns backwards in the process
Policy does not generalize to spring. The agent drives off track right at the beginning. Clearly the agent had memorized Alpine-1
Policy does not generalize to other cars. This is because buggy has very different engine characteristics than other cars. Buggy has exceptionally high torque at low RPMs. This is what makes it so difficult to control. The only other car that it can even start  is the baja-bug

Experiment 7

Motivation

Train one agent to drive both buggy and car1-stock1 in Alpine-1.
Can an RL agent with the SingleAgentSimpleLapObs observations, steer-accel-brake actions and reward settings:

ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 10.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 5.0
max_ang_accl: 2.0

Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/
/home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-18_11-23-20sj0j59xe/checkpoint_691/checkpoint-691
^ retraining from here:
/home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/checkpoint_1302/checkpoint-1302
Observations


The policy can start both the cars
Buggy is more stable and lives longer with this policy than a policy that is trained only on car1-stock1
car1-stock1 runs smooth and completes the lap but is not completely smooth in terms of lateral motion
The lateral motion is particularly vigorous in case of buggy and this is what sets it off track
Training longer with more angular acceleration penalty might help
After training longer, car1-stock1 can complete the race most of the time but buggy drives very poorly
TODO(santara) Visualize and evaluate more checkpoints from /home/anirban/ray_results/PPO_madras_env_2019-12-18_21-45-191x6ogb6f/

Experiment 8

Motivation

Train one agent to drive both buggy and car1-stock1 in Alpine-1.
Can an RL agent with the SingleAgentSimpleLapObs observations, lanepos-speed actions and reward settings:

ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 10.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 5.0
max_ang_accl: 2.0

The conjecture is that when the engine responses of the two cars are drastically different, a low level controller can be used to handle the engine commands while a high-level controller makes a uniform set of desires for both the cars.
P.S. The PID parameters were actually calculated for car1-trb1. An improvement over this could be to tune the parameters for each of the cars
PID latency

5
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_
After resuming from: /home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031 :
/home/anirban/ray_results/PPO_madras_env_2019-12-20_15-01-31g5dbqhfr
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_1031/checkpoint-1031
/home/anirban/ray_results/PPO_madras_env_2019-12-20_07-59-33uqlymzk_/checkpoint_411/checkpoint-411
Observations


VOILA! This agent is so good that it can drive almost every car in TORCS!

It can drift Buggy and Baja-Bug! Seriously!


Average speed is different for different cars
Acura-NSX-SZ and car1-stock2 can not drift well with the current PID controller. So the agent learns to stop them and take sharp turns slowly
We should study the engine response of p406 cars. They behave very differently from the other agents
kc-daytona, kc-2000gt, and kc-giulietta drift well at low speeds but go out of control while drifting at speeds ~100kmph
The straight stretch of Alpine-1 is muddy and the wheels get stuck. The agent actually learns to stop, downshift and
gather the torque necessary to overcome the mud - a behavior that was never observed in the 3-action space agents
Agent chooses different speed ranges for different cars - how does it know???
Agent finds it difficult to start some of the newer cars off track

Experiment 9

Motivation

2 action space, 4 traffiic agents
PID latency

5
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/
Restoring from /home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21:
/home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/
Checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-20_17-53-147rou1mfy/checkpoint_21/checkpoint-21
/home/anirban/ray_results/PPO_madras_env_2019-12-20_19-40-30jw5hac_0/checkpoint_72/checkpoint-72
/home/anirban/ray_results/PPO_madras_env_2019-12-20_21-52-32s6vbfj4u/checkpoint_213/checkpoint-213
Observations

Experiment 10

Motivation

To drive buggy on Alpine with lanepos-vel control
PID latency

5
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-23_15-10-29v4er474i/checkpoint_201/checkpoint-201
Observation

Pretty good performance but it needs some more training
Sometimes goes off track at the sharpest bend
Experiment 11

Motivation

To drive car1-stock1 in Alpine with lanepos-vel control
PID latency

5
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-23_21-09-08utsi2qkt/checkpoint_1041/checkpoint-1041
Observations


The maneuvers of the car are not smooth:

Pointless lane switches at the beginning
Car struggles to accelerate off the start line and navigate through the straight muddy stretch later in the track
Corners are not smooth drifts. The car slows down, halts and goes. It even goes off track while going around the tight corners.
Agent does not learn to take tight corners to reduce distance travelled, unlike the one trained on steer-accel-brake


Experiment 12

Motivation

Train an agent to drive car1-stock1 in Alpine-1 using 3-action space
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701
Experiment 19

Motivation

Train a single agent to drive car1-stock1 on spring from scratch
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/
/home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/
/home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-25_10-34-225wglbdwh/checkpoint_41/checkpoint-41
/home/anirban/ray_results/PPO_madras_env_2019-12-25_11-25-573rhqmu92/checkpoint_1882/checkpoint-1882
/home/anirban/ray_results/PPO_madras_env_2019-12-29_23-49-20r7a3oaik/checkpoint_3633/checkpoint-3633
Experiment 20

Curriculum learning for driving in Spring.
Agent pre-trained in Alpine-1. Training resumed without decreasing learning rate from: /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402
Observations


/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_1402/checkpoint-1402 is a Bad checkpoint: agent drives some distance and stands still
Average speed low
Lots of sidewise motion

Better checkpoints:
/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_792/checkpoint-792,
/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_782/checkpoint-782,
/home/anirban/ray_results/PPO_madras_env_2019-12-25_15-43-37emh90ba9/checkpoint_772/checkpoint-772
Possible improvements


Refer to earlier experiment
Increase jerk penalty
Decrease the learning rate

Experiment 21


Noting this in paper


Motivation

Second attempt at curriculum learning for car1-stock1 in spring after starting from /home/anirban/ray_results/PPO_madras_env_2019-12-24_12-12-15wk33ecf3/checkpoint_701/checkpoint-701 with lower learning rate of 1e-6
Training directories

/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/
/home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112
/home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923
Observation

Healthy training - train for some more time starting from /home/anirban/ray_results/PPO_madras_env_2019-12-25_21-11-49kvt17rg2/checkpoint_1112/checkpoint-1112
Good checkpoint but stil cant handle the big L-shaped left turns: /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2923/checkpoint-2923

very high speeds. coercing episode length to 40000 actually helped
agent drives very smoothly in alpine-1 with barely any sidewise movement
but agent fails to start off in corkscrew :(

the success of a policy is highly dependent on the initial state distribution.
agent brakes hard when it senses a bend ahead.

add temporal context to observation - should solve this problem
Same problem with /home/anirban/ray_results/PPO_madras_env_2019-12-26_20-00-167zm0hxxb/checkpoint_2673/checkpoint-2673


Remark: the agent must learn to slow down for the L-curve. Train the agent longer with larger  max_steps.


TRAIN AN AGENT ON ALPINE-1 AND CORKSCREW FOR IT TO GENERALIZE TO SPRING. ALPINE HAS SHARP RIGHT TURNS BUT NO SHARP LEFT TURN. CORKSCREW HAS A SHARP LEFT TURN. AGENTS TRAINED ON ALPINE-1 USUALLY GO OFF TRACK IN SPRING AROUND 15000 STEPS WHEN IT ENCOUNTERS THE SHARP LEFT TURN. AN AGENT TRAINED ON BOTH ALPINE-1 AND CORKSCREW WILL BE ABLE TO LEARN TO GENERALIZE TO THE WHOLE OF SPRING.
Experiment 23

Motivation

Train an agent to drive  car1-stock1 in alpine-1 and corkscrew
Present a 50-50 blend of the two track using random track assignment in madras.
training direcctories

/home/anirban/ray_results/PPO_madras_env_2019-12-26_08-05-124ejof36t/checkpoint_11/checkpoint-11
/home/anirban/ray_results/PPO_madras_env_2019-12-26_13-43-22mpipeiml/checkpoint_314/checkpoint-314


this is the best guy but it has some issues


can solve both alpine-1 and corkscrew about 10% of the times
can not solve Spring. It goes  off track at the very first right hairpin bend.

/home/anirban/ray_results/PPO_madras_env_2019-12-26_14-35-37pzu8g3uf/checkpoint_745/checkpoint-745
/home/anirban/ray_results/PPO_madras_env_2019-12-26_16-03-36mg86dj5p/checkpoint_1606/checkpoint-1606
EVALUATION PENDING

Experiment 24

Motivation

Experiment #23 does not converge well. We are going to try a curriculum learning strategy in which we will finetune the agent trained on alpine-1 to generalize to corkscrew (train with a 50-50 blend of tracks) with a lower learning rate of 1e-5
Training checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1358/checkpoint-1358
better checkpoints:
/home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1202/checkpoint-1202
/home/anirban/ray_results/PPO_madras_env_2019-12-26_17-44-493m9bbtyy/checkpoint_1192/checkpoint-1192
agent still goes off track at the tight S bend in corkscrew.
Experiment 25

Motivation

Agent trained in Alpine-1 doesnt generalize well to corkscrew because it never learns to negotiate sharp L-shaped left turns. Training corkscrew with alpine-1 with 50-50 blend also does not generalize well. Initializing with alpine-1 and finetuning on 50-50 corkscrew and alpine-1 doesnt fare well too. The motivation of this experiment is to test how effective is pre-training on corkscrew and finetuning on alpine-1.
Training directory

Training car1-stock1 on corkscrew from scratch:
/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/
corkscrew policy trained from scratch:

/home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561
Observations

WOW! This policy can actually complete spring 43% of the times. Thus we can see that corkscrew is a closer MDP to spring than Alpine-1.
==================================================
Car: car1-stock1
 Track: spring

 Average distance covered: 0.6523207230607263

 Average speed: 90.70068030268457

 Successful race completion rate: 0.43243243243243246

 Num trajectories evaluated: 37

==================================================
Experiment 26

Motivation

For training on spring, instead of Alpine-1 as starting point, start with corkscrew trained policy /home/anirban/ray_results/PPO_madras_env_2019-12-27_12-04-23z77stse8/checkpoint_561/checkpoint-561 and see if it can achieve high race completion rate on spring under 1000 steps of training with learning rate 5e-7
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/
Checkpoints

/home/anirban/ray_results/PPO_madras_env_2019-12-27_18-01-52dbzhq7id/checkpoint_2502/checkpoint-2502
Experiment 27

Motivation

Teach a pre-trained agent for alpine-1 with 2-action space to drive with max 2 traffic cars.
First phase of training

1 or 2 traffic cars with 50-50 probability
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-28_10-03-21lqpv40b2/checkpoint_21/checkpoint-21
/home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82
Observations

3 Traffic agents:


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 45
high: 50
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 30
high: 40
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 15
high: 25
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
==================================================
Successful race completion rate: 0.5844155844155844
Num trajectories evaluated: 77

==================================================


Evaluating on 4 traffic agents:


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 55
high: 60
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 45
high: 50
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 30
high: 40
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 15
high: 25
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
==================================================
Successful race completion rate: 0.4067796610169492
Num trajectories evaluated: 59

==================================================
==================================================
Successful race completion rate: 0.3884297520661157
Num trajectories evaluated: 121

==================================================


As the episode length is very short (mostly under 500 steps, average 200 steps), too many rollouts (~30-50) need to be done to be done to fill a replay buffer of size 4000. Reduce the train_batch_size next time
Increase collision penalty?

Observations specific to the evaluation with 4 traffic agents:

The agent learns to choose the left or the right lane as required in most of the episodes
The agent does not slow down to prevent an imminent collision: could be due to high PID latency? Reduce PID latency and check.
The agent sometimes fails to judge the gap between the parked car in front and the edge of the track and makes an unnecessary lane change which causes it to collide with the car in the front
The agent does not speed up to prevent a collision at the rear end.
The agent learns to evade a parked car on the right lane but fails to evade a parked car in the left lane most of the time. This is probably due to the reason that the agent was trained with 1 or 2 traffic agents. TORCS starts assigning lanes from the right. When there is only one traffic agent, the learning agent is assigned the left lane. If the learning agent always runs forward in its (left) lane, the traffic agent would block its way roughly 50% of the times. So the remaining 50% of the time, the agent can achieve Rank1 without having to overtake even once. When there are 2 traffic agents, the learning agent is assigned the right lane by TORCS. In that case, almost 100% of the times, the learning agent has a traffic agent parked in front of it in the right lane and it has to learn to dodge it by switching to the left lane. Thats why the agent learns to evade collisions significantly better on the right lane than the left.

Second phase of training

Starting with /home/anirban/ray_results/PPO_madras_env_2019-12-28_11-11-08u06e4bkq/checkpoint_82/checkpoint-82
train_batch_size = 1000 (down 4x)
lr = 5e-6 (down 10x)
CollisionPenalty scale = 20
We train with 2 traffic agents at all times.
Training checkpoint for this part: /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_433/checkpoint-433
Even better:
/home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373
Evaluating on the same 3 traffic agents as in the previous evaluation:
==================================================
    Successful race completion rate: 0.6666666666666666

    Num trajectories evaluated: 87

==================================================

Evaluating on the same 4 traffic agents as in the previous evaluation:
==================================================
    Successful race completion rate: 0.2717391304347826

    Num trajectories evaluated: 92

==================================================

Resuming from /home/anirban/ray_results/PPO_madras_env_2019-12-28_15-12-41iu1gxm4t/checkpoint_373/checkpoint-373 we train with 3 traffic agents
/home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514
Evaluating on the same 3 traffic agents as in the previous evaluation:
==================================================
    Successful race completion rate: 0.7613636363636364

    Num trajectories evaluated: 88

==================================================

Evaluating on the same 4 traffic agents as in the previous evaluation:
==================================================
    Successful race completion rate: 0.22105263157894736

    Num trajectories evaluated: 95

==================================================

Resuming from above with 4 traffic agents. no randomization
/home/anirban/ray_results/PPO_madras_env_2019-12-29_00-01-40us412r8m/checkpoint_975/checkpoint-975
agent goes OOT to the right most of the time. The agent does not know how to slow down when the car is about to go off track.
We hypothesize that training on a 50-50 blennd of 3 and 4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514 might help ameliorate this overfitting behavior of overtaking from the right:
/home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_645/checkpoint-645
/home/anirban/ray_results/PPO_madras_env_2019-12-29_08-21-48cadf4q4z/checkpoint_575/checkpoint-575
Same problem persists. Agent prefers to go off track than stopping to make some room to overtake. The agent does not learn to slow down to avoid collision either. Collisions are very frequent with this checkpoint both with 3 and 4 traffic agents.
Next steps:
retrain from /home/anirban/ray_results/PPO_madras_env_2019-12-28_20-59-05ug3w7bfi/checkpoint_514/checkpoint-514
with reduced ProgressReward2 and AngAcclPenalty
Before:
rewards:
ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 20.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 8.0
max_ang_accl: 2.0
SuccessfulOvertakeReward:
scale: 10.0
Now:
rewards:
ProgressReward2:
scale: 0.25
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 20.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 1.0
max_ang_accl: 2.0
SuccessfulOvertakeReward:
scale: 10.0
/home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/
/home/anirban/ray_results/PPO_madras_env_2019-12-29_11-25-38gbw6ppb7/checkpoint_705/checkpoint-705

The agent does not stop to prevent going off track
The agent does not brake enough to respond to a braking agent in the front
We must give context to the agent

REMOVE AVERAGE SPEED REWARD BECAUSE IT ONLY APPLIES WHEN THE AGENT RUNS THE WHOLE LENGTH OF THE TRACK.
As a next step, try introducing a penalty for going off track.
Experiment 28

Motivation

Train an agent to drive with 1 to 3 traffic cars from scratch
Settings

lr: 5e-6
train_batch_size: 1000
Training directory

/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/
Checkpoint

/home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381
Observation

Evaluation results with 3 traffic cars:


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 45
high: 50
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 30
high: 40
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 15
high: 25
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
==================================================
Successful race completion rate: 0.5238095238095238
Num trajectories evaluated: 105


==================================================
With 4 traffic agents:


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 55
high: 60
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 45
high: 50
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 30
high: 40
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 15
high: 25
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
==================================================
Successful race completion rate: 0.37894736842105264
Num trajectories evaluated: 95

==================================================


Agent only learns to turn left and run straight - It attempts to execute this exact same behavior at every episode. This is likely because for 1 traffic agent (as is the case 33.33% of the episodes), the learning agent can exploit this policy and finish the race. This gives extra leverage to this policy.
Per overtake rewards must be decoupled from the race over reward. Currently, CollisionPenalty and SuccessfulOvertakeReward have the same scale. If the agent overtakes 2 traffic cars and collides with the third, it gets a net positive reward. Hence the agent does not care to learn to steer right in order to avoid collision on the left lane.
needs more  training to learn the multi-modal behavior.

Starting from /home/anirban/ray_results/PPO_madras_env_2019-12-30_15-06-5733u1nyhs/checkpoint_381/checkpoint-381 with 3 traffic cars at all times:
/home/anirban/ray_results/PPO_madras_env_2019-12-30_20-37-12zhryck8b/checkpoint_1762/checkpoint-1762
Same three traffic agents as the previous evaluation:
==================================================
    Successful race completion rate: 0.5096153846153846

    Num trajectories evaluated: 104

==================================================

Same 4 traffic agents as the previous evaluation:
==================================================
    Successful race completion rate: 0.35789473684210527

    Num trajectories evaluated: 95

==================================================

Analysis of why an agent trained on Alpine-1 with 3-action space fails to take off in some race tracks

Track specific analysis

The agent brakes whenever there is a (slight) left turn in front or when its closer to the right edge than the left. In Alpine-1, this happens when the car is in motion and the agent manages to stay in motion and come to a different state where the agent does not brake. However, in tracks where the slight left turn needs to be taken at a zero or low speed, braking results in the agent coming to a halt and as a result, the agent gets stuck in the braking state and never comes out.

corkscrew: Agent sees a left turn at the start line and brakes
aalborg: Agent comes close to the right edge of the track at the start line and brakes
e-track-4: Agent comes closer (not very close though) to the right edge of the track at the start line and brakes
g-track-1: like corkscrew
g-track-2: like aalborg and e-track-4
g-track-3: right turn in front
forza: like aalborg and e-track-4
wheel-1: like aalborg and e-track-4
brondehach: like aalborg and e-track-4

Experiment 30

Motivation

This is a continuation of experiment 27 and 28 with modified learning settings:

The reward structure has been changed to decouple per overtake reward from race completed reward.
Collision with one traffic agent after overtaking some of the other traffic agents does not give a net positive reward under any circumstances.
Traffic agents are 2 or 3. Single traffic agent seems to be lenient to the agent adopting a silly fixed policy of turning left and going straight along the left edge of the track.

Reward function

ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 15.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 5.0
max_ang_accl: 2.0
SuccessfulOvertakeReward:
scale: 5.0
RankOneReward:
scale: 10.0
Training directory

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-03_21-37-58ozx2miou/checkpoint_691/checkpoint-691
Evaluating on 3 traffic agents:
==================================================
    Successful race completion rate: 0.5555555555555556

    Num trajectories evaluated: 117

==================================================


ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 45
high: 50
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 30
high: 40
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1
ParkedAgent:
target_speed: 50
parking_lane_pos:
low: -0.8
high: 0.8
parking_dist_from_start:
low: 15
high: 25
collision_time_window: 1.2
pid_settings:
accel_pid:
- 10.5  # a_x
- 0.05  # a_y
- 2.8   # a_z
steer_pid:
- 5.1
- 0.001
- 0.000001
accel_scale: 1.0
steer_scale: 0.1

Observations:

The agent learns to execute a fixed policy: turn right, take the edge of the right lane and run straight. It completely ignores the opponent positions. We have to make the agent pay attention to the opponents vector and its own lane_position (trackPos)
Experiment 31

Motivation

Repeat Experiment 30 without per-overtake rewards and lower angular acceleration penalty
Reward function

ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 10.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 1.0
max_ang_accl: 2.0
RankOneReward:
scale: 10.0
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/
/home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/
/home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/
/home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/
Evaluated checkpoints


/home/anirban/ray_results/PPO_madras_env_2020-01-04_06-33-2893urk0wb/checkpoint_31/checkpoint-31
/home/anirban/ray_results/PPO_madras_env_2020-01-04_07-01-29ixh5gtae/checkpoint_242/checkpoint-242
/home/anirban/ray_results/PPO_madras_env_2020-01-04_09-44-16g205c9ez/checkpoint_303/checkpoint-303
/home/anirban/ray_results/PPO_madras_env_2020-01-04_10-50-00wsgcfjst/checkpoint_674/checkpoint-674

Observation


Agent always takes left in the beginning
Agent does not learn to avoid collision in the left lane
Agent runs slowly in checkpoint #3 but speed increases a bit in checkpoint #4
It does slow down when there is a car in front but does not learn to steer right to avoid collision

Next action


I will let it train for longer
Consider increasing collision penalty so much that the agent can never get a positive episode reward if it makes a single collision. Currently the agent is not shy from colliding.

Experiment 32

Motivation

Same as Expt 31 but with very high collision penalty
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-04_19-55-16xzfuziaz/checkpoint_291/checkpoint-291
/home/anirban/ray_results/PPO_madras_env_2020-01-04_22-56-37tmpr7d_m/checkpoint_722/checkpoint-722
Observations


Agent does not steer at all
The learning curve does not show any growth between update steps: 291 and 722
Agent just slows down a bit when there is a car in front but eventually collides
Agent shows no intention to overtake

Actionable


Agent shows no intention of overtaking - bring back the per-overtake reward
The 2-action space does not give the agent quick steering ability - try 3-action space to see if the agent learns to steer more freely
The learning rate 5e-6 might be too low. Increase it 2x to 1e-5.
Agent still does not pay attention to the trackPos and opponents - consider removing track from observation list while using 2-action space. trackPos should give sufficient information to the agent not to go OOT

Experiment 33

Motivation

To address the actionables from Experiment 32
Changes made to the env and training config

lr 1e-5
3 action space
rewards:
ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 100.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 1.0
max_ang_accl: 2.0
SuccessfulOvertakeReward:
scale: 5.0
RankOneReward:
scale: 10.0
TODO later: remove 'track' from state space
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_06-29-1602jz__fz/checkpoint_181/checkpoint-181
/home/anirban/ray_results/PPO_madras_env_2020-01-05_07-10-53tfgdnjbw/checkpoint_362/checkpoint-362
/home/anirban/ray_results/PPO_madras_env_2020-01-05_07-53-19wdmej9r5/checkpoint_783/checkpoint-783
Experiment 34

Motivation

Force the agent to learn a zigzag maneuver and thus learn to steer out of the way of obstacles.
Methodology


We arrange the traffic agents in a way that the agent must first turn right and then turn left.
We stop the agent from taking the edge of the tracks to overtake without turning by

Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_09-39-15hwy781dy/checkpoint_311/checkpoint-311
/home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382
Observation


Agent stops and almost comes to a perfect halt but it keeps trying to go to the left and does not steer right
Maybe it will improve upon more training
Agent remains stuck with the same behavior between 311 and 382 steps - PPO often gets stuck in local minima. Trying to ameliorate this by increasing learning rate 10x to 1e-4

After training with increased learning rate of 1e-4 from /home/anirban/ray_results/PPO_madras_env_2020-01-05_13-00-57je1t0ldp/checkpoint_382/checkpoint-382
we have: /home/anirban/ray_results/PPO_madras_env_2020-01-05_14-04-198iqqfx34/checkpoint_413/checkpoint-413
Experiment 35

Same as #34 but with 3-action space
Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_14-47-37iicixb5i/checkpoint_1894/checkpoint-1894
Observations


Car doesnt even learn to slow down to avoid collision.
Car runs straight ahead at full speed and collides. Over time, it learns to gather rewards from ProgressReward2 by running faster.

Experiment 36

Same as 34 but with modified reward function
rewards:
ProgressReward2:
scale: 1.0
AvgSpeedReward:
scale: 1.0
CollisionPenalty:
scale: 10.0
TurnBackwardPenalty:
scale: 10.0
AngAcclPenalty:
scale: 1.0
max_ang_accl: 2.0
SuccessfulOvertakeReward:
scale: 5.0
RankOneReward:
scale: 10.0
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_22-52-440h5i8uu6/checkpoint_11/checkpoint-11
/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-05-49klpltfym/checkpoint_22/checkpoint-22
Experiment 37

Motivation

Train the agent to detect imminent collision in front and change lanes to the right.
Settings


Num traffic cars: 3
Width of the road: -0.9 to 0.9

Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/
/home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-05_23-26-03xqnlpuna/checkpoint_61/checkpoint-61
/home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322
Agent learns to turn right and evade collisions most of the time.
But policy does not generalize to other traffic configurations.
Experiment 38

Motivation

Start from the output of Experiment 37 and finetune the policy to achieve generalization to a wide variety of traffic conditions
Settings

lr reduced 10x from #37 to 1e-6
Traffic configuration changed to the following:
Num traffic cars: 1 to 4
Traffic agents are parked to make it essential for the agent to steer.
Restored from checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 from Experiment 37.
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-06_11-59-47j4g9sq0f/checkpoint_513/checkpoint-513
Observation


The agent barely makes any progress in learning to generalize
Tested the agent on the traffic config of experiment 37 and saw that it can solve the environment very efficiently. This shows that the policy has not changed enough from /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322

Action points


Train again with higher learning rate? Maybe the minimum that /home/anirban/ray_results/PPO_madras_env_2020-01-06_06-40-2316vkyevv/checkpoint_322/checkpoint-322 represents is too stable a solution and it is hard to bring the agent out of it.
Train with a 50-50 blend of two traffic configurations, in one of which, the agent needs to turn right and the other, it needs to turn left

Experiment 39

Motivation

The agent is presented with two traffic conditions - one in which it has to turn right and another in which it has to turn left.
Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521
Observations


The agent learns to slow down when it detects imminent collision
In the presence of 3 traffic agents, the agent learns to take right turn to evade the first car
However it does not learn to turn left and hits the traffic car in the right lane

Probable cause


The agent does not learn to take drastically different actions based on similar opponents input but different trackPos and track inputs
I narrowed the track but only implemented that in the done function. The agent's trackPos and track inputs are still in their previous scales. Changing these inputs according to the new track_width might be helpful for the agent to recognize the two different conditions under which it has to turn left or right.

Action points


Properly scale the trackPos and track inputs to conform with the modified track-width
Use a deeper neural network, in case, the network capacity is the bottleneck here
Start from the right-turning checkpoint and finetune it only on the 2 traffic agent setting. Observe if and when the agent learns to turn left and whether it preserves its right turn behavior while doing so.
Maybe the KL divergence cutoff of PPO is too small to allow the policy to have variance high enough to capture the nonlinearity involved in the selection of left/right turns depending on lane position of the agent - try DDPG and SAC; read up PPO from Spinning Up in RL

Experiment 40

Same as #39 but with DDPG and 3-action space
Training directory

/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/
Evaluated checkpoints

/home/anirban/ray_results/DDPG_madras_env_2020-01-07_20-45-44dit_leh7/checkpoint_217/checkpoint-217
Experiment 41

Same as #39 but with DDPG and 2-action space
Training directory

/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/
Evaluated checkpoints

/home/anirban/ray_results/DDPG_madras_env_2020-01-08_06-04-45bzrjqtd6/checkpoint_101/checkpoint-101
Observation

Stopped at 101 for a quick experiment - resume later
Experiment 42

Motivation

Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.
Restoring checkpoint: /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_1521/checkpoint-1521
lr: 5e-6
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_11-33-29e1exjxo2/checkpoint_1651/checkpoint-1651
Observation

The agent does not recover from the previous behavior. Repeating with higher learning rate and target kl divergence
Experiment 43

Retrain the policy from #39 with only 2 traffic agents and see how long it takes the agent to learn to take left. Lets see if we can recover one policy that can both do right and left turns.
lr: 5e-5
kl_target: 0.03
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/
/home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/
/home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_15-08-04yg7_ht_i/checkpoint_1615/checkpoint-1615
/home/anirban/ray_results/PPO_madras_env_2020-01-08_16-51-05o_soa0ds/checkpoint_1676/checkpoint-1676
/home/anirban/ray_results/PPO_madras_env_2020-01-08_18-20-24afb6eso_/checkpoint_1749/checkpoint-1749
Experiment 44

Retrain #39 from /home/anirban/ray_results/PPO_madras_env_2020-01-06_20-06-46m1paevmi/checkpoint_301/checkpoint-301 with learning rate 1e-5. The anticipation is that earlier on in the training the policy would be more flexible to adaptation to left/right change
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-08_21-32-57wlb7e74i/checkpoint_382/checkpoint-382
/home/anirban/ray_results/PPO_madras_env_2020-01-08_23-28-28xkab3zix/checkpoint_500/checkpoint-500
Observation

The agent's behavior does not change.
Explanation

The probable cause could be that the agent's lane-position action was not properly adjusted to the changed trackPos limits. The trackPos limits are restricted to +/- 0.5. This gives the agent very low values for lane positions outside the range of +/- 0.5 as it results in episode termination. With 3 traffic agents, the agent always starts in the left lane. When it sees a car in front, all actions of taking a left gets it out of track. As a result all left turns get devalued when there is a car in the front. This way the agent becomes reluctant to take left turns when there is a car in front.
Action items

Instead of blocking away half the action space, rescale the lane-position action's -1/1 to the trackPos limits set in the sim_options file.
Experiment 45

Same as #39 with action space fixed. Now -1 and +1 correspond to track_limits specified in sim_options
Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
Observation


VOILA! The agent learns to both overtake on both the left and the right.

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 2 traffic agents
==================================================
    Successful race completion rate: 0.927461139896373

    Num trajectories evaluated: 193

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 3 traffic agents
==================================================
    Successful race completion rate: 0.839572192513369

    Num trajectories evaluated: 187

==================================================

Testing generalization...
The agent does not generalize well to more traffic agents. For 4 traffic agents, it fetches 0 rate of successful race completion and for 5 traffic agents the following is the result:
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311 on 5 traffic agents
==================================================
    Successful race completion rate: 0.03879310344827586

    Num trajectories evaluated: 232

==================================================

Actionable items

Learn to generalize to more agents
Experiment 46

4-5 traffic agent starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/
/home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-09_22-38-51tq67mqm1/checkpoint_347/checkpoint-347
/home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_968/checkpoint-968
Observation

The agent learns to navigate the 5-traffic scene with 39% success rate  but fails to get any success in the 4 traffic scenario.
Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_918/checkpoint-918 on 5 traffic agents
==================================================
    Successful race completion rate: 0.3904761904761905

    Num trajectories evaluated: 105

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_928/checkpoint-928 on 5 traffic agents
==================================================
    Successful race completion rate: 0.35185185185185186

    Num trajectories evaluated: 108

==================================================

Evaluating /home/anirban/ray_results/PPO_madras_env_2020-01-10_06-06-49x01428kp/checkpoint_808/checkpoint-808 on 5 traffic agents
==================================================
    Successful race completion rate: 0.2956521739130435

    Num trajectories evaluated: 115

==================================================

Actionable

Try training on 3 or 4 traffic agents to teach the agent to learn to cope up to a 4 traffic agent scenario without having forgotten the 3 traffic scenario.
Experiment 47

Motivation

3-4 traffic agents starting from /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_693/checkpoint-693
/home/anirban/ray_results/PPO_madras_env_2020-01-10_21-44-174bw2i3gn/checkpoint_562/checkpoint-562
Observations


The agent still does not learn to navigate starting in the right lane (thee 4 traffic car case)
Although the agent aces the 3 traffic car configuration
As 3 traffic cars appear 50% of the time, the agent gets the episode completion reward of 100, 50% of the time. The agent seems to remain content with this.

Actionable


Train the agent on 3-4 traffic cars from scratch. Starting from an expert on 3 traffic cars may have created a bias for 3/5 traffic car configurations in the model
Present the 4 traffic configuration more often than the 3 traffic car configuration.

Experiment 48

Motivation

Train on 3-4 traffic agents from scratch
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/
Evaluated checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-11_07-26-071k5vnc5m/checkpoint_181/checkpoint-181
/home/anirban/ray_results/PPO_madras_env_2020-01-11_10-02-047frykyio/checkpoint_272/checkpoint-272
/home/anirban/ray_results/PPO_madras_env_2020-01-11_11-38-42nn_hh2gg/checkpoint_433/checkpoint-433
Actionable

Reduce the target speed of the driving agent to 25kmph. We dont need the agent to have a high speed while maneuvering through narrow road with traffic parked.
Experiment 49

Motivation

Train on 3-4 traffic agents from scratch at 50 km per hour
Training directories

home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/
Evaluated checkpoints

home/anirban/ray_results/PPO_madras_env_2020-01-11_14-48-14wj9k39q7/checkpoint_121/checkpoint-121
Observations

Training gets saturated
Experiment 50

Motivation

Start from #45: /home/anirban/ray_results/PPO_madras_env_2020-01-09_09-28-57fmm3iuzh/checkpoint_311/checkpoint-311
Reduce the speed and train on 4-5 traffic agents reducing the target speed to 50kmph
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/
/home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314
/home/anirban/ray_results/PPO_madras_env_2020-01-11_18-18-11w1fvm0st/checkpoint_425/checkpoint-425
Observations


VOILA! The agent overcomes both 4 and 5 traffic agents with high success rate starting from both the left and right sides of the road.


The agent generalizes to upto 9 traffic cars parked alternately on both sides of the road.


The agent generalizes across different car models - even baja-bug and buggy


Limitations:

The agent collides if the traffic cars switch lanes in front of the agent
As the initial stretch of road in aalborg is straight, the agent does not learn to make decisions keeping in mind the turns of the road. It fails to overtake all the agents when tested in Corkscrew which has a left turn in the initial part of the race.


Experiment 51

Motivation

Make the driver in #50 /home/anirban/ray_results/PPO_madras_env_2020-01-11_16-38-573dk5n2ma/checkpoint_314/checkpoint-314 learn to generalize to multiple tracks by training it on 50-50 4-5 traffic and 50-50 aalborg-corkscrew.
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-11_22-05-24d3eqobre/checkpoint_945/checkpoint-945
/home/anirban/ray_results/PPO_madras_env_2020-01-12_07-14-03nwuhirj8/checkpoint_956/checkpoint-956
==================================================================================================================
-------------------------------------------- STOCHASTIC ACTIONS --------------------------------------------------

Experiment 52

Motivation

Train an agent to drive around corkscrew  with N(0, 0.1) Gaussian action noise and 3 action space
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_05-44-367hvs2_e9/checkpoint_751/checkpoint-751
Experiment 53

Motivation

Train an agent to drive around corkscrew with observation noise and no action noise
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/
/home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_07-42-41eom5mt1d/checkpoint_68/checkpoint-68
/home/anirban/ray_results/PPO_madras_env_2020-01-13_09-36-26v7fotu2_/checkpoint_599/checkpoint-599
Experiment 54

Motivation

Train an agent to drive around corkscrew  with N(0, 0.5) Gaussian action noise and 3 action space
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_11-25-330k3tfabm/checkpoint_2172/checkpoint-2172
Experiment 55

Motivation

Train an agent to drive around corkscrew  with N(0, 0.1) Gaussian action noise and observation noise and 3 action space
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_16-54-209wqonv6u/checkpoint_1141/checkpoint-1141
Experiment 56

Motivation

Baseline experiment for this section. Training Train an agent to drive around corkscrew with no observation or action noise.
Training directories

/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/
Evaluated checkpoints

/home/anirban/ray_results/PPO_madras_env_2020-01-13_21-03-554xigbvl1/checkpoint_1377/checkpoint-1377
Observation

The agent goes off track due to timeout. Increasing max-steps to 15000
Training directory

/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/
Last checkpoint

/home/anirban/ray_results/PPO_madras_env_2020-01-14_07-08-089zxr4g7_/checkpoint_671/checkpoint-671
TODO(santara)

Engine characteristic graph (torque@rpm) for all cars - modify the code for full automation
Plot the training graphs of the first 2 experiments-------------------------------------------------------------> done
Train a single agent to drive 1 4WD and 1 RWD cars / 1 new car and buggy. --------------------------------------> done
Train a policy just on buggy to see the decision profile.  -----------------------------------------------------> done
Extract drive-train and CG information--------------------------------------------------------------------------> done
Read the curriculum learning for robotics papers and write the curriculum learning section of the MADRaS paper--> done
Add context to observation and dilated-conv/recurrent policy
While training in traffic, there should be a minimum number of traffic agents-----------------------------------> done