floodsung /
Created June 2, 2016 08:02
This solution is mainly from John Schulman's Deep Reinforcement Learning Lab script in Machine Learning Summer School 2016
import numpy as np
import gym
from gym.spaces import Discrete,Box
# -------------------------------------------
# Policies
# -------------------------------------------
class DeterministicDiscreteActionLinearPolicy(object):
require 'midilib'
seq =[0], 'rb') { |file| }
events = [ ]
id = 0
seq.tracks.each do |track|
sorenbouma /
Last active April 17, 2018 05:54
This is a basic python implementation of the Cross-Entropy Method for reinforcement learning on OpenAI gym's CartPole environment.
import gym
import numpy as np
import matplotlib.pyplot as plt
env = gym.make('CartPole-v0')
#vector of means(mu) and standard dev(sigma) for each paramater
zsal /
Created June 27, 2016 02:57
John Schulman MLSS Lab 1: CartPole-v0
#Most code from John Schulman's MLSS talk on Deep Reinforcement Learning
import numpy as np
import gym
from gym.spaces import Discrete, Box
# ================================================================
# Policies
# ================================================================
fnurl /
Last active January 15, 2020 07:07
A script that produces a JSON page index file for markdown files (extension `.md`) in a directory and its subdirectories (e.g. a Hugo site's ( `content` directory) for use with Algolia Docsearch (
import os
import sys
import yaml
import json
# base url to use
base_url = "http://localhost:1313"
# The attribute mapping for docsearch.
kashif /
Last active November 7, 2023 12:56
Cross Entropy Method

Cross Entropy Method

How do we solve for the policy optimization problem which is to maximize the total reward given some parametrized policy?

Discounted future reward

To begin with, for an episode the total reward is the sum of all the rewards. If our environment is stochastic, we can never be sure if we will get the same rewards the next time we perform the same actions. Thus the more we go into the future the more the total future reward may diverge. So for that reason it is common to use the discounted future reward where the parameter discount is called the discount factor and is between 0 and 1.

A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward. In other words we want to maximize the expected reward per episode.

squiter / install_ruby
Created October 7, 2016 22:26
How to install ruby with TCL and TK for Coursera Programming Languages Part C
set -eou pipefail
eph2795 /
Created February 18, 2017 19:50
import gym
from tqdm import tqdm_notebook
import numpy as np
import matplotlib.pyplot as plt
import warnings
def get_random_policy():
return np.random.choice(n_actions, tuple(bins))
# add to bashrc
# files
alias sampletree='mkdir -p sample/{train,test,valid}'
lsn(){ matchdir=`pwd`/$2; find $matchdir -type f | grep -v sample | shuf -n $1 | awk -F`pwd` '{print "."$NF}' ; }
# shuffle mv/cp
cpn(){ matchdir=`pwd`/$2; find $matchdir -type f | grep -v sample | shuf -n $1 | awk -F`pwd` '{print "."$NF" sample"$NF}' | xargs -t -n2 cp ; }
mvn(){ matchdir=`pwd`/$2; todir=`pwd`/$3; find $matchdir -type f | grep -v sample | shuf -n $1 | awk -F`pwd` -v todir="$todir" '{print $0" "todir}' | xargs -t -n2 mv ; }
pat-coady / racetrack_sarsa.ipynb
Last active April 21, 2024 21:01
Sutton and Barto Racetrack: Sarsa
