Skip to content

Instantly share code, notes, and snippets.

Last active January 3, 2016 20:39
Show Gist options
  • Save xiaohan2012/8516733 to your computer and use it in GitHub Desktop.
Save xiaohan2012/8516733 to your computer and use it in GitHub Desktop.
Using the epsilon-greedy strategy to generate the n-armed bandit testbed.
function [] = n_armed_testbed(nB,nA,nP,sigma)
% Generates the 10-armed bandit testbed.
% Inputs:
% nB: the number of bandits
% nA: the number of arms
% nP: the number of plays (times we will pull a arm)
% sigma: the standard deviation of the return from each of the arms
% Written by:
% --
% John L. Weatherwax 2007-11-13
% email:
% Please send comments and especially bug reports to the
% above email address.
%close all;
if( nargin<1 ) % the number of bandits:
nB = 2000;
if( nargin<2 ) % the number of arms:
nA = 10;
if( nargin<3 ) % the number of plays (times we will pull a arm):
nP = 10000;
if( nargin<4 ) % the standard deviation of the return from each of the arms:
sigma = 1.0;
% generate the TRUE reward Q^{\star}:
qStarMeans = mvnrnd( zeros(nB,nA), eye(nA) );
% run an experiment for each epsilon:
% 0 => fully greedy
% 1 => explore on each trial
epsArray = [ 0, 0.01, 0.1 ]; %, 1 ];
% assume we have at least ONE draw from each "arm" (initialize with use the qStarMeans matrix):
qT0 = mvnrnd( qStarMeans, eye(nA) );
avgReward = zeros(length(epsArray),nP);
perOptAction = zeros(length(epsArray),nP);
cumReward = zeros(length(epsArray),nP);
cumProb = zeros(length(epsArray),nP);
for ei=1:length(epsArray),
tEps = epsArray(ei);
%qT = qT0; % <- initialize to one draw per arm
qT = zeros(size(qT0)); % <- initialize to zero draws per arm (no knowledge)
qN = ones( nB, nA ); % keep track of the number draws on this arm
qS = qT; % keep track of the SUM of the rewards (qT = qS./qN)
allRewards = zeros(nB,nP);
pickedMaxAction = zeros(nB,nP);
for bi=1:nB, % pick a bandit
for pi=1:nP, % make a play
% determine if this move is exploritory or greedy:
if( rand(1) <= tEps ) % pick a RANDOM arm:
[dum,arm] = histc(rand(1),linspace(0,1+eps,nA+1)); clear dum;
else % pick the GREEDY arm:
[dum,arm] = max( qT(bi,:) ); clear dum;
% determine if the arm selected is the best possible:
[dum,bestArm] = max( qStarMeans(bi,:) );
if( arm==bestArm ) pickedMaxAction(bi,pi) = 1; end
% get the reward from drawing on that arm:
reward = qStarMeans(bi,arm) + sigma*randn(1); %introduce some noise, or the pure greedy approach would be the best
allRewards(bi,pi) = reward;
% update qN,qS,qT:
qN(bi,arm) = qN(bi,arm)+1;
qS(bi,arm) = qS(bi,arm)+reward;
qT(bi,arm) = qS(bi,arm)/qN(bi,arm);
avgRew = mean(allRewards,1);
avgReward(ei,:) = avgRew(:).';
percentOptAction = mean(pickedMaxAction,1);
perOptAction(ei,:) = percentOptAction(:).';
csAR = cumsum(allRewards,2); % do a cummulative sum across plays for each bandit
csRew = mean(csAR,1);
cumReward(ei,:) = csRew(:).';
csPA = cumsum(pickedMaxAction,2)./cumsum(ones(size(pickedMaxAction)),2);
csProb = mean(csPA,1);
cumProb(ei,:) = csProb(:).';
% produce the average rewards plot:
figure; hold on; clrStr = 'brkc'; all_hnds = [];
for ei=1:length(epsArray),
%all_hnds(ei) = plot( [ 0, avgReward(ei,:) ], [clrStr(ei)] );
all_hnds(ei) = plot( 1:nP, avgReward(ei,:), [clrStr(ei),'-'] );
legend( all_hnds, { '0', '0.01', '0.1' }, 'Location', 'SouthEast' );
axis tight; grid on;
xlabel( 'plays' ); ylabel( 'Average Reward' );
% produce the percent optimal action plot:
figure; hold on; clrStr = 'brkc'; all_hnds = [];
for ei=1:length(epsArray),
%all_hnds(ei) = plot( [ 0, avgReward(ei,:) ], [clrStr(ei)] );
all_hnds(ei) = plot( 1:nP, perOptAction(ei,:), [clrStr(ei),'-'] );
legend( all_hnds, { '0', '0.01', '0.1' }, 'Location', 'SouthEast' );
axis( [ 0, nP, 0, 1 ] ); axis tight; grid on;
xlabel( 'plays' ); ylabel( '% Optimal Action' );
% produce the cummulative average rewards plot:
figure; hold on; clrStr = 'brkc'; all_hnds = [];
for ei=1:length(epsArray),
all_hnds(ei) = plot( 1:nP, cumReward(ei,:), [clrStr(ei),'-'] );
legend( all_hnds, { '0', '0.01', '0.1' }, 'Location', 'SouthEast' );
axis tight; grid on;
xlabel( 'plays' ); ylabel( 'Cummulative Average Reward' );
% produce the cummulative percent optimal action plot:
figure; hold on; clrStr = 'brkc'; all_hnds = [];
for ei=1:length(epsArray),
all_hnds(ei) = plot( 1:nP, cumProb(ei,:), [clrStr(ei),'-'] );
legend( all_hnds, { '0', '0.01', '0.1' }, 'Location', 'SouthEast' );
axis( [ 0, nP, 0, 1 ] ); axis tight; grid on;
xlabel( 'plays' ); ylabel( 'Cummulative % Optimal Action' );
Copy link

Thing learned:

  1. nargin: default argument
  2. mvnrnd: multivariate Gaussian distribution
  3. using a string like 'rbg' to represent an array of colors when plotting multiple graphs with different colors in one plot

Copy link

Some good ways to plot graphs with legends

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment