numpde/peer_learning.md

## peer_learning.md

      
    Raw
  

              peer_learning.md
            
          
    Reinforcement learning through imitation of successful peers

Introduction

Reinforcement learning is a mode of machine learning
driven by the feedback from the environment
on how good a string of actions of the learning agent turns out to be.
We consider here a reinforcement learning mechanism
for neural networks
that is similar to policy gradients
(see A. Karpathy's introduction)
but with the following distinction:
several agents collected in a "culture" interact with the environment independently,
and
rather than updating the policy according to their own score, each agent learns from the experience of a peer with a better score.
Briefly, agents imitate the more successul peers.
This learning mechanism may also be seen as an evolutionary algorithm applied to behavioral memes.
In particular:

No structural compatibility between neural networks of individual agents is necessary beyond the input and output layers,
which allows for coevolution of different designs.
It is easy to parallelize the computation across the independent agents.

This "cultural diffusion" has been observed in bumblebees, for example.
The environment we look at is a very simplified version of pac-man.
At each time step the agent (dark dot) performs one of five actions: stay/up/down/left/right.
The aim is to collect as many tokens (brighter dots) as possible within a round of 100 time steps (frames).
The score for the round is the number of collected tokens.
Five tokens appear in random locations a) initially and b) as soon as the present tokens are collected.
The boundary of the playfield is periodic, and it is presented to the agent shifted by the coordinates of the agent.
Here is what the response of an agent typically looks like
a) before training, b) with some training, c) with substantial training:


(See the archive peer_learning_plot1.zip attached for producing these images.)
Installation and license

If in doubt, use python3.
Required libraries: numpy, keras (with tensorflow or theano), matplotlib.pyplot.
License: CC BY 4.0.


## peer_learning.py
import numpy as np
import matplotlib.pyplot as plt

from numpy.random import randint, seed, choice

from keras.models import Sequential
from keras.layers import Dense

class Env :
    # Grid size
    xm = 12
    ym = 15

    def __init__(self, I=100) :
        # I = number of moves per round
        self.I = I
        # X = history of the last round
        self.X = []

    def P2R(self, P) :
        # Convert a list of token locations to a matrix
        R = np.zeros((self.xm, self.ym))
        for (x, y) in P : R[x, y] += 1
        return R

    def __call__(self, ai, show=False) :
        # Announce new round
        ai.ready()
        self.X = []
        # Initial location of the agent
        (xa, ya) = (0, 0)
        # List of token locations
        P = []
        # Initial score
        score = 0
        # Coordinates wrap-around helper
        MOD = (lambda x, y : (x % self.xm, y % self.ym))

        # Play the round
        for i in range(self.I) :
            # Put tokens
            if (not P) :
                while (len(P) < 5) :
                    P.append((randint(0, self.xm), randint(0, self.ym)))

            self.X.append( (P[:], (xa, ya)) )

            # State relative to the agent's location (as a row vector)
            S = self.P2R([MOD(x - xa, y - ya) for (x, y) in P]).reshape((1,-1))
            # Ask the agent for a direction
            (dx, dy) = [(0,0), (-1,0), (1,0), (0,-1), (0,+1)][ ai.move(S) ]
            # Move the agent in that direction
            (xa, ya) = MOD(xa + dx, ya + dy)

            # Pick up tokens
            while (xa, ya) in P :
                P.remove((xa, ya))
                score += 1

        self.X.append( (P[:], (xa, ya)) )

        if show : self.visualize()

        return score

    def visualize(self, i=None) :
        X = self.X if (i is None) else [self.X[i]]

        for (P, p) in X :
            im = self.P2R(P)
            im[p] = -1
            plt.cla(); plt.axis('off')
            plt.imshow(im, interpolation='nearest', origin='lower', vmin=-1, vmax=5)
            if (i is None) : plt.ion(); plt.show(); plt.pause(0.001)


class AI :
    def __init__(self) :
        # Experience buffer
        self.X = []

        # Size of visual input
        idim = Env.xm * Env.ym
        # Number of possible actions
        odim = 5

        self.nn = Sequential()
        self.nn.add(Dense(odim, activation='softmax', init='normal', input_dim=idim))

        self.nn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    @staticmethod
    def pdf(a) :
        # Convert numpy.ndarray to a list with sum = 1
        # Corrects float32 rounding errors of the NN prediction
        b = [float(x) for x in list(a.flat)]
        b = [x/sum(b) for x in b]
        return b

    def ready(self) :
        # Reset the experience buffer
        self.X = []

    def move(self, S) :
        # Ask the neural network for a direction
        # Interpret as probabilities on classes
        p = self.pdf(self.nn.predict(S))
        # Choose one direction according to those proba
        d = choice(range(len(p)), p=p)
        # d-th unit vector
        a = np.eye(1, len(p), k=d)
        # Record state-action pair
        self.X.append((S, a))
        # Return the chosen direction
        return d

    def learn_from(self, peer, nb_epoch=8) :
        I = np.vstack(S for (S, _) in peer.X)
        O = np.vstack(a for (_, a) in peer.X)
        # Train the NN on the I-O pairs of the peer
        self.nn.fit(I, O, nb_epoch=nb_epoch, verbose=0)


class Culture :
    def __init__(self, agent_type=None, N=17) :
        self.pool = [agent_type() for _ in range(N)]

    def evolve(self, fitness_criterion, gens=99) :
        for gen in range(gens) :
            # Measure the fitness of each agent
            for agent in self.pool :
                agent.fitness = fitness_criterion(agent)

            # Agents learn from fitter peers
            self.pool.sort(key=(lambda agent: -agent.fitness))
            for (n, agent) in enumerate(self.pool) :
                if n : agent.learn_from(choice(self.pool[:n]))


if (__name__ == "__main__") :
    seed(0) # for numpy.random

    c = Culture(AI)

    for era in range(1000) :
        print("Evolving culture", "(Era #{})".format(era))
        c.evolve(Env())
        print("Fitness:", [agent.fitness for agent in c.pool])

        # Visualize the performance of a random agent
        Env(I=80)(choice(c.pool), show=True)

## peer_learning_plot1.zip

      
    Raw
  

              peer_learning_plot1.zip
            
          
            View raw
	import numpy as np
	import matplotlib.pyplot as plt

	from numpy.random import randint, seed, choice

	from keras.models import Sequential
	from keras.layers import Dense

	class Env :
	# Grid size
	xm = 12
	ym = 15

	def __init__(self, I=100) :
	# I = number of moves per round
	self.I = I
	# X = history of the last round
	self.X = []

	def P2R(self, P) :
	# Convert a list of token locations to a matrix
	R = np.zeros((self.xm, self.ym))
	for (x, y) in P : R[x, y] += 1
	return R

	def __call__(self, ai, show=False) :
	# Announce new round
	ai.ready()
	self.X = []
	# Initial location of the agent
	(xa, ya) = (0, 0)
	# List of token locations
	P = []
	# Initial score
	score = 0
	# Coordinates wrap-around helper
	MOD = (lambda x, y : (x % self.xm, y % self.ym))

	# Play the round
	for i in range(self.I) :
	# Put tokens
	if (not P) :
	while (len(P) < 5) :
	P.append((randint(0, self.xm), randint(0, self.ym)))

	self.X.append( (P[:], (xa, ya)) )

	# State relative to the agent's location (as a row vector)
	S = self.P2R([MOD(x - xa, y - ya) for (x, y) in P]).reshape((1,-1))
	# Ask the agent for a direction
	(dx, dy) = [(0,0), (-1,0), (1,0), (0,-1), (0,+1)][ ai.move(S) ]
	# Move the agent in that direction
	(xa, ya) = MOD(xa + dx, ya + dy)

	# Pick up tokens
	while (xa, ya) in P :
	P.remove((xa, ya))
	score += 1

	self.X.append( (P[:], (xa, ya)) )

	if show : self.visualize()

	return score

	def visualize(self, i=None) :
	X = self.X if (i is None) else [self.X[i]]

	for (P, p) in X :
	im = self.P2R(P)
	im[p] = -1
	plt.cla(); plt.axis('off')
	plt.imshow(im, interpolation='nearest', origin='lower', vmin=-1, vmax=5)
	if (i is None) : plt.ion(); plt.show(); plt.pause(0.001)


	class AI :
	def __init__(self) :
	# Experience buffer
	self.X = []

	# Size of visual input
	idim = Env.xm * Env.ym
	# Number of possible actions
	odim = 5

	self.nn = Sequential()
	self.nn.add(Dense(odim, activation='softmax', init='normal', input_dim=idim))

	self.nn.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

	@staticmethod
	def pdf(a) :
	# Convert numpy.ndarray to a list with sum = 1
	# Corrects float32 rounding errors of the NN prediction
	b = [float(x) for x in list(a.flat)]
	b = [x/sum(b) for x in b]
	return b

	def ready(self) :
	# Reset the experience buffer
	self.X = []

	def move(self, S) :
	# Ask the neural network for a direction
	# Interpret as probabilities on classes
	p = self.pdf(self.nn.predict(S))
	# Choose one direction according to those proba
	d = choice(range(len(p)), p=p)
	# d-th unit vector
	a = np.eye(1, len(p), k=d)
	# Record state-action pair
	self.X.append((S, a))
	# Return the chosen direction
	return d

	def learn_from(self, peer, nb_epoch=8) :
	I = np.vstack(S for (S, _) in peer.X)
	O = np.vstack(a for (_, a) in peer.X)
	# Train the NN on the I-O pairs of the peer
	self.nn.fit(I, O, nb_epoch=nb_epoch, verbose=0)


	class Culture :
	def __init__(self, agent_type=None, N=17) :
	self.pool = [agent_type() for _ in range(N)]

	def evolve(self, fitness_criterion, gens=99) :
	for gen in range(gens) :
	# Measure the fitness of each agent
	for agent in self.pool :
	agent.fitness = fitness_criterion(agent)

	# Agents learn from fitter peers
	self.pool.sort(key=(lambda agent: -agent.fitness))
	for (n, agent) in enumerate(self.pool) :
	if n : agent.learn_from(choice(self.pool[:n]))


	if (__name__ == "__main__") :
	seed(0) # for numpy.random

	c = Culture(AI)

	for era in range(1000) :
	print("Evolving culture", "(Era #{})".format(era))
	c.evolve(Env())
	print("Fitness:", [agent.fitness for agent in c.pool])

	# Visualize the performance of a random agent
	Env(I=80)(choice(c.pool), show=True)