jmatta1/autocorr_notes.md

## autocorr_notes.md

      
    Raw
  

              autocorr_notes.md
            
          
    I was browsing the issues for emcee when I noticed an issue from @fardal that suggested a somewhat modified method for calculationg the autocorrelation time: issue 209. The method used in emcee currently averages all the walkers, extracts the normalized autocovariance function from the "average chain", then uses a standard method to extract the autocorrelation time from the normalized autocovariance function. @fardal's method involves extracting the autocovariance function of every chain individually averaging the individual autocovovariance functions, then once again extract the autocorrelation time from the autocovariance function. @fardal's method seems to dramatically reduce the variance seen between runs of the autocorrelation calculation. Another thing I have noticed is that the standard emcee method produces a skewed gaussian while this method seems to produce a symmetric gaussian.
My theory about the reason for the improvement is that each walker's chain could be quite different even if they are all exploring the same region. Therefore when you average the walkers you are losing a great deal of information. By finding the autocovariance functions first, you are losing far less information after the averaging.
However, @fardal's example implementation calculates the autocorrelation function uses the n^2 algorithm for calculating the autocorrelation function. In the files I have attached I have modified the implementation to use the FFT method for calculating the autocovariance function and made the implementation somewhat more flexible w.r.t. running independently on a cluster to perform many many runs to obtain smooth distributions.
There are two seperate scripts now. The first script (calc_autocorr.py) performs a given number of runs spread across a given number of cores and calculates the autocorrelation time using the mean-first approach (using emcee default parameters). Then it calculates the autocovariance function first approach using window sizes (see "Monte Carlo Methods in Statistical Mechanics: Foundations and New Algorithms" by Alan D. Sokal) in the range [1,30]. Finally it saves the full list of produced autocorrelation sizes into a numpy compressed array file. The second script (gen_plot.py) reads the list of autocorrelation sizes and generates histograms and summary statistics for the mean first approach and the desired window sizes of the autocovariance function first approach.
Usage for calc_autocorr.py:
./calc_autocorr.py    
AC Time Index = 0 --> Autocorrelation time = 9
AC Time Index = 1 --> Autocorrelation time = 20
AC Time Index = 2 --> Autocorrelation time = 30
AC Time Index = 3 --> Autocorrelation time = 60
AC Time Index = 4 --> Autocorrelation time = 200
Usage for gen_plot.py:
./gen_plot.py    [Window Length] ...

  
## calc_autocorr.py
#! /usr/bin/python
"""Tweaked tests of autocorrelation length.
"""
import sys
import multiprocessing
import numpy as np
from emcee import ensemble

CHAIN_PARAMS = [(10000, 0.8),  # Autocorrelation time = 9
                (20000, 0.904761904762),  # Autocorrelation time = 20
                (30000, 0.9354838709677),  # Autocorrelation time = 30
                (60000, 0.9672131147541),  # Autocorrelation time = 60
                (200000, 0.990050200903734685)]  # Autocorrelation time = 200


def autoregressive(nsteps, phi):
    # set the number of walkers
    m_walkers = 64
    # implement an AR(1) model which can be written as:
    # X_t = c + phi*X_(t-1) + epsilon_t
    # where c is some constant offset, phi is the autoregression coefficient
    # and epsilon_t is some mean zero random number distributed with a given
    # variance
    # this is what we want the sqrt of the variance of the time series to be
    sigma_total = 1.0
    # this is the sqrt of the variance of the random numbers required to get the correct total variance
    # this works because the variance (sigma_total^2) of an AR(1) model is: sigma_e^2/(1-phi^2)
    # solve for sigma_e to get the expression below
    sigma_e = sigma_total * np.sqrt(1. -phi**2)
    # The (unnormalized) autocovariance function is: B(n) = E(X(T+n)*X(T)) - avg^2 = sigma_e^2/(1-phi^2)*phi^|n|
    # because of how we have defined sigma_e for this time series, for this particular AR(1) model
    # the autocovariance function collapses to B(n) = phi^|n|
    # the normalized autocovariance function C(t) = (B(t)/B(0)), however
    # B(0) = 1 for this AR(1) therefore C(t) = B(t)
    # the integrated autocorrelation time, when there is no noise, has two definitions
    # "Monte Carlo Methods in Statistical Mechanics: Foundations and New
    # Algorithms" by Alan D. Sokal Lectures at the Cargese Summer School on
    # "Functional Integration: Basics and Applications" September 1996 gives:
    #           tau = 1/2 * (Sum[C(t),{t,-inf, inf}])
    # Goodman and Weare Ensemble Samplers with Affine Invariance in Communications
    # in Applied Mathematics and Computational Science, Vol5, No1, pg65-80,
    # 2010, gives:
    #           tau = Sum[C(t),{t,-inf, inf}]
    # I use the latter definition
    # for phi < 1, phi^|n| go to zero for large n, we can truncate the
    # summation before reaching n=infinity. This truncated summation would then
    # be implemented as:
    # truelength = (phi**np.abs(np.arange(-1000,1000))).sum()
    # However because our summation is symmetrical in n, we can "fold" our
    # summation into:
    #           tau = 1+2*Sum[C(t),{t,1, inf}]
    # which cuts the number of calculations required, roughly in half
    truelength = (1.0 + (2.0 * ((phi**np.abs(np.arange(1,1000))).sum())))
    # initialize the chain:
    chain = np.zeros((m_walkers, nsteps))
    # set the first point to just a random number with variance sigma_total
    chain[:,0] = sigma_total * np.random.normal(size=m_walkers)
    # set each of the subsequent points according to the AR(1) model
    c = np.ones(m_walkers)
    for i in xrange(1, nsteps):
        chain[:,i] = phi * chain[:,i-1] + sigma_e * np.random.normal(size=m_walkers) + c
    # print 'mean, expected: ', chain.mean(), 0.
    # print 'standard deviation, expected: ', chain.std(), sigma_total
    chain.shape = (m_walkers, nsteps, 1)  # make last index param dimension (1-d here)
    return chain, truelength


def findauto(params, windowList):
    ns = params.shape[0]
    mc = params.shape[1]
    nparams = params.shape[2]
    # get the first power of two greater than the number of samples
    maxLen = 2**int(np.ceil(np.log2(ns)))
    # collapse the chain to a 2D array from the 3D that was passed (for compatibility with emcee)
    chain = params[:,:,0]
    # generate an empty array to hold each walker's autocovariance function
    chaincorr = np.zeros((mc,ns))
    # generate an array to hold the forward fourier transforms of each walker
    temp = np.zeros(maxLen, dtype='complex128')
    # calculate the average for each walker
    avgs = chain.mean(axis=0)
    for j in xrange(mc):
        # calculate the forward fourier transform of the *centered* walker chain
        temp = np.fft.fft(chain[:,j]-avgs[j],n=maxLen)
        # calculate the inverse fourier transform of the squared magnitudes of
        # the forward transform and extract the real parts to remove numerical
        # precision errors that give non-zero imaginary components
        chaincorr[j,:] = np.fft.ifft(temp*np.conjugate(temp))[:ns].real
    # average the per walker auto correllation functions
    corr = chaincorr.mean(axis=0)
    # convert the averaged autocorrelation function into the normalized
    # autocorrelation function
    corr /= corr[0]
    # this is like the sum above to find tau, but because we multiply the whole
    # thing by 2, we subtract one since otherwise we double counted the t=0 pt
    nclust_cum = 2.*corr.cumsum() - 1.
    return [find_edges(nclust_cum, ns, x) for x in windowList]


def find_edges(nclust_cum, ns, windowLength):
    # make an array of window sizes, needed for the integrated autocorrelation
    # time finding algorithm for noisy time series given by sokal
    index = np.arange(ns)
    # find points where the window size exceeds some safety factor times the
    # cumulative sum
    end = index > windowLength*nclust_cum
    if end.any():
        # the cumulative sum at the first point fulfilling the above condition
        # is the autocorrelation time
        imax = np.where(end)[0][0]
        nclust = nclust_cum[imax-1]
    else:
        # if there were no points, then there were not enough samples to
        # correctly determine the autocorrelation time for a given window len
        imax = -1
        nclust = nclust_cum[-1]
        print 'WARNING: autocorr not converged.'
    return nclust


def findautofromemcee(chain):
    m_walkers, nsteps, dim = chain.shape
    def foofn():
        pass
    sampler = ensemble.EnsembleSampler(m_walkers, dim, foofn, a=2.0)
    sampler._chain = chain
    return sampler.acor[0]


def do_single_trial(params):
    # pull the parameters out of the tuple passed
    index = params[0]
    nsamples = params[1]
    phi = params[2]
    # give the user something to stare at (if not running on a cluster
    if index % 10 == 0:
        print index
    # generate the chain
    chain, truelength = autoregressive(nsamples, phi)
    # get the transpose of the chain for the format emcee wants
    biechain = chain[:,:,0].T[:,:,np.newaxis]
    # get the emcee estimate
    meanfirst = findautofromemcee(chain)
    # get the new method autocorrelation times
    windowList = range(1, 31)
    temp = findauto(biechain, windowList)
    # build the output list
    output = [truelength, meanfirst]
    output.extend(temp)
    return output


def main(ntrials, numcores, nsamples, phi, outfile):
    # make the 'thread pool'
    mp_pool = multiprocessing.Pool(processes=numcores)
    # generate a list of parameters to map across
    params = [(i, nsamples, phi) for i in range(ntrials)]
    # map the
    values = mp_pool.map(do_single_trial, params)
    temp = np.array(values)
    np.savez_compressed(outfile, data=temp)

HELP_STR = """Usage:
  {0:s} <Num Trials> <Num Cores> <AC Time Index> <output file>
    AC Time Index = 0 --> Autocorrelation time = 9
    AC Time Index = 1 --> Autocorrelation time = 20
    AC Time Index = 2 --> Autocorrelation time = 30
    AC Time Index = 3 --> Autocorrelation time = 60
    AC Time Index = 4 --> Autocorrelation time = 200"""

if __name__ == "__main__":
    if len(sys.argv) != 5:
        print HELP_STR.format(sys.argv[0])
        sys.exit()
    params = CHAIN_PARAMS[int(sys.argv[3])]
    main(int(sys.argv[1]), int(sys.argv[2]), params[0], params[1], sys.argv[4])

## gen_plot.py
#! /usr/bin/python
"""Plot results of autocorrelation length trials
"""
import sys
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

def main(data_file, bins, windows):
    npfile = np.load(data_file, encoding='bytes')
    temp = npfile['data']
    npfile.close()
    truelength = temp[0,0]
    meanfirst = temp[:,1]
    corrfirst = [temp[:,1+i] for i in windows]
    out_dict = {}
    out_dict['true_ac'] = truelength
    add_to_fmt_dict(out_dict, meanfirst)
    out_str = FMT_STR1.format(**out_dict)
    for i, array in zip(windows, corrfirst):
        out_dict['wind'] = i
        add_to_fmt_dict(out_dict, array)
        out_str += FMT_STR2.format(**out_dict)
    bin_count = bins
    _ = plt.hist(meanfirst,  histtype='step', color='b', bins=bin_count,
                 label='Mean-first')
    color_ind = 0
    colors = ['r','k','c','m','y']
    for i, array in zip(windows, corrfirst):
        _ = plt.hist(array, color=colors[color_ind%len(colors)],
                     histtype='step', bins=bin_count,
                     label='Cov-first C={0:d}'.format(i))
        color_ind += 1
    ylim = np.array(plt.ylim())
    xlim = np.array(plt.xlim())
    plt.plot(np.zeros(2)+truelength, ylim, 'g-', label='True AutoCorr')
    font = FontProperties()
    font.set_family('monospace')
    plt.text(0.65*xlim[1], 0.75*ylim[1], out_str, fontproperties=font,
             va='top', ha='left')
    plt.xlabel("Calculated Autocorrelation Time (Samples)")
    plt.ylabel("Counts Per Bin")
    #axes = plt.gca()
    plt.legend()
    plt.show()

def add_to_fmt_dict(out_dict, array):
    out_dict["avg"] = array.mean()
    out_dict["std"] = array.std()
    out_dict["min"] = array.min()
    out_dict["max"] = array.max()

FMT_STR1 = """True AutoCorrelation: {true_ac:3.1f}
                   Avg   |   Std   |   Min   |   Max
Mean-First:      {avg:7.3f} | {std:7.3f} | {min:7.3f} | {max:7.3f}"""
FMT_STR2 = """
Cov-First C={wind:2d}:  {avg:7.3f} | {std:7.3f} | {min:7.3f} | {max:7.3f}"""

HELP_STR = """Usage:
    {0:s} <Numpy Array File> <Num Bins> <Window Length> [Window Length] ..."""

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print HELP_STR.format(sys.argv[0])
        sys.exit()
    bins = int(sys.argv[2])
    windows = [int(x) for x in sys.argv[3:]]
    main(sys.argv[1], bins, windows)
	#! /usr/bin/python
	"""Tweaked tests of autocorrelation length.
	"""
	import sys
	import multiprocessing
	import numpy as np
	from emcee import ensemble

	CHAIN_PARAMS = [(10000, 0.8), # Autocorrelation time = 9
	(20000, 0.904761904762), # Autocorrelation time = 20
	(30000, 0.9354838709677), # Autocorrelation time = 30
	(60000, 0.9672131147541), # Autocorrelation time = 60
	(200000, 0.990050200903734685)] # Autocorrelation time = 200


	def autoregressive(nsteps, phi):
	# set the number of walkers
	m_walkers = 64
	# implement an AR(1) model which can be written as:
	# X_t = c + phi*X_(t-1) + epsilon_t
	# where c is some constant offset, phi is the autoregression coefficient
	# and epsilon_t is some mean zero random number distributed with a given
	# variance
	# this is what we want the sqrt of the variance of the time series to be
	sigma_total = 1.0
	# this is the sqrt of the variance of the random numbers required to get the correct total variance
	# this works because the variance (sigma_total^2) of an AR(1) model is: sigma_e^2/(1-phi^2)
	# solve for sigma_e to get the expression below
	sigma_e = sigma_total * np.sqrt(1. -phi**2)
	# The (unnormalized) autocovariance function is: B(n) = E(X(T+n)X(T)) - avg^2 = sigma_e^2/(1-phi^2)phi^\|n\|
	# because of how we have defined sigma_e for this time series, for this particular AR(1) model
	# the autocovariance function collapses to B(n) = phi^\|n\|
	# the normalized autocovariance function C(t) = (B(t)/B(0)), however
	# B(0) = 1 for this AR(1) therefore C(t) = B(t)
	# the integrated autocorrelation time, when there is no noise, has two definitions
	# "Monte Carlo Methods in Statistical Mechanics: Foundations and New
	# Algorithms" by Alan D. Sokal Lectures at the Cargese Summer School on
	# "Functional Integration: Basics and Applications" September 1996 gives:
	# tau = 1/2 * (Sum[C(t),{t,-inf, inf}])
	# Goodman and Weare Ensemble Samplers with Affine Invariance in Communications
	# in Applied Mathematics and Computational Science, Vol5, No1, pg65-80,
	# 2010, gives:
	# tau = Sum[C(t),{t,-inf, inf}]
	# I use the latter definition
	# for phi < 1, phi^\|n\| go to zero for large n, we can truncate the
	# summation before reaching n=infinity. This truncated summation would then
	# be implemented as:
	# truelength = (phi**np.abs(np.arange(-1000,1000))).sum()
	# However because our summation is symmetrical in n, we can "fold" our
	# summation into:
	# tau = 1+2*Sum[C(t),{t,1, inf}]
	# which cuts the number of calculations required, roughly in half
	truelength = (1.0 + (2.0 * ((phi**np.abs(np.arange(1,1000))).sum())))
	# initialize the chain:
	chain = np.zeros((m_walkers, nsteps))
	# set the first point to just a random number with variance sigma_total
	chain[:,0] = sigma_total * np.random.normal(size=m_walkers)
	# set each of the subsequent points according to the AR(1) model
	c = np.ones(m_walkers)
	for i in xrange(1, nsteps):
	chain[:,i] = phi * chain[:,i-1] + sigma_e * np.random.normal(size=m_walkers) + c
	# print 'mean, expected: ', chain.mean(), 0.
	# print 'standard deviation, expected: ', chain.std(), sigma_total
	chain.shape = (m_walkers, nsteps, 1) # make last index param dimension (1-d here)
	return chain, truelength


	def findauto(params, windowList):
	ns = params.shape[0]
	mc = params.shape[1]
	nparams = params.shape[2]
	# get the first power of two greater than the number of samples
	maxLen = 2**int(np.ceil(np.log2(ns)))
	# collapse the chain to a 2D array from the 3D that was passed (for compatibility with emcee)
	chain = params[:,:,0]
	# generate an empty array to hold each walker's autocovariance function
	chaincorr = np.zeros((mc,ns))
	# generate an array to hold the forward fourier transforms of each walker
	temp = np.zeros(maxLen, dtype='complex128')
	# calculate the average for each walker
	avgs = chain.mean(axis=0)
	for j in xrange(mc):
	# calculate the forward fourier transform of the centered walker chain
	temp = np.fft.fft(chain[:,j]-avgs[j],n=maxLen)
	# calculate the inverse fourier transform of the squared magnitudes of
	# the forward transform and extract the real parts to remove numerical
	# precision errors that give non-zero imaginary components
	chaincorr[j,:] = np.fft.ifft(temp*np.conjugate(temp))[:ns].real
	# average the per walker auto correllation functions
	corr = chaincorr.mean(axis=0)
	# convert the averaged autocorrelation function into the normalized
	# autocorrelation function
	corr /= corr[0]
	# this is like the sum above to find tau, but because we multiply the whole
	# thing by 2, we subtract one since otherwise we double counted the t=0 pt
	nclust_cum = 2.*corr.cumsum() - 1.
	return [find_edges(nclust_cum, ns, x) for x in windowList]


	def find_edges(nclust_cum, ns, windowLength):
	# make an array of window sizes, needed for the integrated autocorrelation
	# time finding algorithm for noisy time series given by sokal
	index = np.arange(ns)
	# find points where the window size exceeds some safety factor times the
	# cumulative sum
	end = index > windowLength*nclust_cum
	if end.any():
	# the cumulative sum at the first point fulfilling the above condition
	# is the autocorrelation time
	imax = np.where(end)[0][0]
	nclust = nclust_cum[imax-1]
	else:
	# if there were no points, then there were not enough samples to
	# correctly determine the autocorrelation time for a given window len
	imax = -1
	nclust = nclust_cum[-1]
	print 'WARNING: autocorr not converged.'
	return nclust


	def findautofromemcee(chain):
	m_walkers, nsteps, dim = chain.shape
	def foofn():
	pass
	sampler = ensemble.EnsembleSampler(m_walkers, dim, foofn, a=2.0)
	sampler._chain = chain
	return sampler.acor[0]


	def do_single_trial(params):
	# pull the parameters out of the tuple passed
	index = params[0]
	nsamples = params[1]
	phi = params[2]
	# give the user something to stare at (if not running on a cluster
	if index % 10 == 0:
	print index
	# generate the chain
	chain, truelength = autoregressive(nsamples, phi)
	# get the transpose of the chain for the format emcee wants
	biechain = chain[:,:,0].T[:,:,np.newaxis]
	# get the emcee estimate
	meanfirst = findautofromemcee(chain)
	# get the new method autocorrelation times
	windowList = range(1, 31)
	temp = findauto(biechain, windowList)
	# build the output list
	output = [truelength, meanfirst]
	output.extend(temp)
	return output


	def main(ntrials, numcores, nsamples, phi, outfile):
	# make the 'thread pool'
	mp_pool = multiprocessing.Pool(processes=numcores)
	# generate a list of parameters to map across
	params = [(i, nsamples, phi) for i in range(ntrials)]
	# map the
	values = mp_pool.map(do_single_trial, params)
	temp = np.array(values)
	np.savez_compressed(outfile, data=temp)

	HELP_STR = """Usage:
	{0:s} <Num Trials> <Num Cores> <AC Time Index> <output file>
	AC Time Index = 0 --> Autocorrelation time = 9
	AC Time Index = 1 --> Autocorrelation time = 20
	AC Time Index = 2 --> Autocorrelation time = 30
	AC Time Index = 3 --> Autocorrelation time = 60
	AC Time Index = 4 --> Autocorrelation time = 200"""

	if __name__ == "__main__":
	if len(sys.argv) != 5:
	print HELP_STR.format(sys.argv[0])
	sys.exit()
	params = CHAIN_PARAMS[int(sys.argv[3])]
	main(int(sys.argv[1]), int(sys.argv[2]), params[0], params[1], sys.argv[4])
	#! /usr/bin/python
	"""Plot results of autocorrelation length trials
	"""
	import sys
	import numpy as np
	import matplotlib.pyplot as plt
	from matplotlib.font_manager import FontProperties

	def main(data_file, bins, windows):
	npfile = np.load(data_file, encoding='bytes')
	temp = npfile['data']
	npfile.close()
	truelength = temp[0,0]
	meanfirst = temp[:,1]
	corrfirst = [temp[:,1+i] for i in windows]
	out_dict = {}
	out_dict['true_ac'] = truelength
	add_to_fmt_dict(out_dict, meanfirst)
	out_str = FMT_STR1.format(**out_dict)
	for i, array in zip(windows, corrfirst):
	out_dict['wind'] = i
	add_to_fmt_dict(out_dict, array)
	out_str += FMT_STR2.format(**out_dict)
	bin_count = bins
	_ = plt.hist(meanfirst, histtype='step', color='b', bins=bin_count,
	label='Mean-first')
	color_ind = 0
	colors = ['r','k','c','m','y']
	for i, array in zip(windows, corrfirst):
	_ = plt.hist(array, color=colors[color_ind%len(colors)],
	histtype='step', bins=bin_count,
	label='Cov-first C={0:d}'.format(i))
	color_ind += 1
	ylim = np.array(plt.ylim())
	xlim = np.array(plt.xlim())
	plt.plot(np.zeros(2)+truelength, ylim, 'g-', label='True AutoCorr')
	font = FontProperties()
	font.set_family('monospace')
	plt.text(0.65xlim[1], 0.75ylim[1], out_str, fontproperties=font,
	va='top', ha='left')
	plt.xlabel("Calculated Autocorrelation Time (Samples)")
	plt.ylabel("Counts Per Bin")
	#axes = plt.gca()
	plt.legend()
	plt.show()

	def add_to_fmt_dict(out_dict, array):
	out_dict["avg"] = array.mean()
	out_dict["std"] = array.std()
	out_dict["min"] = array.min()
	out_dict["max"] = array.max()

	FMT_STR1 = """True AutoCorrelation: {true_ac:3.1f}
	Avg \| Std \| Min \| Max
	Mean-First: {avg:7.3f} \| {std:7.3f} \| {min:7.3f} \| {max:7.3f}"""
	FMT_STR2 = """
	Cov-First C={wind:2d}: {avg:7.3f} \| {std:7.3f} \| {min:7.3f} \| {max:7.3f}"""

	HELP_STR = """Usage:
	{0:s} <Numpy Array File> <Num Bins> <Window Length> [Window Length] ..."""

	if __name__ == "__main__":
	if len(sys.argv) < 4:
	print HELP_STR.format(sys.argv[0])
	sys.exit()
	bins = int(sys.argv[2])
	windows = [int(x) for x in sys.argv[3:]]
	main(sys.argv[1], bins, windows)