usametov/volume-bars.py

## volume-bars.py
"""
Bars indexed by total volume, with each set # of shares traded creating a distinct bar.
We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars
 to maintain information for all parameters across bars.
Let's set a target bar volume equal to the maximum minutely volume in our dataset
 (otherwise we'd have to split minute bars, which we can't).
 Then, we construct our bars in such a way as to minimize the distance from this target:

Combine minute bars while maintaining the time of the first and final minutes,
  adding volume of bars together, and keeping track of the high and low over the interval.
At each minute, check if the difference from the target volume would be minimized by adding or excluding the next minute.
Select the optimal option.
This is meant to prevent the scenario where your target volume is 100K shares,
  and 2 bars of 99.9K shares follow each other. Adding the bars would result in essentially twice our target,
  while this strategy would result in 2 volume bars much closer to the target.
  This regularizes the volume of our bars.
"""
#############################

# Since our base data are not ticks, we cannot infer our variables [OHLCV] if a bar is split
# So the target volume to set a bar is set to the maximum value on our interval,
# Each volume bar consists of at least 1 full time bar. This causes some volume inequity between bars, best we can do
tgt_volume  = prices['volume'].max()

# Indices in time bars where volume bars should end
bar_ends  = []
bar_volume = 0
skip = False # Ugly variable, used to skip an index in the loop if the 2nd volume is used

# Iterates over each pair of neighboring minute bars' volumes, selects the option that minimizes difference from target
for index, (volume1, volume2) in enumerate(zip(prices['volume'], prices['volume'][1:])):
    # Skips this iteration if the index was used already
    if (skip == True):
        skip = False
        continue

    # If the target is surpassed in our pair, selects the difference minimizing index
    if (bar_volume + volume1 + volume2 > tgt_volume):
        d1 = abs(tgt_volume - (bar_volume + volume1))
        d2 = abs(tgt_volume - (bar_volume + volume1 + volume2))

        # Includes 2nd index, sets the loop to skip the next iteration
        if (d1 > d2):
            bar_ends.append(index + 1)
            bar_volume = 0
            skip = True

        # Excludes the 2nd index, business as usual
        else:
            bar_ends.append(index)
            bar_volume = 0

    # If the pair wouldn't result in a cross of the target, just add the volume of the first index
    else:
        bar_volume += volume1

# Transfers and combines information from time bars into volume bars
volume_bars = prices.iloc[bar_ends]
bar_count  = 0
bar_volume = 0
min_price  = prices.iloc[0]['low']
max_price  = prices.iloc[0]['high']
open_price = prices.iloc[0]['open_price']
close_price= prices.iloc[0]['close_price']
for timestamp in prices.index:
    bar_volume += prices.loc[timestamp]['volume']
    min_price = min(min_price, prices.loc[timestamp]['low'])
    max_price = max(max_price, prices.loc[timestamp]['high'])

    if timestamp in volume_bars.index:
        volume_bars.loc[timestamp]['open_price'] = open_price
        volume_bars.loc[timestamp]['high']       = max_price
        volume_bars.loc[timestamp]['low']        = min_price
        close_price = prices.loc[timestamp]['close_price']
        volume_bars.loc[timestamp]['close_price']= close_price
        volume_bars.loc[timestamp]['volume']     = bar_volume

        bar_volume  = 0
        bar_count  += 1
        if bar_count < len(volume_bars):
            open_price  = prices.loc[volume_bars.iloc[bar_count].name]['open_price']

######################################## vectorized impl ###########################

def volume_bars_vectorized(ohlcv, volume_threshold):
    """Create 'volume-base' activity ohlc bars using pandas and numpy to
    make computations more efficient.

    Parameters
    ----------
    ohlcv: pd.DataFrame
        columns = open_price, high, low, close, volume,  index = datetime
    volume_threshold: int
        Number of shares traded per bar

    Returns
    -------
    pd.DataFrame
        DataFrame containing OHLCV data. Indexed by datetime at end of bar.
    """
    cum_vol = ohlcv['volume'].cumsum()
    grouper = cum_vol // volume_threshold

    # This makes sure last minute bar is included in aggregation
    mask = grouper != grouper.shift(1)
    mask[0] = False
    grouper = (grouper - mask.astype(int) ).values

    volume_ohlcv = (ohlcv.reset_index().groupby(grouper)
                    .agg({'open_price': 'first', 'high': 'max',
                          'low': 'min', 'close_price': 'last',
                          'volume': 'sum', 'index': 'last'})).set_index('index')
    volume_ohlcv = volume_ohlcv[['open_price', 'high', 'low', 'close_price', 'volume']]
    volume_ohlcv.index.name=None
    return volume_ohlcv

prices.groupby(prices.index.date)['volume'].sum().mean(), tgt_volume
volume_bars_ = volume_bars_vectorized(prices, tgt_volume)

############################### plots #######
ax1 = plt.subplot2grid((2, 1), (0, 0), colspan=1, rowspan=1)
ax1.set_ylim(0.2e7, 1.2e7)
ax2 = plt.subplot2grid((2, 1), (1, 0), colspan=1, rowspan=1)
ax2.set_ylim(0.2e7, 1.2e7)

ax1.plot(volume_bars['volume'])
ax2.plot(volume_bars_['volume'])

#####################################
"""
We can see that the volatility of volume around our target is much greater in the vectorized solution.
Both implementations suffer from look-ahead bias
  (a solution would involve setting the volume target as some multiple of past volume,
  with an exception case if one minute's volume exceeds this target),
  and both accurately model price changes.
One runs extremely quickly, one more accurately approaches our volume target.
If you are incorporating volume bars into your research, pick whichever you prefer.
We will be continuing with the slower solution for this analysis.
"""
############################# plot price #############

plt.plot(volume_bars['close_price'])
plt.title('SPY Volume Bars')
plt.xlabel('Date')
plt.ylabel('Price');

# Little information was lost in our sampling method (on long timescales at least);
# the price behaviour remains clear.

####################### plot volume ##############

plt.plot(volume_bars['volume'])
plt.title('Volume per SPY Volume Bar')
plt.xlabel('Date')
plt.ylabel('Volume (Shares Traded)');

"""
That's pretty spiky, but decidedly centered at our target.
Theoretical worst case volume bar multiples should be 0.5x and 1.5x our target,
  less if we use a higher multiple of max volume.
Let's see if we constructed this properly.
"""

print('Target Volume: ', tgt_volume)
print('Mean Volume: ', volume_bars['volume'].mean())
print('Max Volume Multiple: ', volume_bars['volume'].max()/tgt_volume)
print('Min Volume Multiple: ', volume_bars['volume'].min()/tgt_volume)
print('# Under: ', len(volume_bars[volume_bars['volume'] < tgt_volume]))
print('# Over : ', len(volume_bars[volume_bars['volume'] > tgt_volume]))

"""
That's pretty good! Our outliers are about 30% from our target,
  but the mean is spot on and almost the exact same amount of bars surpass our target as those that fail to meet it.
Using tick data would largely eliminate this inefficiency,
  but this approximation seems pretty good.
"""

### Let's look at returns, as this is where we want to analyze the statistical properties.

returns = volume_bars['close_price'].pct_change().dropna() #Returns per bar
returns.plot()
plt.ylim(-0.021, 0.021)
plt.title('Returns of SPY per Volume Bar');  plt.xlabel('Date'); plt.ylabel('Returns');

"""
We do still see the variance changing over time, but more spread out.
This may actually prove problematic in our stationarity tests,
  as a greater proportion of our data exhibits 'abnormal' variance.
The tradeoff is that it may better describe the trends in that generally volatile time period.
"""
############## Ah well, let's see.
# Histogram + distribution of returns -- if everything is OK, then it should look normal
ax = sns.distplot(returns, fit=norm, kde=False)
plt.title('Distribution of SPY Returns per Volume Bar')
plt.xlabel('Return')
plt.ylabel('Instances');

print('Sample size: ', len(returns))
print('Mean: ', returns.mean())
print('Variance: ', returns.std())
print('Jarque-Bera Test Results', jarque_bera(returns))
print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])
	"""
	Bars indexed by total volume, with each set # of shares traded creating a distinct bar.
	We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars
	to maintain information for all parameters across bars.
	Let's set a target bar volume equal to the maximum minutely volume in our dataset
	(otherwise we'd have to split minute bars, which we can't).
	Then, we construct our bars in such a way as to minimize the distance from this target:

	Combine minute bars while maintaining the time of the first and final minutes,
	adding volume of bars together, and keeping track of the high and low over the interval.
	At each minute, check if the difference from the target volume would be minimized by adding or excluding the next minute.
	Select the optimal option.
	This is meant to prevent the scenario where your target volume is 100K shares,
	and 2 bars of 99.9K shares follow each other. Adding the bars would result in essentially twice our target,
	while this strategy would result in 2 volume bars much closer to the target.
	This regularizes the volume of our bars.
	"""
	#############################

	# Since our base data are not ticks, we cannot infer our variables [OHLCV] if a bar is split
	# So the target volume to set a bar is set to the maximum value on our interval,
	# Each volume bar consists of at least 1 full time bar. This causes some volume inequity between bars, best we can do
	tgt_volume = prices['volume'].max()

	# Indices in time bars where volume bars should end
	bar_ends = []
	bar_volume = 0
	skip = False # Ugly variable, used to skip an index in the loop if the 2nd volume is used

	# Iterates over each pair of neighboring minute bars' volumes, selects the option that minimizes difference from target
	for index, (volume1, volume2) in enumerate(zip(prices['volume'], prices['volume'][1:])):
	# Skips this iteration if the index was used already
	if (skip == True):
	skip = False
	continue

	# If the target is surpassed in our pair, selects the difference minimizing index
	if (bar_volume + volume1 + volume2 > tgt_volume):
	d1 = abs(tgt_volume - (bar_volume + volume1))
	d2 = abs(tgt_volume - (bar_volume + volume1 + volume2))

	# Includes 2nd index, sets the loop to skip the next iteration
	if (d1 > d2):
	bar_ends.append(index + 1)
	bar_volume = 0
	skip = True

	# Excludes the 2nd index, business as usual
	else:
	bar_ends.append(index)
	bar_volume = 0

	# If the pair wouldn't result in a cross of the target, just add the volume of the first index
	else:
	bar_volume += volume1

	# Transfers and combines information from time bars into volume bars
	volume_bars = prices.iloc[bar_ends]
	bar_count = 0
	bar_volume = 0
	min_price = prices.iloc[0]['low']
	max_price = prices.iloc[0]['high']
	open_price = prices.iloc[0]['open_price']
	close_price= prices.iloc[0]['close_price']
	for timestamp in prices.index:
	bar_volume += prices.loc[timestamp]['volume']
	min_price = min(min_price, prices.loc[timestamp]['low'])
	max_price = max(max_price, prices.loc[timestamp]['high'])

	if timestamp in volume_bars.index:
	volume_bars.loc[timestamp]['open_price'] = open_price
	volume_bars.loc[timestamp]['high'] = max_price
	volume_bars.loc[timestamp]['low'] = min_price
	close_price = prices.loc[timestamp]['close_price']
	volume_bars.loc[timestamp]['close_price']= close_price
	volume_bars.loc[timestamp]['volume'] = bar_volume

	bar_volume = 0
	bar_count += 1
	if bar_count < len(volume_bars):
	open_price = prices.loc[volume_bars.iloc[bar_count].name]['open_price']

	######################################## vectorized impl ###########################

	def volume_bars_vectorized(ohlcv, volume_threshold):
	"""Create 'volume-base' activity ohlc bars using pandas and numpy to
	make computations more efficient.

	Parameters
	----------
	ohlcv: pd.DataFrame
	columns = open_price, high, low, close, volume, index = datetime
	volume_threshold: int
	Number of shares traded per bar

	Returns
	-------
	pd.DataFrame
	DataFrame containing OHLCV data. Indexed by datetime at end of bar.
	"""
	cum_vol = ohlcv['volume'].cumsum()
	grouper = cum_vol // volume_threshold

	# This makes sure last minute bar is included in aggregation
	mask = grouper != grouper.shift(1)
	mask[0] = False
	grouper = (grouper - mask.astype(int) ).values

	volume_ohlcv = (ohlcv.reset_index().groupby(grouper)
	.agg({'open_price': 'first', 'high': 'max',
	'low': 'min', 'close_price': 'last',
	'volume': 'sum', 'index': 'last'})).set_index('index')
	volume_ohlcv = volume_ohlcv[['open_price', 'high', 'low', 'close_price', 'volume']]
	volume_ohlcv.index.name=None
	return volume_ohlcv

	prices.groupby(prices.index.date)['volume'].sum().mean(), tgt_volume
	volume_bars_ = volume_bars_vectorized(prices, tgt_volume)

	############################### plots #######
	ax1 = plt.subplot2grid((2, 1), (0, 0), colspan=1, rowspan=1)
	ax1.set_ylim(0.2e7, 1.2e7)
	ax2 = plt.subplot2grid((2, 1), (1, 0), colspan=1, rowspan=1)
	ax2.set_ylim(0.2e7, 1.2e7)

	ax1.plot(volume_bars['volume'])
	ax2.plot(volume_bars_['volume'])

	#####################################
	"""
	We can see that the volatility of volume around our target is much greater in the vectorized solution.
	Both implementations suffer from look-ahead bias
	(a solution would involve setting the volume target as some multiple of past volume,
	with an exception case if one minute's volume exceeds this target),
	and both accurately model price changes.
	One runs extremely quickly, one more accurately approaches our volume target.
	If you are incorporating volume bars into your research, pick whichever you prefer.
	We will be continuing with the slower solution for this analysis.
	"""
	############################# plot price #############

	plt.plot(volume_bars['close_price'])
	plt.title('SPY Volume Bars')
	plt.xlabel('Date')
	plt.ylabel('Price');

	# Little information was lost in our sampling method (on long timescales at least);
	# the price behaviour remains clear.

	####################### plot volume ##############

	plt.plot(volume_bars['volume'])
	plt.title('Volume per SPY Volume Bar')
	plt.xlabel('Date')
	plt.ylabel('Volume (Shares Traded)');

	"""
	That's pretty spiky, but decidedly centered at our target.
	Theoretical worst case volume bar multiples should be 0.5x and 1.5x our target,
	less if we use a higher multiple of max volume.
	Let's see if we constructed this properly.
	"""

	print('Target Volume: ', tgt_volume)
	print('Mean Volume: ', volume_bars['volume'].mean())
	print('Max Volume Multiple: ', volume_bars['volume'].max()/tgt_volume)
	print('Min Volume Multiple: ', volume_bars['volume'].min()/tgt_volume)
	print('# Under: ', len(volume_bars[volume_bars['volume'] < tgt_volume]))
	print('# Over : ', len(volume_bars[volume_bars['volume'] > tgt_volume]))

	"""
	That's pretty good! Our outliers are about 30% from our target,
	but the mean is spot on and almost the exact same amount of bars surpass our target as those that fail to meet it.
	Using tick data would largely eliminate this inefficiency,
	but this approximation seems pretty good.
	"""

	### Let's look at returns, as this is where we want to analyze the statistical properties.

	returns = volume_bars['close_price'].pct_change().dropna() #Returns per bar
	returns.plot()
	plt.ylim(-0.021, 0.021)
	plt.title('Returns of SPY per Volume Bar'); plt.xlabel('Date'); plt.ylabel('Returns');

	"""
	We do still see the variance changing over time, but more spread out.
	This may actually prove problematic in our stationarity tests,
	as a greater proportion of our data exhibits 'abnormal' variance.
	The tradeoff is that it may better describe the trends in that generally volatile time period.
	"""
	############## Ah well, let's see.
	# Histogram + distribution of returns -- if everything is OK, then it should look normal
	ax = sns.distplot(returns, fit=norm, kde=False)
	plt.title('Distribution of SPY Returns per Volume Bar')
	plt.xlabel('Return')
	plt.ylabel('Instances');

	print('Sample size: ', len(returns))
	print('Mean: ', returns.mean())
	print('Variance: ', returns.std())
	print('Jarque-Bera Test Results', jarque_bera(returns))
	print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])