Skip to content

Instantly share code, notes, and snippets.

@usametov
Created August 16, 2020 19:07
Show Gist options
  • Save usametov/3a67b528aeaf369fcedb8ba37422c007 to your computer and use it in GitHub Desktop.
Save usametov/3a67b528aeaf369fcedb8ba37422c007 to your computer and use it in GitHub Desktop.
"""
Bars indexed by total volume, with each set # of shares traded creating a distinct bar.
We can transform minute bars into an approximation for volume bars, but ideally we would use tick bars
to maintain information for all parameters across bars.
Let's set a target bar volume equal to the maximum minutely volume in our dataset
(otherwise we'd have to split minute bars, which we can't).
Then, we construct our bars in such a way as to minimize the distance from this target:
Combine minute bars while maintaining the time of the first and final minutes,
adding volume of bars together, and keeping track of the high and low over the interval.
At each minute, check if the difference from the target volume would be minimized by adding or excluding the next minute.
Select the optimal option.
This is meant to prevent the scenario where your target volume is 100K shares,
and 2 bars of 99.9K shares follow each other. Adding the bars would result in essentially twice our target,
while this strategy would result in 2 volume bars much closer to the target.
This regularizes the volume of our bars.
"""
#############################
# Since our base data are not ticks, we cannot infer our variables [OHLCV] if a bar is split
# So the target volume to set a bar is set to the maximum value on our interval,
# Each volume bar consists of at least 1 full time bar. This causes some volume inequity between bars, best we can do
tgt_volume = prices['volume'].max()
# Indices in time bars where volume bars should end
bar_ends = []
bar_volume = 0
skip = False # Ugly variable, used to skip an index in the loop if the 2nd volume is used
# Iterates over each pair of neighboring minute bars' volumes, selects the option that minimizes difference from target
for index, (volume1, volume2) in enumerate(zip(prices['volume'], prices['volume'][1:])):
# Skips this iteration if the index was used already
if (skip == True):
skip = False
continue
# If the target is surpassed in our pair, selects the difference minimizing index
if (bar_volume + volume1 + volume2 > tgt_volume):
d1 = abs(tgt_volume - (bar_volume + volume1))
d2 = abs(tgt_volume - (bar_volume + volume1 + volume2))
# Includes 2nd index, sets the loop to skip the next iteration
if (d1 > d2):
bar_ends.append(index + 1)
bar_volume = 0
skip = True
# Excludes the 2nd index, business as usual
else:
bar_ends.append(index)
bar_volume = 0
# If the pair wouldn't result in a cross of the target, just add the volume of the first index
else:
bar_volume += volume1
# Transfers and combines information from time bars into volume bars
volume_bars = prices.iloc[bar_ends]
bar_count = 0
bar_volume = 0
min_price = prices.iloc[0]['low']
max_price = prices.iloc[0]['high']
open_price = prices.iloc[0]['open_price']
close_price= prices.iloc[0]['close_price']
for timestamp in prices.index:
bar_volume += prices.loc[timestamp]['volume']
min_price = min(min_price, prices.loc[timestamp]['low'])
max_price = max(max_price, prices.loc[timestamp]['high'])
if timestamp in volume_bars.index:
volume_bars.loc[timestamp]['open_price'] = open_price
volume_bars.loc[timestamp]['high'] = max_price
volume_bars.loc[timestamp]['low'] = min_price
close_price = prices.loc[timestamp]['close_price']
volume_bars.loc[timestamp]['close_price']= close_price
volume_bars.loc[timestamp]['volume'] = bar_volume
bar_volume = 0
bar_count += 1
if bar_count < len(volume_bars):
open_price = prices.loc[volume_bars.iloc[bar_count].name]['open_price']
######################################## vectorized impl ###########################
def volume_bars_vectorized(ohlcv, volume_threshold):
"""Create 'volume-base' activity ohlc bars using pandas and numpy to
make computations more efficient.
Parameters
----------
ohlcv: pd.DataFrame
columns = open_price, high, low, close, volume, index = datetime
volume_threshold: int
Number of shares traded per bar
Returns
-------
pd.DataFrame
DataFrame containing OHLCV data. Indexed by datetime at end of bar.
"""
cum_vol = ohlcv['volume'].cumsum()
grouper = cum_vol // volume_threshold
# This makes sure last minute bar is included in aggregation
mask = grouper != grouper.shift(1)
mask[0] = False
grouper = (grouper - mask.astype(int) ).values
volume_ohlcv = (ohlcv.reset_index().groupby(grouper)
.agg({'open_price': 'first', 'high': 'max',
'low': 'min', 'close_price': 'last',
'volume': 'sum', 'index': 'last'})).set_index('index')
volume_ohlcv = volume_ohlcv[['open_price', 'high', 'low', 'close_price', 'volume']]
volume_ohlcv.index.name=None
return volume_ohlcv
prices.groupby(prices.index.date)['volume'].sum().mean(), tgt_volume
volume_bars_ = volume_bars_vectorized(prices, tgt_volume)
############################### plots #######
ax1 = plt.subplot2grid((2, 1), (0, 0), colspan=1, rowspan=1)
ax1.set_ylim(0.2e7, 1.2e7)
ax2 = plt.subplot2grid((2, 1), (1, 0), colspan=1, rowspan=1)
ax2.set_ylim(0.2e7, 1.2e7)
ax1.plot(volume_bars['volume'])
ax2.plot(volume_bars_['volume'])
#####################################
"""
We can see that the volatility of volume around our target is much greater in the vectorized solution.
Both implementations suffer from look-ahead bias
(a solution would involve setting the volume target as some multiple of past volume,
with an exception case if one minute's volume exceeds this target),
and both accurately model price changes.
One runs extremely quickly, one more accurately approaches our volume target.
If you are incorporating volume bars into your research, pick whichever you prefer.
We will be continuing with the slower solution for this analysis.
"""
############################# plot price #############
plt.plot(volume_bars['close_price'])
plt.title('SPY Volume Bars')
plt.xlabel('Date')
plt.ylabel('Price');
# Little information was lost in our sampling method (on long timescales at least);
# the price behaviour remains clear.
####################### plot volume ##############
plt.plot(volume_bars['volume'])
plt.title('Volume per SPY Volume Bar')
plt.xlabel('Date')
plt.ylabel('Volume (Shares Traded)');
"""
That's pretty spiky, but decidedly centered at our target.
Theoretical worst case volume bar multiples should be 0.5x and 1.5x our target,
less if we use a higher multiple of max volume.
Let's see if we constructed this properly.
"""
print('Target Volume: ', tgt_volume)
print('Mean Volume: ', volume_bars['volume'].mean())
print('Max Volume Multiple: ', volume_bars['volume'].max()/tgt_volume)
print('Min Volume Multiple: ', volume_bars['volume'].min()/tgt_volume)
print('# Under: ', len(volume_bars[volume_bars['volume'] < tgt_volume]))
print('# Over : ', len(volume_bars[volume_bars['volume'] > tgt_volume]))
"""
That's pretty good! Our outliers are about 30% from our target,
but the mean is spot on and almost the exact same amount of bars surpass our target as those that fail to meet it.
Using tick data would largely eliminate this inefficiency,
but this approximation seems pretty good.
"""
### Let's look at returns, as this is where we want to analyze the statistical properties.
returns = volume_bars['close_price'].pct_change().dropna() #Returns per bar
returns.plot()
plt.ylim(-0.021, 0.021)
plt.title('Returns of SPY per Volume Bar'); plt.xlabel('Date'); plt.ylabel('Returns');
"""
We do still see the variance changing over time, but more spread out.
This may actually prove problematic in our stationarity tests,
as a greater proportion of our data exhibits 'abnormal' variance.
The tradeoff is that it may better describe the trends in that generally volatile time period.
"""
############## Ah well, let's see.
# Histogram + distribution of returns -- if everything is OK, then it should look normal
ax = sns.distplot(returns, fit=norm, kde=False)
plt.title('Distribution of SPY Returns per Volume Bar')
plt.xlabel('Return')
plt.ylabel('Instances');
print('Sample size: ', len(returns))
print('Mean: ', returns.mean())
print('Variance: ', returns.std())
print('Jarque-Bera Test Results', jarque_bera(returns))
print('Augmented Dickey-Fuller Test Results', adf(returns, maxlag=1)[0:2])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment