51enra/AI_for_Trading.md

## AI_for_Trading.md

      
    Raw
  

              AI_for_Trading.md
            
          
    Udacity AI for Trading Notes

1. Concepts


Nassim Taleb's home page


Quantdare


Quantrocket (tutorials, zipline, data)


Zipline documentation (Stephan Jansen) & link to github


Zipline tutorial (Stephan Jansen)


Alphalens ("Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors.")


OHLC, ticks, timeframes


Volume. Higher volume at beginning ('price discovery') and end than in the middle of the day.


Pre-market: typ. 4:00 - 9:00am; post-market: typ. 16:00 - 20:00pm. Normally low volume, but can be used for additional trading information.


Markets in different time zones: E.g. information about price development in Hong Kong can be used to anticipate the initial movement of the same stock on the LSE (however others will do the same).


Lack of data when exchanges are closed creates time-gaps that may have to be taken into account when analyzing time series. Approaches:

Ignore
Normalize price changes by length of the gap (because more news and events can happen during that time)
Consider on a case-by-case basis


Additionally, fundamental data, corporate actions, news etc. should be considered.


Corporate action examples: stock splits and dividends. For analysis, historical prices have to be scaled


Behavioral: Immediately after a split, prices tend to rise ("feels cheaper").


Fundamental analysis examples: Sales per share, earnings per share, dividends per share, PE ratio


Returns:

raw return r = (p_t -p_(t-1))/p_(t-1)
log return R = ln(p_t/p_(t-1)) ~= r
Exactly: R = ln(r + 1) or r = e^r - 1


Some properties of log returns:

They correspond to the "continuous compunding rate", i.e. if you have amount x_(t-1) at the beginning and x_t at the end of the period, this corresponds to an interest rate calculated for arbitrarily small time intervals ln(x_t/x_(t-1)).
Additivity: e.g. the log return over n months is the sum of the log returns for each month.
Numerical stability: Numerically adding logarithms of small numbers is more stable than multiplying the numbers directly.


While the distribution of log returns can be approximated by a normal distribution (if we ignore fat tails etc.), raw returns (~e^R) are distributed approximately log normally (the log normal distribution is strictly positive and heavily skewed to the positive side) --> see lesson 7, Nr. 5 (including interesting Jupyter Notebook).
"Long-term prices and cumulative returns can be modeled as approximately lognormally distributed because they are products of independently, identically distributed (IID) random variables. On the other hand, log returns sum over time. Therefore, if R1 = ln(p1/p0) and R2 = ln(p2/p1) are normal, their sum, the two-period log return, is also normal. Even if they are not normal, as long as they are IID, their long-term sum will be approximately normal, thanks to the Central Limit Theorem. This is one reason why using log returns can be convenient for modeling purposes."


Two definitions of "alpha"

One specific definition of alpha is the extra return that an actively managed fund can deliver, that exceeds the performance of passively investing (buy and hold) in a portfolio of stocks. Another specific definition of alpha, which we’ll primarily focus on in this course, is that of an alpha vector.
An alpha vector is a list of numbers, one for each stock in a portfolio, that gives us a signal as to the relative future performance of these stocks.
Converting "3d" to "2d" array

Data structure in csv:


ticker
date
open
high
low
close
volume
adj_close
adj_volume


ABC
2017-09-13
160.01
160.51
158.22
159.29
44580353.0
159.07
44260255.0


Different tickers in one file. Solution: Create separate dataframes for each column starting from "open". "date" becomes index. Content of "ticker" column provides new column names (one column per ticker). Example for "close" column:
def csv_to_close(csv_filepath, field_names):
    """Reads in data from a csv file and produces a DataFrame with close data.
    
    Parameters
    ----------
    csv_filepath : str
        The name of the csv file to read
    field_names : list of str
        The field names of the field in the csv file

    Returns
    -------
    close : DataFrame
        Close prices for each ticker and date
    """

    df = pd.read_csv(csv_filepath, names=['ticker', 'date', 'open', 'high', 'low',
                                             'close', 'volume', 'adj_close', 'adj_volume'])
    
    return df.pivot(index='date', columns='ticker', values='close')

Resampling data

Assuming the above "ohlc" dataframes contain daily data and we want to convert to weekly, "resampling" can be used.

Pandas time series explained
List of functions that can be applied to a resampling group (seems there's even an .ohlc() function but that requires all values for ohlc in the same series)

def days_to_weeks(open_prices, high_prices, low_prices, close_prices):
    """Converts daily OHLC prices to weekly OHLC prices.
    
    Parameters
    ----------
    open_prices : DataFrame
        Daily open prices for each ticker and date
    high_prices : DataFrame
        Daily high prices for each ticker and date
    low_prices : DataFrame
        Daily low prices for each ticker and date
    close_prices : DataFrame
        Daily close prices for each ticker and date

    Returns
    -------
    open_prices_weekly : DataFrame
        Weekly open prices for each ticker and date
    high_prices_weekly : DataFrame
        Weekly high prices for each ticker and date
    low_prices_weekly : DataFrame
        Weekly low prices for each ticker and date
    close_prices_weekly : DataFrame
        Weekly close prices for each ticker and date
    """
    
    open_w = open_prices.resample('W').first()
    close_w = close_prices.resample('W').last()
    high_w = high_prices.resample('W').max()
    low_w = low_prices.resample('W').min()
    
    return open_w, high_w, low_w, close_w

Calculating percentage returns

Index should be date index.
Attention: First row will have NaNs!
def calculate_returns(close):
    """
    Compute returns for each ticker and date in close.
    
    Parameters
    ----------
    close : DataFrame
        Close prices for each ticker and date
    
    Returns
    -------
    returns : DataFrame
        Returns for each ticker and date
    """
    
    return (close - close.shift(1))/close.shift(1)

Generating entry triggers via type conversion

def generate_positions(prices):
    """
    Generate the following signals:
     - Long 30 share of stock when the price is above 50 dollars
     - Short 10 shares when it's below 20 dollars
    
    Parameters
    ----------
    prices : DataFrame
        Prices for each ticker and date
    
    Returns
    -------
    final_positions : DataFrame
        Final positions for each ticker and date
    """
    signal_long = (prices > 50).astype(np.int) # provides a dataframe of 0s and 1s
    signal_short = (prices < 20).astype(np.int)
    long_pos = signal_long * 30
    short_pos = signal_short * (-10)
    
    return long_pos + short_pos
Select top performing industries

prices:


date
GOOG
AAPL
C
...


2019-02-20
7
11
3
...


sector:


GOOG
AAPL
C
...


SecA
SecB
SecC
...


We take a set of sectors because different stocks may be in the same sector (Note that different stocks may have the same performance and that we only look for the single top performing stock in an industry, not the sector as a whole)
def date_top_industries(prices, sector, date, top_n):
    """
    Get the set of the top industries for the date
    
    Parameters
    ----------
    prices : DataFrame
        Prices for each ticker and date
    sector : Series
        Sector name for each ticker
    date : Date
        Date to get the top performers
    top_n : int
        Number of top performers to get
    
    Returns
    -------
    top_industries : set
        Top industries for the date
    """

    top_stocks = list(prices.loc[date].nlargest(top_n).index)
    top_industries = set(sector[top_stocks].tolist())
    
    return top_industries
    # or: return set(sector.loc[prices.loc[date].nlargest(top_n).index])
t-test


t-statistic : (x_ave - mu)/s_x_ave, where x_ave = sample mean; mu = population mean, s_x_ave = standard error of the sample mean = standard deviation / sqrt(n); n = number of samples
p-value : The probability to get a sample mean x_ave when the population mean is mu (depends on the size of the fluctuation within the sample and the sample size, i.e. the standard error s_x_ave)
To accept the mean (positive) return of a strategy backtest as meaningful for future results, the t-statistic for the null hypothesis mu=0 (i.e. zero return) is calculated and a p-value smaller than a threshold alpha is requested (e.g. alpha=0.1).
scipy-stats package: "ttest_1samp − Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, popmean." (Two-sided means: the test does not discriminate whether the sample mean is larger or smaller than the population mean. With the parameter "alternative", this could be specified.)
Scipy tutorial --> Analysing one sample - descriptive statistics
Numpy/scipy doc

import pandas as pd
import numpy as np
import scipy.stats as stats

def analyze_returns(net_returns):
    """
    Perform a t-test, with the null hypothesis being that the mean return is zero.
    
    Parameters
    ----------
    net_returns : Pandas Series
        A Pandas Series for each date
    
    Returns
    -------
    t_value
        t-statistic from t-test
    p_value
        Corresponding p-value
    """
    # TODO: Perform one-tailed t-test on net_returns
    # Hint: You can use stats.ttest_1samp() to perform the test.
    #       However, this performs a two-tailed t-test.
    #       You'll need to divde the p-value by 2 to get the results of a one-tailed p-value.
    null_hypothesis = 0.0
    result = stats.ttest_1samp(net_returns, null_hypothesis)
    
    return result.statistic, result.pvalue/2
    
def test_run(filename='net_returns.csv'):
    """Test run analyze_returns() with net strategy returns from a file."""
    net_returns = pd.Series.from_csv(filename, header=0)
    t, p = analyze_returns(net_returns)
    print("t-statistic: {:.3f}\np-value: {:.6f}".format(t, p))


if __name__ == '__main__':
    test_run()
Review of Project one:

One-liner possibility for get_top_n function:
def get_top_n(prev_returns, top_n):
    return (prev_returns.rank(1, 'average', None, 'keep', False) <= top_n).astype(np.int64)
Momentum Strategies:

Definition of momentum
Introduction to momentum trading
Momentum strategy with MACD

Confidence testing:

T-test
T-test assumptions
KS-test (Kolmogorov-Smirnov) a.k.a. goodness-of-fit: Key takeaways: "A goodness-of-fit is a statistical test that tries to determine whether a set of observed values match those expected under the applicable model. They can show you whether your sample data fit an expected set of data from a population with normal distribution. There are multiple types of goodness-of-fit tests, but the most common is the chi-square test. The Kolmogorov-Smirnov test determines whether a sample comes from a specific distribution of a population."
P-Value definition
Null Hypothesis
Two-tailed test
One-tailed test
Z-test: Key takeaways: "A z-test is a statistical test to determine whether two population means are different when the variances are known and the sample size is large...Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size. Z-tests assume the standard deviation is known, while t-tests assume it is unknown."

2. Trading Strategies

Types:

Single asset strategies
Pairwise strategies
Cross sectional strategies (equity statistical arbitrage, equity market neutral investing)
Alternative-data based strategies

Steps for creation of a cross-sectional strategy:

Data collection,
Universe definition,
Alpha factor identification (alpha discovery; signal research): Out of a trading hypothesis, identify numerical signals or alpha to decide which assets to buy / sell at what time.
Alpha combination: In modern markets, a single alpha is normally not enough but several alphas must be combined to obtain a viable strategy. "Model stacking / ensembling" in machine learning.
Portfolio construction
Trading

Trading signal: Any numerical signal that informs a trade.
Alpha vector: Vector of numbers representing the portfolio weight of each stock in our universe (can be negative if short). (The alphas in the vector are a specific type of trading signal)
Risk analysis: The strategy should include a mathematical model of the risks.

Systematic risks (e.g. inflation, recession, interest rates, GDP...)

Sector-specific risks (legislation, material costs,...)


Idiosyncratic risks (inherent to individual stock: labor strike, managerial change,...)

Objective function: Which quantities to maximise/minimise (e.g. max return, min variability)
Other constraints: E.g. min/max number of assets; long-only; min market cap;...
** Outliers**:

Can be caused by missing or false data or by real events. Precaution: Examine backtesting dataset for outliers (e.g. price changes considerably larger than average changes). Avoid low-volume (money value volume) stocks with sudden price moves. Avoid "gappy" stocks. Avoid companies depending on single or few products, e.g. single-drug biotech companies....
Check backtesting results for "strange" statistical features, e.g. skewed / distorted distribution of returns. Q-Q plot (return distribution quantiles vs. normal distribution quantiles) - marked / systematic deviations from a straight line should be explainable.
Effect of outliers on trading signals can be mitigated by basing trading signals on moving averages instead of single closing prices (at the cost of signal delay) or by averaging over multiple stocks, e.g. of one sector. Ongoing research: how to employ machine learning in outlier detection.

Distributions and regression


Distribution examples: Uniform, Log-normal, Normal, Exponential
X~D : The random variable X follows probability distribution D
P(x|D): The probability of x given D is given by the probability density function p of the distribution D.
Note that for discrete variables, a probability mass function gives a probability for each value. However, for a probability density function, which describes a continuous variable, the probability is defined as the area under the curve within a range (between two points). For a single point, the area under the curve is actually zero, which is why probabilities for continuous variables are defined within a range instead of a single point.
Matplotlib: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html

Histogram plot for a Pandas series:
import pandas as pd
import matplotlib.pyplot as plt

def plot_histogram(sample, title, bins=16, **kwargs):
    """Plot the histogram of a given sample of random values.

    Parameters
    ----------
    sample : pandas.Series
        raw values to build histogram
    title : str
        plot title/header
    bins : int
        number of bins in the histogram
    kwargs : dict 
        any other keyword arguments for plotting (optional)
    """
    fig, ax = plt.subplots()
    sample.hist(ax=ax, bins=bins, **kwargs)
    ax.set_title(title)
    plt.show()
    
    return
Possibities to check for normality of a distribution


Many statistical methods like regression assume that the underlying data samples are normally distributed. If they are not, the conclusions drawn may be wrong. Therefore it is important to test a distribution for normality.


Visual:

Histogram
Boxplot (see below).
QQ Plot (Plotting the quantiles of the distribution under investigation versus the quantiles of the normal distribution. Typ. quantiles = quartiles, deciles, percentiles (100 bins)). If the two distributions are the same, the plot will be a straight 45 degree line.


Mathematical tests: Null hypothesis = data is normally distributed. Calculate a p value. If e.g. p<0.05, the confidence that the distribution is not normal is >95%.

Shapiro-Wilks test. One of the best performing tests. The Wikipedia article lists others tests with high performance.
D’Agostino-Pearson test. Tests mostly if kurtosis and skewness deviate from each other, so can perform worse than Shapiro-Wilks for certain forms of distributions.
Kolmogorov-Smirnov test. Can be used to compare any two distributions (like QQ plot). May perform worse than Shapiro-Wilks.
To be thorough, use different tests. If only one, use Shapiro-Wilks.


Example for generating samples of different distributions, box plot, histograms, QQ plot, Shapiro-Wilks and Kolmogorov-Smirnov test: test-normality.ipynb


Box- Whisker-Plot

Similar to a horizontal "candle".
The body is between the first quartile (Q1 - i.e. 25% of data points below the body) and the third quartile (Q3 - i.e. 25% of data points above the body); the median (Q2) is at its middle.
Inter Quartile Range (IQR): Distance Q3 - Q1
Whiskers: Between Q1 - 1.5*IQR and Q1; between Q3 and Q3 + 1.5*IQR (Wikipedia: Because the whiskers must end at an observed data point, the whisker lengths can look unequal, even though 1.5 IQR is the same for both sides.)
Data points below / above the whiskers are considered outliers and shown as single dots.
For a symmetrical distribution, median = mean
For a normal distribution, the box plot is symmetric and 24.65% of data points is within each whisker (and as always 50% in the box)
Stock returns are left-skewed and have fat tails, i.e. the box plot is not symmetric

Check for time-constant variance of the data
One of the assumptions of linear regression is that its input data are homoscedastic (i.e. variance is not time dependent). A visual way to check if the our data is homoscedastic is a scatter plot.
If our data is heteroscedastic, a linear regression estimate of the coefficients may be less accurate (further from the actual value), and we may get a smaller p-value than should be expected, which means we may assume (incorrectly) that we have an accurate estimate of the regression coefficient, and assume that it’s statistically significant when it’s not.
Test for homoscedasticity: Breusch-Pagan Test
Transform data to become normally distributed and homoscedastic

Financial data can be transformed to a more homoscedastic dataset by taking relative time intervalls, i.e. dividing prices of subsequent trading days.
For normalization of the distribution, the log function can be applied.
More general: Box-Cox transformation, T(x) = (x^lambda - 1)/lambda or for lambda=0: T(x) = ln(x). Different lambdas can be tried to optimize normality / homoscedacity. (The Box-Cox transformation is a monotonous transformation, i.e. preserves the sequence of datapoints)

Linear Regression

Find coefficient (slope) and intercept so that the residuals (distance of datapoints to regression line) are optimized ("ordinary least squares" method).
If residuals are normally distributed with mean 0 and constant standard deviation, we can assume they are random and there is no bias.
Multiple regression considers linear dependence of multiple independent variables, i.e. with multiple slope values to be determined.
Multivariate linear regression refers to multiple dependent variables (and multivariate multiple has multiple dependent and multiple independent variables, so the coefficients form a matrix and the intercepts a vector).
R-squared: Between 0 and 1; if 1, all variation in our data is captured by the selected independent variables.
Modified R-squared: Allows to minimize the number of independent variables.
F-test: Tests the Null hypothesis that the coefficients and the intercept are equal to zero (and thus there is no dependence on the independent variables). Should be p < 0.05.

Breusch-Pagan test
The Breusch-Pagan test is one of many tests for homoscedasticity/heteroscedasticity. It takes the residuals from a regression, and checks if they are dependent upon the independent variables that we fed into the regression. The test does this by performing a second regression of the residuals against the independent variables, and checking if the coefficients from that second regression are statistically significant (non-zero). If the coefficients of this second regression are significant, then the residuals depend upon the independent variables. If the residuals depend upon the independent variables, then it means that the variance of the data depends on the independent variables. In other words, the data is likely heteroscedastic. So if the p-value of the Breusch-Pagan test is ≤ 0.05, we can assume with a 95% confidence that the distribution is heteroscedastic (not homoscedastic).
Breusch-Pagan test from statsmodels Python package: We input the residuals from the regression of the dependent variable against the independent variables. We also input the independent variables that may affect the variance of the data. The function outputs a p-value.
Linear Regression Jupyter Notebooks

test_normality.ipynb: Creating a normal and a lognormal distributed dataset with scipy.stats; boxplots and Q_Q plots with matplotlib, Shapiro-Wilk and Kolmogorov-Smirnov tests with scypy.stats.
regression.ipynb: Creating "random walk" simulated stock price series based on normal distributions from numpy ("market average" as one series; beta factor for two stocks from normally distributed "noise" as two other series). Linear regression between both stocks using sklearn.linear_model.LinearRegression.

Time Series Analysis


Stock prices are non-stationary, i.e. sample mean and sample standard deviation change over time. In order to obtain more stationary data, we use log returns instead of prices.


Autoregressive model: Predict next value of a time series as a linear combination of previous values. The number of previous values used is called lag. An AR(n) model uses the n preceding values to predict the next: y_t = alpha + B_1 * y_(t-1) + B_2 * y_(t-2) + ... + epsilon_t. Intercept alpha and error term epsilon_t.


Vector Autoregressive model: Multivariate version of autoregression to model dependency of (past values of) different stocks on (the present value) of another stock.


Moving Average models: Predict next value of a time series as a linear combination of the average and previous time steps residuals (i.e. time series value at (t-n) minus average). MA(q) model of lag q: y_t = mu + Theta_1 * eps_(t-1) + Theta_2 * eps_(t-2) + ... + Theta_q * eps_(t-q). Average mu and residuals eps_t. Question: is mu a moving average? To determine a good value for the lag, check the autocorrelation of the present value with previous values. High positive or negative autocorrelation indicates good predictors. Cut off where the autocorrelation becomes small.


Autoregressive (Integrated) Moving Average model: AR(I)MA(p,q) combination of both models; with two independent lags. Integrated refers to using differences of data points at subsequent times instead of the data points themselves to obtain stationary time series. The original time series is then the "integral" (assuming piecewise linear development between data points) of the time difference series. For stock prices, we use differences of the log of prices. The log returns of prices are called integrated of order 0, I(0). The prices are integrated of order 1, I(1).


Augmented Dickey Fuller test: Test for stationarity of a time series. If p<=0.05, the time series can be considered stationary and I(0). Otherwise, create a new time series as the difference of subsequent data points and check that with the Dickey-Fuller test for stationarity. Repeat (n times) until a stationary time series is obtained. The original time series is then I(n). The stationary time series can be run through an ARMA model and the model for the original is obtained by n-fold "integration".


SARIMA model (seasonally adjusted ARIMA): Form the difference between values exactly one year apart to remove seasonality effects. Otherwise apply the approach described above to obtain a stationary series. The seasonality adjustment may also be performed after the "simple" time difference.


Note: In general, autoregressive moving average models are not able to forecast stock returns because stock returns are non-stationary and also quite noisy.  There isn't much correlation between previous periods with the current period. Volatility tends to have more of a correlation with past volatility. In general, using past stock returns to predict future stock returns is rather difficult.


Kalman Filters: (Only summary given).

All prediction relevant information from previous timesteps is summarized in a state for time t-1 and this plus the time serise value at t-1 is used to predict the value at time t. The state is updated for each time step. We do not need to select a specific lag as for the moving average and the autoregression analysis.
The Kalman Filter takes the time series of two stocks, and generate its “smoothed” estimate for a "magic number" (state) at each new time period. Kalman Filters are often used in control systems for vehicles such as cars, planes, rockets, and robots. They’re similar to the application in pairs trading because they take noisy indirect measurements at each new time period in order to estimate state variables (location, direction, speed) of a system.
One way Kalman Filters are used in trading is for choosing the hedge ratio in pairs trading.


Particle Filters: (Only summary given).

Define a large number of models (particles) initialized with randomly selected parameters
At each time step, "reward" particles with a good prediction. Over time, only the best "particles" will remain ("genetic algorithm").
When predictions of multiple particles are closely clustered, this indicates higher confidence. When they are wider spread, the confidence is lower.
Particle filters can be used for data with non-normal distributions and for non-linear relationships.
Check out Sebastian Thrun’s lesson on particle filters in the free “Intro to Artificial Intelligence” Udacity course (lesson 16 “HMMs and Filters: Node 18 “Particle Filters”).


Recurrent neural networks:  (Only summary given).


AR(I)MA Jupyter notebook

autoregression_quiz.ipynb: Using statsmodels to simulate a price time series with autoregressive properties; Using statsmodels.plot_acf to plot autocorrelation of "logarithmic returns"; using partial autocorrelation (plot_pacf) to isolate the autocorrelation effects of different lags from each other; conduct the Ljung-Box test from statsmodels; fit an ARMA model to the simulated autoregressive series; fit an arima model.

Ljung-Box test: test the null hypothesis that a chosen lag is not autocorrelated with the current period. If p<=0.05, we can assume that there is an autocorrelation between the selected lag and the current period.
Statsmodels ARMA model
Statsmodels ARIMA model


Volatility

Use in trading:

Measuring risk
Defining position sizes
Designing alpha factors
Pricing options
Trading volatility directly

Calculate volatility:

Start with log returns of prices; r_i = ln(p_i/p_(i-1)). For m prices, you get (m-1) log returns.
sigma = sqrt(1/(n-1)sum_i(r_mean - r_i)^2). n is the number of log returns, i.e. m-1.
For daily prices, this will be daily volatility. Needs to be extrapolated (annualized) to annual volatility for comparability. sigma_ann. = sqrt(252) * sigma_day

Reason: Assuming the daily log returns all have the same underlying probability distribution with the same standard deviation and are independent of each other, the relation var(A+B) = var(A) + var(B) holds, so var(r_ann.) = sum(var(r_day)).
typ. ann. stock volatility (i.e. standard deviation!): 0.1-0.5


def get_most_volatile(prices):
    """Return the ticker symbol for the most volatile stock.
    
    Parameters
    ----------
    prices : pandas.DataFrame
        a pandas.DataFrame object with columns: ['ticker', 'date', 'price']    
    Returns
    -------
    ticker : string
        ticker symbol for the most volatile stock
    """
    prices = prices.drop(labels='date', axis=1).groupby(by=['ticker']).std()
    return prices.price.idxmax()
Rolling Windows:

Stock market volatility changes over time.
To capture fluctuations in volatiliy, calculate over a rolling time window until yesterday. Length of the suitable time window depends on the intended holding period of the strategy. Longer time window --> less reactive; shorter time window --> more prone to fluctuations.

def calculate_simple_moving_average(rolling_window, close):
    """
    Compute the simple moving average.
    
    Parameters
    ----------
    rolling_window: int
        Rolling window length
    close : DataFrame
        Close prices for each ticker and date
    
    Returns
    -------
    simple_moving_average : DataFrame
        Simple moving average for each ticker and date
    """
    # TODO: Implement Function
    
    return close.rolling(window=rolling_window).sum()/rolling_window
Exponentially weighted moving average:

sigma_t^2 = (r_(t-1)^2 + lambda * r_(r-2)^2 + lambda^2 * r_(t-3)^2 + ... + lambda^(n-1) * r_(t-n)^2))/(1+lambda+lambda^2+...+lambda^(n-1)) (volatility is sigma_t, so sqrt of the formula above!)

Corresponds to calculating a "modified" variance as a weighted average of squared log returns over a time window, var_mod_n = 1/n * sum_i(alpha_i * r_i^2), with sum_i(alpha_i) = 1 and alpha_i decreasing exponentially, i.e. alpha_(i+1) = lambda * alpha_i; lambda between 0 and 1.
Note: "normal" variance var_n = 1/(n-1) * sum_i((r_ave - r_i)^2); so we're neglecting the '-1' (i.e. n large) and assuming that r_ave << r_i (r_ave small compared to standard deviation)
See also Pandas User Guide Window Overview, section about exponentially weighted windows; e.g. for understanding the relation to "classical" EMAs for stock data.


import pandas as pd
import numpy as np

def estimate_volatility(prices, l):
    """Create an exponential moving average model of the volatility of a stock
    price, and return the most recent (last) volatility estimate.
    
    Parameters
    ----------
    prices : pandas.Series
        A series of adjusted closing prices for a stock.
        
    l : float
        The 'lambda' parameter of the exponential moving average model. Making
        this value smaller will cause the model to weight older terms less 
        relative to more recent terms.
        
    Returns
    -------
    last_vol : float
        The last element of your exponential moving averge volatility model series.
    
    """
    # TODO: Implement the exponential moving average volatility model and return the last value.
    ret = np.log(prices) - np.log(prices.shift(1))
    exp_wgt_vol = np.sqrt(np.square(ret).ewm(alpha=1-l).mean())
    last_vol = exp_wgt_vol[-1]
    return last_vol
Markets and volatility

Volatility is not (or only to a smaller extent) determined by incoming news that can have an influence on the asset's price.
Volatility correlates with volume: high demand/offer drives prices up/down.
Because of this, in times of high volatility, mean reversion strategies tend to be more successful. In times of low volatility, momentum strategies tend to be more successful. (Because price changes in times of high volatility are caused by high demand instead of price driving news, so price goes back when demand decreases again)
Volatility is higher when markets go down; lower when they go up. ("Gauge for fear")
VIX measures annualized implied volatility of 30-day SP&P500 options (and is itself tradable)
Published by CBOE, together with other volatility indexes. E.g. VVIX measures VIX volatility.

Using volatility in trading (examples only!)

Selecting a volatility range can be used to limit the investing universe

E.g. for low-volatility stocks, a mean reversion strategy could work better than for high-volatility. Reason: normally the lv stock stays closer to its fair price, so expected to return there in case of deviations.
Contrary to expactation, lv stocks perform better than hv stocks. Possible explanation: these stocks are ignored by people as "boring" (?). Specific etfs focus on lv stocks, e.g USMV, SPLV.


Normalize trading signals by volatility (generally a good practice)

E.g. the same "momentum signal" for a lv stock is more "valuable" than for a hv stock.


Determine position size by (forecasted) volatility (i.e. invest less if volatility is higher)

Example: Pos. Size = R/(sigma * M * LastClose); R = $ Amount at risk (i.e. book loss) when an M-sigma event occurs against the position. sigma = volatility; M = trader-defined integer (e.g. 2 for a 2-sigma event)
Similar formula might also be used to limit the portfolio risk, if the portfolio volatility and last-day value of the whole portfolio is used.


Volatility can also be used to adjust TP and SL levels to a wider margin when volatility increase, to limit trading frequency.

Breakout strategies

With MA and Bollinger Bands:

BBs are drawn 2 sigma above and below the MA, where sigma is calculated over the same time window as the MA.
When the price falls below the lower BB and then returns inside, go long. Go short when crossing the upper BB down.


With rolling Max and Min:

Go long when the price breaks the Max of the previous N days (e.g. N=20). Go short when it breaks the Min.


Pairs trading and mean reversion

Drift and volatility model

Model stock prices with "Brownian motion"; fundamental for estimating option prices beyond the Black-Scholes model as well as prices of bonds.
Equivalent to drift-volatility model: dp_t = drift term + volatility term = p_t * mu * dt + p_t * sigma_t * epsilon_t * sqrt(dt). p_t price at time t; mu expected stock return (estimated usually from historic returns); dt time interval; sigma_t time dependent volatility; epsilon_t "white noise term" with mean zero and stdev 1 - models stock movements not accounted for by the model. epsilon_t * sqrt(dt) is called a "Wiener process".
A stock price modeled by a drift-volatility model will show "mean reverting" behavior around the linear drift.

Pairs trading / Cointegration

For two economically closely linked stocks that mostly move in tandem, short the one that recently had the larger price increase and long the other one. Close the position when the difference has diminished (or SL when the gap keeps increasing). The idea is that this is a market neutral position.
For pair trading scenarios, we use the actual stock prices, not log returns!
Recap Time Series (above):

Time series of stock prices are (assumed to be) I(1), i.e. differences of subsequent prices are I(0) or stationary.
Log returns of stock prices are I(0)


For certain pairs of stocks, we can get I(0) time series also as the linear combination of both I(1) stock price time series. Such pairs of stocks are called cointegrated
Hedge Ratio: Ratio of stock A to stock B. Simple: Calculate the price quotient. More precise: do a linear regression of B against A and take the coefficient (slope parameter) as hedge ratio. This takes into account multiple previous prices, instead of just the last.
Spread: Difference between actual price of B and price predicted by the regression, i.e. the residual / error term of the regression. Spread = y_t - (intercept + coefficient * x_t)
Stocks that are suitable for pair trading: The spread should be stationary (I(0)), i.e. its mean, variance and covariance should be stable over time.
If the spread between stock A and stock B is stationary, A and B are called cointegrated.
The hedge ratio (linear correlation coefficient β) is called Coefficient of Cointegration
Cointegrated stocks do not need to be highly correlated and vice versa.
Test for stationarity of the spread: Augmented Dickey Fuller Test (ADF)

Augmented Dickey Fuller Test
To check if two series are cointegrated, we can use the Augmented Dickey Fuller (ADF) Test. First, let’s get some intuition to see what the ADF test is doing. It’s trying to determine if the linear combination of the two series, (which is also a time series) is stationary.
A series is stationary when its mean and covariance are constant, and also when the autocorrelation between one time period and another only depends on the time duration between them, and not the specific point in time of each observation.
If you could represent a series as an AR(1) model y_t=β * y_(t−1)+ϵ_t * y_t, let’s think about what happens if the β is greater than one. We can imagine putting in a value for y_(t−1) to get an estimate for y_t; then for the next day, we’ll use that value as y_(t−1) to put into the model and estimate the new y_t. We’d end up having a series that trends in one direction, so its mean is not constant, and therefore it is not stationary.
Next, if we had a β equal to one, then y_t=y_(t−1)+ϵ_t * y_t. We call this special case a random walk, and it means that the current price is equal to the previous price plus some white noise. Even though the mean of this series is constant, its covariance between one time period and another depends upon the point in time of the observations, so it is also not stationary.
Finally, if we had a β of less than one, then we notice that y_t depends upon less than 100% of the value of its previous value y_(t−1), with some added random noise ϵ_t. The series doesn’t trend in a particular direction. Its variance is also constant, and its covariance between any two data points doesn’t depend on the point in time of the data point. You can think of the series like a bouncing rubber ball that’s being tapped lightly by random raindrops. Without the rain, the bouncing ball would have smaller and smaller bounces, and eventually stop bouncing. With random raindrops falling on the ball, some raindrops would make the ball bounce more, others would make the ball bounce less. So overall, the ball maintains a constant bounce height over time.
So conceptually, the Augmented Dickey Fuller Test is a hypothesis test for which the null hypothesis is that a series is a random walk (its β is equal to one), and so the null hypothesis assumes that the series is not stationary. The alternate hypothesis is that β is less than one, and therefore it’s a stationary series. So if the ADF produces a p-value of 0.05 or less, we can say with a 95% confidence level that the series is stationary.
Enhanced explanation:
If we have an AR(p) model y_t = β_1 * y_(t−1) + … + β_(t−p) * y_(t−p) + ϵ_t * y_t, we can put all the terms that are not the white noise ϵ to the left, like this: y_t − β_1 * y_(t−1) − … − β_(t−p) * y_(t−p) = ϵ_t.
Then we set the left side of the equation equal to zero. What we have on the left is called the “characteristic equation”. You might recall from learning algebra that when we set an equation equal to zero, we usually are trying to solve for the roots (the values that make the equation equal to zero). Before we can solve for the roots of this equation, we need to rewrite it differently using something called backward shift notation.
Backward shift notation looks like this: B^n y_t = y_(t−n). So when we see y_(t−1), we’ll replace it with B^1 y_t. If we see y_(t−2), we’ll replace it with B^2 y_t. The nice thing about backward shift notation is that we can describe our lags in terms of y_t, which will come in handy in the part that’s coming up.
So we can change this equation: y_t − β_1 * y_(t−1) − … − β_(t−p) * y_(t−p) = 0 into this: y_t − β_1(B y_t) − … − β_(t−p)(B^p y_t) = 0
Notice how we can now factor out the y_t, so we have: y_t(1 − β_1 B − … − β_(t−p) B^p) = 0
Okay, let’s look at some examples to see what this means. We saw previously that an AR(1) model with a coefficient of one: y_t = y_(t−1) + ϵ_t * y_t is called a random walk, and that a random walk is not stationary. If write the characteristic equation of the random walk, it looks like this: y_t − y_(t−1) = ϵ_t = 0. Next, we rewrite it with backward shift notation: y_t − B y_t = 0. Then we factor out the y_t to get: y_t (1−B) = 0 and we solve for B to get B = 1. The root equals one, and you might hear people say that the series has a unit root, or that its root “equals unity”.
Next, let’s look at an AR(1) series where the β coefficient is less than one (let’s say β is 1/2).
y_t = 1/2 * y_(t−1) + ϵ_t * y_t. The characteristic equation looks like this: y_t − 1/2 y_(t−1) = ϵ_t = 0. In backward shift notation, it looks like: y_t − 1/2 B y_t = 0. Factor out the y_t: y_t (1 − 1/2 B) = 0.
Solving for B is solving for the unit root of the characteristic equation. So we get 1 = 1/2 B, and so B = 2. Since the root is greater than one, we can say that the series is stationary.
Note that for series with more than one lag, we can solve for more than one root.
The Augmented Dickey Fuller Test has a null hypothesis that a series has a unit root. In other words, the null hypothesis is that a series is a random walk, which is not stationary. The alternate hypothesis is that the roots of the series are all greater than one, which suggests that the series is stationary. If the ADF gives a p-value of 0.05 or less, we reject the null hypothesis and can assume that the series is stationary.
Engle-Granger Test
The Engle Granger Test is used to check whether two series are cointegrated. It involves two steps. First, calculate the hedge ratio by running a regression on one series against the other y_t = β * x_t. We call the β the “hedge ratio”.
Second, we take y_t − β * x_t to create a series that may be stationary. We’ll call this new series z_t. Then we use the ADF test to check if that series z_t is stationary. If z_t is stationary, we can assume that the x and y series are cointegrated.
How to find meaningful pairs

Analysing all stocks e.g. in the S&P 500 for cointegration would require nearly 250k pairs to evaluate, which is too time consuming and may lead to spurious results.
Simple subdivisions like grouping stocks by industry may give too obvious candidates
ML clustering analysis of stocks can provide good candidates, but the cointegration candidates should still be supported by meaningful arguments.

Pairs trading executed

Perform a time series analysis of the spread. Take action when the spread is substantially wider or narrower than its average.
Short the spread: If the spread widens, short the asset that has risen, long the asset that has fallen.
Long the spread: If the spread narrows, long the asset that has fallen, short the asset that has risen.
Z-score: (value - mean)/(standard deviation).
Define a value of Z-score > 1 for "short spread" and < -1 for "long spread". Perform backtesting.

Example notebook
pairs_trading.ipynb generates two timeseries from one random dataset with noise. Calculates the hedge ratio and performs linear regression to obtain the spread. Tests stationarity of the spread by applying the augmented Dickey-Fuller test from statsmodels
Variations of Pairs Trading or Mean Reversion Trading
Note that it’s also possible to extend pairs trading to more than two stocks. We can identify multiple pairs and include these pairs in the same portfolio. We can also analyze stocks that are in the same industry. If we grouped the stocks within the same industry into a virtual portfolio and calculated the return of that industry, this portfolio return would represent the general expected movement of all stocks within the industry. Then, for each individual stock series, we can calculate the spread between its return and the portfolio return. We can assume that stocks within the same industry may revert towards the industry average. So when the spread between the single stock and the industry changes significantly, we can use that as a signal to buy or sell.
Cointegration with 2 or more stocks: Generalizing the 2-stock pairs trading method
We can extend cointegration from two stocks to three stocks using a method called the Johansen test. First let’s see an example of how this works with two stocks.
The Johansen test gives us coefficients that we can multiply to each of the two stock series, so that a linear combination produces a number, and we can use it the same way we used the spread in the prior pairs trading method.
w1 * stock1 + w2 * stock2 = spread
In other words, if the first stock series moves up significantly relative to the second stock, we can see this by an increase in the “spread” beyond its historical average. We will assume that the spread will revert down towards its historical average, so we’ll short the first stock that is relatively high, and long the second stock that is relatively low.
So far, this looks pretty much like what you did before, except instead of computing a hedge ratio to multiply to one stock, the Johansen test gives you one coefficient to multiply to each of the two stock series.
Now let’s extend this concept to three stocks. If we analyze three stock series with the Johansen test, we can determine whether all three stocks together have a cointegrated relationship, and that a linear combination of all three form a stationary series. Note that for the purpose of cointegration trading we use the original price series, and do not convert them to log returns. The Johansen test also lets us decide whether only two series are needed to form a stationary series, but for now, let’s assume that we find a trio of stocks that are cointegrated.
The Johansen gives us three coefficients, one for each stock series. We take the linear combination to get a spread.
w1 * stock1 + w2 * stock2 + w3 * stock3 =spread
We get the historical average of the spread. Then we check if the spread deviates significantly from that average. For example, let’s say the spread increases significantly. So we check whether each of the three individual series moved up or down significantly to result in the change in spread. We short the series that are relatively high, and long the series that are relatively low. To determine how much to long or short, we again use the weights that are given by the Johansen test (w1,w2,w3).
For example, let’s say the spread has gotten larger. Let’s also pretend that w1 is 0.5, w2 is 0.3, and w3 is -0.1. Notice that the weights do not need to sum to 1. We’ll long or short the number of shares for each stock in these proportions. So for instance, if we traded 5 shares of stock1, we’ll trade 3 shares of stock2, and one share of stock3.
If we notice that stock1 is higher than normal, stock2 is lower than normal, and stock3 is lower than normal, then let’s see whether we long or short a stock, and by how much.
Since stock1 is higher than usual (relative to the others), we short 5 shares of stock1 because we expect it should revert by decreasing relative to the others.
Since stock2 is lower than normal, we long it by 3 shares, because we expect it to revert by increasing relative to the others.
Since stock3 is lower than normal, so we also long it by 1 share but notice that w3 is a negative number (-0.1). Whenever we see a negative weight, it means we change a buy to a sell, or change a sell to a buy. So we long a -1 shares, which is actually shorting 1 share.
Johansen test details
Recall from the lesson on time series, that a vector autoregression attempts to describe a stock’s current value based on not only its prior values, but also the prior values of other stocks. Let’s use two stocks as an example: Note, I’m using the variable names “IBM” and “GE” to refer to the price series of these stocks. The μ refers to a historical average for each stock’s time series. The “e” refers to an error term for each stock.
IBM_t = μ_IBM + β_(1,1) * IBM_(t−1) + β_(1,2) * GE_(t−1) + e_(1,t)
GE_t = μ_GE + β_(2,1) * IBM_(t−1) + β_(2,2) * GE_(t−1) + e_(2,t)
We normally use matrices to make this easier to work with, so the equations above can be written as vector autoregression with a lag of one:
x_t = μ + B * x_(t−1) + e_t
To make things simpler to write, we’ll write the 2 x 2 matrix of betas with a capital B, and we’ll denote the vector of the two stocks with a lowercase x. We’ll write the vector of μ’s with a single μ, and so on.
For a lag of p, this formula looks like x_t = μ + B1 * x_(t−1) + ... + B_p * x_(t−p) + e_t
Now, if you recall from studying cointegrated time series, taking the time-wise difference may help us create a stationary series. So we’ll denote the timewise difference as: As Δx_t = x_t − x_(t−1).
Next, we can define x_t in using a Vector Error Correction Model (VECM) like this:
Δx_t = μ + B * x_(t−1) + C_1 * Δx_(t−1) + ... + C_p * Δx_(t−p) + e_t
Notice how the B * x_(t−1) term is just the vector of the previous periods’ values, and not the time-wise difference like all the other terms to its right. The Johansen test checks how many rows in the matrix B are needed to form a cointegrated series. To do this, it uses an eigenvalue decomposition, to determine how likely the matrix B has a rank of 0, or 1, 2, or 3, up to the number of stocks that we’re looking at (most likely 2 or 3). If we were trying to see if 3 stocks were cointegrated, and the Johansen test estimated that the rank of matrix B was 3 (so all coefficient vectors linearly independent), then we’d assume that all three stocks form a cointegrated relationship. If, on the other hand, the Johansen test results showed that the rank of matrix B was likely 2, then only 2 of the 3 stocks are necessary to form a cointegrated relationship. So we’d want to try out all the pairs of stocks to see which two are cointegrated. If the rank was zero, then that means there was no cointegration among the stocks that we looked at.
To determine the rank, the Johansen test actually does a hypothesis test on whether the rank is 0, 1, 2 or 3, up to the number of stocks there are in the test (probably 2 or 3). Looking at the t-statistic or p-value can let you decide with a certain level of confidence if at least two or even three of these stocks form a cointegrated series.
The Johansen test gives us a vector that we can use as the weights we assign to each stock. If you are curious, this is the largest eigenvector that’s generated by the eigenvalue decomposition. But again, let’s not worry about how to do eigenvalue decomposition, and just see how to use this vector of weights. These are the weights that we mentioned earlier when computing the linear combination of the stock prices, which is used in the same way as the spread.
So if we get w1, w2, w3 from the eigenvector w, we use these as weights on each stock, as we saw earlier:
w1 * stock1 + w2 * stock2 + w3 * stock3 = spread
To summarize, the Johansen test figures out whether a group of stocks is cointegrated, and if so, how to calculate a “spread” that we’ll keep track of for temporary deviations from its historical average. It also gives us the proportion of shares to trade for each stock.
Project 2: Breakout Strategy


Price df: Start from a df with daily close prices of S&P500 stocks
Signal df for entry: Calculate another df of the same shape with 0, 1, -1 to indicate no signal / enter long / enter short for each stock symbol and each day. In this case, the signal indicates a breakout above the high / below the low of the previous  days
Filter signal: Don't allow multiple successive long (or short) within  days.
All-day Profit df: The trade is simply exited after  days. So we can calculate a third df of the shape of the price df that contains for each date and ticker the profit if entering on that day and exiting  days later (we use the log return here).
By multiplying the all-day profit df and the signal df, we obtain the profit df for this breakout entry/time-delay exit strategy.
Returns should be normally distributed (with a mean > 0 if the strategy is meaningful?) if we plot a histogram of returns for all trades and all stocks under this strategy.
One way to statistically validate the normality is by applying a Kolmogorov-Smirnov test (see above, "Possibities to check for normality of a distribution" and test-normality.ipynb). In fact, the K-S test checks if two distributions are equal (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html); here we supply a normal distribution as the distribution to compare our test distribution to.
To find stocks that are outliers (i.e. produce a significantly non-normal distribution of returns under this strategy), we produce a df with two columns, with one row for each trade for all symbols. First column = ticker symbol; second column = trade result (return).
The normal distribution is fully characterized by its mean and variance; for these parameters we supply the mean and variance of all trades for all symbols under this strategy
We calculate the ks-test test statistic ("D-value"?) and p-value for each symbol. Those with a high test statistic and a low p-value (probability of null hypothesis: distributions are equal) are considered outliers.

Reviewer hints:

4 performance optimization tips for faster Python code
Dealing with outliers (in linear regression)
Pandas: Dataframe rolling
Packt: Shifting and lagging time series data
Stackoverflow: Shifting with multiindex
Optimizing pandas code for speed
Using Pandas iloc, loc, & ix to select rows and columns in DataFrames
"Best" breakout trading signal for trend following
The Chartist: Better trading series (including breakout patterns)

3. Stocks, Indices, Funds

Indices


An index is scaled to (e.g.) 100 at its starting date
Price weighted index (NIKKEI, DOW JONES): Sum of prices of constituents, divided by sum of prices at start date, *100
Market Cap weighted index (S&P500, EURO STOXX, HANG SENG, ...)
A cap can be introduced to limit the weight of the largest components. E.g. 15% per constituent in Hang Seng.
Large / Mid / Small Cap indices (e.g. S&P500=large cap; S&P 600 Mid Cap; S&P 400 Small Cap)
Growth / Value Stocks
Adding or removing from an index (e.g. mergers & acquisitions; bankruptcies; violation of index criteria): The index must be rebalanced at a certain date
Often, market cap calculation for index weighting is free-float adjusted (e.g. Hang Seng)

Mutual Funds


Actively managed (alpha) funds; passively managed (beta) funds = index funds
Alpha vs. beta: refers to intercept (alpha) and slope (beta) of a regression of portfolio excess return vs. market excess return. Excess return = return - risk free rate. Beta = 1: portfolio return matches market return. Alpha > 0: Even when the market doesn't return more than the risk free rate, the portfolio has a positive excess return.
"Smart Beta" funds or portfolios: Deviation from pure market cap weighting (or index reproduction) by introducing a different weighting in the expectation to reduce risk or increase returns. E.g. consider P/E or P/B ratio, dividend payments, equal weighting (same dollar amount per constituent), square root of market cap (to relatively increase the weight of smaller companies) etc. a.k.a "a little alpha on top of beta".
Mutual Funds: Restricted by regulation to long-only. No lockup period, i.e. money can be invested/pulled out every day.
Hedge Funds: Can also trade options and derivatives. Normally higher minimum investment; lockup period.
Relative Return: Excess return over benchmark. For an actively managed fund, also called active return - should be positive to reflect the value of the management. For a passive fund, the relative return should be zero; it corresponds to the tracking error TE = sqrt(252) * daily tracking error = sqrt(252) * std_dev(return_portfolio - return_benchmark).
Absolute Return: Usually used as performance measure for hedge funds. Their benchmark is the (slowly changing) US treasury rate or UK LIBOR. Thus the absolute return, instead of the excess return, is normally considered.
Hedging: Employment of market neutral strategies, e.g. by using options or futures as hedges against market turns.
NAV (Net Asset Value) of a fund: NAV = (AUM - expenses)/(number of shares). AUM = Assets under management; total value of all assets in the portfolio.
Expense ratios: Gross Expense Ratio: Expenses / AUM; Net Expense Ratio: (Expenses - Discounts) / AUM (e.g. discounts to attract new customers. Less relevant in the long run.)
Open end mutual fund: Number of shares outstanding increases / decreases when investors buy or redeem shares. Fund needs to keep some cash (interest becomes part of return) to pay out redeemed shares.
Closed end funds: Fixed number of shares outstanding. Need not reduce returns by holding back some cash. Shares can be traded on the stock exchange if investors want to redeem. Due to supply and demand, closed end funds often trade at apremium or discount compared to their fair value / NAV.
Transaction costs*: When rebalancing stocks within a fund for large volumes, market makers may request lower buy prices / higher sell prices to cover their risk. For the fund, this becomes part of the transaction costs. Funds managers may decide to not perform rebalancing trades if the costs outweigh the benefits. Higher rebalancing frequency - higher transaction costs. Large fund companies may trade between funds to reduce transaction costs.

Exchange traded funds (ETFs)


Can avoid drawbacks of open- and closed-ended funds (see above).
Can be used to achieve exposure to foreign markets or commodities markets (commodities normally via futures)
Futures and Forwards: Futures are standardized forwards. Forwards are bespoke agreements between two parties about a future transaction and are not tradable. Futures contracts have standard contract sizes, (also called “lot sizes”), and also standard due dates. Future positions can be cancelled (i.e. opening an opposite position. Clearance is handled by the futures exchange; for me the original position is cancelled) and rolled forward (i.e. cancelling the original position a short time before its due date and opening a new position with a later due date). If a future is not cancelled or rolled forward, execution is mandatory.
Commodities ETFs handle rolling / management of futures for individual investors (and allow smaller position sizes for the individual).
Foreign market ETFs avoid the necessity for an individual investor to trade at international exchanges / in different time zones.
Hedging with ETFs: Hedge funds can short ETFs to obtain market-neutral positions (overall market or specific sectors, depending on investment focus in the portfolio). The portfolio return is then only determined by the "alpha", i.e. outperformance of the market / sector return by the individually selected stocks.
Create Process of an ETF:

ETF Sponsor (e.g. Blackrock, Vanguard) designs a portfolio of stocks
AP (Authorized Partner) (e.g. Meryll Lynch, Goldman Sachs) buys the required stocks in the defined proportions on the market
AP delivers stocks to the sponsor and receives ETF shares
AP sells ETF shares to individual investors
ETF sponsors don't take cash to invest or deal directly with investors!


Redeem Process:

AP buys shares from individual investors
AP returns shares to ETF sponsor and receives stocks in return (reduces number of shares outstanding)
AP sells stocks on the market for cash


Tax efficiency: Because the ETF sponsor doesn't sell stocks at a profit, it can avoid capital gains tax (? but this would result in higher capital gains tax for the AP if the stocks are actually sold).
Arbitrage: APs can use arbitrage deals in ETF shares to make their part in the processes worth their while. Arbitrage opportunities arise when the ETF price exceeds the NAV (premium) or is lower than the NAV (discount) at a specific stock exchange. Difference between premium / discount and NAV is called the basis and is measured in basis points; 1 basis point = 0.01%.

If the ETF trades at a premium compared to the value of underlying stocks, the AP can buy stocks and request new ETF shares from the ETF sponsor in turn (create). It can sell these ETF shares at a profit. The demand will increase the stock prices. The offer will decrease the price of the ETF shares.
Vice versa, when the ETF trades at a discount, the AP can buy and redeem ETF shares to sell the underlying stocks with a profit.


Portfolio Risk and Return


Ideosyncratic / specific risk: Risk attributable to one specific stock
Market / systematic risk: Risk common to all stocks (e.g. inflation, interest rates,...)
Portfolio mean: E(r_P) = Sum_i(x_i * E(r_i)) where E(r_P) is the expected value of the portfolio return, E(r_i) the expected return of portfolio component i and x_i the weight of hat component. sum_i(x_i) = 1!

E(r_i) = sum_n(p(n) * r_(i,n)), where p(n) is the probability of a specific future scenario and r_(i,n) the return of portfolio component i under that scenario (should rather be an integral over a continuum of scenarios)


Portfolio variance:

sigma_P^2 = sum_n(p(n)[r_P - E(r_P)]^2) = ... = sum_i(x_i^2 * sigma_i^2) + sum_(i,j; i!=j)(x_i * x_j * Cov(r_i, r_j));
Covariance Cov(r_i, r_j) = sum_n(p(n)(r_i - E(r_i))(r_j - E(r_j)) = rho_(r_i, r_j) * sigma_i * sigma_j
rho_(r_i, r_j) is the correlation coefficient between r_i and r_j. It can vary between +1 and -1.
sum_n denotes the again the sum over "all future scenarios"; sum_i the sum over portfolio constituents
By having two perfectly anti-correlated components (rho=-1), it is possible to eliminate portfolio variance completely! Because market risk factors influence all portfolio components, perfect anti-correlation is not realistic.
For two perfectly correlated components (rho=+1), the portfolio variance equals the sum of the two individual variances --> no benefit of risk reduction by diversification.
Risk reduction by diversification: For all rhos in between, the portfolio variance will be smaller than the sum of the variances of the constituents!
Math: The expression for portfolio variance is an example of a quadratic form (in each term, the sum of the exponents of the variables is two). As each quadratic form, it can be written as sigma_P^2 = x^T P x with a symmetric matrix P. Here,

x is the vector of prtfolio weights,
P = [(Cov(r_i,r_i), Cov(r_i,r_j)), (Cov(r_j,r_i), Cov(r_j,r_j))] is the Covariance matrix.
Cov(r_i, r_j) = E[(r_i - r^bar_i)(r_j - r^bar_j)] where r^bar_i is the mean of the values of r_i.
For discrete observations with equal probability and zero mean, Cov(r_i, r_j) = 1/(n-1) sum_k(r_(i,k) * r_(j,k)) = 1/(n-1) * r^T r, where the sum over k is the sum over all observations; n the number of observations.
Covariance with numpy; see m3l4_covariance.ipynb


Efficient frontier

The efficient frontier in a portfolio return vs. portfolio volatility plot is the line that represents the highest achievable return for a given volatility. Rationally, portfolio weights should be chosen to construct a portfolio that lies on the efficient frontier. These portfolios are called market portfolios
For a given volatility, returns higher than that of the market portfolio are not achievable, so only the region below the efficient frontier is accessible, but inefficient!
For a given set of stocks and their returns, it is not possible to select weights such that the volatility is lower than some minimum larger than zero. Together with the resulting return, this is the low-volatility endpoint of the efficient frontier. It is called the minimum variance portfolio.

Capital market line

Add a risk-free asset to the portfolio for that the efficient frontier is calculated (r_f: Risk-free rate of return >0; sigma_f = 0, i.e. volatility = 0).
Real risk-free assets don't exist, but 3 month Treasury bills are considered as coming close (In the U.S., Treasury bills mature in 1 year or less, Treasury notes mature in 2 to 10 years, and Treasury bonds mature in 20 to 30 years)
In the return vs. volatility diagram, the risk-free asset will fall on the y (volatility=0) axis. For any risky portfolio, including those on the efficient frontier, the portfolio and the risk-free asset can be combined with arbitrary weights (adding up to one). The volatility vs. return of such a combination will fall on a straight line between the point on the y axis representing the risk-free asset and the point representing the risky portfolio risk vs. return.
The optimal portfolio is obtained when the straight line forms a tangent to the efficient frontier. The risk-free asset should be combined with the risky portfolio on the efficient frontier where it is touched by the straight line. (The tangent lies outside the efficient frontier except for the touch point, i.e. all other points have a higher return at given risk than the corresponding portfolio on the efficient frontier)
Any risk-return combination on this capital market line can be achieved by varying the weighting of the risk-free asset and the risky portfolio. For returns higher than that of the risky portfolio, the weight for the risk-free asset will become negative, representing a borrowing of money (shorting) at the risk-free rate to invest in the risky portfolio.
The slope of the capital market line is the Sharpe ratio (r_M - r_f)/sigma_M; r_M: return of risky portfolio; r_f: return of risk-free asset; sigma_M: volatility of risky portfolio (excess return, differential return or risk premium)

Calculating the Sharpe ratio

Article by William F. Sharpe
Average Risk premium D_ave = sum_1^T(r_(portfolio, t) - r_(risk-free, t))/T
Average standard deviation of D: sigma_D = sqrt(sum_1^T(r_(portfolio, t) - r_(risk-free, t))^2/(T-1))
Sharpe ratio = D_ave/sigma_D (according to the Sharpe article, this is the ex-post or historical Sharpe ratio)
Annualized Sharpe ratio = sqrt(252) * Sharpe ratio

Other risk measures

Semi-deviation: Calculate the volatility (standard deviation) by only including returns smaller than the average return
[Value-at-risk]: "For example, if a portfolio of stocks has a one-day 95% VaR of $1 million, that means that there is a 0.05 probability that the portfolio will fall in value by more than $1 million over a one-day period if there is no trading. Informally, a loss of $1 million or more on this portfolio is expected on 1 day out of 20 days (because of 5% probability)." --> It is not specified how much the loss can be; the probability for a specific loss depends on the shape of the tails of the probability distribution.

Capital Asset Pricing Model
The CAPM (pronounced cap-M) is a model that describes the relationship between systematic risk and expected return for assets. The CAPM assumes that the excess return of a stock is determined by the market return and the stock’s relationship with the market’s movement. It is the foundation of the more advanced multi-factor models used by portfolio managers for portfolio construction.
Recap: the systematic risk, or market risk, is undiversifiable risk that’s inherent to the entire market. In contrast, the idiosyncratic risk is the asset-specific risk.
For a stock, the return of stock i equals the return of the risk free asset plus β times the difference between the market return and the risk free return. β equals the covariance of stock i and the market divided by the variance of the market.
r_i − r_f = β_i * (r_m − r_f); r_i = stock return, r_f = risk free rate, r_m = market return, β_i = cov(r_i,r_m)/σ_m^2
β describes which direction and by how much a stock or portfolio moves relative to the market. For example, if a stock has a β of 1.1, this indicates that if the market’s excess return is 5%, the stock’s excess return would be 1.1 * 5%, or 5.5%.
Compensating Investors for Risk:
Generally speaking, investors need to be compensated in two ways: time value of money and risk. The time value of money is represented by the risk free return. This is the compensation to investors for putting down investments over a period of time. β * (r_m − r_f) represents the risk exposure to the market. It is the additional excess return the investor would require for taking on the given market exposure, β. r_m − r_f is the risk premium, and β reflects the exposure of an asset to the overall market risk.
When the β_i for stock i equals 1, stock i moves up and down with the same magnitude as the market. When β_i is greater (less) than 1, stock i moves up and down more (less) than the market.
Let’s look at a simple example. If the risk free return is 2%, β_i of stock i equals 1.2 and the market return is 10%, the return of stock i equals 11.6%.
r_f = 2%
β_i = 1.2
r_m = 10%
r_i = 2% + 1.2 * (10% - 2%) = 11.6%
Security Market Line
The Security Market Line is the graphical representation of CAPM and it represents the relation between the risk and return of stocks. Please note that it is different from the capital market line. The y-axis is expected returns but the x-axis is beta. (You may recall that for the capital market line that we learned earlier, the x-axis was standard deviation of a portfolio.) As beta increases, the level of risk increases. Hence, the investors demand higher returns to compensate risk.
The Security Market Line is commonly used to evaluate if a stock should be included in a portfolio. At time points when the stock is above the security market line, it is considered “undervalued” because the stock offers a greater return against its systematic risk. In contrast, when the stock is below the line, it is considered overvalued because the expected return does not overcome the inherent risk.
The SML is also used to compare similar securities with approximately similar returns or similar risks.
Portfolio return and beta in CAPM

r_portfolio = sum_i(w_i * (r_f + beta_i * (r_m - r_f))); w_i, beta_i: weight and beta factor of stock i; r_m, r_f: market and risk free return.
beta_portfolio = sum_i(w_i * beta_i)

Portfolio Optimization

Basic:

Finding a minimum (or maximum): First-order partial derivatives equals 0 and Hessian Matrix (example for two variables x,y)...
det(H(a,b)) > 0 and fxx(a,b)>0: (a,b) is local minimum
det(H(a,b)) > 0 and fxx(a,b)<0: (a,b) is local maximum
det(H(a,b)) < 0: (a,b) is saddle point
det(H(a,b)) = 0: inconclusive.

Terminology:

Objective function, cost function: The function to be optimized (can be taken as: minimized)
Optimization variable, in our case, the vector of portfolio weights
Constraints, inequality or equality conditions that must be fulfilled. Otherwise: unconstrained problem.
Optimal / solution vector: Vector of the the optimization variable that has the smallest objective function value given the constraints.
Domain of the problem: The set of points for which the objective function and all constraints are defined
Feasible set: Subset of the domain where all constraints are fulfilled. The problem is feasible if the feasible set is not empty.
The optimization problem is unbounded below, if the objective function reaches neg. infinity within the feasible set.
While general optimization problems are hard to solve, well-developed methods exist for convex cost functions (curve upward everywhere; no straight line between any two points on the curve intersects or touches the curve between the two points). Then there is only one global minimum.
Convex optimization problem: Objective is convex; inequality constraints convex; equality constraints can be written as f(x) = a^T x + b. If you find a local minimum, it is also the global minimum.

Simple example: 2 asset portfolio - minimum variance portfolio

Objective function: minimize sigma_P^2 = x_A^2 sigma_A^2 + x_B^2 sigma_B^2 + 2 x_A x_B sigma_A sigma_B rho_(r_A,r_B) where rho is the correlation between the returns r_A and r_B.
constraint: x_A + x_B = 1
Can be analytically solved (quadratic equation); solution:
x_(A,min) = (sigma_B^2 - sigma_A sigma_B rho)/(sigma_A^2 + sigma_B^2 - 2 sigma_A sigma_B rho), x_B = 1 - x_A
mu_P = mu_A x_A + mu_B x_B; mu_P is the portfolio mean return; mu_B and mu_B the individual asset mean returns.

Example: Covariance matrix with numpy, m3l4_covariance.ipynb
Formulating portfolio optimization problems

So far, we've discussed one way to formulate a portfolio optimization problem. We learned to set the portfolio variance as the objective function, while imposing the constraint that the portfolio weights should sum to 1. However, in practice you may frame the problem a little differently. Let's talk about some of the different ways to set up a portfolio optimization problem.
Common Constraints
There are several common constraints that show up in these problems. Earlier, we were allowing our portfolio weights to be negative or positive, as long as they summed to 1. If a weight turned out to be negative, we would consider the absolute value of that number to be the size of the short position to take on that asset. If your strategy does not allow you to take short positions, your portfolio weights will all need to be positive numbers. In order to enforce this in the optimization problem, you would add the constraint that every x_i in the x vector is positive.
no short selling: 0≤x_i≤1,i=1,2,…,n
You may choose to impose constraints that would limit your portfolio allocations in individual sectors, such as technology or energy. You could do this by limiting the sum of weights for assets in each sector.
sector limits: x_biotech_1 + x_biotech_2 + x_biotech_3 ≤ M, M = % of portfolio to invest in biotech companies
If your optimization objective seeks to minimize portfolio variance, you might also incorporate into the problem a goal for the total portfolio return. You can do this by adding a constraint on the portfolio return.
constraint on portfolio return: xTμ ≥ rmin,rmin = minimum acceptable portfolio return
Maximizing Portfolio Return
We can also flip the problem around by maximizing returns instead of minimizing variance. Instead of minimizing variance, it often makes sense to impose a constraint on the variance in order to manage risk. Then you could maximize mean returns, which is equivalent to minimizing the negative mean returns. This makes sense when your employer has told you, “I want the best return possible, but you must limit your losses to ppp percent!”
objective: minimize: −x^Tμ
constraint: x^TPx ≤ p, p = maximum permissible portfolio variance
Maximizing Portfolio Return And Minimizing Portfolio Variance
Indeed, you could also create an objective function that both maximizes returns and minimizes variance, and controls the tradeoff between the two goals with a parameter, bbb. In this case, you have two terms in your objective function, one representing the portfolio mean, and one representing the portfolio variance, and the variance term is multiplied by bbb.
How does one determine the parameter bbb? Well, it’s very dependent on the individual and the situation, and depends on the level of risk aversion appropriate. It basically represents how much percent return you are willing to give up for each unit of variance you take on.
objective: minimize: −x^Tμ + bx^TPx, b = tradeoff parameter
A Math Note: the L2-Norm
There’s another way to formulate an optimization objective that relies on a new piece of notation, so I’ll just take a moment to explain that now. Say we just want to minimize the difference between two quantities. Then we need a measure of the difference, but generalized into many dimensions. For portfolio optimization problems, each dimension is an asset in the portfolio. When we want to measure the distance between two vectors, we use something called the Euclidean norm or L2-norm. This is just the square root of the squared differences of each of the vectors’ components. We write it with double bars and a 2 subscript.
d = (ax−bx)^2 + (ay−by)^2 + (az−bz)^2 = ∥a−b∥_2
Note that this reduces to the familiar Pythagorean theorem in 2 dimensions.
Minimizing Distance to a Set of Target Weights
Back to portfolio optimization! One way to formulate an optimization problem is to use the L2 norm and minimize the difference between your vector of portfolio weights and a set of predefined target portfolio weights x^∗. The goal would be to get the weights as close as possible to the set of target weights while respecting a set of constraints. As an example, these target weights might be values thought to be proportional to future returns for each asset, in other words, an alpha vector.
objective: minimize: ∥x − x^∗∥_2, x^∗ = a set of target portfolio weights
Tracking an Index
What if you want to minimize portfolio variance, but have the portfolio track an index at the same time? In this case, you would want terms in your objective function representing both portfolio variance and the relationship between your portfolio weights and the index weights, q\mathbf{q}q. There are a few ways to set this up, but one intuitive way is to simply minimize the difference between your portfolio weights and the weights on the assets in the index, and minimize portfolio variance at the same time. The tradeoff between these goals would be determined by a parameter, λ.
objective: minimize: x^TPx + λ ∥x − q∥_2, q = a set of index weights, λ = a tradeoff parameter
Backtracking algorithm for asset allocation with constraints

Besides the solution approach described below (with cvxpy), another approach is the backtracking algorithm, which is basically an "exhaustive search" in the space of all possible weights for each asset. To be finite, a step size for the weights search has to be defined (e.g. 1%). Additionally, the search tree should be "pruned" by applying boundary conditions to the weights, like sum of weights equal to one or maximum/minimum allowed weights per asset.
Python package for optimization problems: cvxpy

cvxpy is a Python package for solving convex optimization problems. It allows you to express the problem in a human-readable way, calls a solver, and unpacks the results.
How to use cvxpy
Import: First, you need to import the package: import cvxpy as cvx
Steps: Optimization problems involve finding the values of a variable that minimize an objective function under a set of constraints on the range of possible values the variable can take. So we need to use cvxpy to declare the variable, objective function and constraints, and then solve the problem.
Optimization variable: Use cvx.Variable() to declare an optimization variable. For portfolio optimization, this will be x, the vector of weights on the assets. Use the argument to declare the size of the variable; e.g. x = cvx.Variable(2) declares that x is a vector of length 2. In general, variables can be scalars, vectors, or matrices.
Objective function: Use cvx.Minimize() to declare the objective function. For example, if the objective function is (x − y)^2, you would declare it to be: objective = cvx.Minimize((x - y)**2).
Constraints: You must specify the problem constraints with a list of expressions. For example, if the constraints are x + y = 1 and x − y ≥ 1 you would create the list: constraints = [x + y == 1, x - y >= 1]. Equality and inequality constraints are elementwise, whether they involve scalars, vectors, or matrices. For example, together the constraints 0 <= x and x <= 1 mean that every entry of x is between 0 and 1. You cannot construct inequalities with < and >. Strict inequalities don’t make sense in a real world setting. Also, you cannot chain constraints together, e.g., 0 <= x <= 1 or x == y == 2.
Quadratic form: Use cvx.quad_form() to create a quadratic form. For example, if you want to minimize portfolio variance, and you have a covariance matrix P, the quantity cvx.quad_form(x, P) represents the quadratic form x^TPx, the portfolio variance.
Norm: Use cvx.norm() to create a norm term. For example, to minimize the distance between x and another vector, b, i.e. ∥x − b∥^2, create a term in the objective function cvx.norm(x-b, 2). The second argument specifies the type of norm; for an L2-norm, use the argument 2.
Constants: Constants are the quantities in objective or constraint expressions that are not Variables. You can use your numeric library of choice to construct matrix and vector constants. For instance, if x is a cvxpy Variable in the expression A*x + b, A and b could be Numpy ndarrays, Numpy matrices, or SciPy sparse matrices. A and b could even be different types.
Optimization problem: The core step in using cvxpy to solve an optimization problem is to specify the problem. Remember that an optimization problem involves minimizing an objective function, under some constraints, so to specify the problem, you need both of these. Use cvx.Problem() to declare the optimization problem. For example, problem = cvx.Problem(objective, constraints), where objective and constraints are quantities you've defined earlier. Problems are immutable. This means that you cannot modify a problem’s objective or constraints after you have created it. If you find yourself wanting to add a constraint to an existing problem, you should instead create a new problem.
Solve: Use problem.solve() to run the optimization solver.
Status: Use problem.status to access the status of the problem and check whether it has been determined to be unfeasible or unbounded.
Results: Use problem.value to access the optimal value of the objective function. Use e.g. x.value to access the optimal value of the optimization variable.
Examples: m3l4_cvxpy_basic.ipynb, m3l4_cvxpy_advanced.ipynb
Rebalancing strategies


Rebalancing serves to bring back asset weights to the "optimal" values after assets have appreciated at different rates over some time. The optimization problem must be re-run with updated parameters.


Rebalancing costs: Transaction costs, capital gains tax, time & labor. To estimate rebalancing costs, one could assume that they are proportional to the portfolio turnover, i.e. the sum of the absolute differences of all asset weights before and after the rebalancing.


Rebalancing triggers: Cash flows (dividends, capital gains, new contributions. Contributions / withdrawals can be used to rebalance), changes in model parameters (e.g. due to mergers and acquisitions. To be decided from which size of deviation rebalancing should occur)


Rebalancing strategies: At fixed temporal intervals; when deviations from the desired weights reach a certain threshold


Problems of portfolio optimization


Past returns may not be a good predictor of future returns
Variance of stocks may not be a good measure of risk (e.g. if the distribution of returns is asymmetric / non-gaussian)
For estimating the covariance of a large number of stocks reliably, many datapoints are required (e.g. for 50 stocks, about 5 years of daily data should be a availabl)
Estimates (of returns and variances) are noisy. "robust optimization" takes ito account confidence interval. Instead of actual return values, only the relative ranks could be considered.
Single-period optimization: The optimization described above has one fixed timeframe in mind. If different time horizons were considered, different optimal actions (e.g. long/short a stock) could result in different time periods. This is considered by multi-period optimization.
Transaction costs could be considered as constraints in the optimization problem.

Smart Beta and Portfolio Optimization

In this project, you will build a smart beta portfolio and compare it to a benchmark index. To find out how well the smart beta portfolio did, you’ll calculate the tracking error against the index. You’ll then build a portfolio by using quadratic programming to optimize the weights. Your code will rebalance this portfolio and calculate turn over to evaluate the performance. You’ll use this metric to find the optimal rebalancing Frequency. For the dataset, we'll be using the end of day from Quotemedia.
Solution: project_3_smart_beta_and_portfolio_optimization.ipynb
4. Factors


Factors contain potentially predictive information for the future movement of stocks, e.g. momentum, fundamental information, signals from social media.


Factors must be transformed into weights / signals for individual stocks


Alpha Factors: Drivers of mean returns


Risk Factors: Drivers of volatility


To make different factors better comparable, it is standard practice to de-mean and rescale factor weights.


Note: "Weight" in the following means the portfolio share of each asset. "Factor value" is a numerical representation for the strength of the considered factor for an asset. E.g. for momentum, the factor value could be the 12 month return.


De-mean: Sum of (standardized) weights equals zero; achieved by subtracting the mean of raw factor values from each raw factor value.


Rescale: Sum of absolute (standardized) weights equals one; achieved by dividing through the sum of absolute raw factor values for each raw factor value.


The sum of the demeaned weights is always zero, the sum of the rescaled weights is always zero, the sum of the absolute value of the rescaled weights is always one, the sum of the rescaled short positions is always -0.5


Reason for de-meaning: Making the portfolio Dollar neutral so that its development is (mostly) independent of the overall market trend ("isolate the effect of the factors"). Long and short positions are balanced (even though the "raw" weights could describe a long-only portfolio.


Notional: The dollar value associated with a portfolio.


Leverage: The leverage ratio is the sum of the (absolute) magnitudes of all positions in dollars, divided by the notional. By rescaling the portfolio weights (i.e. dividing by the sum of absolute weights), we ensure a portfolio leverage of one.


ZIPLINE

Zipline is a Pythonic event-driven system for backtesting, developed and used as the backtesting and live-trading engine by crowd-sourced investment fund Quantopian. Since it closed late 2020, the domain that had hosted these docs expired. The library is used extensively in the book Machine Larning for Algorithmic Trading by Stefan Jansen who is trying to keep the library up to date and available to his readers and the wider Python algotrading community.
Code examples: Zipline_Pipeline _Primer.ipynb, Zipline_Pipeline _Exercise.ipynb
Factor Models


A Factor Model tries to identify a number of *latent variables that can model / explain the common variability of asset returns of certain groups of assets.


Linear factor model: Return of asset i r_i = b_i1 * f_1 + b_i2 *f_2 + ... b_iK * f_K + s_i where f_n are numerical values for the factor at a given time and s_i is an "error term" that describe the portion of the returns that can not be associated to the factors.


terminology:

Factor returns (the f_k) may be: macro-economic variables, returns on pre-specified portfolios, returns on zero-investment strategies (long and short positions of equal value) giving maximum exposure to fundamental or macro-economic factors, returns on benchmark portfolios representing asset classes, or something else.
The b_{ij} coefficients may be called: factor exposures, factor sensitivities, factor loadings, factor betas, asset exposures, style or something else.
The s_i term may be called: idiosyncratic return, security-specific return, non-factor return, residual return, selection return or something else.


Factor model assumptions:

Factor returns are uncorrelated to the residuals, corr(f_n, s_i) = 0 for every n and i. (This can always be achieved by selecting suitable b_in, e.g. through multi-linear regression)
Residuals of different stocks are uncorrelated, corr(s_i, s_j) = 0 for every i != j. This means that all correlation between two assets is assumed to be captured in the factors. s_i captures the ideosyncratic risk of a specific asset. It requires that the factor model includes a sufficient number of important factors to capture the correlation. However, too many factors capture too much "noise", i.e. overfit.


Covariance matrix using the factor model: When using the two uncorrelation assumptions described and when transforming the returns to have zero means (by subtracting any non-zero mean), it can be shown that the asset return covariance E(rr^T) can be expressed as

E(rr^T) = BFB^T + S where
E() denotes the expectation value,
r is the (N dimensional) vector of asset returns,
B is the NxK matrix of factor exposures for each asset (K = number of factors),
F = E(ff^T) the KxK matrix of factor covariances,
S = E(ss^T) the NxN diagonal matrix of the variances of the residuals (as the off-diagonal covariances are zero).


For a factor model, we have Return r = Bf + s and Risk E(rr^T) = BFB^T + S


Portfolio factor exposure for a portfolio with weights x: B^Tx


Risk and Alpha Factors:

Assumption: we can subdivide factors into "Risk factors" that are predictive of the variance of the portfolio and "Alpha factors" that are descriptive of the returns. In portfolio optimization, we want to put constraints on the variance but not on the returns. It is therefore possible (in terms of portfolio optimization) to disregard any alpha factors and consider F as the covariance matrix of all factors that have a considerable impact of variance, while S is the (co)variance that cannot be explained by these factors, including the variance related to the (dropped) alpha factors.


In this model, when using E(rr^T) = BFB^T + S to constrain portfolio risk, the terms B and F don't include any information about alpha and S doesn't include explicit information. "They are usually bought by practitioners from commercial providers."


The alpha factors are identified separately and enter into the objective function for the portfolio optimization.


Risk factors vs. alpha factors

"Real" factors will be not purely risk or alpha factors. When "ranking" factors, there will be a gradual variation of these characteristics.
In general, risk factors are significant contributors to the variance of asset returns, and less predictive of the mean of returns. Risk factors are identified to control risk. One way to do control an asset's exposure to a risk factor is to hold an equal amount long as short. For instance, a dollar neutral portfolio with equal amounts long and short is controlling for risks that the overall market may move up or down.
In general, factors that are significant in describing the mean of asset returns can be candidates for alpha factors. Alpha factors are used to give some indication of whether each stock in the portfolio may have positive expected returns or negative expected returns. For example, a former alpha factor was the market capitalization of a stock. Small cap stocks tend to have higher future returns compared to large cap stocks.
Usually, we'd choose 20 to 60 risk factors that describe overall stock variance as much as possible. So risk factors as a whole account for more of the overall movement of stocks. On the other hand, alpha factors contribute to smaller movements of stocks, which is okay, because we seek to identify these alpha factors because they give some indication of the direction of expected returns, even if they're small compared to risk factors. An important reason why it's important to identify risk factors and then neutralize a portfolio's exposure to risk factors is that if we didn't, the asset movements due to risk factors would overwhelm the movements that are due to the alpha factors.
Risk factors are well-known by the investment community, so investors will track those factors when optimizing their portfolios. This also means that it's unlikely that any one investor can gain a competitive advantage (higher than normal returns) using risk factors (Investors "trade away" the competitive advantage; this has for example happened to the small cap factor after its publication in 1981).
Alpha factors are less well-known by the investment community, because they're generated by in-house research (i.e. they are proprietary) to help the fund generate higher than normal returns. So alpha factors are said to be drivers of the mean of returns because they're used to help push a portfolio's overall returns higher than what would be expected from a passive buy and hold strategy.
When a former alpha factor becomes more and more known, it turns into a risk factor because more and more people are following its signal, thus amplifying price movements in either direction.
Different investors may treat the same factor as either an alpha or a risk factor. Part 2: Use AI to select factors.
Factor models and types of factors


Alpha factors can be broadly categorized into momentum and reversal factors. Example of a momentum factor: numerical value of 12 month return. Example of a reversal factor: negative of the weekly return (assuming that higher weekly returns will lead to profit taking).
Price-Volume factors: Refers to all factors that are constructed from asset price and trading volume information. It can be measured at different frequencies; adjusted or unadjusted prices; bid-ask; OHCL; any combination thereof; statistical properties like (popular) the first four momenta: mean, variance (squared volatility), skewness (asymmetry of the distribution), kurtosis (fat-tailedness); Min and max over different time windows. This information is broadly available for all stocks (as opposed to e.g. analyst estimated as factors) and frequently updated (as opposed to e.g. most fundamental data as factors). It may lead to higher trading / rebalancing frequency, requiring a careful cost - benefit analysis.
Volume factors: "Net buy" or "net sell", volume as indicator for price significance or significance of other factors like momentum (momentum "conditioned" on volume), short interest (high short interest may propel momentum if prices rise - short squeeze)
Fundamental factors: Usually updated quarterly. Lower trading frequency may allow for larger position sizes. Example: Market cap ("size effect").
Fundamental ratios: earnings p. share/price (can become negative!), book/price; less influenced by accounting decisions: cash flow or EBITDA. Cash flow is more volatile than earnings because it immediately reflects the effect of large investments.
Event driven factors: e.g. natural disasters, major accidents, government changes, M&A / spin-offs, new inventions/discoveries (one-off events). Some regularity: interest rate changes, index changes or major weight changes; new earnings releases or earnings guidance; product announcements.
Pre and post event: E.g. for earnings announcements vs analyst expectations. "Post earnings announcement drift": prices tend to drift in the same direction for about two months after an earnings announcement. Well known so less exploitable.
Analyst ratings: Sentiment factor. Herd behavior of analysts. "Sell side" recommendations (banks) for "buy side" (fund managers). Factor could be e.g. a weighted average of analyst ratings (weighted by some reputation measure) or upgrades - downgrades.
Alternative data: Social media, consumer behavior, satellite images
Sentiment analysis on news and social media: Following financial news is important for quant analysts. Sentiment analysis using NLP can transform news and social media messages into "buy"/"hold"/"sell" categories. High public attention may indicate the end of a substantial movement (overextension vs. fundamentals).
NLP can be used to enhance fundamental research: Automated pre-processing / labeling of quarterly reports, financial statements, regulatory filings can filter data for subsequent fundamental analysis. Sentiment analysis of 10-K filings (SEC; 1 per year. Plus three 10-Q = quarterly). E.g. amount of legal risks, perceived uncertainties from competitors, customers, suppliers... can be transformed into "business outlook" factors. Sector classification using 10-K: The article describes (among others) the use of NLP to assign S&P500 stocks to industry sectors. Sector assignment is indicative of correlation and can also be used in factor models. It reviews critically the standard GICS sector classification of the S&P500, which is in parts not very indicative for correlation among stocks.
Other alternative data: Tracking usage of supermarket car parks or fill level of floating-roof oil tanks from satellite images; self-representation of companies in social media / employment offers; marketing activity in social media (counter indicator);... Examples for sources for alternative data: https://www.buildfax.com/, https://www.thinknum.com/, https://orbitalinsight.com/.

Risk Factor Models

Purpose: Model, control, if possible neutralize the portfolio's risk.

Portfolio return in a factor model: r_P = sum(β_(P,i) * f_i) + s_P, with β_(P,i) = sum(x_j * β_j) and s_P = sum(x_j * s_j), where β_(P,i) is the portfolio factor exposure to factor f_i and β_j the exposure of portfolio component j. s_P is the portfolio specific return, s_j the specific return of component j and x_j the weight of component j in the portfolio.
Factor Model of portfolio variance: var(r_P) = X^T(BFB^T + S)X; X: vector of portfolio weights; B: Matrix of portfolio constituent factor exposures; F: covariance matrix of factors; S: Matrix of specific (ideosyncratic) (co)variances - ideally a diagonal matrix.

Example: factor_model_portfolio_return.ipynb.

Define two (dummy) "factors" by calculating the daily mean and median returns over a large number of assets. Build a portfolio by selecting two of these assets with specific weights. Calculate the factor exposure to both factors of the two assets by linear regression of factor return vs. asset return. Calculate portfolio factor exposure and portfolio return. Calculate the common return of the portfolio = part of portfolio return explained by the factors and the specific return = the remainder which is not explained by the factors.

Variance of a single stock in a 2 factor model: Var(r_i) = Var(β_(i,1)f_1) + Var(β_(i,2)f_2) + 2 Cov(β_(i,1)f_1, β_(i,2)f_2) + Var(s_i) = β_(i,1)^2 * Var(f_1) + β_(i,2)^2 * Var(f_2) + 2 β_(i,1)*β_(i,2) * Cov(f_1,f_2) + Var(s_i). The first three terms are called systematic variance (determined by the factors; the fourth the specific / ideosncratic variance (not covered by the factors). f_i are the factor returns, β_(i,j) the factor exposure of stock i to factor j and s_i the specific (factor independent) return of stock i. Important to note that the factor exposures can be taken out of the factor variance and covariance calculation on the rhs!
Covariance of two stocks in a 2 factor model: Cov(r_i,r_j) = Cov(β_(i,1)f_1 + β_(i,2)f_2 + s_i, β_(j,1)f_1 + β_(j,2)f_2 + s_j). When multiplying out the 2 times 3 terms, we end up with 9 terms. However, by definition, the s_n are uncorrelated with the f_m (and uncorrelated with each other?). Therefore, 5 of the 9 terms containing an s_m factor are equal to 0. The remaining 4 are: Cov(r_i,r_j) = β_(i,1)β_(j,1)Var(f_1) + β_(i,2)β_(j,2)Var(f_2) + β_(i,1)β_(j,2)Cov(f_1,f_2) + β_(i,2)β_(j,1)Cov(f_2,f_1).
Covariance Matrix of assets in terms of factors: By using the above formulas, the covariance matrix of the two stocks, ((Var(r_i), Cov(r_i,r_j)), ((Cov(r_j,r_i), Var(r_j))), can be expressed entirely in terms of the risk factor variances and covariances, multiplied by the constant factor exposures. For a large number of assets, this provides a considerable simplification of the calculation!

Example: covariance_matrix_assets.ipynb.

Calculate the covariance matrix of two assets by expressing the terms in a two factor model, weighted with the factor exposures. First by calculating the matrix elements one by one using the formulas above, then by doing numpy matrix calculation using the formula A = BFB^T + S, where A is the covariance matrix of asset returns, B is the matrix of factor exposures, F is the covariance matrix of the factors, S is the diagonal matrix of specific variances.

Portfolio variance of two assets in terms of risk factors: Var(r_P) = x_i^2 * Var(r_i) + x_j^2 * Var(r_j) + 2 x_i x_j * Cov(r_i,r_j). The formulas above can be used to express the terms by factors and factor exposures. The overall formula can be written in matrix / vector notation.
Portfolio variance of n assets in terms of risk factors (matrix formula): Considering the portfolio weight vector X (the i-th element is the weight of asset i in the portfolio), we get the matrix formula Var(r_P) = X^T(BFB^T + S)X, where (for 2 asssets) X = (x_i, x_j)^T, B = ((β_(i,1), β_(i,2)), (β_(j,1), β_(j,2)), F = ((Var(f_1), Cov(f_1,f_2)), (Cov(f_2,f_1), Var(f_2))), S = ((Var(s_i), 0), (0, Var(s_j)))

The term BFB^T + S is the covariance matrix of assets.


Example: portfolio_variance.ipynb.

The portfolio variance of a two-asset portfolio is calculated using the explicit terms described above as well as the matrix formula.
Types of Risk Models

Time-Series Risk Models: E.g. CAPM (single factor = market excess return over risk-free rate), Fama-French 3 factor model.
Cross-Sectional Risk Models
PCA Risk Models

Time Series Risk Model

Example CAPM as one-factor model:

For stock i in our stock universe (market), e.g. S&P500, r_i - r_f = β_i * f_m + c_i; r_i is the stock return, r_f the risk free rate; f_m = r_m - r_f is the single factor of this model, namely the market excess return. All these are time series. We can estimate the market exposure β_i of the CAPM for stock i, which equals the factor exposure in this model, as well as the intercept c_i by doing a linear regression of r_i - r_f vs. f_m

Note 1: It's not a clearly defined methodology; there are other methods which may yield different results for β_i and c_i)
Note 2: The sample (length of the time window) to be used for the regression must be selected. One approach would be to make it dependend on the envisaged trading frequency, e.g. several weeks for daily trading up to years for longer holding periods.


For a factor model, we also need the residual s_i, which is the part of the return of stock i not explained by the model. the time series for s_i is given by s_i = (r_i - r_f) - (β_i * f_m + c_i). This provides the matrix of specific variances.
By calculating the variances and covariances of the of the different terms over a selected time window (see Note 2 above regarding the choice of time window), we can calculate the portfolio variance with the matrix formula Var(r_P) = X^T(BFB^T + S)X

Multi-factor model

"Classical" Fama-French factors: Market, size (SMB = small minus big) and value (HML = high - low). I.e. besides the average "market" return, research suggests that smaller businesses (by market cap) perform on average better than larger businesses and stocks with a low market value compared to their fundamentals perform on average better than those with a high value compared to fundamentals. This research is taken as hypothethis for the formulation of corresponding factors.
To find out the effect of a hypothetical factor, it's common to use a dollar neutral theoretical portfolio, i.e. go long on assets that have more of a particular trait and short on assets that have less of the trait. Assume an investment of equal dollar amount in each asset.
E.g. "size factor" time series f_s = (r_small - r_big)/2; r_small = sum_N_small(r_i)/N_small; r_big = sum_N_big(r_i)/N_big; "small" if the first decile of stocks in our universe sorted by market cap; "big" is the last decile.
E.g. "value factor: Sort stocks by book value/market value ratio; those with high book/market are called Growth(Low); those with low book/market are called Value(High). This time the original FF model goes short all stocks in the first three deciles and long the stocks in the last three deciles. A corresponding formula as for the size factor provides the time series for the value factor f_v.
Complete Fama-French approach: first sort our market universe (return r_m) by size and create the "small" (s) and "big" (b) portfolios as described above. The middle section of 8 deciles is discarded. Then subdivide each of the two portfolios into three parts according to "value" as described for the value factor, labeled "growth" (g), "neutral" (n) and "value" (v). This provides 6 sub-portfolios with return time series r_sg, r_sn etc. Then the definition of the factors is:
Market f_m = r_m - r_f

SMB f_s = ((r_sg + r_sn + r_sv) - (r_bg + r_bn + r_bv)) / 3
HML f_v = ((r_sv + r_bv) - (r_sg + r_bg)) / 2.

Note: The two "neutral" portfolios are only taken into account for the SMB factor!


With these three factor time series, the covariance matrix of the factors F can be calculated.
The factor exposures β_(i,x) (x in m,s,v) of a stock i can be calculated by doing a multilinear regression of ri vs. β_(i,m) * f_m + β_(i,s) * f_s + β_(i,v) * f_v. This provides the matrix B of factor exposures.
The specific return time series can be calculated as s_i = r_i - (β_(i,m) * f_m + β_(i,s) * f_s + β_(i,v) * f_v). The variances of these time series provides the specific variance matrix S.

Cross-Sectional risk model


"Most practitioners tend to buy commercial risk models that are built and maintained by other companies (e.g. MSCI Barra, Axioma, North Field). It is the task of these models to ensure that the factors and specific returns of stocks are independent of each other. The practitioners then focus on identifying alpha factors" (Video 10 and 16 statements)
These are often so called cross-sectional risk models.
Cross-sectional risk models may use categorical variables like country or sector of a company as factors.
The factor exposure of an asset can be expressed by one-hot encoding, i.e. β_(i,country_j) = (1 if country_j is home country; 0 otherwise), or a number between 0 and 1 expressing the percentage of revenues in that country.
Important: The factor exposure is determined by the category, not by a regression of the time series as above.
In contrast, we need regression now to find the time series of factors f_country_j: For a fixed time step t_n, we plot r_i(t_n) for all assets i vs. β_(i,country_j). f_country_j(t_n) is the slope of the linear regression line in this plot.

Note: Does this work with one-hot encoding, when β_(i,country_j) is either 0 or 1 with no intermediate values?


The time series of specific returns can again be calculated as the residuals of the linear regressions, i.e. s_i(t_n) = r_i(t_n) - r_estim(t_n) with r_estim(t_n) = sum_countries(β_(i,country_j) * f_country_j(t_n)) + c_i(t_n)
By repeating for each time step, we get the time series and can again calculate the matrices B, F and S.
This approach can also be applied to the fundamantal (Fama-French) factors! Above, we constructed theoretical long-short model portfolios to obtain the time series of factors and determined the factor exposures by linear regression. Here, we define the factor exposures first, e.g. β_value = book/market or β_size = market_cap. Second, we calculate f_size and f_value by doing a multilinear regression of all returns of assets in our portfolio at a fixed time step vs. β_size and β_value. In practice, a smaller estimation universe of assets would be used for the linear regression.

Statistical Risk Model (PCA)


Short recap vector spaces (extensive/ YouTube: Essence of linear algebra)

Basis: minimum set of mutually linear independent vectors that span the vector space
Transformation between bases: A matrix where each column expresses a vector of the target basis in terms of the original basis, i.e. each row is the "share" of a vector of the original basis. For transforming back from the new to the old basis, we get the inverse of that matrix.


PCA in words: Finding a basis for a vector space defined by a set of points that fulfills specific criteria:

The first axis is selected to maximise the variance of datapoints along that axis, i.e. the projection of the points on that axis is maximally spread out (the sum of squares of the distances along the axis from an origin is maximised).
At the same time, this minimises the reproduction error, i.e. the sum of the squared (perpendicular) distances of the points to the axis is minimised. The reason is that the sum of squared distance from origin along the axis to the projection of the point (related to variance, see above) and the squared distance from the axis equal the squared distance of the point from the origin, which is a constant.
The second axis must be perpendicular to the first, and again maximises the variance among all perpendicular axes.
The third axis is perpendicular to the first two axes and so on, until the whole vector space is spanned
So the set of axes is sorted by "size of variance" and by selecting the first few axes, the majority of the variance can be covered.
For a set of N assets, in theory we can have N linear independent axes, unless two assets move in exact unison.
PCA_Core.ipynb : Visual example in two dimensions with random data (uses Bokeh for interactive visualization).


PCA mathematical:

Mean-centering / mean-normalizing the data; subtract the means along each dimension so that the data are centered around the origin.
Determine the vector w of the first PCA component (maximising the variance) for a set of points {x_i} in an N-dimensional space: We want to maximise the total distance from the origin along w, which is sum_i(x_i w^2)/w^2 (x_i w = x_i w cosΘ is the dot product). If X is the matrix with x_i as row vectors, we can write this as L2-norm ||X w/w||_2 = (X w)^T(X w)/(w^T w) (Rayleigh quotient). To interpret it as a variance, we have to multiply by 1/(N-1). So w = argmax(||X w/w||_2)
For the next PCA component, we apply the same procedure for the set of points {x_i - w}, which is the points x_i projected into the (N-1) dimensional space perpendicular to w.
etc.


In terms of asset returns, the principal components are some linear combination of partial returns of different assets, which may or may not be interpretable.
The sum of squares of distances to the origin of all data points is proportional to the variance for the mean-centered data. The squared distance to the origin is the sum of squares of distances ("variances") along each perpendicular coordinate axis for each data point. Thus, the sequence PCA axes represents the sequence of coordinates capturing the largest, second largest... portion of the variance.
Usually, we want to find lower-dimensional representation of the data that captures as much as possible of the variance. We can achieve this by keeping only the first PCA components. The variation of data points perpendicular to the selected axes (i.e. along the non-selected PCA components) is the variance that remains unaccounted.

Example: PCA_Toy_Problem.ipynb.

Two-dimensional, correlated data set for demonstrating the PCA class of scikit-learn.
Example: pca_basics_solution.ipynb.
490 dim. US stock returns data from 2011 to 2016 obtained with zipline. Calculate number of PCA dimensions needed to cover a certain amount of variance. E.g. for 50%, 9 components are required. For 90%, 179 components are required.
** PCA as a factor model:**

Calculate PCA components and decide on number of components to be kept = number of factors.
Factor model: r = Bf + s; here the factor weights B are given by the PCA components, written in the original basis (each stock's return series represents one basis vector), as column vectors. If we kept all components, r = Bf holds exactly and the "factor returns" matrix f is the matrix of eigenvectors. The factor returns are the "return time series" of a portfolio exactly representing a single factor. Usually this factor cannot be interpreted as a 'real-world' property!
Keeping track of dimensions: For N stocks, T timesteps (return values per stock), and K components, we have: dim(r) = NxT, dim(B) = NxK, dim(f) = KxT, dim(s) = NxT  (rows x columns).
f = B^T r (because the PCA components are orthonormal, B^TB = 1 if we keep all components)
cov(f) = 1/(T-1) f^Tf is a diagonal matrix with the factor variances as diagonal components (i.e. the covariance of different factors is zero because the PCA components are orthogonal!)
If we use daily time series, we need to multiply by 252 to get the annual variance (Standard deviation is square root of variance!)
Ideosyncratic returns (not explained by model): s = r - Bf.
cov(s) = 1/(T-1) s^Ts; The "specific risk" matrix (NxN) S is obtained by neglecting the off-diagonal elements of cov(s) (which should be small if the selected PCA components cover the most part of the original variance). It describes for each stock how much of its variance is not covered by the selected components.

Example:
Alpha Factors

Alpha factors are parameters expected to be predicitve of future mean returns. They build on deviations from the efficient market hypothesis and take advantage of mispricing.

Sources: e.g. academic research papers
Processing techniques to turn alpha vectors into signals (explained in more detail below): Sector neutralization, ranking, Z-scoring, smoothing, conditiong
Evaluation Metrics for alpha factors: Factor returns, Sharpe ratio, information coefficient, information ratio, quantile analysis, turnover analysis

Alpha Factors vs. Risk Factors: Risk factors model the movement of stock prices due to commonly known effects, providing systematic returns and systematic volatility. They may significantly impact portfolio volatility without providing adequate returns. In the world of alpha factors, the portfolio exposure to risk factors is to be neutralized, so that the remaining portfolio returns and volatility are due to alpha factors.
Alpha Model: An algorithm that transforms data into numbers associated with each stock.
Alpha value: refers to a single value for a single stock, for a single time period. Positive for long, negative for short.
Alpha vector: has a number (alpha value) for each stock, and the number is proportional to the amount of money we wish to allocate for each stock. The mean over all components equals zero and the sum of absolute values equals one.
Alpha factor: a time series of alpha vectors (over multiple time periods).
Raw alpha factor: a version of an alpha factor before additional processing to improve the signal.
Stock universe: set of stocks under consideration for the portfolio.

Methods for coming up with alpha factors


Reading financial news, studying the markets for curious behavior, studying the methods of famous discretionary or quant investors, talking to industry practitioners, academic papers.
Academic papers are publicly available, so any information contained is diffused and doesn't lead to strong signals. However, it can aid in the generation of own ideas and serves as a basis for comparison.
Free access only to non-peer reviewed papers from open-source libraries as SSRN. Advantage: papers are more recent and the ideas are still "fresh". Academic papers must be assessed for practical applicability, e.g. do I have access to the data used, is the model still feasible after transaction costs, do they excessively rely on returns of illiquid micro-cap stocks, etc.

Controlling for risk within an alpha factor


Market risk: With the simplifying assumption that all individual stock beta factors are equal to one, the market risk is already eliminated by creating mean-zero alpha vectors (long and short positions neutralize each other).
Sector risk: After market risk elimination, assign each stock in the portfolio to a sector and then subtract the mean over all sector stocks from the alpha value of each sector stock. This (mostly) eliminates sector risk.

Processing techniques for alpha factors

Ranking

If the alpha vector changes daily, we would have to rebalance the portfolio daily.
Outliers may lead to portfolio imbalances. Countermeasure: winsorizing, i.e. all alpha values smaller than the fifth percentile are replaced by the value at the fifth percentile; correspondingly above the 95th percentile. Also possible: setting a maximum allowed portfolio weight for any single stock.
Ranking: (General method in statistics) Transform the alpha signal to be less sensitive to noise and outliers. It is done by assigning an ordinal number (rank) to each stock according to its alpha value.
Ranking could be applied to avoid excessive trades. The rank instead of the raw alpha factor could be used to only change the portfolio when the rank changes (too simplified???).

zipline rank method for applying ranking to a factor.
Z-scoring

Transform the raw alpha factor (numerical value) by subtracting the mean and dividing by the standard deviation. This makes alpha factors more readily comparable.
if the distribution resembles a Gaussian, 95% of the values would then be always between approximately -2 and 2, irrespective of the alpha factor.
Advantage compared to ranking: Even when comparing alpha factors between different stock universes (with different number of stocks), the value range would be the same, while for ranking, it is between 1 and the number of stocks.
Disadvantage: Doesn't protect against outliers and noise.
To combine the advantages, one could first rank and then Z-score when alpha vectors are generated from different stock universes.
Z-scoring of alpha factors is often applied in academic papers.
zipline zscore method

Smoothing

Financial data are noisy and can be sparse (sparse due to missing values or, e.g. for fundamental data, because they are only measured in longer intervals like quarterly or monthly)
To obtain smoother data, rolling window averaging or weighted moving averaging (more recent = more weight) can be applied
(In case of fundamental data, to get a daily time series, one would rather copy the monthly/quarterly data to all days in the subsequent month/quarter)
zipline Simple Moving Average
zipline Exponential Weighted Moving Average

Evaluation metrics for alpha factors

Quantopian alphalens python code for performance evaluation (Same github repository as zipline pyfolio ("portfolio and risk analysis in Python"), trading_calendars,...)

Alphalens documentation
"Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors..."
Factor returns

Use theoretical returns over the same time window and for the same stock portfolio to compare different alpha factors
"Theoretical portfolio returns" means: We adjust our portfolio daily according to the alpha vector calculated until the previous day. The alpha vector is standardized (mean 0, sum of components equal to 1; negative components=short). Use alpha vector values as portfolio weights to calculate the single-day return of the portfolio from the individual stock returns. The time series of daily returns is the factor return (based on historical data).

Universe construction rule

We must ensure that the stock universe to test our factors at a specific date in the past is only constructed based on information available up to that date to avoid lookahead bias. A specific example of lookahead bias is survivorship bias which would come into play if we selected the stock universe from currently existing assets. To avoid this, we could use historical constituent data from an index provider like S&P 500.

Factor returns and leverage

Typical factor returns (i.e. the achieved "alpha") are in the range of 1% to 4%. In order to achieve the returns expected for a professional trading firm / hedge fund, leverage (borrowing money) with a leverage ratio between 2 to 6 is applied.
Leverage ratio = (value of positions)/((own) capital)
For alpha factor analysis, we always apply a leverage ratio of 1!
Return denominator = sum(|α_i|) where the α_i are the components of the alpha vector. For a normalized alpha vector, this is equal to 1 (i.e. the "raw" alpha vector is normalized by dividing each component by its return denominator)
Alphalens:

factor_returns, factor_weights with demean=True
get_clean_factor_and_forward_returns


Sharpe ratio

Key performance indicator to compare alpha factors. "Naked" returns can always be amplified by leveraging, but the sharpe ratio is independent of leverage.
S = mean(f)/stdev(f) * sqrt(252)  where f are the daily factor returns (Note: risk-free rate is neglected here!)

Ranked Information Coefficient (Rank IC):

Correlation between the alpha vector (prediction) and actual forward return:
rank_IC_(t-1) = r_S(α_(t-1), r_(t, t+1))
where r_S is the Spearman Correlation Coefficient, α_(t-1) the alpha vector for time step t-1, and r_(t, t+1) the forward return in the time interval from t to t+1.
The Spearman Correlation r_S(X,Y) = Cov(rk(X), rk(Y))/(Stddev(rk(X))*Stddev(rk(Y))) where rk(X) is the vector containing the ranked X components (i.e. ordinal numbers between 1 and the dimension of X).
The "standard" correlation is called the Pearson Correlation ρ(X,Y) = Cov(X,Y)/(Stddev(X)*Stddev(Y)).
The Pearson Correlation also equals the square root of the R^2 value of a linear regression between X and Y. The R^2 value measures the propoportion of the variance in Y that can be explained by the variance in X.
We use the Spearman instead of the Pearson correlation because we are only interested whether the alpha factor predicts the right direction. If the direction is always right but the magnitude of the return varies differently than the alpha factor, this reduces the Pearson correlation but would not mean that the alpha factor is less suitable.
In alphalens: factor_information_coefficient (code link)
Example: rank_ic.ipynb; applying the alphalens information coefficient function to the two factors from the pipline already used above.

Information Ratio

Recap: Risk Factors are considered to drive portfolio variance; Alpha Factors are considered to drive returns after neutralizing the risk factors. Return driven by risk factors = systematic return, return driven by alpha factors = specific return (also: residual return). r = sum_k(β_k*f_k) + s.
Information Ratio IR = sqrt(252)* mean(s)/Stdev(s), where s is the specific return. "It's the Sharpe ratio of the performance that the fund manager (responsible for the alpha) contributes to the portfolio." If we are building a market and common factor neutral portfolio, the portfolio's Sharpe ratio equals the IR!

Fundamental Law of Active Management (Richard Grinold, Ronald Kahn)

IR = IC * sqrt(B)
Information Ratio IR, Information Coefficient IC (see above; ranked IC), Breadth B (the number of independent applications of the strategy under invetigation; e.g. in different markets or at multiple times)
The formula is a strong simplification and should not be applied directly. Implies that IR can be increased by more active trading. Ignores e.g. transaction costs. Only considers a single isolated strategy, while multiple strategies can benefit from adding a low or negative IR, uncorrelated strategy.
Corporate Finance Institue, Breaking Down Finance
"the IC for even the best quants in the world is relatively not very high. In fact, you might be surprised to learn that the count of profitable trades as a ratio to all trades (a measure similar in spirit to IC) by the best, well-known quants is typically just a bit higher than 50%. And in general, great discretionary investors have the advantage over great quants when it comes to IC. Quants however, have an advantage when it comes to breadth."

Real World Constraints

Liquidity: Proxy - Bid-ask spread. For highly liquid stocks in mature markets, it is around 3-5 basis points (1 basis point = 1/100 %). For less liquid / illiquid stocks, bid-ask spread may be 20-30 basis points (for a single trade); large compared to an annual return of around 400 basis points of a "good" alpha model.
Transaction Costs: For large trading firms, commissions are a rather negligible part of total transaction costs. Much more significant: market impact. To minimize market impact, trades can be spread over multiple days, but this entails the risk of price movement.
Turnover: Liquidity and transaction costs are hard to determine for the evaluation of an alpha strategy. As a proxy, turnover can be compared. Turnover is calculated as the sum of absolute weight changes of all portfolio components from one day to the next, turnover(t1, t2) = sum_k(|x_(t1,k) - x_(t2,k)|)
Factor rank autocorrelation (FRA): Spearman autocorrelation of factor coefficients at times t-1 and t; corr(rk(α_t-1),rk(α_t)). If the FRA is high, it means that the ranks of the individual assets don't change much and is thus indicative of a low turnover.
"If two alpha factors have similar sharpe ratios, similar quintile performance and similar factor returns, we will prefer the one with a lower turnover ratio / higher FRA". Otherwise, if an alpha factor has high turnover, it must be seen whether the return survives backtesting, paper trading and real trading. Low turnover makes it more feasible to execute trades when stocks are illiquid and to offset transaction costs.

Quantile Analysis

To get a more detailed picture of the performance of an alpha factor, quantile analysis can be used.
Sub-divide the whole considered stock universe in quantiles according to their ranked alpha value.
For each quantile, analyse return and variance/volatility over a time period, e.g. 1 or 3 years.
Ideally, there should be a monotonic relationship between quantiles and their (risk-adjusted) returns. In specific cases, only the highest and lowest quantile returns may be significantly different from zero or there may be a "middle ground" in the low quantiles and another one in the high quantiles that bear most of the negative/positive returns. It should be considered whether there is a rational reason for this or whether it shows some arbitrariness of the predictive power of the alpha factor. In any case, if the alpha factor is predictive only for a subset of stocks in the universe, this corresponds to a higher risk.
Richard G. Sloan (1996), Do Stock Prices Fully Reflect Information in Accruals and Cash Flows About Future Earnings? (Not freely available; only mindmap or summary. Message: If quarterly earnings are to a larger degree determined by accruals (e.g. accounts receivable) and a lesser degree by cash flow from operations (CFO), this is indicative of a lower "quality" and potentially lower future earnings. This fact is (was) not fully accounted for by stock prices, so that stocks may show out-/underperformance following future earnings reports depending on the relation of CFO to accruals in the current report. "The strategy he (Sloan) described is likely not used due to the high information acquisition costs and operational difficulties."
Academic researchers often investigate "their" alpha factors different forom practitioners. They look for "broadly applicable" market phenoma by mostly focusing on the outermost quantiles of non-ranked alpha values. To generate trade decisions for each stock in a portfolio, the alpha factor must be applied to the entire portfolio. Thus, practitioners will have to extend the academic studies before applying those factors.