anthonyng2/XGBoost.ipynb

## XGBoost.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Macro Trading Strategy with XGBoost\n",
    "\n",
    "XGBoost is short for “Extreme Gradient Boosting”, where the term  “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is based on this original model. \n",
    "\n",
    "XGBoost is used for supervised learning problems\n",
    "\n",
    "http://xgboost.readthedocs.io/en/latest/model.html\n",
    "\n",
    "I am unable to provide the original data without violating the terms of contracts with vendors. However, you can access these from a Bloomberg Professional Terminal. I have used the original ticker for your reference."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Factors\n",
    "\n",
    "The idea for this model is from JP Morgan's May 2017 publication titled **Big Data and AI Strategies**.\n",
    "\n",
    "Using the following macro indicators as factors:\n",
    "\n",
    "* High Yield Credit Spreads, CDX_HY\n",
    "* Investment Grade Credit Spreads, CDX_IG\n",
    "* Economic Surprise Index, CESIUSD\n",
    "* Oil, Crude_Oil\n",
    "* US Dollar Index, DXY\n",
    "* Gold, GLD\n",
    "* US 10Yr Treasury, GT10\n",
    "* 10Y-2Y Spread, USYC2Y10\n",
    "\n",
    "In this simple test case, we are attempting to predict the returns of Consumer Discretionary Select Sector SPDR ETF (XLY).\n",
    "\n",
    "One can turn this into a long-short strategy trading the nine SPDR sector ETFs. The basic idea is that using XGBoost, one predict the returns of each of the sector ETFs, rank them, long the top 3 and short the bottom 3. \n",
    "\n",
    "Unfortunately, I am not able to implement this on the Quantopian platform as they currently do not support XGBoost. For now, this would need to be tested via vectorized method.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from xgboost import XGBRegressor\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dataset = pd.read_csv('master.csv')\n",
    "X = dataset.iloc[:, 10:19][1:].values\n",
    "y = dataset.iloc[:, 1].pct_change()[1:].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Date</th>\n",
       "      <th>XLY</th>\n",
       "      <th>XLF</th>\n",
       "      <th>XLK</th>\n",
       "      <th>XLE</th>\n",
       "      <th>XLV</th>\n",
       "      <th>XLI</th>\n",
       "      <th>XLP</th>\n",
       "      <th>XLB</th>\n",
       "      <th>XLU</th>\n",
       "      <th>CDX_HY</th>\n",
       "      <th>CDX_IG</th>\n",
       "      <th>CESIUSD</th>\n",
       "      <th>Crude_Oil</th>\n",
       "      <th>DXY</th>\n",
       "      <th>GLD</th>\n",
       "      <th>GT10</th>\n",
       "      <th>USYC2Y10</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>09/09/2011</td>\n",
       "      <td>32.358040</td>\n",
       "      <td>6.778456</td>\n",
       "      <td>21.111734</td>\n",
       "      <td>56.789097</td>\n",
       "      <td>28.906340</td>\n",
       "      <td>26.604198</td>\n",
       "      <td>25.542053</td>\n",
       "      <td>29.202261</td>\n",
       "      <td>26.290201</td>\n",
       "      <td>92.000</td>\n",
       "      <td>132.25</td>\n",
       "      <td>-41.8</td>\n",
       "      <td>156.19</td>\n",
       "      <td>77.192</td>\n",
       "      <td>180.70</td>\n",
       "      <td>1.920</td>\n",
       "      <td>174.879</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>09/12/2011</td>\n",
       "      <td>32.741726</td>\n",
       "      <td>6.856050</td>\n",
       "      <td>21.363386</td>\n",
       "      <td>57.060394</td>\n",
       "      <td>29.005796</td>\n",
       "      <td>26.657087</td>\n",
       "      <td>25.567596</td>\n",
       "      <td>29.008575</td>\n",
       "      <td>26.506611</td>\n",
       "      <td>91.313</td>\n",
       "      <td>135.75</td>\n",
       "      <td>-39.5</td>\n",
       "      <td>157.89</td>\n",
       "      <td>77.578</td>\n",
       "      <td>176.67</td>\n",
       "      <td>1.948</td>\n",
       "      <td>174.161</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>09/13/2011</td>\n",
       "      <td>33.125416</td>\n",
       "      <td>6.900391</td>\n",
       "      <td>21.615030</td>\n",
       "      <td>57.217922</td>\n",
       "      <td>29.304174</td>\n",
       "      <td>27.150738</td>\n",
       "      <td>25.661251</td>\n",
       "      <td>29.483980</td>\n",
       "      <td>26.682945</td>\n",
       "      <td>92.375</td>\n",
       "      <td>132.25</td>\n",
       "      <td>-38.5</td>\n",
       "      <td>161.51</td>\n",
       "      <td>76.919</td>\n",
       "      <td>178.54</td>\n",
       "      <td>1.992</td>\n",
       "      <td>178.866</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>09/14/2011</td>\n",
       "      <td>33.673565</td>\n",
       "      <td>6.983530</td>\n",
       "      <td>21.938587</td>\n",
       "      <td>57.926796</td>\n",
       "      <td>29.584467</td>\n",
       "      <td>27.626760</td>\n",
       "      <td>25.967760</td>\n",
       "      <td>29.941778</td>\n",
       "      <td>26.867296</td>\n",
       "      <td>93.000</td>\n",
       "      <td>129.75</td>\n",
       "      <td>-39.4</td>\n",
       "      <td>159.18</td>\n",
       "      <td>76.833</td>\n",
       "      <td>177.21</td>\n",
       "      <td>1.985</td>\n",
       "      <td>179.759</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>09/15/2011</td>\n",
       "      <td>34.239948</td>\n",
       "      <td>7.160888</td>\n",
       "      <td>22.289101</td>\n",
       "      <td>59.073273</td>\n",
       "      <td>29.855724</td>\n",
       "      <td>28.190929</td>\n",
       "      <td>26.274261</td>\n",
       "      <td>30.452398</td>\n",
       "      <td>27.203936</td>\n",
       "      <td>94.000</td>\n",
       "      <td>125.75</td>\n",
       "      <td>-42.2</td>\n",
       "      <td>160.06</td>\n",
       "      <td>76.241</td>\n",
       "      <td>174.40</td>\n",
       "      <td>2.083</td>\n",
       "      <td>189.179</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Date        XLY       XLF        XLK        XLE        XLV  \\\n",
       "0  09/09/2011  32.358040  6.778456  21.111734  56.789097  28.906340   \n",
       "1  09/12/2011  32.741726  6.856050  21.363386  57.060394  29.005796   \n",
       "2  09/13/2011  33.125416  6.900391  21.615030  57.217922  29.304174   \n",
       "3  09/14/2011  33.673565  6.983530  21.938587  57.926796  29.584467   \n",
       "4  09/15/2011  34.239948  7.160888  22.289101  59.073273  29.855724   \n",
       "\n",
       "         XLI        XLP        XLB        XLU  CDX_HY  CDX_IG  CESIUSD  \\\n",
       "0  26.604198  25.542053  29.202261  26.290201  92.000  132.25    -41.8   \n",
       "1  26.657087  25.567596  29.008575  26.506611  91.313  135.75    -39.5   \n",
       "2  27.150738  25.661251  29.483980  26.682945  92.375  132.25    -38.5   \n",
       "3  27.626760  25.967760  29.941778  26.867296  93.000  129.75    -39.4   \n",
       "4  28.190929  26.274261  30.452398  27.203936  94.000  125.75    -42.2   \n",
       "\n",
       "   Crude_Oil     DXY     GLD   GT10  USYC2Y10  \n",
       "0     156.19  77.192  180.70  1.920   174.879  \n",
       "1     157.89  77.578  176.67  1.948   174.161  \n",
       "2     161.51  76.919  178.54  1.992   178.866  \n",
       "3     159.18  76.833  177.21  1.985   179.759  \n",
       "4     160.06  76.241  174.40  2.083   189.179  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1433, 18)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[  91.313,  135.75 ,  -39.5  ,  157.89 ,   77.578,  176.67 ,\n",
       "           1.948,  174.161],\n",
       "       [  92.375,  132.25 ,  -38.5  ,  161.51 ,   76.919,  178.54 ,\n",
       "           1.992,  178.866],\n",
       "       [  93.   ,  129.75 ,  -39.4  ,  159.18 ,   76.833,  177.21 ,\n",
       "           1.985,  179.759]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.01185752,  0.01171869,  0.01654769])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "seg = 1403\n",
    "X_train = X[:seg,:]\n",
    "y_train = y[:seg]\n",
    "X_test = X[seg:,:]\n",
    "y_test = y[seg:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sc_X = StandardScaler()\n",
    "X_train = sc_X.fit_transform(X_train)\n",
    "X_test = sc_X.transform(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "       colsample_bytree=1, eval_metric='logloss', gamma=0,\n",
       "       learning_rate=0.1, max_delta_step=0, max_depth=7,\n",
       "       min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,\n",
       "       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,\n",
       "       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,\n",
       "       subsample=1)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = XGBRegressor(booster=\"gbtree\", objective=\"reg:linear\", \n",
    "                     max_depth = 7,\n",
    "                     subsample=1, eval_metric='logloss')\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "XGBoost\n",
      "RMSE: 0.0061\n",
      "R^2 Score: -0.0374\n"
     ]
    }
   ],
   "source": [
    "y_xg_pred = model.predict(X_test)\n",
    "\n",
    "print(\"XGBoost\")\n",
    "print(\"RMSE: {0:.4f}\".format(np.sqrt(mean_squared_error(y_test, y_xg_pred))))\n",
    "print(\"R^2 Score: {0:.4f}\".format(r2_score(y_test, y_xg_pred)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lr = LinearRegression()\n",
    "lr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Linear Regression\n",
      "RMSE: 0.0060\n",
      "R^2 Score: -0.0034\n"
     ]
    }
   ],
   "source": [
    "y_lr_pred = lr.predict(X_test)\n",
    "\n",
    "print(\"Linear Regression\")\n",
    "print(\"RMSE: {0:.4f}\".format(np.sqrt(mean_squared_error(y_test, y_lr_pred))))\n",
    "print(\"R^2 Score: {0:.4f}\".format(r2_score(y_test, y_lr_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The RMSE of XGBoost is slightly worst than Linear Regression. The $R^2$ of XGBoost is 10x better. However, it is difficult to tell until one backtest them both.\n",
    "\n",
    "This is just a simple demo of how one can use XGBoost to predict the returns. You can now easily extend it to multi-assets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Macro Trading Strategy with XGBoost\n",
	"\n",
	"XGBoost is short for “Extreme Gradient Boosting”, where the term “Gradient Boosting” is proposed in the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is based on this original model. \n",
	"\n",
	"XGBoost is used for supervised learning problems\n",
	"\n",
	"http://xgboost.readthedocs.io/en/latest/model.html\n",
	"\n",
	"I am unable to provide the original data without violating the terms of contracts with vendors. However, you can access these from a Bloomberg Professional Terminal. I have used the original ticker for your reference."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Factors\n",
	"\n",
	"The idea for this model is from JP Morgan's May 2017 publication titled Big Data and AI Strategies.\n",
	"\n",
	"Using the following macro indicators as factors:\n",
	"\n",
	"* High Yield Credit Spreads, CDX_HY\n",
	"* Investment Grade Credit Spreads, CDX_IG\n",
	"* Economic Surprise Index, CESIUSD\n",
	"* Oil, Crude_Oil\n",
	"* US Dollar Index, DXY\n",
	"* Gold, GLD\n",
	"* US 10Yr Treasury, GT10\n",
	"* 10Y-2Y Spread, USYC2Y10\n",
	"\n",
	"In this simple test case, we are attempting to predict the returns of Consumer Discretionary Select Sector SPDR ETF (XLY).\n",
	"\n",
	"One can turn this into a long-short strategy trading the nine SPDR sector ETFs. The basic idea is that using XGBoost, one predict the returns of each of the sector ETFs, rank them, long the top 3 and short the bottom 3. \n",
	"\n",
	"Unfortunately, I am not able to implement this on the Quantopian platform as they currently do not support XGBoost. For now, this would need to be tested via vectorized method.\n",
	"\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"from xgboost import XGBRegressor\n",
	"from sklearn.linear_model import LinearRegression\n",
	"from sklearn.metrics import mean_squared_error, r2_score\n",
	"from sklearn.preprocessing import StandardScaler"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"dataset = pd.read_csv('master.csv')\n",
	"X = dataset.iloc[:, 10:19][1:].values\n",
	"y = dataset.iloc[:, 1].pct_change()[1:].values"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style>\n",
	" .dataframe thead tr:only-child th {\n",
	" text-align: right;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: left;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>Date</th>\n",
	" <th>XLY</th>\n",
	" <th>XLF</th>\n",
	" <th>XLK</th>\n",
	" <th>XLE</th>\n",
	" <th>XLV</th>\n",
	" <th>XLI</th>\n",
	" <th>XLP</th>\n",
	" <th>XLB</th>\n",
	" <th>XLU</th>\n",
	" <th>CDX_HY</th>\n",
	" <th>CDX_IG</th>\n",
	" <th>CESIUSD</th>\n",
	" <th>Crude_Oil</th>\n",
	" <th>DXY</th>\n",
	" <th>GLD</th>\n",
	" <th>GT10</th>\n",
	" <th>USYC2Y10</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>09/09/2011</td>\n",
	" <td>32.358040</td>\n",
	" <td>6.778456</td>\n",
	" <td>21.111734</td>\n",
	" <td>56.789097</td>\n",
	" <td>28.906340</td>\n",
	" <td>26.604198</td>\n",
	" <td>25.542053</td>\n",
	" <td>29.202261</td>\n",
	" <td>26.290201</td>\n",
	" <td>92.000</td>\n",
	" <td>132.25</td>\n",
	" <td>-41.8</td>\n",
	" <td>156.19</td>\n",
	" <td>77.192</td>\n",
	" <td>180.70</td>\n",
	" <td>1.920</td>\n",
	" <td>174.879</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>09/12/2011</td>\n",
	" <td>32.741726</td>\n",
	" <td>6.856050</td>\n",
	" <td>21.363386</td>\n",
	" <td>57.060394</td>\n",
	" <td>29.005796</td>\n",
	" <td>26.657087</td>\n",
	" <td>25.567596</td>\n",
	" <td>29.008575</td>\n",
	" <td>26.506611</td>\n",
	" <td>91.313</td>\n",
	" <td>135.75</td>\n",
	" <td>-39.5</td>\n",
	" <td>157.89</td>\n",
	" <td>77.578</td>\n",
	" <td>176.67</td>\n",
	" <td>1.948</td>\n",
	" <td>174.161</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>09/13/2011</td>\n",
	" <td>33.125416</td>\n",
	" <td>6.900391</td>\n",
	" <td>21.615030</td>\n",
	" <td>57.217922</td>\n",
	" <td>29.304174</td>\n",
	" <td>27.150738</td>\n",
	" <td>25.661251</td>\n",
	" <td>29.483980</td>\n",
	" <td>26.682945</td>\n",
	" <td>92.375</td>\n",
	" <td>132.25</td>\n",
	" <td>-38.5</td>\n",
	" <td>161.51</td>\n",
	" <td>76.919</td>\n",
	" <td>178.54</td>\n",
	" <td>1.992</td>\n",
	" <td>178.866</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>09/14/2011</td>\n",
	" <td>33.673565</td>\n",
	" <td>6.983530</td>\n",
	" <td>21.938587</td>\n",
	" <td>57.926796</td>\n",
	" <td>29.584467</td>\n",
	" <td>27.626760</td>\n",
	" <td>25.967760</td>\n",
	" <td>29.941778</td>\n",
	" <td>26.867296</td>\n",
	" <td>93.000</td>\n",
	" <td>129.75</td>\n",
	" <td>-39.4</td>\n",
	" <td>159.18</td>\n",
	" <td>76.833</td>\n",
	" <td>177.21</td>\n",
	" <td>1.985</td>\n",
	" <td>179.759</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>09/15/2011</td>\n",
	" <td>34.239948</td>\n",
	" <td>7.160888</td>\n",
	" <td>22.289101</td>\n",
	" <td>59.073273</td>\n",
	" <td>29.855724</td>\n",
	" <td>28.190929</td>\n",
	" <td>26.274261</td>\n",
	" <td>30.452398</td>\n",
	" <td>27.203936</td>\n",
	" <td>94.000</td>\n",
	" <td>125.75</td>\n",
	" <td>-42.2</td>\n",
	" <td>160.06</td>\n",
	" <td>76.241</td>\n",
	" <td>174.40</td>\n",
	" <td>2.083</td>\n",
	" <td>189.179</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" Date XLY XLF XLK XLE XLV \\\n",
	"0 09/09/2011 32.358040 6.778456 21.111734 56.789097 28.906340 \n",
	"1 09/12/2011 32.741726 6.856050 21.363386 57.060394 29.005796 \n",
	"2 09/13/2011 33.125416 6.900391 21.615030 57.217922 29.304174 \n",
	"3 09/14/2011 33.673565 6.983530 21.938587 57.926796 29.584467 \n",
	"4 09/15/2011 34.239948 7.160888 22.289101 59.073273 29.855724 \n",
	"\n",
	" XLI XLP XLB XLU CDX_HY CDX_IG CESIUSD \\\n",
	"0 26.604198 25.542053 29.202261 26.290201 92.000 132.25 -41.8 \n",
	"1 26.657087 25.567596 29.008575 26.506611 91.313 135.75 -39.5 \n",
	"2 27.150738 25.661251 29.483980 26.682945 92.375 132.25 -38.5 \n",
	"3 27.626760 25.967760 29.941778 26.867296 93.000 129.75 -39.4 \n",
	"4 28.190929 26.274261 30.452398 27.203936 94.000 125.75 -42.2 \n",
	"\n",
	" Crude_Oil DXY GLD GT10 USYC2Y10 \n",
	"0 156.19 77.192 180.70 1.920 174.879 \n",
	"1 157.89 77.578 176.67 1.948 174.161 \n",
	"2 161.51 76.919 178.54 1.992 178.866 \n",
	"3 159.18 76.833 177.21 1.985 179.759 \n",
	"4 160.06 76.241 174.40 2.083 189.179 "
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dataset.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(1433, 18)"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"dataset.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 91.313, 135.75 , -39.5 , 157.89 , 77.578, 176.67 ,\n",
	" 1.948, 174.161],\n",
	" [ 92.375, 132.25 , -38.5 , 161.51 , 76.919, 178.54 ,\n",
	" 1.992, 178.866],\n",
	" [ 93. , 129.75 , -39.4 , 159.18 , 76.833, 177.21 ,\n",
	" 1.985, 179.759]])"
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X[:3]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([ 0.01185752, 0.01171869, 0.01654769])"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"y[:3]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"seg = 1403\n",
	"X_train = X[:seg,:]\n",
	"y_train = y[:seg]\n",
	"X_test = X[seg:,:]\n",
	"y_test = y[seg:]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"sc_X = StandardScaler()\n",
	"X_train = sc_X.fit_transform(X_train)\n",
	"X_test = sc_X.transform(X_test)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
	" colsample_bytree=1, eval_metric='logloss', gamma=0,\n",
	" learning_rate=0.1, max_delta_step=0, max_depth=7,\n",
	" min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,\n",
	" nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,\n",
	" reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,\n",
	" subsample=1)"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model = XGBRegressor(booster=\"gbtree\", objective=\"reg:linear\", \n",
	" max_depth = 7,\n",
	" subsample=1, eval_metric='logloss')\n",
	"model.fit(X_train, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"XGBoost\n",
	"RMSE: 0.0061\n",
	"R^2 Score: -0.0374\n"
	]
	}
	],
	"source": [
	"y_xg_pred = model.predict(X_test)\n",
	"\n",
	"print(\"XGBoost\")\n",
	"print(\"RMSE: {0:.4f}\".format(np.sqrt(mean_squared_error(y_test, y_xg_pred))))\n",
	"print(\"R^2 Score: {0:.4f}\".format(r2_score(y_test, y_xg_pred)))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
	]
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"lr = LinearRegression()\n",
	"lr.fit(X_train, y_train)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Linear Regression\n",
	"RMSE: 0.0060\n",
	"R^2 Score: -0.0034\n"
	]
	}
	],
	"source": [
	"y_lr_pred = lr.predict(X_test)\n",
	"\n",
	"print(\"Linear Regression\")\n",
	"print(\"RMSE: {0:.4f}\".format(np.sqrt(mean_squared_error(y_test, y_lr_pred))))\n",
	"print(\"R^2 Score: {0:.4f}\".format(r2_score(y_test, y_lr_pred)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The RMSE of XGBoost is slightly worst than Linear Regression. The $R^2$ of XGBoost is 10x better. However, it is difficult to tell until one backtest them both.\n",
	"\n",
	"This is just a simple demo of how one can use XGBoost to predict the returns. You can now easily extend it to multi-assets."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"***"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}