Skip to content

Instantly share code, notes, and snippets.

@reouno
Created December 7, 2020 01:44
Show Gist options
  • Save reouno/e67e01064b916de20b78a2bc0ac4e397 to your computer and use it in GitHub Desktop.
Save reouno/e67e01064b916de20b78a2bc0ac4e397 to your computer and use it in GitHub Desktop.
上昇した日の翌日の収益率分布を全体の分布と比較する
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 小さいエッジの大量探索第一段階\n",
"\n",
"常にターゲットはE-mini S&P500先物(**ES**)。足の単位は自由。予測の方法も自由。予測の内容も自由(次の足の上昇を予測するとか、N足以内の下落を予測するとか)。\n",
"\n",
"ただし、基本的に**点推定ではなく分布推定**をする。点推定の場合、単に上昇確率=◯◯%などとなるがこの場合、1ポイント以上上昇とか下落が10ポイント以内におさまるとかの確率を知りたいときにいちいち元のデータから計算し直す必要がある。これに対して分布を推定しておけば、後はその分布を元にして確率を計算できるため実際に使用するときに知りたい確率を容易に計算できて便利。\n",
"\n",
"\n",
"- 上昇した足の翌足の収益率分布\n",
"- 連続でN足下落した翌足の収益率分布\n",
"- 銘柄Aが上昇した足の翌足でのESの収益率分布\n",
"- 銘柄Aが下落した足の翌足でのESの収益率分布\n",
"- 銘柄Aが2日連続下落した足の翌足でのESの収益率分布\n",
"\n",
"などの分布を片っぱしから推定していく。このルールだけを与えたら自動で分布を推定して結果を記録していくような仕組みがあると良い。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 手始めに上昇した日の翌日の収益率分布を全体の分布と比較する\n",
"\n",
"仮説:上昇した日の翌日はその反動で下落しやすいと仮定すると、上昇した日の翌日の収益率分布は全期間の収益率分布よりも左にずれると予想できる。\n",
"\n",
"## 結論\n",
"\n",
"結果については一番下のセルを参照。\n",
"\n",
"上昇日の翌日の収益率は全期間の収益率とほぼ同じであり、単純に上昇日の翌日に売るとか買うというルールでランダム売買戦略より大きい収益を上げることはできないだろう。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 準備"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'en_US.UTF-8'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%matplotlib inline\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from pandas.api.types import CategoricalDtype\n",
"import matplotlib as mpl\n",
"mpl.rcParams['font.family'] = 'sans-serif'\n",
"mpl.rcParams['font.sans-serif'] = ['Hiragino Maru Gothic Pro', 'Yu Gothic', 'Meirio', 'Takao', 'IPAexGothic', 'IPAPGothic', 'VL PGothic', 'Noto Sans CJK JP']\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from scipy import stats\n",
"import datetime as dt\n",
"from dateutil.relativedelta import relativedelta\n",
"import locale\n",
"\n",
"# 月や曜日を英語で取得するためこの設定をしておく\n",
"locale.setlocale(locale.LC_TIME, 'en_US.UTF-8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### E-Mini S&P500先物と金先物データ読み込み\n",
"\n",
"データは、[TradeStationのDesktop Platform](https://www.tradestation.com/platforms-and-tools/desktop/)を使って出力したCSVファイル。公開は禁止されているので公開できないが、Quandlからも[E-mini S&P500先物](https://www.quandl.com/data/CHRIS/CME_ES1-E-mini-S-P-500-Futures-Continuous-Contract-1-ES1-Front-Month)の最新の日足データ(今回の分析で使っているのと同じ)をAPI経由で取得できるので、そのデータを使用可能。ただしquandl経由で取得したDFは若干構造が違うから、以下のセルは多少修正が必要。\n",
"- quandl経由で取得したDFはすでにDatetimeindexになっている。また、終値は'Close'ではなく'Last'。おそらくこの2点だけ違う。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/leo/src/pyproject/py-envs/py383env/lib/python3.6/site-packages/pandas/core/series.py:679: RuntimeWarning: divide by zero encountered in log\n",
" result = getattr(ufunc, method)(*inputs, **kwargs)\n"
]
}
],
"source": [
"df_tmp = pd.read_csv('data/e-mini-sp500-200530/e-mini-sp500-daily.csv')\n",
"\n",
"# datetime indexに変換\n",
"def to_datetime_index(df):\n",
" # DateTime列を追加\n",
" df['datetime'] = (df['Date'] + '-' + df['Time']).map(lambda s: dt.datetime.strptime(s, '%m/%d/%Y-%H:%M'))\n",
" df = df.set_index('datetime', drop=True)\n",
" df = df.drop(columns=['Date', 'Time'])\n",
" return df\n",
"\n",
"df = to_datetime_index(df_tmp)\n",
"\n",
"# 対数変換した列を追加\n",
"def add_log_values(df):\n",
" df['logO'] = np.log(df['Open'])\n",
" df['logH'] = np.log(df['High'])\n",
" df['logL'] = np.log(df['Low'])\n",
" df['logC'] = np.log(df['Close'])\n",
" df['logV'] = np.log(df['Vol'])\n",
" df['logOI'] = np.log(df['OI'])\n",
"\n",
"_ = add_log_values(df)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Date</th>\n",
" <th>Time</th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Vol</th>\n",
" <th>OI</th>\n",
" <th>datetime</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>09/11/1997</td>\n",
" <td>17:00</td>\n",
" <td>1071.25</td>\n",
" <td>1082.25</td>\n",
" <td>1062.75</td>\n",
" <td>1068.50</td>\n",
" <td>11825</td>\n",
" <td>2909</td>\n",
" <td>1997-09-11 17:00:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>09/12/1997</td>\n",
" <td>17:00</td>\n",
" <td>1070.50</td>\n",
" <td>1089.00</td>\n",
" <td>1066.00</td>\n",
" <td>1071.25</td>\n",
" <td>9759</td>\n",
" <td>4059</td>\n",
" <td>1997-09-12 17:00:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Date Time Open High Low Close Vol OI \\\n",
"0 09/11/1997 17:00 1071.25 1082.25 1062.75 1068.50 11825 2909 \n",
"1 09/12/1997 17:00 1070.50 1089.00 1066.00 1071.25 9759 4059 \n",
"\n",
" datetime \n",
"0 1997-09-11 17:00:00 \n",
"1 1997-09-12 17:00:00 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 参考:生のデータフレーム\n",
"df_tmp.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Open</th>\n",
" <th>High</th>\n",
" <th>Low</th>\n",
" <th>Close</th>\n",
" <th>Vol</th>\n",
" <th>OI</th>\n",
" <th>logO</th>\n",
" <th>logH</th>\n",
" <th>logL</th>\n",
" <th>logC</th>\n",
" <th>logV</th>\n",
" <th>logOI</th>\n",
" </tr>\n",
" <tr>\n",
" <th>datetime</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1997-09-11 17:00:00</th>\n",
" <td>1071.25</td>\n",
" <td>1082.25</td>\n",
" <td>1062.75</td>\n",
" <td>1068.50</td>\n",
" <td>11825</td>\n",
" <td>2909</td>\n",
" <td>6.976581</td>\n",
" <td>6.986797</td>\n",
" <td>6.968615</td>\n",
" <td>6.974011</td>\n",
" <td>9.377971</td>\n",
" <td>7.975565</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1997-09-12 17:00:00</th>\n",
" <td>1070.50</td>\n",
" <td>1089.00</td>\n",
" <td>1066.00</td>\n",
" <td>1071.25</td>\n",
" <td>9759</td>\n",
" <td>4059</td>\n",
" <td>6.975881</td>\n",
" <td>6.993015</td>\n",
" <td>6.971669</td>\n",
" <td>6.976581</td>\n",
" <td>9.185945</td>\n",
" <td>8.308692</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Open High Low Close Vol OI \\\n",
"datetime \n",
"1997-09-11 17:00:00 1071.25 1082.25 1062.75 1068.50 11825 2909 \n",
"1997-09-12 17:00:00 1070.50 1089.00 1066.00 1071.25 9759 4059 \n",
"\n",
" logO logH logL logC logV \\\n",
"datetime \n",
"1997-09-11 17:00:00 6.976581 6.986797 6.968615 6.974011 9.377971 \n",
"1997-09-12 17:00:00 6.975881 6.993015 6.971669 6.976581 9.185945 \n",
"\n",
" logOI \n",
"datetime \n",
"1997-09-11 17:00:00 7.975565 \n",
"1997-09-12 17:00:00 8.308692 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 参考:対数変換データ追加後のデータフレーム\n",
"df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 価格、対数価格、価格階差、対数差収益率(100倍)のDFを作成"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/leo/src/pyproject/py-envs/py383env/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \n"
]
}
],
"source": [
"def to_log_return_ratio_df(df):\n",
" diff_df = df.diff()\n",
" close_df = df[['Close', 'logC']]\n",
" diff_df = diff_df.rename(columns={'Close': 'CloseDiff', 'logC': 'logCDiff'})\n",
" close_diff_df = diff_df[['CloseDiff', 'logCDiff']]\n",
" close_diff_df['logCDiff'] = close_diff_df['logCDiff'] * 100\n",
" rr_df = pd.concat([close_df, close_diff_df], axis=1)\n",
" rr_df = rr_df.dropna()\n",
" return rr_df\n",
"\n",
"rr_df = to_log_return_ratio_df(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Close</th>\n",
" <th>logC</th>\n",
" <th>CloseDiff</th>\n",
" <th>logCDiff</th>\n",
" </tr>\n",
" <tr>\n",
" <th>datetime</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1997-09-12 17:00:00</th>\n",
" <td>1071.25</td>\n",
" <td>6.976581</td>\n",
" <td>2.75</td>\n",
" <td>0.257040</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1997-09-15 17:00:00</th>\n",
" <td>1083.75</td>\n",
" <td>6.988183</td>\n",
" <td>12.50</td>\n",
" <td>1.160106</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Close logC CloseDiff logCDiff\n",
"datetime \n",
"1997-09-12 17:00:00 1071.25 6.976581 2.75 0.257040\n",
"1997-09-15 17:00:00 1083.75 6.988183 12.50 1.160106"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 参考:対数差分をとった後は、初日のデータはなくなる\n",
"rr_df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 分布の推定と比較"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# 「翌足の対数差収益率」という列を追加\n",
"rr_df2d = rr_df.rename(columns={'logCDiff': 'logCDiff0'})\n",
"# datetimeindexなので、一旦インデックスを消さないとうまく結合できない\n",
"# しかしto_numpy()は遅いから別の方法でやりたい\n",
"rr_df2d['logCDiff1'] = rr_df2d['logCDiff0'][1:].append(pd.Series([np.nan]*1)).to_numpy()\n",
"\n",
"# logCDiff1列の最後はNaNになるためその行を除外(最後の足の翌足のデータは無いから)\n",
"rr_df2d = rr_df2d.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Close</th>\n",
" <th>logC</th>\n",
" <th>CloseDiff</th>\n",
" <th>logCDiff0</th>\n",
" <th>logCDiff1</th>\n",
" </tr>\n",
" <tr>\n",
" <th>datetime</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1997-09-12 17:00:00</th>\n",
" <td>1071.25</td>\n",
" <td>6.976581</td>\n",
" <td>2.75</td>\n",
" <td>0.257040</td>\n",
" <td>1.160106</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1997-09-15 17:00:00</th>\n",
" <td>1083.75</td>\n",
" <td>6.988183</td>\n",
" <td>12.50</td>\n",
" <td>1.160106</td>\n",
" <td>2.258050</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Close logC CloseDiff logCDiff0 logCDiff1\n",
"datetime \n",
"1997-09-12 17:00:00 1071.25 6.976581 2.75 0.257040 1.160106\n",
"1997-09-15 17:00:00 1083.75 6.988183 12.50 1.160106 2.258050"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 参考:logCDiff0は当足の収益率、logCDiff1は翌足の収益率\n",
"# 当足の収益率がプラスだった場合の翌足の収益率分布が知りたい\n",
"rr_df2d.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"上昇した日の割合 = 3045 / 5727 = 53.17%\n"
]
}
],
"source": [
"up_df = rr_df2d[rr_df2d['logCDiff0'] > 0]\n",
"n_all = rr_df2d.shape[0]\n",
"n_up = up_df.shape[0]\n",
"print(f'上昇した日の割合 = {n_up} / {n_all} = {n_up / n_all * 100:.02f}%')"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"全期間の対数差収益率 のt分布パラメータ\n",
"df=2.3669323230150696, loc=0.07086735101302821, scale=0.7277745201048187\n",
"上昇日の翌日の収益率 のt分布パラメータ\n",
"df=2.360011616962992, loc=0.04749138631505906, scale=0.6510313403914763\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x1296 with 3 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots(3, 1, figsize=(20, 18))\n",
"\n",
"# 全期間の収益率分布を描く\n",
"# 上昇した日の翌日の収益率分布を描く\n",
"# 両者を重ねる\n",
"# 各々をt分布と仮定してパラメータ推定(この時、νは同じ値にしておいた方が良いのか??)\n",
"plot_data = [rr_df2d['logCDiff0'], up_df['logCDiff1']]\n",
"titles = ['全期間の対数差収益率', '上昇日の翌日の収益率']\n",
"colors=['tab:green','tab:pink']\n",
"\n",
"# x軸の範囲を広い方(全期間)に合わせる\n",
"xmin = plot_data[0].min()\n",
"xmax = plot_data[0].max()\n",
"\n",
"# t分布の描画範囲\n",
"xs = np.linspace(xmin, xmax, 300)\n",
"\n",
"t_params = []\n",
"t_ys = []\n",
"for i in range(2):\n",
" sns.histplot(plot_data[i], kde=False, stat='density', color='lightblue', ax=ax[i])\n",
"\n",
" # t分布の当てはめ\n",
" t_params.append(stats.t.fit(plot_data[i]))\n",
" t_ys.append(stats.t.pdf(xs, df=t_params[i][0], loc=t_params[i][1], scale=t_params[i][2]))\n",
"\n",
" ax[i].set_xlim(xmin, xmax)\n",
" ax[i].plot(xs, t_ys[i], color=colors[i])\n",
"\n",
" ax[i].set_title(titles[i], fontweight='semibold', fontsize=16)\n",
"\n",
"# 推定分布を重ねて比較\n",
"ax[2].plot(xs, t_ys[0], label=titles[0], color=colors[0])\n",
"ax[2].plot(xs, t_ys[1], label=titles[1], color=colors[1])\n",
"ax[2].set_xlim(xmin, xmax)\n",
"ax[2].legend()\n",
"\n",
"# パラメータ推定値の確認\n",
"for i in range(2):\n",
" print(titles[i], 'のt分布パラメータ')\n",
" print(f'df={t_params[i][0]}, loc={t_params[i][1]}, scale={t_params[i][2]}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 結果"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"全期間の収益率分布と上昇日の翌日の収益率分布の間に差はほとんど見られなかった。分布の重なり具合から見て、統計検定の必要はないだろう。\n",
"\n",
"\n",
"#### 補足:t分布のパラメータνについて\n",
"\n",
"t分布を比較する際、パラメータνを固定した方が良いのか。当てはまりの良さを比較する場合は、対数尤度やAICやBICなどの指標を使うから、パラメータを固定する必要はないし、正規分布とt分布の比較など他分布同士でも良い。また、異なる分布であってもそれを元に計算された「◯◯以上上昇する確率」などの具体的な条件付き確率についても比較できる。\n",
"\n",
"しかし、平均や分散を比較する場合、例えば正規分布とt分布の平均や分散を比較することはできない。なぜならこの2つはそもそも異なる分布であり、同じデータに対してそれぞれの分布を当てはめた場合でも平均と分散はそれぞれ異なる値になるから。\n",
"\n",
"とすると、νの値が異なる2つのt分布がある時、平均や分散を直接比較することはできない。従って、以下のようになるだろう。\n",
"\n",
"- 当てはまりの良さを比較する場合:νを固定する必要はない(異なる分布同士を比較可能)\n",
"- 各々の分布から計算された確率の比較:νを固定する必要はない(そもそも分布の比較ではない)\n",
"- 平均と分散の値を直接比較する場合:νを固定する必要がある"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment