wjwillemse/How to read public Solvency II data.ipynb

## How to read public Solvency II data.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " # How to analyse public Solvency II data of Dutch insurance undertakings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will use the public Solvency II data of all Dutch insurance undertakings and present it in one large data vector per undertaking. In doing so, we are able to use some easy but powerful machine learning algorithms to analyze the data.\n",
    "\n",
    "Solvency II data of individual insurance undertakings is published yearly by the Statistics department of DNB. The data represents the financial and solvency situation of each insurance undertaking per the end of each year. Currently, we have two years of data: ultimo 2016 and ultimo 2017. The publication of DNB is in the form of an Excel file with a number of worksheets containing the aggregated data and the individual data. Here we will use the individual data.\n",
    "\n",
    "We will read the data in a Pandas Data Frame en use Numpy for data manipulations. Furthermore we need the datetime object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can find the data with the following url\n",
    "\n",
    "https://statistiek.dnb.nl/en/downloads/index.aspx#/?statistics_type=toezichtdata&theme=verzekeraars\n",
    "\n",
    "Download the file and make sure that the Excel file is accessible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = \"../../20_local_data/\"\n",
    "\n",
    "xls = pd.ExcelFile(data_path + \"Data individual insurers (year).xlsx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Excel file contains several worksheets with data. We want to combine all the data together in one Data Frame. To do that we need some data preparation and data cleaning for each worksheet.\n",
    "\n",
    "In the following function a worksheet is put into a Data Frame and the columns names are set to lower case. Then an index of the data frame is set to the insurance undertaking name and the reporting period. Then we perform some cleaning (the original worksheets contain some process information). In addition, some worksheets in the file contain 2-dimensional data, that need to be pivoted such that we obtain one large vector with all the data per insurance undertaking in one row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_sheet(num):\n",
    "    \n",
    "    # read entire Excel sheet\n",
    "    df = xls.parse(num)\n",
    "\n",
    "    # columns names to lower case\n",
    "    df.columns = [c.lower() for c in df.columns]\n",
    "\n",
    "    # set index to name and period\n",
    "    df.set_index(['name', 'period'], inplace = True)\n",
    "\n",
    "    # data cleaning (the excel sheet contains some additional data that we don't need)\n",
    "    drop_list = [i for i in df.columns if 'unnamed' in i or 'selectielijst' in i]\n",
    "    df.drop(drop_list, axis = 1, inplace = True)\n",
    "    \n",
    "    # pivot data frame\n",
    "    if \"row_name\" in df.columns:\n",
    "        df.drop(\"row_name\", axis = 1, inplace = True)\n",
    "        df = df.pivot(columns = 'row_header')\n",
    "\n",
    "    if df.columns.nlevels > 1:\n",
    "        df.columns = [str(df.columns[i]) for i in range(len(df.columns))]\n",
    "\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating one large vector per insurance undertaking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the function above we can read a part of the Excel file and store it in a Pandas data frame. The following worksheets are contained in the Excel file published by DNB.\n",
    "\n",
    "* Worksheet 14: balance sheet\n",
    "* Worksheet 15: premiums - life\n",
    "* Worksheet 16: premiums - nonlife\n",
    "* Worksheet 17: technical provisions - 1\n",
    "* Worksheet 18: technical provisions - 2\n",
    "* Worksheet 19: transition and adjustments\n",
    "* Worksheet 20: own funds\n",
    "* Worksheet 21: solvency capital requirements - 1\n",
    "* Worksheet 22: solvency capital requirements - 2\n",
    "* Worksheet 23: minimum capital requirements\n",
    "* Worksheet 24: additional information life\n",
    "* Worksheet 25: additional information non-life\n",
    "* Worksheet 26: additional information reinsurance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s read the first worksheet with data and then append the other sheets to it. We shall not read the last three worksheets, because these contain the country specific reports. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = get_sheet(14)\n",
    "for l in range(15, 24):\n",
    "    df_temp = get_sheet(l)\n",
    "    df = df.join(df_temp, rsuffix = \"_\"+ str(l))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the shape of the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of rows   : \" + str(len(df.index)))\n",
    "print(\"Number of columns: \" + str(len(df.columns)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we have 286 rows of insurance undertakings data for ultimo 2016 and ultimo 2017. And we have 1273 columns with financial and solvency data about each insurance undertaking. Let’s take the data from the year 2017, and select all columns that have floating numbers in them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.xs(datetime(2017,12,31), axis = 0, level = 1, drop_level = True)\n",
    "\n",
    "df = df[[df.columns[c] for c in range(len(df.columns)) if df.dtypes[c]=='float64']]\n",
    "\n",
    "df.fillna(0, inplace = True)\n",
    "\n",
    "print(\"Number of rows   : \" + str(len(df.index)))\n",
    "print(\"Number of columns: \" + str(len(df.columns)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apparently there is one columns with a non-floating values. For ultimo 2017 we have the data of 139 insurance undertaking.\n",
    "\n",
    "Now we can perform all kinds of numerical analysis. Let’s calculate the total amount of assets of all Dutch insurance undertakings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['assets|total assets , solvency ii value'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That’s almost 487 billion euro at the end of 2017."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding the most similar insurance undertaking"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let’s apply some algorithms to this data set. Suppose we want to know what insurance undertakings are similar with respect to their financial and solvency structure. To do that we can calculate the distances between all the data points of each insurance undertakings. An insurance undertaking with a low distance to another insurance undertaking might be similar to that undertaking.\n",
    "\n",
    "If we divide each row by the total assets we do as if all insurance undertakings have equal size, and then the distances indicate similarity in financial and solvency structure (and not similarity in size)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df.div(df['assets|total assets , solvency ii value'], axis = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The scikit-learn package provides numerous algorithms to do calculations with distances. Below we apply the NearestNeighbors algorithm to find the neighbors of each insurance undertaking. Then we get the distances and the indices of the data set and store them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import NearestNeighbors\n",
    "\n",
    "nbrs = NearestNeighbors(n_neighbors = 2, algorithm = 'brute').fit(X.values)\n",
    "\n",
    "distances, indices = nbrs.kneighbors(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the nearest neighbors of the first ten insurance undertakings in the list?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in indices[0:10]:\n",
    "    print(X.index[i[0]] + \" --> \" + X.index[i[1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And with the shortest distance between two insurance undertakings we can find the two insurance undertakings that have the highest similarity in their structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "min_list = np.where(distances[:,1] == distances[:,1].min())\n",
    "\n",
    "list(X.index[min_list])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to understand the financial performance it is of course handy to know which insurance undertakings are similar. A more general approach when comparing insurance undertakings is to cluster them into a small number of peer groups."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering of insurance undertakings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can we cluster the insurance undertakings based on the 1272-dimensional data? To do this we apply the t-sne algorithm (that we used before).\n",
    "\n",
    "First we import all the packages required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as pyplot\n",
    "from sklearn.manifold import TSNE\n",
    "from sklearn.cluster import KMeans"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we run the t-sne algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = TSNE(n_components = 2, perplexity = 5, verbose = 0, random_state = 1).fit_transform(X.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we plot the results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pyplot.figure(figsize = (7, 7))\n",
    "\n",
    "pyplot.scatter(x = Y[:, 0], \n",
    "              y = Y[:, 1], \n",
    "              s = 7)\n",
    "\n",
    "kmeans = KMeans(n_clusters = 4, random_state = 0, n_init  = 10).fit(Y)\n",
    "\n",
    "for i in range(len(kmeans.cluster_centers_)):\n",
    "    \n",
    "    pyplot.scatter(x = kmeans.cluster_centers_[i,0],\n",
    "                   y = kmeans.cluster_centers_[i,1],\n",
    "                   s = 20,\n",
    "                   c = 'yellow')\n",
    "    \n",
    "    pyplot.annotate(str(i), \n",
    "                   xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), \n",
    "                   size = 20)\n",
    "\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on how you zoom in you see different clusters in this picture. In the above left you see the health insurance undertakings (with more clusters within that set: those offering basic health insurance and other offering additional health insurances, or both). On the right are (mostly) life insurance undertakings, and on the left (middle to below) are non-life insurance undertakings. And both clusters can be divided into several more sub clusters. These clusters can be used in further analysis. For example, you could use these as peer groups of similar insurance undertakings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering of features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Given that we have a 1272-dimensional vector of each insurance undertaking we might wish somehow to cluster the features in the data set. That is, we want to know which columns belong to each other and what columns are different.\n",
    "\n",
    "An initial form of clustering were the different worksheets in the original Excel file. The data was clustered around the balance sheet, premiums, technical provisions, etc. But can we also find clusters within the total vector without any prior knowledge of the different worksheets?\n",
    "\n",
    "A simple and effective way is to transpose the data matrix and feed it into the t-sne algorithm. That is, instead of assuming that each feature provides additional information about an insurance undertaking, we assume that each insurance undertaking provides additional information about a feature.\n",
    "\n",
    "Let’s do this for only the balance sheet. In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = get_sheet(14)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of the scaled values we now take whether or not a data point was reported or not, and then transpose the matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# scale data\n",
    "X = (df != 0).T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we apply the t-sne algorithm. In this case with a lower perplexity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y = TSNE(n_components = 2, \n",
    "         perplexity = 1.0, \n",
    "         verbose = 0, \n",
    "         random_state = 0, \n",
    "         learning_rate = 20, \n",
    "         n_iter = 10000).fit_transform(X.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we plot the result with 15 identified clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pyplot.figure(figsize = (7, 7))\n",
    "\n",
    "pyplot.scatter(x = Y[:, 0], \n",
    "              y = Y[:, 1], \n",
    "              s = 5)\n",
    "\n",
    "kmeans = KMeans(n_clusters = 15, random_state = 0, n_init  = 10).fit(Y)\n",
    "\n",
    "for i in range(len(kmeans.cluster_centers_)):\n",
    "    \n",
    "    pyplot.scatter(x = kmeans.cluster_centers_[i,0],\n",
    "                   y = kmeans.cluster_centers_[i,1],\n",
    "                   s = 1,\n",
    "                   c = 'yellow')\n",
    "    \n",
    "    pyplot.annotate(str(i), \n",
    "                   xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), \n",
    "                   size = 13)\n",
    "\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see are large number of different clusters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in df.T.loc[kmeans.labels_ == 6].index:\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the assets held for index-linked and unit-linked contracts are in the same cluster as the technical provisions for index-linked and unit-linked items (and some other related items are found).\n",
    "\n",
    "However, the relations found are not always perfect. To improve the clustering we should cluster the data that is related in their changes over time. But because we have just two years available (and so just one yearly difference) we presumably do not have enough data to do that. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" # How to analyse public Solvency II data of Dutch insurance undertakings"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In this notebook we will use the public Solvency II data of all Dutch insurance undertakings and present it in one large data vector per undertaking. In doing so, we are able to use some easy but powerful machine learning algorithms to analyze the data.\n",
	"\n",
	"Solvency II data of individual insurance undertakings is published yearly by the Statistics department of DNB. The data represents the financial and solvency situation of each insurance undertaking per the end of each year. Currently, we have two years of data: ultimo 2016 and ultimo 2017. The publication of DNB is in the form of an Excel file with a number of worksheets containing the aggregated data and the individual data. Here we will use the individual data.\n",
	"\n",
	"We will read the data in a Pandas Data Frame en use Numpy for data manipulations. Furthermore we need the datetime object"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"from datetime import datetime"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can find the data with the following url\n",
	"\n",
	"https://statistiek.dnb.nl/en/downloads/index.aspx#/?statistics_type=toezichtdata&theme=verzekeraars\n",
	"\n",
	"Download the file and make sure that the Excel file is accessible."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"data_path = \"../../20_local_data/\"\n",
	"\n",
	"xls = pd.ExcelFile(data_path + \"Data individual insurers (year).xlsx\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The Excel file contains several worksheets with data. We want to combine all the data together in one Data Frame. To do that we need some data preparation and data cleaning for each worksheet.\n",
	"\n",
	"In the following function a worksheet is put into a Data Frame and the columns names are set to lower case. Then an index of the data frame is set to the insurance undertaking name and the reporting period. Then we perform some cleaning (the original worksheets contain some process information). In addition, some worksheets in the file contain 2-dimensional data, that need to be pivoted such that we obtain one large vector with all the data per insurance undertaking in one row."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"def get_sheet(num):\n",
	" \n",
	" # read entire Excel sheet\n",
	" df = xls.parse(num)\n",
	"\n",
	" # columns names to lower case\n",
	" df.columns = [c.lower() for c in df.columns]\n",
	"\n",
	" # set index to name and period\n",
	" df.set_index(['name', 'period'], inplace = True)\n",
	"\n",
	" # data cleaning (the excel sheet contains some additional data that we don't need)\n",
	" drop_list = [i for i in df.columns if 'unnamed' in i or 'selectielijst' in i]\n",
	" df.drop(drop_list, axis = 1, inplace = True)\n",
	" \n",
	" # pivot data frame\n",
	" if \"row_name\" in df.columns:\n",
	" df.drop(\"row_name\", axis = 1, inplace = True)\n",
	" df = df.pivot(columns = 'row_header')\n",
	"\n",
	" if df.columns.nlevels > 1:\n",
	" df.columns = [str(df.columns[i]) for i in range(len(df.columns))]\n",
	"\n",
	" return df"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Creating one large vector per insurance undertaking"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"With the function above we can read a part of the Excel file and store it in a Pandas data frame. The following worksheets are contained in the Excel file published by DNB.\n",
	"\n",
	"* Worksheet 14: balance sheet\n",
	"* Worksheet 15: premiums - life\n",
	"* Worksheet 16: premiums - nonlife\n",
	"* Worksheet 17: technical provisions - 1\n",
	"* Worksheet 18: technical provisions - 2\n",
	"* Worksheet 19: transition and adjustments\n",
	"* Worksheet 20: own funds\n",
	"* Worksheet 21: solvency capital requirements - 1\n",
	"* Worksheet 22: solvency capital requirements - 2\n",
	"* Worksheet 23: minimum capital requirements\n",
	"* Worksheet 24: additional information life\n",
	"* Worksheet 25: additional information non-life\n",
	"* Worksheet 26: additional information reinsurance"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let’s read the first worksheet with data and then append the other sheets to it. We shall not read the last three worksheets, because these contain the country specific reports. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = get_sheet(14)\n",
	"for l in range(15, 24):\n",
	" df_temp = get_sheet(l)\n",
	" df = df.join(df_temp, rsuffix = \"_\"+ str(l))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's get the shape of the dataframe."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"print(\"Number of rows : \" + str(len(df.index)))\n",
	"print(\"Number of columns: \" + str(len(df.columns)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So we have 286 rows of insurance undertakings data for ultimo 2016 and ultimo 2017. And we have 1273 columns with financial and solvency data about each insurance undertaking. Let’s take the data from the year 2017, and select all columns that have floating numbers in them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = df.xs(datetime(2017,12,31), axis = 0, level = 1, drop_level = True)\n",
	"\n",
	"df = df[[df.columns[c] for c in range(len(df.columns)) if df.dtypes[c]=='float64']]\n",
	"\n",
	"df.fillna(0, inplace = True)\n",
	"\n",
	"print(\"Number of rows : \" + str(len(df.index)))\n",
	"print(\"Number of columns: \" + str(len(df.columns)))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Apparently there is one columns with a non-floating values. For ultimo 2017 we have the data of 139 insurance undertaking.\n",
	"\n",
	"Now we can perform all kinds of numerical analysis. Let’s calculate the total amount of assets of all Dutch insurance undertakings."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df['assets\|total assets , solvency ii value'].sum()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"That’s almost 487 billion euro at the end of 2017."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Finding the most similar insurance undertaking"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now let’s apply some algorithms to this data set. Suppose we want to know what insurance undertakings are similar with respect to their financial and solvency structure. To do that we can calculate the distances between all the data points of each insurance undertakings. An insurance undertaking with a low distance to another insurance undertaking might be similar to that undertaking.\n",
	"\n",
	"If we divide each row by the total assets we do as if all insurance undertakings have equal size, and then the distances indicate similarity in financial and solvency structure (and not similarity in size)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"X = df.div(df['assets\|total assets , solvency ii value'], axis = 0)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The scikit-learn package provides numerous algorithms to do calculations with distances. Below we apply the NearestNeighbors algorithm to find the neighbors of each insurance undertaking. Then we get the distances and the indices of the data set and store them."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"from sklearn.neighbors import NearestNeighbors\n",
	"\n",
	"nbrs = NearestNeighbors(n_neighbors = 2, algorithm = 'brute').fit(X.values)\n",
	"\n",
	"distances, indices = nbrs.kneighbors(X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"What are the nearest neighbors of the first ten insurance undertakings in the list?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"for i in indices[0:10]:\n",
	" print(X.index[i[0]] + \" --> \" + X.index[i[1]])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And with the shortest distance between two insurance undertakings we can find the two insurance undertakings that have the highest similarity in their structure."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"min_list = np.where(distances[:,1] == distances[:,1].min())\n",
	"\n",
	"list(X.index[min_list])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you want to understand the financial performance it is of course handy to know which insurance undertakings are similar. A more general approach when comparing insurance undertakings is to cluster them into a small number of peer groups."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Clustering of insurance undertakings"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Can we cluster the insurance undertakings based on the 1272-dimensional data? To do this we apply the t-sne algorithm (that we used before).\n",
	"\n",
	"First we import all the packages required."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"import matplotlib.pyplot as pyplot\n",
	"from sklearn.manifold import TSNE\n",
	"from sklearn.cluster import KMeans"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Then we run the t-sne algorithm."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"Y = TSNE(n_components = 2, perplexity = 5, verbose = 0, random_state = 1).fit_transform(X.values)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And we plot the results"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"pyplot.figure(figsize = (7, 7))\n",
	"\n",
	"pyplot.scatter(x = Y[:, 0], \n",
	" y = Y[:, 1], \n",
	" s = 7)\n",
	"\n",
	"kmeans = KMeans(n_clusters = 4, random_state = 0, n_init = 10).fit(Y)\n",
	"\n",
	"for i in range(len(kmeans.cluster_centers_)):\n",
	" \n",
	" pyplot.scatter(x = kmeans.cluster_centers_[i,0],\n",
	" y = kmeans.cluster_centers_[i,1],\n",
	" s = 20,\n",
	" c = 'yellow')\n",
	" \n",
	" pyplot.annotate(str(i), \n",
	" xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), \n",
	" size = 20)\n",
	"\n",
	"pyplot.show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Depending on how you zoom in you see different clusters in this picture. In the above left you see the health insurance undertakings (with more clusters within that set: those offering basic health insurance and other offering additional health insurances, or both). On the right are (mostly) life insurance undertakings, and on the left (middle to below) are non-life insurance undertakings. And both clusters can be divided into several more sub clusters. These clusters can be used in further analysis. For example, you could use these as peer groups of similar insurance undertakings."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Clustering of features"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Given that we have a 1272-dimensional vector of each insurance undertaking we might wish somehow to cluster the features in the data set. That is, we want to know which columns belong to each other and what columns are different.\n",
	"\n",
	"An initial form of clustering were the different worksheets in the original Excel file. The data was clustered around the balance sheet, premiums, technical provisions, etc. But can we also find clusters within the total vector without any prior knowledge of the different worksheets?\n",
	"\n",
	"A simple and effective way is to transpose the data matrix and feed it into the t-sne algorithm. That is, instead of assuming that each feature provides additional information about an insurance undertaking, we assume that each insurance undertaking provides additional information about a feature.\n",
	"\n",
	"Let’s do this for only the balance sheet. In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"df = get_sheet(14)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Instead of the scaled values we now take whether or not a data point was reported or not, and then transpose the matrix."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"# scale data\n",
	"X = (df != 0).T"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Then we apply the t-sne algorithm. In this case with a lower perplexity. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"Y = TSNE(n_components = 2, \n",
	" perplexity = 1.0, \n",
	" verbose = 0, \n",
	" random_state = 0, \n",
	" learning_rate = 20, \n",
	" n_iter = 10000).fit_transform(X.values)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And we plot the result with 15 identified clusters."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"pyplot.figure(figsize = (7, 7))\n",
	"\n",
	"pyplot.scatter(x = Y[:, 0], \n",
	" y = Y[:, 1], \n",
	" s = 5)\n",
	"\n",
	"kmeans = KMeans(n_clusters = 15, random_state = 0, n_init = 10).fit(Y)\n",
	"\n",
	"for i in range(len(kmeans.cluster_centers_)):\n",
	" \n",
	" pyplot.scatter(x = kmeans.cluster_centers_[i,0],\n",
	" y = kmeans.cluster_centers_[i,1],\n",
	" s = 1,\n",
	" c = 'yellow')\n",
	" \n",
	" pyplot.annotate(str(i), \n",
	" xy = (kmeans.cluster_centers_[i, 0], kmeans.cluster_centers_[i, 1]), \n",
	" size = 13)\n",
	"\n",
	"pyplot.show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We see are large number of different clusters."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In a balance sheet it is not immediately straightforward how the left side is related to the right side of the balance sheet, i.e. which assets are related to which liabilities. If you cluster all the data of the balance sheet then related items are clustered (irrespective of whether they are assets or liabilities)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"for i in df.T.loc[kmeans.labels_ == 6].index:\n",
	" print(i)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So the assets held for index-linked and unit-linked contracts are in the same cluster as the technical provisions for index-linked and unit-linked items (and some other related items are found).\n",
	"\n",
	"However, the relations found are not always perfect. To improve the clustering we should cluster the data that is related in their changes over time. But because we have just two years available (and so just one yearly difference) we presumably do not have enough data to do that. "
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.8"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}