Insurance Data Science Conference 2018
Gareth Peters, Heriot-Watt University, Edinburgh
In this talk we build on a sequence of papers recently developed to enhance the modelling of life expectancy based on mortality data. Forecasting life expectancy and mortality are two important aspects for the study of demography that influence pension planning, retirement decisions and government policy.
We demonstrate how to develop regression models incorporating stochastic factors such as graduation temporal effects, period effects, cohort effects, stochastic volatility and long memory to enhance the forecasting and estimation of life tables. In addition, we show the mispricing that occurs in standard annuities, pure endowments and Gauranteed Annuity Oprtions (GAOs)
Tom Reynkens, KU Leuven
Insurance companies use predictive models for a variety of analytic tasks including pricing, marketing campaigns, and fraud and churn detection. In practice, these predictive models often use a selection of continuous, ordinal, nominal and spatial variables to detect different risks.
Such models should not only be competitive, but also interpretable by stakeholders (including the policyholder and the regulator), and easy to implement and maintain in a production environment. Therefore, current actuarial literature puts focus on generalised linear models (GLMs) where ad hoc techniques or professional expertise are applied to bin (recombine levels) or remove variables. In the statistical and machine learning literature, penalised regression is often used to obtain an automatic data-driven method for variable selection and binning in predictive modelling. Most penalisation strategies work for data where all predictors are of the same type, such as Lasso for continuous variables and Fused Lasso for ordinal variables.
We design an estimation strategy for regularised GLMs which includes variable selection and binning through the use of multi-type Lasso penalties. We consider the joint presence of different types of variables and specific penalties for each type. Using the theory of proximal operators, our estimation procedure is computationally efficient, splitting the overall optimisation problem into easier to solve subproblems per variable and its associated penalty. As such, we are able to simultaneously select, estimate and group, in a statistically sound way, any combination of continuous, ordinal, nominal and spatial variables. We illustrate the approach, which is implemented in an R package, using a case-study on motor insurance pricing.
Christian Rohrbeck, Lancaster University
This talk considers the association between weather events, such as rainfall or snow-melt, and the number of water-related property insurance claims. Weather events which cause severe damages are of general interest; decision makers want to take efficient actions against them while the insurance companies want to set adequate premiums. The modelling is challenging since the underlying dynamics vary across geographical regions due to differences in topology, construction designs and climate.
This talk presents new methodology which improves upon the existing models which in particular fail to model high numbers of claims. The approach includes a clustering algorithm which aggregates claims over consecutive days based on the observed weather metrics and derives more informative explanatory variables. In combination with a statistical framework based on mixture modelling, we achieve a better understanding of the association between claims and weather events. Both the clustering algorithm and the estimation of the statistical model are performed using R. The benefits of the methodology are illustrated by applying it to insurance and weather data from 1997 to 2006 for three Norwegian cities: Oslo, Bergen and Bærum.
Mario Wüthrich, RiskLab, ETH Zurich
We investigate the predictive power of covariates extracted from telematics car driving data using the v-a heatmaps of Gao and Wüthrich (2017) for claims frequency modeling. These telematics covariates include the K-means classification, the principal components, and the bottleneck activations from a bottleneck neural network. It turns out that the predictive power of the first principal component and the bottleneck activations are more significant in predicting claims frequencies than driver’s age. For this reason we recommend the use of these telematics covariates for car insurance pricing.
Valerie du Preez, Dupro
Co-authors: Steven Perkins, Zhixin Lim
Improvements in computational power has given rise to the use of machine learning techniques in a wide variety of areas, including finance, driverless cars, image detection, speech recognition etc. In a world of high volume and varied datasets, machine learning techniques are an essential toolkit to provide actionable insights from the data.
The exponential increase in data generation, capture and storage along with improved computer power is likely to benefit actuaries in two key ways. Firstly, improved data and computational capabilities is likely to mean that traditional actuarial tasks can be tackled with increasingly sophisticated approaches. The second opportunity arises because many actuaries will have the necessary skills to capitalise on new opportunities which arise to expand the profession into new areas.
This session will cover machine learning at a high level and will include an overview of case studies performed by the Modelling Analytics and Insights from Data Working Party from the Institute of Actuaries in the UK, to explore how additional techniques around data analysis could be utilised in the future. The views presented will be the presenters’ own.
Truncated regression models for the analysis of operational losses due to fraud: A high performance computing implementation in R
Alberto Glionna, Generali, Italy
Co-authors: Giovanni Millo and Nicola Torelli
In operational risks modeling, losses arising from fraud are often collected only when exceeding a given monetary threshold. With reference to regression-like problems, in econometrics and statistics, those type of observations are known as truncated. In general, great care is needed when truncated data are used in predictive analysis, or more broadly risk modeling, which are aiming at inferring the model for the data. With respect to this, the dualistic frequency-severity nature of fraud and the inevitable truncation of the information challenge the traditional methods which may lead to inaccurate results. In this paper it is shown that more accurate analyses are possible whenever some conditions are met. A technique called Augmented (Generalized) Linear Model (AGLM) have been developed: it is aimed at managing the problem of truncated fraud data and allow for theoretically unbiased estimates of operational losses. This technique has been implemented and used in Assicurazioni Generali.
Unfortunately, in this context inference on relevant quantities (such as confidence intervals) can be obtained by using computationally heavy solutions with complex and time consuming algorithms. In this paper we will discuss elements of the above mentioned models alongside a cross-platform, user friendly high performance computing implementation in R that considerably reduces the computational time, overcoming one of the main limitations of AGLMs. In particular the packages ‘parallel’ and ‘shiny’ will be exploited to speed up both parametric and non-parametric bootstrap on any multicore architecture, up to distributed networks of workstations, through a user-friendly application with an intuitive graphical interface, not requiring any previous knowledge of R.
Lara A. Neira Gonzalez, University of Edinburgh
Co-authors: Martin Kreer, Jose-Maria Guerra, Alfredo D Egidio dos Reis
Over the last decade, machine learning techniques have become more popular on different background such as genetics, finance and health. The issues that classical statistical techniques have on Big Data situations such as p << n, make Machine Learning algorithms quite attractive for usage in real application studies. A precise estimation of breakdowns cannot only be applied in predictive maintenance but also for the calculation of insurance premiums of industrial equipment. The aim of our study is to estimate the probability of breakdowns using a Machine Learning technique on machine data using training and test datasets.
Random Forest, a supervised non-parametric technique based on the AUC variable importance measure, was applied 1000 times under the null hypothesis and once under the alternative on our training sample in order to calculate an empirical p-value. Then, the empirically significant variables from the training sample were tested for significance using general linear regression on the independent test sample corrected by multiple testing. After obtaining the risk factors, the probability of breakdowns was estimated.
All of 32 variables of one machine and all of 6 variables of another machine were empirically significant. After testing on the independent test dataset, 30 variables showed significance after Bonferroni correction (considering 90%, 95% and 90% of confidence levels); 29 with a p-value < 0.01 and 1 with a p-value < 0.10 on the first machine. The most significant variable showed a R2 = 18.2%. Considering a model with all significant variables, our model reached an accuracy greater than 80% in an independent test dataset when predicting a breakdown. In the other machine, 5 variables were significant, 3 having a significance level of 0.01, one at 0.05 significant level and another one at 0.10. The most significant variable explained more than 85% of the variance (R2 = 85.93). Considering models with only this variable, we predicted the probability of breakdown with an accuracy of 100% in the independent test dataset (AUC = 1).
Both our studies found significant factors which help us to better understand the insights of the failure of the two machines. Calculating the empirical p-value based on 1000 outputs under the null makes the estimation stable and reliable. Using Random Forest for diagnosis helps us to reach high levels of accuracy on predicting breakdowns on both machines. In both cases our model can predict better than a human would do it, predicting perfectly the breakdowns of one machine. The probability of breakdowns based on our models is a good estimation to use when calculating premiums, and they can be used for creating risk profiles as our models could detect different significant factors which are playing an important role in the process of each machine.
Aniketh Pittea, University of Kent
Co-authors: Jaideep Oberoi and Pradip Tapadar
Actuaries and financial risk managers use an Economic Scenario Generator (ESG) to identify, manage and mitigate risks at a range of horizons. In particular, pension schemes and other long term businesses require ESGs to simulate projections of assets and liabilities in order to devise adequate risk mitigation mechanisms. This requires ESGs to provide reasonable simulations of the joint distribution of several variables that enter the calculation of assets and liabilities. In this paper, we discuss how a graphical model approach is used to develop an ESG, and also provide a specific application.
A wide range of ESGs are currently in use in industry. These models have varying levels of complexity and are often proprietary. They are periodically recalibrated, and tend to incorporate a forecasting dimension. For instance, they may incorporate a Vector Auto Regression model. Alternatively, many rely on a cascading structure, where the forecast of one or more variables is then used to generate values for other variables, and so on. In each case, these models balance the difficult trade-off between accurately capturing both short and long term dynamics and interdependences. We argue that, for the purpose of risk calculations over very long periods, it may be easier and more transparent to use a simpler approach that captures the underlying correlations between the variables in the model. Graphical models achieve this in a parsimonious manner, making them useful for simulating data in larger dimensions. In graphical models, dependence between variables is represented by “edges” in a graph connecting the variables or “nodes”. This approach allows us to assume conditional independence between variables and to set their partial correlations to zero. The two variables could then be connected via one or more intermediate variables, so that they could still be weakly correlated.
We compare different algorithms to select a graphical model, based on p-values, AIC, BIC, and deviance using R (and provide the relevant packages). We find them to yield reasonable results and relatively stable structures in our example. The graphical approach is fairly easy to implement, is flexible and transparent when incorporating new variables, and thus easier to apply across different datasets (e.g. countries). Similar to other reduced form approaches, it may require some constraints to avoid violation of theoretical rules. It is also easy to use this model to introduce arbitrary economic shocks. We provide an example in which we identify a suitable ESG for a pension fund in United Kingdom that invests in equities and bonds, and pays defined benefits. While more complex modelling of the short term dynamics of processes is certainly feasible, our focus is on the joint distribution of innovations over the long term. To this end, we simply fit an autoregressive process to each of the series in our model and then estimate the graphical structure of the contemporaneous residuals. We find that simulations from this simple structure provide plausible distributions that are comparable to existing models. We also discuss how these models can be used to introduce nonlinear dependence through regime shifts in a simple way.
Overall, we argue that this approach to developing ESGs is a useful tool for actuaries and financial risk managers concerned about long term portfolios.
Marc Rierola, Qatar Re
Qatar Re as a global multi-line reinsurer has to deal with a multitude of different loss files (result files from probabilistic catastrophe models) from different sources. Loss files can be provided by the clients/brokers, created by our actuaries, the Catastrophe Pricing Analytics team or Risk Management. While all those loss files are being presented in a different format and being stored in different places using different systems it becomes rather difficult to make good use of all the available information across the different departments. We are currently building a Loss File Database within Qatar Re which shall function as a central storage place for all different forms of Loss files. A place where each department can upload their data in a predefined format and any other department can easily access said data. RShiny plays a crucial role in this process as we are building an app that allows to upload, examine, scale, manipulate, download Loss files. This RShiny application shall become the single point of contact for anybody in the company that is seeking Loss File information.
Pauli Rämö, Mirai Solutions
Insurance companies require detailed (and ideally actionable) insights into the sources and contributors of their risks arising from losses modeled given premium exposure. It is crucial to have good monitoring and valid predictions of the evolving risks, for example when determining an adequate reinsurance strategy. For rare events incurring large losses, particularly large scale simulations are required in order to obtain realistic and stable risk estimates, such as Expected Shortfall (ES / CVaR) or Value at Risk (VaR).
Google's TensorFlow is a modern IT framework designed for big data analytics, particularly in the field of (deep) machine learning. TensorFlow has attracted a lot of attention over the past couple of years and provides several advantages: definition of computational graphs, lazy execution, improved performance, established framework with active development community support, visualization and profiling tools, and seamless CPU/GPU/TPU support/switching out-of-the-box.
Somewhat out of focus, with all the buzz around deep learning and artificial intelligence, TensorFlow can also be applied to any other computational task suitable to "tensor mathematics". In this talk we demonstrate how this can work in the case of reinsurance modeling. We simulate large-scale gross loss data through convolution of frequency and severity distributions and then apply a realistic set of proportional reinsurance contracts (excess of loss, surplus share) to calculate net losses and various risk measures. We show hands-on implementation using Python and TensorFlow and discuss the advantages and challenges this creates.
We have developed an R package 'KSgeneral' (available from https://CRAN.R-project.org/package=KSgeneral) that computes the (complementary) cumulative distribution function (cdf) of the one-sample two-sided (or one-sided, as a special case) Kolmogorov-Smirnov (KS) statistic, for any fixed critical level, and an arbitrary, possibly large sample size for a pre-specified purely discrete, mixed or continuous cdf under the null hypothesis. If a data sample is supplied, 'KSgeneral' also computes the p-value corresponding to the value of the KS test statistic computed based on the user provided data sample. The package 'KSgeneral' implements a novel, accurate and efficient method named Exact-KS-FFT, developed by Dimitrova, Kaishev, Tan (2017), available together with the underlying C++ code from http://openaccess.city.ac.uk/18541. The p-value is expressed as a double-boundary non-crossing probability for a homogeneous Poisson process, which is then efficiently computed using Fast Fourier Transform (FFT).
Michael Ludkovski, UC Santa Barbara Dept of Statistics; Visiting Professor, London School of Economics
Co-authors: Jimmy Risk (Cal Poly Pomona Dept of Mathematical Sciences)
We consider calculation of VaR/TVaR capital requirements when the underlying economic scenarios are determined by simulatable risk factors. This problem involves computationally expensive nested simulation, since evaluating expected portfolio losses of an outer scenario (aka computing a conditional expectation) requires inner-level Monte Carlo. We introduce several inter-related machine learning techniques to speed up this computation, in particular by properly accounting for the simulation noise. Our main workhorse is an advanced Gaussian Process (GP) regression approach co-authored by the second author as part of the hetGP R package. hetGP constructs a heteroskedastic spatial model to efficiently learn the relationship between the stochastic factors defining scenarios and corresponding portfolio value. Leveraging this emulator, we develop sequential algorithms that adaptively allocate inner simulation budgets to target the quantile region, akin to Bayesian contour-finding. The GP framework also yields better uncertainty quantification for the resulting VaR/\TVaR estimators that reduces bias and variance compared to existing methods.
Silvana M. Pesenti, Cass Business School, City, University of London
Sensitivity analysis is an important component of model building, interpretation and validation. A model comprises a vector of random input factors, an aggregation function mapping input factors to a random output, and a (baseline) probability measure. A risk measure, such as Value-at-Risk and Expected Shortfall, maps the distribution of the output to the real line. As is common in risk management, the value of the risk measure applied to the output is a decision variable. Therefore, it is of interest to associate a critical increase in the risk measure to specific input factors. We propose a global and model-independent framework, termed 'reverse sensitivity testing', comprising three steps:
- an output stress is specified, corresponding to an increase in the risk measure(s);
- a (stressed) probability measure is derived, minimising the Kullback-Leibler divergence with respect to the baseline probability, under constraints generated by the output stress;
- changes in the distributions of input factors are evaluated.
We argue that a substantial change in the distribution of an input factor corresponds to high sensitivity to that input and introduce a novel sensitivity measure to formalise this insight. Implementation of reverse sensitivity testing in a Monte-Carlo setting can be performed on a single set of input/output scenarios, simulated under the baseline model. Thus the approach circumvents the need for additional computationally expensive evaluations of the aggregation function. We illustrate the proposed approach through a numerical example of a simple insurance portfolio and a model of a London Insurance Market portfolio used in industry.
Benjamin C. Dean, Iconoclast Tech LLC
Modelling of cyber risks is an emerging domain in the field of catastrophe modelling. Insurers and re-insurers increasingly wish to model cyber risks given the fast growth of stand-alone cyber insurance product lines and the potential ‘silent’ cyber risk exposure of existing property and casualty policies.
Analysts face a number of challenges in modelling these risks. For instance, the cyber ecosystem contains complex dependencies, which creates systemic risks linked to contagion and cascading effects in the wake of cyber incidents. This results in highly-skewed distributions and the potential for ‘black swan’ phenomena.
Extreme value theory (EVT) has traditionally been used in catastrophe modelling of phenomena such as hurricanes, floods, etc. However, cyber risks are different to traditional catastrophic phenomena in a number of fundamental ways, which necessitates adaptation of analytical techniques such as EVT when modelling cyber risks.
This presentation will explain how cyber risks differ from traditional catastrophe risks and the implications this has when modelling these risks. It will use examples related to severity and cost of data breaches to illustrate these points with the help of R packages such as evd, extRemes and ismev.
SFCR automated analysis using scraping, text mining and machine learning methods for benchmarking and capital modelling
Aurelien Couloumy, Reacfin
The aim of this paper is to propose a new process developed in R to manage easily both unstructured textual data and quantitative data for insurance purposes. The approach is threefold:
- Using scraping methods to gather content automatically;
- Using text mining and natural language processing to analyze documents;
- Exploiting data using both supervised and unsupervised machine leaning methods.
To show the benefits of this process, we propose to analyze Solvency & Financial Condition Reports (SFCR) that have been published since 18th May 2017 by insurers under Solvency II regulation. We will deduce and analyze many information from this work such as:
- Descriptive statistics about insurers;
- Unconventional insurers clustering;
- Machine learning models to predict key risk drivers.
Based on this application, we will also make some recommendation about data visualization methods. We will finally suggest other applicable uses of this process (such as insurance general conditions or reinsurance treaties management).
Oliver Laslett, Cytora
With the growing speed and scale of data generation, insurers are increasingly deploying advanced data analytics and machine learning techniques to assess and price risks. The application of these techniques promises to transform external data into insight for commercial underwriting. But can machine learning deliver fair underwriting results? The answer crucially depends on whether or not fairness is included in the design and analysis of the machine learning algorithms.
In this talk, we present methods for evaluating the fairness of decisions made by automated algorithms using a number of frequentist and Bayesian approaches. We demonstrate how fairness can be included in the design of machine learning algorithms, by controlling for protected features. Finally, we highlight how applying transparent modelling approaches to external data can deliver fair underwriting in commercial insurance. By implementing these methods, insurers can price risk more accurately, while delivering fairer prices to customers.
Javier Rodriguez Zaurin, Simply Business
During the last 4 years Simply Business has undergone an tech transformation, gradually decommissioning legacy systems and introducing a series of new technologies across the company. This was part of a long-term strategy designed to change the ways data were collected and used in the company. As a result, and just to cite some examples, the Customer Insights teams runs analytics on Jupyter notebooks, the Pricing team runs and validates their model using Databricks and pySpark, and the entire data platform relies on cloud-based (AWS) tools streaming constantly into storage systems like S3 or Redshift. It is now the time when the company is ready to efficiently utilise machine learning. The challenge now is, however, as complex if not harder than the previously described technology transformation. The challenge now is “human”. During this presentation I will illustrate with some precise examples the difficulties of productionising machine learning algorithms even when all the technology requirements are in place. I will also illustrate why there is no point in investing in data science/machine learning if the mentioned tech-transformation has not happened beforehand.
Eric Novik, Generable
In this talk, I will introduce the key components of the Bayesian workflow including gathering prior knowledge, formulating a generative model, simulating data, fitting the model, evaluating the quality of the model, and making decisions conditional on model inferences.
I will talk about how Stan fits into this workflow and discuss its strengths and weaknesses. I will use a few examples from product pricing and clinical research to highlight the applicability of the tools and methods to difficult applied problems.