Skip to content

Instantly share code, notes, and snippets.

@osde8info
Last active April 16, 2020 16:35
Show Gist options
  • Save osde8info/832af9a0efa87f7abb8434ef1becae58 to your computer and use it in GitHub Desktop.
Save osde8info/832af9a0efa87f7abb8434ef1becae58 to your computer and use it in GitHub Desktop.
ML REGRESSION
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Section 3.1: Linear regression, part 1\n",
"\n",
"Thus far in this track we have largely investigated just machine-learning (ML) classification techniques. With linear regression, we will look at one of the most important predictive algorithms in ML.\n",
"\n",
"The essence of linear regression is arguably the simplest form of ML: drawing a line through points. You might have done a simple form of this in your high school physics class: plot the results of a series of experiments on graph paper and then draw a line through as many points as possible (and include as many points below the line as above the line where the points don't fall on the line). That is a very form of linear regression. We will build on that conceptual foundation to address more complex situations, such as having points in more than two dimensions or even points whose relationship seems non-linear.\n",
"\n",
"Formally, linear regression is used to predict a quantitative *response* (the values on a Y axis) that is dependent on one or more *predictors* (values on one or more axes that are orthogonal to $Y$, commonly just thought of collectively as $X$). The working assumption is that the relationship between the predictors and the response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. \n",
"\n",
"How do we tell that we have the best fit possible for our line? The most common means of assessing the error between the fit of our line -- our model -- and the data is called the [***least squares method***](https://en.wikipedia.org/wiki/Least_squares). This method consists of minimizing the number you get when you square the differences between your predicted values (the line) and the actual values (the data) and add up all of those squared differences for your entire dataset.\n",
"\n",
"> **Learning goal:** By the end of this section, you should be comfortable fitting linear-regression models, and you should have some familiarity with interpreting their output.\n",
"\n",
"## Load the data\n",
"\n",
"In this section and the next (3.2), we will use national statistics gathered by the United Nations (UN) from 2009-2011 (accessed from the United Nations Statistics Division's [Social indicators page](https://unstats.un.org/unsd/demographic/products/socind/) on April 23, 2012). The data includes national health and welfare statistics for 199 countries and territories; these locations are mostly UN members, but the list also includes other areas that are not independent countries (such as Hong Kong).\n",
"\n",
"The dataset includes 199 observations with the following features:\n",
" - **region:** Region of the world\n",
" - **group:** A factor (or categorical variable) with the following levels:\n",
" - **oecd:** Countries that were members of the [Organisation for Economic Co-operation and Development](www.oecd.org) (OECD) as of May 25, 2012\n",
" - **africa:** Countries on the continent of Africa (note: no OECD countries are located in Africa)\n",
" - **other:** For all other countries\n",
" - **fertility:** The total number of children born or likely to be born to a woman in her life time if she were subject to the prevailing rate of age-specific fertility in the population\n",
" - **ppgdp:** Per-capita gross domestic product (GDP) in 2012 US dollars (measure of a country's economic output that accounts for its number of people.)\n",
" - **lifeExpF:** Female life expectancy in years\n",
" - **pctUrban:** Percent of the population urbanized (living in urban areas)\n",
"\n",
"We will need to load several modules for this section to handle the ML and visualizations."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll then load the data."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('UN11.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['pctUrban'].min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple linear regression\n",
"\n",
"Let's plot out the data to see the relationship between per-capita GDP and female life expectancy."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(df['ppgdp'], df['lifeExpF'], alpha=0.3);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Technical note:** The `alpha` parameter we supplied in the matplotlib `scatter` function; it makes the points semi-transparent so that we can where data points bunch up. Also note the semicolon at the end of the code snippet above; it silences the matplotlib memory-path output for cleaner inline graphing (without additional output above the graph, such as `<matplotlib.collections.PathCollection at 0x7f2c54737f28>`).\n",
"\n",
"Let's now plot a line and see what we get."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True) \n",
"\n",
"X = df['ppgdp'][:, np.newaxis] \n",
"y = df['lifeExpF'] \n",
"\n",
"model.fit(X, y) \n",
"\n",
"x_plot = np.linspace(0, 100000) \n",
"y_plot = model.predict(x_plot[:, np.newaxis]) \n",
"\n",
"plt.scatter(df['ppgdp'], df['lifeExpF'], alpha=0.3) \n",
"plt.plot(x_plot, y_plot, c='orange');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Technical note:** Notice in the code cell above that we did not fit the model using `model.fit(df['ppgdp'], df['lifeExpF'])`. Instead, we had to use `df['ppgdp'][:, np.newaxis]` for our predictors rather than just `df['ppgdp']`. The addition of `[:, np.newaxis]` changes `df['ppgdp']` from a pandas `Series` to an array in matrix format. (If you're unsure what that looks like, create a new code cell below this cell using **Insert > Insert Cell Below** and then run `df['ppgdp']` and then `df['ppgdp'][:, np.newaxis]` in order to see the difference.\n",
"\n",
"Just how poor is this initial model? Let's calculate the $R^2$ score for it to see. The $R^2$ score (also called the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)) represents the proportion of the variance in our response that is predictable from the predictors -- so 0 is the worst (a model explains none of the variance) and 1 is the best (a model explains all of it)."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import r2_score\n",
"\n",
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"model.fit(df['ppgdp'][:, np.newaxis], df['lifeExpF'])\n",
"\n",
"predictions = model.predict(df['ppgdp'][:, np.newaxis])\n",
"r2_score(df['lifeExpF'], predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This first model accounts for only 30 percent of the variability in `lifeExpF` and is really not a very good representation of the relationship between economic activity and female life expectancy.\n",
"\n",
"These results are not good, which stems from the fact that there is no linear relationship between per-capita GDP and female life expectancy. Instead, the relationship has an elbow-like curve to it. When countries are very poor, the data suggests that even modest increases to GDP per capita can dramatically increase female life expectancy, but only up to a point; once countries hit about USD 10,000 per head, additional gains correlated to increases in wealth are much smaller. This suggests a logarithmic relationship between these factors: female life expectancy being not related to GPD, but to its logarithm.\n",
"\n",
"Let's create a new column that contains the logarithm of per-capita GDP by country. Note that, because we are dealing with powers of 10 in the GDP column, we will use the base-10 logarithm rather than the natural logarithm in order to make interpretation easier."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['log_ppgdp'] = np.log10(df['ppgdp'])\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now plot our new `log_ppgdp` column against `lifeExpF` to see if there is a more linear relationship."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"X = df['log_ppgdp'][:, np.newaxis]\n",
"y = df['lifeExpF']\n",
"\n",
"model.fit(X, y)\n",
"\n",
"x_min = df['log_ppgdp'].min()\n",
"x_max = df['log_ppgdp'].max()\n",
"x_plot = np.linspace(x_min, x_max)\n",
"y_plot = model.predict(x_plot[:, np.newaxis])\n",
"\n",
"plt.scatter(df['log_ppgdp'], df['lifeExpF'], alpha=0.3)\n",
"plt.plot(x_plot, y_plot, c='orange');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is much better, but it is still far from perfect. The shape of the data seems to have a curve to it and we will examine how to deal with that shortly. However, let's first interpret the model have right here to see what it tells us. How much better is it than the first model? Let's look at the $R^2$ score."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.fit(df['log_ppgdp'][:, np.newaxis], df['lifeExpF'])\n",
"\n",
"predictions = model.predict(df['log_ppgdp'][:, np.newaxis])\n",
"r2_score(df['lifeExpF'], predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `log_ppgdp` rather than `ppgdp` in the model roughly doubles the amount of variance in `lifeExpF` that we can account for with this model (to 60 percent). But what does our model actually mean?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Model slope: \", model.coef_[0])\n",
"print(\"Model intercept:\", model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that in high school algebra lines were generally defined by an equation of the form\n",
"\n",
"$$\n",
"y = ax + b\n",
"$$\n",
"\n",
"where $a$ is the *slope* and $b$ is the *intercept*. That samer terminology applies in linear regression. The slope refers to our model's predicted change in units of female life expectancy (years) for each unit of the base-10 logarithm of per-capita GDP. In other words, our model predicts that, on average, women's life expectancies increase by 11.6 years every time per-capita GDP increases 10 fold.\n",
"\n",
"The intercept is a little more abstract because it is not directly tied to any data point. It shows the value of the $y$-axis at the point where our line crossed that axis (where $x=0$). If we were still modeling `ppgdp` versus `lifeExpF`, we might interpret the intercept as representing women's baseline life expectancy in a hypothetical country with a per-capita GDP of USD 0: 29.8 years. However, we are modeling `log_ppgdp` versus `lifeExpF`, and the logarithm of 0 is undefined. Therefore, it can be easiest to accept the intercept in our model as a mathematical abstraction necessary to making other parts of our model work. Our model can be stated as:\n",
"\n",
"$$\n",
"{\\rm lifeExpF} = 11.6 \\times {\\rm log\\_ppgdp} + 29.8\n",
"$$"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Polynomial regression\n",
"\n",
"We can generalize the line equation above in the form favored by statisticians:\n",
"\n",
"$$\n",
"y = β_0 + β_1x + \\epsilon\n",
"$$\n",
"\n",
"where $\\epsilon$ is an unobserved random error that we generally fold into $β_0$. Nothing says that we can have only one $x$ term, however. We can define a linear model for our data of the form\n",
"\n",
"$$\n",
"y = β_0 + β_1x + β_2x^2 + \\epsilon\n",
"$$\n",
"\n",
"This is still a linear relationship because none of our $\\beta$s ever multiply or divide each other. In fact, we can generalize linear models to the form\n",
"\n",
"$$\n",
"y = β_0 + β_1x + β_2x^2 + β_3x^3 + \\cdots + β_nx^n + \\epsilon\n",
"$$\n",
"\n",
"The linearity of our models depend on the linearity of $β_n$, not $x_n$. We will use this fact to use linear regression to model data that does not follow a straight line. Let's apply this to our model of `log_ppgdp` and `lifeExpF`."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"poly = PolynomialFeatures(degree=2)\n",
"\n",
"X = df['log_ppgdp'][:, np.newaxis]\n",
"y = df['lifeExpF']\n",
"x_min = df['log_ppgdp'].min()\n",
"x_max = df['log_ppgdp'].max()\n",
"x_plot = np.linspace(x_min, x_max)\n",
"x_fit = poly.fit_transform(x_plot[:, np.newaxis])\n",
"X_ = poly.fit_transform(X)\n",
"\n",
"poly_model = LinearRegression(fit_intercept=True)\n",
"poly_model.fit(X_, y)\n",
"\n",
"x_fit[:, np.newaxis]\n",
"y_fit = poly_model.predict(x_fit)\n",
"\n",
"plt.scatter(df['log_ppgdp'], df['lifeExpF'], alpha=0.3)\n",
"plt.plot(x_fit[:,1], y_fit, c='orange');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Adding the polynomial term provides us with a much more intuitive fit of the data! The `degree=2` parameter that we supply to the `PolynomialFeatures` function dictates that our model takes the form of\n",
"\n",
"$$\n",
"y = β_0 + β_1x + β_2x^2\n",
"$$\n",
"\n",
"Let’s see what the coefficients for our model are."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Model slope: \", poly_model.coef_)\n",
"print(\"Model intercept:\", poly_model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"We can state our polynomial model as\n",
"\n",
"$$\n",
"{\\rm lifeExpF} = -6.5 + 32.1 \\times {\\rm log\\_ppgdp} - 2.8 \\times {\\rm log\\_ppgdp}^2\n",
"$$\n",
"\n",
"Using the polynomial model improves predictive power, but it comes at the cost of interpretability. What is the intuitive relationship between `lifeExpF` and `log_ppgdp` now?\n",
"\n",
"> **Technical note:** Fitting the polynomial-regression model above has a lot of steps in it, and performing these transformations (transforming the features for polynomial regression and fitting the regression model) manually can quickly become tedious and error prone. To streamline this type of processing, scikit-learn provides the `Pipeline` object, which you can use to encapsulate several transformations into one step. Let's run this model again using scikit-learn `make_pipeline()`."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import make_pipeline \n",
"\n",
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"\n",
"X = df['log_ppgdp'][:, np.newaxis]\n",
"y = df['lifeExpF']\n",
"poly_model.fit(X, y)\n",
"x_min = df['log_ppgdp'].min()\n",
"x_max = df['log_ppgdp'].max()\n",
"x_plot = np.linspace(x_min, x_max)\n",
"y_plot = poly_model.predict(x_plot[:, np.newaxis])\n",
"\n",
"plt.scatter(df['log_ppgdp'], df['lifeExpF'], alpha=0.3)\n",
"plt.plot(x_plot, y_plot, c='orange');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That was much simpler to code! But how much did going through the work doing the polynomial regression help our model?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"poly_model.fit(df['log_ppgdp'][:, np.newaxis], df['lifeExpF'])\n",
"\n",
"predictions = poly_model.predict(df['log_ppgdp'][:, np.newaxis])\n",
"r2_score(df['lifeExpF'], predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our improved, polynomial model now accounts for 61.4 percent of the variance in `lifeExpF`. Clearly an improvement, but a modest one.\n",
"\n",
"> **Exercise**\n",
">\n",
"> Go to the code cell in which we fitted the polynomial model using `make_pipeline()` and try different values (>2) in `PolynomialFeatures` to see what using higher-degree polynomials does for our model.\n",
"\n",
"> **Exercise solution**\n",
"Here is a comparison of the outputs for models using three-degree, four-degree, and five-degree polynomials."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"colors = ['teal', 'yellowgreen', 'gold']\n",
"\n",
"x_min = df['log_ppgdp'].min()\n",
"x_max = df['log_ppgdp'].max()\n",
"x_plot = np.linspace(x_min, x_max)\n",
"\n",
"plt.scatter(df['log_ppgdp'], df['lifeExpF'], alpha=0.3, c='gray')\n",
"\n",
"for count, degree in enumerate([3, 4, 5]):\n",
" model = make_pipeline(PolynomialFeatures(degree),\n",
" LinearRegression())\n",
" X = df['log_ppgdp'][:, np.newaxis]\n",
" y = df['lifeExpF']\n",
" model.fit(X, y)\n",
" y_plot = model.predict(x_plot[:, np.newaxis])\n",
" plt.plot(x_plot, y_plot, color=colors[count], \n",
" linewidth=2, label=\"Degree %d\" % degree)\n",
"\n",
"plt.legend(loc='lower right')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s see what the $R^2$ scores for the different-degree polynomial models are."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for count, degree in enumerate([3, 4, 5]):\n",
" model = make_pipeline(PolynomialFeatures(degree),\n",
" LinearRegression())\n",
" X = df['log_ppgdp'][:, np.newaxis]\n",
" y = df['lifeExpF']\n",
" model.fit(X, y)\n",
" predictions = model.predict(X)\n",
" print(\"Degree %d\" % degree, \"r-squared score:\", \n",
" r2_score(df['lifeExpF'], predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each additional polynomial degree improves the fit of our model (as demonstrated by the incremental improvements to the r-squared scores). However, adding more degrees to the polynomial regressions opens us to the risk of [overfitting](https://en.wikipedia.org/wiki/Overfitting), a process by which our models come to fit the training data too closely and are thus less useful in predicting more generalized data.\n",
"\n",
"Higher-degree polynomials also bring back the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). Simple linear models need only $N + 1$ sample points to fit, where $N$ is the number of dimensions (2 points in 1 dimension, 3 in 2 dimensions, 4 in three dimensions, and so on). However, each additional polynomial degree increases the number of sample points required for a given dimensionality much faster. Particularly if certain data points are difficult or expensive to come by, you might run out of data in order to fit a high-degree polynomial model."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple regression\n",
"\n",
"Ultimately, no matter how complex the model we construct between `log_ppgdp` and `lifeExpF`, we will only be able to explain so much of variability because factors other than per-capita GDP affect female life expectancy. Using more than one predictor in our regression model helps us capture more of this richness of detail.\n",
"\n",
"Let's start by plotting the relationship between log per-capita GDP, urbanization, and female life expectancy in three dimensions."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"%matplotlib inline\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"ax.scatter(df['log_ppgdp'], df['pctUrban'], df['lifeExpF'])\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"X = df[['log_ppgdp', 'pctUrban']]\n",
"y = df['lifeExpF']\n",
"\n",
"model.fit(X, y)\n",
"\n",
"x1_plot = np.linspace(df['log_ppgdp'].min(), df['log_ppgdp'].max())\n",
"x2_plot = np.linspace(df['pctUrban'].min(), df['pctUrban'].max())\n",
"X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
"y_plot = model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"ax.scatter(df['log_ppgdp'], df['pctUrban'], df['lifeExpF'])\n",
"ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap='viridis');\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How accurate is our multiple-regression model?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"X = df[['log_ppgdp', 'pctUrban']]\n",
"y = df['lifeExpF']\n",
"\n",
"model.fit(X, y)\n",
"\n",
"predictions = model.predict(X)\n",
"r2_score(df['lifeExpF'], predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model explains 59.8 percent of the variance in `lifeExpF`: better than our initial simple linear model ($R^2=$ 0.596), but not spectacularly so.\n",
"\n",
"What does this new model mean?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Model slopes: \", model.coef_)\n",
"print(\"Model intercept:\", model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our model now has two predictors in it, so it takes the generalized form:\n",
"\n",
"$$\n",
"y = β_0 + β_1x_1 + β_2x_2\n",
"$$\n",
"\n",
"Specifically, our model is:\n",
"\n",
"$$\n",
"{\\rm lifeExpF} = 30.7 + 11 \\times {\\rm log\\_ppgdp} + 0.023 \\times {\\rm pctUrban}\n",
"$$\n",
"\n",
"Multiple regression is a little trickier to interpret than simple regression, but not enormously so. Our model says that if we were to hold all other factors equal, then increasing the per-capita GDP of a country 10 fold will (on average) add 11 years to women's life expectancy. It also says that if we keep everything else the same, then increasing the urbanization of a country by 1 percent will increase women's life expectancy by 0.023 years. (Remember that we can’t think of the intercept as representing a hypothetical baseline country with USD0 GDP and 0 urbanization, because the logarithm of 0 is undefined.) This is another way of showing that adding `pctUrban` to our model provides some additional predictive power to our simple model, but not much. But does it do anything if we add it to a polynomial model?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"\n",
"X = df[['log_ppgdp', 'pctUrban']]\n",
"y = df['lifeExpF']\n",
"\n",
"poly_model.fit(X, y)\n",
"\n",
"x1_plot = np.linspace(df['log_ppgdp'].min(), df['log_ppgdp'].max())\n",
"x2_plot = np.linspace(df['pctUrban'].min(), df['pctUrban'].max())\n",
"X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
"y_plot = poly_model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"ax.scatter(df['log_ppgdp'], df['pctUrban'], df['lifeExpF'])\n",
"ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap='viridis');\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s take a look at the $R^2$ for this model."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"\n",
"X = df[['log_ppgdp', 'pctUrban']]\n",
"y = df['lifeExpF']\n",
"\n",
"poly_model.fit(X, y)\n",
"\n",
"predictions = poly_model.predict(X)\n",
"r2_score(df['lifeExpF'], predictions)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the polynomial regression, adding `pctUrban` to our model provides a decent improvement to our model's predictive power (for example, this model's $R^2$ score is higher than those that we got with our two-degree, three-degree, or four-degree models using just `log_ppgdp`).\n",
"\n",
"More than just boosting the $R^2$ score, fitting the multiple polynomial regression provides additional insights from the visualization. If you rotate the visualization above 180 degrees about the $z$-axis, you will notice that while our model predicts increased female life expectancy at high incomes, in poor countries, our model actually shows a *decrease* in female life expectancy in poor countries correlated with increased urbanization.\n",
"\n",
"All of these conclusions come from a model that treats all of the data as coming from a rather monolithic whole. We have other types of data that we can also use in our modeling to try and arrive at different insights."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categorical data\n",
"\n",
"Our dataset has two categorical features (also known as *factors* in the statistical world): `region` and `group`. There are multiple ways of handling data like this in linear regression; here, we will handle it by building sub-models for it.\n",
"\n",
"To begin moving in that analytical direction, let's start by color coding our 3D scatterplot points by `group`."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"groups = df.groupby('group')\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"for name, group in groups:\n",
" ax.scatter(group['log_ppgdp'], \n",
" group['pctUrban'], \n",
" group['lifeExpF'], label=name)\n",
"ax.legend()\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"ax.legend()\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unsurprisingly, OECD-member countries cluster at the high end of the income scale and, sadly, African countries lag at the poorer end. Countries from the `other` group include countries ranging from poor ones in Southeast Asia to oil-rich Middle Eastern countries, and thus countries from that group are scattered across the income spectrum.\n",
"\n",
"Now that we have the data points detailed in color, let's fit three separate, simple linear models for each group of countries."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"\n",
"groups = df.groupby('group')\n",
"cmaps = ['Blues', 'Oranges', 'Greens']\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" model.fit(X, y)\n",
"\n",
" x1_plot = np.linspace(group['log_ppgdp'].min(), group['log_ppgdp'].max())\n",
" x2_plot = np.linspace(group['pctUrban'].min(), group['pctUrban'].max())\n",
" X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
" y_plot = model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
" \n",
" ax.scatter(group['log_ppgdp'], group['pctUrban'], group['lifeExpF'], label=name)\n",
" cmap_index = sorted(df['group'].unique().tolist()).index(name)\n",
" cmap = cmaps[cmap_index]\n",
" ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap=cmap);\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Technical note:** The blue plane models the `africa` group, the orange one `oecd`, and `other` is green.\n",
"\n",
"The opacity of the various planes can make it hard to pick out the details, but if you rotate the graph, you should be able to see that while the `other` and `oecd` models behave similarly, the `africa` sub-model exhibits different behavior and responds more dramatically to increases in per-capita GDP and urbanization for increasing women's lifespans.\n",
"\n",
"> **Exercise**\n",
">\n",
"> Changing color map for 3D plots like these can sometime help make different details clearer. Locate the `cmap` parameter in the `plot_surface()` function and change it to `cmap='viridis'`.\n",
"\n",
"> **Exercise solution**"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"\n",
"groups = df.groupby('group')\n",
"cmaps = ['Blues', 'Oranges', 'Greens']\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" model.fit(X, y)\n",
"\n",
" x1_plot = np.linspace(group['log_ppgdp'].min(), group['log_ppgdp'].max())\n",
" x2_plot = np.linspace(group['pctUrban'].min(), group['pctUrban'].max())\n",
" X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
" y_plot = model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
"\n",
" ax.scatter(group['log_ppgdp'], group['pctUrban'], group['lifeExpF'], label=name)\n",
" ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap='viridis');\n",
" \n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How good are these models?"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"groups = df.groupby('group')\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" model.fit(X, y)\n",
" predictions = model.predict(X) \n",
" \n",
" print(name, \"r-squared score:\", \n",
" r2_score(group['lifeExpF'], \n",
" predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These models are not great. But we will see if they are improved by using polynomial regression later on."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression(fit_intercept=True)\n",
"\n",
"groups = df.groupby('group')\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" model.fit(X, y)\n",
" \n",
" print(name, \"slopes: \", model.coef_)\n",
" print(name, \"intercept:\", model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do these plots based on group tell us? The slopes for `log_ppgdp` are similar between the `africa` and `oecd` groups, but the slope is different for the `other` group. This suggests that there might be some interaction between `group` and `log_ppgdp` in explaining `lifeExpF`. The `pctUrban` slope for the `oecd` group has a different sign than for `africa` or `other`, which might indicate another interaction, but we are on shaky statistical ground here because we have done no testing to see if these differences in slope are statistically significant (for `pctUrban`, the differences—and the numbers—are small).\n",
"\n",
"What do we mean by interactions between groups? Recall that we generalized a linear model with two features as:\n",
"\n",
"$$\n",
"y=β_0+β_1 x_1+β_2 x_2\n",
"$$\n",
"\n",
"However, if $x_1$ and $x_2$ interact—if different values of $x_1$, for example, change the influence of $x_2$ on $y$ — we need to include that in the model like so:\n",
"\n",
"$$\n",
"y=β_0+β_1 x_1+β_2 x_2+β_3x_1x_2\n",
"$$\n",
"\n",
"Our model involves three predictors (`log_ppgdp`, `pctUrban`, and `group`) and has this form:\n",
"\n",
"$$\n",
"y=β_0+β_1 x_1+β_2 x_2+β_3 u_2+β_4 u_3+β_5x_1x_2+β_6 x_1 u_2+β_7 x_1 u_3+β_8 x_2 u_2+β_9 x_2 u_3+β_{10} x_1x_2u_2+β_{11}x_1x_2 u_3\n",
"$$\n",
"\n",
"Think of this as another aspect of the curse of dimensionality: as we add features (especially categorical ones), the number of potential interactions between features that we have to account for increases even faster.\n",
"\n",
"> **Technical note:** Statisticians often use the variable u for categorical features, a convention that we have used here. Also note that we only included $u_2$ and $u_3$ in the generalized equation for the model, even though we have three groups in the categorical feature group. This is not a mistake; it is because one group from the categorical feature gets included in the intercept.\n",
"\n",
"\n",
"> **Question**\n",
">\n",
"> Do you see where these numbers come from? What do you think the intercepts indicate for each of these groups?\n",
"\n",
"Let's now see what happens when we plot polynomial models for each of these groups."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"\n",
"fig = plt.figure(figsize=(6,6))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
"\n",
"groups = df.groupby('group')\n",
"cmaps = ['Blues_r', 'Oranges_r', 'Greens_r']\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" poly_model.fit(X, y)\n",
"\n",
" x1_plot = np.linspace(group['log_ppgdp'].min(), group['log_ppgdp'].max())\n",
" x2_plot = np.linspace(group['pctUrban'].min(), group['pctUrban'].max())\n",
" X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
" y_plot = poly_model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
" \n",
" ax.scatter(group['log_ppgdp'], group['pctUrban'], group['lifeExpF'], label=name)\n",
" cmap_index = sorted(df['group'].unique().tolist()).index(name)\n",
" cmap = cmaps[cmap_index]\n",
" ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap=cmap);\n",
"\n",
"ax.set_xlabel('GDP per capita (log)')\n",
"ax.set_ylabel('Percent urbanized')\n",
"ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The differences in shapes for these surfaces suggest interaction between `log_ppgdp`, `pctUrban`, and `group`. However, insightful as these plots can be, the nature of 3D visualization can make them hard to see. Another way to present the data is by breaking each model into its own subplot."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"\n",
"fig = plt.figure(figsize=(18, 6))\n",
"\n",
"groups = df.groupby('group')\n",
"cmaps = ['Blues_r', 'Oranges_r', 'Greens_r']\n",
"colors = ['blue', 'orange', 'green']\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" poly_model.fit(X, y)\n",
"\n",
" x1_plot = np.linspace(group['log_ppgdp'].min(), group['log_ppgdp'].max())\n",
" x2_plot = np.linspace(group['pctUrban'].min(), group['pctUrban'].max())\n",
" X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
" y_plot = poly_model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
" \n",
" index = sorted(df['group'].unique().tolist()).index(name)\n",
" ax = fig.add_subplot(1, 3, index + 1, projection='3d')\n",
" color = colors[index]\n",
" ax.scatter(group['log_ppgdp'], group['pctUrban'], group['lifeExpF'], \n",
" label=name, c=color)\n",
" cmap = cmaps[index]\n",
" ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap=cmap);\n",
"\n",
" ax.set_title(name)\n",
" ax.set_xlabel('GDP per capita (log)')\n",
" ax.set_ylabel('Percent urbanized')\n",
" ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How useful are these models for prediction? Let's look at the $R^2$ scores."
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(2),\n",
" LinearRegression())\n",
"groups = df.groupby('group')\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" poly_model.fit(X, y)\n",
" predictions = poly_model.predict(X) \n",
" \n",
" print(name, \"r-squared score:\", \n",
" r2_score(group['lifeExpF'], \n",
" predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not uniformly good. Adding polynomial regression improved the model for the `oecd` group, but worsened it for `africa` and `other`. Let's see if increasing the degrees of the polynomials helps.\n",
"\n",
"> **Exercise**\n",
">\n",
"> Try re-running the $R^2$ scoring code cell above using different polynomial degrees in the `PolynomialFeatures()` function until you get better-fitting models.\n",
"\n",
"> **Possible exercise solution**"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(5),\n",
" LinearRegression())\n",
"groups = df.groupby('group')\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" poly_model.fit(X, y)\n",
" predictions = poly_model.predict(X) \n",
" \n",
" print(name, \"r-squared score:\", \n",
" r2_score(group['lifeExpF'], \n",
" predictions))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Exercise**\n",
">\n",
"> Now that you have a better polynomial degree to use in the models, re-run the code to plot them to see what they look like.\n",
"\n",
"> **Exercise solution**"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_model = make_pipeline(PolynomialFeatures(5),\n",
" LinearRegression())\n",
"\n",
"fig = plt.figure(figsize=(18, 6))\n",
"\n",
"groups = df.groupby('group')\n",
"cmaps = ['Blues_r', 'Oranges_r', 'Greens_r']\n",
"colors = ['blue', 'orange', 'green']\n",
"\n",
"for name, group in groups:\n",
"\n",
" X = group[['log_ppgdp', 'pctUrban']]\n",
" y = group['lifeExpF']\n",
"\n",
" poly_model.fit(X, y)\n",
"\n",
" x1_plot = np.linspace(group['log_ppgdp'].min(), group['log_ppgdp'].max())\n",
" x2_plot = np.linspace(group['pctUrban'].min(), group['pctUrban'].max())\n",
" X1_plot, X2_plot = np.meshgrid(x1_plot, x2_plot)\n",
" y_plot = poly_model.predict(np.c_[X1_plot.ravel(), X2_plot.ravel()])\n",
" \n",
" index = sorted(df['group'].unique().tolist()).index(name)\n",
" ax = fig.add_subplot(1, 3, index + 1, projection='3d')\n",
" color = colors[index]\n",
" ax.scatter(group['log_ppgdp'], group['pctUrban'], group['lifeExpF'], \n",
" label=name, c=color)\n",
" cmap = cmaps[index]\n",
" ax.plot_surface(X1_plot, X2_plot, y_plot.reshape(X1_plot.shape), cmap=cmap);\n",
"\n",
" ax.set_title(name)\n",
" ax.set_xlabel('GDP per capita (log)')\n",
" ax.set_ylabel('Percent urbanized')\n",
" ax.set_zlabel('Life expectancy (years)')\n",
"\n",
"plt.show();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The differences in the shapes of these surfaces suggest interactions between `log_ppgdp`, `pctUrban`, and `group`, but we would have to do additional tests to establish what the specifics of those interactions are.\n",
"\n",
"> **Question**\n",
">\n",
"> What do these plots tell you about the dangers of extrapolating too much from a model? Is overfitting a possible concern with these tightly fit models?\n",
"\n",
"> **Takeaway**\n",
">\n",
"> Linear regression can be a flexible tool for modeling the relationships between features in our data, particularly with polynomial regression. However, this flexibility comes with dangers, particularly the hazard of overfitting our models to our data. With multiple regression, we can produce richer models that include more relationships, but interpretation becomes murkier with each additional feature, particularly when categorical features are included."
],
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "2.7.15-final"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment