Skip to content

Instantly share code, notes, and snippets.

@qdpham
Created February 25, 2022 13:29
Show Gist options
  • Save qdpham/21b30181a8e0dd50cfcbba57b61e9b70 to your computer and use it in GitHub Desktop.
Save qdpham/21b30181a8e0dd50cfcbba57b61e9b70 to your computer and use it in GitHub Desktop.
Fun-inria MOOC scikit-learn Exervice M4.04: Different outputs for linear regression coefficients
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "3f36b5d0",
"metadata": {},
"source": [
"# 📃 Solution for Exercise M4.04\n",
"\n",
"In the previous notebook, we saw the effect of applying some regularization\n",
"on the coefficient of a linear model.\n",
"\n",
"In this exercise, we will study the advantage of using some regularization\n",
"when dealing with correlated features.\n",
"\n",
"We will first create a regression dataset. This dataset will contain 2,000\n",
"samples and 5 features from which only 2 features will be informative."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2e45e8c6-a2ac-4d7a-871a-088094fe7fd5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"System:\n",
" python: 3.9.7 (default, Sep 16 2021, 08:50:36) [Clang 10.0.0 ]\n",
"executable: /Users/qdp/opt/anaconda3/bin/python\n",
" machine: macOS-10.16-x86_64-i386-64bit\n",
"\n",
"Python dependencies:\n",
" pip: 21.2.4\n",
" setuptools: 58.0.4\n",
" sklearn: 1.0.2\n",
" numpy: 1.20.3\n",
" scipy: 1.7.1\n",
" Cython: 0.29.24\n",
" pandas: 1.3.4\n",
" matplotlib: 3.4.3\n",
" joblib: 1.1.0\n",
"threadpoolctl: 2.2.0\n",
"\n",
"Built with OpenMP: True\n"
]
}
],
"source": [
"# Configuration \n",
"import sklearn\n",
"sklearn.show_versions()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8a51a22a",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import make_regression\n",
"\n",
"data, target, coef = make_regression(\n",
" n_samples=2_000,\n",
" n_features=5,\n",
" n_informative=2,\n",
" shuffle=False,\n",
" coef=True,\n",
" random_state=0,\n",
" noise=30,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8bdbc1ce",
"metadata": {},
"source": [
"When creating the dataset, `make_regression` returns the true coefficient\n",
"used to generate the dataset. Let's plot this information."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "aab52b75",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Relevant feature #0 9.566665\n",
"Relevant feature #1 40.192077\n",
"Noisy feature #0 0.000000\n",
"Noisy feature #1 0.000000\n",
"Noisy feature #2 0.000000\n",
"dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import pandas as pd\n",
"\n",
"feature_names = [\n",
" \"Relevant feature #0\",\n",
" \"Relevant feature #1\",\n",
" \"Noisy feature #0\",\n",
" \"Noisy feature #1\",\n",
" \"Noisy feature #2\",\n",
"]\n",
"coef = pd.Series(coef, index=feature_names)\n",
"coef.plot.barh()\n",
"coef"
]
},
{
"cell_type": "markdown",
"id": "b1c460d7",
"metadata": {},
"source": [
"Create a `LinearRegression` regressor and fit on the entire dataset and\n",
"check the value of the coefficients. Are the coefficients of the linear\n",
"regressor close to the coefficients used to generate the dataset?"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "ca5de3ea",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10.89587004, 40.41128042, -0.20542454, -0.18954462, 0.11129768])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# solution\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"linear_regression = LinearRegression()\n",
"linear_regression.fit(data, target)\n",
"linear_regression.coef_"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "0f8d6647",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"feature_names = [\n",
" \"Relevant feature #0\",\n",
" \"Relevant feature #1\",\n",
" \"Noisy feature #0\",\n",
" \"Noisy feature #1\",\n",
" \"Noisy feature #2\",\n",
"]\n",
"coef = pd.Series(linear_regression.coef_, index=feature_names)\n",
"_ = coef.plot.barh()"
]
},
{
"cell_type": "markdown",
"id": "3b03ce00",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"We see that the coefficients are close to the coefficients used to generate\n",
"the dataset. The dispersion is indeed cause by the noise injected during the\n",
"dataset generation."
]
},
{
"cell_type": "markdown",
"id": "dc96d427",
"metadata": {},
"source": [
"Now, create a new dataset that will be the same as `data` with 4 additional\n",
"columns that will repeat twice features 0 and 1. This procedure will create\n",
"perfectly correlated features."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "45df16ab",
"metadata": {},
"outputs": [],
"source": [
"# solution\n",
"import numpy as np\n",
"\n",
"data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)"
]
},
{
"cell_type": "markdown",
"id": "6aefcec3",
"metadata": {},
"source": [
"Fit again the linear regressor on this new dataset and check the\n",
"coefficients. What do you observe?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "84fee60f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 3.63195668, 13.47042681, -0.20542454, -0.18954462, 0.11129768,\n",
" 3.63195668, 13.47042681, 3.63195668, 13.47042681])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# solution\n",
"linear_regression = LinearRegression()\n",
"linear_regression.fit(data, target)\n",
"linear_regression.coef_"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b40ca761-ede7-4edb-918b-0bd96d3c6522",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Linear regression coefficients by hand: \n",
" [ 5.33306979e+03 7.63990305e+01 -2.06636605e-01 -1.90616627e-01\n",
" 1.11646376e-01 -1.79027400e+03 -1.74340449e+02 4.95808017e+03\n",
" 1.35353488e+02] \n",
"\n",
"Determinant of X.T @ X: -2.9853291155589836e-36\n"
]
}
],
"source": [
"# Linear regression by hand\n",
"import numpy as np\n",
"beta = np.linalg.inv(data.T @ data) @ data.T @ target\n",
"print(\"Linear regression coefficients by hand: \\n \", beta, \"\\n\")\n",
"\n",
"# determinant\n",
"det = np.linalg.det(data.T @ data)\n",
"print(\"Determinant of X.T @ X: \", det)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "e36183f4",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"feature_names = [\n",
" \"Relevant feature #0\",\n",
" \"Relevant feature #1\",\n",
" \"Noisy feature #0\",\n",
" \"Noisy feature #1\",\n",
" \"Noisy feature #2\",\n",
" \"First repetition of feature #0\",\n",
" \"First repetition of feature #1\",\n",
" \"Second repetition of feature #0\",\n",
" \"Second repetition of feature #1\",\n",
"]\n",
"coef = pd.Series(linear_regression.coef_, index=feature_names)\n",
"_ = coef.plot.barh()"
]
},
{
"cell_type": "markdown",
"id": "d2c7ab37",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"We see that the coefficient values are far from what one could expect.\n",
"By repeating the informative features, one would have expected these\n",
"coefficients to be similarly informative.\n",
"\n",
"Instead, we see that some coefficients have a huge norm ~1e14. It indeed\n",
"means that we try to solve an mathematical ill-posed problem. Indeed, finding\n",
"coefficients in a linear regression involves inverting the matrix\n",
"`np.dot(data.T, data)` which is not possible (or lead to high numerical\n",
"errors)."
]
},
{
"cell_type": "markdown",
"id": "2ddd26a3",
"metadata": {},
"source": [
"Create a ridge regressor and fit on the same dataset. Check the coefficients.\n",
"What do you observe?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "1905d49b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 3.6313933 , 13.46802113, -0.20549345, -0.18929961, 0.11117205,\n",
" 3.6313933 , 13.46802113, 3.6313933 , 13.46802113])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# solution\n",
"from sklearn.linear_model import Ridge\n",
"\n",
"ridge = Ridge()\n",
"ridge.fit(data, target)\n",
"ridge.coef_"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f087bf65",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"coef = pd.Series(ridge.coef_, index=feature_names)\n",
"_ = coef.plot.barh()"
]
},
{
"cell_type": "markdown",
"id": "5b15c35b",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"We see that the penalty applied on the weights give a better results: the\n",
"values of the coefficients do not suffer from numerical issues. Indeed, the\n",
"matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`.\n",
"Adding this penalty `alpha` allow the inversion without numerical issue."
]
},
{
"cell_type": "markdown",
"id": "e5825c28",
"metadata": {},
"source": [
"Can you find the relationship between the ridge coefficients and the original\n",
"coefficients?"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "83ef8489",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([10.89417991, 40.40406338, -0.61648035, -0.56789883, 0.33351616])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# solution\n",
"ridge.coef_[:5] * 3"
]
},
{
"cell_type": "markdown",
"id": "6df695f6",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"Repeating three times each informative features induced to divide the\n",
"ridge coefficients by three."
]
},
{
"cell_type": "markdown",
"id": "a64488eb",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"<div class=\"admonition tip alert alert-warning\">\n",
"<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
"<p>We advise to always use a penalty to shrink the magnitude of the weights\n",
"toward zero (also called \"l2 penalty\"). In scikit-learn, <tt class=\"docutils literal\">LogisticRegression</tt>\n",
"applies such penalty by default. However, one needs to use <tt class=\"docutils literal\">Ridge</tt> (and even\n",
"<tt class=\"docutils literal\">RidgeCV</tt> to tune the parameter <tt class=\"docutils literal\">alpha</tt>) instead of <tt class=\"docutils literal\">LinearRegression</tt>.</p>\n",
"<p class=\"last\">Other kinds of regularizations exist but will not be covered in this course.</p>\n",
"</div>\n",
"\n",
"## Dealing with correlation between one-hot encoded features\n",
"\n",
"In this section, we will focus on how to deal with correlated features that\n",
"arise naturally when one-hot encoding categorical features.\n",
"\n",
"Let's first load the Ames housing dataset and take a subset of features that\n",
"are only categorical features."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "03cc10c3",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"\n",
"ames_housing = pd.read_csv(\"../datasets/house_prices.csv\", na_values='?')\n",
"ames_housing = ames_housing.drop(columns=\"Id\")\n",
"\n",
"categorical_columns = [\"Street\", \"Foundation\", \"CentralAir\", \"PavedDrive\"]\n",
"target_name = \"SalePrice\"\n",
"X, y = ames_housing[categorical_columns], ames_housing[target_name]\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.2, random_state=0\n",
")"
]
},
{
"cell_type": "markdown",
"id": "65787f48",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"\n",
"We previously presented that a `OneHotEncoder` creates as many columns as\n",
"categories. Therefore, there is always one column (i.e. one encoded category)\n",
"that can be inferred from the others. Thus, `OneHotEncoder` creates\n",
"collinear features.\n",
"\n",
"We illustrate this behaviour by considering the \"CentralAir\" feature that\n",
"contains only two categories:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f724aa4a",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"text/plain": [
"618 Y\n",
"870 N\n",
"92 Y\n",
"817 Y\n",
"302 Y\n",
" ..\n",
"763 Y\n",
"835 Y\n",
"1216 Y\n",
"559 Y\n",
"684 Y\n",
"Name: CentralAir, Length: 1168, dtype: object"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train[\"CentralAir\"]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c4bd2c18",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CentralAir_N</th>\n",
" <th>CentralAir_Y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1163</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1164</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1165</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1166</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1167</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1168 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" CentralAir_N CentralAir_Y\n",
"0 0 1\n",
"1 1 0\n",
"2 0 1\n",
"3 0 1\n",
"4 0 1\n",
"... ... ...\n",
"1163 0 1\n",
"1164 0 1\n",
"1165 0 1\n",
"1166 0 1\n",
"1167 0 1\n",
"\n",
"[1168 rows x 2 columns]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"\n",
"single_feature = [\"CentralAir\"]\n",
"encoder = OneHotEncoder(sparse=False, dtype=np.int32)\n",
"X_trans = encoder.fit_transform(X_train[single_feature])\n",
"X_trans = pd.DataFrame(\n",
" X_trans,\n",
" columns=encoder.get_feature_names_out(input_features=single_feature),\n",
")\n",
"X_trans"
]
},
{
"cell_type": "markdown",
"id": "3c7e6b09",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"\n",
"Here, we see that the encoded category \"CentralAir_N\" is the opposite of the\n",
"encoded category \"CentralAir_Y\". Therefore, we observe that using a\n",
"`OneHotEncoder` creates two features having the problematic pattern observed\n",
"earlier in this exercise. Training a linear regression model on such a\n",
"of one-hot encoded binary feature can therefore lead to numerical\n",
"problems, especially without regularization. Furthermore, the two one-hot\n",
"features are redundant as they encode exactly the same information in\n",
"opposite ways.\n",
"\n",
"Using regularization helps to overcome the numerical issues that we highlighted\n",
"earlier in this exercise.\n",
"\n",
"Another strategy is to arbitrarily drop one of the encoded categories.\n",
"Scikit-learn provides such an option by setting the parameter `drop` in the\n",
"`OneHotEncoder`. This parameter can be set to `first` to always drop the\n",
"first encoded category or `binary_only` to only drop a column in the case of\n",
"binary categories."
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "100b438b",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>CentralAir_Y</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1163</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1164</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1165</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1166</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1167</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>1168 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" CentralAir_Y\n",
"0 1\n",
"1 0\n",
"2 1\n",
"3 1\n",
"4 1\n",
"... ...\n",
"1163 1\n",
"1164 1\n",
"1165 1\n",
"1166 1\n",
"1167 1\n",
"\n",
"[1168 rows x 1 columns]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"encoder = OneHotEncoder(drop=\"first\", sparse=False, dtype=np.int32)\n",
"X_trans = encoder.fit_transform(X_train[single_feature])\n",
"X_trans = pd.DataFrame(\n",
" X_trans,\n",
" columns=encoder.get_feature_names_out(input_features=single_feature),\n",
")\n",
"X_trans"
]
},
{
"cell_type": "markdown",
"id": "dae0c25e",
"metadata": {
"tags": [
"solution"
]
},
"source": [
"\n",
"We see that only the second column of the previous encoded data is kept.\n",
"Dropping one of the one-hot encoded column is a common practice,\n",
"especially for binary categorical features. Note however that this breaks\n",
"symmetry between categories and impacts the number of coefficients of the\n",
"model, their values, and thus their meaning, especially when applying\n",
"strong regularization.\n",
"\n",
"Let's finally illustrate how to use this option is a machine-learning pipeline:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "0654d3b3",
"metadata": {
"tags": [
"solution"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 score on the testing set: 0.24\n",
"Our model contains 9 features while 13 categories are originally available.\n"
]
}
],
"source": [
"from sklearn.pipeline import make_pipeline\n",
"\n",
"model = make_pipeline(OneHotEncoder(drop=\"first\", dtype=np.int32), Ridge())\n",
"model.fit(X_train, y_train)\n",
"n_categories = [X_train[col].nunique() for col in X_train.columns]\n",
"print(\n",
" f\"R2 score on the testing set: {model.score(X_test, y_test):.2f}\"\n",
")\n",
"print(\n",
" f\"Our model contains {model[-1].coef_.size} features while \"\n",
" f\"{sum(n_categories)} categories are originally available.\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ce7e70b0-fe29-4602-98aa-8005e9c4e2d2",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "tags,-all",
"main_language": "python",
"notebook_metadata_filter": "-all"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@qdpham
Copy link
Author

qdpham commented Feb 25, 2022

Hi,

Here are the two files reproducing the issue, that we talked about in the forum, about exact same coefficients between LinearRegression and Ridge on my machine.

The first file fun_linear_models_ex_04.ipynb is the code run from the fun-inria server and the second my_linear_models_ex_04.ipynb run from my local machine.

Note that I’ve added a cell for computing the coefficients by hand.
I had an error raised from the fun-inria server, as the determinant is computed at exactly 0, however the `LinearRegression’ still provides results without explicit warning (although we can understand that the coefficients are unreasonable).
On the other hand, the determinant is computed at approximately 0 on my machine and so I could proceed and got somewhat reasonable coefficients.

At the end of the day, I get the same coefficients between LineaarRegression and Ridge (regularized) on my machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment