Created
January 29, 2024 12:59
-
-
Save jessegrabowski/efd27a4cd8872d4363b15668e0471d5d to your computer and use it in GitHub Desktop.
QR for Multi-Collinearity checking
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "13fbdcda", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"from scipy import linalg" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "39a29ddf", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"X = np.random.normal(size=(10, 10))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "783c6159", | |
"metadata": {}, | |
"source": [ | |
"Definition: A matrix is defined as \"low-rank\" if it has eigenvalues equal to zero." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "d2f54723", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([3.67 , 2.92 , 2.92 , 2.64 , 2.64 , 1.717, 1.717, 1.482, 1.482,\n", | |
" 2.115])" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Showing modulus because it's nicer\n", | |
"np.abs(linalg.eigvals(X)).round(3)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "7e3268b2", | |
"metadata": {}, | |
"source": [ | |
"This matrix is fine, so it inverts like a champ." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "2c81f6dd", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[ 0.043 -0.255 -0.507 -0.067 0.455 -0.148 -0.234 0.351 0.106 0.002]\n", | |
" [ 0.162 0.725 0.457 0.388 -0.126 0.474 -0.289 -0.685 0.195 -0.352]\n", | |
" [-0.22 -0.638 -0.071 -0.415 -0.072 -0.614 0.308 0.244 -0.414 0.46 ]\n", | |
" [-0.187 -1.077 -0.276 -0.373 -0.139 -0.695 0.046 1.004 -0.272 0.662]\n", | |
" [ 0.146 0.431 0.107 -0.039 -0.136 0.307 -0.244 -0.475 0.22 -0.223]\n", | |
" [-0.197 -0.408 -0.469 -0.408 0.178 -0.553 0.349 0.623 -0.216 0.583]\n", | |
" [ 0.405 0.312 0.349 0.186 -0.141 0.395 -0.298 -0.324 0.309 -0.135]\n", | |
" [-0.195 -0.202 -0.319 0.126 0.148 -0.314 0.23 0.09 0.238 0.044]\n", | |
" [-0.094 0.45 0.223 -0.141 0.041 0.514 0.174 -0.482 0.117 -0.145]\n", | |
" [-0.028 -0.112 0.232 -0.009 0.049 -0.171 0.019 0.176 -0.038 -0.08 ]]\n" | |
] | |
} | |
], | |
"source": [ | |
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n", | |
" print(linalg.inv(X))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "81603e0e", | |
"metadata": {}, | |
"source": [ | |
"Replace a column with a linear combination of another column " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "7c87b5d4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"X[:, 0] = 3 * X[:, 1] + 5\n", | |
"X[:, 4] = 2 * X[:, 5] - 1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "b71dd44f", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([7.35 , 7.35 , 2.996, 2.996, 2.841, 2.841, 2.157, 0.987, 0. ,\n", | |
" 0.292])" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.abs(linalg.eigvals(X)).round(3)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "fd7eb069", | |
"metadata": {}, | |
"source": [ | |
"Surprisingly, it still agrees to be inverted, but notice the horrible values we get. The matrix inversion isn't stable." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "38dd1188", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[-4.791e+14 -9.932e+14 2.426e+14 1.895e+14 -9.915e+13 -7.456e+14 9.779e+14 1.018e+15 -7.675e+14 6.569e+14]\n", | |
" [ 1.437e+15 2.980e+15 -7.277e+14 -5.684e+14 2.974e+14 2.237e+15 -2.934e+15 -3.053e+15 2.303e+15 -1.971e+15]\n", | |
" [ 1.071e+00 3.975e+00 2.011e+00 -6.430e-01 -2.321e+00 2.605e+00 -1.499e+00 -4.988e+00 1.442e+00 -1.654e+00]\n", | |
" [ 1.142e+00 3.449e+00 1.549e+00 -6.404e-01 -2.162e+00 2.479e+00 -1.911e+00 -4.094e+00 1.664e+00 -1.475e+00]\n", | |
" [-2.396e+15 -4.966e+15 1.213e+15 9.473e+14 -4.957e+14 -3.728e+15 4.889e+15 5.089e+15 -3.838e+15 3.285e+15]\n", | |
" [ 4.791e+15 9.932e+15 -2.426e+15 -1.895e+15 9.915e+14 7.456e+15 -9.779e+15 -1.018e+16 7.675e+15 -6.569e+15]\n", | |
" [-6.295e-01 -3.264e+00 -1.147e+00 3.860e-01 1.503e+00 -2.109e+00 1.203e+00 3.713e+00 -1.192e+00 1.537e+00]\n", | |
" [-2.241e-01 -5.372e-02 -9.788e-03 1.682e-01 -1.291e-01 -2.296e-01 3.791e-01 -1.160e-01 1.676e-01 4.742e-02]\n", | |
" [-4.876e-01 -1.257e+00 -8.354e-01 -1.149e-01 1.117e+00 -6.551e-01 5.974e-01 1.499e+00 -4.166e-01 5.533e-01]\n", | |
" [ 2.164e-01 3.434e-01 3.544e-02 -1.130e-01 1.664e-01 1.775e-01 -5.022e-01 -2.776e-01 3.602e-01 -4.067e-01]]\n" | |
] | |
} | |
], | |
"source": [ | |
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n", | |
" print(linalg.inv(X))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "68d0af4c", | |
"metadata": {}, | |
"source": [ | |
"Checking the eigenvalues shows that there is a problem, but it doesn't show where. To see where, use QR decomposition." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "dcd860f4", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"Q, R, P = linalg.qr(X, pivoting=True)\n", | |
"Q2, R2 = linalg.qr(X, pivoting=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "3fc9606c", | |
"metadata": {}, | |
"source": [ | |
"QR decomposition breaks the matrix into two pieces, an orthonormal piece and an upper triangular piece. Check the wiki for details.\n", | |
"\n", | |
"Importantly, it also sorts `R` so that the diagonal elements are ordered from lowest to highest. The `P` array shows you what this ordering is. So if you pass `pivoting=False`, you get back the sorted `R`, but you don't know how that sorting came about. If you pass `pivoting=True`, you also get back the ordering of the X columns that result in the sorted `R`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "5af37622", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 22, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.allclose(Q2 @ R2, X)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "95b8be86", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"True" | |
] | |
}, | |
"execution_count": 23, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.allclose(Q @ R, X[:, P])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "be9bc993", | |
"metadata": {}, | |
"source": [ | |
"So what is the `R` anyway? It's a upper triangular matrix that sort of encodes all the unique information about the original matrix `X`. If there is no information in a column of `X`, a row of `R` will sum to zero. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"id": "53b13ded", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[[23.068 0.623 0.04 3.809 -0.032 0.484 1.479 0.718 1.351 1.475]\n", | |
" [ 0. -5.005 -1.236 -0.882 0.396 0.315 1.161 0.757 0.265 -2.238]\n", | |
" [ 0. 0. -3.699 0.779 0.493 -0.365 -0.395 -0.017 0.097 -0.234]\n", | |
" [ 0. 0. 0. 3.367 0.98 0.444 1.053 -1.929 1.983 -1.01 ]\n", | |
" [ 0. 0. 0. 0. -2.661 -0.112 0.013 -2.18 1.488 -0. ]\n", | |
" [ 0. 0. 0. 0. 0. 2.227 0.301 0.586 0.535 -0. ]\n", | |
" [ 0. 0. 0. 0. 0. 0. -2.171 -0.005 -1.014 -0. ]\n", | |
" [ 0. 0. 0. 0. 0. 0. 0. -1.156 -0.96 -0. ]\n", | |
" [ 0. 0. 0. 0. 0. 0. 0. 0. 0.266 0. ]\n", | |
" [ 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. ]]\n" | |
] | |
} | |
], | |
"source": [ | |
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n", | |
" print(R.round(3))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "61e6df65", | |
"metadata": {}, | |
"source": [ | |
"Notice that we have a row of all zeros at the bottom, and that the last column has only 4 non-zero rows. This points to degeneracies in the X matrix. \n", | |
"\n", | |
"Of course we already knew this by construction. We could also check the column rank of the matrix:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"id": "6afcbf1b", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"9" | |
] | |
}, | |
"execution_count": 38, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.linalg.matrix_rank(X)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "e4a2142c", | |
"metadata": {}, | |
"source": [ | |
"To know which columns to check, we look at the first `n - k` entries of `P`, where `n` is the number of columns in the matrix, and `k` is the column rank of the matrix (or the number of zero eigenvalues, or the number of zero-sum rows in `R`)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 39, | |
"id": "3ce212f0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([0, 4, 9, 1, 2, 7, 8, 6, 3, 5], dtype=int32)" | |
] | |
}, | |
"execution_count": 39, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"P" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "32777a71", | |
"metadata": {}, | |
"source": [ | |
"We see the offender is `0`, which we made a linear function of column `1`. The next worst one is `4`, which we also know is a linear function of `5`. For whatever reason, the computer doesn't detect that this one is a bad value.\n", | |
"\n", | |
"I have no idea why that is. As you can see, no method is foolproof." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "bced9111", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"gist": { | |
"data": { | |
"description": "QR for finding multi-colinearity", | |
"public": true | |
}, | |
"id": "" | |
}, | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.11.6" | |
}, | |
"toc": { | |
"base_numbering": 1, | |
"nav_menu": {}, | |
"number_sections": true, | |
"sideBar": true, | |
"skip_h1_title": false, | |
"title_cell": "Table of Contents", | |
"title_sidebar": "Contents", | |
"toc_cell": false, | |
"toc_position": {}, | |
"toc_section_display": true, | |
"toc_window_display": false | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment