Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jessegrabowski/efd27a4cd8872d4363b15668e0471d5d to your computer and use it in GitHub Desktop.
Save jessegrabowski/efd27a4cd8872d4363b15668e0471d5d to your computer and use it in GitHub Desktop.
QR for Multi-Collinearity checking
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "13fbdcda",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from scipy import linalg"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "39a29ddf",
"metadata": {},
"outputs": [],
"source": [
"X = np.random.normal(size=(10, 10))"
]
},
{
"cell_type": "markdown",
"id": "783c6159",
"metadata": {},
"source": [
"Definition: A matrix is defined as \"low-rank\" if it has eigenvalues equal to zero."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d2f54723",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3.67 , 2.92 , 2.92 , 2.64 , 2.64 , 1.717, 1.717, 1.482, 1.482,\n",
" 2.115])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Showing modulus because it's nicer\n",
"np.abs(linalg.eigvals(X)).round(3)"
]
},
{
"cell_type": "markdown",
"id": "7e3268b2",
"metadata": {},
"source": [
"This matrix is fine, so it inverts like a champ."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2c81f6dd",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[ 0.043 -0.255 -0.507 -0.067 0.455 -0.148 -0.234 0.351 0.106 0.002]\n",
" [ 0.162 0.725 0.457 0.388 -0.126 0.474 -0.289 -0.685 0.195 -0.352]\n",
" [-0.22 -0.638 -0.071 -0.415 -0.072 -0.614 0.308 0.244 -0.414 0.46 ]\n",
" [-0.187 -1.077 -0.276 -0.373 -0.139 -0.695 0.046 1.004 -0.272 0.662]\n",
" [ 0.146 0.431 0.107 -0.039 -0.136 0.307 -0.244 -0.475 0.22 -0.223]\n",
" [-0.197 -0.408 -0.469 -0.408 0.178 -0.553 0.349 0.623 -0.216 0.583]\n",
" [ 0.405 0.312 0.349 0.186 -0.141 0.395 -0.298 -0.324 0.309 -0.135]\n",
" [-0.195 -0.202 -0.319 0.126 0.148 -0.314 0.23 0.09 0.238 0.044]\n",
" [-0.094 0.45 0.223 -0.141 0.041 0.514 0.174 -0.482 0.117 -0.145]\n",
" [-0.028 -0.112 0.232 -0.009 0.049 -0.171 0.019 0.176 -0.038 -0.08 ]]\n"
]
}
],
"source": [
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n",
" print(linalg.inv(X))"
]
},
{
"cell_type": "markdown",
"id": "81603e0e",
"metadata": {},
"source": [
"Replace a column with a linear combination of another column "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7c87b5d4",
"metadata": {},
"outputs": [],
"source": [
"X[:, 0] = 3 * X[:, 1] + 5\n",
"X[:, 4] = 2 * X[:, 5] - 1"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b71dd44f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([7.35 , 7.35 , 2.996, 2.996, 2.841, 2.841, 2.157, 0.987, 0. ,\n",
" 0.292])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.abs(linalg.eigvals(X)).round(3)"
]
},
{
"cell_type": "markdown",
"id": "fd7eb069",
"metadata": {},
"source": [
"Surprisingly, it still agrees to be inverted, but notice the horrible values we get. The matrix inversion isn't stable."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "38dd1188",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[-4.791e+14 -9.932e+14 2.426e+14 1.895e+14 -9.915e+13 -7.456e+14 9.779e+14 1.018e+15 -7.675e+14 6.569e+14]\n",
" [ 1.437e+15 2.980e+15 -7.277e+14 -5.684e+14 2.974e+14 2.237e+15 -2.934e+15 -3.053e+15 2.303e+15 -1.971e+15]\n",
" [ 1.071e+00 3.975e+00 2.011e+00 -6.430e-01 -2.321e+00 2.605e+00 -1.499e+00 -4.988e+00 1.442e+00 -1.654e+00]\n",
" [ 1.142e+00 3.449e+00 1.549e+00 -6.404e-01 -2.162e+00 2.479e+00 -1.911e+00 -4.094e+00 1.664e+00 -1.475e+00]\n",
" [-2.396e+15 -4.966e+15 1.213e+15 9.473e+14 -4.957e+14 -3.728e+15 4.889e+15 5.089e+15 -3.838e+15 3.285e+15]\n",
" [ 4.791e+15 9.932e+15 -2.426e+15 -1.895e+15 9.915e+14 7.456e+15 -9.779e+15 -1.018e+16 7.675e+15 -6.569e+15]\n",
" [-6.295e-01 -3.264e+00 -1.147e+00 3.860e-01 1.503e+00 -2.109e+00 1.203e+00 3.713e+00 -1.192e+00 1.537e+00]\n",
" [-2.241e-01 -5.372e-02 -9.788e-03 1.682e-01 -1.291e-01 -2.296e-01 3.791e-01 -1.160e-01 1.676e-01 4.742e-02]\n",
" [-4.876e-01 -1.257e+00 -8.354e-01 -1.149e-01 1.117e+00 -6.551e-01 5.974e-01 1.499e+00 -4.166e-01 5.533e-01]\n",
" [ 2.164e-01 3.434e-01 3.544e-02 -1.130e-01 1.664e-01 1.775e-01 -5.022e-01 -2.776e-01 3.602e-01 -4.067e-01]]\n"
]
}
],
"source": [
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n",
" print(linalg.inv(X))"
]
},
{
"cell_type": "markdown",
"id": "68d0af4c",
"metadata": {},
"source": [
"Checking the eigenvalues shows that there is a problem, but it doesn't show where. To see where, use QR decomposition."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "dcd860f4",
"metadata": {},
"outputs": [],
"source": [
"Q, R, P = linalg.qr(X, pivoting=True)\n",
"Q2, R2 = linalg.qr(X, pivoting=False)"
]
},
{
"cell_type": "markdown",
"id": "3fc9606c",
"metadata": {},
"source": [
"QR decomposition breaks the matrix into two pieces, an orthonormal piece and an upper triangular piece. Check the wiki for details.\n",
"\n",
"Importantly, it also sorts `R` so that the diagonal elements are ordered from lowest to highest. The `P` array shows you what this ordering is. So if you pass `pivoting=False`, you get back the sorted `R`, but you don't know how that sorting came about. If you pass `pivoting=True`, you also get back the ordering of the X columns that result in the sorted `R`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "5af37622",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.allclose(Q2 @ R2, X)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "95b8be86",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.allclose(Q @ R, X[:, P])"
]
},
{
"cell_type": "markdown",
"id": "be9bc993",
"metadata": {},
"source": [
"So what is the `R` anyway? It's a upper triangular matrix that sort of encodes all the unique information about the original matrix `X`. If there is no information in a column of `X`, a row of `R` will sum to zero. "
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "53b13ded",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[23.068 0.623 0.04 3.809 -0.032 0.484 1.479 0.718 1.351 1.475]\n",
" [ 0. -5.005 -1.236 -0.882 0.396 0.315 1.161 0.757 0.265 -2.238]\n",
" [ 0. 0. -3.699 0.779 0.493 -0.365 -0.395 -0.017 0.097 -0.234]\n",
" [ 0. 0. 0. 3.367 0.98 0.444 1.053 -1.929 1.983 -1.01 ]\n",
" [ 0. 0. 0. 0. -2.661 -0.112 0.013 -2.18 1.488 -0. ]\n",
" [ 0. 0. 0. 0. 0. 2.227 0.301 0.586 0.535 -0. ]\n",
" [ 0. 0. 0. 0. 0. 0. -2.171 -0.005 -1.014 -0. ]\n",
" [ 0. 0. 0. 0. 0. 0. 0. -1.156 -0.96 -0. ]\n",
" [ 0. 0. 0. 0. 0. 0. 0. 0. 0.266 0. ]\n",
" [ 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. ]]\n"
]
}
],
"source": [
"with np.printoptions(linewidth=10000, precision=3, suppress=True):\n",
" print(R.round(3))"
]
},
{
"cell_type": "markdown",
"id": "61e6df65",
"metadata": {},
"source": [
"Notice that we have a row of all zeros at the bottom, and that the last column has only 4 non-zero rows. This points to degeneracies in the X matrix. \n",
"\n",
"Of course we already knew this by construction. We could also check the column rank of the matrix:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "6afcbf1b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.linalg.matrix_rank(X)"
]
},
{
"cell_type": "markdown",
"id": "e4a2142c",
"metadata": {},
"source": [
"To know which columns to check, we look at the first `n - k` entries of `P`, where `n` is the number of columns in the matrix, and `k` is the column rank of the matrix (or the number of zero eigenvalues, or the number of zero-sum rows in `R`)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "3ce212f0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 4, 9, 1, 2, 7, 8, 6, 3, 5], dtype=int32)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"P"
]
},
{
"cell_type": "markdown",
"id": "32777a71",
"metadata": {},
"source": [
"We see the offender is `0`, which we made a linear function of column `1`. The next worst one is `4`, which we also know is a linear function of `5`. For whatever reason, the computer doesn't detect that this one is a bad value.\n",
"\n",
"I have no idea why that is. As you can see, no method is foolproof."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bced9111",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"gist": {
"data": {
"description": "QR for finding multi-colinearity",
"public": true
},
"id": ""
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment