Skip to content

Instantly share code, notes, and snippets.

@hamelin
Last active December 8, 2021 13:44
Show Gist options
  • Save hamelin/654da61ff22502e34fe64ab18140731c to your computer and use it in GitHub Desktop.
Save hamelin/654da61ff22502e34fe64ab18140731c to your computer and use it in GitHub Desktop.
Explains how to build sparse (COO) matrices that one-hot encode square lists
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "a746f54e-83ed-479a-ba27-ce9291637c1b",
"metadata": {},
"source": [
"# Building sparse matrices from square lists\n",
"\n",
"Forget about SciKit Learn's `OneHotEncoder`: it's doing what you want, but your current\n",
"data structure enables a _much_ easier way.\n",
"\n",
"What we need is a binary sparse matrix that encodes which permissions each user has. This means it has to carry 1 for each row corresponding to a user and each column corresponding to a permission this user has; and 0 everywhere else. We can specify such a matrix in **COO format**: we simply indicate the row and column indices of non-zero entries, and the number to store there (which would be 1, in our case)."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "a38ef5d8-dad6-437e-8d5c-c97f8ba02d95",
"metadata": {},
"outputs": [],
"source": [
"from scipy.sparse.coo import coo_matrix # The constructor for our sparse matrix."
]
},
{
"cell_type": "markdown",
"id": "578f8486-88ad-4acf-8c66-0c0a5d4e8d4c",
"metadata": {},
"source": [
"Look up full docs with `help(coo_matrix)`. The short version is that we can go for the fourth signature of the constructor:\n",
"\n",
"```\n",
"M = coo_matrix((data, (i, j)), shape=(m, n))\n",
"```\n",
"\n",
"This means the first parameter is a pair of a vector of data, and another pair; this second pair is the vector of column indexes, and the vector of row indexes; thus the data vector are the numbers to place at each indexed entry (again, in our case, we put 1's). The shape parameter indicates the number of rows and columns to our matrix: in our case, the number of users and the number of permissions. It is not strictly necessary, but if you have some permissions that go unused, then we risk the matrix going truncated on the right."
]
},
{
"cell_type": "markdown",
"id": "86e95af3-fb6c-4287-80f4-f4220765aa08",
"metadata": {},
"source": [
"Now, let's have the full list of permissions. I use here a small subset, which corresponds to the blob container permission set."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "c7d6b9b9-f247-4fc6-a62a-b3756f18693d",
"metadata": {},
"outputs": [],
"source": [
"permissions = [\"read\", \"add\", \"create\", \"write\", \"delete\", \"list\", \"immutable\"]"
]
},
{
"cell_type": "markdown",
"id": "c5415d92-b33b-4500-a56c-89283c18356a",
"metadata": {},
"source": [
"Now, our users come as a sequence of lists of two elements. First element is the user's name; second element is a list of all of that user's permissions."
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "06151052-60f3-46bc-b8bd-d3a864a68c69",
"metadata": {},
"outputs": [],
"source": [
"users = [\n",
" [\"simon\", [\"read\", \"list\"]],\n",
" [\"ben\", [\"read\", \"add\", \"create\", \"write\", \"delete\", \"list\", \"immutable\"]],\n",
" [\"alex\", [\"read\", \"list\", \"add\", \"create\", \"write\"]],\n",
" [\"mike\", [\"read\", \"list\"]]\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "e1eb7dc0-8317-440e-8c2e-14a33fc45718",
"metadata": {},
"source": [
"To build a sparse matrix, a good first step is to \"name your columns\": you build a dictionary that indicates which permission corresponds to which column of the matrix."
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "aef26ff3-5b67-42bc-8e53-b63576104cd0",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'read': 0,\n",
" 'add': 1,\n",
" 'create': 2,\n",
" 'write': 3,\n",
" 'delete': 4,\n",
" 'list': 5,\n",
" 'immutable': 6}"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"map_columns = {p: i for i, p in enumerate(permissions)}\n",
"map_columns"
]
},
{
"cell_type": "markdown",
"id": "d64f3995-402a-432c-8dba-1d1f8822e87a",
"metadata": {},
"source": [
"Now, let's establish each row and column index where we will put a 1."
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "efc413c0-f047-475b-8d7c-a3cfac91b1e1",
"metadata": {},
"outputs": [],
"source": [
"rows = []\n",
"cols = []\n",
"for row, (_, permissions_user) in enumerate(users):\n",
" for p in permissions_user:\n",
" rows.append(row)\n",
" cols.append(map_columns[p])"
]
},
{
"cell_type": "markdown",
"id": "6299c8d7-de67-4f64-b832-67f8555bde3b",
"metadata": {},
"source": [
"Finally we can instantiate this COO sparse matrix!"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "f212c375-9000-464b-b1d6-a413793ea674",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<4x7 sparse matrix of type '<class 'numpy.float64'>'\n",
"\twith 16 stored elements in COOrdinate format>"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M = coo_matrix(\n",
" (np.ones((len(rows),)), (rows, cols)),\n",
" shape=(len(users), len(permissions)))\n",
"M"
]
},
{
"cell_type": "markdown",
"id": "6b0f8344-2fda-4e59-9325-1e02cd610a66",
"metadata": {},
"source": [
"Let's check that its head (top rows) has the contents we expect. Sparse matrices cannot be indexed directly, but you can select the rows of a matrix by multiplying it by a partial identity on the left."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "d6b0f826-466e-4bcc-92ae-2b732196ef32",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0., 0.],\n",
" [0., 1., 0., 0.]])"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.eye(2, len(users))"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "4bb60096-0633-4053-acec-ac6fdd7aecc1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0., 0., 0., 1., 0.],\n",
" [1., 1., 1., 1., 1., 1., 1.]])"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.eye(2, len(users)) @ M"
]
},
{
"cell_type": "markdown",
"id": "3bd9c701-3347-4060-b450-8a6ab67b388e",
"metadata": {},
"source": [
"For this example notebook, you can look up a whole dense representation of this matrix. At scale, this would crash your Python kernel as it would exceed available memory."
]
},
{
"cell_type": "code",
"execution_count": 49,
"id": "bd7b7f3b-b932-48d4-b13a-07d4a01d8978",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0., 0., 0., 1., 0.],\n",
" [1., 1., 1., 1., 1., 1., 1.],\n",
" [1., 1., 1., 1., 0., 1., 0.],\n",
" [1., 0., 0., 0., 0., 1., 0.]])"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M.toarray()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "SCIENCE (Python)",
"language": "python",
"name": "conda-env-SCIENCE-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment