Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ericdatakelly/8507f395e33470081f72ace25335f5c9 to your computer and use it in GitHub Desktop.
Save ericdatakelly/8507f395e33470081f72ace25335f5c9 to your computer and use it in GitHub Desktop.
create a dataframe from your python distributions
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "bf7a0cfa-baff-49d3-8840-a0ad27483729",
"metadata": {},
"source": [
"# what do we learn when _we look at the distributions on our system_\n",
"<!-- TEASER_END -->"
]
},
{
"cell_type": "markdown",
"id": "6f5931a5-6a80-4577-95bc-cd2f7e37c932",
"metadata": {},
"source": [
"[`importlib.metadata`](https://docs.python.org/3/library/importlib.metadata.html) is an addition to Python 3.8 that makes it easier to explore the packages in your current python environment. we are going to load this data into pandas and see what we can learn from this data. this approach is fabulous way to generate really data for a demonstration."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5ea1eb67-e6db-4797-9b65-8d47270c794d",
"metadata": {},
"outputs": [],
"source": [
"import importlib.metadata, pandas, toolz"
]
},
{
"cell_type": "markdown",
"id": "5dafa65e-a378-45c8-9432-408870c6988a",
"metadata": {},
"source": [
"create a series from `importlib.metadata.distributions` . each distribution contains information about a package."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8a477969-afe9-4ddb-80c0-0b3c9d1d7294",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sphinxcontrib-bibtex <importlib.metadata.PathDistribution object at...\n",
"pyasn1-modules <importlib.metadata.PathDistribution object at...\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" distributions = pandas.Series({x.metadata.get(\"Name\"): x for x in importlib.metadata.distributions()})\n",
" distributions.sample(2)"
]
},
{
"cell_type": "markdown",
"id": "dc93b7ca-8e05-43ce-8d41-027b7f7c21af",
"metadata": {},
"source": [
"the distributions can be expanded into a tidy dataframe with the following `features`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "53a1638d-de87-4229-9bbf-3f13d2b7350e",
"metadata": {},
"outputs": [],
"source": [
" features = ['files', 'version', 'requires', 'metadata']"
]
},
{
"cell_type": "markdown",
"id": "94280efc-f9a3-4c1b-b0a5-a5afa2b53e0f",
"metadata": {},
"source": [
"we'll widen our `distributions` to a tidy dataframe"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "6af6d10d-ac7a-4ded-a455-240f7cf2f395",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>files</th>\n",
" <th>version</th>\n",
" <th>requires</th>\n",
" <th>metadata</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>blinker</th>\n",
" <td>[blinker-1.4.dist-info/AUTHORS, blinker-1.4.di...</td>\n",
" <td>1.4</td>\n",
" <td>None</td>\n",
" <td>[Metadata-Version, Name, Version, Summary, Hom...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>conda-package-handling</th>\n",
" <td>[../../../bin/cph, conda_package_handling-1.7....</td>\n",
" <td>1.7.3</td>\n",
" <td>[six]</td>\n",
" <td>[Metadata-Version, Name, Version, Summary, Hom...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" files \\\n",
"blinker [blinker-1.4.dist-info/AUTHORS, blinker-1.4.di... \n",
"conda-package-handling [../../../bin/cph, conda_package_handling-1.7.... \n",
"\n",
" version requires \\\n",
"blinker 1.4 None \n",
"conda-package-handling 1.7.3 [six] \n",
"\n",
" metadata \n",
"blinker [Metadata-Version, Name, Version, Summary, Hom... \n",
"conda-package-handling [Metadata-Version, Name, Version, Summary, Hom... "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" df = distributions.apply(\n",
" toolz.compose_left(operator.attrgetter(*features), pandas.Series)\n",
" ).rename(columns=dict(zip(range(len(features)), features)))\n",
" df.sample(2)"
]
},
{
"cell_type": "markdown",
"id": "db255356-ab1c-4e55-8e1c-e78d3182a475",
"metadata": {},
"source": [
"there are still some goodies in this dataframe nested into the `metadata` column. in the next segment we create a wider dataframe with `distribution` details and package `metadata`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "732cc351-8d9c-4a98-81d7-90aac31e93db",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th colspan=\"4\" halign=\"left\">distribution</th>\n",
" <th colspan=\"17\" halign=\"left\">metadata</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>files</th>\n",
" <th>version</th>\n",
" <th>requires</th>\n",
" <th>metadata</th>\n",
" <th>Metadata-Version</th>\n",
" <th>Name</th>\n",
" <th>Version</th>\n",
" <th>Summary</th>\n",
" <th>Home-page</th>\n",
" <th>Author</th>\n",
" <th>...</th>\n",
" <th>Requires-Dist</th>\n",
" <th>Project-URL</th>\n",
" <th>Description-Content-Type</th>\n",
" <th>Maintainer</th>\n",
" <th>Maintainer-email</th>\n",
" <th>License-File</th>\n",
" <th>Provides-Extra</th>\n",
" <th>Description</th>\n",
" <th>Download-URL</th>\n",
" <th>Provides</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>sphinx-book-theme</th>\n",
" <td>[sphinx_book_theme-0.1.4.dist-info/INSTALLER, ...</td>\n",
" <td>0.1.4</td>\n",
" <td>[beautifulsoup4 (&lt;5,&gt;=4.6.1), click (~=7.1), d...</td>\n",
" <td>[Metadata-Version, Name, Version, Summary, Hom...</td>\n",
" <td>2.1</td>\n",
" <td>sphinx-book-theme</td>\n",
" <td>0.1.4</td>\n",
" <td>Jupyter Book: Create an online book with Jupyt...</td>\n",
" <td>https://jupyterbook.org/</td>\n",
" <td>Project Jupyter Contributors</td>\n",
" <td>...</td>\n",
" <td>beautifulsoup4 (&lt;5,&gt;=4.6.1)</td>\n",
" <td>Documentation, https://jupyterbook.org</td>\n",
" <td>text/markdown</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>code_style</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>html5lib</th>\n",
" <td>[html5lib-1.1.dist-info/AUTHORS.rst, html5lib-...</td>\n",
" <td>1.1</td>\n",
" <td>[six (&gt;=1.9), webencodings, genshi ; extra == ...</td>\n",
" <td>[Metadata-Version, Name, Version, Summary, Hom...</td>\n",
" <td>2.1</td>\n",
" <td>html5lib</td>\n",
" <td>1.1</td>\n",
" <td>HTML parser based on the WHATWG HTML specifica...</td>\n",
" <td>https://github.com/html5lib/html5lib-python</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>six (&gt;=1.9)</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>James Graham</td>\n",
" <td>james@hoppipolla.co.uk</td>\n",
" <td>NaN</td>\n",
" <td>all</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>2 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" distribution \\\n",
" files version \n",
"sphinx-book-theme [sphinx_book_theme-0.1.4.dist-info/INSTALLER, ... 0.1.4 \n",
"html5lib [html5lib-1.1.dist-info/AUTHORS.rst, html5lib-... 1.1 \n",
"\n",
" \\\n",
" requires \n",
"sphinx-book-theme [beautifulsoup4 (<5,>=4.6.1), click (~=7.1), d... \n",
"html5lib [six (>=1.9), webencodings, genshi ; extra == ... \n",
"\n",
" \\\n",
" metadata \n",
"sphinx-book-theme [Metadata-Version, Name, Version, Summary, Hom... \n",
"html5lib [Metadata-Version, Name, Version, Summary, Hom... \n",
"\n",
" metadata \\\n",
" Metadata-Version Name Version \n",
"sphinx-book-theme 2.1 sphinx-book-theme 0.1.4 \n",
"html5lib 2.1 html5lib 1.1 \n",
"\n",
" \\\n",
" Summary \n",
"sphinx-book-theme Jupyter Book: Create an online book with Jupyt... \n",
"html5lib HTML parser based on the WHATWG HTML specifica... \n",
"\n",
" \\\n",
" Home-page \n",
"sphinx-book-theme https://jupyterbook.org/ \n",
"html5lib https://github.com/html5lib/html5lib-python \n",
"\n",
" ... \\\n",
" Author ... \n",
"sphinx-book-theme Project Jupyter Contributors ... \n",
"html5lib NaN ... \n",
"\n",
" \\\n",
" Requires-Dist \n",
"sphinx-book-theme beautifulsoup4 (<5,>=4.6.1) \n",
"html5lib six (>=1.9) \n",
"\n",
" \\\n",
" Project-URL \n",
"sphinx-book-theme Documentation, https://jupyterbook.org \n",
"html5lib NaN \n",
"\n",
" \\\n",
" Description-Content-Type Maintainer \n",
"sphinx-book-theme text/markdown NaN \n",
"html5lib NaN James Graham \n",
"\n",
" \\\n",
" Maintainer-email License-File Provides-Extra \n",
"sphinx-book-theme NaN NaN code_style \n",
"html5lib james@hoppipolla.co.uk NaN all \n",
"\n",
" \n",
" Description Download-URL Provides \n",
"sphinx-book-theme NaN NaN NaN \n",
"html5lib NaN NaN NaN \n",
"\n",
"[2 rows x 26 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" df = pandas.concat(dict(\n",
" distribution=df,\n",
" metadata=df[\"metadata\"].apply(\n",
" toolz.compose_left(dict, pandas.Series)\n",
" )\n",
" ), axis=1)\n",
" df.sample(2)"
]
},
{
"cell_type": "markdown",
"id": "6f592c99-07e5-465a-a5dd-2bf1f9d50a19",
"metadata": {},
"source": [
"## what can we learn about our environment?"
]
},
{
"cell_type": "markdown",
"id": "1235688b-0ba1-4358-a53b-e2ca576b7ce9",
"metadata": {},
"source": [
"### how many distributions does it contain?"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "25ca1574-10ce-49c1-912d-8da7badee345",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'in this environment, there are 363 distributions installed.'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" F\"\"\"in this environment, there are {len(df)} distributions installed.\"\"\""
]
},
{
"cell_type": "markdown",
"id": "7e2c3f6e-84bc-45f1-937f-84a50c6a3493",
"metadata": {},
"source": [
"### how many files are in each distribution"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "78079984-bb21-4aef-9551-3dc2585ce0d4",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mkdocs-material</th>\n",
" <th>bokeh</th>\n",
" <th>jedi</th>\n",
" <th>pandas</th>\n",
" <th>notebook</th>\n",
" <th>mypy</th>\n",
" <th>panel</th>\n",
" <th>numpy</th>\n",
" <th>Faker</th>\n",
" <th>holoviews</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>files</th>\n",
" <td>8144</td>\n",
" <td>2244</td>\n",
" <td>1763</td>\n",
" <td>1661</td>\n",
" <td>1516</td>\n",
" <td>1511</td>\n",
" <td>1350</td>\n",
" <td>1194</td>\n",
" <td>1017</td>\n",
" <td>951</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mkdocs-material bokeh jedi pandas notebook mypy panel numpy \\\n",
"files 8144 2244 1763 1661 1516 1511 1350 1194 \n",
"\n",
" Faker holoviews \n",
"files 1017 951 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" df[\"distribution\"][\"files\"].apply(lambda x: len(x or [])).sort_values(ascending=False).iloc[:10].to_frame().T"
]
},
{
"cell_type": "markdown",
"id": "b3be1993-eb87-4045-8146-7ec7386d8ed5",
"metadata": {},
"source": [
"### what is the distribution of licenses?"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "8ffbbdd3-9bb9-4996-b214-394b82eafb81",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>MIT</th>\n",
" <th>BSD</th>\n",
" <th>BSD-3-Clause</th>\n",
" <th>UNKNOWN</th>\n",
" <th>MIT License</th>\n",
" <th>Apache 2.0</th>\n",
" <th>ISC</th>\n",
" <th>BSD License</th>\n",
" <th>Apache License, Version 2.0</th>\n",
" <th>Apache Software License</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>License</th>\n",
" <td>100</td>\n",
" <td>62</td>\n",
" <td>37</td>\n",
" <td>36</td>\n",
" <td>17</td>\n",
" <td>13</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" MIT BSD BSD-3-Clause UNKNOWN MIT License Apache 2.0 ISC \\\n",
"License 100 62 37 36 17 13 6 \n",
"\n",
" BSD License Apache License, Version 2.0 Apache Software License \n",
"License 6 5 5 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"metadata\"][\"License\"].value_counts().iloc[:10].to_frame().T"
]
},
{
"cell_type": "markdown",
"id": "00a9a4ac-4346-4a3b-880b-cd1e65eb35f2",
"metadata": {},
"source": [
"### who authored my packages?"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a374ec51-aeee-43c4-b943-286ef1fc1432",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Jupyter Development Team</th>\n",
" <th>wxyz contributors</th>\n",
" <th>Sébastien Eustace</th>\n",
" <th>Georg Brandl</th>\n",
" <th>IPython Development Team</th>\n",
" <th>Chris Sewell</th>\n",
" <th>Kenneth Reitz</th>\n",
" <th>Executable Book Project</th>\n",
" <th>Armin Ronacher</th>\n",
" <th>Thomas Kluyver</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Author</th>\n",
" <td>16</td>\n",
" <td>12</td>\n",
" <td>10</td>\n",
" <td>8</td>\n",
" <td>7</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Jupyter Development Team wxyz contributors Sébastien Eustace \\\n",
"Author 16 12 10 \n",
"\n",
" Georg Brandl IPython Development Team Chris Sewell Kenneth Reitz \\\n",
"Author 8 7 6 6 \n",
"\n",
" Executable Book Project Armin Ronacher Thomas Kluyver \n",
"Author 6 5 4 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"metadata\"][\"Author\"].value_counts().iloc[:10].to_frame().T"
]
},
{
"cell_type": "markdown",
"id": "9de59f21-af92-4eb2-afe5-bb7fe6f790f5",
"metadata": {},
"source": [
"## conclusion\n",
"\n",
"there is a juicy dataset in your environment just waiting for you to explore. what will you find?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment