Skip to content

Instantly share code, notes, and snippets.

@MikeTrizna
Last active May 17, 2023 19:07
Show Gist options
  • Save MikeTrizna/8d5bdf55c36c6ce3c32ca0d462bf9962 to your computer and use it in GitHub Desktop.
Save MikeTrizna/8d5bdf55c36c6ce3c32ca0d462bf9962 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "f141b24f-07d5-4408-bc07-86f7fad87832",
"metadata": {},
"source": [
"# Data visualization with Pandas and Altair"
]
},
{
"cell_type": "markdown",
"id": "8eb58e71-6d19-474f-a8b3-b1553504a234",
"metadata": {},
"source": [
"## Python Data Visualization Ecosystem\n",
"\n",
"Unlike R, where the community has rallied around a single visualization package (ggplot2), Python users have many different packages to choose from -- all of which have their strengths and weaknesses.\n",
"\n",
"Here is a sampling of a few prominent options:\n",
"\n",
"**Matplotlib**\n",
"\n",
"Matplotlib is the \"grandparent\" of Python plotting libraries. It was written to look and act like MatLab, so it was originally written in a fairly \"non-Pythonic\" way. Since it has been around for the longest time, there are a lot of Python libraries that are built around it, and there have been various efforts to streamline and overhaul the way to interface with it.\n",
"\n",
"Link: https://matplotlib.org/\n",
" \n",
"**Seaborn**\n",
"\n",
"Seaborn is built on top of Matplotlib to provide functions to build various specific statistical plots. But it also incorporates default nice styling, and also attempts to standardize the code.\n",
"\n",
"Link: https://seaborn.pydata.org/\n",
"\n",
"**Plotnine**\n",
"\n",
"Plotnine is also built on top of Matplotlib, and is an effort to be a Python port of R's ggplot plotting library. The original Data Carpentry Python visualization lesson is written to use Altair, so that it can stay in sync with the Data Carpentry R lesson.\n",
"\n",
"Link: https://plotnine.readthedocs.io/en/stable/\n",
"\n",
"**Plotly**\n",
"\n",
"Link: https://plotly.com/python/\n",
"\n",
"**Bokeh**\n",
"\n",
"Link: https://bokeh.org/\n",
"\n",
"**Altair**\n",
"\n",
"Link: https://altair-viz.github.io/\n",
"\n",
"We will be using Altair for most of today's lesson for its combination to adherence to the Grammar of Graphics as well as its widespread adoption by Python users.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "876f3955-825c-4af4-b5bf-42178bfa7ffe",
"metadata": {},
"source": [
"## Visualization with Altair"
]
},
{
"cell_type": "markdown",
"id": "f1559f93-45a2-4121-8bbf-4693824f389c",
"metadata": {},
"source": [
"### Preparing our dataset"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5bb896b5-cb21-4f1b-8094-d0985395f5e1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 35549 entries, 0 to 35548\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 record_id 35549 non-null int64 \n",
" 1 month 35549 non-null int64 \n",
" 2 day 35549 non-null int64 \n",
" 3 year 35549 non-null int64 \n",
" 4 plot_id 35549 non-null int64 \n",
" 5 species_id 34786 non-null object \n",
" 6 sex 33038 non-null object \n",
" 7 hindfoot_length 31438 non-null float64\n",
" 8 weight 32283 non-null float64\n",
"dtypes: float64(2), int64(5), object(2)\n",
"memory usage: 2.4+ MB\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"surveys = pd.read_csv('data/surveys.csv')\n",
"surveys.info()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "76679521-dd3a-4c25-8733-e1324e34c056",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>species_id</th>\n",
" <th>species_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>AB</td>\n",
" <td>303</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>AH</td>\n",
" <td>437</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>AS</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>BA</td>\n",
" <td>46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>CB</td>\n",
" <td>50</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" species_id species_count\n",
"0 AB 303\n",
"1 AH 437\n",
"2 AS 2\n",
"3 BA 46\n",
"4 CB 50"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"species_counts = surveys.groupby('species_id')['record_id'].count().reset_index(name='species_count')\n",
"species_counts.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "91770624-e5cd-483e-b77e-7f88d2b89a38",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"48"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(species_counts)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f0df2442-6f85-48be-90f9-14019f3b9439",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['AB',\n",
" 'AH',\n",
" 'CB',\n",
" 'DM',\n",
" 'DO',\n",
" 'DS',\n",
" 'NL',\n",
" 'OL',\n",
" 'OT',\n",
" 'PB',\n",
" 'PE',\n",
" 'PF',\n",
" 'PM',\n",
" 'PP',\n",
" 'RF',\n",
" 'RM',\n",
" 'SA',\n",
" 'SH',\n",
" 'SS']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"big_species = species_counts[species_counts['species_count'] >= 50]['species_id'].to_list()\n",
"big_species"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "39afb560-4b63-45f7-b430-8e4e94271ee5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Index: 30463 entries, 62 to 35547\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 record_id 30463 non-null int64 \n",
" 1 month 30463 non-null int64 \n",
" 2 day 30463 non-null int64 \n",
" 3 year 30463 non-null int64 \n",
" 4 plot_id 30463 non-null int64 \n",
" 5 species_id 30463 non-null object \n",
" 6 sex 30463 non-null object \n",
" 7 hindfoot_length 30463 non-null float64\n",
" 8 weight 30463 non-null float64\n",
"dtypes: float64(2), int64(5), object(2)\n",
"memory usage: 2.3+ MB\n"
]
}
],
"source": [
"surveys_filtered = surveys[surveys['species_id'].isin(big_species)].dropna()\n",
"surveys_filtered.info()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "07037a8c-4382-4dcc-acd0-5df7c12e08a7",
"metadata": {},
"outputs": [],
"source": [
"surveys_filtered.to_csv('data/surveys_filtered.csv', index=False)"
]
},
{
"cell_type": "markdown",
"id": "cd0284f8-db23-4e5e-bf3b-d0723d6c1ced",
"metadata": {},
"source": [
"### Building your plots iteratively"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "0f80e9cc-d8d3-415f-8f7a-7bd5aaf97927",
"metadata": {},
"outputs": [],
"source": [
"import altair as alt"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7c1cd3c9-08f9-4153-830d-0b9850a46f19",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"vegafusion.enable_widget()"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import vegafusion as vf\n",
"vf.enable_widget()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "048d037b-038d-4ae7-a43e-69b8677d7e73",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5632dd7b9f494fedaaa08738aea0a48d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"source = surveys.sample(50)\n",
"alt.Chart(source).mark_circle().encode(x='weight', \n",
" y='hindfoot_length')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6679aaa6-fb2e-4cc4-8bf8-8dfeab612c92",
"metadata": {},
"outputs": [],
"source": [
"url = 'https://gist.githubusercontent.com/MikeTrizna/cd01f9bf3e21d6f74823423bdb45a2f3/raw/2d8c36cf78c9b6abf6938451c60defc93c5911a4/surveys_filtered.csv'"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "fde9ba13-dbb6-4d55-9b1a-0cf736cbccc4",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "eb83b76039a9412692b5434ae595d98c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
" y='hindfoot_length:Q')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "90307af2-04ae-40cc-b4ce-f28f75fc890f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d8ae54a6b1c64e1daaa1565dd2617f85",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_circle(opacity=0.1,\n",
" color='red').encode(x='weight:Q', \n",
" y='hindfoot_length:Q')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "014cce42-8d78-4792-b31f-d75cfd21d7bc",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "923fc08ca95a493c81168e0cfa7499be",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
" y='hindfoot_length:Q',\n",
" color='species_id:N')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "58b480c5-f33d-487f-9eb5-44f0095fb47e",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d3f0768b7c0343c4bb4be543def79aad",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
" y='hindfoot_length:Q',\n",
" color='species_id:N',\n",
" tooltip='species_id:N'\n",
" ).interactive()"
]
},
{
"cell_type": "markdown",
"id": "6f7dc8eb-aee4-43b0-9bba-8e5f4f197bc3",
"metadata": {},
"source": [
"### Faceting"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f94f09cf-86c4-4d5f-8bbd-8157ee59b5f8",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "08d007387d3e4cb6a772d4f9090abd41",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
" y='hindfoot_length:Q',\n",
" facet='sex:N',\n",
" color='species_id:N')"
]
},
{
"cell_type": "markdown",
"id": "5a16cead-b8aa-443d-a46d-642d16970b01",
"metadata": {},
"source": [
"### Boxplot"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "0c85df67-1213-4a30-8811-05f89dc1b01f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4b9af9e372b343e7b2ecde8995b2294d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_boxplot().encode(x='species_id:N', \n",
" y='weight:Q')"
]
},
{
"cell_type": "markdown",
"id": "8c77efa1-c0bb-4326-aabc-7f6681ebd893",
"metadata": {},
"source": [
"**Challenge**\n",
"\n",
"Make a boxplot of the dataset that shows the distribution of hindfoot_length values by plot_id"
]
},
{
"cell_type": "markdown",
"id": "d2c0250b-51e8-4fc5-ad71-dba779143652",
"metadata": {},
"source": [
"### Built-in grouping"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "a1fb91bf-8c3e-4ddb-9f00-2cb147b0833a",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a7957e11df4840d4ab6081e8504686b8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_bar().encode(\n",
" x='plot_id:O',\n",
" y='count():Q',\n",
" color='sex:N'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "353b2b5f-9e8a-40b5-ac17-d9fd8499ed01",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d394e54c42d94ad3b189299a2d3ec3a6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(surveys_filtered).mark_line().encode(\n",
" x='year:O',\n",
" y='count():Q',\n",
" color='species_id:N'\n",
")"
]
},
{
"cell_type": "markdown",
"id": "db8f46b4-95af-4de3-961e-e54095f9adb7",
"metadata": {},
"source": [
"**Challenge**\n",
"\n",
"Make a bar plot showing the breakdown of sex values by species_id"
]
},
{
"cell_type": "markdown",
"id": "4f20c7cd-ceb4-47e9-8203-fa22ead838c2",
"metadata": {},
"source": [
"### Crossfiltering"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "fd3f284e-bc1f-4788-9ab1-a5672207b14f",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "b9849f07eaa64bc4b65e34b591a3e037",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": []
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"brush = alt.selection_interval()\n",
"\n",
"points = alt.Chart(surveys_filtered).mark_point(opacity=0.1).encode(\n",
" x='weight:Q',\n",
" y='hindfoot_length:Q',\n",
" color=alt.condition(brush, 'species_id:N', alt.value('lightgray'))\n",
").add_params(\n",
" brush\n",
")\n",
"\n",
"bars = alt.Chart(surveys_filtered).mark_bar().encode(\n",
" y='species_id:N',\n",
" color='species_id:N',\n",
" x='count(species_id):Q'\n",
").transform_filter(\n",
" brush\n",
")\n",
"\n",
"points & bars"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a584b02f-dd33-4305-a298-37c8ce5cdc37",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment