MikeTrizna/visualization_with_altair.ipynb

## visualization_with_altair.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f141b24f-07d5-4408-bc07-86f7fad87832",
   "metadata": {},
   "source": [
    "# Data visualization with Pandas and Altair"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8eb58e71-6d19-474f-a8b3-b1553504a234",
   "metadata": {},
   "source": [
    "## Python Data Visualization Ecosystem\n",
    "\n",
    "Unlike R, where the community has rallied around a single visualization package (ggplot2), Python users have many different packages to choose from -- all of which have their strengths and weaknesses.\n",
    "\n",
    "Here is a sampling of a few prominent options:\n",
    "\n",
    "**Matplotlib**\n",
    "\n",
    "Matplotlib is the \"grandparent\" of Python plotting libraries. It was written to look and act like MatLab, so it was originally written in a fairly \"non-Pythonic\" way. Since it has been around for the longest time, there are a lot of Python libraries that are built around it, and there have been various efforts to streamline and overhaul the way to interface with it.\n",
    "\n",
    "Link: https://matplotlib.org/\n",
    "    \n",
    "**Seaborn**\n",
    "\n",
    "Seaborn is built on top of Matplotlib to provide functions to build various specific statistical plots. But it also incorporates default nice styling, and also attempts to standardize the code.\n",
    "\n",
    "Link: https://seaborn.pydata.org/\n",
    "\n",
    "**Plotnine**\n",
    "\n",
    "Plotnine is also built on top of Matplotlib, and is an effort to be a Python port of R's ggplot plotting library. The original Data Carpentry Python visualization lesson is written to use Altair, so that it can stay in sync with the Data Carpentry R lesson.\n",
    "\n",
    "Link: https://plotnine.readthedocs.io/en/stable/\n",
    "\n",
    "**Plotly**\n",
    "\n",
    "Link: https://plotly.com/python/\n",
    "\n",
    "**Bokeh**\n",
    "\n",
    "Link: https://bokeh.org/\n",
    "\n",
    "**Altair**\n",
    "\n",
    "Link: https://altair-viz.github.io/\n",
    "\n",
    "We will be using Altair for most of today's lesson for its combination to adherence to the Grammar of Graphics as well as its widespread adoption by Python users.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "876f3955-825c-4af4-b5bf-42178bfa7ffe",
   "metadata": {},
   "source": [
    "## Visualization with Altair"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1559f93-45a2-4121-8bbf-4693824f389c",
   "metadata": {},
   "source": [
    "### Preparing our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5bb896b5-cb21-4f1b-8094-d0985395f5e1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 35549 entries, 0 to 35548\n",
      "Data columns (total 9 columns):\n",
      " #   Column           Non-Null Count  Dtype  \n",
      "---  ------           --------------  -----  \n",
      " 0   record_id        35549 non-null  int64  \n",
      " 1   month            35549 non-null  int64  \n",
      " 2   day              35549 non-null  int64  \n",
      " 3   year             35549 non-null  int64  \n",
      " 4   plot_id          35549 non-null  int64  \n",
      " 5   species_id       34786 non-null  object \n",
      " 6   sex              33038 non-null  object \n",
      " 7   hindfoot_length  31438 non-null  float64\n",
      " 8   weight           32283 non-null  float64\n",
      "dtypes: float64(2), int64(5), object(2)\n",
      "memory usage: 2.4+ MB\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "surveys = pd.read_csv('data/surveys.csv')\n",
    "surveys.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "76679521-dd3a-4c25-8733-e1324e34c056",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species_id</th>\n",
       "      <th>species_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AB</td>\n",
       "      <td>303</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AH</td>\n",
       "      <td>437</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AS</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>BA</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CB</td>\n",
       "      <td>50</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  species_id  species_count\n",
       "0         AB            303\n",
       "1         AH            437\n",
       "2         AS              2\n",
       "3         BA             46\n",
       "4         CB             50"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "species_counts = surveys.groupby('species_id')['record_id'].count().reset_index(name='species_count')\n",
    "species_counts.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "91770624-e5cd-483e-b77e-7f88d2b89a38",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "48"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(species_counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f0df2442-6f85-48be-90f9-14019f3b9439",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['AB',\n",
       " 'AH',\n",
       " 'CB',\n",
       " 'DM',\n",
       " 'DO',\n",
       " 'DS',\n",
       " 'NL',\n",
       " 'OL',\n",
       " 'OT',\n",
       " 'PB',\n",
       " 'PE',\n",
       " 'PF',\n",
       " 'PM',\n",
       " 'PP',\n",
       " 'RF',\n",
       " 'RM',\n",
       " 'SA',\n",
       " 'SH',\n",
       " 'SS']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "big_species = species_counts[species_counts['species_count'] >= 50]['species_id'].to_list()\n",
    "big_species"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "39afb560-4b63-45f7-b430-8e4e94271ee5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 30463 entries, 62 to 35547\n",
      "Data columns (total 9 columns):\n",
      " #   Column           Non-Null Count  Dtype  \n",
      "---  ------           --------------  -----  \n",
      " 0   record_id        30463 non-null  int64  \n",
      " 1   month            30463 non-null  int64  \n",
      " 2   day              30463 non-null  int64  \n",
      " 3   year             30463 non-null  int64  \n",
      " 4   plot_id          30463 non-null  int64  \n",
      " 5   species_id       30463 non-null  object \n",
      " 6   sex              30463 non-null  object \n",
      " 7   hindfoot_length  30463 non-null  float64\n",
      " 8   weight           30463 non-null  float64\n",
      "dtypes: float64(2), int64(5), object(2)\n",
      "memory usage: 2.3+ MB\n"
     ]
    }
   ],
   "source": [
    "surveys_filtered = surveys[surveys['species_id'].isin(big_species)].dropna()\n",
    "surveys_filtered.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "07037a8c-4382-4dcc-acd0-5df7c12e08a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "surveys_filtered.to_csv('data/surveys_filtered.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd0284f8-db23-4e5e-bf3b-d0723d6c1ced",
   "metadata": {},
   "source": [
    "### Building your plots iteratively"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0f80e9cc-d8d3-415f-8f7a-7bd5aaf97927",
   "metadata": {},
   "outputs": [],
   "source": [
    "import altair as alt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7c1cd3c9-08f9-4153-830d-0b9850a46f19",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "vegafusion.enable_widget()"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import vegafusion as vf\n",
    "vf.enable_widget()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "048d037b-038d-4ae7-a43e-69b8677d7e73",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5632dd7b9f494fedaaa08738aea0a48d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "source = surveys.sample(50)\n",
    "alt.Chart(source).mark_circle().encode(x='weight', \n",
    "                                        y='hindfoot_length')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6679aaa6-fb2e-4cc4-8bf8-8dfeab612c92",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = 'https://gist.githubusercontent.com/MikeTrizna/cd01f9bf3e21d6f74823423bdb45a2f3/raw/2d8c36cf78c9b6abf6938451c60defc93c5911a4/surveys_filtered.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "fde9ba13-dbb6-4d55-9b1a-0cf736cbccc4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "eb83b76039a9412692b5434ae595d98c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
    "                                               y='hindfoot_length:Q')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "90307af2-04ae-40cc-b4ce-f28f75fc890f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d8ae54a6b1c64e1daaa1565dd2617f85",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_circle(opacity=0.1,\n",
    "                           color='red').encode(x='weight:Q', \n",
    "                                               y='hindfoot_length:Q')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "014cce42-8d78-4792-b31f-d75cfd21d7bc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "923fc08ca95a493c81168e0cfa7499be",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
    "                                               y='hindfoot_length:Q',\n",
    "                                               color='species_id:N')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "58b480c5-f33d-487f-9eb5-44f0095fb47e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d3f0768b7c0343c4bb4be543def79aad",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
    "                                               y='hindfoot_length:Q',\n",
    "                                               color='species_id:N',\n",
    "                                               tooltip='species_id:N'\n",
    "                                              ).interactive()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f7dc8eb-aee4-43b0-9bba-8e5f4f197bc3",
   "metadata": {},
   "source": [
    "### Faceting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "f94f09cf-86c4-4d5f-8bbd-8157ee59b5f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "08d007387d3e4cb6a772d4f9090abd41",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
    "                                               y='hindfoot_length:Q',\n",
    "                                               facet='sex:N',\n",
    "                                               color='species_id:N')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a16cead-b8aa-443d-a46d-642d16970b01",
   "metadata": {},
   "source": [
    "### Boxplot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "0c85df67-1213-4a30-8811-05f89dc1b01f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4b9af9e372b343e7b2ecde8995b2294d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_boxplot().encode(x='species_id:N', \n",
    "                                     y='weight:Q')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c77efa1-c0bb-4326-aabc-7f6681ebd893",
   "metadata": {},
   "source": [
    "**Challenge**\n",
    "\n",
    "Make a boxplot of the dataset that shows the distribution of hindfoot_length values by plot_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2c0250b-51e8-4fc5-ad71-dba779143652",
   "metadata": {},
   "source": [
    "### Built-in grouping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "a1fb91bf-8c3e-4ddb-9f00-2cb147b0833a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a7957e11df4840d4ab6081e8504686b8",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_bar().encode(\n",
    "    x='plot_id:O',\n",
    "    y='count():Q',\n",
    "    color='sex:N'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "353b2b5f-9e8a-40b5-ac17-d9fd8499ed01",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d394e54c42d94ad3b189299a2d3ec3a6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alt.Chart(surveys_filtered).mark_line().encode(\n",
    "    x='year:O',\n",
    "    y='count():Q',\n",
    "    color='species_id:N'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db8f46b4-95af-4de3-961e-e54095f9adb7",
   "metadata": {},
   "source": [
    "**Challenge**\n",
    "\n",
    "Make a bar plot showing the breakdown of sex values by species_id"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f20c7cd-ceb4-47e9-8203-fa22ead838c2",
   "metadata": {},
   "source": [
    "### Crossfiltering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "fd3f284e-bc1f-4788-9ab1-a5672207b14f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b9849f07eaa64bc4b65e34b591a3e037",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VegaFusionWidget(spec='{\\n  \"config\": {\\n    \"view\": {\\n      \"continuousWidth\": 300,\\n      \"continuousHeight…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": []
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "brush = alt.selection_interval()\n",
    "\n",
    "points = alt.Chart(surveys_filtered).mark_point(opacity=0.1).encode(\n",
    "    x='weight:Q',\n",
    "    y='hindfoot_length:Q',\n",
    "    color=alt.condition(brush, 'species_id:N', alt.value('lightgray'))\n",
    ").add_params(\n",
    "    brush\n",
    ")\n",
    "\n",
    "bars = alt.Chart(surveys_filtered).mark_bar().encode(\n",
    "    y='species_id:N',\n",
    "    color='species_id:N',\n",
    "    x='count(species_id):Q'\n",
    ").transform_filter(\n",
    "    brush\n",
    ")\n",
    "\n",
    "points & bars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a584b02f-dd33-4305-a298-37c8ce5cdc37",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "f141b24f-07d5-4408-bc07-86f7fad87832",
	"metadata": {},
	"source": [
	"# Data visualization with Pandas and Altair"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8eb58e71-6d19-474f-a8b3-b1553504a234",
	"metadata": {},
	"source": [
	"## Python Data Visualization Ecosystem\n",
	"\n",
	"Unlike R, where the community has rallied around a single visualization package (ggplot2), Python users have many different packages to choose from -- all of which have their strengths and weaknesses.\n",
	"\n",
	"Here is a sampling of a few prominent options:\n",
	"\n",
	"Matplotlib\n",
	"\n",
	"Matplotlib is the \"grandparent\" of Python plotting libraries. It was written to look and act like MatLab, so it was originally written in a fairly \"non-Pythonic\" way. Since it has been around for the longest time, there are a lot of Python libraries that are built around it, and there have been various efforts to streamline and overhaul the way to interface with it.\n",
	"\n",
	"Link: https://matplotlib.org/\n",
	" \n",
	"Seaborn\n",
	"\n",
	"Seaborn is built on top of Matplotlib to provide functions to build various specific statistical plots. But it also incorporates default nice styling, and also attempts to standardize the code.\n",
	"\n",
	"Link: https://seaborn.pydata.org/\n",
	"\n",
	"Plotnine\n",
	"\n",
	"Plotnine is also built on top of Matplotlib, and is an effort to be a Python port of R's ggplot plotting library. The original Data Carpentry Python visualization lesson is written to use Altair, so that it can stay in sync with the Data Carpentry R lesson.\n",
	"\n",
	"Link: https://plotnine.readthedocs.io/en/stable/\n",
	"\n",
	"Plotly\n",
	"\n",
	"Link: https://plotly.com/python/\n",
	"\n",
	"Bokeh\n",
	"\n",
	"Link: https://bokeh.org/\n",
	"\n",
	"Altair\n",
	"\n",
	"Link: https://altair-viz.github.io/\n",
	"\n",
	"We will be using Altair for most of today's lesson for its combination to adherence to the Grammar of Graphics as well as its widespread adoption by Python users.\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"id": "876f3955-825c-4af4-b5bf-42178bfa7ffe",
	"metadata": {},
	"source": [
	"## Visualization with Altair"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f1559f93-45a2-4121-8bbf-4693824f389c",
	"metadata": {},
	"source": [
	"### Preparing our dataset"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "5bb896b5-cb21-4f1b-8094-d0985395f5e1",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<class 'pandas.core.frame.DataFrame'>\n",
	"RangeIndex: 35549 entries, 0 to 35548\n",
	"Data columns (total 9 columns):\n",
	" # Column Non-Null Count Dtype \n",
	"--- ------ -------------- ----- \n",
	" 0 record_id 35549 non-null int64 \n",
	" 1 month 35549 non-null int64 \n",
	" 2 day 35549 non-null int64 \n",
	" 3 year 35549 non-null int64 \n",
	" 4 plot_id 35549 non-null int64 \n",
	" 5 species_id 34786 non-null object \n",
	" 6 sex 33038 non-null object \n",
	" 7 hindfoot_length 31438 non-null float64\n",
	" 8 weight 32283 non-null float64\n",
	"dtypes: float64(2), int64(5), object(2)\n",
	"memory usage: 2.4+ MB\n"
	]
	}
	],
	"source": [
	"import pandas as pd\n",
	"\n",
	"surveys = pd.read_csv('data/surveys.csv')\n",
	"surveys.info()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"id": "76679521-dd3a-4c25-8733-e1324e34c056",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>species_id</th>\n",
	" <th>species_count</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>AB</td>\n",
	" <td>303</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>AH</td>\n",
	" <td>437</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>AS</td>\n",
	" <td>2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>BA</td>\n",
	" <td>46</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>CB</td>\n",
	" <td>50</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" species_id species_count\n",
	"0 AB 303\n",
	"1 AH 437\n",
	"2 AS 2\n",
	"3 BA 46\n",
	"4 CB 50"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"species_counts = surveys.groupby('species_id')['record_id'].count().reset_index(name='species_count')\n",
	"species_counts.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "91770624-e5cd-483e-b77e-7f88d2b89a38",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"48"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"len(species_counts)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"id": "f0df2442-6f85-48be-90f9-14019f3b9439",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['AB',\n",
	" 'AH',\n",
	" 'CB',\n",
	" 'DM',\n",
	" 'DO',\n",
	" 'DS',\n",
	" 'NL',\n",
	" 'OL',\n",
	" 'OT',\n",
	" 'PB',\n",
	" 'PE',\n",
	" 'PF',\n",
	" 'PM',\n",
	" 'PP',\n",
	" 'RF',\n",
	" 'RM',\n",
	" 'SA',\n",
	" 'SH',\n",
	" 'SS']"
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"big_species = species_counts[species_counts['species_count'] >= 50]['species_id'].to_list()\n",
	"big_species"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"id": "39afb560-4b63-45f7-b430-8e4e94271ee5",
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<class 'pandas.core.frame.DataFrame'>\n",
	"Index: 30463 entries, 62 to 35547\n",
	"Data columns (total 9 columns):\n",
	" # Column Non-Null Count Dtype \n",
	"--- ------ -------------- ----- \n",
	" 0 record_id 30463 non-null int64 \n",
	" 1 month 30463 non-null int64 \n",
	" 2 day 30463 non-null int64 \n",
	" 3 year 30463 non-null int64 \n",
	" 4 plot_id 30463 non-null int64 \n",
	" 5 species_id 30463 non-null object \n",
	" 6 sex 30463 non-null object \n",
	" 7 hindfoot_length 30463 non-null float64\n",
	" 8 weight 30463 non-null float64\n",
	"dtypes: float64(2), int64(5), object(2)\n",
	"memory usage: 2.3+ MB\n"
	]
	}
	],
	"source": [
	"surveys_filtered = surveys[surveys['species_id'].isin(big_species)].dropna()\n",
	"surveys_filtered.info()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"id": "07037a8c-4382-4dcc-acd0-5df7c12e08a7",
	"metadata": {},
	"outputs": [],
	"source": [
	"surveys_filtered.to_csv('data/surveys_filtered.csv', index=False)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "cd0284f8-db23-4e5e-bf3b-d0723d6c1ced",
	"metadata": {},
	"source": [
	"### Building your plots iteratively"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"id": "0f80e9cc-d8d3-415f-8f7a-7bd5aaf97927",
	"metadata": {},
	"outputs": [],
	"source": [
	"import altair as alt"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"id": "7c1cd3c9-08f9-4153-830d-0b9850a46f19",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"vegafusion.enable_widget()"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import vegafusion as vf\n",
	"vf.enable_widget()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "048d037b-038d-4ae7-a43e-69b8677d7e73",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "5632dd7b9f494fedaaa08738aea0a48d",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"source = surveys.sample(50)\n",
	"alt.Chart(source).mark_circle().encode(x='weight', \n",
	" y='hindfoot_length')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"id": "6679aaa6-fb2e-4cc4-8bf8-8dfeab612c92",
	"metadata": {},
	"outputs": [],
	"source": [
	"url = 'https://gist.githubusercontent.com/MikeTrizna/cd01f9bf3e21d6f74823423bdb45a2f3/raw/2d8c36cf78c9b6abf6938451c60defc93c5911a4/surveys_filtered.csv'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"id": "fde9ba13-dbb6-4d55-9b1a-0cf736cbccc4",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "eb83b76039a9412692b5434ae595d98c",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
	" y='hindfoot_length:Q')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"id": "90307af2-04ae-40cc-b4ce-f28f75fc890f",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "d8ae54a6b1c64e1daaa1565dd2617f85",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 11,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_circle(opacity=0.1,\n",
	" color='red').encode(x='weight:Q', \n",
	" y='hindfoot_length:Q')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"id": "014cce42-8d78-4792-b31f-d75cfd21d7bc",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "923fc08ca95a493c81168e0cfa7499be",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 12,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
	" y='hindfoot_length:Q',\n",
	" color='species_id:N')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"id": "58b480c5-f33d-487f-9eb5-44f0095fb47e",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "d3f0768b7c0343c4bb4be543def79aad",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
	" y='hindfoot_length:Q',\n",
	" color='species_id:N',\n",
	" tooltip='species_id:N'\n",
	" ).interactive()"
	]
	},
	{
	"cell_type": "markdown",
	"id": "6f7dc8eb-aee4-43b0-9bba-8e5f4f197bc3",
	"metadata": {},
	"source": [
	"### Faceting"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"id": "f94f09cf-86c4-4d5f-8bbd-8157ee59b5f8",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "08d007387d3e4cb6a772d4f9090abd41",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 14,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_circle(opacity=0.1).encode(x='weight:Q', \n",
	" y='hindfoot_length:Q',\n",
	" facet='sex:N',\n",
	" color='species_id:N')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "5a16cead-b8aa-443d-a46d-642d16970b01",
	"metadata": {},
	"source": [
	"### Boxplot"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 15,
	"id": "0c85df67-1213-4a30-8811-05f89dc1b01f",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "4b9af9e372b343e7b2ecde8995b2294d",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 15,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_boxplot().encode(x='species_id:N', \n",
	" y='weight:Q')"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8c77efa1-c0bb-4326-aabc-7f6681ebd893",
	"metadata": {},
	"source": [
	"Challenge\n",
	"\n",
	"Make a boxplot of the dataset that shows the distribution of hindfoot_length values by plot_id"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d2c0250b-51e8-4fc5-ad71-dba779143652",
	"metadata": {},
	"source": [
	"### Built-in grouping"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 17,
	"id": "a1fb91bf-8c3e-4ddb-9f00-2cb147b0833a",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "a7957e11df4840d4ab6081e8504686b8",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 17,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_bar().encode(\n",
	" x='plot_id:O',\n",
	" y='count():Q',\n",
	" color='sex:N'\n",
	")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"id": "353b2b5f-9e8a-40b5-ac17-d9fd8499ed01",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "d394e54c42d94ad3b189299a2d3ec3a6",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"alt.Chart(surveys_filtered).mark_line().encode(\n",
	" x='year:O',\n",
	" y='count():Q',\n",
	" color='species_id:N'\n",
	")"
	]
	},
	{
	"cell_type": "markdown",
	"id": "db8f46b4-95af-4de3-961e-e54095f9adb7",
	"metadata": {},
	"source": [
	"Challenge\n",
	"\n",
	"Make a bar plot showing the breakdown of sex values by species_id"
	]
	},
	{
	"cell_type": "markdown",
	"id": "4f20c7cd-ceb4-47e9-8203-fa22ead838c2",
	"metadata": {},
	"source": [
	"### Crossfiltering"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"id": "fd3f284e-bc1f-4788-9ab1-a5672207b14f",
	"metadata": {},
	"outputs": [
	{
	"data": {
	"application/vnd.jupyter.widget-view+json": {
	"model_id": "b9849f07eaa64bc4b65e34b591a3e037",
	"version_major": 2,
	"version_minor": 0
	},
	"text/plain": [
	"VegaFusionWidget(spec='{\\n \"config\": {\\n \"view\": {\\n \"continuousWidth\": 300,\\n \"continuousHeight…"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": []
	},
	"execution_count": 21,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"brush = alt.selection_interval()\n",
	"\n",
	"points = alt.Chart(surveys_filtered).mark_point(opacity=0.1).encode(\n",
	" x='weight:Q',\n",
	" y='hindfoot_length:Q',\n",
	" color=alt.condition(brush, 'species_id:N', alt.value('lightgray'))\n",
	").add_params(\n",
	" brush\n",
	")\n",
	"\n",
	"bars = alt.Chart(surveys_filtered).mark_bar().encode(\n",
	" y='species_id:N',\n",
	" color='species_id:N',\n",
	" x='count(species_id):Q'\n",
	").transform_filter(\n",
	" brush\n",
	")\n",
	"\n",
	"points & bars"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "a584b02f-dd33-4305-a298-37c8ce5cdc37",
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3 (ipykernel)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.10.11"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}