Skip to content

Instantly share code, notes, and snippets.

@lundeen-bryan
Last active March 25, 2024 19:57
Show Gist options
  • Save lundeen-bryan/32495d61ca9bfab82fb89bbe8d8e096c to your computer and use it in GitHub Desktop.
Save lundeen-bryan/32495d61ca9bfab82fb89bbe8d8e096c to your computer and use it in GitHub Desktop.
Boolean_Indexing_Masking_ChatGPT_Tutorial
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# UNDERSTANDING HOW BOOLEAN INDEXING WORKS\n",
"\n",
"Boolean Indexing or masking is a technique used to select a subset of the data based on some criteria. This is a difficult concept to grasp, but it is a powerful tool for data analysis. The following exercises were created by ChatGPT to help better understand how boolean indexing works."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"URL = \"https://gist.githubusercontent.com/sh7ata/e075ff35b51ebb0d2d577fbe1d19ebc9/raw/b966d02c7c26bcca60703acb1390e938a65a35cb/drinks.csv\"\n",
"df = pd.read_csv(URL)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1: Filtering Countries with Low Wine Servings\n",
"#### Task:\n",
"Filter the dataset to identify countries with wine servings below a certain threshold.\n",
"\n",
"#### Instructions:\n",
"1. Set a threshold for the minimum number of wine servings.\n",
"2. Use boolean indexing to create a mask that identifies countries with wine servings below the threshold.\n",
"3. Apply the mask to filter the dataset and display the countries with low wine servings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For clarity on why we use a threshold in the first two exercises:\n",
"\n",
"1. Define a threshold as a specific value used as a criterion or cutoff point for filtering data.\n",
"2. The threshold serves as a boundary or cutoff point for making decisions about which data points to include or exclude from further analysis.\n",
"3. It helps to define a specific range or subset of data that meets certain criteria, allowing for more focused and targeted analysis.\n",
"\n",
"We don't always have to use a threshold, but sometimes we just use some type of logic to determine which data points to include or exclude from further analysis."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Set the threshold for minimum wine servings\n",
"threshold = 50"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"I think the starting point here is to create a threashold value that we can test against in our filter. So for example, if we want to find wine servings below 50, we would set the threshold to 50.\n",
"\n",
"ChatGPT gives the following definition of a \"threshold\":\n",
"\n",
"A threshold is a value that is used to determine whether a value is \"above\" or \"below\" a certain value. For example, a threshold of 50 means that if a value is 50 or greater, it is considered to be \"above\" the threshold. If a value is less than 50, it is considered to be \"below\" the threshold...The threshold is a specific value or condition used to define the criteria for filtering the data. For example, in Exercise 1, the threshold could represent the minimum number of wine servings below which countries are considered to have low wine consumption."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Create a boolean mask for countries with wine servings below the threshold\n",
"low_wine_mask = df['wine_servings'] < threshold\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Here we create a variable called low_wine_mask that is a boolean mask that identifies countries with wine servings below the threshold."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 True\n",
"Name: wine_servings, dtype: bool"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"low_wine_mask.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# Apply the mask to filter the dataset\n",
"countries_with_low_wine = df[low_wine_mask]"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The \"low_wine_mask\" datatype is: <class 'pandas.core.series.Series'>\n",
"\n",
"The \"countries_with_low_wine\" datatype is: <class 'pandas.core.frame.DataFrame'>\n"
]
}
],
"source": [
"# What is the data type of countries_with_low_wine?\n",
"print(\n",
" 'The \"low_wine_mask\" datatype is:', \n",
" type(low_wine_mask)\n",
")\n",
"print(\n",
" '\\nThe \"countries_with_low_wine\" datatype is:', \n",
" type(countries_with_low_wine)\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"So basically the mask is a series of True/False values that we can use to filter the data. Once filtered, then we can pull other columns from the dataset that match the boolean mask."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>country</th>\n",
" <th>beer_servings</th>\n",
" <th>spirit_servings</th>\n",
" <th>wine_servings</th>\n",
" <th>total_litres_of_pure_alcohol</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Algeria</td>\n",
" <td>25.0</td>\n",
" <td>NaN</td>\n",
" <td>14.0</td>\n",
" <td>0.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Angola</td>\n",
" <td>217.0</td>\n",
" <td>57.0</td>\n",
" <td>45.0</td>\n",
" <td>5.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Antigua &amp; Barbuda</td>\n",
" <td>12.0</td>\n",
" <td>128.0</td>\n",
" <td>45.0</td>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Armenia</td>\n",
" <td>21.0</td>\n",
" <td>179.0</td>\n",
" <td>11.0</td>\n",
" <td>3.8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Azerbaijan</td>\n",
" <td>21.0</td>\n",
" <td>46.0</td>\n",
" <td>5.0</td>\n",
" <td>1.3</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" country beer_servings spirit_servings wine_servings \\\n",
"2 Algeria 25.0 NaN 14.0 \n",
"4 Angola 217.0 57.0 45.0 \n",
"5 Antigua & Barbuda 12.0 128.0 45.0 \n",
"7 Armenia 21.0 179.0 11.0 \n",
"10 Azerbaijan 21.0 46.0 5.0 \n",
"\n",
" total_litres_of_pure_alcohol \n",
"2 0.7 \n",
"4 5.9 \n",
"5 4.9 \n",
"7 3.8 \n",
"10 1.3 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"countries_with_low_wine.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"I noticed that there are a few `NaN` values in the resulting dataframe. Since the focus of this exercise is to filter the dataset, we can safely ignore these `NaN` values."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Countries with wine servings below 50 :\n",
" country beer_servings spirit_servings wine_servings \\\n",
"2 Algeria 25.0 NaN 14.0 \n",
"4 Angola 217.0 57.0 45.0 \n",
"5 Antigua & Barbuda 12.0 128.0 45.0 \n",
"7 Armenia 21.0 179.0 11.0 \n",
"10 Azerbaijan 21.0 46.0 5.0 \n",
".. ... ... ... ... \n",
"187 Vanuatu 21.0 18.0 11.0 \n",
"188 Venezuela 333.0 1.0 3.0 \n",
"189 Vietnam 111.0 2.0 1.0 \n",
"191 Zambia 32.0 19.0 4.0 \n",
"192 Zimbabwe 64.0 18.0 4.0 \n",
"\n",
" total_litres_of_pure_alcohol \n",
"2 0.7 \n",
"4 5.9 \n",
"5 4.9 \n",
"7 3.8 \n",
"10 1.3 \n",
".. ... \n",
"187 0.9 \n",
"188 7.7 \n",
"189 2.0 \n",
"191 2.5 \n",
"192 4.7 \n",
"\n",
"[114 rows x 5 columns]\n"
]
}
],
"source": [
"# Display the filtered dataset\n",
"print(\n",
" \"Countries with wine servings below\", \n",
" threshold, \n",
" \":\\n\", \n",
" countries_with_low_wine\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2: Finding Countries with High Alcohol Consumption\n",
"#### Task:\n",
"Identify countries with high total alcohol consumption based on the sum of beer, spirit, and wine servings.\n",
"\n",
"#### Instructions:\n",
"1. Calculate the total alcohol consumption for each country by summing the servings of beer, spirit, and wine.\n",
"2. Set a threshold for the minimum total alcohol consumption.\n",
"3. Create a boolean mask to identify countries with total alcohol consumption above the threshold.\n",
"4. Apply the mask to filter the dataset and display the countries with high alcohol consumption.\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Calculate total alcohol consumption for each country\n",
"df['total_alcohol'] = df['beer_servings'] + df['spirit_servings'] + df['wine_servings']\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Set the threshold for minimum total alcohol consumption\n",
"threshold = 400\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Again we'll set a threshold value that we can test against in our filter. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"# Create a boolean mask for countries with total alcohol consumption above the threshold\n",
"high_alcohol_mask = df['total_alcohol'] > threshold\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# Apply the mask to filter the dataset\n",
"countries_with_high_alcohol = df[high_alcohol_mask]\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Countries with total alcohol consumption above 400 :\n",
" country beer_servings spirit_servings wine_servings \\\n",
"3 Andorra 245.0 138.0 312.0 \n",
"6 Argentina 193.0 25.0 221.0 \n",
"8 Australia 261.0 72.0 212.0 \n",
"9 Austria 279.0 75.0 191.0 \n",
"15 Belarus 142.0 373.0 42.0 \n",
"16 Belgium 295.0 84.0 212.0 \n",
"23 Brazil 245.0 145.0 16.0 \n",
"25 Bulgaria 231.0 252.0 94.0 \n",
"44 Cyprus 192.0 154.0 113.0 \n",
"45 Czech Republic 361.0 17.0 134.0 \n",
"48 Denmark 224.0 81.0 278.0 \n",
"57 Estonia 224.0 194.0 59.0 \n",
"60 Finland 263.0 133.0 97.0 \n",
"62 Gabon 347.0 98.0 59.0 \n",
"65 Germany 346.0 117.0 175.0 \n",
"67 Greece 133.0 112.0 218.0 \n",
"68 Grenada 199.0 438.0 28.0 \n",
"75 Hungary 234.0 215.0 185.0 \n",
"81 Ireland 313.0 118.0 165.0 \n",
"93 Latvia 281.0 216.0 62.0 \n",
"98 Lithuania 343.0 244.0 56.0 \n",
"99 Luxembourg 236.0 133.0 271.0 \n",
"132 Paraguay 213.0 117.0 74.0 \n",
"135 Poland 343.0 215.0 56.0 \n",
"136 Portugal 194.0 67.0 339.0 \n",
"140 Romania 297.0 122.0 167.0 \n",
"141 Russian Federation 247.0 326.0 73.0 \n",
"144 St. Lucia 171.0 315.0 71.0 \n",
"151 Serbia 283.0 131.0 127.0 \n",
"155 Slovakia 196.0 293.0 116.0 \n",
"160 Spain 284.0 157.0 112.0 \n",
"182 United Kingdom 219.0 126.0 195.0 \n",
"184 USA 249.0 158.0 84.0 \n",
"\n",
" total_litres_of_pure_alcohol total_alcohol \n",
"3 12.4 695.0 \n",
"6 8.3 439.0 \n",
"8 1.4 545.0 \n",
"9 9.7 545.0 \n",
"15 14.4 557.0 \n",
"16 1.5 591.0 \n",
"23 7.2 406.0 \n",
"25 1.3 577.0 \n",
"44 8.2 459.0 \n",
"45 11.8 512.0 \n",
"48 1.4 583.0 \n",
"57 9.5 477.0 \n",
"60 1.0 493.0 \n",
"62 8.9 504.0 \n",
"65 11.3 638.0 \n",
"67 8.3 463.0 \n",
"68 11.9 665.0 \n",
"75 11.3 634.0 \n",
"81 11.4 596.0 \n",
"93 1.5 559.0 \n",
"98 12.9 643.0 \n",
"99 11.4 640.0 \n",
"132 7.3 404.0 \n",
"135 1.9 614.0 \n",
"136 11.0 600.0 \n",
"140 1.4 586.0 \n",
"141 11.5 646.0 \n",
"144 1.1 557.0 \n",
"151 9.6 541.0 \n",
"155 11.4 605.0 \n",
"160 1.0 553.0 \n",
"182 1.4 540.0 \n",
"184 8.7 491.0 \n"
]
}
],
"source": [
"# Display the filtered dataset\n",
"print(\"Countries with total alcohol consumption above\", threshold, \":\\n\", countries_with_high_alcohol)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Note that we are only using alcohol consumption as a measurement for all other columns. It looks like columns of data in a dataset are called 'measures' or 'features' in the context of this exercise. For comparison, Power BI uses the term 'measures' to refer to columns of data that are used to define a 'feature' in a Python dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise Three: Filtering Countries with High Spirit Servings\n",
"\n",
"**Objective:** Use boolean indexing to filter countries with high spirit servings.\n",
"\n",
"**Instructions:**\n",
"\n",
"1. Read the provided dataset into a DataFrame.\n",
"2. Create a boolean mask to identify countries with spirit servings above 200.\n",
"3. Use the boolean mask to filter the DataFrame and create a new DataFrame containing only the countries with high spirit servings.\n",
"4. Display the filtered DataFrame to view the countries with high spirit servings.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a boolean mask for countries with spirit servings above 200\n",
"high_spirit_mask = df['spirit_servings'] > 200\n",
"\n",
"# Filter the DataFrame using the boolean mask\n",
"countries_with_high_spirit = df[high_spirit_mask]\n",
"\n",
"# Display the filtered DataFrame\n",
"print(\"Countries with High Spirit Servings:\")\n",
"print(countries_with_high_spirit)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"I asked ChatGPT for a third exercise to help clarify how boolean indexing works but without applying a threshold. Instead I asked for a logical test to show only countries that did not have a null value in the `beer_servings` column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3: Filtering Out Null `beer_servings`\n",
"\n",
"**SHOW COUNTRIES THAT ALWAYS SERVE BEER**\n",
"\n",
"**Objective**: Apply boolean indexing to exclude countries with null (`NaN`) values in the `beer_servings` column, using the dataset from the previous exercises.\n",
"\n",
"**Instructions**:\n",
"1. First, check for null values in the `beer_servings` column to understand how many entries are missing.\n",
"2. Create a boolean mask that identifies rows where `beer_servings` is not null.\n",
"3. Apply this mask to filter the dataset, excluding any rows with null `beer_servings`.\n",
"\n",
"**Explanation**: \n",
"- This exercise demonstrates handling missing data, a common task in data preprocessing. \n",
"- The `.isnull()` method creates a boolean series marking null entries, while `.notnull()` does the opposite. \n",
"- Filtering with the `not_null_mask` ensures our analysis includes only countries with recorded `beer_servings` data."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Null entries in the \"beer_servings\" column: 0 True\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"188 False\n",
"189 False\n",
"190 False\n",
"191 False\n",
"192 False\n",
"Name: beer_servings, Length: 193, dtype: bool\n"
]
}
],
"source": [
"null_mask = df['beer_servings'].isnull()\n",
"print(\n",
" 'Null entries in the \"beer_servings\" column:', \n",
" null_mask\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entries in the \"beer_servings\" column that are not null: 0 False\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
" ... \n",
"188 True\n",
"189 True\n",
"190 True\n",
"191 True\n",
"192 True\n",
"Name: beer_servings, Length: 193, dtype: bool\n"
]
}
],
"source": [
"not_null_mask = df['beer_servings'].notnull()\n",
"print(\n",
" 'Entries in the \"beer_servings\" column that are not null:', \n",
" not_null_mask\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Countries that always serve beer:\n",
" country beer_servings spirit_servings wine_servings \\\n",
"1 Albania 89.0 132.0 54.0 \n",
"2 Algeria 25.0 NaN 14.0 \n",
"3 Andorra 245.0 138.0 312.0 \n",
"4 Angola 217.0 57.0 45.0 \n",
"5 Antigua & Barbuda 12.0 128.0 45.0 \n",
".. ... ... ... ... \n",
"188 Venezuela 333.0 1.0 3.0 \n",
"189 Vietnam 111.0 2.0 1.0 \n",
"190 Yemen 6.0 NaN NaN \n",
"191 Zambia 32.0 19.0 4.0 \n",
"192 Zimbabwe 64.0 18.0 4.0 \n",
"\n",
" total_litres_of_pure_alcohol total_alcohol \n",
"1 4.9 275.0 \n",
"2 0.7 NaN \n",
"3 12.4 695.0 \n",
"4 5.9 319.0 \n",
"5 4.9 185.0 \n",
".. ... ... \n",
"188 7.7 337.0 \n",
"189 2.0 114.0 \n",
"190 0.1 NaN \n",
"191 2.5 55.0 \n",
"192 4.7 86.0 \n",
"\n",
"[178 rows x 6 columns]\n"
]
}
],
"source": [
"countries_that_always_serve_beer = df[not_null_mask]\n",
"print(\n",
" 'Countries that always serve beer:\\n', \n",
" countries_that_always_serve_beer\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Personal Comments**:\n",
"\n",
"Although I switched from GPT3.5 to GPT4, the model had a very hard time correctly creating the exercise and even more difficulty solving it. I had to solve it myself. For example it started using very generic naming conventions such as calling the final resulting dataframe 'filtered_data' which is a naming convention that I would expect from a junior analyst. ChatGPT should be able to natively consider generic naming conventions as antithetical to a python project. It causes a lot of problem. I tried to reiterate multiple times before giving up and writing the code and solution myself. \n",
"\n",
"Will I remember that I did this a month from now? probably not... LOL. 😂"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.8"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment