pimlock/loading-data-into-dataframe.ipynb

## loading-data-into-dataframe.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "loading-data-into-dataframe.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.9"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/pimlock/91e0021e53f24ea7e1244829e5235712/loading-data-into-dataframe.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UTRxpSlaczHY"
      },
      "source": [
        "# How to load (almost) anything into a DataFrame - examples\n",
        "\n",
        "This notebook is a companion to our blog post on getting data into pandas' DataFrame: https://gretel.ai/blog/a-guide-to-load-almost-anything-into-a-dataframe\n",
        "\n",
        "---\n",
        "\n",
        "**NOTE: when running this notebook in Google Colab, you will need to restart the runtime after new version `pandas` is installed in the first cell.** See this [StackOverflow question](https://stackoverflow.com/questions/53085501/how-to-upgrade-pandas-on-google-colab/53085629) for more details."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AE96QZBNBntN"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e5dBghoGBntN"
      },
      "source": [
        "### Dependencies\n",
        "\n",
        "First, we will install dependencies. Apart from Pandas itself, we are installing optional dependencies. This is required, because we will be using features that are not included with Pandas itself, like e.g. reading from S3. \n",
        "\n",
        "If these dependencies are not installed and you use operation that requires them, Pandas will fail with an error message that will indicate what's the missing package. For example:\n",
        "\n",
        "```\n",
        "Missing optional dependency 'fsspec'.  Use pip or conda to install fsspec.\n",
        "```\n",
        "\n",
        "More information on optional dependencies that Pandas is using can be found in their [documentation](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#optional-dependencies)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "U8P5xLalBntP"
      },
      "source": [
        "%%capture\n",
        "\n",
        "!pip install -U pandas~=1.2.4\n",
        "\n",
        "# fsspec package is required for downloading data from remote locations.\n",
        "#  Using '[s3]' extra enables downloading from S3.\n",
        "!pip install -U 'fsspec[s3]'\n",
        "\n",
        "# required only when reading data from HTML\n",
        "!pip install -U BeautifulSoup4 html5lib lxml\n",
        "\n",
        "# required only when reading Excel files\n",
        "!pip install -U xlrd~=1.2.0\n",
        "\n",
        "# required only when reading Parquet files\n",
        "!pip install -U pyarrow\n",
        "\n",
        "# plotly is used for simple visualizations.\n",
        "!pip install -U plotly"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pJLjOxlmF4iG"
      },
      "source": [
        "# verify pandas version (this notebook requires pandas=1.2.0 or higher)\n",
        "import pandas as pd\n",
        "\n",
        "# pd.show_versions()\n",
        "\n",
        "if pd.__version__ < \"1.2.0\":\n",
        "  raise Exception(\"This notebook requires pandas version to be > 1.2.0. \"\n",
        "    f\"pandas version in this notebook is {pd.__version__}. \"\n",
        "    \"If you run into this issue, you will need to restart the runtime \"\n",
        "    \"selecting 'Runtime -> Restart and run all'\"\n",
        "  )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bzbX9WwZBntP"
      },
      "source": [
        "### Logging\n",
        "\n",
        "In this section, you can adjust logging level for `botocore` or `fsspec` to see into what's going on behind the scenes. \n",
        "I found it useful when debugging some issues with getting access to S3 objects."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "pycharm": {
          "name": "#%%\n"
        },
        "id": "28R79_NZBntP"
      },
      "source": [
        "import logging, sys\n",
        "\n",
        "logging.basicConfig(\n",
        "    format=\"%(asctime)s.%(msecs)03dZ - %(levelname)s - %(name)s - %(message)s\",\n",
        "    datefmt=\"%Y-%m-%dT%H:%M:%S\",\n",
        "    level=logging.INFO,\n",
        "    stream=sys.stdout\n",
        ")\n",
        "\n",
        "logger = logging.getLogger(__name__)\n",
        "\n",
        "logging.getLogger('botocore').setLevel(logging.INFO)\n",
        "logging.getLogger('fsspec').setLevel(logging.DEBUG)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WMAsq5hmBntQ"
      },
      "source": [
        "### Temporary files\n",
        "\n",
        "Here I'm setting up a directory for temporary files used throughout this notebook. There is one CSV file with sample data downloaded into that dir."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rnbm2j9-BntQ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "d282d5f8-e029-4f0a-fd00-2e2c9e7adcd5"
      },
      "source": [
        "from tempfile import mkdtemp\n",
        "from urllib.request import urlretrieve\n",
        "\n",
        "tmp_dir = mkdtemp()\n",
        "\n",
        "# CSV to load from disk\n",
        "sample_csv_path = f\"{tmp_dir}/sample.csv\"\n",
        "urlretrieve(\"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/2016-olympics/2016-olympics-medals.csv\", sample_csv_path)\n",
        "logger.info(f\"Downloaded sample CSV file to {sample_csv_path}\")\n",
        "\n",
        "# SQLite DB file to use with SQL examples\n",
        "sample_sqlite_db_path = f\"{tmp_dir}/sample.db\"\n",
        "urlretrieve(\"https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/2016-olympics/2016-olympics-medals.db\", sample_sqlite_db_path)\n",
        "logger.info(f\"Downloaded sample SQLite DB file to {sample_sqlite_db_path}\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:23.645Z - INFO - __main__ - Downloaded sample CSV file to /tmp/tmped1hp597/sample.csv\n",
            "2021-05-12T19:38:24.016Z - INFO - __main__ - Downloaded sample SQLite DB file to /tmp/tmped1hp597/sample.db\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oXJ7RA6JBntR"
      },
      "source": [
        "---\n",
        "\n",
        "## Examples\n",
        "\n",
        "Ok, now that we have everything set up, let's run through some examples!\n",
        "\n",
        "For simplicity, all of the examples will be using the same dataset, but stored in different format."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jE0SUJ-XBntR"
      },
      "source": [
        "### Example 1: Loading CSV data from the filesystem\n",
        "\n",
        "As simple as it gets - I have a CSV file on my disk and want to load it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AOuEE7hIBntR",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "fcf8807c-f939-408f-afea-39f20d2a66c9"
      },
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "\n",
        "df = pd.read_csv(\n",
        "    sample_csv_path, \n",
        "    # if your dataset doesn't have column names in the first row, you need to specify them like this\n",
        "    header=0, names=[\"Rank\", \"NOC\", \"Gold\", \"Silver\", \"Bronze\", \"Total\"]\n",
        ")\n",
        "\n",
        "df.info()\n",
        "\n",
        "# after loading, we can explore the data in all different ways pandas support\n",
        "df = df.sort_values('Bronze', ascending=False)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UpAqUAM1BntS"
      },
      "source": [
        "### Example 2. Loading CSV from remote location\n",
        "\n",
        "Let's explore few ways you can load data from remote location like S3. Other types of remote storage will work in a very similar way.\n",
        "\n",
        "**Note**: all of the S3 objects we use in this example are public, so we don't need AWS credentials to access them. When you're accessing your data in S3 you will likely need to have AWS credentials configured before accessing S3 objects.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pZugv1U0BntS",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "cb0fc61c-23a0-4f17-f548-69821e02f6ee"
      },
      "source": [
        "# 1. download file from S3\n",
        "df_1 = pd.read_csv(\n",
        "    \"s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv\",\n",
        "    storage_options={\"anon\": True},\n",
        ")\n",
        "\n",
        "# 2. cache the file locally and then read it \n",
        "#  - any subsequent calls will use cached version (until it expires)\n",
        "cache_dir = f\"{tmp_dir}/cache\"\n",
        "logger.info(f\"Setting caching dir for fsspec to {cache_dir}\")\n",
        "\n",
        "df_2 = pd.read_csv(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv\",\n",
        "    storage_options={\n",
        "        \"s3\": {\"anon\": True},\n",
        "        \"filecache\": {\"cache_storage\": cache_dir}\n",
        "    },\n",
        ")\n",
        "\n",
        "# 3. read compressed file - it will be decompressed on the fly\n",
        "df_3 = pd.read_csv(\n",
        "    \"s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz\",\n",
        "    # Note: by default compression is set to \"infer\", which infers compression type from the filename\n",
        "    compression=\"gzip\",\n",
        "    storage_options={\"anon\": True},\n",
        ")\n",
        "\n",
        "# 4. trying everything\n",
        "#   - read compressed file, \"infer\" the compression\n",
        "#   - cache the file locally\n",
        "\n",
        "df_4 = pd.read_csv(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\",\n",
        "    storage_options={\"s3\": {\"anon\": True}},\n",
        ")\n",
        "\n",
        "assert df_1.equals(df_2) and df_2.equals(df_3) and df_3.equals(df_4)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:25.071Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv> read: 0 - 262144\n",
            "2021-05-12T19:38:25.172Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv> read: 2343 - 262144\n",
            "2021-05-12T19:38:25.174Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv> read: 2343 - 264487\n",
            "2021-05-12T19:38:25.181Z - INFO - __main__ - Setting caching dir for fsspec to /tmp/tmped1hp597/cache\n",
            "2021-05-12T19:38:25.287Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv to local cache\n",
            "2021-05-12T19:38:25.390Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv\n",
            "2021-05-12T19:38:25.491Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 0 - 2\n",
            "2021-05-12T19:38:25.612Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 2 - 10\n",
            "2021-05-12T19:38:25.614Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 10 - 11\n",
            "2021-05-12T19:38:25.616Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 11 - 12\n",
            "2021-05-12T19:38:25.618Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 12 - 13\n",
            "2021-05-12T19:38:25.619Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 13 - 14\n",
            "2021-05-12T19:38:25.621Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 14 - 15\n",
            "2021-05-12T19:38:25.623Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 15 - 16\n",
            "2021-05-12T19:38:25.625Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 16 - 17\n",
            "2021-05-12T19:38:25.627Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 17 - 18\n",
            "2021-05-12T19:38:25.628Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 18 - 19\n",
            "2021-05-12T19:38:25.630Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 19 - 20\n",
            "2021-05-12T19:38:25.632Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 20 - 21\n",
            "2021-05-12T19:38:25.633Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 21 - 22\n",
            "2021-05-12T19:38:25.635Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 22 - 23\n",
            "2021-05-12T19:38:25.637Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 23 - 24\n",
            "2021-05-12T19:38:25.639Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 24 - 25\n",
            "2021-05-12T19:38:25.640Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 25 - 26\n",
            "2021-05-12T19:38:25.642Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 26 - 27\n",
            "2021-05-12T19:38:25.644Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 27 - 28\n",
            "2021-05-12T19:38:25.645Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 28 - 29\n",
            "2021-05-12T19:38:25.647Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 29 - 30\n",
            "2021-05-12T19:38:25.649Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 30 - 31\n",
            "2021-05-12T19:38:25.650Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 31 - 32\n",
            "2021-05-12T19:38:25.652Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 32 - 33\n",
            "2021-05-12T19:38:25.653Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 33 - 34\n",
            "2021-05-12T19:38:25.655Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 34 - 35\n",
            "2021-05-12T19:38:25.657Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 35 - 8227\n",
            "2021-05-12T19:38:25.658Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 1316 - 1317\n",
            "2021-05-12T19:38:25.660Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 1316 - 1318\n",
            "2021-05-12T19:38:25.662Z - DEBUG - fsspec - <File-like object S3FileSystem, gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz> read: 1316 - 1318\n",
            "2021-05-12T19:38:25.790Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip to local cache\n",
            "2021-05-12T19:38:25.911Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4YGMOQKzErgQ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "0c1fb033-77f4-4bf0-8e4c-def9a95cdf48"
      },
      "source": [
        "%%timeit -n 5 -r 1\n",
        "\n",
        "# You can see for yourself that the file is not downloaded again after it's cached.\n",
        "pd.read_csv(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\",\n",
        "    storage_options={\"s3\": {\"anon\": True}},\n",
        ")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:25.944Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "2021-05-12T19:38:25.966Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "2021-05-12T19:38:25.982Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "2021-05-12T19:38:25.991Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "2021-05-12T19:38:26.002Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "5 loops, best of 1: 13.6 ms per loop\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QaeVLzTZaYuD"
      },
      "source": [
        "### Example 3. Loading data using various readers\n",
        "\n",
        "This section contains examples of loading data using some of the most popular readers that Pandas support."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hwJa5hEThczV"
      },
      "source": [
        "#### JSON"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZADZlgtea4u0",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "a5e1297c-ee6b-4804-e99c-56f035937f05"
      },
      "source": [
        "# JSONL (aka new line delimited JSON)\n",
        "# Each line in this file is a separate JSON object.\n",
        "\n",
        "_STORAGE_OPTIONS = {\"s3\": {\"anon\": True}}\n",
        "\n",
        "df_json_lines = pd.read_json(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.jsonl\",\n",
        "    # this tells pandas to treat each line as a separate JSON object\n",
        "    lines=True,\n",
        "    storage_options=_STORAGE_OPTIONS\n",
        ")\n",
        "df_json_lines.info()\n",
        "\n",
        "# JSON\n",
        "# In this format, the whole file is a JSON object. Because of that, the data\n",
        "#  we want to load will be nested in one of the keys of that top-level object.\n",
        "\n",
        "# We need to use fsspec directly here, as we need to load data as JSON first\n",
        "#  before we load it into pandas.\n",
        "import fsspec\n",
        "import json\n",
        "\n",
        "with fsspec.open(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.json\",\n",
        "    **_STORAGE_OPTIONS\n",
        ") as f:\n",
        "    json_data = json.load(f)\n",
        "\n",
        "df_json = pd.json_normalize(json_data, record_path=\"Countries\")\n",
        "df_json.info()\n",
        "\n",
        "assert df_json_lines.equals(df_json)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:26.131Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.jsonl to local cache\n",
            "2021-05-12T19:38:26.257Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.jsonl\n",
            "2021-05-12T19:38:26.271Z - INFO - numexpr.utils - NumExpr defaulting to 2 threads.\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n",
            "2021-05-12T19:38:26.392Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.json to local cache\n",
            "2021-05-12T19:38:26.532Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.json\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GdYDKRw3hf0r"
      },
      "source": [
        "#### Excel"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XusMSJVmhEpq",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "51628f5d-db25-4e93-8312-2078ef336d53"
      },
      "source": [
        "# Excel\n",
        "\n",
        "df_excel = pd.read_excel(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.xls\",\n",
        "    # sheet_name=\"Medals\",\n",
        "    storage_options=_STORAGE_OPTIONS\n",
        ")\n",
        "df_excel.info()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:26.687Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.xls to local cache\n",
            "2021-05-12T19:38:26.801Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.xls\n",
            "2021-05-12T19:38:26.804Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.xls\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NxoB7T36hrX3"
      },
      "source": [
        "#### Parquet"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NXOfLteuhqjy",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "3909f7b2-b861-4e27-fff6-5d029faecc41"
      },
      "source": [
        "# Parquet\n",
        "\n",
        "df_parquet = pd.read_parquet(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.snappy.parquet\",\n",
        "    storage_options=_STORAGE_OPTIONS\n",
        ")\n",
        "df_parquet.info()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:27.194Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.snappy.parquet to local cache\n",
            "2021-05-12T19:38:27.293Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.snappy.parquet\n",
            "2021-05-12T19:38:27.398Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.snappy.parquet\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1HQzfhVwhtdr"
      },
      "source": [
        "#### SQL"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ak-rg6sShu3z",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "db2a9cc1-8b89-4943-963c-4a4eb0a22cf0"
      },
      "source": [
        "# For simplicity, we are reading data from SQLite.\n",
        "#  Pandas supports SQLAlchemy to read from most common SQL databases.\n",
        "\n",
        "import sqlite3\n",
        "con = sqlite3.connect(sample_sqlite_db_path)\n",
        "\n",
        "try:\n",
        "  df_sql = pd.read_sql_query(\"SELECT * FROM medals\", con)\n",
        "\n",
        "  df_sql.info()\n",
        "finally:\n",
        "  con.close()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nvxX4guqEYdx"
      },
      "source": [
        "#### Compressed files\n",
        "\n",
        "You can point Pandas to a compressed file as well! It will automatically detect the compression algorithm that was used and decompress it on the fly.\n",
        "\n",
        "It works with any of the `read_...` methods."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "H9SoSbtnEvvk",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "439c674c-e9a6-4edc-b195-ce228064065f"
      },
      "source": [
        "# GZIP\n",
        "\n",
        "df_gzip = pd.read_csv(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz\",\n",
        "    storage_options=_STORAGE_OPTIONS\n",
        ")\n",
        "df_gzip.info()\n",
        "\n",
        "# ZIP (has to have a single file inside the archive)\n",
        "\n",
        "df_zip = pd.read_csv(\n",
        "    \"filecache::s3://gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\",\n",
        "    storage_options=_STORAGE_OPTIONS\n",
        ")\n",
        "df_zip.info()\n",
        "\n",
        "assert df_gzip.equals(df_zip)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-05-12T19:38:27.740Z - DEBUG - fsspec - Copying gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz to local cache\n",
            "2021-05-12T19:38:27.842Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.gz\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n",
            "2021-05-12T19:38:27.862Z - DEBUG - fsspec - Opening local copy of gretel-public-website/datasets/2016-olympics/2016-olympics-medals.csv.zip\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 86 entries, 0 to 85\n",
            "Data columns (total 6 columns):\n",
            " #   Column  Non-Null Count  Dtype \n",
            "---  ------  --------------  ----- \n",
            " 0   Rank    86 non-null     int64 \n",
            " 1   NOC     86 non-null     object\n",
            " 2   Gold    86 non-null     int64 \n",
            " 3   Silver  86 non-null     int64 \n",
            " 4   Bronze  86 non-null     int64 \n",
            " 5   Total   86 non-null     int64 \n",
            "dtypes: int64(5), object(1)\n",
            "memory usage: 4.2+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZU7kVc2wBntT"
      },
      "source": [
        "### Example 4. Loading data from a website\n",
        "\n",
        "This one is one of my most favorite ones. I came across that table with medal counts on the [Wikipedia about 2016 Olympics](https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table). I wanted to see how all these countries compare to each other visually.\n",
        "\n",
        "Before, I would try to copy that data into a text editor, make it into a proper CSV and then load that into a `DataFrame`. Turns out, Pandas can read data from tables directly from the HTML!\n",
        "\n",
        "That's very handy for ad-hoc, quick analysis. Bear in mind, it only works for small tables (who would have a huge table on their website anyway?!) and parsing is relatively slow."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "xMm6kjtmBntT",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 667
        },
        "outputId": "d7871b17-8e05-42bc-836d-a4c9129cc292"
      },
      "source": [
        "df_list = pd.read_html(\n",
        "    \"https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table\", \n",
        "    match=\"2016 Summer Olympics medal table\"\n",
        ")\n",
        "\n",
        "# Note - read_html() returns list of DataFrames!\n",
        "df = df_list[0]\n",
        "\n",
        "# Just for fun, let's create a simple visualization\n",
        "\n",
        "import plotly.express as px\n",
        "\n",
        "# Take first 35 countries and reverse (this way plotly will render the best country first)\n",
        "plot_data = df.head(35)[::-1]\n",
        "\n",
        "medals = px.bar(plot_data, x=[\"Gold\", \"Silver\", \"Bronze\"], y=\"NOC\")\n",
        "medals.update_layout(\n",
        "    title_text=\"2016 Olympics results\",\n",
        "    xaxis_title_text=\"Medals won\",\n",
        "    yaxis_title_text=\"\",\n",
        "    legend_title_text=\"\",\n",
        "    height=650\n",
        ")\n",
        "medals.update_layout()\n",
        "medals.update_traces(marker_color=\"#FFD700\", selector=0)\n",
        "medals.update_traces(marker_color=\"#C0C0C0\", selector=1)\n",
        "medals.update_traces(marker_color=\"#CD7F32\", selector=2)\n",
        "medals.show()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<html>\n",
              "<head><meta charset=\"utf-8\" /></head>\n",
              "<body>\n",
              "    <div>            <script src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS-MML_SVG\"></script><script type=\"text/javascript\">if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}</script>                <script type=\"text/javascript\">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>\n",
              "        <script src=\"https://cdn.plot.ly/plotly-latest.min.js\"></script>                <div id=\"31a33c8c-f95a-41b0-98a3-07beb58c615c\" class=\"plotly-graph-div\" style=\"height:650px; width:100%;\"></div>            <script type=\"text/javascript\">                                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById(\"31a33c8c-f95a-41b0-98a3-07beb58c615c\")) {                    Plotly.newPlot(                        \"31a33c8c-f95a-41b0-98a3-07beb58c615c\",                        [{\"alignmentgroup\": \"True\", \"hovertemplate\": \"variable=Gold<br>value=%{x}<br>NOC=%{y}<extra></extra>\", \"legendgroup\": \"Gold\", \"marker\": {\"color\": \"#FFD700\"}, \"name\": \"Gold\", \"offsetgroup\": \"Gold\", \"orientation\": \"h\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 10, 12, 17, 19, 26, 27, 46], \"xaxis\": \"x\", \"y\": [\"Belgium\\u00a0(BEL)\", \"North Korea\\u00a0(PRK)\", \"Poland\\u00a0(POL)\", \"Serbia\\u00a0(SRB)\", \"Ukraine\\u00a0(UKR)\", \"South Africa\\u00a0(RSA)\", \"Sweden\\u00a0(SWE)\", \"Denmark\\u00a0(DEN)\", \"Argentina\\u00a0(ARG)\", \"Greece\\u00a0(GRE)\", \"Iran\\u00a0(IRI)\", \"Switzerland\\u00a0(SUI)\", \"Colombia\\u00a0(COL)\", \"Kazakhstan\\u00a0(KAZ)\", \"Uzbekistan\\u00a0(UZB)\", \"Canada\\u00a0(CAN)\", \"New Zealand\\u00a0(NZL)\", \"Cuba\\u00a0(CUB)\", \"Croatia\\u00a0(CRO)\", \"Jamaica\\u00a0(JAM)\", \"Kenya\\u00a0(KEN)\", \"Spain\\u00a0(ESP)\", \"Brazil\\u00a0(BRA)*\", \"Hungary\\u00a0(HUN)\", \"Netherlands\\u00a0(NED)\", \"Australia\\u00a0(AUS)\", \"Italy\\u00a0(ITA)\", \"South Korea\\u00a0(KOR)\", \"France\\u00a0(FRA)\", \"Japan\\u00a0(JPN)\", \"Germany\\u00a0(GER)\", \"Russia\\u00a0(RUS)\", \"China\\u00a0(CHN)\", \"Great Britain\\u00a0(GBR)\", \"United States\\u00a0(USA)\"], \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"variable=Silver<br>value=%{x}<br>NOC=%{y}<extra></extra>\", \"legendgroup\": \"Silver\", \"marker\": {\"color\": \"#C0C0C0\"}, \"name\": \"Silver\", \"offsetgroup\": \"Silver\", \"orientation\": \"h\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [2, 3, 3, 4, 5, 6, 6, 6, 1, 1, 1, 2, 2, 5, 2, 3, 9, 2, 3, 3, 6, 4, 6, 3, 7, 11, 12, 3, 18, 8, 10, 17, 18, 23, 37], \"xaxis\": \"x\", \"y\": [\"Belgium\\u00a0(BEL)\", \"North Korea\\u00a0(PRK)\", \"Poland\\u00a0(POL)\", \"Serbia\\u00a0(SRB)\", \"Ukraine\\u00a0(UKR)\", \"South Africa\\u00a0(RSA)\", \"Sweden\\u00a0(SWE)\", \"Denmark\\u00a0(DEN)\", \"Argentina\\u00a0(ARG)\", \"Greece\\u00a0(GRE)\", \"Iran\\u00a0(IRI)\", \"Switzerland\\u00a0(SUI)\", \"Colombia\\u00a0(COL)\", \"Kazakhstan\\u00a0(KAZ)\", \"Uzbekistan\\u00a0(UZB)\", \"Canada\\u00a0(CAN)\", \"New Zealand\\u00a0(NZL)\", \"Cuba\\u00a0(CUB)\", \"Croatia\\u00a0(CRO)\", \"Jamaica\\u00a0(JAM)\", \"Kenya\\u00a0(KEN)\", \"Spain\\u00a0(ESP)\", \"Brazil\\u00a0(BRA)*\", \"Hungary\\u00a0(HUN)\", \"Netherlands\\u00a0(NED)\", \"Australia\\u00a0(AUS)\", \"Italy\\u00a0(ITA)\", \"South Korea\\u00a0(KOR)\", \"France\\u00a0(FRA)\", \"Japan\\u00a0(JPN)\", \"Germany\\u00a0(GER)\", \"Russia\\u00a0(RUS)\", \"China\\u00a0(CHN)\", \"Great Britain\\u00a0(GBR)\", \"United States\\u00a0(USA)\"], \"yaxis\": \"y\"}, {\"alignmentgroup\": \"True\", \"hovertemplate\": \"variable=Bronze<br>value=%{x}<br>NOC=%{y}<extra></extra>\", \"legendgroup\": \"Bronze\", \"marker\": {\"color\": \"#CD7F32\"}, \"name\": \"Bronze\", \"offsetgroup\": \"Bronze\", \"orientation\": \"h\", \"showlegend\": true, \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [2, 2, 6, 2, 4, 2, 3, 7, 0, 2, 4, 2, 3, 10, 7, 15, 5, 4, 2, 2, 1, 6, 6, 4, 4, 10, 8, 9, 14, 21, 15, 20, 26, 17, 38], \"xaxis\": \"x\", \"y\": [\"Belgium\\u00a0(BEL)\", \"North Korea\\u00a0(PRK)\", \"Poland\\u00a0(POL)\", \"Serbia\\u00a0(SRB)\", \"Ukraine\\u00a0(UKR)\", \"South Africa\\u00a0(RSA)\", \"Sweden\\u00a0(SWE)\", \"Denmark\\u00a0(DEN)\", \"Argentina\\u00a0(ARG)\", \"Greece\\u00a0(GRE)\", \"Iran\\u00a0(IRI)\", \"Switzerland\\u00a0(SUI)\", \"Colombia\\u00a0(COL)\", \"Kazakhstan\\u00a0(KAZ)\", \"Uzbekistan\\u00a0(UZB)\", \"Canada\\u00a0(CAN)\", \"New Zealand\\u00a0(NZL)\", \"Cuba\\u00a0(CUB)\", \"Croatia\\u00a0(CRO)\", \"Jamaica\\u00a0(JAM)\", \"Kenya\\u00a0(KEN)\", \"Spain\\u00a0(ESP)\", \"Brazil\\u00a0(BRA)*\", \"Hungary\\u00a0(HUN)\", \"Netherlands\\u00a0(NED)\", \"Australia\\u00a0(AUS)\", \"Italy\\u00a0(ITA)\", \"South Korea\\u00a0(KOR)\", \"France\\u00a0(FRA)\", \"Japan\\u00a0(JPN)\", \"Germany\\u00a0(GER)\", \"Russia\\u00a0(RUS)\", \"China\\u00a0(CHN)\", \"Great Britain\\u00a0(GBR)\", \"United States\\u00a0(USA)\"], \"yaxis\": \"y\"}],                        {\"barmode\": \"relative\", \"height\": 650, \"legend\": {\"title\": {\"text\": \"\"}, \"tracegroupgap\": 0}, \"margin\": {\"t\": 60}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"#2a3f5f\"}, \"error_y\": {\"color\": \"#2a3f5f\"}, \"marker\": {\"line\": {\"color\": \"#E5ECF6\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"#E5ECF6\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"white\", \"linecolor\": \"white\", \"minorgridcolor\": \"white\", \"startlinecolor\": \"#2a3f5f\"}, \"baxis\": {\"endlinecolor\": \"#2a3f5f\", \"gridcolor\": \"white\", \"linecolor\": \"white\", \"minorgridcolor\": \"white\", \"startlinecolor\": \"#2a3f5f\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}, \"colorscale\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"#EBF0F8\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"#C8D4E3\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"#2a3f5f\", \"arrowhead\": 0, \"arrowwidth\": 1}, \"autotypenumbers\": \"strict\", \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"ticks\": \"\"}}, \"colorscale\": {\"diverging\": [[0, \"#8e0152\"], [0.1, \"#c51b7d\"], [0.2, \"#de77ae\"], [0.3, \"#f1b6da\"], [0.4, \"#fde0ef\"], [0.5, \"#f7f7f7\"], [0.6, \"#e6f5d0\"], [0.7, \"#b8e186\"], [0.8, \"#7fbc41\"], [0.9, \"#4d9221\"], [1, \"#276419\"]], \"sequential\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]], \"sequentialminus\": [[0.0, \"#0d0887\"], [0.1111111111111111, \"#46039f\"], [0.2222222222222222, \"#7201a8\"], [0.3333333333333333, \"#9c179e\"], [0.4444444444444444, \"#bd3786\"], [0.5555555555555556, \"#d8576b\"], [0.6666666666666666, \"#ed7953\"], [0.7777777777777778, \"#fb9f3a\"], [0.8888888888888888, \"#fdca26\"], [1.0, \"#f0f921\"]]}, \"colorway\": [\"#636efa\", \"#EF553B\", \"#00cc96\", \"#ab63fa\", \"#FFA15A\", \"#19d3f3\", \"#FF6692\", \"#B6E880\", \"#FF97FF\", \"#FECB52\"], \"font\": {\"color\": \"#2a3f5f\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"#E5ECF6\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"white\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"mapbox\": {\"style\": \"light\"}, \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"#E5ECF6\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\"}, \"bgcolor\": \"#E5ECF6\", \"radialaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"#E5ECF6\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}, \"yaxis\": {\"backgroundcolor\": \"#E5ECF6\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}, \"zaxis\": {\"backgroundcolor\": \"#E5ECF6\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}}, \"shapedefaults\": {\"line\": {\"color\": \"#2a3f5f\"}}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\"}, \"bgcolor\": \"#E5ECF6\", \"caxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\"}}, \"title\": {\"x\": 0.05}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"white\", \"zerolinewidth\": 2}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"white\", \"linecolor\": \"white\", \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"white\", \"zerolinewidth\": 2}}}, \"title\": {\"text\": \"2016 Olympics results\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Medals won\"}}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"\"}}},                        {\"responsive\": true}                    ).then(function(){\n",
              "                            \n",
              "var gd = document.getElementById('31a33c8c-f95a-41b0-98a3-07beb58c615c');\n",
              "var x = new MutationObserver(function (mutations, observer) {{\n",
              "        var display = window.getComputedStyle(gd).display;\n",
              "        if (!display || display === 'none') {{\n",
              "            console.log([gd, 'removed!']);\n",
              "            Plotly.purge(gd);\n",
              "            observer.disconnect();\n",
              "        }}\n",
              "}});\n",
              "\n",
              "// Listen for the removal of the full notebook cells\n",
              "var notebookContainer = gd.closest('#notebook-container');\n",
              "if (notebookContainer) {{\n",
              "    x.observe(notebookContainer, {childList: true});\n",
              "}}\n",
              "\n",
              "// Listen for the clearing of the current output cell\n",
              "var outputEl = gd.closest('.output');\n",
              "if (outputEl) {{\n",
              "    x.observe(outputEl, {childList: true});\n",
              "}}\n",
              "\n",
              "                        })                };                            </script>        </div>\n",
              "</body>\n",
              "</html>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OaQin8Kdxj8i"
      },
      "source": [
        "#### Exercise\n",
        "\n",
        "Calculate number of medals per person from each country."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Rp6Hj3Q3JH9x",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "0d1299a4-702c-48f0-edf1-6d0826910906"
      },
      "source": [
        "df_list = pd.read_html(\"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population\", match=\"China\")\n",
        "population = df_list[0]\n",
        "\n",
        "..."
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Ellipsis"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rRqzeBITIEZq"
      },
      "source": [
        "#### How many different formats do Pandas support?\n",
        "\n",
        "Here's a snippet that gets this info directly from Pandas documentation!"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M0asZOULGMXS",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "3419e5ec-8196-40bc-deb9-39dc6e13b23e"
      },
      "source": [
        "import pandas as pd\n",
        "df_list = pd.read_html(\n",
        "  \"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\", \n",
        "  match=\"Format Type\"\n",
        ")\n",
        "df = df_list[0]; nl = \"\\n\"\n",
        "\n",
        "print(f\"Pandas can read data in {len(df_list[0])} formats: {nl + nl.join(df['Data Description'].tolist())}\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Pandas can read data in 18 formats: \n",
            "CSV\n",
            "Fixed-Width Text File\n",
            "JSON\n",
            "HTML\n",
            "Local clipboard\n",
            "MS Excel\n",
            "OpenDocument\n",
            "HDF5 Format\n",
            "Feather Format\n",
            "Parquet Format\n",
            "ORC Format\n",
            "Msgpack\n",
            "Stata\n",
            "SAS\n",
            "SPSS\n",
            "Python Pickle Format\n",
            "SQL\n",
            "Google BigQuery\n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}