phenders/lobbying.ipynb

## lobbying.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Which of the FAMGA companies spent most on lobbying over the last five years?\n",
    "\n",
    "Facebook, Apple, Microsoft, Google, and Amazon (FAMGA) all spend a lot of money on lobbying. The issues they hope to influence lawmakers on range from those you might expect, like \"High-skilled immigration\", \"Cybersecurity issues\", and \"Autonomous vehicle technology,\" to more obscure ones, like \"Issues related to online wine sales\" and \"Saving High Tech Inventors from Egregious Legal Disputes (SHIELD).\" According to data on [Enigma Public](https://public.enigma.com), the total amount spent on lobbying in the U.S. Senate by all companies has more than doubled since 2000. But which of the FAMGA companies spends the most on lobbying? \n",
    "\n",
    "Lobbyists working in the U.S. Senate are required to report all of their activity. You can find these reports on Enigma Public in the [U.S. Senate Lobbying Reports collection](https://public.enigma.com/browse/collection/u-s-senate-lobbying-reports/3b169a69-cbf4-4137-92d4-0ca268816153). Within this collection, [Primary Lobbying Reports](https://public.enigma.com/browse/collection/primary-lobbying-reports/b4c9a21a-79d1-495e-a6ea-d643a3820b43) includes the information about clients and payments. We'll use the datasets in this collection to answer the question about who spends most.\n",
    "\n",
    "### Some initial setup\n",
    "\n",
    "This article assumes a basic knowledge of Python and pandas, as well as some familiarity with the [Enigma Public API](http://docs.enigma.com/public/public_v20_api_about.html). If you haven't used the API before, the tutorial [Getting started with the API](http://docs.enigma.com/public/public_v20_user_api.html) provides information on exploring the API using [Postman](https://www.getpostman.com/). Another tutorial, [Using Python to access Enigma Public](http://docs.enigma.com/public/public_v20_user_python.html), demonstrates how to make API calls from Python.\n",
    "\n",
    "The code samples in this article are shown in Jupyter Notebook. If you want to follow along, you'll need to install [Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/content-quickstart.html) (or [JupyterLab](http://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html)) along with the two required Python modules, [requests](https://docs.python-requests.org/en/master/user/quickstart/) and [pandas](https://pandas.pydata.org/), into your Python environment (`pip install requests` and `pip install pandas`). You can then import the modules into your Jupyter project. The code below shows how to create an HTTP session object and include your [Enigma Public API key](http://docs.enigma.com/public/public_v20_api_authentication.html) in the authorization header."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd\n",
    "\n",
    "base_url = 'https://public.enigma.com/api/'\n",
    "s = requests.Session()\n",
    "headers = {'authorization': 'Bearer ' + 'YOUR-API-KEY'}\n",
    "s.headers.update(headers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the snapshot IDs for the datasets of interest\n",
    "\n",
    "The [Primary Lobbying Reports collection](https://public.enigma.com/browse/collection/primary-lobbying-reports/b4c9a21a-79d1-495e-a6ea-d643a3820b43) has data going back to 1999. Since we're interested only in the data for the last five years, you'll need to get the snapshot ID associated with each of the datasets going back to 2013. First, we'll use the `parent_collection_id` parameter on the API's [GET /datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) endpoint to get the datasets within the Primary Lobbying Reports collection, and then get the ID of the current snapshot for datasets where the year in the dataset name is 2013 or later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "collection_id = 'b4c9a21a-79d1-495e-a6ea-d643a3820b43' # Primary Lobbying Reports\n",
    "url = base_url + 'datasets/?parent_collection_id={}'.format(collection_id)\n",
    "response = s.get(url).json()\n",
    "ids = [dataset['current_snapshot']['id'] for dataset in response if int(dataset['display_name'].split()[-1]) >= 2013]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you make a request to the Enigma Public API, the response comes back in JSON format. If you open one of the datasets in Enigma Public (for example, [Lobbying Disclosures - Primary Report 2017](https://public.enigma.com/datasets/lobbying-disclosures-primary-report-2017/bbb5f44a-f9c4-4c01-b112-03d576984d67)) and click the API button, you'll see an example of a JSON response. Navigating the JSON hierarchy to get the information you want can be one of the biggest challenges when working with the API (see [Getting the most recent snapshot ID](http://docs.enigma.com/public/public_v20_user_python.html#snapshot_id)). In the example above, we're going to the `current_snapshot` node and getting the value associated with the `id` key, but saving it only if the numeric value of the last word in the display name (the year) is greater than or equal to 2013. This gives us a list of the required snapshot IDs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get the rows of interest from each snapshot\n",
    "\n",
    "Each of the datasets in the collection is quite large. Since we're interested only in data about five specific companies over five specific years, you can request only the rows where the \"Client Name\" column includes Facebook, Apple, Microsoft, Google, or Amazon, and the \"Year\" column is in the range 2013-2017 (since a report can include filings for previous years).\n",
    "\n",
    "\"Client Name\" and \"Year\" are the column *display names* (the names shown in the column headers in Enigma Public). To request information for specific columns, you need to know their *field names*. If you clicked the API button in Enigma Public as suggested earlier, you'll see a list of field names in the JSON response. The field names in this case are `year` and `client_name`.\n",
    "\n",
    "The [GET /snapshots/id](http://docs.enigma.com/public/public_v20_api_get_snapshots_id.html) API endpoint has an [advanced query mode](http://development-dockerd-0.enigma:8888/public_v20_api_get_snapshots_id.html#advanced) that lets you search specific columns, rather than entire rows. To search a column for multiple terms, you need to \"OR\" the search terms and must include a URL-encoded space character (`%20`) before and after the operator, as done by the Python `join` functions below. Since we're searching two columns, \"AND\" the two column searches together to create the query string. You can then iterate over the snapshot ID list and call the API using the same query for each call. Note that you need to specify a `row_limit` (<10,000) or you won't get any rows back. In the code below, `rows.extend(response['table_rows']['rows'])` creates a list-of-lists with all of the resulting rows for all five years."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "companies = ['Facebook', 'Apple', 'Microsoft', 'Google', 'Amazon']\n",
    "years = ['2013', '2014', '2015', '2016', '2017']\n",
    "query = '(client_name:({})AND(year:({})))'.format('%20OR%20'.join(companies), '%20OR%20'.join(years))\n",
    "rows = []\n",
    "for snapshot_id in ids:\n",
    "    url = base_url + 'snapshots/{}?query_mode=advanced&query={}&row_limit=1000'.format(snapshot_id, query)\n",
    "    response = s.get(url).json()\n",
    "    rows.extend(response['table_rows']['rows'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a dataframe using the field names as the column headers\n",
    "\n",
    "Using pandas, it's easy to convert a list-of-lists into a DataFrame. You can use the list of field names contained in the JSON API response to set the column headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>year</th>\n",
       "      <th>received</th>\n",
       "      <th>amount</th>\n",
       "      <th>type</th>\n",
       "      <th>period</th>\n",
       "      <th>registrant_id</th>\n",
       "      <th>registrant_name</th>\n",
       "      <th>registrant_general_description</th>\n",
       "      <th>address</th>\n",
       "      <th>...</th>\n",
       "      <th>client_general_description</th>\n",
       "      <th>client_id</th>\n",
       "      <th>self_filer</th>\n",
       "      <th>contact_full_name</th>\n",
       "      <th>is_state_or_local_gov</th>\n",
       "      <th>client_country</th>\n",
       "      <th>client_ppb_country</th>\n",
       "      <th>client_state</th>\n",
       "      <th>client_ppb_state</th>\n",
       "      <th>serialid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1129</th>\n",
       "      <td>4EC593A7-CD18-425F-86F3-EE782103AD40</td>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-04T14:02:53+00:00</td>\n",
       "      <td>30000</td>\n",
       "      <td>FOURTH QUARTER REPORT</td>\n",
       "      <td>4th Quarter (Oct 1 - Dec 31)</td>\n",
       "      <td>401036920</td>\n",
       "      <td>Bloom Strategic Counsel</td>\n",
       "      <td>Law, public policy, and government relations</td>\n",
       "      <td>1155 F Street, NW\\r\\nSuite 1200\\r\\nWashington,...</td>\n",
       "      <td>...</td>\n",
       "      <td>Online retailer</td>\n",
       "      <td>125</td>\n",
       "      <td>FALSE</td>\n",
       "      <td>Seth Bloom</td>\n",
       "      <td>TRUE</td>\n",
       "      <td>USA</td>\n",
       "      <td>USA</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>16034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1130</th>\n",
       "      <td>904E3F0B-FF0B-4B8B-94D7-8D468A4672CC</td>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-08T11:57:36+00:00</td>\n",
       "      <td>30000</td>\n",
       "      <td>FOURTH QUARTER REPORT</td>\n",
       "      <td>4th Quarter (Oct 1 - Dec 31)</td>\n",
       "      <td>306011</td>\n",
       "      <td>Fidelis Government Relations</td>\n",
       "      <td>Government relations consulting firm</td>\n",
       "      <td>1300 Penn Ave Suite 700\\r\\nWashington, DC 20004</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>51</td>\n",
       "      <td>FALSE</td>\n",
       "      <td>TERRY G. ALLEN</td>\n",
       "      <td>TRUE</td>\n",
       "      <td>USA</td>\n",
       "      <td>USA</td>\n",
       "      <td>DISTRICT OF COLUMBIA</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>16525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1131</th>\n",
       "      <td>2FE1FBFB-A894-4FED-9BB1-FD1012E44422</td>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20T17:22:01+00:00</td>\n",
       "      <td>60000</td>\n",
       "      <td>FOURTH QUARTER REPORT</td>\n",
       "      <td>4th Quarter (Oct 1 - Dec 31)</td>\n",
       "      <td>400424029</td>\n",
       "      <td>Polaris-Hutton Group, LLC</td>\n",
       "      <td>Government Relations Firm</td>\n",
       "      <td>PO Box 1304\\r\\nAlexandria, VA 22313</td>\n",
       "      <td>...</td>\n",
       "      <td>Software company that manufacturers, licenses ...</td>\n",
       "      <td>48</td>\n",
       "      <td>FALSE</td>\n",
       "      <td>DANIEL J. GANS</td>\n",
       "      <td>TRUE</td>\n",
       "      <td>USA</td>\n",
       "      <td>USA</td>\n",
       "      <td>DISTRICT OF COLUMBIA</td>\n",
       "      <td>WASHINGTON</td>\n",
       "      <td>16981</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1132</th>\n",
       "      <td>4B354BA0-9FA5-4433-AB7A-C2A7FA553F04</td>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20T21:55:24+00:00</td>\n",
       "      <td>50000</td>\n",
       "      <td>FOURTH QUARTER REPORT</td>\n",
       "      <td>4th Quarter (Oct 1 - Dec 31)</td>\n",
       "      <td>57475</td>\n",
       "      <td>Ernst &amp; Young LLP (Washington Council Ernst &amp; ...</td>\n",
       "      <td>professional services firm</td>\n",
       "      <td>1001 Pennsylvania Avenue, NW\\r\\nSuite 601 Nort...</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>2031</td>\n",
       "      <td>FALSE</td>\n",
       "      <td>Nick Giordano</td>\n",
       "      <td>TRUE</td>\n",
       "      <td>USA</td>\n",
       "      <td>USA</td>\n",
       "      <td>DISTRICT OF COLUMBIA</td>\n",
       "      <td>DISTRICT OF COLUMBIA</td>\n",
       "      <td>17245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1133</th>\n",
       "      <td>53826956-E747-4B33-94F2-766ADEF4F4A7</td>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-21T17:56:06+00:00</td>\n",
       "      <td>None</td>\n",
       "      <td>FOURTH QUARTER REPORT</td>\n",
       "      <td>4th Quarter (Oct 1 - Dec 31)</td>\n",
       "      <td>44641</td>\n",
       "      <td>Brown Winick Graves Gross Baskerville &amp; Schoen...</td>\n",
       "      <td>Law firm</td>\n",
       "      <td>666 Grand Avenue, Suite 2000\\r\\nDes Moines, IA...</td>\n",
       "      <td>...</td>\n",
       "      <td>Internet search and advertising services</td>\n",
       "      <td>125</td>\n",
       "      <td>FALSE</td>\n",
       "      <td>Christopher Sackett</td>\n",
       "      <td>TRUE</td>\n",
       "      <td>USA</td>\n",
       "      <td>USA</td>\n",
       "      <td>DISTRICT OF COLUMBIA</td>\n",
       "      <td>CALIFORNIA</td>\n",
       "      <td>17511</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                        id  year                   received  \\\n",
       "1129  4EC593A7-CD18-425F-86F3-EE782103AD40  2016  2017-01-04T14:02:53+00:00   \n",
       "1130  904E3F0B-FF0B-4B8B-94D7-8D468A4672CC  2016  2017-01-08T11:57:36+00:00   \n",
       "1131  2FE1FBFB-A894-4FED-9BB1-FD1012E44422  2016  2017-01-20T17:22:01+00:00   \n",
       "1132  4B354BA0-9FA5-4433-AB7A-C2A7FA553F04  2016  2017-01-20T21:55:24+00:00   \n",
       "1133  53826956-E747-4B33-94F2-766ADEF4F4A7  2016  2017-01-21T17:56:06+00:00   \n",
       "\n",
       "     amount                   type                        period  \\\n",
       "1129  30000  FOURTH QUARTER REPORT  4th Quarter (Oct 1 - Dec 31)   \n",
       "1130  30000  FOURTH QUARTER REPORT  4th Quarter (Oct 1 - Dec 31)   \n",
       "1131  60000  FOURTH QUARTER REPORT  4th Quarter (Oct 1 - Dec 31)   \n",
       "1132  50000  FOURTH QUARTER REPORT  4th Quarter (Oct 1 - Dec 31)   \n",
       "1133   None  FOURTH QUARTER REPORT  4th Quarter (Oct 1 - Dec 31)   \n",
       "\n",
       "     registrant_id                                    registrant_name  \\\n",
       "1129     401036920                            Bloom Strategic Counsel   \n",
       "1130        306011                       Fidelis Government Relations   \n",
       "1131     400424029                          Polaris-Hutton Group, LLC   \n",
       "1132         57475  Ernst & Young LLP (Washington Council Ernst & ...   \n",
       "1133         44641  Brown Winick Graves Gross Baskerville & Schoen...   \n",
       "\n",
       "                    registrant_general_description  \\\n",
       "1129  Law, public policy, and government relations   \n",
       "1130          Government relations consulting firm   \n",
       "1131                     Government Relations Firm   \n",
       "1132                    professional services firm   \n",
       "1133                                      Law firm   \n",
       "\n",
       "                                                address   ...     \\\n",
       "1129  1155 F Street, NW\\r\\nSuite 1200\\r\\nWashington,...   ...      \n",
       "1130    1300 Penn Ave Suite 700\\r\\nWashington, DC 20004   ...      \n",
       "1131                PO Box 1304\\r\\nAlexandria, VA 22313   ...      \n",
       "1132  1001 Pennsylvania Avenue, NW\\r\\nSuite 601 Nort...   ...      \n",
       "1133  666 Grand Avenue, Suite 2000\\r\\nDes Moines, IA...   ...      \n",
       "\n",
       "                             client_general_description client_id self_filer  \\\n",
       "1129                                    Online retailer       125      FALSE   \n",
       "1130                                               None        51      FALSE   \n",
       "1131  Software company that manufacturers, licenses ...        48      FALSE   \n",
       "1132                                               None      2031      FALSE   \n",
       "1133           Internet search and advertising services       125      FALSE   \n",
       "\n",
       "        contact_full_name is_state_or_local_gov client_country  \\\n",
       "1129           Seth Bloom                  TRUE            USA   \n",
       "1130       TERRY G. ALLEN                  TRUE            USA   \n",
       "1131       DANIEL J. GANS                  TRUE            USA   \n",
       "1132        Nick Giordano                  TRUE            USA   \n",
       "1133  Christopher Sackett                  TRUE            USA   \n",
       "\n",
       "     client_ppb_country          client_state      client_ppb_state serialid  \n",
       "1129                USA            WASHINGTON            WASHINGTON    16034  \n",
       "1130                USA  DISTRICT OF COLUMBIA            WASHINGTON    16525  \n",
       "1131                USA  DISTRICT OF COLUMBIA            WASHINGTON    16981  \n",
       "1132                USA  DISTRICT OF COLUMBIA  DISTRICT OF COLUMBIA    17245  \n",
       "1133                USA  DISTRICT OF COLUMBIA            CALIFORNIA    17511  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(rows, columns=response['table_rows']['fields'])\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract the columns of interest\n",
    "\n",
    "The resulting DataFrame has 23 columns, but you don't need all of these to understand where and when each company spent its lobbying money. Next, we'll extract the columns we want to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>received</th>\n",
       "      <th>amount</th>\n",
       "      <th>registrant_name</th>\n",
       "      <th>client_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1129</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-04T14:02:53+00:00</td>\n",
       "      <td>30000</td>\n",
       "      <td>Bloom Strategic Counsel</td>\n",
       "      <td>Amazon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1130</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-08T11:57:36+00:00</td>\n",
       "      <td>30000</td>\n",
       "      <td>Fidelis Government Relations</td>\n",
       "      <td>Microsoft Corp.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1131</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20T17:22:01+00:00</td>\n",
       "      <td>60000</td>\n",
       "      <td>Polaris-Hutton Group, LLC</td>\n",
       "      <td>Microsoft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1132</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20T21:55:24+00:00</td>\n",
       "      <td>50000</td>\n",
       "      <td>Ernst &amp; Young LLP (Washington Council Ernst &amp; ...</td>\n",
       "      <td>Microsoft Corporation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1133</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-21T17:56:06+00:00</td>\n",
       "      <td>None</td>\n",
       "      <td>Brown Winick Graves Gross Baskerville &amp; Schoen...</td>\n",
       "      <td>Google Inc.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      year                   received amount  \\\n",
       "1129  2016  2017-01-04T14:02:53+00:00  30000   \n",
       "1130  2016  2017-01-08T11:57:36+00:00  30000   \n",
       "1131  2016  2017-01-20T17:22:01+00:00  60000   \n",
       "1132  2016  2017-01-20T21:55:24+00:00  50000   \n",
       "1133  2016  2017-01-21T17:56:06+00:00   None   \n",
       "\n",
       "                                        registrant_name            client_name  \n",
       "1129                            Bloom Strategic Counsel                 Amazon  \n",
       "1130                       Fidelis Government Relations        Microsoft Corp.  \n",
       "1131                          Polaris-Hutton Group, LLC              Microsoft  \n",
       "1132  Ernst & Young LLP (Washington Council Ernst & ...  Microsoft Corporation  \n",
       "1133  Brown Winick Graves Gross Baskerville & Schoen...            Google Inc.  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df[['year', 'received', 'amount', 'registrant_name', 'client_name']]\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set the column types as needed\n",
    "\n",
    "When you read data from the JSON response, the values are read as strings. In order to compute aggregates, you'll need to convert the strings in the \"Amount\" column to numeric values. The code below does this, and also shows how to convert the datetime strings to dates in YYYY-MM-DD format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>received</th>\n",
       "      <th>amount</th>\n",
       "      <th>registrant_name</th>\n",
       "      <th>client_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1129</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Bloom Strategic Counsel</td>\n",
       "      <td>Amazon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1130</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-08</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Fidelis Government Relations</td>\n",
       "      <td>Microsoft Corp.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1131</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20</td>\n",
       "      <td>60000.0</td>\n",
       "      <td>Polaris-Hutton Group, LLC</td>\n",
       "      <td>Microsoft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1132</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>Ernst &amp; Young LLP (Washington Council Ernst &amp; ...</td>\n",
       "      <td>Microsoft Corporation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1133</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-21</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Brown Winick Graves Gross Baskerville &amp; Schoen...</td>\n",
       "      <td>Google Inc.</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      year    received   amount  \\\n",
       "1129  2016  2017-01-04  30000.0   \n",
       "1130  2016  2017-01-08  30000.0   \n",
       "1131  2016  2017-01-20  60000.0   \n",
       "1132  2016  2017-01-20  50000.0   \n",
       "1133  2016  2017-01-21      NaN   \n",
       "\n",
       "                                        registrant_name            client_name  \n",
       "1129                            Bloom Strategic Counsel                 Amazon  \n",
       "1130                       Fidelis Government Relations        Microsoft Corp.  \n",
       "1131                          Polaris-Hutton Group, LLC              Microsoft  \n",
       "1132  Ernst & Young LLP (Washington Council Ernst & ...  Microsoft Corporation  \n",
       "1133  Brown Winick Graves Gross Baskerville & Schoen...            Google Inc.  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['amount'] = df['amount'].apply(pd.to_numeric)\n",
    "df['received'] = df['received'].apply(pd.to_datetime)\n",
    "df['received'] = df['received'].apply(lambda x: x.strftime('%Y-%m-%d'))\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalize the company names\n",
    "\n",
    "As you can see above, lobbyists sometimes refer to their clients using variations of the client's name in the disclosure. For example, it's probably safe to assume Microsoft Corp., Microsoft, and Microsoft Corporation are all the same company. To do the aggregation, you'll need to resolve these variants to a single name. We'll use the list of names created earlier and convert anything in the `client_name` column that contains the name to the shorter version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>year</th>\n",
       "      <th>received</th>\n",
       "      <th>amount</th>\n",
       "      <th>registrant_name</th>\n",
       "      <th>client_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1129</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-04</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Bloom Strategic Counsel</td>\n",
       "      <td>Amazon</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1130</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-08</td>\n",
       "      <td>30000.0</td>\n",
       "      <td>Fidelis Government Relations</td>\n",
       "      <td>Microsoft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1131</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20</td>\n",
       "      <td>60000.0</td>\n",
       "      <td>Polaris-Hutton Group, LLC</td>\n",
       "      <td>Microsoft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1132</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-20</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>Ernst &amp; Young LLP (Washington Council Ernst &amp; ...</td>\n",
       "      <td>Microsoft</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1133</th>\n",
       "      <td>2016</td>\n",
       "      <td>2017-01-21</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Brown Winick Graves Gross Baskerville &amp; Schoen...</td>\n",
       "      <td>Google</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      year    received   amount  \\\n",
       "1129  2016  2017-01-04  30000.0   \n",
       "1130  2016  2017-01-08  30000.0   \n",
       "1131  2016  2017-01-20  60000.0   \n",
       "1132  2016  2017-01-20  50000.0   \n",
       "1133  2016  2017-01-21      NaN   \n",
       "\n",
       "                                        registrant_name client_name  \n",
       "1129                            Bloom Strategic Counsel      Amazon  \n",
       "1130                       Fidelis Government Relations   Microsoft  \n",
       "1131                          Polaris-Hutton Group, LLC   Microsoft  \n",
       "1132  Ernst & Young LLP (Washington Council Ernst & ...   Microsoft  \n",
       "1133  Brown Winick Graves Gross Baskerville & Schoen...      Google  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for company in companies:\n",
    "    df.loc[df['client_name'].str.contains(company, case=False), 'client_name'] = company\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute the totals\n",
    "\n",
    "Finally we're ready to compute the total amount spent by each company. Since we resolved the company names, we can use the `client_name` column in the panda's `groupby` function to combine all rows with the same company name. Here, we want to sum the values in the \"Amount\" column in the resulting row. We'll then format the numeric values and display the rows in descending order based on the amount."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>client_name</th>\n",
       "      <th>amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Google</td>\n",
       "      <td>90,037,877.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Microsoft</td>\n",
       "      <td>51,908,500.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Facebook</td>\n",
       "      <td>38,372,000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Amazon</td>\n",
       "      <td>30,431,831.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Apple</td>\n",
       "      <td>23,135,000.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  client_name        amount\n",
       "0      Google  90,037,877.0\n",
       "1   Microsoft  51,908,500.0\n",
       "2    Facebook  38,372,000.0\n",
       "3      Amazon  30,431,831.0\n",
       "4       Apple  23,135,000.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total = df.groupby(['client_name']).agg({'amount': sum}).reset_index()\n",
    "total['amount'] = total.apply(lambda x: \"{:,}\".format(x['amount']), axis=1)\n",
    "total = total.sort_values(by='amount', ascending=0).reset_index(drop=True)\n",
    "total"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have the answer: Google spent the most on lobbying.\n",
    "\n",
    "One caveat -- if you look at the source datasets, you'll see that lobbyists have an opportunity to amend their reports. When they do, the orginal report and the amended report are both listed, meaning some transactions are counted twice. This will reduce the total amount actually spent by each company, but most likely won't impact the overall result -- that Google spent the most by a large margin."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further research\n",
    "\n",
    "The datasets examined here indicate how much each company spent and who lobbied on their behalf, but don't tell you about the issues they tried to influence. If you take an issue ID from column 1 of the primary report, you can look it up in the corresponding issues report in the [Lobbying Issues collection](https://public.enigma.com/browse/collection/lobbying-issues/6ecaafbc-6962-4dc2-a4ac-b067ba3ba021). If you're interested in exploring further, try joining the primary dataset and issues dataset for each year, using the issue ID as the column to join on."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}