phenders/power_search.ipynb

## power_search.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Power search using the Enigma Public API\n",
    "\n",
    "The [Enigma Public web application](https://public.enigma.com/) provides powerful [search capabilities](http://docs.enigma.com/public/public_v20_user_searching.html) to help you locate specific information within the vast range of available datasets. For example, a search for `Michael Cohen` indicates there are 563 matching datasets. For each dataset, the search results page tells you the number of matching rows within each dataset, and filtering options let you narrow down the search results based on parent collection, dataset size, or date last updated. You can then open each matching dataset in turn to view the matching rows and determine if the data is relevant to your research. \n",
    "\n",
    "A search like this, though, returns a *lot* of data, and many of the hits won't relate to a person called Michael Cohen. For example, one row in the [New York City Building Permits dataset](https://public.enigma.com/datasets/new-york-city-building-permits-issued/862f9e4a-88f3-4231-ad43-c823833ba4fd) lists the building owner as Stephen Cohen and the person filing the application as Michael Ferrone. Since the row includes the words `Michael` and `Cohen`, it counts as a hit. Additionally, other rows may reference a person called Michael who lives on Cohen Street, or a person called Cohen who works at the Paul Michael company. How can you focus your search to locate a person called `Michael Cohen` and display the search results in a way that gives you an immediate view of *all* matching rows across *all* datasets? That's the focus of this tutorial.\n",
    "\n",
    "#### Some initial setup\n",
    "\n",
    "This article assumes a basic knowledge of Python and pandas, as well as some familiarity with the [Enigma Public API](http://docs.enigma.com/public/public_v20_api_about.html). If you haven't used the API before, the tutorial [Getting started with the API](http://docs.enigma.com/public/public_v20_user_api.html) provides information on exploring the API using [Postman](https://www.getpostman.com/). Another tutorial, [Using Python to access Enigma Public](http://docs.enigma.com/public/public_v20_user_python.html), demonstrates how to make API calls from Python.\n",
    "\n",
    "The code samples in this article are shown in Jupyter Notebook. If you want to follow along, you'll need to install [Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/content-quickstart.html) (or [JupyterLab](http://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html)) along with the two required Python modules, [requests](https://docs.python-requests.org/en/master/user/quickstart/) and [pandas](https://pandas.pydata.org/), into your Python environment (`pip install requests` and `pip install pandas`). The code below shows how to import the modules into your Jupyter project, create an HTTP session object, and include your [Enigma Public API key](http://docs.enigma.com/public/public_v20_api_authentication.html) in the authorization header. It also sets up the base URL for the API calls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import pandas as pd\n",
    "\n",
    "s = requests.Session()\n",
    "headers = {'authorization': 'Bearer ' + 'your_api_key'}\n",
    "s.headers.update(headers)\n",
    "\n",
    "url = 'https://public.enigma.com/api/datasets/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create your initial query\n",
    "\n",
    "The `query` parameter on the API's [GET /datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) endpoint lets you search all Enigma Public datasets for the terms you specify. If you specify multiple search terms, they're ANDed together. The `query` parameter supports other [advanced operators](http://docs.enigma.com/public/public_v20_user_searching.html#AdvancedSearch), but we won't use them in this tutorial. Additional parameters we'll use for the search are:\n",
    "\n",
    "* `match_rows`: When `true`, the API searches for datasets where there's a match within the data records (rows). We'll set this to `true` (the default).\n",
    "* `match_metadata`: When `true`, the API searches for datasets where there's a match within the dataset metadata. We'll set this to `false`.\n",
    "* `row_limit`: Number of rows to return for each dataset. The default is 0 and the maximum you can request is 10,000. We'll request up to 1,000 from each dataset.\n",
    "\n",
    "We'll set up these parameters as a dictionary we can use as the `params` argument when we make the API call. Defining the query string separately will help later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "parameters = {'match_metadata':'false','match_rows':'true', 'row_limit': 1000}\n",
    "phrase1 = 'michael cohen'\n",
    "parameters['query'] = phrase1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get count of datasets matching the query string\n",
    "\n",
    "By default, [GET /datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) returns results for the first 20 datasets. We already know that a search for `michael cohen` yields many more than 20 datasets. Although you can request up to 1,000 by including a [Range header](http://docs.enigma.com/public/public_v20_api_pagination.html) in the request, there's a good chance the API will refuse your request or time out if the quantity of data is too large. For this reason, you should request the data in chunks. Before you can do this though, you need to know how many datasets match your query. You can do this by sending an HTTP `HEAD` request (which returns only the response headers, and not the data) and then reading the total number from the `content-range` header, as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "563"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "response = s.head(url, headers=headers, params=parameters)\n",
    "ds_count = int(response.headers.get('content-range').split(\"/\")[1])\n",
    "ds_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get resulting rows in batches\n",
    "\n",
    "Now that you know the total number of matching datasets (563), you can request them in batches. The code below requests the datasets in batches of 10 using a `Range` header. The API returns the requested data as a JSON list, where each list item represents one dataset. For each dataset, the JSON includes:\n",
    "\n",
    "* The dataset metadata (`display_name`, `id`, etc.)\n",
    "* Data for the current snapshot, including the field names and up to 1,000 records (since we specified `row_limit: 1000`)\n",
    "\n",
    "The code iterates through the list of datasets and creates a list-of-lists with all the matching rows, as well as information about each dataset. The second-to-last line truncates any values with more than 100 characters, since some datasets (for example, the [U.S. Patent and Trademark Office datasets](https://public.enigma.com/browse/collection/u-s-patent-and-trademark-office/cfc6f327-763c-48ee-b9ec-82f7d0b00762)) include extremely long descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "for i in range(0, ds_count, 10):\n",
    "    headers['Range'] = 'resources={}-{}'.format(i, i + 9)\n",
    "    s.headers.update(headers)\n",
    "    response = s.get(url, headers=headers, params=parameters).json()\n",
    "    for dataset in response:\n",
    "        results.append([dataset['display_name']])\n",
    "        results.append(['https://public.enigma.com/datasets/' + dataset['id']])\n",
    "        results.append(dataset['current_snapshot']['table_rows']['fields'])\n",
    "        rows = dataset['current_snapshot']['table_rows']['rows']\n",
    "        for row in rows:\n",
    "            row = [value[:100] for value in row if isinstance(value, str)]\n",
    "            results.append(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a DataFrame\n",
    "\n",
    "Using pandas, it's easy to convert a list-of-lists into a DataFrame. You can then save this as a CSV file you can analyze offline using your favorite spreadsheet application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>245</th>\n",
       "      <th>246</th>\n",
       "      <th>247</th>\n",
       "      <th>248</th>\n",
       "      <th>249</th>\n",
       "      <th>250</th>\n",
       "      <th>251</th>\n",
       "      <th>252</th>\n",
       "      <th>253</th>\n",
       "      <th>254</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Medicare Provider Charges - Part D</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>https://public.enigma.com/datasets/ce0ea1d5-50...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>npi</td>\n",
       "      <td>nppes_provider_last_org_name</td>\n",
       "      <td>nppes_provider_first_name</td>\n",
       "      <td>nppes_provider_city</td>\n",
       "      <td>nppes_provider_state</td>\n",
       "      <td>specialty_description</td>\n",
       "      <td>description_flag</td>\n",
       "      <td>drug_name</td>\n",
       "      <td>generic_name</td>\n",
       "      <td>bene_count</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1053463901</td>\n",
       "      <td>COHEN</td>\n",
       "      <td>MICHAEL</td>\n",
       "      <td>KENMORE</td>\n",
       "      <td>WA</td>\n",
       "      <td>Dentist</td>\n",
       "      <td>T</td>\n",
       "      <td>IBUPROFEN</td>\n",
       "      <td>IBUPROFEN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1083666788</td>\n",
       "      <td>COHEN</td>\n",
       "      <td>MICHAEL</td>\n",
       "      <td>RENNER</td>\n",
       "      <td>LA</td>\n",
       "      <td>Obstetrics/Gynecology</td>\n",
       "      <td>S</td>\n",
       "      <td>METRONIDAZOLE</td>\n",
       "      <td>METRONIDAZOLE</td>\n",
       "      <td>12.0</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 255 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                 0    \\\n",
       "0                 Medicare Provider Charges - Part D   \n",
       "1  https://public.enigma.com/datasets/ce0ea1d5-50...   \n",
       "2                                                npi   \n",
       "3                                         1053463901   \n",
       "4                                         1083666788   \n",
       "\n",
       "                            1                          2    \\\n",
       "0                          None                       None   \n",
       "1                          None                       None   \n",
       "2  nppes_provider_last_org_name  nppes_provider_first_name   \n",
       "3                         COHEN                    MICHAEL   \n",
       "4                         COHEN                    MICHAEL   \n",
       "\n",
       "                   3                     4                      5    \\\n",
       "0                 None                  None                   None   \n",
       "1                 None                  None                   None   \n",
       "2  nppes_provider_city  nppes_provider_state  specialty_description   \n",
       "3              KENMORE                    WA                Dentist   \n",
       "4               RENNER                    LA  Obstetrics/Gynecology   \n",
       "\n",
       "                6              7              8           9    ...    245  \\\n",
       "0              None           None           None        None  ...   None   \n",
       "1              None           None           None        None  ...   None   \n",
       "2  description_flag      drug_name   generic_name  bene_count  ...   None   \n",
       "3                 T      IBUPROFEN      IBUPROFEN        11.0  ...   None   \n",
       "4                 S  METRONIDAZOLE  METRONIDAZOLE        12.0  ...   None   \n",
       "\n",
       "    246   247   248   249   250   251   252   253   254  \n",
       "0  None  None  None  None  None  None  None  None  None  \n",
       "1  None  None  None  None  None  None  None  None  None  \n",
       "2  None  None  None  None  None  None  None  None  None  \n",
       "3  None  None  None  None  None  None  None  None  None  \n",
       "4  None  None  None  None  None  None  None  None  None  \n",
       "\n",
       "[5 rows x 255 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df.to_csv('cohen.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Searching for \"Michael Cohen\"\n",
    "\n",
    "The code so far returns all rows in Enigma Public that include the words `michael` and `cohen`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "25731"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After subtracting the rows used to capture the dataset display name, ID, and field names, the number of matching rows is around 24,000. This includes the row containing Stephen Cohen and Michael Ferrone, as well as rows containing any Michael who lives on Cohen Street, etc. We want to narrow this down to rows that reference a person called Michael Cohen. \n",
    "\n",
    "The API doesn't provide a way to do this directly, but you can search for rows where `michael` and `cohen` are in the same column or adjacent columns, indicating a likely match. This will reduce the number of matching rows considerably.\n",
    "\n",
    "The code below does this by analyzing each row before appending it to the results. Since it's possible a dataset many not have any rows that match the new proximity matching criteria, we'll delay writing the dataset metadata until we know there's at least one row.\n",
    "\n",
    "The matching algorithm requires that all cell values are strings. Since some values are read as Booleans, we'll convert these to strings. We'll also filter out `None` values, make everything lower case, and remove any punctuation. The inner `for` loop traverses each row using a \"window\" of size `distance` (set here to 2, but you could set it to 3, for example, if you think there might be a separate column for a middle initial). If all of the search terms appear within the sliding window at any point during the traversal, the `match` flag is set to `True`.\n",
    "\n",
    "We've moved everything, including the initial setup steps, into a function so we can reuse it later in the tutorial, which is when `phrase2` comes into play. For now we're passing in an empty string for `phrase2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def proximity_search(phrase1, phrase2):\n",
    "    import string\n",
    "    table = str.maketrans('', '', string.punctuation)\n",
    "    \n",
    "    s = requests.Session()\n",
    "    headers = {'authorization': 'Bearer ' + 'your_api_key'}\n",
    "    s.headers.update(headers)\n",
    "\n",
    "    url = 'https://public.enigma.com/api/datasets/'\n",
    "    parameters = {'match_metadata':'false','match_rows':'true', 'row_limit': 1000}\n",
    "    parameters['query'] = ' '.join([phrase1.lower(), phrase2.lower()])    \n",
    "\n",
    "    response = s.head(url, headers=headers, params=parameters)\n",
    "    ds_count = int(response.headers.get('content-range').split(\"/\")[1])\n",
    "\n",
    "    results = []\n",
    "    distance = 2\n",
    "    for i in range(0, ds_count, 10):\n",
    "        headers['Range'] = 'resources={}-{}'.format(i, i + 9)\n",
    "        s.headers.update(headers)\n",
    "        response = s.get(url, headers=headers, params=parameters).json()\n",
    "        for dataset in response:\n",
    "            first_match = True\n",
    "            rows = dataset['current_snapshot']['table_rows']['rows']\n",
    "            for row in rows:\n",
    "                row = [str(value) if isinstance(value, bool) else value for value in row ]\n",
    "                match = False\n",
    "                for j in range (0, len(row) - distance + 1):\n",
    "                    words = (' '.join(filter(None, row[j: j + distance]))).lower().translate(table)\n",
    "                    if all(s in words.split() for s in phrase1.lower().split()):\n",
    "                        match = True\n",
    "                        break\n",
    "                if match:\n",
    "                    if first_match:\n",
    "                        results.append([dataset['display_name']])\n",
    "                        results.append(['https://public.enigma.com/datasets/' + dataset['id']])\n",
    "                        results.append(dataset['current_snapshot']['table_rows']['fields'])\n",
    "                        first_match = False\n",
    "                    row = [value[:100] if isinstance(value, str) else '' for value in row ]\n",
    "                    results.append(row)\n",
    "    return results\n",
    "\n",
    "results = proximity_search(phrase1, '')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Convert rows to DataFrame\n",
    "\n",
    "We can now convert the results list-of-lists to a DataFrame and save it as a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>245</th>\n",
       "      <th>246</th>\n",
       "      <th>247</th>\n",
       "      <th>248</th>\n",
       "      <th>249</th>\n",
       "      <th>250</th>\n",
       "      <th>251</th>\n",
       "      <th>252</th>\n",
       "      <th>253</th>\n",
       "      <th>254</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Medicare Provider Charges - Part D</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ce0ea1d5-504c-46b4-b8f3-9f431f71512c</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>npi</td>\n",
       "      <td>nppes_provider_last_org_name</td>\n",
       "      <td>nppes_provider_first_name</td>\n",
       "      <td>nppes_provider_city</td>\n",
       "      <td>nppes_provider_state</td>\n",
       "      <td>specialty_description</td>\n",
       "      <td>description_flag</td>\n",
       "      <td>drug_name</td>\n",
       "      <td>generic_name</td>\n",
       "      <td>bene_count</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1053463901</td>\n",
       "      <td>COHEN</td>\n",
       "      <td>MICHAEL</td>\n",
       "      <td>KENMORE</td>\n",
       "      <td>WA</td>\n",
       "      <td>Dentist</td>\n",
       "      <td>T</td>\n",
       "      <td>IBUPROFEN</td>\n",
       "      <td>IBUPROFEN</td>\n",
       "      <td>11.0</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1083666788</td>\n",
       "      <td>COHEN</td>\n",
       "      <td>MICHAEL</td>\n",
       "      <td>RENNER</td>\n",
       "      <td>LA</td>\n",
       "      <td>Obstetrics/Gynecology</td>\n",
       "      <td>S</td>\n",
       "      <td>METRONIDAZOLE</td>\n",
       "      <td>METRONIDAZOLE</td>\n",
       "      <td>12.0</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 255 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                    0                             1    \\\n",
       "0    Medicare Provider Charges - Part D                          None   \n",
       "1  ce0ea1d5-504c-46b4-b8f3-9f431f71512c                          None   \n",
       "2                                   npi  nppes_provider_last_org_name   \n",
       "3                            1053463901                         COHEN   \n",
       "4                            1083666788                         COHEN   \n",
       "\n",
       "                         2                    3                     4    \\\n",
       "0                       None                 None                  None   \n",
       "1                       None                 None                  None   \n",
       "2  nppes_provider_first_name  nppes_provider_city  nppes_provider_state   \n",
       "3                    MICHAEL              KENMORE                    WA   \n",
       "4                    MICHAEL               RENNER                    LA   \n",
       "\n",
       "                     5                 6              7              8    \\\n",
       "0                   None              None           None           None   \n",
       "1                   None              None           None           None   \n",
       "2  specialty_description  description_flag      drug_name   generic_name   \n",
       "3                Dentist                 T      IBUPROFEN      IBUPROFEN   \n",
       "4  Obstetrics/Gynecology                 S  METRONIDAZOLE  METRONIDAZOLE   \n",
       "\n",
       "          9    ...    245   246   247   248   249   250   251   252   253  \\\n",
       "0        None  ...   None  None  None  None  None  None  None  None  None   \n",
       "1        None  ...   None  None  None  None  None  None  None  None  None   \n",
       "2  bene_count  ...   None  None  None  None  None  None  None  None  None   \n",
       "3        11.0  ...   None  None  None  None  None  None  None  None  None   \n",
       "4        12.0  ...   None  None  None  None  None  None  None  None  None   \n",
       "\n",
       "    254  \n",
       "0  None  \n",
       "1  None  \n",
       "2  None  \n",
       "3  None  \n",
       "4  None  \n",
       "\n",
       "[5 rows x 255 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(results)\n",
    "df.to_csv('cohen.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then check the number of matching rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "17396"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Focus the search\n",
    "\n",
    "Although we've halved the number of matching rows, there are still a lot. Now that you have the results in a format you can review, you might want to modify the query string to conduct a more narrowly scoped search based on your initial findings. For example, if you're interested in the Michael Cohen with connections to New York, you might include `new york` in the search query. The way the function is constructed, `phrase2` is used only to narrow down the number of rows returned by the API, and isn't used in the proximity search. Let's see how this affects the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>216</th>\n",
       "      <th>217</th>\n",
       "      <th>218</th>\n",
       "      <th>219</th>\n",
       "      <th>220</th>\n",
       "      <th>221</th>\n",
       "      <th>222</th>\n",
       "      <th>223</th>\n",
       "      <th>224</th>\n",
       "      <th>225</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>New York City Building Permits Issued</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>862f9e4a-88f3-4231-ad43-c823833ba4fd</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>bin__</td>\n",
       "      <td>borough</td>\n",
       "      <td>permit_type</td>\n",
       "      <td>permit_status</td>\n",
       "      <td>house__</td>\n",
       "      <td>street_name</td>\n",
       "      <td>zip_code</td>\n",
       "      <td>issuance_date</td>\n",
       "      <td>expiration_date</td>\n",
       "      <td>job_start_date</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1010650</td>\n",
       "      <td>MANHATTAN</td>\n",
       "      <td>EW</td>\n",
       "      <td>ISSUED</td>\n",
       "      <td>517</td>\n",
       "      <td>6 AVENUE</td>\n",
       "      <td>10011</td>\n",
       "      <td>2000-11-03T00:00:00</td>\n",
       "      <td>11/03/2001</td>\n",
       "      <td>11/03/2000</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1046619</td>\n",
       "      <td>MANHATTAN</td>\n",
       "      <td>EW</td>\n",
       "      <td>ISSUED</td>\n",
       "      <td>49</td>\n",
       "      <td>EAST 80TH STREET</td>\n",
       "      <td>10075</td>\n",
       "      <td>2016-06-28T00:00:00</td>\n",
       "      <td>05/29/2017</td>\n",
       "      <td>06/28/2016</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 226 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     0          1            2    \\\n",
       "0  New York City Building Permits Issued       None         None   \n",
       "1   862f9e4a-88f3-4231-ad43-c823833ba4fd       None         None   \n",
       "2                                  bin__    borough  permit_type   \n",
       "3                                1010650  MANHATTAN           EW   \n",
       "4                                1046619  MANHATTAN           EW   \n",
       "\n",
       "             3        4                 5         6                    7    \\\n",
       "0           None     None              None      None                 None   \n",
       "1           None     None              None      None                 None   \n",
       "2  permit_status  house__       street_name  zip_code        issuance_date   \n",
       "3         ISSUED      517          6 AVENUE     10011  2000-11-03T00:00:00   \n",
       "4         ISSUED       49  EAST 80TH STREET     10075  2016-06-28T00:00:00   \n",
       "\n",
       "               8               9    ...    216   217   218   219   220   221  \\\n",
       "0             None            None  ...   None  None  None  None  None  None   \n",
       "1             None            None  ...   None  None  None  None  None  None   \n",
       "2  expiration_date  job_start_date  ...   None  None  None  None  None  None   \n",
       "3       11/03/2001      11/03/2000  ...   None  None  None  None  None  None   \n",
       "4       05/29/2017      06/28/2016  ...   None  None  None  None  None  None   \n",
       "\n",
       "    222   223   224   225  \n",
       "0  None  None  None  None  \n",
       "1  None  None  None  None  \n",
       "2  None  None  None  None  \n",
       "3  None  None  None  None  \n",
       "4  None  None  None  None  \n",
       "\n",
       "[5 rows x 226 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = proximity_search('michael cohen', 'new york')\n",
    "df = pd.DataFrame(results)\n",
    "df.to_csv('cohen.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1538"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is much better. We've found 1,139 rows that reference a Michael Cohen with connections to New York. There will be some false matches (for example, if there's a cell with Stephen Cohen next to a cell with Michael Ferrone), but having narrowed down the results sufficiently, you can review them manually in the [attached spreadheet](https://docs.google.com/spreadsheets/d/1xRpllp2s6zBzyGBDKtD5f4oxWcxDbjolFjum6YI2HJY/edit#gid=1663626083).\n",
    "\n",
    "#### Further research\n",
    "\n",
    "You can use the proximity search function in other ways. For example, the New York Corporate Registrations dataset shows Michael Cohen at 502 Park Ave, New York as the owner of several taxi companies. You might wonder if there are other businesses operated out of the same address."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>...</th>\n",
       "      <th>85</th>\n",
       "      <th>86</th>\n",
       "      <th>87</th>\n",
       "      <th>88</th>\n",
       "      <th>89</th>\n",
       "      <th>90</th>\n",
       "      <th>91</th>\n",
       "      <th>92</th>\n",
       "      <th>93</th>\n",
       "      <th>94</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Bill of Lading Summary - 2016</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>f72f8304-8598-4920-9287-6e12cb2071f5</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>identifier</td>\n",
       "      <td>trade_update_date</td>\n",
       "      <td>run_date</td>\n",
       "      <td>vessel_name</td>\n",
       "      <td>port_of_unlading</td>\n",
       "      <td>estimated_arrival_date</td>\n",
       "      <td>foreign_port_of_lading</td>\n",
       "      <td>record_status_indicator</td>\n",
       "      <td>place_of_receipt</td>\n",
       "      <td>port_of_destination</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016010519011</td>\n",
       "      <td>2015-12-03T00:00:00</td>\n",
       "      <td>2016-01-05T00:00:00</td>\n",
       "      <td>YM UTOPIA</td>\n",
       "      <td>New York/Newark Area, Newark, New Jersey</td>\n",
       "      <td>2015-12-30T00:00:00</td>\n",
       "      <td>Singapore,Singapore</td>\n",
       "      <td>New</td>\n",
       "      <td>SINGAPORE</td>\n",
       "      <td>KKLU</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016111619182</td>\n",
       "      <td>2016-10-21T00:00:00</td>\n",
       "      <td>2016-11-16T00:00:00</td>\n",
       "      <td>MATAQUITO</td>\n",
       "      <td>New York/Newark Area, Newark, New Jersey</td>\n",
       "      <td>2016-11-15T00:00:00</td>\n",
       "      <td>Mundra,India</td>\n",
       "      <td>New</td>\n",
       "      <td>NEW DELHI INDIA</td>\n",
       "      <td>CMDU</td>\n",
       "      <td>...</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 95 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                     0                    1   \\\n",
       "0         Bill of Lading Summary - 2016                 None   \n",
       "1  f72f8304-8598-4920-9287-6e12cb2071f5                 None   \n",
       "2                            identifier    trade_update_date   \n",
       "3                         2016010519011  2015-12-03T00:00:00   \n",
       "4                         2016111619182  2016-10-21T00:00:00   \n",
       "\n",
       "                    2            3                                         4   \\\n",
       "0                 None         None                                      None   \n",
       "1                 None         None                                      None   \n",
       "2             run_date  vessel_name                          port_of_unlading   \n",
       "3  2016-01-05T00:00:00    YM UTOPIA  New York/Newark Area, Newark, New Jersey   \n",
       "4  2016-11-16T00:00:00    MATAQUITO  New York/Newark Area, Newark, New Jersey   \n",
       "\n",
       "                       5                       6                        7   \\\n",
       "0                    None                    None                     None   \n",
       "1                    None                    None                     None   \n",
       "2  estimated_arrival_date  foreign_port_of_lading  record_status_indicator   \n",
       "3     2015-12-30T00:00:00     Singapore,Singapore                      New   \n",
       "4     2016-11-15T00:00:00            Mundra,India                      New   \n",
       "\n",
       "                 8                    9   ...     85    86    87    88    89  \\\n",
       "0              None                 None  ...   None  None  None  None  None   \n",
       "1              None                 None  ...   None  None  None  None  None   \n",
       "2  place_of_receipt  port_of_destination  ...   None  None  None  None  None   \n",
       "3         SINGAPORE                 KKLU  ...   None  None  None  None  None   \n",
       "4   NEW DELHI INDIA                 CMDU  ...   None  None  None  None  None   \n",
       "\n",
       "     90    91    92    93    94  \n",
       "0  None  None  None  None  None  \n",
       "1  None  None  None  None  None  \n",
       "2  None  None  None  None  None  \n",
       "3  None  None  None  None  None  \n",
       "4  None  None  None  None  None  \n",
       "\n",
       "[5 rows x 95 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results = proximity_search('502 park ave', 'new york')\n",
    "df = pd.DataFrame(results)\n",
    "df.to_csv('park ave.csv')\n",
    "df.head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}