Skip to content

Instantly share code, notes, and snippets.

@phenders
Last active May 16, 2018 16:28
Show Gist options
  • Save phenders/4dea6558f254d57e2a7e21ccba21d2a5 to your computer and use it in GitHub Desktop.
Save phenders/4dea6558f254d57e2a7e21ccba21d2a5 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Power search using the Enigma Public API\n",
"\n",
"The [Enigma Public web application](https://public.enigma.com/) provides powerful [search capabilities](http://docs.enigma.com/public/public_v20_user_searching.html) to help you locate specific information within the vast range of available datasets. For example, a search for `Michael Cohen` indicates there are 563 matching datasets. For each dataset, the search results page tells you the number of matching rows within each dataset, and filtering options let you narrow down the search results based on parent collection, dataset size, or date last updated. You can then open each matching dataset in turn to view the matching rows and determine if the data is relevant to your research. \n",
"\n",
"A search like this, though, returns a *lot* of data, and many of the hits won't relate to a person called Michael Cohen. For example, one row in the [New York City Building Permits dataset](https://public.enigma.com/datasets/new-york-city-building-permits-issued/862f9e4a-88f3-4231-ad43-c823833ba4fd) lists the building owner as Stephen Cohen and the person filing the application as Michael Ferrone. Since the row includes the words `Michael` and `Cohen`, it counts as a hit. Additionally, other rows may reference a person called Michael who lives on Cohen Street, or a person called Cohen who works at the Paul Michael company. How can you focus your search to locate a person called `Michael Cohen` and display the search results in a way that gives you an immediate view of *all* matching rows across *all* datasets? That's the focus of this tutorial.\n",
"\n",
"#### Some initial setup\n",
"\n",
"This article assumes a basic knowledge of Python and pandas, as well as some familiarity with the [Enigma Public API](http://docs.enigma.com/public/public_v20_api_about.html). If you haven't used the API before, the tutorial [Getting started with the API](http://docs.enigma.com/public/public_v20_user_api.html) provides information on exploring the API using [Postman](https://www.getpostman.com/). Another tutorial, [Using Python to access Enigma Public](http://docs.enigma.com/public/public_v20_user_python.html), demonstrates how to make API calls from Python.\n",
"\n",
"The code samples in this article are shown in Jupyter Notebook. If you want to follow along, you'll need to install [Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/content-quickstart.html) (or [JupyterLab](http://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html)) along with the two required Python modules, [requests](https://docs.python-requests.org/en/master/user/quickstart/) and [pandas](https://pandas.pydata.org/), into your Python environment (`pip install requests` and `pip install pandas`). The code below shows how to import the modules into your Jupyter project, create an HTTP session object, and include your [Enigma Public API key](http://docs.enigma.com/public/public_v20_api_authentication.html) in the authorization header. It also sets up the base URL for the API calls."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"import pandas as pd\n",
"\n",
"s = requests.Session()\n",
"headers = {'authorization': 'Bearer ' + 'your_api_key'}\n",
"s.headers.update(headers)\n",
"\n",
"url = 'https://public.enigma.com/api/datasets/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create your initial query\n",
"\n",
"The `query` parameter on the API's [GET /datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) endpoint lets you search all Enigma Public datasets for the terms you specify. If you specify multiple search terms, they're ANDed together. The `query` parameter supports other [advanced operators](http://docs.enigma.com/public/public_v20_user_searching.html#AdvancedSearch), but we won't use them in this tutorial. Additional parameters we'll use for the search are:\n",
"\n",
"* `match_rows`: When `true`, the API searches for datasets where there's a match within the data records (rows). We'll set this to `true` (the default).\n",
"* `match_metadata`: When `true`, the API searches for datasets where there's a match within the dataset metadata. We'll set this to `false`.\n",
"* `row_limit`: Number of rows to return for each dataset. The default is 0 and the maximum you can request is 10,000. We'll request up to 1,000 from each dataset.\n",
"\n",
"We'll set up these parameters as a dictionary we can use as the `params` argument when we make the API call. Defining the query string separately will help later."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"parameters = {'match_metadata':'false','match_rows':'true', 'row_limit': 1000}\n",
"phrase1 = 'michael cohen'\n",
"parameters['query'] = phrase1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Get count of datasets matching the query string\n",
"\n",
"By default, [GET /datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) returns results for the first 20 datasets. We already know that a search for `michael cohen` yields many more than 20 datasets. Although you can request up to 1,000 by including a [Range header](http://docs.enigma.com/public/public_v20_api_pagination.html) in the request, there's a good chance the API will refuse your request or time out if the quantity of data is too large. For this reason, you should request the data in chunks. Before you can do this though, you need to know how many datasets match your query. You can do this by sending an HTTP `HEAD` request (which returns only the response headers, and not the data) and then reading the total number from the `content-range` header, as shown below."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"563"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"response = s.head(url, headers=headers, params=parameters)\n",
"ds_count = int(response.headers.get('content-range').split(\"/\")[1])\n",
"ds_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Get resulting rows in batches\n",
"\n",
"Now that you know the total number of matching datasets (563), you can request them in batches. The code below requests the datasets in batches of 10 using a `Range` header. The API returns the requested data as a JSON list, where each list item represents one dataset. For each dataset, the JSON includes:\n",
"\n",
"* The dataset metadata (`display_name`, `id`, etc.)\n",
"* Data for the current snapshot, including the field names and up to 1,000 records (since we specified `row_limit: 1000`)\n",
"\n",
"The code iterates through the list of datasets and creates a list-of-lists with all the matching rows, as well as information about each dataset. The second-to-last line truncates any values with more than 100 characters, since some datasets (for example, the [U.S. Patent and Trademark Office datasets](https://public.enigma.com/browse/collection/u-s-patent-and-trademark-office/cfc6f327-763c-48ee-b9ec-82f7d0b00762)) include extremely long descriptions."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"results = []\n",
"for i in range(0, ds_count, 10):\n",
" headers['Range'] = 'resources={}-{}'.format(i, i + 9)\n",
" s.headers.update(headers)\n",
" response = s.get(url, headers=headers, params=parameters).json()\n",
" for dataset in response:\n",
" results.append([dataset['display_name']])\n",
" results.append(['https://public.enigma.com/datasets/' + dataset['id']])\n",
" results.append(dataset['current_snapshot']['table_rows']['fields'])\n",
" rows = dataset['current_snapshot']['table_rows']['rows']\n",
" for row in rows:\n",
" row = [value[:100] for value in row if isinstance(value, str)]\n",
" results.append(row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a DataFrame\n",
"\n",
"Using pandas, it's easy to convert a list-of-lists into a DataFrame. You can then save this as a CSV file you can analyze offline using your favorite spreadsheet application."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>245</th>\n",
" <th>246</th>\n",
" <th>247</th>\n",
" <th>248</th>\n",
" <th>249</th>\n",
" <th>250</th>\n",
" <th>251</th>\n",
" <th>252</th>\n",
" <th>253</th>\n",
" <th>254</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Medicare Provider Charges - Part D</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>https://public.enigma.com/datasets/ce0ea1d5-50...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>npi</td>\n",
" <td>nppes_provider_last_org_name</td>\n",
" <td>nppes_provider_first_name</td>\n",
" <td>nppes_provider_city</td>\n",
" <td>nppes_provider_state</td>\n",
" <td>specialty_description</td>\n",
" <td>description_flag</td>\n",
" <td>drug_name</td>\n",
" <td>generic_name</td>\n",
" <td>bene_count</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1053463901</td>\n",
" <td>COHEN</td>\n",
" <td>MICHAEL</td>\n",
" <td>KENMORE</td>\n",
" <td>WA</td>\n",
" <td>Dentist</td>\n",
" <td>T</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>11.0</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1083666788</td>\n",
" <td>COHEN</td>\n",
" <td>MICHAEL</td>\n",
" <td>RENNER</td>\n",
" <td>LA</td>\n",
" <td>Obstetrics/Gynecology</td>\n",
" <td>S</td>\n",
" <td>METRONIDAZOLE</td>\n",
" <td>METRONIDAZOLE</td>\n",
" <td>12.0</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 255 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 \\\n",
"0 Medicare Provider Charges - Part D \n",
"1 https://public.enigma.com/datasets/ce0ea1d5-50... \n",
"2 npi \n",
"3 1053463901 \n",
"4 1083666788 \n",
"\n",
" 1 2 \\\n",
"0 None None \n",
"1 None None \n",
"2 nppes_provider_last_org_name nppes_provider_first_name \n",
"3 COHEN MICHAEL \n",
"4 COHEN MICHAEL \n",
"\n",
" 3 4 5 \\\n",
"0 None None None \n",
"1 None None None \n",
"2 nppes_provider_city nppes_provider_state specialty_description \n",
"3 KENMORE WA Dentist \n",
"4 RENNER LA Obstetrics/Gynecology \n",
"\n",
" 6 7 8 9 ... 245 \\\n",
"0 None None None None ... None \n",
"1 None None None None ... None \n",
"2 description_flag drug_name generic_name bene_count ... None \n",
"3 T IBUPROFEN IBUPROFEN 11.0 ... None \n",
"4 S METRONIDAZOLE METRONIDAZOLE 12.0 ... None \n",
"\n",
" 246 247 248 249 250 251 252 253 254 \n",
"0 None None None None None None None None None \n",
"1 None None None None None None None None None \n",
"2 None None None None None None None None None \n",
"3 None None None None None None None None None \n",
"4 None None None None None None None None None \n",
"\n",
"[5 rows x 255 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(results)\n",
"df.to_csv('cohen.csv')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Searching for \"Michael Cohen\"\n",
"\n",
"The code so far returns all rows in Enigma Public that include the words `michael` and `cohen`:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"25731"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After subtracting the rows used to capture the dataset display name, ID, and field names, the number of matching rows is around 24,000. This includes the row containing Stephen Cohen and Michael Ferrone, as well as rows containing any Michael who lives on Cohen Street, etc. We want to narrow this down to rows that reference a person called Michael Cohen. \n",
"\n",
"The API doesn't provide a way to do this directly, but you can search for rows where `michael` and `cohen` are in the same column or adjacent columns, indicating a likely match. This will reduce the number of matching rows considerably.\n",
"\n",
"The code below does this by analyzing each row before appending it to the results. Since it's possible a dataset many not have any rows that match the new proximity matching criteria, we'll delay writing the dataset metadata until we know there's at least one row.\n",
"\n",
"The matching algorithm requires that all cell values are strings. Since some values are read as Booleans, we'll convert these to strings. We'll also filter out `None` values, make everything lower case, and remove any punctuation. The inner `for` loop traverses each row using a \"window\" of size `distance` (set here to 2, but you could set it to 3, for example, if you think there might be a separate column for a middle initial). If all of the search terms appear within the sliding window at any point during the traversal, the `match` flag is set to `True`.\n",
"\n",
"We've moved everything, including the initial setup steps, into a function so we can reuse it later in the tutorial, which is when `phrase2` comes into play. For now we're passing in an empty string for `phrase2`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def proximity_search(phrase1, phrase2):\n",
" import string\n",
" table = str.maketrans('', '', string.punctuation)\n",
" \n",
" s = requests.Session()\n",
" headers = {'authorization': 'Bearer ' + 'your_api_key'}\n",
" s.headers.update(headers)\n",
"\n",
" url = 'https://public.enigma.com/api/datasets/'\n",
" parameters = {'match_metadata':'false','match_rows':'true', 'row_limit': 1000}\n",
" parameters['query'] = ' '.join([phrase1.lower(), phrase2.lower()]) \n",
"\n",
" response = s.head(url, headers=headers, params=parameters)\n",
" ds_count = int(response.headers.get('content-range').split(\"/\")[1])\n",
"\n",
" results = []\n",
" distance = 2\n",
" for i in range(0, ds_count, 10):\n",
" headers['Range'] = 'resources={}-{}'.format(i, i + 9)\n",
" s.headers.update(headers)\n",
" response = s.get(url, headers=headers, params=parameters).json()\n",
" for dataset in response:\n",
" first_match = True\n",
" rows = dataset['current_snapshot']['table_rows']['rows']\n",
" for row in rows:\n",
" row = [str(value) if isinstance(value, bool) else value for value in row ]\n",
" match = False\n",
" for j in range (0, len(row) - distance + 1):\n",
" words = (' '.join(filter(None, row[j: j + distance]))).lower().translate(table)\n",
" if all(s in words.split() for s in phrase1.lower().split()):\n",
" match = True\n",
" break\n",
" if match:\n",
" if first_match:\n",
" results.append([dataset['display_name']])\n",
" results.append(['https://public.enigma.com/datasets/' + dataset['id']])\n",
" results.append(dataset['current_snapshot']['table_rows']['fields'])\n",
" first_match = False\n",
" row = [value[:100] if isinstance(value, str) else '' for value in row ]\n",
" results.append(row)\n",
" return results\n",
"\n",
"results = proximity_search(phrase1, '')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Convert rows to DataFrame\n",
"\n",
"We can now convert the results list-of-lists to a DataFrame and save it as a CSV file."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>245</th>\n",
" <th>246</th>\n",
" <th>247</th>\n",
" <th>248</th>\n",
" <th>249</th>\n",
" <th>250</th>\n",
" <th>251</th>\n",
" <th>252</th>\n",
" <th>253</th>\n",
" <th>254</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Medicare Provider Charges - Part D</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ce0ea1d5-504c-46b4-b8f3-9f431f71512c</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>npi</td>\n",
" <td>nppes_provider_last_org_name</td>\n",
" <td>nppes_provider_first_name</td>\n",
" <td>nppes_provider_city</td>\n",
" <td>nppes_provider_state</td>\n",
" <td>specialty_description</td>\n",
" <td>description_flag</td>\n",
" <td>drug_name</td>\n",
" <td>generic_name</td>\n",
" <td>bene_count</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1053463901</td>\n",
" <td>COHEN</td>\n",
" <td>MICHAEL</td>\n",
" <td>KENMORE</td>\n",
" <td>WA</td>\n",
" <td>Dentist</td>\n",
" <td>T</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>11.0</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1083666788</td>\n",
" <td>COHEN</td>\n",
" <td>MICHAEL</td>\n",
" <td>RENNER</td>\n",
" <td>LA</td>\n",
" <td>Obstetrics/Gynecology</td>\n",
" <td>S</td>\n",
" <td>METRONIDAZOLE</td>\n",
" <td>METRONIDAZOLE</td>\n",
" <td>12.0</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 255 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 \\\n",
"0 Medicare Provider Charges - Part D None \n",
"1 ce0ea1d5-504c-46b4-b8f3-9f431f71512c None \n",
"2 npi nppes_provider_last_org_name \n",
"3 1053463901 COHEN \n",
"4 1083666788 COHEN \n",
"\n",
" 2 3 4 \\\n",
"0 None None None \n",
"1 None None None \n",
"2 nppes_provider_first_name nppes_provider_city nppes_provider_state \n",
"3 MICHAEL KENMORE WA \n",
"4 MICHAEL RENNER LA \n",
"\n",
" 5 6 7 8 \\\n",
"0 None None None None \n",
"1 None None None None \n",
"2 specialty_description description_flag drug_name generic_name \n",
"3 Dentist T IBUPROFEN IBUPROFEN \n",
"4 Obstetrics/Gynecology S METRONIDAZOLE METRONIDAZOLE \n",
"\n",
" 9 ... 245 246 247 248 249 250 251 252 253 \\\n",
"0 None ... None None None None None None None None None \n",
"1 None ... None None None None None None None None None \n",
"2 bene_count ... None None None None None None None None None \n",
"3 11.0 ... None None None None None None None None None \n",
"4 12.0 ... None None None None None None None None None \n",
"\n",
" 254 \n",
"0 None \n",
"1 None \n",
"2 None \n",
"3 None \n",
"4 None \n",
"\n",
"[5 rows x 255 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(results)\n",
"df.to_csv('cohen.csv')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then check the number of matching rows."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"17396"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Focus the search\n",
"\n",
"Although we've halved the number of matching rows, there are still a lot. Now that you have the results in a format you can review, you might want to modify the query string to conduct a more narrowly scoped search based on your initial findings. For example, if you're interested in the Michael Cohen with connections to New York, you might include `new york` in the search query. The way the function is constructed, `phrase2` is used only to narrow down the number of rows returned by the API, and isn't used in the proximity search. Let's see how this affects the results."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>216</th>\n",
" <th>217</th>\n",
" <th>218</th>\n",
" <th>219</th>\n",
" <th>220</th>\n",
" <th>221</th>\n",
" <th>222</th>\n",
" <th>223</th>\n",
" <th>224</th>\n",
" <th>225</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>New York City Building Permits Issued</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>862f9e4a-88f3-4231-ad43-c823833ba4fd</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>bin__</td>\n",
" <td>borough</td>\n",
" <td>permit_type</td>\n",
" <td>permit_status</td>\n",
" <td>house__</td>\n",
" <td>street_name</td>\n",
" <td>zip_code</td>\n",
" <td>issuance_date</td>\n",
" <td>expiration_date</td>\n",
" <td>job_start_date</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1010650</td>\n",
" <td>MANHATTAN</td>\n",
" <td>EW</td>\n",
" <td>ISSUED</td>\n",
" <td>517</td>\n",
" <td>6 AVENUE</td>\n",
" <td>10011</td>\n",
" <td>2000-11-03T00:00:00</td>\n",
" <td>11/03/2001</td>\n",
" <td>11/03/2000</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1046619</td>\n",
" <td>MANHATTAN</td>\n",
" <td>EW</td>\n",
" <td>ISSUED</td>\n",
" <td>49</td>\n",
" <td>EAST 80TH STREET</td>\n",
" <td>10075</td>\n",
" <td>2016-06-28T00:00:00</td>\n",
" <td>05/29/2017</td>\n",
" <td>06/28/2016</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 226 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2 \\\n",
"0 New York City Building Permits Issued None None \n",
"1 862f9e4a-88f3-4231-ad43-c823833ba4fd None None \n",
"2 bin__ borough permit_type \n",
"3 1010650 MANHATTAN EW \n",
"4 1046619 MANHATTAN EW \n",
"\n",
" 3 4 5 6 7 \\\n",
"0 None None None None None \n",
"1 None None None None None \n",
"2 permit_status house__ street_name zip_code issuance_date \n",
"3 ISSUED 517 6 AVENUE 10011 2000-11-03T00:00:00 \n",
"4 ISSUED 49 EAST 80TH STREET 10075 2016-06-28T00:00:00 \n",
"\n",
" 8 9 ... 216 217 218 219 220 221 \\\n",
"0 None None ... None None None None None None \n",
"1 None None ... None None None None None None \n",
"2 expiration_date job_start_date ... None None None None None None \n",
"3 11/03/2001 11/03/2000 ... None None None None None None \n",
"4 05/29/2017 06/28/2016 ... None None None None None None \n",
"\n",
" 222 223 224 225 \n",
"0 None None None None \n",
"1 None None None None \n",
"2 None None None None \n",
"3 None None None None \n",
"4 None None None None \n",
"\n",
"[5 rows x 226 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = proximity_search('michael cohen', 'new york')\n",
"df = pd.DataFrame(results)\n",
"df.to_csv('cohen.csv')\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1538"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is much better. We've found 1,139 rows that reference a Michael Cohen with connections to New York. There will be some false matches (for example, if there's a cell with Stephen Cohen next to a cell with Michael Ferrone), but having narrowed down the results sufficiently, you can review them manually in the [attached spreadheet](https://docs.google.com/spreadsheets/d/1xRpllp2s6zBzyGBDKtD5f4oxWcxDbjolFjum6YI2HJY/edit#gid=1663626083).\n",
"\n",
"#### Further research\n",
"\n",
"You can use the proximity search function in other ways. For example, the New York Corporate Registrations dataset shows Michael Cohen at 502 Park Ave, New York as the owner of several taxi companies. You might wonder if there are other businesses operated out of the same address."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>85</th>\n",
" <th>86</th>\n",
" <th>87</th>\n",
" <th>88</th>\n",
" <th>89</th>\n",
" <th>90</th>\n",
" <th>91</th>\n",
" <th>92</th>\n",
" <th>93</th>\n",
" <th>94</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Bill of Lading Summary - 2016</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>f72f8304-8598-4920-9287-6e12cb2071f5</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>identifier</td>\n",
" <td>trade_update_date</td>\n",
" <td>run_date</td>\n",
" <td>vessel_name</td>\n",
" <td>port_of_unlading</td>\n",
" <td>estimated_arrival_date</td>\n",
" <td>foreign_port_of_lading</td>\n",
" <td>record_status_indicator</td>\n",
" <td>place_of_receipt</td>\n",
" <td>port_of_destination</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016010519011</td>\n",
" <td>2015-12-03T00:00:00</td>\n",
" <td>2016-01-05T00:00:00</td>\n",
" <td>YM UTOPIA</td>\n",
" <td>New York/Newark Area, Newark, New Jersey</td>\n",
" <td>2015-12-30T00:00:00</td>\n",
" <td>Singapore,Singapore</td>\n",
" <td>New</td>\n",
" <td>SINGAPORE</td>\n",
" <td>KKLU</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016111619182</td>\n",
" <td>2016-10-21T00:00:00</td>\n",
" <td>2016-11-16T00:00:00</td>\n",
" <td>MATAQUITO</td>\n",
" <td>New York/Newark Area, Newark, New Jersey</td>\n",
" <td>2016-11-15T00:00:00</td>\n",
" <td>Mundra,India</td>\n",
" <td>New</td>\n",
" <td>NEW DELHI INDIA</td>\n",
" <td>CMDU</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 95 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 \\\n",
"0 Bill of Lading Summary - 2016 None \n",
"1 f72f8304-8598-4920-9287-6e12cb2071f5 None \n",
"2 identifier trade_update_date \n",
"3 2016010519011 2015-12-03T00:00:00 \n",
"4 2016111619182 2016-10-21T00:00:00 \n",
"\n",
" 2 3 4 \\\n",
"0 None None None \n",
"1 None None None \n",
"2 run_date vessel_name port_of_unlading \n",
"3 2016-01-05T00:00:00 YM UTOPIA New York/Newark Area, Newark, New Jersey \n",
"4 2016-11-16T00:00:00 MATAQUITO New York/Newark Area, Newark, New Jersey \n",
"\n",
" 5 6 7 \\\n",
"0 None None None \n",
"1 None None None \n",
"2 estimated_arrival_date foreign_port_of_lading record_status_indicator \n",
"3 2015-12-30T00:00:00 Singapore,Singapore New \n",
"4 2016-11-15T00:00:00 Mundra,India New \n",
"\n",
" 8 9 ... 85 86 87 88 89 \\\n",
"0 None None ... None None None None None \n",
"1 None None ... None None None None None \n",
"2 place_of_receipt port_of_destination ... None None None None None \n",
"3 SINGAPORE KKLU ... None None None None None \n",
"4 NEW DELHI INDIA CMDU ... None None None None None \n",
"\n",
" 90 91 92 93 94 \n",
"0 None None None None None \n",
"1 None None None None None \n",
"2 None None None None None \n",
"3 None None None None None \n",
"4 None None None None None \n",
"\n",
"[5 rows x 95 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = proximity_search('502 park ave', 'new york')\n",
"df = pd.DataFrame(results)\n",
"df.to_csv('park ave.csv')\n",
"df.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment