Skip to content

Instantly share code, notes, and snippets.

@phenders
Created July 27, 2018 13:38
Show Gist options
  • Save phenders/493b1686b4de1957c34c48fca72140e1 to your computer and use it in GitHub Desktop.
Save phenders/493b1686b4de1957c34c48fca72140e1 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using the Enigma Python SDK\n",
"\n",
"I wrote the [Enigma Public Power Search app](http://www.search-public.info/) a while ago, before the Enigma SDK existed. The app's core feature is a \"proximity search\" function that searches across all of Enigma Public for data records where the words you want are in the same or adjacent columns. Often people search Enigma Public for a specific person, company, location, etc. The standard search returns a lot of false hits and doesn't offer a way to view the resulting rows without opening each dataset in turn. For example, you might be looking for information on \"Michael Cohen\" but you get hits on \"Michael Smith\" who lives on \"Cohen Street,\" or \"Stephen Cohen\" who's an officer of the same company as \"Michael Ferrone.\" Power Search greatly reduces the number of false hits and gives you an immediate view of *all* matching rows across *all* datasets. \n",
"\n",
"The app is written in Python and makes heavy use of the Requests library to make HTTP requests to the Enigma Public API. There's a lot of requesting, a lot of looping to handle pagination, and a lot of digging through long JSON responses. Since an SDK is supposed to wrap finicky HTTP requests, provide helper methods, and let developers interact with native objects instead of navigating through complex JSON hierarchies, I figured the new Enigma Python SDK would simplify things. I wanted to see just how much easier it would be and whether it could actually improve my app's performance, which is admittedly very sluggish.\n",
"\n",
"In this article, I'll show you some of the original proximity search code alongside the new SDK-powered version. I'm no power Pythonista, so you'll have to excuse my hacky code. However, I can offer some insights into the new SDK and why you might want to use it if you're a Python programmer interested in exploring public data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How proximity search works\n",
"\n",
"Here's an outline of how the proximity search function works:\n",
"\n",
"1. It searches Enigma Public to locate all datasets that include the specified search words. This will include all those \"false hits\" mentioned earlier.\n",
"2. It then fetches the matching rows from those datasets. It fetches the data in batches because if the number of matches is large, the connection will time out or the API will refuse your request.\n",
"3. For each batch, it goes through each row of each dataset to determine if the search words are in the same column (distance=1), adjacent columns (distance=2), or within three adjacent columns (distance=3), depending on the user's selection.\n",
"4. It gathers up all the proximity matched rows along with information about the dataset each group came from, and returns these to the app.\n",
"\n",
"It's a brute force algorithm that requires a lot of API interation and fetching of data, but it works! Let's go through each step in turn, comparing the pre-SDK code with the SDK-powered code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting the dataset count\n",
"\n",
"Step 1 is to locate all datasets that include the search words. We won't fetch them yet -- we just need a count so we can fetch them in batches in the next step. Here's how the pre-SDK function works:\n",
"\n",
"1. It sets up the required HTTP session object.\n",
"2. It makes a `HEAD` request (which returns only the response headers and not the actual data) to the [/datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) endpoint. The `query` parameter specifies the search words (for example, \"michael cohen\"). Setting `match_metadata` and `include_serialids` to `false` says we're not interested in metadata hits or serial IDs. \n",
"3. It reads response headers and extracts the total number of matching datasets from the `content-range` header."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"572"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"import pandas as pd\n",
"\n",
"session = requests.Session()\n",
"headers = {'authorization': 'Bearer ' + '<YOUR_API_KEY>'}\n",
"session.headers.update(headers)\n",
"\n",
"url = 'https://public.enigma.com/api/datasets/'\n",
"phrase = 'michael cohen'\n",
"params = {'query': phrase, 'row_limit':1000, 'match_metadata':'false', 'include_serialids':'false'}\n",
"\n",
"response = session.head(url, headers=headers, params=params)\n",
"ds_count = int(response.headers.get('content-range').split(\"/\")[1])\n",
"ds_count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's reporting that there are 572 datasets on Enigma Public with at least one instance of \"Michael\" and \"Cohen\" on the same row.\n",
"\n",
"Now the SDK. The SDK makes things quite a bit easier for several reasons:\n",
"\n",
"* First, it handles the HTTP session and all HTTP requests for you. This includes automatic retries, connection pooling, and passing of API keys. You create an SDK client object, add your API key, and you're done. If you've never managed HTTP requests, it doesn't matter -- power to the people!\n",
"* Second, you don't have to dig through response headers to find the number of matching datasets. The SDK handles this for you automatically.\n",
"* Third (and this is a real bonus), the SDK's [datasets.list( )](http://docs.enigma.com/public/public_v20_sdk_datasets_list) method returns a [ResourceList](http://docs.enigma.com/public/public_v20_sdk_resourcelist) object that represents the entire batch and supports all standard list operations (indexing, slicing, iterating over, etc.), but fetches data from the server automatically as needed. Nice! We'll use it the next step to get the data without making any further requests in the code.\n",
"\n",
"Here's how the SDK-powered code looks:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"572"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import enigma\n",
"import pandas as pd\n",
"\n",
"public = enigma.Public()\n",
"public.set_auth(apikey='YOUR-API-KEY')\n",
"\n",
"phrase = 'michael cohen'\n",
"datasets = public.datasets.list(query=phrase, row_limit=1000, match_metadata=False, include_serialids=False).all()\n",
"len(datasets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We got the same result, the code is a lot simpler, and we saved a few lines, but the real benefits come in the next step. In case you're wondering, it took just a fraction longer to return the magic ResourceList than it did to execute the `HEAD` request in the pre-SDK code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Get resulting rows in batches\n",
"\n",
"Step 2 is to fetch the matching rows from each of the datasets located in step 1. The pre-SDK function makes a `GET` request to the [/datasets/](http://docs.enigma.com/public/public_v20_api_get_datasets.html) endpoint. By default, the API returns results for the first 20 datasets. Although you can request more (up to 1,000) by including a [Range header](http://docs.enigma.com/public/public_v20_api_pagination.html), there's a good chance the API will refuse your request or time out if the quantity of data is too large. For this reason, you should request the data in chunks.\n",
"\n",
"The pre-SDK code below requests the datasets in batches of 10 (an arbitrary choice, but seems to work). The API returns the requested data as a JSON list, where each list item represents one dataset. For each dataset, the JSON includes:\n",
"\n",
"* The dataset metadata (`display_name`, `id`, etc.)\n",
"* Data for the current snapshot, including the field names and up to 1,000 records (since we specified `row_limit: 1000`)\n",
"\n",
"If you're curious to see a sample JSON response, follow this link: https://public.enigma.com/api/datasets/?query=michael%20cohen&row_limit=10.\n",
"\n",
"The code below iterates through the list of datasets and creates a list-of-lists with all the matching rows, as well as information about each dataset. The third-to-last line truncates any values with more than 100 characters, since some datasets (for example, the [U.S. Patent and Trademark Office datasets](https://public.enigma.com/browse/collection/u-s-patent-and-trademark-office/cfc6f327-763c-48ee-b9ec-82f7d0b00762)) include extremely long descriptions."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"25325"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = []\n",
"for i in range(0, ds_count, 10):\n",
" headers['Range'] = 'resources={}-{}'.format(i, i + 9)\n",
" session.headers.update(headers)\n",
" response = session.get(url, headers=headers, params=params).json()\n",
" for dataset in response:\n",
" results.append([dataset['display_name']])\n",
" results.append(dataset['current_snapshot']['table_rows']['fields'])\n",
" rows = dataset['current_snapshot']['table_rows']['rows']\n",
" for row in rows:\n",
" row = [value[:100] for value in row if isinstance(value, str)]\n",
" results.append(row)\n",
"len(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Those 572 datasets yielded 25,325 matching rows. Since we set the row limit at 1,000 it's possible there are more. We could have maxed it out at 10,000, but 1,000 seems plenty.\n",
"\n",
"Now let's look at the SDK-powered version. Some things to note in the code below:\n",
"\n",
"* First, there are only two `for` loops instead of three in the pre-SDK version. This is because the SDK handles pagination for us, so we don't need to worry about batches. \n",
"* Second, we're interacting with Python objects using the familiar *object.attribute* notation, rather than navigating complicated JSON trees.\n",
"* Third, we're using the SDK's nifty [TableView](http://docs.enigma.com/public/public_v20_sdk_tableview) class, which offers convenient ways to reference snapshot rows and their fields."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"25325"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = []\n",
"for dataset in datasets:\n",
" tableview = dataset.current_snapshot.table_rows\n",
" results.append([dataset.display_name] + [''] * 10)\n",
" results.append([field.display_name for field in tableview.fields])\n",
" for row in tableview.rows:\n",
" row = [value[:100] if isinstance(value, str) else '' for value in row]\n",
" results.append(row)\n",
"len(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We got the same result but, woah, that was a lot faster! I know you're just reading this, but I'm executing the code as I go in Jupyter Notebook. The pre-SDK version took 80 seconds, whereas the SDK version took just 50 seconds. That's about 40% faster.\n",
"\n",
"Side note: You must be thinking even 50 seconds is dreadful, and it is. This is why the Power Search app lets you include secondary search terms, like \"New York\" if you're interested in a Michael Cohen with connections to New York. This greatly reduces the number of hits and speeds up the search quite dramatically. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding proximity matches\n",
"\n",
"Now that we have all the potential matches in memory as a list of lists, we're done with the SDK -- it's pure Python from here on. In case you're curious how a hacker does a proximity search, I'll continue.\n",
"\n",
"The code below uses the same nested `for` loop, but analyzes each row before appending it to the results. Since it's possible a dataset many not have any rows that match the new proximity matching criteria, we delay writing the metadata until we know there's at least one row.\n",
"\n",
"The matching algorithm requires that all cell values are strings. Since some values are read as Booleans, it converts these to strings. It also filters out `None` values, makes everything lower case, and removes any punctuation. The inner `for` loop traverses each row using a \"window\" of size `distance` (set here to 2, but you could set it to 3, for example, if you think there might be a separate column for a middle initial). If all of the search terms appear within the sliding window at any point during the traversal, the `match` flag is set to `True`and we add the row to the results."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import string\n",
"\n",
"table = str.maketrans('', '', string.punctuation)\n",
"\n",
"results = []\n",
"distance = 2\n",
"\n",
"for dataset in datasets:\n",
" tableview = dataset.current_snapshot.table_rows\n",
" first_match = True\n",
" for row in tableview.rows:\n",
" row = [str(value) if isinstance(value, bool) else value for value in row ]\n",
" match = False\n",
" for j in range (0, len(row) - distance + 1):\n",
" words = (' '.join(filter(None, row[j: j + distance]))).lower().translate(table)\n",
" if all(s in words.split() for s in phrase.lower().split()):\n",
" match = True\n",
" break\n",
" if match:\n",
" if first_match:\n",
" results.append([dataset.display_name] + [''] * 10)\n",
" results.append([field.display_name for field in tableview.fields])\n",
" row = [value[:100] if isinstance(value, str) else '' for value in row ]\n",
" results.append(row)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To see the results, I'll convert the `results` array into a pandas DataFrame so it displays nicely in Jupyter. The table is long, so I'll display just the first few rows, but you could easily use the pandas `to_csv()` function to save the results and view them in a spreadsheet."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" <th>9</th>\n",
" <th>...</th>\n",
" <th>245</th>\n",
" <th>246</th>\n",
" <th>247</th>\n",
" <th>248</th>\n",
" <th>249</th>\n",
" <th>250</th>\n",
" <th>251</th>\n",
" <th>252</th>\n",
" <th>253</th>\n",
" <th>254</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Medicare Provider Charges - Part D</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NPI</td>\n",
" <td>NPPES Provider Org Name</td>\n",
" <td>NPPES Provider First Name</td>\n",
" <td>NPPES Provider City</td>\n",
" <td>NPPES Provider State</td>\n",
" <td>Specialty Description</td>\n",
" <td>Description Flag</td>\n",
" <td>Drug Name</td>\n",
" <td>Generic Name</td>\n",
" <td>Bene Count</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1053463901</td>\n",
" <td>COHEN</td>\n",
" <td>MICHAEL</td>\n",
" <td>KENMORE</td>\n",
" <td>WA</td>\n",
" <td>Dentist</td>\n",
" <td>T</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>IBUPROFEN</td>\n",
" <td>11.0</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Medicare Provider Charges - Part D</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NPI</td>\n",
" <td>NPPES Provider Org Name</td>\n",
" <td>NPPES Provider First Name</td>\n",
" <td>NPPES Provider City</td>\n",
" <td>NPPES Provider State</td>\n",
" <td>Specialty Description</td>\n",
" <td>Description Flag</td>\n",
" <td>Drug Name</td>\n",
" <td>Generic Name</td>\n",
" <td>Bene Count</td>\n",
" <td>...</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 255 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 \\\n",
"0 Medicare Provider Charges - Part D \n",
"1 NPI NPPES Provider Org Name \n",
"2 1053463901 COHEN \n",
"3 Medicare Provider Charges - Part D \n",
"4 NPI NPPES Provider Org Name \n",
"\n",
" 2 3 4 \\\n",
"0 \n",
"1 NPPES Provider First Name NPPES Provider City NPPES Provider State \n",
"2 MICHAEL KENMORE WA \n",
"3 \n",
"4 NPPES Provider First Name NPPES Provider City NPPES Provider State \n",
"\n",
" 5 6 7 8 \\\n",
"0 \n",
"1 Specialty Description Description Flag Drug Name Generic Name \n",
"2 Dentist T IBUPROFEN IBUPROFEN \n",
"3 \n",
"4 Specialty Description Description Flag Drug Name Generic Name \n",
"\n",
" 9 ... 245 246 247 248 249 250 251 252 253 \\\n",
"0 ... None None None None None None None None None \n",
"1 Bene Count ... None None None None None None None None None \n",
"2 11.0 ... None None None None None None None None None \n",
"3 ... None None None None None None None None None \n",
"4 Bene Count ... None None None None None None None None None \n",
"\n",
" 254 \n",
"0 None \n",
"1 None \n",
"2 None \n",
"3 None \n",
"4 None \n",
"\n",
"[5 rows x 255 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(results).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Should I use the SDK or the API?\n",
"\n",
"If you're writing a Python app to explore Enigma Public, I strongly recommend the SDK. It offers all the functionality that's available through the API, but makes things much simpler and your app will likely run faster. The SDK is only available for Python though, so if you want to write (say) a JavaScript app, unfortunately you're out of luck.\n",
"\n",
"If you'd like to to build an app using the Python SDK, here's plenty of information available on the [Enigma documentation site]. If you do build something, please let us know. We've love to showcase it on our [projects page](https://public.enigma.com/projects)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment