Skip to content

Instantly share code, notes, and snippets.

@sujitpal
Created April 28, 2014 22:23
Show Gist options
  • Save sujitpal/11385771 to your computer and use it in GitHub Desktop.
Save sujitpal/11385771 to your computer and use it in GitHub Desktop.
Some simple Python code to parse the CMS.gov site and extract links to all the dataset zips.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:d37a96fd27d722ea7812ba053a6f77a5aa7838687a7eeecd6f2e08f881feab18"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Medicaid Dataset - Basic Data Analysis #\n",
"\n",
"CMS.gov has made available a [synthetic dataset](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html) of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.\n",
"\n",
"As noted in the [Data Codebook (PDF)](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/Downloads/SynPUF_Codebook.pdf), the data has been heavily anonymized and imputed, so predictions arising out of this analysis are not reliable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Download ##\n",
"\n",
"The data is spread across 99 zip files on the site. The data is partitioned by function (Benefit Summary, Inpatient Claims, Outpatient Claims, Carrier Claims, and Prescription Drug Events) and each functional partition is further partitioned into CSV files of approximately 100,000 records each. The rationale is that people can choose to work with subsets of data if needed."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# extract URLs\n",
"import re\n",
"import urllib2\n",
"import os\n",
"\n",
"# extract URLs from main page\n",
"url_prefix = \"http://www.cms.gov\"\n",
"url = \"/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html\"\n",
"url_pattern = re.compile(r'<a href=\"([^\\\"]+)\">')\n",
"html = urllib2.urlopen(url_prefix + url).read()\n",
"links = []\n",
"for url_suffix in url_pattern.findall(html):\n",
" if \"DESample\" in url_suffix:\n",
" links.append(url_suffix)\n",
" \n",
"# extract zip URLs from each sub-page\n",
"ziplinks = []\n",
"for link in links:\n",
" html = urllib2.urlopen(url_prefix + link).read()\n",
" for ziplink in url_pattern.findall(html):\n",
" if ziplink.endswith(\".zip\"): \n",
" if ziplink.startswith(\"http://\"):\n",
" ziplinks.append(ziplink)\n",
" else:\n",
" ziplinks.append(url_prefix + ziplink)\n",
"\n",
"# print \"\\n\".join(ziplinks)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Resulting list of URLs to zip files are fed into \"curl -O -\" to download all zip files to current directory. Each zip file unzips into a single CSV file."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment