sujitpal/cms_data_download.ipynb

## cms_data_download.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:d37a96fd27d722ea7812ba053a6f77a5aa7838687a7eeecd6f2e08f881feab18"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Medicaid Dataset - Basic Data Analysis #\n",
      "\n",
      "CMS.gov has made available a [synthetic dataset](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html) of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.\n",
      "\n",
      "As noted in the [Data Codebook (PDF)](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/Downloads/SynPUF_Codebook.pdf), the data has been heavily anonymized and imputed, so predictions arising out of this analysis are not reliable."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Data Download ##\n",
      "\n",
      "The data is spread across 99 zip files on the site. The data is partitioned by function (Benefit Summary, Inpatient Claims, Outpatient Claims, Carrier Claims, and Prescription Drug Events) and each functional partition is further partitioned into CSV files of approximately 100,000 records each. The rationale is that people can choose to work with subsets of data if needed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# extract URLs\n",
      "import re\n",
      "import urllib2\n",
      "import os\n",
      "\n",
      "# extract URLs from main page\n",
      "url_prefix = \"http://www.cms.gov\"\n",
      "url = \"/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html\"\n",
      "url_pattern = re.compile(r'<a href=\"([^\\\"]+)\">')\n",
      "html = urllib2.urlopen(url_prefix + url).read()\n",
      "links = []\n",
      "for url_suffix in url_pattern.findall(html):\n",
      "    if \"DESample\" in url_suffix:\n",
      "        links.append(url_suffix)\n",
      "    \n",
      "# extract zip URLs from each sub-page\n",
      "ziplinks = []\n",
      "for link in links:\n",
      "    html = urllib2.urlopen(url_prefix + link).read()\n",
      "    for ziplink in url_pattern.findall(html):\n",
      "        if ziplink.endswith(\".zip\"): \n",
      "            if ziplink.startswith(\"http://\"):\n",
      "                ziplinks.append(ziplink)\n",
      "            else:\n",
      "                ziplinks.append(url_prefix + ziplink)\n",
      "\n",
      "# print \"\\n\".join(ziplinks)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Resulting list of URLs to zip files are fed into \"curl -O -\" to download all zip files to current directory. Each zip file unzips into a single CSV file."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:d37a96fd27d722ea7812ba053a6f77a5aa7838687a7eeecd6f2e08f881feab18"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Medicaid Dataset - Basic Data Analysis #\n",
	"\n",
	"CMS.gov has made available a [synthetic dataset](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html) of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.\n",
	"\n",
	"As noted in the [Data Codebook (PDF)](http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/Downloads/SynPUF_Codebook.pdf), the data has been heavily anonymized and imputed, so predictions arising out of this analysis are not reliable."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Download ##\n",
	"\n",
	"The data is spread across 99 zip files on the site. The data is partitioned by function (Benefit Summary, Inpatient Claims, Outpatient Claims, Carrier Claims, and Prescription Drug Events) and each functional partition is further partitioned into CSV files of approximately 100,000 records each. The rationale is that people can choose to work with subsets of data if needed."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# extract URLs\n",
	"import re\n",
	"import urllib2\n",
	"import os\n",
	"\n",
	"# extract URLs from main page\n",
	"url_prefix = \"http://www.cms.gov\"\n",
	"url = \"/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/SynPUFs/DE_Syn_PUF.html\"\n",
	"url_pattern = re.compile(r'<a href=\"([^\\\"]+)\">')\n",
	"html = urllib2.urlopen(url_prefix + url).read()\n",
	"links = []\n",
	"for url_suffix in url_pattern.findall(html):\n",
	" if \"DESample\" in url_suffix:\n",
	" links.append(url_suffix)\n",
	" \n",
	"# extract zip URLs from each sub-page\n",
	"ziplinks = []\n",
	"for link in links:\n",
	" html = urllib2.urlopen(url_prefix + link).read()\n",
	" for ziplink in url_pattern.findall(html):\n",
	" if ziplink.endswith(\".zip\"): \n",
	" if ziplink.startswith(\"http://\"):\n",
	" ziplinks.append(ziplink)\n",
	" else:\n",
	" ziplinks.append(url_prefix + ziplink)\n",
	"\n",
	"# print \"\\n\".join(ziplinks)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 17
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Resulting list of URLs to zip files are fed into \"curl -O -\" to download all zip files to current directory. Each zip file unzips into a single CSV file."
	]
	}
	],
	"metadata": {}
	}
	]
	}