linanqiu/Calculation Documentation.ipynb

## Calculation Documentation.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Objectives\n",
    "\n",
    "## Pairwise Covariance\n",
    "\n",
    "We want the covariance ($cov$) of diseases $x$ and $y$ for all diseases in the database. Furthermore, these covariances are split by sex $s$ and age ranges $l$ and $h$.\n",
    "\n",
    "Diseases $x$ and $y$ are specified using ICD codes. Sexes $s$ are either $female$ or $male$, and $l, h$ specify lower and upper bound ranges (ranging between 0 and 100). Hence a query is of the form $cov(x, y, s, l, h)$.\n",
    "\n",
    "For example, a query for the covariance of disease with ICD code 100 and ICD code 200 for females aged between 10 and 25 would be:\n",
    "\n",
    "$$cov(100, 200, female, 10, 25) = 0.25$$\n",
    "\n",
    "indicating a covariance of 0.25.\n",
    "\n",
    "## Covariance Matrices\n",
    "\n",
    "We also want covariance matrices for all possible diseases for each of the two sexes for defined age ranges. In other words, a set of matrices $\\mathbf{M}_{s, l, h}$ should be produced for the following age increments: 1, 5, 10. These should be cached for fast querying and exported in a machine readable format.\n",
    "\n",
    "## Design Outcome\n",
    "\n",
    "There should be two deliverables:\n",
    "\n",
    "- A REST interface allowing for queries for **pairwise covariance** and the predetermined **covariance matrices**\n",
    "- Machine readable output of **covariance matrices**.\n",
    "\n",
    "\n",
    "# Data\n",
    "\n",
    "We are given the following data sources (in the folder `/original_data/`:\n",
    "\n",
    "- `adm3.sas7bdat` (53.5 MB): ignored for the purposes of this project\n",
    "- `cancer3.sas7bdat` (1.2 MB): ignored for the purposes of this project\n",
    "- `opd3.sas7bdat` (5.14 GB): data source for this project\n",
    "\n",
    "## Format\n",
    "\n",
    "The files are in SAS data format (ugh) which can be read in Python easily using the `sas7bdat` package.\n",
    "\n",
    "## Headers\n",
    "\n",
    "Using `sas7bdat` in command line, we can extract the headers of the `opd3.sas7bdat`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Header:\n",
      "\tcol_count_p1: 24\n",
      "\tcol_count_p2: 0\n",
      "\tcolumn_count: 24\n",
      "\tcompression: None\n",
      "\tcreator: None\n",
      "\tcreator_proc: None\n",
      "\tdate_created: 2016-03-23 13:59:06.127000\n",
      "\tdate_modified: 2016-03-23 14:43:56.712000\n",
      "\tendianess: little\n",
      "\tfile_type: DATA\n",
      "\tfilename: opd3.sas7bdat\n",
      "\theader_length: 65536\n",
      "\tlcp: 8\n",
      "\tlcs: 0\n",
      "\tmix_page_row_count: 322\n",
      "\tname: OPD3\n",
      "\tos_name: \n",
      "\tos_type: \n",
      "\tpage_count: 78393\n",
      "\tpage_length: 65536\n",
      "\tplatform: windows\n",
      "\trow_count: 26653278\n",
      "\trow_length: 192\n",
      "\tsas_release: 9.0401M1\n",
      "\tserver_type: X64_7PRO\n",
      "\tu64: False\n",
      "\n",
      "Contents of dataset \"OPD3\":\n",
      "Num Name        Type   Length Format Label\n",
      "--- ----------- ------ ------ ------ -----\n",
      "  1 FEEYR       number      8 BEST        \n",
      "  2 ATYPE       number      8 BEST        \n",
      "  3 APP_DAY     number      8 BEST        \n",
      "  4 CTYPE       number      8 BEST        \n",
      "  5 SN          number      8 BEST        \n",
      "  6 CITY        number      8 BEST        \n",
      "  7 HTYPE       number      8 BEST        \n",
      "  8 HID         number      8 BEST        \n",
      "  9 DID         string     10 $           \n",
      " 10 VSTYPE      number      8 BEST        \n",
      " 11 VSDAY       number      8 BEST        \n",
      " 12 BIR         number      8 BEST        \n",
      " 13 PNO         string      3 $           \n",
      " 14 ICD1        number      8 BEST        \n",
      " 15 ICD2        number      8 BEST        \n",
      " 16 ICD3        number      8 BEST        \n",
      " 17 OPC1        string      1 $           \n",
      " 18 T_DOT       number      8 BEST        \n",
      " 19 PDOT        number      8 BEST        \n",
      " 20 TDOT        number      8 BEST        \n",
      " 21 SEX         number      8 BEST        \n",
      " 22 PID         string     11             \n",
      " 23 APP_DAY2    number      8             \n",
      " 24 YR          number      8             \n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sas7bdat import SAS7BDAT as sas\n",
    "\n",
    "with sas('original_data/opd3.sas7bdat') as f:\n",
    "    print(f.header)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The interpretation of these variables are:\n",
    "\n",
    "- APP_DAY: Apply Date\n",
    "- APP_DAY2: Apply Date (in SAS)\n",
    "- ATYPE: Apply Type\n",
    "- BIR: Birthday YM\n",
    "- CTYPE: Case Type\n",
    "- CITY:  The City Code of Hospital\n",
    "- FEEYR: Fee Year/Month\n",
    "- VSDAY: Visit Date\n",
    "- VSTYPE:  Visit Department\n",
    "- HTYPE: Level of Hospital\n",
    "- HID: Code of Hospital\n",
    "- OPC1: Code of Operation\n",
    "- ICD1: Code of Diagnosis 1\n",
    "- ICD2: Code of Diagnosis 2\n",
    "- ICD3: Code of Diagnosis 3\n",
    "- SEX: Gender\n",
    "- PID: ID\n",
    "- PDOT:  Copayment Dot\n",
    "- PNO: Copayment Code\n",
    "- DID: ID of Doctor\n",
    "- SN:  Serial Number\n",
    "- TADOT: Total Apply Dot\n",
    "- TDOT: Total Dot\n",
    "- YR: Year\n",
    "\n",
    "## Preview\n",
    "\n",
    "We can also preview the first few rows of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>FEEYR</th>\n",
       "      <th>ATYPE</th>\n",
       "      <th>APP_DAY</th>\n",
       "      <th>CTYPE</th>\n",
       "      <th>SN</th>\n",
       "      <th>CITY</th>\n",
       "      <th>HTYPE</th>\n",
       "      <th>HID</th>\n",
       "      <th>DID</th>\n",
       "      <th>VSTYPE</th>\n",
       "      <th>VSDAY</th>\n",
       "      <th>BIR</th>\n",
       "      <th>PNO</th>\n",
       "      <th>ICD1</th>\n",
       "      <th>ICD2</th>\n",
       "      <th>ICD3</th>\n",
       "      <th>OPC1</th>\n",
       "      <th>T_DOT</th>\n",
       "      <th>PDOT</th>\n",
       "      <th>TDOT</th>\n",
       "      <th>SEX</th>\n",
       "      <th>PID</th>\n",
       "      <th>APP_DAY2</th>\n",
       "      <th>YR</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>200602.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060309.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>297.0</td>\n",
       "      <td>3201.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>161403.0</td>\n",
       "      <td>wxtqaajejb</td>\n",
       "      <td>40.0</td>\n",
       "      <td>20060211.0</td>\n",
       "      <td>198207.0</td>\n",
       "      <td>H10</td>\n",
       "      <td>5243.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>770.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>720.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>erlra</td>\n",
       "      <td>16869.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>200605.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>964.0</td>\n",
       "      <td>2101.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>55584.0</td>\n",
       "      <td>ccopskyrqu</td>\n",
       "      <td>11.0</td>\n",
       "      <td>20060510.0</td>\n",
       "      <td>198207.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>7080.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>396.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>346.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>erlra</td>\n",
       "      <td>16959.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060807.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>215.0</td>\n",
       "      <td>203.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>338027.0</td>\n",
       "      <td>gxurzkwtcm</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20060720.0</td>\n",
       "      <td>197108.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>460.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>425.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>375.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>erlsa</td>\n",
       "      <td>17020.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>200603.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060403.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1080.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>71180.0</td>\n",
       "      <td>vatnpkveaw</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20060318.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>462.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>396.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>346.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>16894.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>200603.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060403.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1202.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>71180.0</td>\n",
       "      <td>ivdinwvrlk</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20060320.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4659.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>396.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>346.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>16894.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>537.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060730.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>465.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>395.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>303.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060718.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>38102.0</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>539.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>419.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060724.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>38102.0</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>1261.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>1191.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>236.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060714.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>3829.0</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>538.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>468.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>505.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060727.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>531.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>461.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>200607.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060814.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>361.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060721.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>38102.0</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>531.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>461.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17027.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>200608.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060912.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>58.0</td>\n",
       "      <td>1705.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>38728.0</td>\n",
       "      <td>ahnakixwyl</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060803.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4619.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>532.0</td>\n",
       "      <td>70.0</td>\n",
       "      <td>462.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17056.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>200609.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20061009.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>3130.0</td>\n",
       "      <td>1704.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>212055.0</td>\n",
       "      <td>isnpfsiwjr</td>\n",
       "      <td>11.0</td>\n",
       "      <td>20060913.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>6982.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>250.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>200.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17083.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>200609.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20061005.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>526.0</td>\n",
       "      <td>1702.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>94511.0</td>\n",
       "      <td>aphbnraejv</td>\n",
       "      <td>60.0</td>\n",
       "      <td>20060927.0</td>\n",
       "      <td>199107.0</td>\n",
       "      <td>S10</td>\n",
       "      <td>9150.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>480.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>430.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>eihya</td>\n",
       "      <td>17079.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>200601.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060218.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>562.0</td>\n",
       "      <td>3117.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>316404.0</td>\n",
       "      <td>qokprusepp</td>\n",
       "      <td>0.0</td>\n",
       "      <td>20060115.0</td>\n",
       "      <td>197901.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>4659.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>325.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qzwib</td>\n",
       "      <td>16850.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>200601.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060220.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>2065.0</td>\n",
       "      <td>3101.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>27213.0</td>\n",
       "      <td>bkmfenpwkw</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060109.0</td>\n",
       "      <td>197901.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>46619.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>325.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qzwib</td>\n",
       "      <td>16852.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>200601.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060220.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>1150.0</td>\n",
       "      <td>3101.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>27213.0</td>\n",
       "      <td>bkmfenpwkw</td>\n",
       "      <td>9.0</td>\n",
       "      <td>20060105.0</td>\n",
       "      <td>197901.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>463.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>325.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>275.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qzwib</td>\n",
       "      <td>16852.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>200603.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060420.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>5244.0</td>\n",
       "      <td>3101.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>27213.0</td>\n",
       "      <td>jffpfriwxy</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20060322.0</td>\n",
       "      <td>197901.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>46619.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>185.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>135.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qzwib</td>\n",
       "      <td>16911.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>200611.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20061217.0</td>\n",
       "      <td>9.0</td>\n",
       "      <td>3614.0</td>\n",
       "      <td>3101.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>27213.0</td>\n",
       "      <td>hjoyqiywsh</td>\n",
       "      <td>4.0</td>\n",
       "      <td>20061118.0</td>\n",
       "      <td>197901.0</td>\n",
       "      <td>D10</td>\n",
       "      <td>90.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>220.0</td>\n",
       "      <td>50.0</td>\n",
       "      <td>170.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qzwib</td>\n",
       "      <td>17152.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>200601.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>20060214.0</td>\n",
       "      <td>21.0</td>\n",
       "      <td>1359.0</td>\n",
       "      <td>3202.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>11078.0</td>\n",
       "      <td>dmaumlqtom</td>\n",
       "      <td>60.0</td>\n",
       "      <td>20060125.0</td>\n",
       "      <td>192808.0</td>\n",
       "      <td>R20</td>\n",
       "      <td>5640.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "      <td></td>\n",
       "      <td>510.0</td>\n",
       "      <td>90.0</td>\n",
       "      <td>420.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>qhtmb</td>\n",
       "      <td>16846.0</td>\n",
       "      <td>2006.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       FEEYR  ATYPE     APP_DAY  CTYPE      SN    CITY  HTYPE       HID  \\\n",
       "0   200602.0    1.0  20060309.0   19.0   297.0  3201.0   37.0  161403.0   \n",
       "1   200605.0    1.0  20060607.0    1.0   964.0  2101.0   35.0   55584.0   \n",
       "2   200607.0    1.0  20060807.0    1.0   215.0   203.0   25.0  338027.0   \n",
       "3   200603.0    1.0  20060403.0    1.0  1080.0  1705.0   35.0   71180.0   \n",
       "4   200603.0    1.0  20060403.0    1.0  1202.0  1705.0   35.0   71180.0   \n",
       "5   200607.0    1.0  20060814.0    9.0   537.0  1705.0   35.0   38728.0   \n",
       "6   200607.0    1.0  20060814.0    9.0   303.0  1705.0   35.0   38728.0   \n",
       "7   200607.0    1.0  20060814.0    9.0   419.0  1705.0   35.0   38728.0   \n",
       "8   200607.0    1.0  20060814.0    9.0   236.0  1705.0   35.0   38728.0   \n",
       "9   200607.0    1.0  20060814.0    9.0   505.0  1705.0   35.0   38728.0   \n",
       "10  200607.0    1.0  20060814.0    9.0   361.0  1705.0   35.0   38728.0   \n",
       "11  200608.0    1.0  20060912.0    9.0    58.0  1705.0   35.0   38728.0   \n",
       "12  200609.0    1.0  20061009.0    9.0  3130.0  1704.0   35.0  212055.0   \n",
       "13  200609.0    1.0  20061005.0   29.0   526.0  1702.0   38.0   94511.0   \n",
       "14  200601.0    1.0  20060218.0    1.0   562.0  3117.0   35.0  316404.0   \n",
       "15  200601.0    1.0  20060220.0    9.0  2065.0  3101.0   35.0   27213.0   \n",
       "16  200601.0    1.0  20060220.0    9.0  1150.0  3101.0   35.0   27213.0   \n",
       "17  200603.0    1.0  20060420.0    9.0  5244.0  3101.0   35.0   27213.0   \n",
       "18  200611.0    1.0  20061217.0    9.0  3614.0  3101.0   35.0   27213.0   \n",
       "19  200601.0    1.0  20060214.0   21.0  1359.0  3202.0   18.0   11078.0   \n",
       "\n",
       "           DID  VSTYPE       VSDAY       BIR  PNO     ICD1     ICD2  ICD3  \\\n",
       "0   wxtqaajejb    40.0  20060211.0  198207.0  H10   5243.0      NaN  None   \n",
       "1   ccopskyrqu    11.0  20060510.0  198207.0  D10   7080.0      NaN  None   \n",
       "2   gxurzkwtcm     0.0  20060720.0  197108.0  D10    460.0      NaN  None   \n",
       "3   vatnpkveaw     0.0  20060318.0  199107.0  D10    462.0      NaN  None   \n",
       "4   ivdinwvrlk     0.0  20060320.0  199107.0  D10   4659.0      NaN  None   \n",
       "5   ahnakixwyl     9.0  20060730.0  199107.0  D10   4619.0      NaN  None   \n",
       "6   ahnakixwyl     9.0  20060718.0  199107.0  D10   4619.0  38102.0  None   \n",
       "7   ahnakixwyl     9.0  20060724.0  199107.0  D10  38102.0   4619.0  None   \n",
       "8   ahnakixwyl     9.0  20060714.0  199107.0  D10   4619.0   3829.0  None   \n",
       "9   ahnakixwyl     9.0  20060727.0  199107.0  D10   4619.0      NaN  None   \n",
       "10  ahnakixwyl     9.0  20060721.0  199107.0  D10   4619.0  38102.0  None   \n",
       "11  ahnakixwyl     9.0  20060803.0  199107.0  D10   4619.0      NaN  None   \n",
       "12  isnpfsiwjr    11.0  20060913.0  199107.0  D10   6982.0      NaN  None   \n",
       "13  aphbnraejv    60.0  20060927.0  199107.0  S10   9150.0      NaN  None   \n",
       "14  qokprusepp     0.0  20060115.0  197901.0  D10   4659.0      NaN  None   \n",
       "15  bkmfenpwkw     9.0  20060109.0  197901.0  D10  46619.0      NaN  None   \n",
       "16  bkmfenpwkw     9.0  20060105.0  197901.0  D10    463.0      NaN  None   \n",
       "17  jffpfriwxy     4.0  20060322.0  197901.0  D10  46619.0      NaN  None   \n",
       "18  hjoyqiywsh     4.0  20061118.0  197901.0  D10     90.0      NaN  None   \n",
       "19  dmaumlqtom    60.0  20060125.0  192808.0  R20   5640.0      NaN  None   \n",
       "\n",
       "   OPC1   T_DOT  PDOT    TDOT  SEX    PID  APP_DAY2      YR  \n",
       "0         770.0  50.0   720.0  2.0  erlra   16869.0  2006.0  \n",
       "1         396.0  50.0   346.0  2.0  erlra   16959.0  2006.0  \n",
       "2         425.0  50.0   375.0  1.0  erlsa   17020.0  2006.0  \n",
       "3         396.0  50.0   346.0  2.0  eihya   16894.0  2006.0  \n",
       "4         396.0  50.0   346.0  2.0  eihya   16894.0  2006.0  \n",
       "5         465.0  70.0   395.0  2.0  eihya   17027.0  2006.0  \n",
       "6         539.0  70.0   469.0  2.0  eihya   17027.0  2006.0  \n",
       "7        1261.0  70.0  1191.0  2.0  eihya   17027.0  2006.0  \n",
       "8         538.0  70.0   468.0  2.0  eihya   17027.0  2006.0  \n",
       "9         531.0  70.0   461.0  2.0  eihya   17027.0  2006.0  \n",
       "10        531.0  70.0   461.0  2.0  eihya   17027.0  2006.0  \n",
       "11        532.0  70.0   462.0  2.0  eihya   17056.0  2006.0  \n",
       "12        250.0  50.0   200.0  2.0  eihya   17083.0  2006.0  \n",
       "13        480.0  50.0   430.0  2.0  eihya   17079.0  2006.0  \n",
       "14        325.0  50.0   275.0  2.0  qzwib   16850.0  2006.0  \n",
       "15        325.0  50.0   275.0  2.0  qzwib   16852.0  2006.0  \n",
       "16        325.0  50.0   275.0  2.0  qzwib   16852.0  2006.0  \n",
       "17        185.0  50.0   135.0  2.0  qzwib   16911.0  2006.0  \n",
       "18        220.0  50.0   170.0  2.0  qzwib   17152.0  2006.0  \n",
       "19        510.0  90.0   420.0  2.0  qhtmb   16846.0  2006.0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import itertools\n",
    "import pandas\n",
    "\n",
    "pandas.set_option('display.max_columns', 999)\n",
    "\n",
    "with sas('original_data/opd3.sas7bdat') as f:\n",
    "    gen = f.readlines()\n",
    "    headers = next(gen)\n",
    "    rows = list(itertools.islice(gen, 0, 20))\n",
    "    dataframe = pandas.DataFrame(rows, columns=headers)calc\n",
    "dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Covariance Calculation\n",
    "\n",
    "## Binary Distribution Sample Covariance\n",
    "\n",
    "Courtesy of Eric S. Wu, we can calculate covariance using the binary distribution sample covariance formula:\n",
    "\n",
    "$$cov(a, b) = \\frac{nk_{a,b} - k_a k_b}{n(1-n)}$$\n",
    "\n",
    "where $k_{a,b}$ is the number of cases where $a$ and $b$ co-occurred, $k_{a}$ is the number of cases where $a$ occurred and $k_{b}$ is the number of cases where $b$ occurred. $n$ is the total number of cases.\n",
    "\n",
    "In our case, since covariance has three more parameters $s$, $l$, and $h$, we produce the formula:\n",
    "\n",
    "$$cov(x, y, s, l, h) = \\frac{n_{s, l, h}k_{x, y, s, l, h} - k_{x, s, l, h} k_{y, s, l, h}}{n_{s, l, h}(1-n_{s, l, h})}$$\n",
    "\n",
    "This simply further parameterizes the variables across the 3 dimensions $s$, $l$, and $h$.\n",
    "\n",
    "# System Design\n",
    "\n",
    "## Dimensionality\n",
    "\n",
    "There are $\\frac{100}{2}(100) = 5,000$ combinations of $l, h$ (since $l \\in [0, 100]$ and $h \\in [l, 100]$.) There are two combinations of gender ($female$ or $male$). Hence, there are 10,000 different $s, l, h$ combinations. There are 9,787 different ICD codes in the database. This means that the dimensionality of the $(x, s, l, h)$ tuple is around $10,000 * 10,000 = 100,000,000$ and the dimensionality of the $(x, y, s, l, h)$ tuple is $10,000 * 10,000 * 10,000$. It is then intractible to store every $k_{x, y, s, l, h}$ and $k_{x, s, l, h}$ in a database.\n",
    "\n",
    "## Design Solution\n",
    "\n",
    "However, a database with hashed indices on ICD code, sex, and age can be used easily to calculate pairwise covariance. To produce the covariance matrices, a separate in-memory database can be used to store the raw data and pairwise covariance queries can be made. Alternatively, this problem is trivially parallelizable (since we are making separate independent queries) in-memory. We can consider a CUDA implementation given time.\n",
    "\n",
    "## Database Schema\n",
    "\n",
    "This means that the only relevant data we need to retain in our database is:\n",
    "\n",
    "- Patient ID. This is given in the original database as `PID`\n",
    "- ICD Code. However, ICD Code is currently given as `ICD1`, `ICD2`, and `ICD3`. This represents the case of a single individual being given multiple valid diagnoses in a single visit. This complicates our database unnecessarily, and we coerce cases with multiple diagnoses into multiple cases (rows). This results in each row having only `ICD`.\n",
    "- Sex, which is given in the original data as `SEX` (duh)\n",
    "- Age, which need to be calculated for each row of data. This can be calculated from the visit date of the patient (`VSDAY`) and the birth date of the patient (`BIR`). However, since `VSDAY` is given as `YYYYMMDD` and `BIR` is given as `YYYYMM`, we simply take `AGE = VSDAY / 10000 - BIR / 100`. This produces `AGE` in integers.\n",
    "\n",
    "Then, we create a single B-tree on the tuple `(ICD, SEX, AGE)`. We now have the schema:\n",
    "\n",
    "```SQL\n",
    "CREATE SEQUENCE patient_id_seq;\n",
    "\n",
    "CREATE TABLE patients (\n",
    "  ID  INTEGER DEFAULT nextval('patient_id_seq'),\n",
    "  PID VARCHAR(11) NOT NULL,\n",
    "  ICD INTEGER NOT NULL,\n",
    "  SEX INTEGER NOT NULL,\n",
    "  AGE INTEGER NOT NULL,\n",
    "  PRIMARY KEY (ID)\n",
    ")\n",
    "\n",
    "CREATE INDEX ICD_SEX_AGE ON patients (ICD, SEX, AGE)\n",
    "```\n",
    "\n",
    "This allows us to query quickly the number of patients with a specific disease for a specific sex for a specific age using\n",
    "\n",
    "```SQL\n",
    "WITH pid_x AS (\n",
    "  SELECT DISTINCT PID FROM patients WHERE ICD=100 AND SEX=1 AND AGE>=10 AND AGE<20 GROUP BY PID\n",
    "), pid_y AS (\n",
    "  SELECT DISTINCT PID FROM patients WHERE ICD=200 AND SEX=1 AND AGE>=10 AND AGE<20 GROUP BY PID\n",
    "), pid_xy AS (\n",
    "  SELECT DISTINCT pid_x.PID FROM pid_x, pid_y WHERE pid_x.PID=pid_y.PID GROUP BY pid_x.PID\n",
    ")\n",
    "SELECT COUNT(pid_x.PID) as kx, COUNT(pid_y.PID) as ky, COUNT(pid_xy.PID) as kxy\n",
    "FROM pid_x, pid_y, pid_xy\n",
    "```\n",
    "\n",
    "Furthermore, we cache the $n_{s, l, h}$ values:\n",
    "\n",
    "```SQL\n",
    "CREATE TABLE patient_counts(\n",
    "  AGE INTEGER NOT NULL,\n",
    "  SEX INTEGER NOT NULL,\n",
    "  N INTEGER NOT NULL,\n",
    "  PRIMARY KEY (AGE)\n",
    ")\n",
    "```\n",
    "\n",
    "and these values can be extracted quickly via:\n",
    "\n",
    "```SQL\n",
    "SELECT SUM(N)\n",
    "FROM patient_counts\n",
    "WHERE SEX=sex AND AGE>=l AND AGE<h;\n",
    "```\n",
    "\n",
    "# Database Setup\n",
    "\n",
    "## Conversion to CSV\n",
    "\n",
    "The data given is in `.sas7bdat` which is painful to work with. Converting the entirety to `.csv` using `sas7bdat_to_csv` provided in the `sas7bdat` python package converted the 5 GB SAS data to 4 GB of CSV.\n",
    "\n",
    "## Create CSVs to Populate DB\n",
    "\n",
    "This section is done in `create_db_data.py` in the root directory.\n",
    "\n",
    "We want to iterate through each row of data, doing the following:\n",
    "\n",
    "1. Retain only the columns `[PID, ICD1, ICD2, ICD3, SEX, VSDAY, BIR]`\n",
    "2. Calculate `AGE = AGE = VSDAY / 10000 - BIR / 100` since `VSDAY` is given as `YYYYMMDD` and `BIR` is given as `YYYYMM`, we simply take `AGE = VSDAY / 10000 - BIR / 100`. This produces `AGE` in integers.\n",
    "3. If `ICD2` or `ICD3` is populated, discard them and create additional rows using the `ICD2` and/or `ICD3` values as `ICD1` values for the new rows.\n",
    "4. Rename `ICD1` to `ICD`.\n",
    "\n",
    "This logic is in `parse/parse_raw.py`.\n",
    "\n",
    "This produces `csv_data/patients.csv` (682.5 MB).\n",
    "\n",
    "We also want to calculate patient counts to populate the `patient_counts` table. To do this, we will count the number of patients with a particular age.\n",
    "\n",
    "This produces `csv_data/patient_counts.csv`.\n",
    "\n",
    "## Creating a DB Instance\n",
    "\n",
    "I launched a Amazon RDS instance. Read-only credentials to the instance are available:\n",
    "\n",
    "- `host`: `pwas-postgres.c5fx9eaq2kdo.us-east-1.rds.amazonaws.com`\n",
    "- `port`: 5432\n",
    "- `login`: `guest`\n",
    "- `password`: `guest`\n",
    "\n",
    "For admin privileges, contact the author.\n",
    "\n",
    "Tables and sequences created are in `create_table.sql`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}