Skip to content

Instantly share code, notes, and snippets.

@linanqiu
Created June 2, 2016 05:04
Show Gist options
  • Save linanqiu/a60816648a7662d25a589bfb93934b9e to your computer and use it in GitHub Desktop.
Save linanqiu/a60816648a7662d25a589bfb93934b9e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives\n",
"\n",
"## Pairwise Covariance\n",
"\n",
"We want the covariance ($cov$) of diseases $x$ and $y$ for all diseases in the database. Furthermore, these covariances are split by sex $s$ and age ranges $l$ and $h$.\n",
"\n",
"Diseases $x$ and $y$ are specified using ICD codes. Sexes $s$ are either $female$ or $male$, and $l, h$ specify lower and upper bound ranges (ranging between 0 and 100). Hence a query is of the form $cov(x, y, s, l, h)$.\n",
"\n",
"For example, a query for the covariance of disease with ICD code 100 and ICD code 200 for females aged between 10 and 25 would be:\n",
"\n",
"$$cov(100, 200, female, 10, 25) = 0.25$$\n",
"\n",
"indicating a covariance of 0.25.\n",
"\n",
"## Covariance Matrices\n",
"\n",
"We also want covariance matrices for all possible diseases for each of the two sexes for defined age ranges. In other words, a set of matrices $\\mathbf{M}_{s, l, h}$ should be produced for the following age increments: 1, 5, 10. These should be cached for fast querying and exported in a machine readable format.\n",
"\n",
"## Design Outcome\n",
"\n",
"There should be two deliverables:\n",
"\n",
"- A REST interface allowing for queries for **pairwise covariance** and the predetermined **covariance matrices**\n",
"- Machine readable output of **covariance matrices**.\n",
"\n",
"\n",
"# Data\n",
"\n",
"We are given the following data sources (in the folder `/original_data/`:\n",
"\n",
"- `adm3.sas7bdat` (53.5 MB): ignored for the purposes of this project\n",
"- `cancer3.sas7bdat` (1.2 MB): ignored for the purposes of this project\n",
"- `opd3.sas7bdat` (5.14 GB): data source for this project\n",
"\n",
"## Format\n",
"\n",
"The files are in SAS data format (ugh) which can be read in Python easily using the `sas7bdat` package.\n",
"\n",
"## Headers\n",
"\n",
"Using `sas7bdat` in command line, we can extract the headers of the `opd3.sas7bdat`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Header:\n",
"\tcol_count_p1: 24\n",
"\tcol_count_p2: 0\n",
"\tcolumn_count: 24\n",
"\tcompression: None\n",
"\tcreator: None\n",
"\tcreator_proc: None\n",
"\tdate_created: 2016-03-23 13:59:06.127000\n",
"\tdate_modified: 2016-03-23 14:43:56.712000\n",
"\tendianess: little\n",
"\tfile_type: DATA\n",
"\tfilename: opd3.sas7bdat\n",
"\theader_length: 65536\n",
"\tlcp: 8\n",
"\tlcs: 0\n",
"\tmix_page_row_count: 322\n",
"\tname: OPD3\n",
"\tos_name: \n",
"\tos_type: \n",
"\tpage_count: 78393\n",
"\tpage_length: 65536\n",
"\tplatform: windows\n",
"\trow_count: 26653278\n",
"\trow_length: 192\n",
"\tsas_release: 9.0401M1\n",
"\tserver_type: X64_7PRO\n",
"\tu64: False\n",
"\n",
"Contents of dataset \"OPD3\":\n",
"Num Name Type Length Format Label\n",
"--- ----------- ------ ------ ------ -----\n",
" 1 FEEYR number 8 BEST \n",
" 2 ATYPE number 8 BEST \n",
" 3 APP_DAY number 8 BEST \n",
" 4 CTYPE number 8 BEST \n",
" 5 SN number 8 BEST \n",
" 6 CITY number 8 BEST \n",
" 7 HTYPE number 8 BEST \n",
" 8 HID number 8 BEST \n",
" 9 DID string 10 $ \n",
" 10 VSTYPE number 8 BEST \n",
" 11 VSDAY number 8 BEST \n",
" 12 BIR number 8 BEST \n",
" 13 PNO string 3 $ \n",
" 14 ICD1 number 8 BEST \n",
" 15 ICD2 number 8 BEST \n",
" 16 ICD3 number 8 BEST \n",
" 17 OPC1 string 1 $ \n",
" 18 T_DOT number 8 BEST \n",
" 19 PDOT number 8 BEST \n",
" 20 TDOT number 8 BEST \n",
" 21 SEX number 8 BEST \n",
" 22 PID string 11 \n",
" 23 APP_DAY2 number 8 \n",
" 24 YR number 8 \n",
"\n"
]
}
],
"source": [
"from sas7bdat import SAS7BDAT as sas\n",
"\n",
"with sas('original_data/opd3.sas7bdat') as f:\n",
" print(f.header)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The interpretation of these variables are:\n",
"\n",
"- APP_DAY: Apply Date\n",
"- APP_DAY2: Apply Date (in SAS)\n",
"- ATYPE: Apply Type\n",
"- BIR: Birthday YM\n",
"- CTYPE: Case Type\n",
"- CITY: The City Code of Hospital\n",
"- FEEYR: Fee Year/Month\n",
"- VSDAY: Visit Date\n",
"- VSTYPE: Visit Department\n",
"- HTYPE: Level of Hospital\n",
"- HID: Code of Hospital\n",
"- OPC1: Code of Operation\n",
"- ICD1: Code of Diagnosis 1\n",
"- ICD2: Code of Diagnosis 2\n",
"- ICD3: Code of Diagnosis 3\n",
"- SEX: Gender\n",
"- PID: ID\n",
"- PDOT: Copayment Dot\n",
"- PNO: Copayment Code\n",
"- DID: ID of Doctor\n",
"- SN: Serial Number\n",
"- TADOT: Total Apply Dot\n",
"- TDOT: Total Dot\n",
"- YR: Year\n",
"\n",
"## Preview\n",
"\n",
"We can also preview the first few rows of the data."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FEEYR</th>\n",
" <th>ATYPE</th>\n",
" <th>APP_DAY</th>\n",
" <th>CTYPE</th>\n",
" <th>SN</th>\n",
" <th>CITY</th>\n",
" <th>HTYPE</th>\n",
" <th>HID</th>\n",
" <th>DID</th>\n",
" <th>VSTYPE</th>\n",
" <th>VSDAY</th>\n",
" <th>BIR</th>\n",
" <th>PNO</th>\n",
" <th>ICD1</th>\n",
" <th>ICD2</th>\n",
" <th>ICD3</th>\n",
" <th>OPC1</th>\n",
" <th>T_DOT</th>\n",
" <th>PDOT</th>\n",
" <th>TDOT</th>\n",
" <th>SEX</th>\n",
" <th>PID</th>\n",
" <th>APP_DAY2</th>\n",
" <th>YR</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>200602.0</td>\n",
" <td>1.0</td>\n",
" <td>20060309.0</td>\n",
" <td>19.0</td>\n",
" <td>297.0</td>\n",
" <td>3201.0</td>\n",
" <td>37.0</td>\n",
" <td>161403.0</td>\n",
" <td>wxtqaajejb</td>\n",
" <td>40.0</td>\n",
" <td>20060211.0</td>\n",
" <td>198207.0</td>\n",
" <td>H10</td>\n",
" <td>5243.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>770.0</td>\n",
" <td>50.0</td>\n",
" <td>720.0</td>\n",
" <td>2.0</td>\n",
" <td>erlra</td>\n",
" <td>16869.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>200605.0</td>\n",
" <td>1.0</td>\n",
" <td>20060607.0</td>\n",
" <td>1.0</td>\n",
" <td>964.0</td>\n",
" <td>2101.0</td>\n",
" <td>35.0</td>\n",
" <td>55584.0</td>\n",
" <td>ccopskyrqu</td>\n",
" <td>11.0</td>\n",
" <td>20060510.0</td>\n",
" <td>198207.0</td>\n",
" <td>D10</td>\n",
" <td>7080.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>396.0</td>\n",
" <td>50.0</td>\n",
" <td>346.0</td>\n",
" <td>2.0</td>\n",
" <td>erlra</td>\n",
" <td>16959.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060807.0</td>\n",
" <td>1.0</td>\n",
" <td>215.0</td>\n",
" <td>203.0</td>\n",
" <td>25.0</td>\n",
" <td>338027.0</td>\n",
" <td>gxurzkwtcm</td>\n",
" <td>0.0</td>\n",
" <td>20060720.0</td>\n",
" <td>197108.0</td>\n",
" <td>D10</td>\n",
" <td>460.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>425.0</td>\n",
" <td>50.0</td>\n",
" <td>375.0</td>\n",
" <td>1.0</td>\n",
" <td>erlsa</td>\n",
" <td>17020.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>200603.0</td>\n",
" <td>1.0</td>\n",
" <td>20060403.0</td>\n",
" <td>1.0</td>\n",
" <td>1080.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>71180.0</td>\n",
" <td>vatnpkveaw</td>\n",
" <td>0.0</td>\n",
" <td>20060318.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>462.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>396.0</td>\n",
" <td>50.0</td>\n",
" <td>346.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>16894.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>200603.0</td>\n",
" <td>1.0</td>\n",
" <td>20060403.0</td>\n",
" <td>1.0</td>\n",
" <td>1202.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>71180.0</td>\n",
" <td>ivdinwvrlk</td>\n",
" <td>0.0</td>\n",
" <td>20060320.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4659.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>396.0</td>\n",
" <td>50.0</td>\n",
" <td>346.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>16894.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>537.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060730.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>465.0</td>\n",
" <td>70.0</td>\n",
" <td>395.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>303.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060718.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>38102.0</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>539.0</td>\n",
" <td>70.0</td>\n",
" <td>469.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>419.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060724.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>38102.0</td>\n",
" <td>4619.0</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>1261.0</td>\n",
" <td>70.0</td>\n",
" <td>1191.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>236.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060714.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>3829.0</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>538.0</td>\n",
" <td>70.0</td>\n",
" <td>468.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>505.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060727.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>531.0</td>\n",
" <td>70.0</td>\n",
" <td>461.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>200607.0</td>\n",
" <td>1.0</td>\n",
" <td>20060814.0</td>\n",
" <td>9.0</td>\n",
" <td>361.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060721.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>38102.0</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>531.0</td>\n",
" <td>70.0</td>\n",
" <td>461.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17027.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>200608.0</td>\n",
" <td>1.0</td>\n",
" <td>20060912.0</td>\n",
" <td>9.0</td>\n",
" <td>58.0</td>\n",
" <td>1705.0</td>\n",
" <td>35.0</td>\n",
" <td>38728.0</td>\n",
" <td>ahnakixwyl</td>\n",
" <td>9.0</td>\n",
" <td>20060803.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>4619.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>532.0</td>\n",
" <td>70.0</td>\n",
" <td>462.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17056.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>200609.0</td>\n",
" <td>1.0</td>\n",
" <td>20061009.0</td>\n",
" <td>9.0</td>\n",
" <td>3130.0</td>\n",
" <td>1704.0</td>\n",
" <td>35.0</td>\n",
" <td>212055.0</td>\n",
" <td>isnpfsiwjr</td>\n",
" <td>11.0</td>\n",
" <td>20060913.0</td>\n",
" <td>199107.0</td>\n",
" <td>D10</td>\n",
" <td>6982.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>250.0</td>\n",
" <td>50.0</td>\n",
" <td>200.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17083.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>200609.0</td>\n",
" <td>1.0</td>\n",
" <td>20061005.0</td>\n",
" <td>29.0</td>\n",
" <td>526.0</td>\n",
" <td>1702.0</td>\n",
" <td>38.0</td>\n",
" <td>94511.0</td>\n",
" <td>aphbnraejv</td>\n",
" <td>60.0</td>\n",
" <td>20060927.0</td>\n",
" <td>199107.0</td>\n",
" <td>S10</td>\n",
" <td>9150.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>480.0</td>\n",
" <td>50.0</td>\n",
" <td>430.0</td>\n",
" <td>2.0</td>\n",
" <td>eihya</td>\n",
" <td>17079.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>200601.0</td>\n",
" <td>1.0</td>\n",
" <td>20060218.0</td>\n",
" <td>1.0</td>\n",
" <td>562.0</td>\n",
" <td>3117.0</td>\n",
" <td>35.0</td>\n",
" <td>316404.0</td>\n",
" <td>qokprusepp</td>\n",
" <td>0.0</td>\n",
" <td>20060115.0</td>\n",
" <td>197901.0</td>\n",
" <td>D10</td>\n",
" <td>4659.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>325.0</td>\n",
" <td>50.0</td>\n",
" <td>275.0</td>\n",
" <td>2.0</td>\n",
" <td>qzwib</td>\n",
" <td>16850.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>200601.0</td>\n",
" <td>1.0</td>\n",
" <td>20060220.0</td>\n",
" <td>9.0</td>\n",
" <td>2065.0</td>\n",
" <td>3101.0</td>\n",
" <td>35.0</td>\n",
" <td>27213.0</td>\n",
" <td>bkmfenpwkw</td>\n",
" <td>9.0</td>\n",
" <td>20060109.0</td>\n",
" <td>197901.0</td>\n",
" <td>D10</td>\n",
" <td>46619.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>325.0</td>\n",
" <td>50.0</td>\n",
" <td>275.0</td>\n",
" <td>2.0</td>\n",
" <td>qzwib</td>\n",
" <td>16852.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>200601.0</td>\n",
" <td>1.0</td>\n",
" <td>20060220.0</td>\n",
" <td>9.0</td>\n",
" <td>1150.0</td>\n",
" <td>3101.0</td>\n",
" <td>35.0</td>\n",
" <td>27213.0</td>\n",
" <td>bkmfenpwkw</td>\n",
" <td>9.0</td>\n",
" <td>20060105.0</td>\n",
" <td>197901.0</td>\n",
" <td>D10</td>\n",
" <td>463.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>325.0</td>\n",
" <td>50.0</td>\n",
" <td>275.0</td>\n",
" <td>2.0</td>\n",
" <td>qzwib</td>\n",
" <td>16852.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>200603.0</td>\n",
" <td>1.0</td>\n",
" <td>20060420.0</td>\n",
" <td>9.0</td>\n",
" <td>5244.0</td>\n",
" <td>3101.0</td>\n",
" <td>35.0</td>\n",
" <td>27213.0</td>\n",
" <td>jffpfriwxy</td>\n",
" <td>4.0</td>\n",
" <td>20060322.0</td>\n",
" <td>197901.0</td>\n",
" <td>D10</td>\n",
" <td>46619.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>185.0</td>\n",
" <td>50.0</td>\n",
" <td>135.0</td>\n",
" <td>2.0</td>\n",
" <td>qzwib</td>\n",
" <td>16911.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>200611.0</td>\n",
" <td>1.0</td>\n",
" <td>20061217.0</td>\n",
" <td>9.0</td>\n",
" <td>3614.0</td>\n",
" <td>3101.0</td>\n",
" <td>35.0</td>\n",
" <td>27213.0</td>\n",
" <td>hjoyqiywsh</td>\n",
" <td>4.0</td>\n",
" <td>20061118.0</td>\n",
" <td>197901.0</td>\n",
" <td>D10</td>\n",
" <td>90.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>220.0</td>\n",
" <td>50.0</td>\n",
" <td>170.0</td>\n",
" <td>2.0</td>\n",
" <td>qzwib</td>\n",
" <td>17152.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>200601.0</td>\n",
" <td>1.0</td>\n",
" <td>20060214.0</td>\n",
" <td>21.0</td>\n",
" <td>1359.0</td>\n",
" <td>3202.0</td>\n",
" <td>18.0</td>\n",
" <td>11078.0</td>\n",
" <td>dmaumlqtom</td>\n",
" <td>60.0</td>\n",
" <td>20060125.0</td>\n",
" <td>192808.0</td>\n",
" <td>R20</td>\n",
" <td>5640.0</td>\n",
" <td>NaN</td>\n",
" <td>None</td>\n",
" <td></td>\n",
" <td>510.0</td>\n",
" <td>90.0</td>\n",
" <td>420.0</td>\n",
" <td>2.0</td>\n",
" <td>qhtmb</td>\n",
" <td>16846.0</td>\n",
" <td>2006.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FEEYR ATYPE APP_DAY CTYPE SN CITY HTYPE HID \\\n",
"0 200602.0 1.0 20060309.0 19.0 297.0 3201.0 37.0 161403.0 \n",
"1 200605.0 1.0 20060607.0 1.0 964.0 2101.0 35.0 55584.0 \n",
"2 200607.0 1.0 20060807.0 1.0 215.0 203.0 25.0 338027.0 \n",
"3 200603.0 1.0 20060403.0 1.0 1080.0 1705.0 35.0 71180.0 \n",
"4 200603.0 1.0 20060403.0 1.0 1202.0 1705.0 35.0 71180.0 \n",
"5 200607.0 1.0 20060814.0 9.0 537.0 1705.0 35.0 38728.0 \n",
"6 200607.0 1.0 20060814.0 9.0 303.0 1705.0 35.0 38728.0 \n",
"7 200607.0 1.0 20060814.0 9.0 419.0 1705.0 35.0 38728.0 \n",
"8 200607.0 1.0 20060814.0 9.0 236.0 1705.0 35.0 38728.0 \n",
"9 200607.0 1.0 20060814.0 9.0 505.0 1705.0 35.0 38728.0 \n",
"10 200607.0 1.0 20060814.0 9.0 361.0 1705.0 35.0 38728.0 \n",
"11 200608.0 1.0 20060912.0 9.0 58.0 1705.0 35.0 38728.0 \n",
"12 200609.0 1.0 20061009.0 9.0 3130.0 1704.0 35.0 212055.0 \n",
"13 200609.0 1.0 20061005.0 29.0 526.0 1702.0 38.0 94511.0 \n",
"14 200601.0 1.0 20060218.0 1.0 562.0 3117.0 35.0 316404.0 \n",
"15 200601.0 1.0 20060220.0 9.0 2065.0 3101.0 35.0 27213.0 \n",
"16 200601.0 1.0 20060220.0 9.0 1150.0 3101.0 35.0 27213.0 \n",
"17 200603.0 1.0 20060420.0 9.0 5244.0 3101.0 35.0 27213.0 \n",
"18 200611.0 1.0 20061217.0 9.0 3614.0 3101.0 35.0 27213.0 \n",
"19 200601.0 1.0 20060214.0 21.0 1359.0 3202.0 18.0 11078.0 \n",
"\n",
" DID VSTYPE VSDAY BIR PNO ICD1 ICD2 ICD3 \\\n",
"0 wxtqaajejb 40.0 20060211.0 198207.0 H10 5243.0 NaN None \n",
"1 ccopskyrqu 11.0 20060510.0 198207.0 D10 7080.0 NaN None \n",
"2 gxurzkwtcm 0.0 20060720.0 197108.0 D10 460.0 NaN None \n",
"3 vatnpkveaw 0.0 20060318.0 199107.0 D10 462.0 NaN None \n",
"4 ivdinwvrlk 0.0 20060320.0 199107.0 D10 4659.0 NaN None \n",
"5 ahnakixwyl 9.0 20060730.0 199107.0 D10 4619.0 NaN None \n",
"6 ahnakixwyl 9.0 20060718.0 199107.0 D10 4619.0 38102.0 None \n",
"7 ahnakixwyl 9.0 20060724.0 199107.0 D10 38102.0 4619.0 None \n",
"8 ahnakixwyl 9.0 20060714.0 199107.0 D10 4619.0 3829.0 None \n",
"9 ahnakixwyl 9.0 20060727.0 199107.0 D10 4619.0 NaN None \n",
"10 ahnakixwyl 9.0 20060721.0 199107.0 D10 4619.0 38102.0 None \n",
"11 ahnakixwyl 9.0 20060803.0 199107.0 D10 4619.0 NaN None \n",
"12 isnpfsiwjr 11.0 20060913.0 199107.0 D10 6982.0 NaN None \n",
"13 aphbnraejv 60.0 20060927.0 199107.0 S10 9150.0 NaN None \n",
"14 qokprusepp 0.0 20060115.0 197901.0 D10 4659.0 NaN None \n",
"15 bkmfenpwkw 9.0 20060109.0 197901.0 D10 46619.0 NaN None \n",
"16 bkmfenpwkw 9.0 20060105.0 197901.0 D10 463.0 NaN None \n",
"17 jffpfriwxy 4.0 20060322.0 197901.0 D10 46619.0 NaN None \n",
"18 hjoyqiywsh 4.0 20061118.0 197901.0 D10 90.0 NaN None \n",
"19 dmaumlqtom 60.0 20060125.0 192808.0 R20 5640.0 NaN None \n",
"\n",
" OPC1 T_DOT PDOT TDOT SEX PID APP_DAY2 YR \n",
"0 770.0 50.0 720.0 2.0 erlra 16869.0 2006.0 \n",
"1 396.0 50.0 346.0 2.0 erlra 16959.0 2006.0 \n",
"2 425.0 50.0 375.0 1.0 erlsa 17020.0 2006.0 \n",
"3 396.0 50.0 346.0 2.0 eihya 16894.0 2006.0 \n",
"4 396.0 50.0 346.0 2.0 eihya 16894.0 2006.0 \n",
"5 465.0 70.0 395.0 2.0 eihya 17027.0 2006.0 \n",
"6 539.0 70.0 469.0 2.0 eihya 17027.0 2006.0 \n",
"7 1261.0 70.0 1191.0 2.0 eihya 17027.0 2006.0 \n",
"8 538.0 70.0 468.0 2.0 eihya 17027.0 2006.0 \n",
"9 531.0 70.0 461.0 2.0 eihya 17027.0 2006.0 \n",
"10 531.0 70.0 461.0 2.0 eihya 17027.0 2006.0 \n",
"11 532.0 70.0 462.0 2.0 eihya 17056.0 2006.0 \n",
"12 250.0 50.0 200.0 2.0 eihya 17083.0 2006.0 \n",
"13 480.0 50.0 430.0 2.0 eihya 17079.0 2006.0 \n",
"14 325.0 50.0 275.0 2.0 qzwib 16850.0 2006.0 \n",
"15 325.0 50.0 275.0 2.0 qzwib 16852.0 2006.0 \n",
"16 325.0 50.0 275.0 2.0 qzwib 16852.0 2006.0 \n",
"17 185.0 50.0 135.0 2.0 qzwib 16911.0 2006.0 \n",
"18 220.0 50.0 170.0 2.0 qzwib 17152.0 2006.0 \n",
"19 510.0 90.0 420.0 2.0 qhtmb 16846.0 2006.0 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import itertools\n",
"import pandas\n",
"\n",
"pandas.set_option('display.max_columns', 999)\n",
"\n",
"with sas('original_data/opd3.sas7bdat') as f:\n",
" gen = f.readlines()\n",
" headers = next(gen)\n",
" rows = list(itertools.islice(gen, 0, 20))\n",
" dataframe = pandas.DataFrame(rows, columns=headers)calc\n",
"dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Covariance Calculation\n",
"\n",
"## Binary Distribution Sample Covariance\n",
"\n",
"Courtesy of Eric S. Wu, we can calculate covariance using the binary distribution sample covariance formula:\n",
"\n",
"$$cov(a, b) = \\frac{nk_{a,b} - k_a k_b}{n(1-n)}$$\n",
"\n",
"where $k_{a,b}$ is the number of cases where $a$ and $b$ co-occurred, $k_{a}$ is the number of cases where $a$ occurred and $k_{b}$ is the number of cases where $b$ occurred. $n$ is the total number of cases.\n",
"\n",
"In our case, since covariance has three more parameters $s$, $l$, and $h$, we produce the formula:\n",
"\n",
"$$cov(x, y, s, l, h) = \\frac{n_{s, l, h}k_{x, y, s, l, h} - k_{x, s, l, h} k_{y, s, l, h}}{n_{s, l, h}(1-n_{s, l, h})}$$\n",
"\n",
"This simply further parameterizes the variables across the 3 dimensions $s$, $l$, and $h$.\n",
"\n",
"# System Design\n",
"\n",
"## Dimensionality\n",
"\n",
"There are $\\frac{100}{2}(100) = 5,000$ combinations of $l, h$ (since $l \\in [0, 100]$ and $h \\in [l, 100]$.) There are two combinations of gender ($female$ or $male$). Hence, there are 10,000 different $s, l, h$ combinations. There are 9,787 different ICD codes in the database. This means that the dimensionality of the $(x, s, l, h)$ tuple is around $10,000 * 10,000 = 100,000,000$ and the dimensionality of the $(x, y, s, l, h)$ tuple is $10,000 * 10,000 * 10,000$. It is then intractible to store every $k_{x, y, s, l, h}$ and $k_{x, s, l, h}$ in a database.\n",
"\n",
"## Design Solution\n",
"\n",
"However, a database with hashed indices on ICD code, sex, and age can be used easily to calculate pairwise covariance. To produce the covariance matrices, a separate in-memory database can be used to store the raw data and pairwise covariance queries can be made. Alternatively, this problem is trivially parallelizable (since we are making separate independent queries) in-memory. We can consider a CUDA implementation given time.\n",
"\n",
"## Database Schema\n",
"\n",
"This means that the only relevant data we need to retain in our database is:\n",
"\n",
"- Patient ID. This is given in the original database as `PID`\n",
"- ICD Code. However, ICD Code is currently given as `ICD1`, `ICD2`, and `ICD3`. This represents the case of a single individual being given multiple valid diagnoses in a single visit. This complicates our database unnecessarily, and we coerce cases with multiple diagnoses into multiple cases (rows). This results in each row having only `ICD`.\n",
"- Sex, which is given in the original data as `SEX` (duh)\n",
"- Age, which need to be calculated for each row of data. This can be calculated from the visit date of the patient (`VSDAY`) and the birth date of the patient (`BIR`). However, since `VSDAY` is given as `YYYYMMDD` and `BIR` is given as `YYYYMM`, we simply take `AGE = VSDAY / 10000 - BIR / 100`. This produces `AGE` in integers.\n",
"\n",
"Then, we create a single B-tree on the tuple `(ICD, SEX, AGE)`. We now have the schema:\n",
"\n",
"```SQL\n",
"CREATE SEQUENCE patient_id_seq;\n",
"\n",
"CREATE TABLE patients (\n",
" ID INTEGER DEFAULT nextval('patient_id_seq'),\n",
" PID VARCHAR(11) NOT NULL,\n",
" ICD INTEGER NOT NULL,\n",
" SEX INTEGER NOT NULL,\n",
" AGE INTEGER NOT NULL,\n",
" PRIMARY KEY (ID)\n",
")\n",
"\n",
"CREATE INDEX ICD_SEX_AGE ON patients (ICD, SEX, AGE)\n",
"```\n",
"\n",
"This allows us to query quickly the number of patients with a specific disease for a specific sex for a specific age using\n",
"\n",
"```SQL\n",
"WITH pid_x AS (\n",
" SELECT DISTINCT PID FROM patients WHERE ICD=100 AND SEX=1 AND AGE>=10 AND AGE<20 GROUP BY PID\n",
"), pid_y AS (\n",
" SELECT DISTINCT PID FROM patients WHERE ICD=200 AND SEX=1 AND AGE>=10 AND AGE<20 GROUP BY PID\n",
"), pid_xy AS (\n",
" SELECT DISTINCT pid_x.PID FROM pid_x, pid_y WHERE pid_x.PID=pid_y.PID GROUP BY pid_x.PID\n",
")\n",
"SELECT COUNT(pid_x.PID) as kx, COUNT(pid_y.PID) as ky, COUNT(pid_xy.PID) as kxy\n",
"FROM pid_x, pid_y, pid_xy\n",
"```\n",
"\n",
"Furthermore, we cache the $n_{s, l, h}$ values:\n",
"\n",
"```SQL\n",
"CREATE TABLE patient_counts(\n",
" AGE INTEGER NOT NULL,\n",
" SEX INTEGER NOT NULL,\n",
" N INTEGER NOT NULL,\n",
" PRIMARY KEY (AGE)\n",
")\n",
"```\n",
"\n",
"and these values can be extracted quickly via:\n",
"\n",
"```SQL\n",
"SELECT SUM(N)\n",
"FROM patient_counts\n",
"WHERE SEX=sex AND AGE>=l AND AGE<h;\n",
"```\n",
"\n",
"# Database Setup\n",
"\n",
"## Conversion to CSV\n",
"\n",
"The data given is in `.sas7bdat` which is painful to work with. Converting the entirety to `.csv` using `sas7bdat_to_csv` provided in the `sas7bdat` python package converted the 5 GB SAS data to 4 GB of CSV.\n",
"\n",
"## Create CSVs to Populate DB\n",
"\n",
"This section is done in `create_db_data.py` in the root directory.\n",
"\n",
"We want to iterate through each row of data, doing the following:\n",
"\n",
"1. Retain only the columns `[PID, ICD1, ICD2, ICD3, SEX, VSDAY, BIR]`\n",
"2. Calculate `AGE = AGE = VSDAY / 10000 - BIR / 100` since `VSDAY` is given as `YYYYMMDD` and `BIR` is given as `YYYYMM`, we simply take `AGE = VSDAY / 10000 - BIR / 100`. This produces `AGE` in integers.\n",
"3. If `ICD2` or `ICD3` is populated, discard them and create additional rows using the `ICD2` and/or `ICD3` values as `ICD1` values for the new rows.\n",
"4. Rename `ICD1` to `ICD`.\n",
"\n",
"This logic is in `parse/parse_raw.py`.\n",
"\n",
"This produces `csv_data/patients.csv` (682.5 MB).\n",
"\n",
"We also want to calculate patient counts to populate the `patient_counts` table. To do this, we will count the number of patients with a particular age.\n",
"\n",
"This produces `csv_data/patient_counts.csv`.\n",
"\n",
"## Creating a DB Instance\n",
"\n",
"I launched a Amazon RDS instance. Read-only credentials to the instance are available:\n",
"\n",
"- `host`: `pwas-postgres.c5fx9eaq2kdo.us-east-1.rds.amazonaws.com`\n",
"- `port`: 5432\n",
"- `login`: `guest`\n",
"- `password`: `guest`\n",
"\n",
"For admin privileges, contact the author.\n",
"\n",
"Tables and sequences created are in `create_table.sql`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment