Created
February 7, 2023 17:01
-
-
Save dalgu90/94c079ca68426eb5b81f74202c57793c to your computer and use it in GitHub Desktop.
MIMIC-IV dataset split for automatic ICD coding
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"id": "726983da", | |
"metadata": {}, | |
"source": [ | |
"## MIMIC-IV dataset split for AnEMIC" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "1caf7499", | |
"metadata": {}, | |
"source": [ | |
"This notebook generates the split (train/val/test) of three ICD coding dataset from MIMIC-IV.\n", | |
"1. **MIMIC-IV full**, which contains all admissions with discharge summaries and uses all the ICD codes as label sets.\n", | |
"2. **MIMIC-IV top-50**, which uses the 50 most frequent ICD codes as the label set and contains admissions that have at least one top-50 ICD code.\n", | |
"3. **MIMIC-IV full inclusive**, which is derived from MIMIC-IV full and uses ICD codes that are included in all the train, validation, and test splits of MIMIC-IV full.\n", | |
"\n", | |
"The (presumed) steps for generating the split of the CAML dataset (MIMIC-III) are as follows:\n", | |
"1. Split the set of all SUBJECT_IDs into train:val:test=9:1/3:2/3\n", | |
"2. Get HADM_ID splits from the SUBJECT_ID splits from the above step -> MIMIC-III full\n", | |
"3. From the step 2, select HADM_IDs that has at least one top-50 ICD code\n", | |
"4. Select HADM_IDs randomly to match the number of instances in Shi et al. (8066:1573:1729) -> MIMIC-III top-50\n", | |
"\n", | |
"We follow very similar steps as above. But here, since ICD-9 and ICD-10 codes are difficult to convert to one another, we generate two different versions for each dataset: one containing admissions with ICD-9 codes, and another with admissions with ICD-10 codes." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"id": "e192430f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import collections\n", | |
"import json\n", | |
"import os\n", | |
"import random\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f0c178dd", | |
"metadata": {}, | |
"source": [ | |
"### 1. Loading MIMIC-IV" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"id": "33080ef6", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"mimic4_root = '../mimic4/mimic-iv-2.2/'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"id": "ad7c072f", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Admission table (admissions)\n", | |
"df_admission = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'admissions.csv.gz'),\n", | |
" dtype={'subject_id': 'string', 'hadm_id': 'string'})\n", | |
"\n", | |
"# Diagnosis & procedure\n", | |
"df_diag = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'diagnoses_icd.csv.gz'),\n", | |
" dtype={\"icd_code\": \"string\", \"subject_id\": \"string\", \"hadm_id\": \"string\"})\n", | |
"df_proc = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'procedures_icd.csv.gz'),\n", | |
" dtype={\"icd_code\": \"string\", \"subject_id\": \"string\", \"hadm_id\": \"string\"})\n", | |
"\n", | |
"# Discharge summary\n", | |
"df_disch = pd.read_csv(os.path.join(mimic4_root, 'note', 'discharge.csv.gz'),\n", | |
" dtype={'note_id': 'string', 'subject_id': 'string', 'hadm_id': 'string', 'text': 'string'})" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "dfde76bb", | |
"metadata": {}, | |
"source": [ | |
"### 2. MIMIC-IV full" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"id": "9de78562", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"seed_num = 42\n", | |
"random.seed(seed_num)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"id": "a414eaa0", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"HADM_IDs with discharge summary: 331794\n", | |
"\n", | |
"HADM_IDs with ICD-9 diagnoses : 276803\n", | |
"HADM_IDs with ICD-10 diagnoses : 154059\n", | |
"HADM_IDs with ICD-9 procedures : 155891\n", | |
"HADM_IDs with ICD-10 procedures: 73555\n", | |
"\n", | |
"HADM_IDs with ICD-9 only diag & proc : 276800\n", | |
"HADM_IDs with ICD-10 only diag & proc: 154066\n", | |
"\n", | |
"HADM_IDs with ICD-9 and discharge summary : 209352\n", | |
"HADM_IDs with ICD-10 and discharge summary: 122310\n" | |
] | |
} | |
], | |
"source": [ | |
"# Get the list of HADM_ID / SUBJECT_ID that has (diagnoses or procedures) and discharge summary\n", | |
"# Discharge summary\n", | |
"hadm_disch_set = set(df_disch['hadm_id'].unique())\n", | |
"print(f'HADM_IDs with discharge summary: {len(hadm_disch_set)}')\n", | |
"\n", | |
"# HADM_IDs with ICD-9/10 diagnoses / procedures\n", | |
"# Note that some admissions have both ICD-9 and ICD-10\n", | |
"hadm_diag_icd9_set_temp = set(df_diag[df_diag.icd_version == 9].hadm_id.unique())\n", | |
"hadm_diag_icd10_set_temp = set(df_diag[df_diag.icd_version == 10].hadm_id.unique())\n", | |
"hadm_proc_icd9_set_temp = set(df_proc[df_proc.icd_version == 9].hadm_id.unique())\n", | |
"hadm_proc_icd10_set_temp = set(df_proc[df_proc.icd_version == 10].hadm_id.unique())\n", | |
"\n", | |
"print()\n", | |
"print(f'HADM_IDs with ICD-9 diagnoses : {len(hadm_diag_icd9_set_temp)}')\n", | |
"print(f'HADM_IDs with ICD-10 diagnoses : {len(hadm_diag_icd10_set_temp)}')\n", | |
"print(f'HADM_IDs with ICD-9 procedures : {len(hadm_proc_icd9_set_temp)}')\n", | |
"print(f'HADM_IDs with ICD-10 procedures: {len(hadm_proc_icd10_set_temp)}')\n", | |
"\n", | |
"hadm_icd9_set = (hadm_diag_icd9_set_temp | hadm_proc_icd9_set_temp) - (hadm_diag_icd10_set_temp | hadm_proc_icd10_set_temp)\n", | |
"hadm_icd10_set = (hadm_diag_icd10_set_temp | hadm_proc_icd10_set_temp) - (hadm_diag_icd9_set_temp | hadm_proc_icd9_set_temp)\n", | |
"\n", | |
"print()\n", | |
"print(f'HADM_IDs with ICD-9 only diag & proc : {len(hadm_icd9_set)}')\n", | |
"print(f'HADM_IDs with ICD-10 only diag & proc: {len(hadm_icd10_set)}')\n", | |
"\n", | |
"hadm_icd9_set &= hadm_disch_set\n", | |
"hadm_icd10_set &= hadm_disch_set\n", | |
"\n", | |
"print()\n", | |
"print(f'HADM_IDs with ICD-9 and discharge summary : {len(hadm_icd9_set)}')\n", | |
"print(f'HADM_IDs with ICD-10 and discharge summary: {len(hadm_icd10_set)}')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"id": "b85adb97", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# HADM_ID -> SUBJECT_ID mapping\n", | |
"hadm_to_subject = {row.hadm_id: row.subject_id for _, row in df_admission.iterrows()}" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"id": "791bb9e3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"SUBJECT_IDs with ICD-9 and discharge summary : 97726\n", | |
"SUBJECT_IDs with ICD-10 and discharge summary: 65683\n" | |
] | |
} | |
], | |
"source": [ | |
"# SUBJECT_ID -> HADM_ID list mapping\n", | |
"# Note that a patient have different ICD codes (ICD-9 in one admission, ICD-10 in another)\n", | |
"# We are keeping these mappings so we can find HADM_IDs with correct ICD version from SUBJECT_ID\n", | |
"subject_to_hadm_icd9 = collections.defaultdict(list)\n", | |
"for hadm in hadm_icd9_set:\n", | |
" subject_to_hadm_icd9[hadm_to_subject[hadm]].append(hadm)\n", | |
" \n", | |
"subject_to_hadm_icd10 = collections.defaultdict(list)\n", | |
"for hadm in hadm_icd10_set:\n", | |
" subject_to_hadm_icd10[hadm_to_subject[hadm]].append(hadm)\n", | |
"\n", | |
"subject_icd9_list = sorted(subject_to_hadm_icd9.keys())\n", | |
"subject_icd10_list = sorted(subject_to_hadm_icd10.keys())\n", | |
"\n", | |
"print(f'SUBJECT_IDs with ICD-9 and discharge summary : {len(subject_icd9_list)}')\n", | |
"print(f'SUBJECT_IDs with ICD-10 and discharge summary: {len(subject_icd10_list)}')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"id": "3ee761da", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Split at the SUBJECT_ID level and then select HADM_IDs according to the SUBJECT_ID splits\n", | |
"random.shuffle(subject_icd9_list)\n", | |
"N_train_subject_icd9 = int(len(subject_icd9_list) * 0.9)\n", | |
"N_val_subject_icd9 = int(len(subject_icd9_list) * 0.1 / 3)\n", | |
"\n", | |
"train_subject_icd9_set = set(subject_icd9_list[:N_train_subject_icd9])\n", | |
"val_subject_icd9_set = set(subject_icd9_list[N_train_subject_icd9:N_train_subject_icd9+N_val_subject_icd9])\n", | |
"test_subject_icd9_set = set(subject_icd9_list[N_train_subject_icd9+N_val_subject_icd9:])\n", | |
"\n", | |
"train_hadm_icd9_set = set([hadm_id for subject_id in train_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n", | |
"val_hadm_icd9_set = set([hadm_id for subject_id in val_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n", | |
"test_hadm_icd9_set = set([hadm_id for subject_id in test_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n", | |
"\n", | |
"random.shuffle(subject_icd10_list)\n", | |
"N_train_subject_icd10 = int(len(subject_icd10_list) * 0.9)\n", | |
"N_val_subject_icd10 = int(len(subject_icd10_list) * 0.1 / 3)\n", | |
"\n", | |
"train_subject_icd10_set = set(subject_icd10_list[:N_train_subject_icd10])\n", | |
"val_subject_icd10_set = set(subject_icd10_list[N_train_subject_icd10:N_train_subject_icd10+N_val_subject_icd10])\n", | |
"test_subject_icd10_set = set(subject_icd10_list[N_train_subject_icd10+N_val_subject_icd10:])\n", | |
"\n", | |
"train_hadm_icd10_set = set([hadm_id for subject_id in train_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])\n", | |
"val_hadm_icd10_set = set([hadm_id for subject_id in val_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])\n", | |
"test_hadm_icd10_set = set([hadm_id for subject_id in test_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"id": "debd66d3", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Full ICD-9 HADM_ID(SUBJECT_ID): 209352( 97726)\n", | |
"Train: 188533 ( 87953)\n", | |
"Val : 7110 ( 3257)\n", | |
"Test : 13709 ( 6516)\n", | |
"\n", | |
"Full ICD-10 HADM_ID(SUBJECT_ID): 122310( 65683)\n", | |
"Train: 110442 ( 59114)\n", | |
"Val : 4017 ( 2189)\n", | |
"Test : 7851 ( 4380)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"Full ICD-9 HADM_ID(SUBJECT_ID): {len(hadm_icd9_set):6d}({len(subject_icd9_list):6d})\")\n", | |
"print(f\"Train: {len(train_hadm_icd9_set):6d} ({len(train_subject_icd9_set):6d})\")\n", | |
"print(f\"Val : {len(val_hadm_icd9_set):6d} ({len(val_subject_icd9_set):6d})\")\n", | |
"print(f\"Test : {len(test_hadm_icd9_set):6d} ({len(test_subject_icd9_set):6d})\")\n", | |
"print()\n", | |
"print(f\"Full ICD-10 HADM_ID(SUBJECT_ID): {len(hadm_icd10_set):6d}({len(subject_icd10_list):6d})\")\n", | |
"print(f\"Train: {len(train_hadm_icd10_set):6d} ({len(train_subject_icd10_set):6d})\")\n", | |
"print(f\"Val : {len(val_hadm_icd10_set):6d} ({len(val_subject_icd10_set):6d})\")\n", | |
"print(f\"Test : {len(test_hadm_icd10_set):6d} ({len(test_subject_icd10_set):6d})\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"id": "8334d03d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Sanity check - no overlap between any pair of HADM_ID split\n", | |
"sets = [train_hadm_icd9_set, val_hadm_icd9_set, test_hadm_icd9_set,\n", | |
" train_hadm_icd10_set, val_hadm_icd10_set, test_hadm_icd10_set]\n", | |
"for i in range(6):\n", | |
" for j in range(i+1, 6):\n", | |
" assert(not(sets[i] & sets[j]))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"id": "38c3d334", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write to files\n", | |
"with open(\"full_icd9_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_icd9_set), fd, indent=4)\n", | |
"with open(\"full_icd9_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_icd9_set), fd, indent=4)\n", | |
"with open(\"full_icd9_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_icd9_set), fd, indent=4)\n", | |
"\n", | |
"with open(\"full_icd10_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_icd10_set), fd, indent=4)\n", | |
"with open(\"full_icd10_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_icd10_set), fd, indent=4)\n", | |
"with open(\"full_icd10_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_icd10_set), fd, indent=4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "1034056a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "0adbfc7b", | |
"metadata": {}, | |
"source": [ | |
"### 3. MIMIC-IV top-50" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"id": "bd768b4a", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def reformat_icd9_code(icd_code, is_diagnosis_code):\n", | |
" code = \"\".join(icd_code.split(\".\"))\n", | |
" if is_diagnosis_code:\n", | |
" if code.startswith(\"E\"):\n", | |
" if len(code) > 4:\n", | |
" code = code[:4] + \".\" + code[4:]\n", | |
" else:\n", | |
" if len(code) > 3:\n", | |
" code = code[:3] + \".\" + code[3:]\n", | |
" else:\n", | |
" if len(code) > 2:\n", | |
" code = code[:2] + \".\" + code[2:]\n", | |
" return code" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"id": "443c4d55", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_diag_icd9 = df_diag[df_diag.icd_version == 9]\n", | |
"df_diag_icd10 = df_diag[df_diag.icd_version == 10]\n", | |
"df_proc_icd9 = df_proc[df_proc.icd_version == 9]\n", | |
"df_proc_icd10 = df_proc[df_proc.icd_version == 10]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"id": "e97dc189", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Train ICD-9 diag: 8694 unique, 2198671 occured\n", | |
"Val ICD-9 diag : 3963 unique, 83738 occured\n", | |
"Test ICD-9 diag : 4833 unique, 159135 occured\n", | |
"Train ICD-9 proc: 2451 unique, 336154 occured\n", | |
"Val ICD-9 diag : 1152 unique, 12929 occured\n", | |
"Test ICD-9 diag : 1431 unique, 24822 occured\n" | |
] | |
} | |
], | |
"source": [ | |
"train_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n", | |
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in train_hadm_icd9_set)].icd_code])\n", | |
"val_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n", | |
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in val_hadm_icd9_set)].icd_code])\n", | |
"test_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n", | |
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in test_hadm_icd9_set)].icd_code])\n", | |
"print(f\"Train ICD-9 diag: {len(train_diag_icd9_counter):5d} unique, {sum(train_diag_icd9_counter.values()):7d} occured\")\n", | |
"print(f\"Val ICD-9 diag : {len(val_diag_icd9_counter):5d} unique, {sum(val_diag_icd9_counter.values()):7d} occured\")\n", | |
"print(f\"Test ICD-9 diag : {len(test_diag_icd9_counter):5d} unique, {sum(test_diag_icd9_counter.values()):7d} occured\")\n", | |
"\n", | |
"train_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n", | |
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in train_hadm_icd9_set)].icd_code])\n", | |
"val_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n", | |
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in val_hadm_icd9_set)].icd_code])\n", | |
"test_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n", | |
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in test_hadm_icd9_set)].icd_code])\n", | |
"print(f\"Train ICD-9 proc: {len(train_proc_icd9_counter):5d} unique, {sum(train_proc_icd9_counter.values()):7d} occured\")\n", | |
"print(f\"Val ICD-9 diag : {len(val_proc_icd9_counter):5d} unique, {sum(val_proc_icd9_counter.values()):7d} occured\")\n", | |
"print(f\"Test ICD-9 diag : {len(test_proc_icd9_counter):5d} unique, {sum(test_proc_icd9_counter.values()):7d} occured\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"id": "8b3db7de", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Total ICD-9 diag: 8837\n", | |
"Total ICD-9 proc: 2494\n", | |
"Overlap : 0\n" | |
] | |
} | |
], | |
"source": [ | |
"# Sanity check - Is there any ICD-9 code that is both diagnosis (ICD-9-CM) and procedure (ICD-9-PCS)\n", | |
"diag_icd9_set = set(train_diag_icd9_counter.keys()) | set(val_diag_icd9_counter.keys()) | set(test_diag_icd9_counter.keys())\n", | |
"proc_icd9_set = set(train_proc_icd9_counter.keys()) | set(val_proc_icd9_counter.keys()) | set(test_proc_icd9_counter.keys())\n", | |
"print(f'Total ICD-9 diag: {len(diag_icd9_set)}')\n", | |
"print(f'Total ICD-9 proc: {len(proc_icd9_set)}')\n", | |
"print(f'Overlap : {len(diag_icd9_set & proc_icd9_set)}')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"id": "2e6bfa9a", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[('401.9', 81673), ('272.4', 57822), ('530.81', 41944), ('250.00', 34407), ('428.0', 32424), ('427.31', 31915), ('414.01', 30278), ('V15.82', 28601), ('311', 28248), ('584.9', 26961), ('244.9', 24117), ('285.9', 21802), ('305.1', 21410), ('403.90', 21385), ('V58.61', 19686), ('300.00', 18204), ('599.0', 18107), ('V58.67', 16854), ('585.9', 15942), ('493.90', 15775), ('272.0', 15739), ('327.23', 15277), ('38.93', 13618), ('V45.82', 13111), ('412', 12835), ('496', 12432), ('V58.66', 12353), ('278.00', 12141), ('276.1', 11840), ('V45.81', 11442), ('733.00', 10755), ('486', 10169), ('V12.51', 10131), ('338.29', 10063), ('V49.86', 9907), ('38.97', 9807), ('274.9', 9704), ('414.00', 9552), ('285.1', 9442), ('276.51', 9214), ('V12.54', 9169), ('276.2', 8819), ('600.00', 8777), ('564.00', 8743), ('357.2', 8406), ('585.6', 8256), ('287.5', 8145), ('427.89', 8138), ('96.6', 7696), ('428.32', 7487)]\n" | |
] | |
} | |
], | |
"source": [ | |
"icd9_counter = train_diag_icd9_counter + val_diag_icd9_counter + test_diag_icd9_counter \\\n", | |
" + train_proc_icd9_counter + val_proc_icd9_counter + test_proc_icd9_counter\n", | |
"print(icd9_counter.most_common(50))\n", | |
"icd9_top50 = set([icd_code for icd_code, _ in icd9_counter.most_common(50)])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"id": "58c1380d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Top-k diag/proc occurrences\n", | |
"df_diag_icd9_top50 = df_diag_icd9[df_diag_icd9.icd_code.map(lambda x: reformat_icd9_code(x, True) in icd9_top50)]\n", | |
"df_proc_icd9_top50 = df_proc_icd9[df_proc_icd9.icd_code.map(lambda x: reformat_icd9_code(x, False) in icd9_top50)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"id": "7aca660b", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Admissions that has top-k ICD-9 codes\n", | |
"train_hadm_50_icd9_set = train_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))\n", | |
"val_hadm_50_icd9_set = val_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))\n", | |
"test_hadm_50_icd9_set = test_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"id": "40c34e1b", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Top-50 ICD-9 HADM_ID (SUBJECT_ID)\n", | |
"Train: 170664 (77890)\n", | |
"Val : 6406 ( 2871)\n", | |
"Test : 12405 ( 5756)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"Top-50 ICD-9 HADM_ID (SUBJECT_ID)\")\n", | |
"print(f\"Train: {len(train_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_50_icd9_set))):5d})\")\n", | |
"print(f\"Val : {len(val_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_50_icd9_set))):5d})\")\n", | |
"print(f\"Test : {len(test_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_50_icd9_set))):5d})\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"id": "08694017", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write to files\n", | |
"with open(\"top50_icd9_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_50_icd9_set), fd, indent=4)\n", | |
"with open(\"top50_icd9_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_50_icd9_set), fd, indent=4)\n", | |
"with open(\"top50_icd9_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_50_icd9_set), fd, indent=4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "5d7c74d0", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"id": "1ec5d8a8", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Train ICD-10 diag: 15681 unique, 1595623 occured\n", | |
"Val ICD-10 diag : 4748 unique, 58397 occured\n", | |
"Test ICD-10 diag : 6387 unique, 111289 occured\n", | |
"Train ICD-10 proc: 9549 unique, 188681 occured\n", | |
"Val ICD-10 proc : 1990 unique, 7119 occured\n", | |
"Test ICD-10 proc : 2772 unique, 13229 occured\n" | |
] | |
} | |
], | |
"source": [ | |
"train_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in train_hadm_icd10_set)].icd_code)\n", | |
"val_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in val_hadm_icd10_set)].icd_code)\n", | |
"test_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in test_hadm_icd10_set)].icd_code)\n", | |
"print(f\"Train ICD-10 diag: {len(train_diag_icd10_counter):5d} unique, {sum(train_diag_icd10_counter.values()):7d} occured\")\n", | |
"print(f\"Val ICD-10 diag : {len(val_diag_icd10_counter):5d} unique, {sum(val_diag_icd10_counter.values()):7d} occured\")\n", | |
"print(f\"Test ICD-10 diag : {len(test_diag_icd10_counter):5d} unique, {sum(test_diag_icd10_counter.values()):7d} occured\")\n", | |
"\n", | |
"train_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in train_hadm_icd10_set)].icd_code)\n", | |
"val_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in val_hadm_icd10_set)].icd_code)\n", | |
"test_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in test_hadm_icd10_set)].icd_code)\n", | |
"print(f\"Train ICD-10 proc: {len(train_proc_icd10_counter):5d} unique, {sum(train_proc_icd10_counter.values()):7d} occured\")\n", | |
"print(f\"Val ICD-10 proc : {len(val_proc_icd10_counter):5d} unique, {sum(val_proc_icd10_counter.values()):7d} occured\")\n", | |
"print(f\"Test ICD-10 proc : {len(test_proc_icd10_counter):5d} unique, {sum(test_proc_icd10_counter.values()):7d} occured\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"id": "449ad4f5", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Total ICD-10 diag: 16154\n", | |
"Total ICD-10 proc: 9942\n", | |
"Overlap : 0\n" | |
] | |
} | |
], | |
"source": [ | |
"# Sanity check - Is there any ICD-10 code that is both diagnosis (ICD-10-CM) and procedure (ICD-10-PCS)\n", | |
"diag_icd10_set = set(train_diag_icd10_counter.keys()) | set(val_diag_icd10_counter.keys()) | set(test_diag_icd10_counter.keys())\n", | |
"proc_icd10_set = set(train_proc_icd10_counter.keys()) | set(val_proc_icd10_counter.keys()) | set(test_proc_icd10_counter.keys())\n", | |
"print(f'Total ICD-10 diag: {len(diag_icd10_set)}')\n", | |
"print(f'Total ICD-10 proc: {len(proc_icd10_set)}')\n", | |
"print(f'Overlap : {len(diag_icd10_set & proc_icd10_set)}')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"id": "fed9dbb5", | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"[('E785', 44043), ('I10', 43573), ('Z87891', 36296), ('K219', 30802), ('F329', 23232), ('I2510', 22606), ('N179', 19706), ('F419', 19156), ('Z7901', 15319), ('Z794', 15276), ('E039', 15253), ('E119', 13571), ('G4733', 12661), ('D649', 12467), ('E669', 12146), ('I4891', 12033), ('F17210', 11619), ('Y929', 11550), ('Z66', 10743), ('J45909', 10612), ('Z7902', 10515), ('J449', 10269), ('D62', 10132), ('02HV33Z', 10017), ('N390', 9658), ('I129', 9432), ('E1122', 9204), ('E871', 8647), ('I252', 8576), ('N189', 8566), ('E872', 8162), ('Z8673', 7910), ('Z955', 7759), ('Z86718', 7596), ('G8929', 7534), ('I110', 7436), ('K5900', 7098), ('N400', 6816), ('N183', 6804), ('I480', 6699), ('I130', 6516), ('G4700', 6450), ('D696', 6439), ('Z951', 6273), ('M109', 6223), ('Y92239', 5982), ('J9601', 5896), ('J189', 5791), ('Z23', 5713), ('Y92230', 5653)]\n" | |
] | |
} | |
], | |
"source": [ | |
"icd10_counter = train_diag_icd10_counter + val_diag_icd10_counter + test_diag_icd10_counter \\\n", | |
" + train_proc_icd10_counter + val_proc_icd10_counter + test_proc_icd10_counter\n", | |
"print(icd10_counter.most_common(50))\n", | |
"icd10_top50 = set([icd_code for icd_code, _ in icd10_counter.most_common(50)])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"id": "1ca82d05", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_diag_icd10_top50 = df_diag_icd10[df_diag_icd10.icd_code.map(lambda x: x in icd10_top50)]\n", | |
"df_proc_icd10_top50 = df_proc_icd10[df_proc_icd10.icd_code.map(lambda x: x in icd10_top50)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"id": "e90941b7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train_hadm_50_icd10_set = train_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))\n", | |
"val_hadm_50_icd10_set = val_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))\n", | |
"test_hadm_50_icd10_set = test_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 26, | |
"id": "15ae411d", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Top-50 ICD-10 HADM_ID (SUBJECT_ID)\n", | |
"Train: 104077 (55149)\n", | |
"Val : 3805 ( 2048)\n", | |
"Test : 7368 ( 4068)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"Top-50 ICD-10 HADM_ID (SUBJECT_ID)\")\n", | |
"print(f\"Train: {len(train_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_50_icd10_set))):5d})\")\n", | |
"print(f\"Val : {len(val_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_50_icd10_set))):5d})\")\n", | |
"print(f\"Test : {len(test_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_50_icd10_set))):5d})\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"id": "4772eb02", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write to files\n", | |
"with open(\"top50_icd10_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_50_icd10_set), fd, indent=4)\n", | |
"with open(\"top50_icd10_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_50_icd10_set), fd, indent=4)\n", | |
"with open(\"top50_icd10_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_50_icd10_set), fd, indent=4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "afd9ffd7", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"id": "f727e450", | |
"metadata": {}, | |
"source": [ | |
"### 4. MIMIC-IV full inclusive" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 28, | |
"id": "43a35456", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"3369 ICD-9 diag code commonly appear in all splits\n", | |
"982 ICD-9 proc code commonly appear in all splits\n" | |
] | |
} | |
], | |
"source": [ | |
"# Common ICD-9 codes\n", | |
"diag_icd9_inc_set = set(train_diag_icd9_counter) & set(val_diag_icd9_counter) & set(test_diag_icd9_counter)\n", | |
"print(f\"{len(diag_icd9_inc_set)} ICD-9 diag code commonly appear in all splits\")\n", | |
"proc_icd9_inc_set = set(train_proc_icd9_counter) & set(val_proc_icd9_counter) & set(test_proc_icd9_counter)\n", | |
"print(f\"{len(proc_icd9_inc_set)} ICD-9 proc code commonly appear in all splits\")\n", | |
"icd9_inc_set = diag_icd9_inc_set | proc_icd9_inc_set" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 29, | |
"id": "85995ada", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_diag_icd9_inc = df_diag_icd9[df_diag_icd9.icd_code.map(lambda x: reformat_icd9_code(x, True) in icd9_inc_set)]\n", | |
"df_proc_icd9_inc = df_proc_icd9[df_proc_icd9.icd_code.map(lambda x: reformat_icd9_code(x, False) in icd9_inc_set)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 30, | |
"id": "03b399bc", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train_hadm_inc_icd9_set = train_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))\n", | |
"val_hadm_inc_icd9_set = val_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))\n", | |
"test_hadm_inc_icd9_set = test_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 31, | |
"id": "b45c47d6", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\n", | |
"Train: 188397 (87865)\n", | |
"Val : 7108 ( 3256)\n", | |
"Test : 13699 ( 6510)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\")\n", | |
"print(f\"Train: {len(train_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_inc_icd9_set))):5d})\")\n", | |
"print(f\"Val : {len(val_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_inc_icd9_set))):5d})\")\n", | |
"print(f\"Test : {len(test_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_inc_icd9_set))):5d})\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 32, | |
"id": "5bb7693d", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write to files\n", | |
"with open(\"inc_icd9_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_inc_icd9_set), fd, indent=4)\n", | |
"with open(\"inc_icd9_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_inc_icd9_set), fd, indent=4)\n", | |
"with open(\"inc_icd9_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_inc_icd9_set), fd, indent=4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "2c6b2db1", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 33, | |
"id": "da424f33", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"3643 ICD-10 diag code commonly appear in all splits\n", | |
"1228 ICD-10 proc code commonly appear in all splits\n" | |
] | |
} | |
], | |
"source": [ | |
"# Common ICD-10 codes\n", | |
"diag_icd10_inc_set = set(train_diag_icd10_counter) & set(val_diag_icd10_counter) & set(test_diag_icd10_counter)\n", | |
"print(f\"{len(diag_icd10_inc_set)} ICD-10 diag code commonly appear in all splits\")\n", | |
"proc_icd10_inc_set = set(train_proc_icd10_counter) & set(val_proc_icd10_counter) & set(test_proc_icd10_counter)\n", | |
"print(f\"{len(proc_icd10_inc_set)} ICD-10 proc code commonly appear in all splits\")\n", | |
"icd10_inc_set = diag_icd10_inc_set | proc_icd10_inc_set" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"id": "f6ac0bf9", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_diag_icd10_inc = df_diag_icd10[df_diag_icd10.icd_code.map(lambda x: x in icd10_inc_set)]\n", | |
"df_proc_icd10_inc = df_proc_icd10[df_proc_icd10.icd_code.map(lambda x: x in icd10_inc_set)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 35, | |
"id": "664e32db", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train_hadm_inc_icd10_set = train_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))\n", | |
"val_hadm_inc_icd10_set = val_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))\n", | |
"test_hadm_inc_icd10_set = test_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 36, | |
"id": "7a45dc4c", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\n", | |
"Train: 110304 (59005)\n", | |
"Val : 4015 ( 2187)\n", | |
"Test : 7846 ( 4375)\n" | |
] | |
} | |
], | |
"source": [ | |
"print(f\"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\")\n", | |
"print(f\"Train: {len(train_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_inc_icd10_set))):5d})\")\n", | |
"print(f\"Val : {len(val_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_inc_icd10_set))):5d})\")\n", | |
"print(f\"Test : {len(test_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_inc_icd10_set))):5d})\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"id": "990ae6ad", | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write to files\n", | |
"with open(\"inc_icd10_train_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(train_hadm_inc_icd10_set), fd, indent=4)\n", | |
"with open(\"inc_icd10_val_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(val_hadm_inc_icd10_set), fd, indent=4)\n", | |
"with open(\"inc_icd10_test_split.json\", \"w\") as fd:\n", | |
" json.dump(sorted(test_hadm_inc_icd10_set), fd, indent=4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"id": "b7c1baf6", | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3 (ipykernel)", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Resulting dataset splits (hadm_id) are available at https://github.com/dalgu90/icd-coding-benchmark/tree/mimic4/datasets/mimic4/static