Skip to content

Instantly share code, notes, and snippets.

@dalgu90
Created February 7, 2023 17:01
Show Gist options
  • Save dalgu90/94c079ca68426eb5b81f74202c57793c to your computer and use it in GitHub Desktop.
Save dalgu90/94c079ca68426eb5b81f74202c57793c to your computer and use it in GitHub Desktop.
MIMIC-IV dataset split for automatic ICD coding
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "726983da",
"metadata": {},
"source": [
"## MIMIC-IV dataset split for AnEMIC"
]
},
{
"cell_type": "markdown",
"id": "1caf7499",
"metadata": {},
"source": [
"This notebook generates the split (train/val/test) of three ICD coding dataset from MIMIC-IV.\n",
"1. **MIMIC-IV full**, which contains all admissions with discharge summaries and uses all the ICD codes as label sets.\n",
"2. **MIMIC-IV top-50**, which uses the 50 most frequent ICD codes as the label set and contains admissions that have at least one top-50 ICD code.\n",
"3. **MIMIC-IV full inclusive**, which is derived from MIMIC-IV full and uses ICD codes that are included in all the train, validation, and test splits of MIMIC-IV full.\n",
"\n",
"The (presumed) steps for generating the split of the CAML dataset (MIMIC-III) are as follows:\n",
"1. Split the set of all SUBJECT_IDs into train:val:test=9:1/3:2/3\n",
"2. Get HADM_ID splits from the SUBJECT_ID splits from the above step -> MIMIC-III full\n",
"3. From the step 2, select HADM_IDs that has at least one top-50 ICD code\n",
"4. Select HADM_IDs randomly to match the number of instances in Shi et al. (8066:1573:1729) -> MIMIC-III top-50\n",
"\n",
"We follow very similar steps as above. But here, since ICD-9 and ICD-10 codes are difficult to convert to one another, we generate two different versions for each dataset: one containing admissions with ICD-9 codes, and another with admissions with ICD-10 codes."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e192430f",
"metadata": {},
"outputs": [],
"source": [
"import collections\n",
"import json\n",
"import os\n",
"import random\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"id": "f0c178dd",
"metadata": {},
"source": [
"### 1. Loading MIMIC-IV"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "33080ef6",
"metadata": {},
"outputs": [],
"source": [
"mimic4_root = '../mimic4/mimic-iv-2.2/'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "ad7c072f",
"metadata": {},
"outputs": [],
"source": [
"# Admission table (admissions)\n",
"df_admission = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'admissions.csv.gz'),\n",
" dtype={'subject_id': 'string', 'hadm_id': 'string'})\n",
"\n",
"# Diagnosis & procedure\n",
"df_diag = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'diagnoses_icd.csv.gz'),\n",
" dtype={\"icd_code\": \"string\", \"subject_id\": \"string\", \"hadm_id\": \"string\"})\n",
"df_proc = pd.read_csv(os.path.join(mimic4_root, 'hosp', 'procedures_icd.csv.gz'),\n",
" dtype={\"icd_code\": \"string\", \"subject_id\": \"string\", \"hadm_id\": \"string\"})\n",
"\n",
"# Discharge summary\n",
"df_disch = pd.read_csv(os.path.join(mimic4_root, 'note', 'discharge.csv.gz'),\n",
" dtype={'note_id': 'string', 'subject_id': 'string', 'hadm_id': 'string', 'text': 'string'})"
]
},
{
"cell_type": "markdown",
"id": "dfde76bb",
"metadata": {},
"source": [
"### 2. MIMIC-IV full"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9de78562",
"metadata": {},
"outputs": [],
"source": [
"seed_num = 42\n",
"random.seed(seed_num)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a414eaa0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HADM_IDs with discharge summary: 331794\n",
"\n",
"HADM_IDs with ICD-9 diagnoses : 276803\n",
"HADM_IDs with ICD-10 diagnoses : 154059\n",
"HADM_IDs with ICD-9 procedures : 155891\n",
"HADM_IDs with ICD-10 procedures: 73555\n",
"\n",
"HADM_IDs with ICD-9 only diag & proc : 276800\n",
"HADM_IDs with ICD-10 only diag & proc: 154066\n",
"\n",
"HADM_IDs with ICD-9 and discharge summary : 209352\n",
"HADM_IDs with ICD-10 and discharge summary: 122310\n"
]
}
],
"source": [
"# Get the list of HADM_ID / SUBJECT_ID that has (diagnoses or procedures) and discharge summary\n",
"# Discharge summary\n",
"hadm_disch_set = set(df_disch['hadm_id'].unique())\n",
"print(f'HADM_IDs with discharge summary: {len(hadm_disch_set)}')\n",
"\n",
"# HADM_IDs with ICD-9/10 diagnoses / procedures\n",
"# Note that some admissions have both ICD-9 and ICD-10\n",
"hadm_diag_icd9_set_temp = set(df_diag[df_diag.icd_version == 9].hadm_id.unique())\n",
"hadm_diag_icd10_set_temp = set(df_diag[df_diag.icd_version == 10].hadm_id.unique())\n",
"hadm_proc_icd9_set_temp = set(df_proc[df_proc.icd_version == 9].hadm_id.unique())\n",
"hadm_proc_icd10_set_temp = set(df_proc[df_proc.icd_version == 10].hadm_id.unique())\n",
"\n",
"print()\n",
"print(f'HADM_IDs with ICD-9 diagnoses : {len(hadm_diag_icd9_set_temp)}')\n",
"print(f'HADM_IDs with ICD-10 diagnoses : {len(hadm_diag_icd10_set_temp)}')\n",
"print(f'HADM_IDs with ICD-9 procedures : {len(hadm_proc_icd9_set_temp)}')\n",
"print(f'HADM_IDs with ICD-10 procedures: {len(hadm_proc_icd10_set_temp)}')\n",
"\n",
"hadm_icd9_set = (hadm_diag_icd9_set_temp | hadm_proc_icd9_set_temp) - (hadm_diag_icd10_set_temp | hadm_proc_icd10_set_temp)\n",
"hadm_icd10_set = (hadm_diag_icd10_set_temp | hadm_proc_icd10_set_temp) - (hadm_diag_icd9_set_temp | hadm_proc_icd9_set_temp)\n",
"\n",
"print()\n",
"print(f'HADM_IDs with ICD-9 only diag & proc : {len(hadm_icd9_set)}')\n",
"print(f'HADM_IDs with ICD-10 only diag & proc: {len(hadm_icd10_set)}')\n",
"\n",
"hadm_icd9_set &= hadm_disch_set\n",
"hadm_icd10_set &= hadm_disch_set\n",
"\n",
"print()\n",
"print(f'HADM_IDs with ICD-9 and discharge summary : {len(hadm_icd9_set)}')\n",
"print(f'HADM_IDs with ICD-10 and discharge summary: {len(hadm_icd10_set)}')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b85adb97",
"metadata": {},
"outputs": [],
"source": [
"# HADM_ID -> SUBJECT_ID mapping\n",
"hadm_to_subject = {row.hadm_id: row.subject_id for _, row in df_admission.iterrows()}"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "791bb9e3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SUBJECT_IDs with ICD-9 and discharge summary : 97726\n",
"SUBJECT_IDs with ICD-10 and discharge summary: 65683\n"
]
}
],
"source": [
"# SUBJECT_ID -> HADM_ID list mapping\n",
"# Note that a patient have different ICD codes (ICD-9 in one admission, ICD-10 in another)\n",
"# We are keeping these mappings so we can find HADM_IDs with correct ICD version from SUBJECT_ID\n",
"subject_to_hadm_icd9 = collections.defaultdict(list)\n",
"for hadm in hadm_icd9_set:\n",
" subject_to_hadm_icd9[hadm_to_subject[hadm]].append(hadm)\n",
" \n",
"subject_to_hadm_icd10 = collections.defaultdict(list)\n",
"for hadm in hadm_icd10_set:\n",
" subject_to_hadm_icd10[hadm_to_subject[hadm]].append(hadm)\n",
"\n",
"subject_icd9_list = sorted(subject_to_hadm_icd9.keys())\n",
"subject_icd10_list = sorted(subject_to_hadm_icd10.keys())\n",
"\n",
"print(f'SUBJECT_IDs with ICD-9 and discharge summary : {len(subject_icd9_list)}')\n",
"print(f'SUBJECT_IDs with ICD-10 and discharge summary: {len(subject_icd10_list)}')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "3ee761da",
"metadata": {},
"outputs": [],
"source": [
"# Split at the SUBJECT_ID level and then select HADM_IDs according to the SUBJECT_ID splits\n",
"random.shuffle(subject_icd9_list)\n",
"N_train_subject_icd9 = int(len(subject_icd9_list) * 0.9)\n",
"N_val_subject_icd9 = int(len(subject_icd9_list) * 0.1 / 3)\n",
"\n",
"train_subject_icd9_set = set(subject_icd9_list[:N_train_subject_icd9])\n",
"val_subject_icd9_set = set(subject_icd9_list[N_train_subject_icd9:N_train_subject_icd9+N_val_subject_icd9])\n",
"test_subject_icd9_set = set(subject_icd9_list[N_train_subject_icd9+N_val_subject_icd9:])\n",
"\n",
"train_hadm_icd9_set = set([hadm_id for subject_id in train_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n",
"val_hadm_icd9_set = set([hadm_id for subject_id in val_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n",
"test_hadm_icd9_set = set([hadm_id for subject_id in test_subject_icd9_set for hadm_id in subject_to_hadm_icd9[subject_id]])\n",
"\n",
"random.shuffle(subject_icd10_list)\n",
"N_train_subject_icd10 = int(len(subject_icd10_list) * 0.9)\n",
"N_val_subject_icd10 = int(len(subject_icd10_list) * 0.1 / 3)\n",
"\n",
"train_subject_icd10_set = set(subject_icd10_list[:N_train_subject_icd10])\n",
"val_subject_icd10_set = set(subject_icd10_list[N_train_subject_icd10:N_train_subject_icd10+N_val_subject_icd10])\n",
"test_subject_icd10_set = set(subject_icd10_list[N_train_subject_icd10+N_val_subject_icd10:])\n",
"\n",
"train_hadm_icd10_set = set([hadm_id for subject_id in train_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])\n",
"val_hadm_icd10_set = set([hadm_id for subject_id in val_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])\n",
"test_hadm_icd10_set = set([hadm_id for subject_id in test_subject_icd10_set for hadm_id in subject_to_hadm_icd10[subject_id]])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "debd66d3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Full ICD-9 HADM_ID(SUBJECT_ID): 209352( 97726)\n",
"Train: 188533 ( 87953)\n",
"Val : 7110 ( 3257)\n",
"Test : 13709 ( 6516)\n",
"\n",
"Full ICD-10 HADM_ID(SUBJECT_ID): 122310( 65683)\n",
"Train: 110442 ( 59114)\n",
"Val : 4017 ( 2189)\n",
"Test : 7851 ( 4380)\n"
]
}
],
"source": [
"print(f\"Full ICD-9 HADM_ID(SUBJECT_ID): {len(hadm_icd9_set):6d}({len(subject_icd9_list):6d})\")\n",
"print(f\"Train: {len(train_hadm_icd9_set):6d} ({len(train_subject_icd9_set):6d})\")\n",
"print(f\"Val : {len(val_hadm_icd9_set):6d} ({len(val_subject_icd9_set):6d})\")\n",
"print(f\"Test : {len(test_hadm_icd9_set):6d} ({len(test_subject_icd9_set):6d})\")\n",
"print()\n",
"print(f\"Full ICD-10 HADM_ID(SUBJECT_ID): {len(hadm_icd10_set):6d}({len(subject_icd10_list):6d})\")\n",
"print(f\"Train: {len(train_hadm_icd10_set):6d} ({len(train_subject_icd10_set):6d})\")\n",
"print(f\"Val : {len(val_hadm_icd10_set):6d} ({len(val_subject_icd10_set):6d})\")\n",
"print(f\"Test : {len(test_hadm_icd10_set):6d} ({len(test_subject_icd10_set):6d})\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8334d03d",
"metadata": {},
"outputs": [],
"source": [
"# Sanity check - no overlap between any pair of HADM_ID split\n",
"sets = [train_hadm_icd9_set, val_hadm_icd9_set, test_hadm_icd9_set,\n",
" train_hadm_icd10_set, val_hadm_icd10_set, test_hadm_icd10_set]\n",
"for i in range(6):\n",
" for j in range(i+1, 6):\n",
" assert(not(sets[i] & sets[j]))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "38c3d334",
"metadata": {},
"outputs": [],
"source": [
"# Write to files\n",
"with open(\"full_icd9_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_icd9_set), fd, indent=4)\n",
"with open(\"full_icd9_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_icd9_set), fd, indent=4)\n",
"with open(\"full_icd9_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_icd9_set), fd, indent=4)\n",
"\n",
"with open(\"full_icd10_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_icd10_set), fd, indent=4)\n",
"with open(\"full_icd10_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_icd10_set), fd, indent=4)\n",
"with open(\"full_icd10_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_icd10_set), fd, indent=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1034056a",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "0adbfc7b",
"metadata": {},
"source": [
"### 3. MIMIC-IV top-50"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "bd768b4a",
"metadata": {},
"outputs": [],
"source": [
"def reformat_icd9_code(icd_code, is_diagnosis_code):\n",
" code = \"\".join(icd_code.split(\".\"))\n",
" if is_diagnosis_code:\n",
" if code.startswith(\"E\"):\n",
" if len(code) > 4:\n",
" code = code[:4] + \".\" + code[4:]\n",
" else:\n",
" if len(code) > 3:\n",
" code = code[:3] + \".\" + code[3:]\n",
" else:\n",
" if len(code) > 2:\n",
" code = code[:2] + \".\" + code[2:]\n",
" return code"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "443c4d55",
"metadata": {},
"outputs": [],
"source": [
"df_diag_icd9 = df_diag[df_diag.icd_version == 9]\n",
"df_diag_icd10 = df_diag[df_diag.icd_version == 10]\n",
"df_proc_icd9 = df_proc[df_proc.icd_version == 9]\n",
"df_proc_icd10 = df_proc[df_proc.icd_version == 10]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "e97dc189",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train ICD-9 diag: 8694 unique, 2198671 occured\n",
"Val ICD-9 diag : 3963 unique, 83738 occured\n",
"Test ICD-9 diag : 4833 unique, 159135 occured\n",
"Train ICD-9 proc: 2451 unique, 336154 occured\n",
"Val ICD-9 diag : 1152 unique, 12929 occured\n",
"Test ICD-9 diag : 1431 unique, 24822 occured\n"
]
}
],
"source": [
"train_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n",
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in train_hadm_icd9_set)].icd_code])\n",
"val_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n",
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in val_hadm_icd9_set)].icd_code])\n",
"test_diag_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, True) for icd_code in\n",
" df_diag_icd9[df_diag_icd9.hadm_id.map(lambda x: x in test_hadm_icd9_set)].icd_code])\n",
"print(f\"Train ICD-9 diag: {len(train_diag_icd9_counter):5d} unique, {sum(train_diag_icd9_counter.values()):7d} occured\")\n",
"print(f\"Val ICD-9 diag : {len(val_diag_icd9_counter):5d} unique, {sum(val_diag_icd9_counter.values()):7d} occured\")\n",
"print(f\"Test ICD-9 diag : {len(test_diag_icd9_counter):5d} unique, {sum(test_diag_icd9_counter.values()):7d} occured\")\n",
"\n",
"train_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n",
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in train_hadm_icd9_set)].icd_code])\n",
"val_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n",
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in val_hadm_icd9_set)].icd_code])\n",
"test_proc_icd9_counter = collections.Counter([reformat_icd9_code(icd_code, False) for icd_code in\n",
" df_proc_icd9[df_proc_icd9.hadm_id.map(lambda x: x in test_hadm_icd9_set)].icd_code])\n",
"print(f\"Train ICD-9 proc: {len(train_proc_icd9_counter):5d} unique, {sum(train_proc_icd9_counter.values()):7d} occured\")\n",
"print(f\"Val ICD-9 diag : {len(val_proc_icd9_counter):5d} unique, {sum(val_proc_icd9_counter.values()):7d} occured\")\n",
"print(f\"Test ICD-9 diag : {len(test_proc_icd9_counter):5d} unique, {sum(test_proc_icd9_counter.values()):7d} occured\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8b3db7de",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total ICD-9 diag: 8837\n",
"Total ICD-9 proc: 2494\n",
"Overlap : 0\n"
]
}
],
"source": [
"# Sanity check - Is there any ICD-9 code that is both diagnosis (ICD-9-CM) and procedure (ICD-9-PCS)\n",
"diag_icd9_set = set(train_diag_icd9_counter.keys()) | set(val_diag_icd9_counter.keys()) | set(test_diag_icd9_counter.keys())\n",
"proc_icd9_set = set(train_proc_icd9_counter.keys()) | set(val_proc_icd9_counter.keys()) | set(test_proc_icd9_counter.keys())\n",
"print(f'Total ICD-9 diag: {len(diag_icd9_set)}')\n",
"print(f'Total ICD-9 proc: {len(proc_icd9_set)}')\n",
"print(f'Overlap : {len(diag_icd9_set & proc_icd9_set)}')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "2e6bfa9a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('401.9', 81673), ('272.4', 57822), ('530.81', 41944), ('250.00', 34407), ('428.0', 32424), ('427.31', 31915), ('414.01', 30278), ('V15.82', 28601), ('311', 28248), ('584.9', 26961), ('244.9', 24117), ('285.9', 21802), ('305.1', 21410), ('403.90', 21385), ('V58.61', 19686), ('300.00', 18204), ('599.0', 18107), ('V58.67', 16854), ('585.9', 15942), ('493.90', 15775), ('272.0', 15739), ('327.23', 15277), ('38.93', 13618), ('V45.82', 13111), ('412', 12835), ('496', 12432), ('V58.66', 12353), ('278.00', 12141), ('276.1', 11840), ('V45.81', 11442), ('733.00', 10755), ('486', 10169), ('V12.51', 10131), ('338.29', 10063), ('V49.86', 9907), ('38.97', 9807), ('274.9', 9704), ('414.00', 9552), ('285.1', 9442), ('276.51', 9214), ('V12.54', 9169), ('276.2', 8819), ('600.00', 8777), ('564.00', 8743), ('357.2', 8406), ('585.6', 8256), ('287.5', 8145), ('427.89', 8138), ('96.6', 7696), ('428.32', 7487)]\n"
]
}
],
"source": [
"icd9_counter = train_diag_icd9_counter + val_diag_icd9_counter + test_diag_icd9_counter \\\n",
" + train_proc_icd9_counter + val_proc_icd9_counter + test_proc_icd9_counter\n",
"print(icd9_counter.most_common(50))\n",
"icd9_top50 = set([icd_code for icd_code, _ in icd9_counter.most_common(50)])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "58c1380d",
"metadata": {},
"outputs": [],
"source": [
"# Top-k diag/proc occurrences\n",
"df_diag_icd9_top50 = df_diag_icd9[df_diag_icd9.icd_code.map(lambda x: reformat_icd9_code(x, True) in icd9_top50)]\n",
"df_proc_icd9_top50 = df_proc_icd9[df_proc_icd9.icd_code.map(lambda x: reformat_icd9_code(x, False) in icd9_top50)]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "7aca660b",
"metadata": {},
"outputs": [],
"source": [
"# Admissions that has top-k ICD-9 codes\n",
"train_hadm_50_icd9_set = train_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))\n",
"val_hadm_50_icd9_set = val_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))\n",
"test_hadm_50_icd9_set = test_hadm_icd9_set & (set(df_diag_icd9_top50.hadm_id) | set(df_proc_icd9_top50.hadm_id))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "40c34e1b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top-50 ICD-9 HADM_ID (SUBJECT_ID)\n",
"Train: 170664 (77890)\n",
"Val : 6406 ( 2871)\n",
"Test : 12405 ( 5756)\n"
]
}
],
"source": [
"print(f\"Top-50 ICD-9 HADM_ID (SUBJECT_ID)\")\n",
"print(f\"Train: {len(train_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_50_icd9_set))):5d})\")\n",
"print(f\"Val : {len(val_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_50_icd9_set))):5d})\")\n",
"print(f\"Test : {len(test_hadm_50_icd9_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_50_icd9_set))):5d})\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "08694017",
"metadata": {},
"outputs": [],
"source": [
"# Write to files\n",
"with open(\"top50_icd9_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_50_icd9_set), fd, indent=4)\n",
"with open(\"top50_icd9_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_50_icd9_set), fd, indent=4)\n",
"with open(\"top50_icd9_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_50_icd9_set), fd, indent=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d7c74d0",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 21,
"id": "1ec5d8a8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train ICD-10 diag: 15681 unique, 1595623 occured\n",
"Val ICD-10 diag : 4748 unique, 58397 occured\n",
"Test ICD-10 diag : 6387 unique, 111289 occured\n",
"Train ICD-10 proc: 9549 unique, 188681 occured\n",
"Val ICD-10 proc : 1990 unique, 7119 occured\n",
"Test ICD-10 proc : 2772 unique, 13229 occured\n"
]
}
],
"source": [
"train_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in train_hadm_icd10_set)].icd_code)\n",
"val_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in val_hadm_icd10_set)].icd_code)\n",
"test_diag_icd10_counter = collections.Counter(df_diag_icd10[df_diag_icd10.hadm_id.map(lambda x: x in test_hadm_icd10_set)].icd_code)\n",
"print(f\"Train ICD-10 diag: {len(train_diag_icd10_counter):5d} unique, {sum(train_diag_icd10_counter.values()):7d} occured\")\n",
"print(f\"Val ICD-10 diag : {len(val_diag_icd10_counter):5d} unique, {sum(val_diag_icd10_counter.values()):7d} occured\")\n",
"print(f\"Test ICD-10 diag : {len(test_diag_icd10_counter):5d} unique, {sum(test_diag_icd10_counter.values()):7d} occured\")\n",
"\n",
"train_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in train_hadm_icd10_set)].icd_code)\n",
"val_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in val_hadm_icd10_set)].icd_code)\n",
"test_proc_icd10_counter = collections.Counter(df_proc_icd10[df_proc_icd10.hadm_id.map(lambda x: x in test_hadm_icd10_set)].icd_code)\n",
"print(f\"Train ICD-10 proc: {len(train_proc_icd10_counter):5d} unique, {sum(train_proc_icd10_counter.values()):7d} occured\")\n",
"print(f\"Val ICD-10 proc : {len(val_proc_icd10_counter):5d} unique, {sum(val_proc_icd10_counter.values()):7d} occured\")\n",
"print(f\"Test ICD-10 proc : {len(test_proc_icd10_counter):5d} unique, {sum(test_proc_icd10_counter.values()):7d} occured\")"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "449ad4f5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total ICD-10 diag: 16154\n",
"Total ICD-10 proc: 9942\n",
"Overlap : 0\n"
]
}
],
"source": [
"# Sanity check - Is there any ICD-10 code that is both diagnosis (ICD-10-CM) and procedure (ICD-10-PCS)\n",
"diag_icd10_set = set(train_diag_icd10_counter.keys()) | set(val_diag_icd10_counter.keys()) | set(test_diag_icd10_counter.keys())\n",
"proc_icd10_set = set(train_proc_icd10_counter.keys()) | set(val_proc_icd10_counter.keys()) | set(test_proc_icd10_counter.keys())\n",
"print(f'Total ICD-10 diag: {len(diag_icd10_set)}')\n",
"print(f'Total ICD-10 proc: {len(proc_icd10_set)}')\n",
"print(f'Overlap : {len(diag_icd10_set & proc_icd10_set)}')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "fed9dbb5",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('E785', 44043), ('I10', 43573), ('Z87891', 36296), ('K219', 30802), ('F329', 23232), ('I2510', 22606), ('N179', 19706), ('F419', 19156), ('Z7901', 15319), ('Z794', 15276), ('E039', 15253), ('E119', 13571), ('G4733', 12661), ('D649', 12467), ('E669', 12146), ('I4891', 12033), ('F17210', 11619), ('Y929', 11550), ('Z66', 10743), ('J45909', 10612), ('Z7902', 10515), ('J449', 10269), ('D62', 10132), ('02HV33Z', 10017), ('N390', 9658), ('I129', 9432), ('E1122', 9204), ('E871', 8647), ('I252', 8576), ('N189', 8566), ('E872', 8162), ('Z8673', 7910), ('Z955', 7759), ('Z86718', 7596), ('G8929', 7534), ('I110', 7436), ('K5900', 7098), ('N400', 6816), ('N183', 6804), ('I480', 6699), ('I130', 6516), ('G4700', 6450), ('D696', 6439), ('Z951', 6273), ('M109', 6223), ('Y92239', 5982), ('J9601', 5896), ('J189', 5791), ('Z23', 5713), ('Y92230', 5653)]\n"
]
}
],
"source": [
"icd10_counter = train_diag_icd10_counter + val_diag_icd10_counter + test_diag_icd10_counter \\\n",
" + train_proc_icd10_counter + val_proc_icd10_counter + test_proc_icd10_counter\n",
"print(icd10_counter.most_common(50))\n",
"icd10_top50 = set([icd_code for icd_code, _ in icd10_counter.most_common(50)])"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "1ca82d05",
"metadata": {},
"outputs": [],
"source": [
"df_diag_icd10_top50 = df_diag_icd10[df_diag_icd10.icd_code.map(lambda x: x in icd10_top50)]\n",
"df_proc_icd10_top50 = df_proc_icd10[df_proc_icd10.icd_code.map(lambda x: x in icd10_top50)]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "e90941b7",
"metadata": {},
"outputs": [],
"source": [
"train_hadm_50_icd10_set = train_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))\n",
"val_hadm_50_icd10_set = val_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))\n",
"test_hadm_50_icd10_set = test_hadm_icd10_set & (set(df_diag_icd10_top50.hadm_id) | set(df_proc_icd10_top50.hadm_id))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "15ae411d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top-50 ICD-10 HADM_ID (SUBJECT_ID)\n",
"Train: 104077 (55149)\n",
"Val : 3805 ( 2048)\n",
"Test : 7368 ( 4068)\n"
]
}
],
"source": [
"print(f\"Top-50 ICD-10 HADM_ID (SUBJECT_ID)\")\n",
"print(f\"Train: {len(train_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_50_icd10_set))):5d})\")\n",
"print(f\"Val : {len(val_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_50_icd10_set))):5d})\")\n",
"print(f\"Test : {len(test_hadm_50_icd10_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_50_icd10_set))):5d})\")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "4772eb02",
"metadata": {},
"outputs": [],
"source": [
"# Write to files\n",
"with open(\"top50_icd10_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_50_icd10_set), fd, indent=4)\n",
"with open(\"top50_icd10_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_50_icd10_set), fd, indent=4)\n",
"with open(\"top50_icd10_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_50_icd10_set), fd, indent=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afd9ffd7",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "f727e450",
"metadata": {},
"source": [
"### 4. MIMIC-IV full inclusive"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "43a35456",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3369 ICD-9 diag code commonly appear in all splits\n",
"982 ICD-9 proc code commonly appear in all splits\n"
]
}
],
"source": [
"# Common ICD-9 codes\n",
"diag_icd9_inc_set = set(train_diag_icd9_counter) & set(val_diag_icd9_counter) & set(test_diag_icd9_counter)\n",
"print(f\"{len(diag_icd9_inc_set)} ICD-9 diag code commonly appear in all splits\")\n",
"proc_icd9_inc_set = set(train_proc_icd9_counter) & set(val_proc_icd9_counter) & set(test_proc_icd9_counter)\n",
"print(f\"{len(proc_icd9_inc_set)} ICD-9 proc code commonly appear in all splits\")\n",
"icd9_inc_set = diag_icd9_inc_set | proc_icd9_inc_set"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "85995ada",
"metadata": {},
"outputs": [],
"source": [
"df_diag_icd9_inc = df_diag_icd9[df_diag_icd9.icd_code.map(lambda x: reformat_icd9_code(x, True) in icd9_inc_set)]\n",
"df_proc_icd9_inc = df_proc_icd9[df_proc_icd9.icd_code.map(lambda x: reformat_icd9_code(x, False) in icd9_inc_set)]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "03b399bc",
"metadata": {},
"outputs": [],
"source": [
"train_hadm_inc_icd9_set = train_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))\n",
"val_hadm_inc_icd9_set = val_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))\n",
"test_hadm_inc_icd9_set = test_hadm_icd9_set & (set(df_diag_icd9_inc.hadm_id) | set(df_proc_icd9_inc.hadm_id))"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "b45c47d6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\n",
"Train: 188397 (87865)\n",
"Val : 7108 ( 3256)\n",
"Test : 13699 ( 6510)\n"
]
}
],
"source": [
"print(f\"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\")\n",
"print(f\"Train: {len(train_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_inc_icd9_set))):5d})\")\n",
"print(f\"Val : {len(val_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_inc_icd9_set))):5d})\")\n",
"print(f\"Test : {len(test_hadm_inc_icd9_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_inc_icd9_set))):5d})\")"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "5bb7693d",
"metadata": {},
"outputs": [],
"source": [
"# Write to files\n",
"with open(\"inc_icd9_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_inc_icd9_set), fd, indent=4)\n",
"with open(\"inc_icd9_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_inc_icd9_set), fd, indent=4)\n",
"with open(\"inc_icd9_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_inc_icd9_set), fd, indent=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2c6b2db1",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 33,
"id": "da424f33",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3643 ICD-10 diag code commonly appear in all splits\n",
"1228 ICD-10 proc code commonly appear in all splits\n"
]
}
],
"source": [
"# Common ICD-10 codes\n",
"diag_icd10_inc_set = set(train_diag_icd10_counter) & set(val_diag_icd10_counter) & set(test_diag_icd10_counter)\n",
"print(f\"{len(diag_icd10_inc_set)} ICD-10 diag code commonly appear in all splits\")\n",
"proc_icd10_inc_set = set(train_proc_icd10_counter) & set(val_proc_icd10_counter) & set(test_proc_icd10_counter)\n",
"print(f\"{len(proc_icd10_inc_set)} ICD-10 proc code commonly appear in all splits\")\n",
"icd10_inc_set = diag_icd10_inc_set | proc_icd10_inc_set"
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "f6ac0bf9",
"metadata": {},
"outputs": [],
"source": [
"df_diag_icd10_inc = df_diag_icd10[df_diag_icd10.icd_code.map(lambda x: x in icd10_inc_set)]\n",
"df_proc_icd10_inc = df_proc_icd10[df_proc_icd10.icd_code.map(lambda x: x in icd10_inc_set)]"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "664e32db",
"metadata": {},
"outputs": [],
"source": [
"train_hadm_inc_icd10_set = train_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))\n",
"val_hadm_inc_icd10_set = val_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))\n",
"test_hadm_inc_icd10_set = test_hadm_icd10_set & (set(df_diag_icd10_inc.hadm_id) | set(df_proc_icd10_inc.hadm_id))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "7a45dc4c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\n",
"Train: 110304 (59005)\n",
"Val : 4015 ( 2187)\n",
"Test : 7846 ( 4375)\n"
]
}
],
"source": [
"print(f\"Full inclusive ICD-9 HADM_ID (SUBJECT_ID)\")\n",
"print(f\"Train: {len(train_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, train_hadm_inc_icd10_set))):5d})\")\n",
"print(f\"Val : {len(val_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, val_hadm_inc_icd10_set))):5d})\")\n",
"print(f\"Test : {len(test_hadm_inc_icd10_set):6d} ({len(set(map(hadm_to_subject.get, test_hadm_inc_icd10_set))):5d})\")"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "990ae6ad",
"metadata": {},
"outputs": [],
"source": [
"# Write to files\n",
"with open(\"inc_icd10_train_split.json\", \"w\") as fd:\n",
" json.dump(sorted(train_hadm_inc_icd10_set), fd, indent=4)\n",
"with open(\"inc_icd10_val_split.json\", \"w\") as fd:\n",
" json.dump(sorted(val_hadm_inc_icd10_set), fd, indent=4)\n",
"with open(\"inc_icd10_test_split.json\", \"w\") as fd:\n",
" json.dump(sorted(test_hadm_inc_icd10_set), fd, indent=4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7c1baf6",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
@dalgu90
Copy link
Author

dalgu90 commented Feb 7, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment