Skip to content

Instantly share code, notes, and snippets.

@callahantiff
Last active January 28, 2023 22:30
Show Gist options
  • Save callahantiff/1b6b0530bd057527108ef335aea96a97 to your computer and use it in GitHub Desktop.
Save callahantiff/1b6b0530bd057527108ef335aea96a97 to your computer and use it in GitHub Desktop.
A simple pipeline for characterizing patient representations

Characterizing Patient Representations


Purpose

Unlike other fields which perform comprehensive diagnostics or characterization prior to analysis, the evaluation of patient representation learning methods has largely been limited to the context of a specific downstream use case and is usually performed as part of model interpretation. While it cannot solve all of the aforementioned challenges, data-driven characterization of patient representations, independent of model development, may provide invaluable and unexpected insight and is an important first step towards understanding if these methods can be used to help automate CP development. To this end, we sought to answer the following questions:

RSQ1: What combinations of data type and sampling window create the best patient representations and does performance differ by disease group?

RSQ2: How does data-driven characterization of patient representation impact the explainability of downstream tasks like clustering?

To address these questions, we examined the impact of sampling window (i.e., all data, all data before the index date, and all data available after the index date), and data type (i.e., only conditions, only drug exposures, only measurements, and all data types) on patient representations within the context of rare disease. The work presented in this paper makes the following contributions:

  1. We perform, to the best of our knowledge, the first comprehensive characterization of patient representations learned from EHR data. As an evaluation, the similarity between any two patients was used as a diagnostic biomarker and assessed by its effectiveness for identifying similar patients, which is generalizable to a wide-range of patient representation learning algorithms and downstream learning tasks.
  2. We conduct extensive ablation studies to examine the impact of learning patient representations using different data types and sampling windows. We perform these studies on four different types of rare disease patients and medically complex controls. We evaluate the resulting representations using two tasks: (i) Leave-one-patient-out classification, which was assessed using a wide variety of performance metrics and plots; and (ii) K-means clustering, which included domain expert review of the most important features for each cluster.

The scripts in this Gist are implement the work described above.



Running the Code

Computational Requirements

All preprocessing of data, construction of patient representations, and evaluation tasks were initally performed using Python 3.6.2 on a machine running macOS Sierra with 16 GB of RAM and a 2.7 GHz processor with 8 cores. The Python scikit-learn (v.0.22.1) library was used to create the BoW+TF-IDF representations and perform clustering and ROC plots were visualized using Matplotlib (v.3.3.2).

Script Order

Running these scripts assumes that you have already generated data and it saved locally.

  • To run the scripts, place all of the scripts within the same directory.
  • Navigate to the main.py and run all of the code chunks in the main() function.


Contact: Please contact Tiffany Callahan (callahantiff [at] gmail.com) with any questions!

The published manuscript describing this work is available on medRxiv: https://www.medrxiv.org/content/10.1101/2022.07.26.22278073v1

#############################################################################################
# Purpose: Script creates a connection to OMOP de-identified data stored in Google BigQuery
# version 1.0.0
#############################################################################################
import datetime
import pickle
from oauth2client.client import GoogleCredentials
from google.cloud import bigquery
from bigquery import get_client
def GBQData(query, project):
'''
Function takes a string containing a user query to be run against the de-identified OMOP database and returns the
query results as a list. Each row of the list corresponds to a row of the results. Function is designed to
re-establish client connection for each query run against the database.
:param query: A string containing a user query.
:param project: A string specifying the project in Google Cloud Platform to use.
:return: A list of query results.
'''
print("Started running query at {} against the OMOP de-id database: ".format(str(datetime.datetime.now())))
#authenticate and set client to GCP project
client = bigquery.Client(project)
# run query
user_query = client.run_sync_query(query)
user_query.use_legacy_sql = False
user_query.run()
# convert query results to list
results = []
while len(results) == 0:
results = [x for x in user_query.fetch_data()]
if len(results) == 0: print('Query returned no results, trying again')
else: break
print("Finished processing query at: {}".format(str(datetime.datetime.now())))
print('Query returned: {} results".format(len(results))
return results
##########################################################################################
# Purpose: Script to create and characterize patient representations.
# version 1.0.0
##########################################################################################
# import and load needed scripts
import patient_data
import matplotlib.pyplot as plt
import matplotlib
# matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
import pickle
from scipy import interp
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.metrics.pairwise import linear_kernel
import random
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import collections
from operator import itemgetter
import operator
from sklearn.preprocessing import scale
# https://buhrmann.github.io/tfidf-analysis.html
# http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
def query_saver(query_file, file_name):
"""
Function takes two strings as input, the first is for the path/file storing the query and the second is for the
file path storing the location to write the results of the function to. Using the first string the function runs
the query referenced by the string and writes the results to the second string.
:param query_file: The path/file storing the query.
:param file_name: The location to write the results.
:return:
query_data: A list of lists holding the results of the running the query.
"""
query_data = PatientData_v2.data_query(query_file)
outfile = open(file_name, 'wb')
pickle.dump(query_data, outfile)
outfile.close()
print "Query Ran:", str(file_name.split('/')[-1].split('.')[0])
return query_data
def add_vectors(cond_labels, data1, data2, data3):
"""
Function takes a pandas data frame of patient labels and three dictionaries of patient data where each list
represents a different type of clinical data (e.g. conditions, medications, and labs). With these data,
the function creates two new lists, where the first list contains patient identifiers that were in all clinical
data types and the second list contains strings which contain all clinical concepts.
:param cond_labels: a pandas df where patients are the keys and values are strings representing each patients
:param data1: a dict where keys are patients and values are lists of concept codes for clinical data type 1
:param data2: a dict where keys are patients and values are lists of concept codes for clinical data type 2
:param data3: a dict where keys are patients and values are lists of concept codes for clinical data type 3
:return:
pat_ids: A list of patient identifiers that were in all clinical data types.
add_vecs: A list of strings which contain all clinical concepts.
"""
pat_ids = []; add_vecs = []
for pat in cond_labels['pat_id']:
d1 = ""; d2 = ""; d3 = ""
if pat in list(data1['pat_id']): idx = data1.index[data1['pat_id'] == pat][0]; d1 = data1['pat_conds'][idx]
if pat in list(data2['pat_id']): idx = data2.index[data2['pat_id'] == pat][0]; d2 = data2['pat_conds'][idx]
if pat in list(data3['pat_id']): idx = data3.index[data3['pat_id'] == pat][0]; d3 = data3['pat_conds'][idx]
vec_sum = " ".join([d1, d2, d3]); pat_ids.append(pat); add_vecs.append(vec_sum)
# CHECK - file has data
if len(pat_ids) != len(add_vecs): raise ValueError('The number of patients and vectors is not equal')
else: return pat_ids, add_vecs
def db_maker(labels, cond, med, lab, file_path):
"""
Function takes 4 lists of lists containing patient data and a string that stores file path information. Using
this information the function generates, saves, and returns 4 pandas datasets.
:param labels: A list of lists containing patient ids and associated condition labels.
:param cond: A list of lists containing patient condition information.
:param med: A list of lists containing patient medications information.
:param lab: A list of lists containing patient labs information.
:param file_path: A string containing information on where to write generated pandas datasets.
:return:
cond_db: A Pandas DataFrame of patient condition information.
med_db: A Pandas DataFrame of patient medication information.
lab_db: A Pandas DataFrame of patient lab information.
combo_db: A Pandas DataFrame of patient condition, medication, and lab information.
"""
cond_db = labels.merge(
pd.DataFrame(dict(pat_id=cond[0].keys(), pat_conds=cond[0].values())),
left_on='pat_id', right_on='pat_id', how='left').dropna(axis=0, how='any').reset_index(drop=True)
lb = cond_db.iloc[:, 0:3].copy()
# medications
med_db = lb.merge(
pd.DataFrame(dict(pat_id=med[0].keys(), pat_conds=med[0].values())),
left_on='pat_id', right_on='pat_id', how='left').dropna(axis=0, how='any').reset_index(drop=True)
# labs
lab_db = lb.merge(
pd.DataFrame(dict(pat_id=lab[0].keys(), pat_conds=lab[0].values())),
left_on='pat_id', right_on='pat_id', how='left').dropna(axis=0, how='any').reset_index(drop=True)
# combine vectors for conditions, medications, and labs
pat_comb = add_vectors(lb, cond_db, med_db, lab_db)
combo_db = lb.merge(
pd.DataFrame(dict(pat_id=pat_comb[0], pat_vecs=pat_comb[1])),
left_on='pat_id', right_on='pat_id', how='left').dropna(axis=0, how='any').reset_index(drop=True)
# save data - raw datasets
cond_db.to_pickle(str(file_path) + "_cond_db"); med_db.to_pickle(str(file_path) + "_med_db")
lab_db.to_pickle(str(file_path) + "_lab_db"); combo_db.to_pickle(str(file_path) + "_combo_db")
return cond_db, med_db, lab_db, combo_db
def patient_bow(corpus):
"""
Function takes a list of lists where the first item in each nested list a patient identifier and the second item
is a string that represents all concepts for a given clinical feature for that patient. The function uses this
information to create a count and TF-IDF transformed matrix. The function returns the count matrix where each row
represents a patient and each column represents a concept, TF-IDF matrix where each row represents a patient and
each column represents a concept and counts are weighted by TF-IDF, and a dictionary of each feature and it's
frequency in the TF-IDF matrix.
:param corpus: A list of lists where the first item in each nested list a patient identifier and the second
item is a string that represents all concepts for a given clinical feature for that patient.
:return:
count_matrix: A matrix where each row represents a patient and each column represents a concept.
tfidf_matrix: A matric here each row represents a patient and each column represents a concept and counts are
weighted by TF-IDF.
feature_counts: A dictionary of each feature and it's frequency in the TF-IDF matrix.
"""
# STEP 1: create vectorized representation of patient concepts
count_vector = CountVectorizer(stop_words=None, lowercase=True)
count_matrix = count_vector.fit_transform([content for _, content in corpus])
# pat_counts.shape # should be number patients by number unique dx codes
# count_matrix.toarray()
# STEP 2: TF-IDF transformation + L2 normalization
tfidf_transform = TfidfVectorizer(analyzer='word', use_idf=True, norm='l2', stop_words=None, lowercase=True)
tfidf_matrix = tfidf_transform.fit_transform([content for file, content in corpus])
# tfidf_matrix.shape
# get location of each feature as well as a list of unique features
feature_dict_cnt = count_vector.vocabulary_
concepts_cnt = count_vector.get_feature_names()
feature_dict = tfidf_transform.vocabulary_
concepts = tfidf_transform.get_feature_names()
return count_matrix, tfidf_matrix, concepts, feature_dict, concepts_cnt, feature_dict_cnt
def similarity_search(tfidf_matrix, index_patient, top_n):
"""
Function takes as input a tfidf matrix, an integer representing a patient id, and an integer representing the
number of similar patients to return. The function uses this information and calculates the cosine similarity
between the index patient and all other included patients. The results are sorted and returned as a list of
lists where each list contains a patient identifier and the cosine similarity score the top set of similar as
indicated by the input argument are returned.
:param tfidf_matrix: A matrix where each row represents a patient and each column represents a concept and counts are
weighted by TF-IDF.
:param index_patient: An integer representing a patient id.
:param top_n: An integer representing the number of similar patients to return.
:return:
similar_patients: A list of lists where each list contains a patient identifier and the cosine similarity
score the top set of similar as indicated by the input argument are returned.
"""
# http://markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-
# calculate similarity
cosine_similarities = linear_kernel(tfidf_matrix[index_patient:index_patient + 1], tfidf_matrix).flatten()
rel_pat_indices = [i for i in cosine_similarities.argsort()[::-1] if i != index_patient]
similar_patients = [(patient, cosine_similarities[patient]) for patient in rel_pat_indices][0:top_n]
return similar_patients
def similarity_test(tfidf_matrix, test_data1, index_id, cond_type):
"""
Function takes a matrix of counts weighted with tf-idf, a pandas data frame containing ids, conditions,
and patient conditions, a number representing the identifier of the index patient, and a string representing
the disease.
:param tfidf_matrix: A matrix of counts weighted with tf-idf.
:param test_data1: A pandas data frame containing ids, conditions, and patient conditions, a number
representing the The identifier of the index patient.
:param index_id: A number representing the identifier of the index patient.
:param cond_type: A string representing the disease.
:return:
p_sim: Pandas DataFrame for index patient including condition labels and similarity scores.
"""
# get index of id - used to pull correct row from tf-idf matrix
index_pat = test_data1.index[test_data1.pat_id == index_id][0]
id = []; cond = []; scores = []
for index, score in similarity_search(tfidf_matrix, index_pat, len(test_data1) - 1):
id.append(test_data1['pat_id'][index])
conds_list = test_data1['pat_label'][test_data1.index[test_data1.pat_id == test_data1['pat_id'][index]]]
if list(conds_list)[0] == cond_type: cond.append(1.0)
else: cond.append(0.0)
scores.append(score)
p_sim = pd.DataFrame(dict(pat_id=id, conds=cond, score=scores))
# verify that the number of rows is correct
if len(test_data1) - 1 != len(p_sim): raise ValueError('Output file is the wrong length')
else: return p_sim
def youden_index(data_frame):
"""
Function takes a data frame as input and uses it to calculate Youden index and confusion matrix information.
The function returns a list of values where each value is a result from a confusion matrix.
:param data_frame: Pandas DataFrame.
:return:
tn, fp, fn, tp: The results from a confusion matrix.
"""
# create a clean data frame for the regression
cols_to_keep = ['conds', 'score']; data = data_frame[cols_to_keep]
data['intercept'] = 1.0 # manually add the intercept
train_cols = data.columns[1:]
# run regression
model = LogisticRegression().fit(data[train_cols], data['conds'])
data['pred_probs'] = model.predict_proba(data[train_cols])[:, 1]
# identify thresholds
fpr, tpr, thresholds = roc_curve(data['conds'], data['pred_probs'])
threshold = pd.Series(tpr - fpr, index=thresholds).idxmax()
# add threshold to data
data['pred'] = data['pred_probs'].map(lambda x: 1.0 if x > threshold else 0.0)
# calculate confusion matrix (*by 1.0 to add decimal point to numbers)
tn, fp, fn, tp = (confusion_matrix(data['conds'], data['pred']).ravel()) * 1.0
return [fpr, tpr], tn, fp, fn, tp
def top_tfidf_feats(row_id, data, row, features, top_n=25):
"""Get top n tfidf values in row and return them with their corresponding feature names."""
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = [str(data['pat_id'][row_id]) + "_" + str(data['pat_label'][row_id]), 'tfidf']
return df
def top_feats_in_doc(data, xtr, features, row_id, top_n=25):
"""Top tfidf features in specific document (matrix row) """
row = np.squeeze(xtr[row_id].toarray())
return top_tfidf_feats(row_id, data, row, features, top_n)
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
"""Return the top n features that on average are most important amongst documents in rows identified by indices
in grp_ids."""
if grp_ids: d = xtr[grp_ids].toarray()
else: d = xtr.toarray()
d[d < min_tfidf] = 0
tfidf_means = np.mean(d, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)
def top_feats_by_class(xtr, y, features, min_tfidf=0.1, top_n=25):
"""Return a list of dfs, where each df holds top_n features and their mean tfidf value calculated across
documents with the same class label."""
dfs = []
labels = np.unique(y)
for label in labels:
ids = np.where(y == label)
feats_df = top_mean_feats(xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
feats_df.label = label
l = [label] * len(feats_df)
feats_df['label'] = l
dfs.append(feats_df)
return dfs
def plot_tfidf_classfeats_h(dfs):
"""Plot the data frames returned by the function plot_tfidf_classfeats()."""
fig = plt.figure(figsize=(50, 10))
x = np.arange(len(dfs[0]))
for i, df in enumerate(dfs):
ax = fig.add_subplot(1, len(dfs), i + 1)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.set_frame_on(False)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.set_xlabel("Mean Tf-Idf Score", labelpad=16, fontsize=14)
ax.set_title("label = " + str(df.label), fontsize=16)
ax.ticklabel_format(axis='x', style='sci', scilimits=(-2, 2))
ax.barh(x, df.tfidf, align='center', color='#3F5D7D')
ax.set_yticks(x)
ax.set_ylim([-1, x[-1] + 1])
ax.set_yticklabels(df.feature, fontsize=10)
plt.subplots_adjust(bottom=0.09, right=0.97, left=0.15, top=0.95, wspace=0.52)
plt.show()
def label_remover(test_data, cut_list, col_list, extra_strings):
"""Function removes any codes or labels that should be excluded from the datasets."""
cut_list = [a for b in cut_list for a in b]
for row in range(len(test_data)):
for col in col_list:
clean_row = [word for word in test_data[col][row].split(' ') if word not in cut_list]
idx = [[i for i, item in enumerate(clean_row) if re.search(s.lower(), item.lower())] for s in extra_strings]
cleaned_row = [i for i in clean_row if i not in [clean_row[x] for y in idx for x in y]]
test_data.iloc[row, test_data.columns.get_loc(col)] = ' '.join(set(cleaned_row))
return test_data
def model_metrics(tn, fp, fn, tp):
"""Function calculates performance metrics."""
specificty = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
fpr = 1 - (tn / (tn + fp))
accuracy = (tp + tn) / (tp + fp + fn + tn)
f1_score = (2 * tp / (2 * tp + fp + fn))
return specificty, recall, precision, fpr, accuracy, f1_score
def label_remover(test_data, cut_list, col_list, extra_strings):
"""Removes code or labels from a column where each cell contains a string of codes or labels."""
cut_list = [a for b in cut_list for a in b]
for row in range(len(test_data)):
for col in col_list:
clean_row = [word for word in test_data[col][row].split(' ') if word not in cut_list]
idx = [[i for i, item in enumerate(clean_row) if re.search(s.lower(), item.lower())] for s in extra_strings]
cleaned_row = [i for i in clean_row if i not in [clean_row[x] for y in idx for x in y]]
test_data.iloc[row, test_data.columns.get_loc(col)] = ' '.join(set(cleaned_row))
return test_data
def main():
#####################################################################################################
#### Read in Clinical Data from GBQ Query ####
#####################################################################################################
# labels
case_labels = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2', 'rb'))
rand_labels = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Ransom2', 'rb'))
# demographics
case_demo= pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases_Demographics', 'rb'))
rand_demo = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Random_Demographics', 'rb'))
# conditions
case_cond_c = pickle.load(open(‘’../Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Condition_Codes' ,’rb'))
case_cond_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Condition_Source', 'rb'))
rand_cond_c = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Condition_Codes', 'rb'))
rand_cond_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Condition_Source', 'rb'))
# medications
case_med_c = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Medication_Codes', 'rb'))
case_med_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Medication_Source', 'rb'))
rand_med_c = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Medication_Codes', 'rb'))
rand_med_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Medication_Source', 'rb'))
# labs
case_lab_c = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Lab_Codes', 'rb'))
case_lab_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Cases2_Lab_Source', 'rb'))
rand_lab_c = pickle.load(open('Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Lab_Codes', 'rb'))
rand_lab_txt = pickle.load(open('../Queries/Test_Queries/QueryData/PatSim_RareDisease_Random2_Lab_Source', 'rb'))
#####################################################################################################
#### Process Clinical Data ####
#####################################################################################################
### STEP 1: Identify duplicate patients (this needs to include identifying the min number of patients with each concept)
case_labels_dedup = patient_data.duplicate_identifier(case_labels, 'YES', 'YES')
rand_labels_dedup = patient_data.duplicate_identifier(rand_labels, 'YES')
# demographics
case_demo = patient_data.patient_demo(case_demo, case_labels_dedup)
rand_demo = patient_data.patient_demo(rand_demo, rand_labels_dedup)
### STEP 2: Process data for concept labels
# conditions
case_cond_text = patient_data.patient_concepts(case_cond_txt, case_labels_dedup, 'CASE', 'TEXT', 'COND')
rand_cond_text = patient_data.patient_concepts(rand_cond_txt, rand_labels_dedup, 'RAND', 'TEXT', 'COND')
# medications
case_med_txt = patient_data.patient_concepts(case_med_txt, case_labels_dedup, 'CASE', 'TEXT', 'MED')
rand_med_text = patient_data.patient_concepts(rand_med_txt, rand_labels_dedup, 'RAND', 'TEXT', 'MED')
# labs
case_lab_txt = patient_data.patient_concepts(case_lab_txt, case_labels_dedup, 'CASE', 'TEXT', 'LAB')
rand_lab_text = patient_data.patient_concepts(rand_lab_txt, rand_labels_dedup, 'RAND', 'TEXT', 'LAB')
## tidy, filter, and merge data
# CASES
# conditions
case_cond_text.concept_text = case_cond_text.concept_text.astype(str)
case_dx_visits = list(set(list(case_cond_text['visit_id'])))
cond_agg = case_cond_text.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_txt = cond_agg.merge(case_cond_text[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_txt.drop_duplicates(keep='first', inplace=True)
case_cond_txt = case_cond_txt.reset_index(drop=True)
case_cond_txt.groupby(by=['pat_label']).size()
# medications
case_med_txt.concept_text = case_med_txt.concept_text.astype(str)
med_filt = case_med_txt[case_med_txt['visit_id'].isin(case_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_txt = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_txt.drop_duplicates(keep='first', inplace=True)
case_med_txt = case_med_txt.reset_index(drop=True)
case_med_txt.groupby(by=['pat_label']).size()
# labs
case_lab_txt.concept_text = case_lab_txt.concept_text.astype(str)
lab_filt = case_lab_txt[case_lab_txt['visit_id'].isin(case_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_txt = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_txt.drop_duplicates(keep='first', inplace=True)
case_lab_txt = case_lab_txt.reset_index(drop=True)
case_lab_txt.groupby(by=['pat_label']).size()
# combine data types
combo_txt = case_cond_txt.merge(case_lab_txt[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_txt = combo_txt.merge(case_med_txt[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_txt = combo_txt.replace(np.nan, '', regex=True)
# add demo info
combo_txt = combo_txt.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
combo_txt["pat_vecs"] = combo_txt["cond_vecs"] + " " + combo_txt["med_vecs"] + " " + combo_txt["lab_vecs"]
combo_txt["pat_vecs3"] = combo_txt['dob'] + " " + combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["cond_vecs"] + " " + combo_txt["med_vecs"] + " " + combo_txt["lab_vecs"]
combo_txt["pat_vecs2"] = combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["cond_vecs"] + " " + combo_txt["med_vecs"] + " " + combo_txt["lab_vecs"]
combo_txt["cond_vecs2"] = combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["cond_vecs"]
combo_txt["med_vecs2"] = combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["med_vecs"]
combo_txt["lab_vecs2"] = combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["lab_vecs"]
combo_txt["cond_vecs3"] = combo_txt['dob'] + " " + combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["cond_vecs"]
combo_txt["med_vecs3"] = combo_txt['dob'] + " " + combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["med_vecs"]
combo_txt["lab_vecs3"] = combo_txt['dob'] + " " + combo_txt['gender'] + " " + combo_txt['race'] + " " + combo_txt["lab_vecs"]
# save data
combo_txt.to_pickle("Data/PatSim_test/Rare_case_all_text_df")
# CONTROLS
# conditions
rand_cond_text.concept_text = rand_cond_text.concept_text.astype(str)
cond_agg = rand_cond_text.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
rand_cond_text = cond_agg.merge(rand_cond_text[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_cond_text.drop_duplicates(keep='first', inplace=True)
rand_cond_text = rand_cond_text.reset_index(drop=True)
rand_cond_text.groupby(by=['pat_label']).size()
# medications
rand_med_text.concept_text = rand_med_text.concept_text.astype(str)
med_agg = rand_med_text.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
rand_med_text = med_agg.merge(rand_med_text[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_med_text.drop_duplicates(keep='first', inplace=True)
rand_med_text = rand_med_text.reset_index(drop=True)
rand_med_text.groupby(by=['pat_label']).size()
# labs
rand_lab_text.concept_text = rand_lab_text.concept_text.astype(str)
lab_agg = rand_lab_text.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
rand_lab_text = lab_agg.merge(rand_lab_text[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_lab_text.drop_duplicates(keep='first', inplace=True)
# rand_lab_text = rand_lab_text.sample(n=10000, replace=False).reset_index(drop=True)
rand_lab_text.groupby(by=['pat_label']).size()
# add demo info
rand_all = rand_cond_text.merge(rand_med_text[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
rand_all.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'med_vecs'}, inplace=True)
rand_all = rand_all.merge(rand_lab_text[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
rand_all.rename(columns={'pat_conds': 'lab_vecs'}, inplace=True)
rand_all = rand_all.replace(np.nan, '', regex=True)
rand_all = rand_all.merge(rand_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
rand_all["pat_vecs"] = rand_all["cond_vecs"] + " " + rand_all["med_vecs"] + " " + rand_all["lab_vecs"]
rand_all["pat_vecs2"] = rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"] + " " + rand_all["med_vecs"] + " " + rand_all["lab_vecs"]
rand_all["cond_vecs2"] = rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"]
rand_all["med_vecs2"] = rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["med_vecs"]
rand_all["lab_vecs2"] = rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["lab_vecs"]
rand_all["pat_vecs3"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"] + " " + rand_all["med_vecs"] + " " + rand_all["lab_vecs"]
rand_all["cond_vecs3"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"]
rand_all["med_vecs3"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["med_vecs"]
rand_all["lab_vecs3"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["lab_vecs"]
# combine and save data
combo_txt = pd.concat([combo_txt, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_txt.to_pickle("Data/PatSim_test/Rare_combo_all_text_df")
### STEP 2 B: Removing concepts BEFORE labeled dx occurred
c_grp = case_cond_txt.groupby(by=['pat_label'])
c_grp = c_grp.apply(lambda g: g[pd.to_datetime(g['cond_date']) <= pd.to_datetime(g['cond_start_date'])])
cond_before_dx = c_grp.reset_index(drop=True)
cond_before_dx.groupby(by=['pat_label']).size()
before_dx_visits = list(set(list(c_grp['visit_id'])))
# conditions
cond_before_dx.concept_text = cond_before_dx.concept_text.astype(str)
cond_agg = cond_before_dx.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_txt_B_dx = cond_agg.merge(cond_before_dx[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_txt_B_dx.drop_duplicates(keep='first', inplace=True)
case_cond_txt_B_dx = case_cond_txt_B_dx.reset_index(drop=True)
case_cond_txt_B_dx.groupby(by=['pat_label']).size()
# medications
case_med_txt.concept_text = case_med_txt.concept_text.astype(str)
med_filt = case_med_txt[case_med_txt['visit_id'].isin(before_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_txt_B_dx = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_txt_B_dx.drop_duplicates(keep='first', inplace=True)
case_med_txt_B_dx = case_med_txt_B_dx.reset_index(drop=True)
case_med_txt_B_dx.groupby(by=['pat_label']).size()
# measurements
case_lab_txt.concept_text = case_lab_txt.concept_text.astype(str)
lab_filt = case_lab_txt[case_lab_txt['visit_id'].isin(before_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_txt_B_dx = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_txt_B_dx.drop_duplicates(keep='first', inplace=True)
case_lab_txt_B_dx = case_lab_txt_B_dx.reset_index(drop=True)
case_lab_txt_B_dx.groupby(by=['pat_label']).size()
### STEP 2 C: Remove conceprs AFTER labeled dx occurred
c_grp = case_cond_txt.groupby(by=['pat_id'])
c_grp = c_grp.apply(lambda g: g[pd.to_datetime(g['cond_date']) >= pd.to_datetime(g['cond_start_date'])])
cond_after_dx = c_grp.reset_index(drop=True)
after_dx_visits = list(set(list(c_grp['visit_id'])))
# conditions
cond_after_dx.concept_text = cond_after_dx.concept_text.astype(str)
cond_agg = cond_after_dx.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_txt_A_dx = cond_agg.merge(cond_after_dx[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_txt_A_dx.drop_duplicates(keep='first', inplace=True)
case_cond_txt_A_dx = case_cond_txt_A_dx.reset_index(drop=True)
case_cond_txt_A_dx.groupby(by=['pat_label']).size()
# medications
case_med_txt.concept_text = case_med_txt.concept_text.astype(str)
med_filt = case_med_txt[case_med_txt['visit_id'].isin(after_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_txt_A_dx = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_txt_A_dx.drop_duplicates(keep='first', inplace=True)
case_med_txt_A_dx = case_med_txt_A_dx.reset_index(drop=True)
case_med_txt_A_dx.groupby(by=['pat_label']).size()
# labs
case_lab_txt.concept_text = case_lab_txt.concept_text.astype(str)
lab_filt = case_lab_txt[case_lab_txt['visit_id'].isin(after_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_text'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_txt_A_dx = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_txt_A_dx.drop_duplicates(keep='first', inplace=True)
case_lab_txt_A_dx = case_lab_txt_A_dx.reset_index(drop=True)
case_lab_txt_A_dx.groupby(by=['pat_label']).size()
### STEP 2 D: Combine concept sets
## combine data types - before
combo_txt_B_dx = case_cond_txt_B_dx.merge(case_lab_txt_B_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt_B_dx.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_txt_B_dx = combo_txt_B_dx.merge(case_med_txt_B_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt_B_dx.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_txt_B_dx = combo_txt_B_dx.replace(np.nan, '', regex=True)
# add demo info
combo_txt_B_dx = combo_txt_B_dx.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
combo_txt_B_dx["pat_vecs"] = combo_txt_B_dx["cond_vecs"] + " " + combo_txt_B_dx["med_vecs"] + " " + combo_txt_B_dx[ "lab_vecs"]
combo_txt_B_dx["pat_vecs2"] = combo_txt_B_dx['dob'] + " " + combo_txt_B_dx['gender'] + " " + combo_txt_B_dx['race'] + " " + combo_txt_B_dx["cond_vecs"] + " " + combo_txt_B_dx["med_vecs"] + " " + combo_txt_B_dx["lab_vecs"]
combo_txt_B_dx["cond_vecs2"] = combo_txt_B_dx['dob'] + " " + combo_txt_B_dx['gender'] + " " + combo_txt_B_dx['race'] + " " + combo_txt_B_dx["cond_vecs"]
combo_txt_B_dx["med_vecs2"] = combo_txt_B_dx['dob'] + " " + combo_txt_B_dx['gender'] + " " + combo_txt_B_dx['race'] + " " + combo_txt_B_dx["med_vecs"]
combo_txt_B_dx["lab_vecs2"] = combo_txt_B_dx['dob'] + " " + combo_txt_B_dx['gender'] + " " + combo_txt_B_dx['race'] + " " + combo_txt_B_dx["lab_vecs"]
combo_txt_B_dx.to_pickle("Data/PatSim_test/case_all_text_B_filtered_df")
# merge with controls
combo_txt_B_dx = pd.concat([combo_txt_B_dx, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_txt_B_dx.to_pickle("Data/PatSim_test/comb_all_text_B_filtered_df")
## combine data types - after
combo_txt_A_dx = case_cond_txt_A_dx.merge(case_lab_txt_A_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt_A_dx.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_txt_A_dx = combo_txt_A_dx.merge(case_med_txt_A_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_txt_A_dx.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_txt_A_dx = combo_txt_A_dx.replace(np.nan, '', regex=True)
# add demo info
combo_txt_A_dx = combo_txt_A_dx.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
combo_txt_A_dx["pat_vecs"] = combo_txt_A_dx["cond_vecs"] + " " + combo_txt_A_dx["med_vecs"] + " " + combo_txt_A_dx["lab_vecs"]
combo_txt_A_dx["pat_vecs2"] = combo_txt_A_dx['dob'] + " " + combo_txt_A_dx['gender'] + " " + combo_txt_A_dx['race'] + " " + combo_txt_A_dx["cond_vecs"] + " " + combo_txt_A_dx["med_vecs"] + " " + combo_txt_A_dx[ "lab_vecs"]
combo_txt_A_dx["cond_vecs2"] = combo_txt_A_dx['dob'] + " " + combo_txt_A_dx['gender'] + " " + combo_txt_A_dx['race'] + " " + combo_txt_A_dx["cond_vecs"]
combo_txt_A_dx["med_vecs2"] = combo_txt_A_dx['dob'] + " " + combo_txt_A_dx['gender'] + " " + combo_txt_A_dx['race'] + " " + combo_txt_A_dx["med_vecs"]
combo_txt_A_dx["lab_vecs2"] = combo_txt_A_dx['dob'] + " " + combo_txt_A_dx['gender'] + " " + combo_txt_A_dx['race'] + " " + combo_txt_A_dx["lab_vecs"]
combo_txt_A_dx.to_pickle("Data/PatSim_test/case_all_text_A_filtered_df")
# merge with controls
combo_txt_A_dx = pd.concat([combo_txt_A_dx, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_txt_A_dx.to_pickle("Data/PatSim_test/comb_all_text_A_filtered_df")
### STEP 2: Process data for concept codes
## tidy, filter, and merge data
# conditions
case_cond_c = patient_data.patient_concepts(case_cond_c, case_labels_dedup, 'CASE', 'CODE', 'COND')
rand_cond_c = patient_data.patient_concepts(rand_cond_c, rand_labels_dedup, ‘RANDOM’, 'CODE', 'COND')
# medications
case_med_c = patient_data.patient_concepts(case_med_c, case_labels_dedup, 'CASE', 'CODE', 'MED')
rand_med_c = patient_data.patient_concepts(rand_med_c, rand_labels_dedup, ‘RANDOM, 'CODE', 'MED')
# labs
case_lab_c = patient_data.patient_concepts(case_lab_c, case_labels_dedup, 'CASE', 'CODE', 'LAB')
rand_lab_c = patient_data.patient_concepts(rand_lab_c, rand_labels_dedup, ‘RANDOM, 'CODE', 'LAB')
# CASES
# conditions
case_cond_c.concept_code = case_cond_c.concept_code.astype(str)
case_dx_visits = list(set(list(case_cond_c['visit_id'])))
cond_agg = case_cond_c.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_c = cond_agg.merge(case_cond_c[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_c.drop_duplicates(keep='first', inplace=True)
case_cond_c = case_cond_c.reset_index(drop=True)
case_cond_c.groupby(by=['pat_label']).size()
# medications
case_med_c.concept_code = case_med_c.concept_code.astype(str)
med_filt = case_med_c[case_med_c['visit_id'].isin(case_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_c = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_c.drop_duplicates(keep='first', inplace=True)
case_med_c = case_med_c.reset_index(drop=True)
case_med_c.groupby(by=['pat_label']).size()
# labs
case_lab_c.concept_code = case_lab_c.concept_code.astype(str)
lab_filt = case_lab_c[case_lab_c['visit_id'].isin(case_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_c = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_c.drop_duplicates(keep='first', inplace=True)
case_lab_c = case_lab_c.reset_index(drop=True)
case_lab_c.groupby(by=['pat_label']).size()
# combine data types - before
combo_c = case_cond_c.merge(case_lab_c[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_c = combo_c.merge(case_med_c[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_c = combo_c.replace(np.nan, '', regex=True)
# add demo info
combo_c = combo_c.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
combo_c["pat_vecs"] = combo_c["cond_vecs"] + " " + combo_c["med_vecs"] + " " + combo_c["lab_vecs"]
combo_c["pat_vecs2"] = combo_c['dob'] + " " + combo_c['gender'] + " " + combo_c['race'] + " " + combo_c["cond_vecs"] + " " + combo_c["med_vecs"] + " " + combo_c["lab_vecs"]
combo_c["cond_vecs2"] = combo_c['dob'] + " " + combo_c['gender'] + " " + combo_c['race'] + " " + combo_c["cond_vecs"]
combo_c["med_vecs2"] = combo_c['dob'] + " " + combo_c['gender'] + " " + combo_c['race'] + " " + combo_c["med_vecs"]
combo_c["lab_vecs2"] = combo_c['dob'] + " " + combo_c['gender'] + " " + combo_c['race'] + " " + combo_c["lab_vecs"]
# save data
combo_c.to_pickle("Data/PatSim_test/case_all_code_df")
# CONTROLS
# conditions
rand_cond_c.concept_code = rand_cond_c.concept_code.astype(str)
cond_agg = rand_cond_c.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
rand_cond_c = cond_agg.merge(rand_cond_c[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_cond_c.drop_duplicates(keep='first', inplace=True)
rand_cond_c = rand_cond_c.reset_index(drop=True)
rand_cond_c.groupby(by=['pat_label']).size()
# medications
rand_med_c.concept_code = rand_med_c.concept_code.astype(str)
med_agg = rand_med_c.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
rand_med_c = med_agg.merge(rand_med_c[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_med_c.drop_duplicates(keep='first', inplace=True)
rand_med_c = rand_med_c.reset_index(drop=True)
rand_med_c.groupby(by=['pat_label']).size()
# labs
rand_lab_c.concept_code = rand_lab_c.concept_code.astype(str)
lab_agg = rand_lab_c.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
rand_lab_c = lab_agg.merge(rand_lab_c[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
rand_lab_c.drop_duplicates(keep='first', inplace=True)
rand_lab_c = rand_lab_c.reset_index(drop=True)
rand_lab_c.groupby(by=['pat_label']).size()
# add demo info
rand_all = rand_cond_c.merge(rand_med_c[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
rand_all.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'med_vecs'}, inplace=True)
rand_all = rand_all.merge(rand_lab_c[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
rand_all.rename(columns={'pat_conds': 'lab_vecs'}, inplace=True)
rand_all = rand_all.replace(np.nan, '', regex=True)
rand_all = rand_all.merge(rand_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# combine conds into 1 column
rand_all["pat_vecs"] = rand_all["cond_vecs"] + " " + rand_all["med_vecs"] + " " + rand_all["lab_vecs"]
rand_all["pat_vecs2"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"] + " " + rand_all["med_vecs"] + " " + rand_all["lab_vecs"]
rand_all["cond_vecs2"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["cond_vecs"]
rand_all["med_vecs2"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["med_vecs"]
rand_all["lab_vecs2"] = rand_all['dob'] + " " + rand_all['gender'] + " " + rand_all['race'] + " " + rand_all["lab_vecs"]
# combine data
combo_c = pd.concat([combo_c, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_c.to_pickle('Data/PatSim_test/comb_all_code_df')
### STEP 2 B: Remove concepts BEFORE labeled dx occurred
c_grp = case_cond_c.groupby(by=['pat_label'])
c_grp = c_grp.apply(lambda g: g[pd.to_datetime(g['cond_date']) <= pd.to_datetime(g['cond_start_date'])])
cond_before_dx = c_grp.reset_index(drop=True)
cond_before_dx.groupby(by=['pat_label']).size()
before_dx_visits = list(set(list(c_grp['visit_id'])))
# conditions
cond_before_dx.concept_code = cond_before_dx.concept_code.astype(str)
cond_agg = cond_before_dx.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_c_B_dx = cond_agg.merge(cond_before_dx[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_c_B_dx.drop_duplicates(keep='first', inplace=True)
case_cond_c_B_dx = case_cond_c_B_dx.reset_index(drop=True)
case_cond_c_B_dx.groupby(by=['pat_label']).size()
# medications
case_med_c.concept_code = case_med_c.concept_code.astype(str)
med_filt = case_med_c[case_med_c['visit_id'].isin(before_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_c_B_dx = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_c_B_dx.drop_duplicates(keep='first', inplace=True)
case_med_c_B_dx = case_med_c_B_dx.reset_index(drop=True)
case_med_c_B_dx.groupby(by=['pat_label']).size()
# labs
case_lab_c.concept_code = case_lab_c.concept_code.astype(str)
lab_filt = case_lab_c[case_lab_c['visit_id'].isin(before_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_c_B_dx = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_c_B_dx.drop_duplicates(keep='first', inplace=True)
case_lab_c_B_dx = case_lab_c_B_dx.reset_index(drop=True)
case_lab_c_B_dx.groupby(by=['pat_label']).size()
### STEP 2 C: Remove concepts AFTER labeled dx occurred
c_grp = case_cond_c.groupby(by=['pat_id'])
c_grp = c_grp.apply(lambda g: g[pd.to_datetime(g['cond_date']) >= pd.to_datetime(g['cond_start_date'])])
cond_after_dx = c_grp.reset_index(drop=True)
after_dx_visits = list(set(list(c_grp['visit_id'])))
# conditions
cond_after_dx.concept_code = cond_after_dx.concept_code.astype(str)
cond_agg = cond_after_dx.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
cond_agg = pd.DataFrame({'pat_id': cond_agg.index, 'pat_conds': cond_agg.values})
case_cond_c_A_dx = cond_agg.merge(cond_after_dx[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_cond_c_A_dx.drop_duplicates(keep='first', inplace=True)
case_cond_c_A_dx = case_cond_c_A_dx.reset_index(drop=True)
case_cond_c_A_dx.groupby(by=['pat_label']).size()
# medications
case_med_c.concept_code = case_med_c.concept_code.astype(str)
med_filt = case_med_c[case_med_c['visit_id'].isin(after_dx_visits)]
med_agg = med_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
med_agg = pd.DataFrame({'pat_id': med_agg.index, 'pat_conds': med_agg.values})
case_med_c_A_dx = med_agg.merge(med_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_med_c_A_dx.drop_duplicates(keep='first', inplace=True)
case_med_c_A_dx = case_med_c_A_dx.reset_index(drop=True)
case_med_c_A_dx.groupby(by=['pat_label']).size()
# labs
case_lab_c.concept_code = case_lab_c.concept_code.astype(str)
lab_filt = case_lab_c[case_lab_c['visit_id'].isin(after_dx_visits)]
lab_agg = lab_filt.groupby('pat_id')['concept_code'].agg(lambda col: ' '.join(col))
lab_agg = pd.DataFrame({'pat_id': lab_agg.index, 'pat_conds': lab_agg.values})
case_lab_c_A_dx = lab_agg.merge(lab_filt[['pat_id', 'pat_label', 'pat_lab_num']], left_on='pat_id', right_on='pat_id', how='left')
case_lab_c_A_dx.drop_duplicates(keep='first', inplace=True)
case_lab_c_A_dx = case_lab_c_A_dx.reset_index(drop=True)
case_lab_c_A_dx.groupby(by=['pat_label']).size()
### STEP 2 D: Combine concept sets
## combine data types - before
# conditions
combo_c_B_dx = case_cond_c_B_dx.merge(case_lab_c_B_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c_B_dx.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_c_B_dx = combo_c_B_dx.merge(case_med_c_B_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c_B_dx.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_c_B_dx = combo_c_B_dx.replace(np.nan, '', regex=True)
# add demo info
combo_c_B_dx = combo_c_B_dx.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
# conditions
combo_c_B_dx["pat_vecs"] = combo_c_B_dx["cond_vecs"] + " " + combo_c_B_dx["med_vecs"] + " " + combo_c_B_dx["lab_vecs"]
combo_c_B_dx["pat_vecs2"] = combo_c_B_dx['dob'] + " " + combo_c_B_dx['gender'] + " " + combo_c_B_dx['race'] + " " + combo_c_B_dx["cond_vecs"] + " " + combo_c_B_dx["med_vecs"] + " " + combo_c_B_dx[ "lab_vecs"]
combo_c_B_dx["cond_vecs2"] = combo_c_B_dx['dob'] + " " + combo_c_B_dx['gender'] + " " + combo_c_B_dx['race'] + " " + combo_c_B_dx["cond_vecs"]
combo_c_B_dx["med_vecs2"] = combo_c_B_dx['dob'] + " " + combo_c_B_dx['gender'] + " " + combo_c_B_dx['race'] + " " + combo_c_B_dx["med_vecs"]
combo_c_B_dx["lab_vecs2"] = combo_c_B_dx['dob'] + " " + combo_c_B_dx['gender'] + " " + combo_c_B_dx['race'] + " " + combo_c_B_dx["lab_vecs"]
combo_c_B_dx.to_pickle('Data/PatSim_test/case_all_code_B_filtered_df')
# merge with random and write
combo_c_B_dx = pd.concat([combo_c_B_dx, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_c_B_dx.to_pickle('Data/PatSim_test/comb_all_code_B_filtered_df')
## combine data types - after
combo_c_A_dx = case_cond_c_A_dx.merge(case_lab_c_A_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c_A_dx.rename(columns={'pat_conds_x': 'cond_vecs', 'pat_conds_y': 'lab_vecs'}, inplace=True)
combo_c_A_dx = combo_c_A_dx.merge(case_med_c_A_dx[['pat_id', 'pat_conds']], left_on='pat_id', right_on='pat_id', how='left')
combo_c_A_dx.rename(columns={'pat_conds': 'med_vecs'}, inplace=True)
combo_c_A_dx = combo_c_A_dx.replace(np.nan, '', regex=True)
# add demo info
combo_c_A_dx = combo_c_A_dx.merge(case_demo, left_on='pat_id', right_on='pat_id', how='left').reset_index( drop=True)
# combine conds into 1 column
combo_c_A_dx["pat_vecs"] = combo_c_A_dx["cond_vecs"] + " " + combo_c_A_dx["med_vecs"] + " " + combo_c_A_dx["lab_vecs"]
combo_c_A_dx["pat_vecs2"] = combo_c_A_dx['dob'] + " " + combo_c_A_dx['gender'] + " " + combo_c_A_dx['race'] + " " + combo_c_A_dx["cond_vecs"] + " " + combo_c_A_dx["med_vecs"] + " " + combo_c_A_dx["lab_vecs"]
combo_c_A_dx["cond_vecs2"] = combo_c_A_dx['dob'] + " " + combo_c_A_dx['gender'] + " " + combo_c_A_dx['race'] + " " + combo_c_A_dx["cond_vecs"]
combo_c_A_dx["med_vecs2"] = combo_c_A_dx['dob'] + " " + combo_c_A_dx['gender'] + " " + combo_c_A_dx['race'] + " " + combo_c_A_dx["med_vecs"]
combo_c_A_dx["lab_vecs2"] = combo_c_A_dx['dob'] + " " + combo_c_A_dx['gender'] + " " + combo_c_A_dx['race'] + " " + combo_c_A_dx["lab_vecs"]
combo_c_A_dx.to_pickle('Data/PatSim_test/case_all_code_A_filtered_df')
# merge with random and save
combo_c_A_dx = pd.concat([combo_c_A_dx, rand_all.sample(n=10000, replace=False)]).reset_index(drop=True)
combo_c_A_dx.to_pickle('Data/PatSim_test/comb_all_code_A_filtered_df')
#####################################################################################################
#### Remove Labels and Code Use to Diagnosis Cases ####
#####################################################################################################
# CF patients
cf_cut_code = [str(x) for x in [254320, 194325, 441267, 434615, 193174, 40479565]]
cf_cut_txt = ['Cystic_fibrosis_with_meconium_ileus', 'Meconium_ileus_in_cystic_fibrosis', 'Cystic_fibrosis_with_other_manifestations', 'Cystic_fibrosis_with_other_intestinal_manifestations', 'Cystic_fibrosis_with_gastrointestinal_manifestations', 'Cystic_fibrosis_with_pulmonary_manifestations', 'Cystic_fibrosis_without_mention_of_meconium_ileus', 'Cystic_fibrosis_unspecified', 'Cystic_fibrosis', 'Cystic_fibrosis_carrier', 'Cystic_fibrosis_gene_carrier']
# SCD patients
sc_cut_code = [str(x) for x in [30683, 22281, 443726, 26942, 196943, 443721, 254062, 25518, 321263, 443738, 40485018]]
sc_cut_txt = ['Sickle_cell_disease_unspecified', 'Hb_SS_disease_with_acute_chest_syndrome', 'Sickle_cell_disease,_unspecified', 'Sickle_cell_disease', 'Sickle_cell_trait', 'Hb_SS_disease_with_crisis', 'Hb_SS_disease_with_crisis_unspecified', 'Other_sickle_cell_disorders_with_crisis_unspecified', 'Hb_SS_disease_with_splenic_sequestration' 'Other_sickle_cell_disease_with_crisis' 'Other_sickle_cell_disorders_with_acute_chest_syndrome' 'Sickle_cell_disease_without_crisis', 'Hb_SS_disease_without_crisis', 'Other_sickle_cell_disorders_without_crisis' 'Other_sickle_cell_disease_without_crisis’, 'Sickle_cell_Hb_C_disease_with_splenic_sequestration',
'Sickle_cell_thalassemia_with_acute_chest_syndrome', 'Acute_chest_syndrome', 'Sickle_cell_Hb_C_disease_with_acute_chest_syndrome', 'Sickle_cell_thalassemia_without_crisis',
'Sickle_cell_Hb_C_disease_with_crisis_unspecified', 'Sickle_cell_Hb_C_disease_with_crisis', 'Sickle_cell_Hb_C_disease_without_crisis', 'Sickle_cell_thalassemia_with_crisis',
'Sickle_cell_thalassemia_with_crisis_unspecified', 'Other_sickle_cell_disorders_with_splenic_sequestration']
# CAH patients
ch_cut_code = [str(x) for x in [4314093, 4130017, 4081998]]
ch_cut_txt = ['Congenital_hypothyroidism_with_diffuse_goiter', 'Congenital_hypothyroidism_without_goiter', 'Congenital_hypothyroidism']
# PKU patients
pku_cut_code = [str(x) for x in [432872]]
pku_cut_txt = ['Phenylketonuria_PKU', 'Phenylketonuria', 'Classical_phenylketonuria', 'Screening_for_phenylketonuria_PKU']
# put lists together
cut_txt = [cf_cut_txt, sc_cut_txt, ch_cut_txt, pku_cut_txt]
cut_code = [cf_cut_code, sc_cut_code, ch_cut_code, pku_cut_code]
extra_strings = ['cystic_fibrosis', 'congenital_hypothyroidism', 'sickle_cell', 'hb_ss', 'phenylketonuria']
# read in processed data — assumes the data from the prior steps were saved
f_loc2s = ['Data/PatSim_test/case_all_code_df', 'Data/PatSim_test/comb_all_code_df', 'Data/PatSim_test/case_all_text_df', 'Data/PatSim_test/combo_all_text_df', 'Data/PatSim_test/case_all_code_B_filtered_df', 'Data/PatSim_test/comb_all_code_B_filtered_df', 'Data/PatSim_test/case_all_text_B_filtered_df', 'Data/PatSim_test/comb_all_text_B_filtered_df', 'Data/PatSim_test/case_all_code_A_filtered_df', 'Data/PatSim_test/comb_all_code_A_filtered_df', 'Data/PatSim_test/case_all_text_A_filtered_df', 'Data/PatSim_test/comb_all_text_A_filtered_df']
for x in f_loc2s:
print(x)
df = pd.read_pickle(x); col_list = ['cond_vecs', 'med_vecs', 'lab_vecs', 'pat_vecs']
if 'text' in x: df = label_remover(df, cut_txt, col_list, extra_strings)
else: df = label_remover(df, cut_code, col_list, extra_strings)
df.to_pickle(x)
#####################################################################################################
#### Perform Classification ####
#####################################################################################################
### STEP 1: Perform leave-one-patient-out classifcation
domain = ['cond', 'med', 'lab', 'all']
f_loc2 = ['Data/PatSim_test/case_all_code_df', 'Data/PatSim_test/comb_all_code_df', 'Data/PatSim_test/case_all_text_df', 'Data/PatSim_test/combo_all_text_df', 'Data/PatSim_test/case_all_code_B_filtered_df', 'Data/PatSim_test/comb_all_code_B_filtered_df', 'Data/PatSim_test/case_all_text_B_filtered_df', 'Data/PatSim_test/comb_all_text_B_filtered_df', 'Data/PatSim_test/case_all_code_A_filtered_df', 'Data/PatSim_test/comb_all_code_A_filtered_df', 'Data/PatSim_test/case_all_text_A_filtered_df', 'Data/PatSim_test/comb_all_text_A_filtered_df']
origin = 'Results/PatSim_Testing/dissertation/' # set location to write data to
for d_file in range(0, len(f_loc2)):
data_file = f_loc2[d_file]; df = pd.read_pickle(data_file)
print('\n' + 'FILE: {}\n'.format(data_file))
for dom in domain[1:]:
corpus = []
print('\n' + 'DOMAIN: {}'.format(dom))
if dom == 'all': sub_dir = 'Combined'; var = 'pat_vecs'
if dom == 'cond': sub_dir = 'Conditions'; var = dom + '_vecs'
if dom == 'med': sub_dir = 'Medications'; var = dom + '_vecs'
if dom == 'lab': sub_dir = 'Labs'; var = dom + '_vecs'
# files and titles
f_name = data_file.replace('all', dom).split('/')[-1]
f_type = origin + sub_dir + '/' + f_name + '_'; title_grp = ''
# update databases
test_data = df[['pat_id', 'pat_label', var]].drop_duplicates(keep='first', inplace=False)
test_data = test_data[test_data[var] != '']; test_data = test_data.reset_index(drop=True)
if 'comb' in data_file:
case = test_data[test_data['pat_label'] != 'Rand'].reset_index(drop=True)
ctrl = test_data[test_data['pat_label'] == 'Rand'].sample(n=1000, replace=False).reset_index(drop=True)
test_data = pd.concat([case, ctrl]).reset_index(drop=True)
print(test_data.groupby(by=['pat_label']).size())
# create corpus
for i in range(0, len(test_data)):
corpus.append((test_data['pat_id'][i], test_data[str(var)][i]))
matrix = patient_bow(corpus); test_matrix = matrix[1]; features = matrix[2]
print(test_matrix.shape); print('\n')
# perform classification
cf_counts = []; cf_plot = []; cf = [[] for _ in range(10)]
pku_counts = []; pku_plot = []; pku = [[] for _ in range(10)]
sc_counts = []; sc_plot = []; sc = [[] for _ in range(10)]
ch_counts = []; ch_plot = []; ch = [[] for _ in range(10)]
pat_lab = len(test_data['pat_label'][test_data.pat_label != 'Rand'])
for i in range(0, len(test_data)):
index = test_data['pat_id'][i]; cond = test_data['pat_label'][i]
print("*********************" * 5)
print("Running patient: " + str(i + 1) + "/" + str(pat_lab) + "\n")
# CF patients
if cond == 'CF':
sim_res = similarity_test(test_matrix, test_data, index, cond)
results_cf = youden_index(sim_res)
cf_counts.append(
[len([1.0 for x in sim_res['conds'][0:1] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:10] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:20] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:30] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:40] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:50] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:60] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:70] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:80] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:90] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:100] if x == 1.0])])
cf_plot.append(results_cf[0])
cf[0].append(results_cf[1]); cf[1].append(results_cf[2])
cf[2].append(results_cf[3]); cf[3].append(results_cf[4])
res = model_metrics(results_cf[1], results_cf[2], results_cf[3], results_cf[4])
cf[4].append(res[0]); cf[5].append(res[1]); cf[6].append(res[2]); cf[7].append(res[3])
cf[8].append(res[4]); cf[9].append(res[5])
# PKU patients
if cond == 'PKU':
sim_res = similarity_test(test_matrix, test_data, index, cond)
results_pku = youden_index(sim_res)
pku_counts.append(
[len([1.0 for x in sim_res['conds'][0:1] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:10] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:20] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:30] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:40] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:50] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:60] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:70] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:80] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:90] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:100] if x == 1.0])])
pku_plot.append(results_pku[0]); pku[0].append(results_pku[1]); pku[1].append(results_pku[2])
pku[2].append(results_pku[3]); pku[3].append(results_pku[4])
res = model_metrics(results_pku[1], results_pku[2], results_pku[3], results_pku[4])
pku[4].append(res[0]); pku[5].append(res[1]); pku[6].append(res[2])
pku[7].append(res[3]); pku[8].append(res[4]); pku[9].append(res[5])
# SCD patients
if cond == 'SC':
sim_res = similarity_test(test_matrix, test_data, index, cond)
results_sc = youden_index(sim_res)
sc_counts.append(
[len([1.0 for x in sim_res['conds'][0:1] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:10] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:20] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:30] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:40] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:50] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:60] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:70] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:80] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:90] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:100] if x == 1.0])])
sc_plot.append(results_sc[0]); sc[0].append(results_sc[1]); sc[1].append(results_sc[2])
sc[2].append(results_sc[3]); sc[3].append(results_sc[4])
res = model_metrics(results_sc[1], results_sc[2], results_sc[3], results_sc[4])
sc[4].append(res[0]); sc[5].append(res[1]); sc[6].append(res[2])
sc[7].append(res[3]); sc[8].append(res[4]); sc[9].append(res[5])
# CAH patients
if cond == 'CH':
sim_res = similarity_test(test_matrix, test_data, index, cond)
results_ch = youden_index(sim_res)
ch_counts.append(
[len([1.0 for x in sim_res['conds'][0:1] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:10] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:20] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:30] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:40] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:50] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:60] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:70] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:80] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:90] if x == 1.0]),
len([1.0 for x in sim_res['conds'][0:100] if x == 1.0])])
ch_plot.append(results_ch[0]); ch[0].append(results_ch[1]); ch[1].append(results_ch[2])
ch[2].append(results_ch[3]); ch[3].append(results_ch[4])
res = model_metrics(results_ch[1], results_ch[2], results_ch[3], results_ch[4])
ch[4].append(res[0]); ch[5].append(res[1]); ch[6].append(res[2]); ch[7].append(res[3])
ch[8].append(res[4]); ch[9].append(res[5])
# write results to disk
pickle.dump(cf_counts, open(str(f_type) + "cf_counts.txt", "wb"))
pickle.dump(cf_plot, open(str(f_type) + "cf_plot.txt", "wb"))
pickle.dump(cf, open(str(f_type) + "cf_res.txt", "wb"))
pickle.dump(open(str(f_type) + "pku_counts.txt", "wb"), fp)
pickle.dump(pku_plot, open(str(f_type) + "pku_plot.txt", "wb"))
pickle.dump(pku, open(str(f_type) + "pku_res.txt", "wb"))
pickle.dump(sc_counts, open(str(f_type) + "sc_counts.txt", "wb"))
pickle.dump(sc_plot, open(str(f_type) + "sc_plot.txt", "wb"))
pickle.dump(sc, open(str(f_type) + "sc_res.txt", "wb"))
pickle.dump(ch_counts, open(str(f_type) + "ch_counts.txt", "wb"))
pickle.dump(ch_plot, open(str(f_type) + "ch_plot.txt", "wb"))
pickle.dump(ch, open(str(f_type) + "ch_res.txt", "wb"))
### STEP 2: Calculate performance metrics
results = {}
for filename in glob.iglob(str(f_type) + '*.txt'):
if '_performance' not in filename:
var = '_'.join(filename.split('_')[-2:]).split('.')[0]
with open(str(filename), "rb") as fp:
results[var] = pickle.load(fp)
print("*" * 25 + '\n' + 'Generating Average Performance Results')
myfile = open(str(f_type) + 'model_performance.txt', 'w')
for dis in results:
if '_res' in dis:
tn = np.mean(results[dis][0]); fp = np.mean(results[dis][1])
fn = np.mean(results[dis][2]); tp = np.mean(results[dis][3])
tot = (tp + fp + fn + tn); obs_acc = (tp + tn) / tot
r = ((tp+fp)*(tp+fn))/tot; l = (tp*(fp+tn))/tot
exp_acc = (r+l)/tot; Kappa = (obs_acc- exp_acc) / (1 - exp_acc)
myfile.write("GROUP: " + str(dis) + "\n")
myfile.write("Mean Specificity [tn/(tn+fp)]: " + str(tn / (tn + fp)) + "\n")
myfile.write("Mean TPR/Recall [tp/(tp+fn)]: " + str(tp / (tp + fn)) + "\n")
myfile.write("Mean Precision [tp/(tp_fp)]: " + str(tp / (tp + fp)) + "\n")
myfile.write("Mean FPR [1-(tn/(tn+fp))]: " + str(1 - (tn / (tn + fp))) + "\n")
myfile.write("Mean Accuracy [(tp+tn)/(tp+fp+fn+tn)]: " + str((tp + tn) / (tp + fp + fn + tn)) + "\n")
myfile.write("Mean F1 Score: " + str((2*tp / (2*tp + fp + fn))) + "\n")
myfile.write("Kappa Score: " + str((obs_acc- exp_acc) / (1 - exp_acc)) + "\n")
myfile.write("\n")
myfile.close()
### STEP 3: Create AUC plots
print("*" * 25 + '\n' + 'Generating AUC Plots')
fig = plt.figure(figsize=(12, 12))
grid = plt.GridSpec(2, 2, hspace=0.2, wspace=0.2)
ax_1 = fig.add_subplot(grid[0, 0])
ax_2 = fig.add_subplot(grid[0, 1], )
ax_3 = fig.add_subplot(grid[1, 0])
ax_4 = fig.add_subplot(grid[1, 1])
plt.suptitle(str(title_grp), fontsize=13)
fig.text(0.075, 0.5, 'True Positive Rate', va="center", rotation="vertical", fontsize=11)
fig.text(0.46, 0.06, 'False Positive Rate', va="center", fontsize=11)
w_file = str(f_type) + 'ROC_plot.png'
plt_ax = [ax_1, ax_2, ax_3, ax_4]; i = 0
for dis_plot in results.keys():
if '_plot' in dis_plot:
plots = results[dis_plot]; title = str(dis_plot.split('_')[0].upper())
tprs, aucs, = [], []; base_fpr = np.linspace(0, 1, 101)
for res in plots:
fpr = res[0]; tpr = res[1]; roc_auc = auc(fpr, tpr); aucs.append(roc_auc)
tpr = interp(base_fpr, fpr, tpr); tpr[0] = 0.0; tprs.append(tpr)
tprs = np.array(tprs); mean_tprs = tprs.mean(axis=0); mean_auc = auc(base_fpr, mean_tprs)
std_auc = np.std(aucs); std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tprs + std_tpr, 1); tprs_lower = np.maximum(mean_tprs - std_tpr, 0)
plt_ax[i].plot(base_fpr, mean_tprs, 'b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=2, alpha=.8)
plt_ax[i].fill_between(base_fpr, tprs_lower, tprs_upper, color='grey', alpha=0.3,
label=r'$\pm$ 1 std. dev.')
plt_ax[i].plot([0, 1], [0, 1], 'r--', label='Random Chance')
plt_ax[i].set_xlim([-0.01, 1.01])
plt_ax[i].set_ylim([-0.01, 1.01])
plt_ax[i].legend(loc="lower right", fontsize=10)
plt_ax[i].set_title(str(title), fontsize=10)
i += 1
plt.savefig(str(w_file), bbox_inches='tight')
plt.close()
### STEP 4: Create TPR plots
print("*" * 25 + '\n' + 'Generating TP Count Plots'); print("Generating ")
fig = plt.figure(figsize=(12, 12))
grid = plt.GridSpec(2, 2, hspace=0.2, wspace=0.2)
ax_1 = fig.add_subplot(grid[0, 0])
ax_2 = fig.add_subplot(grid[0, 1])
ax_3 = fig.add_subplot(grid[1, 0])
ax_4 = fig.add_subplot(grid[1, 1])
plt.suptitle("True Positives by Top N Similar Patients - " + str(title_grp), fontsize=13)
fig.text(0.075, 0.5, '% True Positive', va="center", rotation="vertical", fontsize=11)
fig.text(0.46, 0.06, 'Top N Similar Patients', va="center", fontsize=11)
w_file = str(f_type) + 'TP_counts.png'
plt_ax = [ax_1, ax_2, ax_3, ax_4]; j = 0
for dis_count in results.keys():
if '_counts' in dis_count:
dis = results[dis_count]; title = str(dis_count.split('_')[0].upper())
g10 = []; g20 = []; g30 = []; g40 = []; g50 = []; g60 = []; g70 = []; g80 = []; g90 = []; g100 = []
for person in dis:
g10.append(person[1] / 10.0); g20.append(person[2] / 20.0); g30.append(person[3] / 30.0)
g40.append(person[4] / 40.0); g50.append(person[5] / 50.0); g60.append(person[6] / 60.0)
g70.append(person[7] / 70.0); g80.append(person[8] / 80.0); g90.append(person[9] / 90.0)
g100.append(person[10] / 100.0)
res = g10 + g20 + g30 + g40 + g50 + g60 + g70 + g80 + g90 + g100
labels = [[x for _ in range(len(g10))] for x in [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]]
df_plot = pd.DataFrame({'data': res, 'group': [item for sublist in labels for item in sublist]})
grouped = df_plot.groupby('group'); names, vals, xs = [], [], []
# plot results
for i, (name, subdf) in enumerate(grouped):
names.append(name); vals.append(subdf['data'].tolist())
xs.append(np.random.normal(i + 1, 0.04, subdf.shape[0]))
ngroup = len(vals); clevels = np.linspace(0., 1., ngroup)
bp = plt_ax[j].boxplot(vals, labels=names, patch_artist=True)
for box in bp['boxes']:
box.set(color='mediumseagreen', linewidth=3) # change outline color
box.set(facecolor='mediumaquamarine', alpha=0.3) # change fill color
plt.setp(bp['whiskers'], color='black')
plt.setp(bp['fliers'], color='black')
plt.setp(bp['means'], color='black')
plt.setp(bp['caps'], color='black')
plt.setp(bp['medians'], color='darkcyan', linewidth=2.5)
# add scatter with jitter to plot
for x, val, clevel in zip(xs, vals, clevels): plt_ax[j].scatter(x, val, c='gray', alpha=0.5)
plt_ax[j].set_ylim([-0.01, 1.01]); plt_ax[j].set_title(str(title), fontsize=10); j += 1
plt.savefig(str(w_file), bbox_inches='tight')
plt.close()
#####################################################################################################
#### Generate t-SNE Plots ####
#####################################################################################################
domain = ['cond', 'med', 'lab', 'all']
f_loc2 = ['Data/PatSim_test/case_all_code_df', 'Data/PatSim_test/comb_all_code_df', 'Data/PatSim_test/case_all_text_df', 'Data/PatSim_test/combo_all_text_df', 'Data/PatSim_test/case_all_code_B_filtered_df', 'Data/PatSim_test/comb_all_code_B_filtered_df', 'Data/PatSim_test/case_all_text_B_filtered_df', 'Data/PatSim_test/comb_all_text_B_filtered_df', 'Data/PatSim_test/case_all_code_A_filtered_df', 'Data/PatSim_test/comb_all_code_A_filtered_df', 'Data/PatSim_test/case_all_text_A_filtered_df', 'Data/PatSim_test/comb_all_text_A_filtered_df']
origin = 'Results/PatSim_Testing/dissertation/'
for d_file in range(0, len(f_loc2)):
data_file = f_loc2[d_file]
print('\n' + 'FILE: {}'.format(data_file))
df = pd.read_pickle(data_file)
for dom in domain:
corpus = []
print('\n' + 'DOMAIN: {}'.format(dom))
### STEP1: Prepare data and file names
if dom == 'all': sub_dir = 'Combined'; var = 'pat_vecs'
if dom == 'cond': sub_dir = 'Conditions'; var = dom + '_vecs'
if dom == 'med': sub_dir = 'Medications'; var = dom + '_vecs'
if dom == 'lab': sub_dir = 'Labs'; var = dom + '_vecs'
# files and titles
f_name = data_file.replace('all', dom).split('/')[-1]
f_type = origin + sub_dir + '/' + f_name + '_'; title_grp = ''
# update databases
test_data = df[['pat_id', 'pat_label', 'pat_lab_num', var]].drop_duplicates(keep='first', inplace=False)
test_data = test_data[test_data[var] != '']; test_data = test_data.reset_index(drop=True)
# create corpus
for i in range(0, len(test_data)): corpus.append((test_data['pat_id'][i], test_data[str(var)][i]))
matrix = patient_bow(corpus); test_matrix = matrix[1]; features = matrix[2]
# format file names
tsne_file = origin + sub_dir + '/Data/' + f_name + '_' + 'tsne.npy'
tsne_title = ""; clustering_title = ""
plot_file = origin + sub_dir + '/Figures/' + f_name + '_' + 'tsne.png'
clust_k_file = origin + sub_dir + '/Clustering/' + f_name + '_' + 'kmeans_elbow.png'
clustering_file = origin + sub_dir + '/Figures/' + f_name + '_' + 'CLUSTERS.png'
clustering_metrics = origin + sub_dir + '/Clustering/' + f_name + '_' + 'CLUSTERING_Metrics.txt'
f_type_clust = origin + sub_dir + '/Clustering/' + f_name + '_'
### STEP 2: Reduce dimensions of matrices
X_embedded = np.load(tsne_file)
X_reduced = TruncatedSVD(n_components=50, random_state=1).fit_transform(test_matrix)
X_embedded = TSNE(n_components=2, random_state=1, verbose=True, perplexity=25.0).fit_transform(X_reduced)
np.save(tsne_file, X_embedded) # X_embedded = np.load("../Data/tSNE_embeddings.npy")
### STEP 3: Create t-SNE plots
colors = {0: '#00C0A3', 1: '#4B4453', 3: '#009EFA', 4: '#FF8066', 99: 'goldenrod'}
names = {0: 'Phenylketonuria', 1: 'Congenital Hypothyroidism', 2: 'Sickle Cell', 3: 'Cystic Fibrosis', 99: 'Control'}
pku = mpatches.Patch(color='#00C0A3', label='Phenylketonuria')
ch = mpatches.Patch(color='#4B4453', label='Congenital Hypothyroidism')
scd = mpatches.Patch(color='#009EFA', label='Sickle Cell')
cf = mpatches.Patch(color='#FF8066', label='Cystic Fibrosis')
rnd = mpatches.Patch(color='goldenrod', label='Control')
# create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=X_embedded[:, 0], y=X_embedded[:, 1], group=list(test_data['pat_lab_num'])))
groups = df.groupby('group'); fig, ax = plt.subplots(figsize=(12, 8))
for name, group in groups:
if name == 99: ax.plot(group.x, group.y, marker='o', linestyle='', ms=5, label=names[name], color=colors[name], mec='whitesmoke', alpha=0.9)
for name, group in groups:
if name != 99: ax.plot(group.x, group.y, marker='o', linestyle='', ms=6, label=names[name], color=colors[name], mec='whitesmoke', alpha=0.7)
if 'comb' not in data_file: plt.legend(handles=[cf, pku, ch, scd], fontsize=12, frameon=False, loc="lower center", ncol=4)
else: plt.legend(handles=[cf, pku, ch, scd, rnd], fontsize=10.5, frameon=False, loc="lower center", ncol=5)
ax.tick_params(labelsize=16)
if 'comb' not in data_file: n = 100; plt.ylim(-n, n); plt.xlim(-n, n)
else: n = 125; plt.ylim(-n, n); plt.xlim(-n, n)
plt.savefig(str(plot_file), bbox_inches='tight')
plt.show(); plt.close()
#####################################################################################################
#### Perform K-Means Clustering ####
#####################################################################################################
# https://www.tutorialspoint.com/scikit_learn/scikit_learn_clustering_performance_evaluation.htm
### STEP 1: Create list to store labels
labels_real = [int(x) for x in test_data['pat_lab_num']]; labels_true = []
for l in labels_real:
if l == 0 or l == 1: labels_true.append(l)
if l == 3: labels_true.append(2)
if l == 4: labels_true.append(3)
if l == 99: labels_true.append(4)
# elbow method to help pick k
# scaled_data = scale(X_embedded, with_mean=False); cluster_range = range(1, 10); cluster_errors = []
# for num_clusters in cluster_range:
# clusters = KMeans(num_clusters).fit(scaled_data); cluster_errors.append(clusters.inertia_)
# clusters_df = pd.DataFrame({"num_clusters": cluster_range, "cluster_errors": cluster_errors})
# plt.figure(figsize=(9, 6))
# plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker="o")
# plt.title('Elbow Method - Determining the Optimal K')
# plt.savefig(clust_k_file, bbox_inches='tight'); plt.show(); plt.close()
### STEP 2: Run K-Means
k = 4 if 'comb' not in data_file else 5
# kmeans_model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=100).fit(scaled_data)
kmeans_model = KMeans(n_clusters=k, init='k-means++', max_iter=1000, random_state=0).fit(scaled_data)
labels = [int(x) for x in kmeans_model.labels_.tolist()]
### STEP 3: Derive clustrering metrics
ar = round(adjusted_rand_score(labels_true, labels), 3)
anmi = round(adjusted_mutual_info_score(labels_true, labels), 3)
pur = round(purity_score(labels_true, labels), 3)
with open(clustering_metrics, 'w') as out:
out.write('CLUSTERING: ' + f_name + '\n\n')
out.write('Adjusted Rand Score: ' + str(ar) + '\n')
out.write('Adjusted Normalized Mutual Information: ' + str(anmi) + '\n')
out.write('Purity: ' + str(pur) + '\n')
print('Adjusted Rand Score: ' + str(ar))
print('Adjusted Normalized Mutual Information: ' + str(anmi))
print('Purity: ' + str(pur) + '\n')
out.close()
# store clustering results in df
data = pd.DataFrame(dict(x=scaled_data[:, 0], y=scaled_data[:, 1], cluster=list(labels)))
### STEP 4: Plot clustering results
label1 = {0: 'red', 10: 'orchid', 3: 'lightcoral', 2: 'darkorange', 4: 'gray', 5: 'saddlebrown', 6: 'blue',
11: 'turquoise', 8: 'dodgerblue', 9: 'forestgreen', 1: 'lightseagreen', 7: 'darkmagenta',
12: 'deeppink', 13: 'rosybrown', 14: "darkviolet", 15: 'lime', 16: 'mediumvioletred'}
color = [label1[i] for i in labels]
plt.figure(figsize=(12, 8))
for i, label in enumerate(labels):
plt.scatter(x=data.loc[data['cluster'] == label, 'x'], y=data.loc[data['cluster'] == label, 'y'],
alpha=0.6, marker='o', color=color[i], s=40, edgecolor='whitesmoke')
plt.annotate(label, data.loc[data['cluster'] == label, ['x', 'y']].mean(),
horizontalalignment='center', verticalalignment='center', size=20, color='ghostwhite',
weight='bold', bbox=dict(boxstyle="round", fc=color[i], edgecolor='ghostwhite', alpha=0.4))
plt.ylim(-3.5, 3.5); plt.xlim(-3.5, 3.5); plt.tick_params(labelsize=16)
plt.savefig(str(clustering_file), bbox_inches='tight'); plt.show(); plt.close()
### STEP 5: Obtain patient- and concept-level information for each cluster
same_conds = {}
for pat in range(0, len(data)):
feats = top_feats_in_doc2(test_data, test_matrix, features, pat, top_n=10)
cluster = str(data.iloc[pat]['cluster'])
id = test_data.iloc[pat]['pat_id']; poc = test_data.iloc[pat]['pat_lab_num']
if poc == 0 or poc == 1: cond = poc
if poc == 3: cond = 2
if poc == 4: cond = 3
if poc == 99: cond = 4
features2 = list(feats[feats.columns[0]]); feat_count = list(feats[feats.columns[1]])
# filter out features with tf-idf of 0.000
idx = len([x for x in feat_count if x != 0.000000]); features2 = features2[0:idx]
if cluster in same_conds.keys():
same_conds[cluster]['id'].append(id); same_conds[cluster]['feature'] += features2
same_conds[cluster]['feat_count'] += feat_count; same_conds[cluster]['cond'].append(cond)
same_conds[cluster]['cluster_cond'].append(names[cond])
else:
same_conds[cluster] = {}; same_conds[cluster]['id'] = [id]
same_conds[cluster]['feature'] = features2; same_conds[cluster]['feat_count'] = feat_count
same_conds[cluster]['cond'] = [cond]; same_conds[cluster]['cluster_cond'] = [names[cond]]
### STEP 6: Output clustering information
np.save(f_type_clust + 'Clustering_results.txt', same_conds)
## write out statistics for all patient concepts by cluster
myfile = open(f_type_clust + 'Clustering.txt', 'w'); unq = {}
for cluster in same_conds.keys():
cnt = collections.Counter(same_conds[cluster]['feature'])
sort_cnt = sorted(cnt.items(), key=itemgetter(1), reverse=True)
unq[cluster] = sort_cnt
myfile.write("==" * 20 + "\n")
myfile.write("CLUSTER: " + str(cluster) + "\n")
myfile.write("Number of patients: " + str(len(same_conds[cluster]['id'])) + "\n")
myfile.write("Number of concepts: " + str(len(sort_cnt)) + "\n")
myfile.write("Patient Conditions:\n")
dx_list = same_conds[cluster]['cluster_cond']; clust_keys = set(dx_list); tot = len(dx_list)
for p in sorted(clust_keys):
frac = round((dx_list.count(p)/float(tot))*100, 2)
myfile.write("\t\t - {}: {} ({}%)\n".format(p, str(dx_list.count(p)), str(frac)))
myfile.write("*" * 10 + "\n")
for x in sort_cnt[0:40]:
myfile.write(' - '.join((x[0], str(x[1]))) + "\n")
myfile.write("==" * 20 + "\n")
myfile.close()
## write out statistics for only those concepts unique to each patient cluster
myfile2 = open(f_type_clust + 'Clustering_unique.txt', 'w')
for i in unq:
all = []; unq_res = []; start = {x[0]: x[1] for x in unq[i]}
for x in unq.keys():
if x != i:
for res in unq[x]:
all.append(res[0])
unq_res = sorted({a: start[a] for a in start.keys() if a not in all}.items(),
key=itemgetter(1), reverse=True)
myfile2.write("==" * 20 + "\n")
myfile2.write("CLUSTER: " + str(i) + "\n")
myfile2.write("Number of concepts: " + str(len(unq[i])) + "\n")
myfile2.write("Number of unique concepts: " + str(len(unq_res)) + "\n")
myfile2.write("*" * 10 + "\n")
for y in unq_res:
myfile2.write(' - '.join((y[0], str(y[1]))) + "\n")
myfile2.write("==" * 20 + "\n")
myfile2.close()
###########################################################################################
# Purpose: script queries an OMOP database and returns a list of data that is de-duplicated
# version 1.0.0
###########################################################################################
import datetime
import google_api
import os
import pandas as pd
import re
from progressbar import ProgressBar, FormatLabel, Percentage, Bar
def data_query(query, project=None):
"""
Function runs a query in the form of a string against an OMOP database via Google BigQuery and returns the
results as a list of lists.
:param query: A string that encodes a query.
:param project: A string specifying the project in Google Cloud Platform to use.
:return: data: A list of lists of results from a query against a Google BigQuery instance.
"""
if project is None: ValueError('Please provide a valid GCP Project')
# CHECK - file has data
if len(query) == 0: raise ValueError('input query is empty')
else: return google_api.GBQData(query, project)
def duplicate_identifier(patient_list, decision, proc='NO'):
"""
Function takes a list of patients and assigned concepts as input and using the list and the concept
identifies patients with multiple concepts. The function returns a list of patients having multiple
concepts as well as a pandas df containing patient ids and condition labels.
:param patient_list: A list of patients and assigned concepts.
:param decision: A string that indicates if patients should be checked for co-occurring disease labels.
:return:
pat_labels: A pandas df where patients are the keys and values are strings representing each patients.
"""
if proc == 'YES':
pat_labels = pd.DataFrame(dict(pat_id=[id[0] for id in patient_list],
pat_vis_num=[id[1] for id in patient_list],
pat_label=[id[2] for id in patient_list],
pat_lab_num=[id[3] for id in patient_list],
cond_start=[id[4] for id in patient_list],
age_at_dx=[id[5] for id in patient_list],))
else:
pat_labels = pd.DataFrame(dict(pat_id=[id[0] for id in patient_list],
pat_vis_num=[id[1] for id in patient_list],
pat_label=[id[2] for id in patient_list],
pat_lab_num=[id[3] for id in patient_list]))
# identify and drop patients that have more than 1 disease label
if decision == 'YES':
pat_labels.drop_duplicates(subset=['pat_id'], keep=False, inplace=True)
pat_labels.reset_index(drop=True)
# CHECK - file has data
if len(pat_labels) == 0:
raise ValueError('There are 0 valid patient IDs detected')
else:
# print number of patients
print("There are " + str(len(pat_labels)) + " patients")
return pat_labels
def key_matcher(labels, labeled_data):
"""Function verifies that every code has a label."""
keys = list(set(labels['pat_id']).intersection(set(labeled_data[0])))
drop = [labels.index[labels.pat_id == i][0] for i in labels["pat_id"] if i not in keys]
labels.drop(labels.index[labels.index[drop]], inplace=True)
# ensure conditions has the same number of entries as labels
for i in labeled_data[0][:]:
if i not in keys: idx = labeled_data[0].index(i); del labeled_data[0][idx]; labeled_data[1][idx]
if len(labels) != len(labeled_data[0]) or len(labels) != len(labeled_data[1]):
raise ValueError('The number of patients with labels and patient conditions is not the same')
else:
print('There are {} patients'.format(len(labels)))
labels = labels.reset_index(drop=True)
return labels, labeled_data
def patient_concepts2_new(concept_list, patients, unique, source, d_type = 'OTHER'):
"""
Function takes a list of lists, where each list represents the result of a GBQ query as well as a pandas data frame
of non-duplicated patient ids. By non-duplicated, we mean that only those identifiers that remain in the
non-duplicated list are eligible for being carried forward when added with additional patient data. Using this
information the function creates a dictionary where keys are patients and values are lists of patient dx codes
and a list of unique concept codes.
:param concept_list: A list of lists, where each list represents the result of a GBQ query.
:param patients: A pandas data frame of non-duplicated patient ids.
:param unique: A string that indicates if patients should be checked for co-occurring disease labels.
:param source: A string that indicates if patients should be checked for duplicates.
:return:
dat: A Pandas DataFrame containing processed results.
vocab: A list of vocabulary types.
patient_cond_date: A dictionary of clinical concepts by date.
"""
# initialize progress bar progress bar
widgets = [Percentage(), Bar(), FormatLabel('(elapsed: %(elapsed)s)')]
pbar = ProgressBar(widgets=widgets, maxval=len(concept_list))
patient_data = {}; patient_cond_date = {}; vocab = []
for row in pbar(concept_list):
if unique == 'YES':
if row[0] in list(patients['pat_id']):
key = row[0]
patient_cond_date[key] = row[4]
if source == 'YES':
if d_type == 'MED': val = row[3].split(" | ")[-1].strip()
if d_type == 'COND': val = re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_')
if d_type == 'LAB':
f_grp = row[3].split('|')
if len(f_grp) == 3: val = re.sub(r'(\W+)', '_', f_grp[1].strip()).strip('_')
else: val = re.sub(r'(\W+)', '_', f_grp[0].strip()).strip('_')
if source == 'NO': val = row[1]
if key in patient_data.keys(): patient_data[key] = " ".join((patient_data[key], str(val)))
else: patient_data[key] = str(val)
vocab.append(str(val))
pbar.finish(). # close progress bar
print("There are {} patients and {} unique concepts".format(len(patient_data.keys()), len(set(vocab))))
return patient_data, vocab, patient_cond_date
def patient_concepts3_rand(concept_list, patients, unique, d_type = 'OTHER'):
"""
Function takes a list of lists, where each list represents the result of a GBQ query as well as a pandas data frame
of non-duplicated patient ids. By non-duplicated, we mean that only those identifiers that remain in the
non-duplicated list are eligible for being carried forward when added with additional patient data. Using this
information the function creates a dictionary where keys are patients and values are lists of patient dx codes
and a list of unique concept codes.
:param concept_list: A list of lists, where each list represents the result of a GBQ query.
:param patients: A pandas data frame of non-duplicated patient ids.
:param unique: A string that indicates if patients should be checked for co-occurring disease labels.
:param source: A string that indicates if patients should be checked for duplicates.
:return:
dat: A Pandas DataFrame containing processed results.
vocab: A list of vocabulary types.
"""
# initialize progress bar progress bar
widgets = [Percentage(), Bar(), FormatLabel('(elapsed: %(elapsed)s)')]
pbar = ProgressBar(widgets=widgets, maxval=len(concept_list))
patient_data = {}; vocab = []
for row in pbar(concept_list):
if unique == 'YES':
if row[0] in list(patients['pat_id']):
key = row[0]
if d_type == 'MED': val = re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_')
if d_type == 'COND' and row[3].split('|')[0] != " ": val = re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_')
if d_type == 'LAB':
f_grp = row[3].split('|')
if len(f_grp) == 3: val = re.sub(r'(\W+)', '_', f_grp[1].strip()).strip('_')
else: val = re.sub(r'(\W+)', '_', f_grp[0].strip()).strip('_')
if row[3].split('|')[0] != " ":
if key in patient_data.keys(): patient_data[key] = " ".join((patient_data[key], str(val)))
else: patient_data[key] = str(val)
vocab.append(str(val))
pbar.finish(). # close progress bar
print("There are {} patients and {} unique concepts".format(len(patient_data.keys()), len(set(vocab))))
return patient_data, vocab
def patient_concepts(concept_list, patients, case, code, d_type):
"""
Function takes output from GBQ and turns it into a pandas data frame.
:param concept_list: A list of lists, where each list represents the result of a GBQ query.
:param patients: A pandas data frame of non-duplicated patient ids.
:param unique: A string that indicates if patients should be checked for co-occurring disease labels.
:param source: A string that indicates if patients should be checked for duplicates.
:return:
dat: A Pandas DataFrame containing processed results.
"""
widgets = [Percentage(), Bar(), FormatLabel('(elapsed: %(elapsed)s)')]
pbar = ProgressBar(widgets=widgets, maxval=len(concept_list))
codes=[];source=[];pats=[];cond_start=[];cond_date=[];visit=[]
# LABELED PATIENTS - CONDITIONS
if case == 'CASE' and code == 'TEXT' and d_type == 'COND':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']) and (row[6] is not None) and (row[3].split('|')[0] != " "):
pats.append(row[0]); cond_start.append(row[4]); cond_date.append(row[5]); visit.append(row[6])
if d_type == 'MED':
source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'COND' and row[3].split('|')[0] != " ":
source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'LAB':
if len(row[3].split('|')) == 3:
f_grp = ' '.join(row[3].split('|')[1:])
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
else:
f_grp = ' '.join(row[3].split('|'))
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_text=source, cond_start_date=cond_start, cond_date=cond_date, visit_id=visit))
if case == 'CASE' and code == 'CODE' and d_type == 'COND':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']) and row[6] is not None:
pats.append(row[0]); cond_start.append(row[4]); cond_date.append(row[5]); visit.append(row[6])
codes.append(str(row[1]))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_code=codes, cond_start_date=cond_start, cond_date=cond_date, visit_id=visit))
# LABELED PATIENTS - NOT CONDITIONS
if case == 'CASE' and code == 'TEXT' and d_type != 'COND':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']) and (row[4] is not None) and (row[3].split('|')[0] != " "):
pats.append(row[0]); visit.append(row[4])
if d_type == 'MED': source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'COND' and row[3].split('|')[0] != " ": source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'LAB':
if len(row[3].split('|')) == 3:
f_grp = ' '.join(row[3].split('|')[1:])
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
else:
f_grp = ' '.join(row[3].split('|'))
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_text=source, visit_id=visit))
if case == 'CASE' and code == 'CODE' and d_type != 'COND':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']) and row[4] is not None:
pats.append(row[0]); visit.append(row[4])
codes.append(str(row[1]))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_code=codes, visit_id=visit))
# RANDOM PATIENTS
if case == 'RAND' and code == 'TEXT':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']) and (row[3].split('|')[0] != " "):
pats.append(row[0])
if d_type == 'MED':
source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'COND' and row[3].split('|')[0] != " ":
source.append(re.sub(r'(\W+)', '_', row[3].split('|')[0].strip()).strip('_'))
if d_type == 'LAB':
if len(row[3].split('|')) == 3:
f_grp = ' '.join(row[3].split('|')[1:])
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
else:
f_grp = ' '.join(row[3].split('|'))
source.append(re.sub(r'(\W+)', '_', f_grp.strip()).strip('_'))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_text=source))
if case == 'RAND' and code == 'CODE':
for row in pbar(concept_list):
if row[0] in list(patients['pat_id']): pats.append(row[0]); codes.append(str(row[1]))
pat_data = pd.DataFrame(dict(pat_id=pats, concept_code=codes))
dat = pat_data.merge(patients, left_on='pat_id', right_on='pat_id', how='left').reset_index(drop=True)
pbar.finish(). # close progress bar
return dat
def patient_demo(concept_list, patients):
"""Function takes output from GBQ and turns it into a Pandas DataFrame."""
widgets = [Percentage(), Bar(), FormatLabel('(elapsed: %(elapsed)s)')]
pbar = ProgressBar(widgets=widgets, maxval=len(concept_list))
dob=[];gender=[];pats=[];race=[]
for row in pbar(concept_list):
# d = re.sub(' ', '_', re.sub(',', ' ', str(row[1].strftime('%b %d,%Y'))))
d = str(row[1].strftime('%b %d,%Y')).split(',')[-1]
if row[0] in list(patients['pat_id']):
pats.append(row[0])
dob.append(d)
gender.append(str(row[2]).lower())
race.append(re.sub(' ', '_', str(row[4]).lower()))
pat_data = pd.DataFrame(dict(pat_id=pats, dob=dob, gender=gender, race=race))
pbar.finish(). # close progress bar
return pat_data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment