magic-lantern/!n3c_ml_intro.md

## !n3c_ml_intro.md

      
    Raw
  

              !n3c_ml_intro.md
            
          
    N3C machine learning introduction

This machine learning example was developed for the paper "The National COVID Cohort Collaborative: Clinical Characterization and Early Severity Prediction" currently available as a preprint.
Although the code is available in the N3C Unite Palantir platform, for faster code sharing, I have made the following publically accessible exports of the workbooks:

Build ML Dataset - Pulls in many variables curated as part of the Cohort paper and does additional curation.
scikit-ML - Separation of dataset in to input and outcomes and seasonal versions
xgboost-ML - Grid search over XGBoost hyperparameters
Unsupervised ML - UMAP and PCA analysis of data
cohort-ML - Final training and testing results for paper

Overview

The follow text is an excerpt for our pre-print paper linked above.
To demonstrate the utility of the N3C cohort for analytics, we used machine learning (ML) to predict clinical severity and risk factors over time. Using 64 inputs available on the first hospital day, we predicted a severe clinical course (death, discharge to hospice, invasive ventilation, or extracorporeal membrane oxygenation) using random forest and XGBoost models (AUROC 0.86 and 0.87 respectively) that were stable over time. The most powerful predictors in these models are patient age and widely available vital sign and laboratory values.
We developed models to predict patient-specific maximum clinical severity: hospitalization with death, discharge to hospice, invasive mechanical ventilation, or extracorporeal membrane oxygenation (ECMO) versus hospitalization without any of those. To avoid immortal time bias, we only included patients with at least one hospital overnight. We split the hospitalized laboratory-confirmed positive cohort into randomly selected 70% training and 30% testing cohorts stratified by outcome proportions and held out the testing set. We chose a broad set of potential predictors present for at least 15% of the training set (Supplemental Table 2). The input variables are the most abnormal value on the first calendar day of the hospital encounter. When patients did not have a laboratory test value on the first calendar day, we imputed normal values for specialized labs (e.g. ferritin, procalcitonin) and the median cohort value for common labs (e.g. sodium, albumin) (Supplemental Table 2). We compared several analytical approaches with varying flexibility and interpretability: logistic regression +/- L1 and L2 penalty, random forest, support vector machines, and XGBoost (github.com/dmlc/xgboost).
We internally validated models and limited overfitting using 5-fold cross-validation and evaluated models using the testing set and area under the receiver operator characteristic (AUROC) as the primary metric. Secondary metrics included precision/positive predictive value, recall/sensitivity, specificity, and F1-measure. Because SARS-CoV-2 outcomes have improved over time, we evaluated model performance overall and for March-May 2020 and June-October 2020. See Supplemental Methods.
We developed several models that accurately predict a severe clinical course using data from the first hospital calendar day. The models with the best discrimination of severe versus non-severe clinical course were built using XGBoost (AUROC 0.87) and random forest (AUROC 0.86). Both are flexible nonlinear tree-based models that provide interpretability with a variable importance metric. Importantly, discrimination by the two models was stable over time (March-May 2020 and June-October 2020, Supplemental Table 6). This indicates that the models did not train on health care processes only typical during the pandemic’s chaotic first wave. Commonly collected variables (age, SpO2, RR, blood urea nitrogen, systolic blood pressure, and aspartate aminotransferase) were among the inputs with the highest variable importance for both models.
Categorical variables were converted to k-1 dummy variables using Pandas’ get_dummies (one-hot encoding). For logistic regression and support vector machines, numeric variables were centered to mean zero with unit variance using scikit-learn’s StandardScaler. Optimal model specific hyperparameters were selected with a grid search performed using scikit-learn’s GridSearchCV using 5-fold cross validation on the training set with AUROC as the scoring metric. Each grid search included multiple iterations with categorical settings such as solver and with first coarse settings for numeric parameters following a logarithmic scale followed by more specific settings around the values found to perform best.
Variables

This table shows the 42 categories of 64 input variables for the machine learning models. The worst value for each variable on the first calendar day of hospital admission was used. We defined the worst value as the lowest value for diastolic blood pressure, hemoglobin, pH, platelet count, SpO2, and systolic blood pressure. For the remainder, we used the highest value. NTproBNP = N-Terminal-prohormone B-type Natriuretic Peptide. See "missing_data_info_all_cols" for the transform that creates this calculation.

Feature importance by model: Scikit-learn provides a common API to extract feature importance for a model. Each ML method has an algorithm for determining feature importance. For XGBoost we used the type "gain", the average gain across all splits the feature is used in; for RandomForest we used Gini; for logistic regression methods (no penalty, L1, L2) we reported ordered absolute value of coefficients (all input data had the mean set to 0 and were scaled to unit variance.) When ranking features from L1-regularized models with a coefficient of 0, we show these with an equal lack of importance as having the same ranking in the table.

Summary of model results

This table shows performance metrics for each machine learning model type over Inpatient stays ending between January 2020 and November 2020. Mar-May = March to May 2020. Jun-Oct = June to October 2020. AUROC = area under the receiver operator characteristic curve.

Code Workbook Overview

To aid in understanding workbooks, we've generally followed the following convention:

Red cells are input datasets
Blue cells are review/QA checks
Green cells are part of the ML pipeline

All of the following workbooks are composed of Python and SQL transforms, with some Python code in the global code space.
Workbook: Build ML Dataset

Cohort for the ML process is an inpatient population affected by COVID where at least part of their stay was in 2020. See the transform "inpatients" for most of the relevant cohort logic.
Severe outcome is any of the following:

Extracorporeal membrane oxygenation (ECMO)
Invasive Ventilation
Died

Other information included:

Charlson Comorbidities
Payer
Labs
Demographics

Categorical variables are then dummy encoded using Pandas get_dummies(), followed by a simple imputation (see above). Lastly all model inputs are scaled using scikit-learn StandardScaler()
Workbook: scikit-ML

In the final models for the cohort paper, we compared outcomes across the early (March to May 2020) vs later (June to October 2020) COVID outbreaks as well as for the whole period. This workbook is primarily used for that step of splitting data into coarse time periods. Note: At the time of this work, November data was only partially available. As of May 2021 all of 2020 and several months of 2021 are available.
The other item performed in this work book is grid search over model hyperparameters. Some of the grid search transforms take a very long time to run (days).
Lastly, not part of the final output but present in this workbook, are some sequential feature selection transforms. These transforms explore the effect of inclusion or exclusion of some of the input variables on model performance. Some code comes from the formerly beta version but now released scikit-learn 0.24.x, and some from Raschka and Mirjalili's "Python Machine Learning, 3rd Edition"
Workbook: xgboost-ML

Due to significant load times for workbooks with custom package selections, the XGBoost hyperparameter exploration code is saved as a separate workbook. In general, it takes about 1 hour for cold start of a workbook with a customized package selection. Subsequent start (such as same day as previously started workbook) can be as fast as 15 minutes. I've found that minimizing the necessary packages helps some with start times, so recommend removing everything not required (such as all R related packages).
Workbook: cohort-ML

This workbooks takes the data from the "Build ML Dataset" and "scikit-ML" workbooks as well as the hyperparameters from the "scikit-ML" and "xgboost-ML" workbooks and trains 8 different models and reports results. A simple ROC curve is generated for 6 of the models.
Workbook: Unsupervised ML

Transforms in this workbook are not part of the Cohort paper nor used in the machine learning pipeline. There are many transforms that explore the use of Principle Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). Created both 2D and 3D images for visual inspection. 3D images do not work in the Unite platform, but will work when created using Plotly and exported.
Areas for improvement


More advanced imputation will likely improve results.
Feature selection needs a lot of time to run - multiple days per model type. Need to either have time to run or determine methods for improving performance.
Would be interesting to compare results when running this process against more current data.

Questions


What can be done in the area of reproducibility?
What information can be provided from this example to help in machine learning tasks people working on?
What difficulties are people having in generating their own ML workflows?


## feature_importance.png

      
    Raw
  

              feature_importance.png
            
          
## model_summary.png

      
    Raw
  

              model_summary.png
            
          
## variables_table.png

      
    Raw
  

              variables_table.png