Skip to content

Instantly share code, notes, and snippets.

@bmmalone
Created April 15, 2021 08:41
Show Gist options
  • Save bmmalone/0c1e8a8503e519028246fd0b02d2722d to your computer and use it in GitHub Desktop.
Save bmmalone/0c1e8a8503e519028246fd0b02d2722d to your computer and use it in GitHub Desktop.
Pseudocode for cross-validation with embedding models
given labeled_training_indices (e.g., maybe there are 20 labeled training instances)
given labeled_test_indices (there are always ~3000 of these due to the split created by Harutyunyan et al.)
train_fold, val_fold <- stratified split(labeled_training_indices, train=70%, "test"=30%) # "test" is really the validation set here
# for example, if we have 20 labeled training instances, then we have 14 instances for training and 6 for validation
# ... so we really don't have a lot when the number of labeled training instances is small
hp_grid = ParameterGrid({
'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, ...]
'embedding_epoch': [1, 11, 21, ...],
... other hyperparameters ...
})
best_model <- None
for each hp in hp_grid:
load embeddings for 'embedding_epoch'
train logistic regression model on train_fold using embeddings and other hps
evaluate model on val_fold
if model is better than best_model: #"model" includes the embedding epoch
best_model <- model
evaluate best_model (including the embedding epoch) on labeled_test_indices
@bmmalone
Copy link
Author

~3000 of these due to the split created by Harutyunyan et al.

This comment is specifically related to the MIMIC-III dataset splits defined in this paper: Multitask learning and benchmarking with clinical time series data by Harutyunyan et al.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment