Skip to content

Instantly share code, notes, and snippets.

@shaabhishek
Last active January 31, 2022 16:33
Show Gist options
  • Save shaabhishek/a3c83b35c772e9544527c38972ff5ef6 to your computer and use it in GitHub Desktop.
Save shaabhishek/a3c83b35c772e9544527c38972ff5ef6 to your computer and use it in GitHub Desktop.
Important Methods/Attributes - Psych Project

DatasetAllocator:

  1. Helps create the flattened representation of the timeseries tensor from the raw files

DatasetAllocator.record: 1.


extract_timeseries_dataset.main: 1.


TimeseriesDataset:

  1. Time series tensor representation of the dataset with dimensions (num_patients, num_timesteps, num_codes)
  2. Each entry of tensor stores sums of patient codecounts binned into 2-week intervals

TimeseriesDataset.fact_counts / DatasetAllocator.fact_counts / fact_counts.npy:

  1. Stores the actual count data in a flattened representation (i.e. it's a 1D array).
  2. It is created by iterating through the patients and saving the counts for each (timestep, code) observed (sorted by time)
  3. To reconstruct the counts, we would also need the fact_codes and fact_steps arrays.

TimeseriesDataset.fact_codes / DatasetAllocator.fact_codes / fact_codes.npy:

  1. Stores the fact codes corresponding to the counts in fact_counts array.
  2. It has values in [0, total_concepts] = [0, ~107k]. These values are mapped to original code names using ConceptDefs.concept_idx method
  3. Shape is the same as fact_counts

TimeseriesDataset.fact_steps / DatasetAllocator.fact_steps / fact_steps.npy:

  1. Stores the time steps corresponding to the counts in fact_counts array.
  2. It has values in [-2, 4] years = [-51,104] fortnights (0 corresponds to fortnight of first MDD diagnosis for the patient)
  3. Shape is the same as fact_counts

TimeseriesDataset.as_csr:

  1. Create a sparse 2D matrix representing all patient facts w/ caching.
  2. Really should be a 3D matrix of (patient,code,time), but scipy.sparse only supports 2D, so instead it's (patient x time,code)
  3. Contiguous blocks of {rows_per_patient = 6 years = 156} rows represent neighboring fortnights for individual patients
  4. Uses the 1D arrays [fact_counts, fact_codes, fact_steps] to know which entry to populate with what
  5. Internally, it 1. remaps step values from [-51, 104] to [0, 155], 2. Offsets them to match the patient row in 2D matrix, 3. Uses the remapped steps and codes as rows and cols for the sparse matrix

TimeseriesDataset.two_week_antidepressants:

  1. Subsets the timeseries tensor to 1. only [2 years after first MDD diagnosis ~= 60 fortnights] and 2. only antidepressant indices
  2. Returns a 3D tensor of shape [n_patients, 60, n_antidepressant_codes]

TimeseriesDataset.three_month_antidepressants:

  1. Same subset as above (60 fortnights + antidepressants) but the time window for counts is now [3 months = 6 fortnights] wide (resulting in 10 three-month time blocks in 2 years).
  2. Returns a 3D tensor of shape [n_patients, 10, n_antidepressant_codes]

TimeseriesDataset.six_month_antidepressants:

  1. Same subset as above (60 fortnights + antidepressants) but the time window for counts is now [6 months = 12 fortnights] wide (resulting in 5 six-month time blocks in 2 years).
  2. Returns a 3D tensor of shape [n_patients, 5, n_antidepressant_codes]

TimeseriesDataset.count_representation:

  1. Subsets the timeseries tensor to [2 years before first MDD diagnosis = 51 fortnights] and returns total counts of codes for this period.
  2. Returns a 2D tensor of shape [n_patients, n_codes]
  3. I feel this is confusing notation

TimeseriesDataset.count_representation_6_mos_before:

  1. Same as TimeseriesDataset.count_representation but now restricted to [6 months before first MDD diagnosis].
  2. Returns a 2D tensor of shape [n_patients, n_codes]

TimeseriesDataset._counts_and_dems_without_antidepressants:

  1. Subsets the TimeseriesDataset.count_representation matrix by excluding the antidepressant codes, and appends patient demographic info to it.
  2. Returns a 2D tensor of shape [n_patients, n_codes - n_antidepressant_codes + n_dem_features]

TimeseriesDataset._counts_dems_and_ages_without_antidepressants:

  1. Same as TimeseriesDataset._counts_and_dems_without_antidepressants but it also appends patient age as a feature
  2. Same shape as TimeseriesDataset._counts_and_dems_without_antidepressants except that it has 1 more col

TimeseriesDataset._counts_dems_and_ages_without_antidepressants_6_mos_before:

  1. Analogous function to TimeseriesDataset._counts_dems_and_ages_without_antidepressants but uses TimeseriesDataset.count_representation_6_mos_before

TimeseriesDatasetExtended.two_week_non_AD_collapsed_codes:

  1. Analogous to TimeseriesDataset.two_week_antidepressants except:
    1. This excludes antidepressants.
    2. Sums the counts of all the included codes for each time block
  2. Returns an (expanded) 3D tensor of shape [n_patients, 60, 1]

TimeseriesDatasetExtended.two_week_psych_codes:

  1. Analogous to TimeseriesDataset.two_week_antidepressants except that this includes psych codes using the ConceptDefs.is_psych_code filter.
  2. Returns a 3D tensor of shape [n_patients, 60, n_psych_codes]

TimeseriesDatasetExtended.build_outcomes:

  1. This method defines several outcomes for each patient (all are 1D arrays of shape [n_patients]).
  2. The count data used is TimeseriesDataset.two_week_antidepressants, TimeseriesDatasetExtended.two_week_non_AD_collapsed_codes and TimeseriesDatasetExtended.two_week_psych_codes.
  3. These counts all correspond to 2 years from the index prescription. Also a prediction_time_window_in_fortnights variable is defined to be [3 months = 6 fortnights] (call that variable W in the below discussion)
  4. Outcome defined are:
    1. switched_treatment: The logic is to identify whether an AD treatment changed during the first W (i.e. 3 months). It does so by first finding timesteps where a unique treatment was provided (some code count > 0), and returns True if there are more than 1 sure timesteps.
    2. antidepressant_afterwards: Identifies whether an AD treatment was provided to patient in [W, 2W] / [3 month - 6 month period]. Does so by summing all counts at each step and checking if any timestep with count > 0 exists.
    3. same_treatment_afterwards: Computes unique AD treatment timesteps in [0, W] and [W, 2W] windows and checks if these two match. One thing to note is that this method doesn't check if only one treatment is provided in the window.
    4. remains_in_care: If any non-AD code exists in [W, 2W] window, return True.
    5. psych_afterwards: If any psych code exists in [W, 2W] window, return True.
    6. stable_treatment: Computed as (not switched_treatment) and (same_treatment_afterwards), i.e. the patient sticks to the same treatment in the [0, 2W] window.
    7. dropped_treatment: Computed as (not antidepressant_afterwards) and (remains_in_care and (not psych_afterwards)), i.e. the patient 1. didn't receive an AD or psych code but received a non-AD code in [W,2W] window (which means they dropped treatment but not going to hospital).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment