Skip to content

Instantly share code, notes, and snippets.

@jamesqo
Last active August 31, 2023 20:30
Show Gist options
  • Save jamesqo/56f91d6db0a0372b5d686d082a1d03bc to your computer and use it in GitHub Desktop.
Save jamesqo/56f91d6db0a0372b5d686d082a1d03bc to your computer and use it in GitHub Desktop.

Comments & assumptions made during curation

General

  • Study is updated once every 3 months with latest data from ISB-CGC BigQuery tables
  • Reference genome used: hg38
  • Only tumor sample data is included (no normal samples)

Clinical data

  • Patient data: Retrieved from BigQuery table isb-cgc-bq.TCGA.clinical_gdc_current

  • Sample data: Retrieved from BigQuery table isb-cgc-bq.TCGA.biospecimen_gdc_current

  • DFS_STATUS and DFS_MONTHS are unavailable from BigQuery, so instead they're pulled from existing TCGA studies in datahub.

    • First, the corresponding pancan study is checked. If the patient ID is not found there, then the value from the legacy TCGA study is used.
  • Transformations

    • AGE is clipped from 18 to 89.
    • OS_MONTHS is converted from demo__days_to_death when that value if present. If the patient is still alive, it is converted from diag__days_to_last_follow_up.
  • Remapped columns: TODO

CNA data

  • Retrieved from BigQuery table isb-cgc-bq.TCGA.copy_number_gene_level_hg38_gdc_current
  • Transformations
    • Ensembl gene IDs are mapped to Entrez IDs using the Genome Nexus hg38 canonical transcript file.
    • If a sample has multiple aliquots, it is "reduced" to a single aliquot chosen to represent the entire sample.
      • This is done by choosing the aliquot ID with the highest sort value (eg. highest plate number). This follows the same policy used by GDAC used to reduce aliquot data in their studies.
    • Copy number values from the BigQuery tables are converted from ASCAT to GISTIC 2.0.

Segment data

  • Retrieved from BigQuery table isb-cgc-bq.TCGA.copy_number_segment_masked_hg38_gdc_current
  • Remapped columns:
Original cBioPortal
sample_barcode ID
chromosome chrom
start_pos loc.start
end_pos loc.end
num_probes num.mark
segment_mean seg.mean

Mutation data

  • Retrieved from BigQuery table isb-cgc-bq.TCGA.masked_somatic_mutation_hg38_gdc_current
  • Remapped columns:
Original cBioPortal
sample_barcode_tumor Tumor_Sample_Barcode
sample_barcode_normal Matched_Norm_Sample_Barcode
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment