Skip to content

Instantly share code, notes, and snippets.

@jamesqo
Created July 14, 2023 16:50
Show Gist options
  • Save jamesqo/f036fbdf0c95ebd5c6f8d1def93eb6e9 to your computer and use it in GitHub Desktop.
Save jamesqo/f036fbdf0c95ebd5c6f8d1def93eb6e9 to your computer and use it in GitHub Desktop.
  • Is there a schema describing how data in each table links to each other?

    • case barcode - covers 90% of linkages
    • sample barcode / aliquot barcode cover rest
    • aliquot - one sample split into multiple test tubes
    • DNA methylation tables: Different chromosomes / different human genomes
  • Significance of version numbers on tables

    • Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking
  • What’s the use case for the REST API

    • It's meant for the website
  • Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for

    • Contact them as they pop up
  • submitter_id

    • case_barcode and submitter_id are generally the same (from GDC)
    • submitter_id can sometimes be ambiguous
    • GDC barcodes: begin with case barcode, +4 letters for sample, +4 for aliquot
  • Methylation tables are split to lower query costs for end users

  • Tables are kept flat - can be represented in a spreadsheet

  • mRNA seq - one file per case per experimental run

  • Billing account for external users, contact Fabian

  • The metadata associated with a sample is always kept current in the _current tables

    • We should use the tables that are marked _current
    • Data in the current tables change when GDC runs a new harmonization pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment