-
Is there a schema describing how data in each table links to each other?
- case barcode - covers 90% of linkages
- sample barcode / aliquot barcode cover rest
- aliquot - one sample split into multiple test tubes
- DNA methylation tables: Different chromosomes / different human genomes
-
Significance of version numbers on tables
- Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking
-
What’s the use case for the REST API
- It's meant for the website
-
Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for
- Contact them as they pop up
-
submitter_id
- case_barcode and submitter_id are generally the same (from GDC)
- submitter_id can sometimes be ambiguous
- GDC barcodes: begin with case barcode, +4 letters for sample, +4 for aliquot
-
Methylation tables are split to lower query costs for end users
-
Tables are kept flat - can be represented in a spreadsheet
-
mRNA seq - one file per case per experimental run
-
Billing account for external users, contact Fabian
-
The metadata associated with a sample is always kept current in the _current tables
- We should use the tables that are marked _current
- Data in the current tables change when GDC runs a new harmonization pipeline
Created
July 14, 2023 16:50
-
-
Save jamesqo/f036fbdf0c95ebd5c6f8d1def93eb6e9 to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment