jamesqo/cda-meeting-notes-07-14.md

## cda-meeting-notes-07-14.md

      
    Raw
  

              cda-meeting-notes-07-14.md
            
          
Is there a schema describing how data in each table links to each other?

case barcode - covers 90% of linkages
sample barcode / aliquot barcode cover rest
aliquot - one sample split into multiple test tubes
DNA methylation tables: Different chromosomes / different human genomes


Significance of version numbers on tables

Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking


What’s the use case for the REST API

It's meant for the website


Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for

Contact them as they pop up


submitter_id

case_barcode and submitter_id are generally the same (from GDC)
submitter_id can sometimes be ambiguous
GDC barcodes: begin with case barcode, +4 letters for sample, +4 for aliquot


Methylation tables are split to lower query costs for end users


Tables are kept flat - can be represented in a spreadsheet


mRNA seq - one file per case per experimental run


Billing account for external users, contact Fabian


The metadata associated with a sample is always kept current in the _current tables

We should use the tables that are marked _current
Data in the current tables change when GDC runs a new harmonization pipeline