jamesqo/tcga-readme.md

## tcga-readme.md

      
    Raw
  

              tcga-readme.md
            
          
    Comments & assumptions made during curation

General


Study is updated once every 3 months with latest data from ISB-CGC BigQuery tables
Reference genome used: hg38
Only tumor sample data is included (no normal samples)

Clinical data


Patient data: Retrieved from BigQuery table isb-cgc-bq.TCGA.clinical_gdc_current


Sample data: Retrieved from BigQuery table isb-cgc-bq.TCGA.biospecimen_gdc_current


DFS_STATUS and DFS_MONTHS are unavailable from BigQuery, so instead they're pulled from existing TCGA studies in datahub.

First, the corresponding pancan study is checked. If the patient ID is not found there, then the value from the legacy TCGA study is used.


Transformations

AGE is clipped from 18 to 89.
OS_MONTHS is converted from demo__days_to_death when that value if present. If the patient is still alive, it is converted from diag__days_to_last_follow_up.


Remapped columns: TODO


CNA data


Retrieved from BigQuery table isb-cgc-bq.TCGA.copy_number_gene_level_hg38_gdc_current
Transformations

Ensembl gene IDs are mapped to Entrez IDs using the Genome Nexus hg38 canonical transcript file.
If a sample has multiple aliquots, it is "reduced" to a single aliquot chosen to represent the entire sample.

This is done by choosing the aliquot ID with the highest sort value (eg. highest plate number). This follows the same policy used by GDAC used to reduce aliquot data in their studies.


Copy number values from the BigQuery tables are converted from ASCAT to GISTIC 2.0.


Segment data


Retrieved from BigQuery table isb-cgc-bq.TCGA.copy_number_segment_masked_hg38_gdc_current
Remapped columns:


Original
cBioPortal


sample_barcode
ID


chromosome
chrom


start_pos
loc.start


end_pos
loc.end


num_probes
num.mark


segment_mean
seg.mean


Mutation data


Retrieved from BigQuery table isb-cgc-bq.TCGA.masked_somatic_mutation_hg38_gdc_current
Remapped columns:


Original
cBioPortal


sample_barcode_tumor
Tumor_Sample_Barcode


sample_barcode_normal
Matched_Norm_Sample_Barcode
Original	cBioPortal
sample_barcode	ID
chromosome	chrom
start_pos	loc.start
end_pos	loc.end
num_probes	num.mark
segment_mean	seg.mean
Original	cBioPortal
sample_barcode_tumor	Tumor_Sample_Barcode
sample_barcode_normal	Matched_Norm_Sample_Barcode