Skip to content

Instantly share code, notes, and snippets.

  • Understand the different types of clinical data available from GDC
    • Expected timeframe: 1-2 weeks
    • Deliverable: Google Doc / Markdown doc explaining all of them
  • Decide which of these clinical data fields we'd like to fetch
    • Expected timeframe: 1 week
    • Deliverable: Updated Google doc / Markdown
  • Understand the genomic data present in GDC and decide which ones to fetch
    • Expected timeframe: 2 weeks
    • Deliverable: Google Doc / Markdown
  • Familiarize self with CDA
  • Is there a schema describing how data in each table links to each other?
    • case barcode - covers 90% of linkages
    • sample barcode / aliquot barcode cover rest
    • aliquot - one sample split into multiple test tubes
    • DNA methylation tables: Different chromosomes / different human genomes
  • Significance of version numbers on tables
    • Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking
  • What’s the use case for the REST API
    • It's meant for the website
  • Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for
OTHER_PATIENT_ID PATIENT_ID GENDER RACE ETHNICITY PRIMARY_SITE_PATIENT CANCER_TYPE AGE_IN_DAYS OS_STATUS DFS_STATUS OS_MONTHS DFS_MONTHS
dd3a6357-9087-44b2-9956-f981e0de6f1c TCGA-2F-A9KO male WHITE NOT HISPANIC OR LATINO Bladder blca 23323 Dead not reported 24.113009198423125
65d1eaec-28db-4a41-a4db-1a710fcb24ad TCGA-2F-A9KP male WHITE NOT HISPANIC OR LATINO Bladder blca 24428 Dead not reported 11.957950065703022
25eaf3f9-c364-423c-aaae-925e7b393afc TCGA-2F-A9KQ male WHITE NOT HISPANIC OR LATINO Bladder blca 25259 Alive not reported 94.80946123521682 94.88115199
f6d916b0-8e4c-49cb-a0d4-883908f3284f TCGA-2F-A9KR female NOT REPORTED NOT REPORTED Bladder blca 21848 Dead not reported 104.56636005256242 101.4564224
bc6c516b-591e-4950-b6b0-decafa666f4e TCGA-2F-A9KT male WHITE NOT HISPANIC OR LATINO Bladder blca 30520 Alive not reported 77.26675427069645 108.9522307
3b464065-b2e9-4fb2-a7da-09e963fd43b3 TCGA-2F-A9KW female WHITE NOT HISPANIC OR LATINO Bladder blca 24703 Dead not reported 8.344283837056505
1ecfdf16
ENST00000250823: [9084, 353513]
ENST00000250831: [159163, 378951]
ENST00000251595: [3039, 3040]
ENST00000272298: [801, 805]
ENST00000272395: [129868, 653192]
ENST00000289488: [220074, 120356739]
ENST00000300258: [140290, 110091775]
ENST00000301408: [1082, 93659]
ENST00000303910: [3012, 8335]
ENST00000304270: [389852, 729201]
ENSG00000083622
ENSG00000093100
ENSG00000100101
ENSG00000103200
ENSG00000106540
ENSG00000108516
ENSG00000108958
ENSG00000111780
ENSG00000112096
ENSG00000121388

Comments & assumptions made during curation

General

  • Study is updated once every 3 months with latest data from ISB-CGC BigQuery tables
  • Reference genome used: hg38
  • Only tumor sample data is included (no normal samples)

Clinical data

-- Empty patient queries
SELECT *
FROM `isb-cgc-bq.TARGET.clinical_gdc_current`
WHERE submitter_id IN (
SELECT DISTINCT case_barcode
FROM `isb-cgc-bq.TARGET.biospecimen_gdc_current`
WHERE project_short_name IN ('TARGET-ALL-P1', 'TARGET-ALL-P2')
AND (sample_type IN ('01', '02', '06'))
)
import pandas as pd
import numpy as np
import requests
def main():
df = pd.read_csv('clinical_data_mappings.tsv', sep='\t')
attrib_names = df["cBioPortal CDD Attribute"]
sheet_display_names = df["Display Name"]
sheet_descs = df["Description"]
{
"variant": "6:g.37282286_37282287del",
"originalVariantQuery": "6:g.37282286_37282287del",
"hgvsg": "6:g.37282286_37282287del",
"id": "6:g.37282286_37282287del",
"assembly_name": "GRCh37",
"seq_region_name": "6",
"start": 37282286,
"end": 37282287,
"allele_string": "AA/-",
This file has been truncated, but you can view the full file.
Hugo_Symbol Primary_Site Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer HGVSp_Short t_ref_count t_alt_count n_alt_count n_ref_count
CYP4B1 Brain 1580 BI GRCh38 chr1 46810818 46810818 + Missense_Mutation SNP C C T rs200083913 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.T64M 122 93
HFM1 Brain 164045 BI GRCh38 chr1 91375532 91375532 + Missense_Mutation SNP G G A rs1557483909 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.L531F 144 90
INTS3 Brain 65123 BI GRCh38 chr1 153741306 153741306 +