- Understand the different types of clinical data available from GDC
- Expected timeframe: 1-2 weeks
- Deliverable: Google Doc / Markdown doc explaining all of them
- Decide which of these clinical data fields we'd like to fetch
- Expected timeframe: 1 week
- Deliverable: Updated Google doc / Markdown
- Understand the genomic data present in GDC and decide which ones to fetch
- Expected timeframe: 2 weeks
- Deliverable: Google Doc / Markdown
- Familiarize self with CDA
- Is there a schema describing how data in each table links to each other?
- case barcode - covers 90% of linkages
- sample barcode / aliquot barcode cover rest
- aliquot - one sample split into multiple test tubes
- DNA methylation tables: Different chromosomes / different human genomes
- Significance of version numbers on tables
- Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking
- What’s the use case for the REST API
- It's meant for the website
- Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
OTHER_PATIENT_ID PATIENT_ID GENDER RACE ETHNICITY PRIMARY_SITE_PATIENT CANCER_TYPE AGE_IN_DAYS OS_STATUS DFS_STATUS OS_MONTHS DFS_MONTHS | |
dd3a6357-9087-44b2-9956-f981e0de6f1c TCGA-2F-A9KO male WHITE NOT HISPANIC OR LATINO Bladder blca 23323 Dead not reported 24.113009198423125 | |
65d1eaec-28db-4a41-a4db-1a710fcb24ad TCGA-2F-A9KP male WHITE NOT HISPANIC OR LATINO Bladder blca 24428 Dead not reported 11.957950065703022 | |
25eaf3f9-c364-423c-aaae-925e7b393afc TCGA-2F-A9KQ male WHITE NOT HISPANIC OR LATINO Bladder blca 25259 Alive not reported 94.80946123521682 94.88115199 | |
f6d916b0-8e4c-49cb-a0d4-883908f3284f TCGA-2F-A9KR female NOT REPORTED NOT REPORTED Bladder blca 21848 Dead not reported 104.56636005256242 101.4564224 | |
bc6c516b-591e-4950-b6b0-decafa666f4e TCGA-2F-A9KT male WHITE NOT HISPANIC OR LATINO Bladder blca 30520 Alive not reported 77.26675427069645 108.9522307 | |
3b464065-b2e9-4fb2-a7da-09e963fd43b3 TCGA-2F-A9KW female WHITE NOT HISPANIC OR LATINO Bladder blca 24703 Dead not reported 8.344283837056505 | |
1ecfdf16 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ENST00000250823: [9084, 353513] | |
ENST00000250831: [159163, 378951] | |
ENST00000251595: [3039, 3040] | |
ENST00000272298: [801, 805] | |
ENST00000272395: [129868, 653192] | |
ENST00000289488: [220074, 120356739] | |
ENST00000300258: [140290, 110091775] | |
ENST00000301408: [1082, 93659] | |
ENST00000303910: [3012, 8335] | |
ENST00000304270: [389852, 729201] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ENSG00000083622 | |
ENSG00000093100 | |
ENSG00000100101 | |
ENSG00000103200 | |
ENSG00000106540 | |
ENSG00000108516 | |
ENSG00000108958 | |
ENSG00000111780 | |
ENSG00000112096 | |
ENSG00000121388 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- Empty patient queries | |
SELECT * | |
FROM `isb-cgc-bq.TARGET.clinical_gdc_current` | |
WHERE submitter_id IN ( | |
SELECT DISTINCT case_barcode | |
FROM `isb-cgc-bq.TARGET.biospecimen_gdc_current` | |
WHERE project_short_name IN ('TARGET-ALL-P1', 'TARGET-ALL-P2') | |
AND (sample_type IN ('01', '02', '06')) | |
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import numpy as np | |
import requests | |
def main(): | |
df = pd.read_csv('clinical_data_mappings.tsv', sep='\t') | |
attrib_names = df["cBioPortal CDD Attribute"] | |
sheet_display_names = df["Display Name"] | |
sheet_descs = df["Description"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"variant": "6:g.37282286_37282287del", | |
"originalVariantQuery": "6:g.37282286_37282287del", | |
"hgvsg": "6:g.37282286_37282287del", | |
"id": "6:g.37282286_37282287del", | |
"assembly_name": "GRCh37", | |
"seq_region_name": "6", | |
"start": 37282286, | |
"end": 37282287, | |
"allele_string": "AA/-", |
This file has been truncated, but you can view the full file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hugo_Symbol Primary_Site Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer HGVSp_Short t_ref_count t_alt_count n_alt_count n_ref_count | |
CYP4B1 Brain 1580 BI GRCh38 chr1 46810818 46810818 + Missense_Mutation SNP C C T rs200083913 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.T64M 122 93 | |
HFM1 Brain 164045 BI GRCh38 chr1 91375532 91375532 + Missense_Mutation SNP G G A rs1557483909 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.L531F 144 90 | |
INTS3 Brain 65123 BI GRCh38 chr1 153741306 153741306 + |
OlderNewer