James Ko jamesqo

## msk-tentative-roadmap.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jamesqo
                / msk-tentative-roadmap.md
            
            
              Last active
              June 22, 2023 04:32
            
          
Understand the different types of clinical data available from GDC

Expected timeframe: 1-2 weeks
Deliverable: Google Doc / Markdown doc explaining all of them


Decide which of these clinical data fields we'd like to fetch

Expected timeframe: 1 week
Deliverable: Updated Google doc / Markdown


Understand the genomic data present in GDC and decide which ones to fetch

Expected timeframe: 2 weeks
Deliverable: Google Doc / Markdown


Familiarize self with CDA


## cda-meeting-notes-07-14.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jamesqo
                / cda-meeting-notes-07-14.md
            
            
              Created
              July 14, 2023 16:50
            
          
Is there a schema describing how data in each table links to each other?

case barcode - covers 90% of linkages
sample barcode / aliquot barcode cover rest
aliquot - one sample split into multiple test tubes
DNA methylation tables: Different chromosomes / different human genomes


Significance of version numbers on tables

Versioned tables are kept around so that code that references one version of the data can be run in the future w/o breaking


What’s the use case for the REST API

It's meant for the website


Within just TCGA + CPTAC, are there any data inconsistencies that the CDA team has already discovered that we need to watch out for


## data_clinical_patient.txt
OTHER_PATIENT_ID	PATIENT_ID	GENDER	RACE	ETHNICITY	PRIMARY_SITE_PATIENT	CANCER_TYPE	AGE_IN_DAYS	OS_STATUS	DFS_STATUS	OS_MONTHS	DFS_MONTHS
dd3a6357-9087-44b2-9956-f981e0de6f1c	TCGA-2F-A9KO	male	WHITE	NOT HISPANIC OR LATINO	Bladder	blca	23323	Dead	not reported	24.113009198423125
65d1eaec-28db-4a41-a4db-1a710fcb24ad	TCGA-2F-A9KP	male	WHITE	NOT HISPANIC OR LATINO	Bladder	blca	24428	Dead	not reported	11.957950065703022
25eaf3f9-c364-423c-aaae-925e7b393afc	TCGA-2F-A9KQ	male	WHITE	NOT HISPANIC OR LATINO	Bladder	blca	25259	Alive	not reported	94.80946123521682	94.88115199
f6d916b0-8e4c-49cb-a0d4-883908f3284f	TCGA-2F-A9KR	female	NOT REPORTED	NOT REPORTED	Bladder	blca	21848	Dead	not reported	104.56636005256242	101.4564224
bc6c516b-591e-4950-b6b0-decafa666f4e	TCGA-2F-A9KT	male	WHITE	NOT HISPANIC OR LATINO	Bladder	blca	30520	Alive	not reported	77.26675427069645	108.9522307
3b464065-b2e9-4fb2-a7da-09e963fd43b3	TCGA-2F-A9KW	female	WHITE	NOT HISPANIC OR LATINO	Bladder	blca	24703	Dead	not reported	8.344283837056505
1ecfdf16

## ambiguous_ensembl_ids.txt
ENST00000250823: [9084, 353513]
ENST00000250831: [159163, 378951]
ENST00000251595: [3039, 3040]
ENST00000272298: [801, 805]
ENST00000272395: [129868, 653192]
ENST00000289488: [220074, 120356739]
ENST00000300258: [140290, 110091775]
ENST00000301408: [1082, 93659]
ENST00000303910: [3012, 8335]
ENST00000304270: [389852, 729201]

## missing_ensembl_ids_tcga_acc.txt
ENSG00000083622
ENSG00000093100
ENSG00000100101
ENSG00000103200
ENSG00000106540
ENSG00000108516
ENSG00000108958
ENSG00000111780
ENSG00000112096
ENSG00000121388

## tcga-readme.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                jamesqo
                / tcga-readme.md
            
            
              Last active
              August 31, 2023 20:30
            
          
    Comments & assumptions made during curation

General


Study is updated once every 3 months with latest data from ISB-CGC BigQuery tables
Reference genome used: hg38
Only tumor sample data is included (no normal samples)

Clinical data


## queries.sql
-- Empty patient queries

    SELECT *
    FROM `isb-cgc-bq.TARGET.clinical_gdc_current`
    WHERE submitter_id IN (
    SELECT DISTINCT case_barcode
    FROM `isb-cgc-bq.TARGET.biospecimen_gdc_current`
    WHERE project_short_name IN ('TARGET-ALL-P1', 'TARGET-ALL-P2')
    AND (sample_type IN ('01', '02', '06'))
)

## find_attribs_missing_from_cdd.py
import pandas as pd
import numpy as np
import requests

def main():
    df = pd.read_csv('clinical_data_mappings.tsv', sep='\t')
    attrib_names = df["cBioPortal CDD Attribute"]
    sheet_display_names = df["Display Name"]
    sheet_descs = df["Description"]

## gn_response.json
{
  "variant": "6:g.37282286_37282287del",
  "originalVariantQuery": "6:g.37282286_37282287del",
  "hgvsg": "6:g.37282286_37282287del",
  "id": "6:g.37282286_37282287del",
  "assembly_name": "GRCh37",
  "seq_region_name": "6",
  "start": 37282286,
  "end": 37282287,
  "allele_string": "AA/-",

## data_mutations__brain_cptac_gdc.txt
Hugo_Symbol	Primary_Site	Entrez_Gene_Id	Center	NCBI_Build	Chromosome	Start_Position	End_Position	Strand	Variant_Classification	Variant_Type	Reference_Allele	Tumor_Seq_Allele1	Tumor_Seq_Allele2	dbSNP_RS	dbSNP_Val_Status	Tumor_Sample_Barcode	Matched_Norm_Sample_Barcode	Match_Norm_Seq_Allele1	Match_Norm_Seq_Allele2	Tumor_Validation_Allele1	Tumor_Validation_Allele2	Match_Norm_Validation_Allele1	Match_Norm_Validation_Allele2	Verification_Status	Validation_Status	Mutation_Status	Sequencing_Phase	Sequence_Source	Validation_Method	Score	BAM_File	Sequencer	HGVSp_Short	t_ref_count	t_alt_count	n_alt_count	n_ref_count
CYP4B1	Brain	1580	BI	GRCh38	chr1	46810818	46810818	+	Missense_Mutation	SNP	C	C	T	rs200083913		C3L-00104-01	C3L-00104-31									Somatic						Illumina HiSeq 4000	p.T64M	122	93
HFM1	Brain	164045	BI	GRCh38	chr1	91375532	91375532	+	Missense_Mutation	SNP	G	G	A	rs1557483909		C3L-00104-01	C3L-00104-31									Somatic						Illumina HiSeq 4000	p.L531F	144	90
INTS3	Brain	65123	BI	GRCh38	chr1	153741306	153741306	+
	OTHER_PATIENT_ID PATIENT_ID GENDER RACE ETHNICITY PRIMARY_SITE_PATIENT CANCER_TYPE AGE_IN_DAYS OS_STATUS DFS_STATUS OS_MONTHS DFS_MONTHS
	dd3a6357-9087-44b2-9956-f981e0de6f1c TCGA-2F-A9KO male WHITE NOT HISPANIC OR LATINO Bladder blca 23323 Dead not reported 24.113009198423125
	65d1eaec-28db-4a41-a4db-1a710fcb24ad TCGA-2F-A9KP male WHITE NOT HISPANIC OR LATINO Bladder blca 24428 Dead not reported 11.957950065703022
	25eaf3f9-c364-423c-aaae-925e7b393afc TCGA-2F-A9KQ male WHITE NOT HISPANIC OR LATINO Bladder blca 25259 Alive not reported 94.80946123521682 94.88115199
	f6d916b0-8e4c-49cb-a0d4-883908f3284f TCGA-2F-A9KR female NOT REPORTED NOT REPORTED Bladder blca 21848 Dead not reported 104.56636005256242 101.4564224
	bc6c516b-591e-4950-b6b0-decafa666f4e TCGA-2F-A9KT male WHITE NOT HISPANIC OR LATINO Bladder blca 30520 Alive not reported 77.26675427069645 108.9522307
	3b464065-b2e9-4fb2-a7da-09e963fd43b3 TCGA-2F-A9KW female WHITE NOT HISPANIC OR LATINO Bladder blca 24703 Dead not reported 8.344283837056505
	1ecfdf16
	ENST00000250823: [9084, 353513]
	ENST00000250831: [159163, 378951]
	ENST00000251595: [3039, 3040]
	ENST00000272298: [801, 805]
	ENST00000272395: [129868, 653192]
	ENST00000289488: [220074, 120356739]
	ENST00000300258: [140290, 110091775]
	ENST00000301408: [1082, 93659]
	ENST00000303910: [3012, 8335]
	ENST00000304270: [389852, 729201]
	ENSG00000083622
	ENSG00000093100
	ENSG00000100101
	ENSG00000103200
	ENSG00000106540
	ENSG00000108516
	ENSG00000108958
	ENSG00000111780
	ENSG00000112096
	ENSG00000121388
	-- Empty patient queries

	SELECT *
	FROM `isb-cgc-bq.TARGET.clinical_gdc_current`
	WHERE submitter_id IN (
	SELECT DISTINCT case_barcode
	FROM `isb-cgc-bq.TARGET.biospecimen_gdc_current`
	WHERE project_short_name IN ('TARGET-ALL-P1', 'TARGET-ALL-P2')
	AND (sample_type IN ('01', '02', '06'))
	)
	import pandas as pd
	import numpy as np
	import requests

	def main():
	df = pd.read_csv('clinical_data_mappings.tsv', sep='\t')
	attrib_names = df["cBioPortal CDD Attribute"]
	sheet_display_names = df["Display Name"]
	sheet_descs = df["Description"]
	{
	"variant": "6:g.37282286_37282287del",
	"originalVariantQuery": "6:g.37282286_37282287del",
	"hgvsg": "6:g.37282286_37282287del",
	"id": "6:g.37282286_37282287del",
	"assembly_name": "GRCh37",
	"seq_region_name": "6",
	"start": 37282286,
	"end": 37282287,
	"allele_string": "AA/-",
	Hugo_Symbol Primary_Site Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer HGVSp_Short t_ref_count t_alt_count n_alt_count n_ref_count
	CYP4B1 Brain 1580 BI GRCh38 chr1 46810818 46810818 + Missense_Mutation SNP C C T rs200083913 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.T64M 122 93
	HFM1 Brain 164045 BI GRCh38 chr1 91375532 91375532 + Missense_Mutation SNP G G A rs1557483909 C3L-00104-01 C3L-00104-31 Somatic Illumina HiSeq 4000 p.L531F 144 90
	INTS3 Brain 65123 BI GRCh38 chr1 153741306 153741306 +