Skip to content

Instantly share code, notes, and snippets.

@smrgit
smrgit / use_protein_mRNA_corr_to_score_pathways.sql
Created May 17, 2017 19:23
use the wiki pathways table to "score" pathways using the correlation between the protein quantification and the mRNA-seq based expression estimates
WITH
-- first we get the 77 samples that passed the QC tests
qcSet AS (
SELECT
TCGA_case_ID AS case_barcode
FROM
`isb-cgc.hg19_data_previews.TCGA_Breast_SuppTable01`
WHERE
QC_Status="pass" ),
--
@smrgit
smrgit / joinCorr_CPTAC_RPPA_mRNAseq.sql
Created May 16, 2017 22:58
join correlations between CPTAC data with RPPA and mRNA-Seq data all in one go!
WITH
-- first we get the 77 samples that passed the QC tests
qcSet AS (
SELECT
TCGA_case_ID AS case_barcode
FROM
`isb-cgc.hg19_data_previews.TCGA_Breast_SuppTable01`
WHERE
QC_Status="pass" ),
--
@smrgit
smrgit / case_images_join.sql
Created April 28, 2017 00:56
Find radiology and diagnostic images for selected types of TCGA cases
WITH
cases AS (
SELECT
case_barcode
FROM
`isb-cgc.TCGA_bioclin_v0.Clinical`
WHERE
project_short_name="TCGA-LUSC"
OR project_short_name="TCGA-LUAD" ),
radImg AS (
@smrgit
smrgit / GDC_data_not_in_CloudStorage.sql
Last active September 15, 2017 17:16
Take a look at the types of open-access TXT and TSV data files that exist at the GDC but are *not* available in Google Cloud Storage.
WITH
t1 AS (
SELECT
dbName,
file_id AS file_gdc_id,
access,
cases__project__program__name AS program_name,
cases__project__project_id AS project_short_name,
experimental_strategy,
data_category,
@smrgit
smrgit / GDC_data_in_CloudStorage.sql
Last active September 15, 2017 17:05
This query joins two ISB-CGC GDC_metadata tables to get a summary of the GDC data that is available in ISB-CGC buckets in Google Cloud Storage.
WITH
t1 AS (
SELECT
dbName,
file_id AS file_gdc_id,
access,
cases__project__program__name AS program_name,
cases__project__project_id AS project_short_name,
experimental_strategy,
data_category,
@smrgit
smrgit / high_var_mirna.sql
Last active April 16, 2017 22:55
find most variable miRNAs (by MIMAT accession)
WITH
-- first we just extract all of the rows corresponding to mature miR's
d1 AS (
SELECT
sample_barcode,
aliquot_barcode,
chromosome,
start_pos,
end_pos,
strand,
@smrgit
smrgit / bcgsc_gdc_delta_isoforms.sql
Created March 24, 2017 03:46
join BCGSC and GDC hg38 miRNA isoform-level data and compute read_count deltas
WITH
aList AS (
SELECT
aliquot_barcode AS abarcode
FROM
`isb-cgc-04-0010.draft_new_data.bcgsc_hg38_isoforms`
GROUP BY
abarcode ),
gdcData AS (
SELECT
@smrgit
smrgit / GO_scoring_v0.sql
Created March 22, 2017 23:51
GO_scoring_v0
WITH
--
-- we start by translating the correlations that we got to ranks,
-- based on sorting the genes on corrByGene "DESC"
-- this will result in the highest positive correlation getting
-- rank #1, etc
-- we also lightly filter the genes by excluding any with near-zero
-- or negative correlation coefficients, and the result is a list
-- of approx 9000 genes with symbol, correlation, and rank
geneScoresT AS (
@smrgit
smrgit / miRNA_hg19_hg38_stemloop_corr.sql
Last active April 14, 2017 17:44
GQ GDC hg19 vs hg38 miRNA correlation analysis -- at the stemloop expression level
WITH
hg38_d1 AS (
-- we start with a table at the aliquot level, in case there are multiple aliquots
-- for a single sample;
SELECT
sample_barcode,
aliquot_barcode,
mirna_id,
reads_per_million_miRNA_mapped AS RPM
FROM
@smrgit
smrgit / BRCA_CPTAC_RPPA_corr.sql
Last active March 22, 2017 15:42
BRCA CPTAC-RPPA correlation
WITH
-- first we get the 77 samples that passed the QC tests
qcSet AS (
SELECT
TCGA_case_ID AS case_barcode
FROM
`isb-cgc.hg19_data_previews.TCGA_Breast_SuppTable01`
WHERE
QC_Status="pass" ),
--