Created
March 22, 2017 23:51
-
-
Save smrgit/5b154bdb4642d40d8c04122c870eb7b0 to your computer and use it in GitHub Desktop.
GO_scoring_v0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
WITH | |
-- | |
-- we start by translating the correlations that we got to ranks, | |
-- based on sorting the genes on corrByGene "DESC" | |
-- this will result in the highest positive correlation getting | |
-- rank #1, etc | |
-- we also lightly filter the genes by excluding any with near-zero | |
-- or negative correlation coefficients, and the result is a list | |
-- of approx 9000 genes with symbol, correlation, and rank | |
geneScoresT AS ( | |
SELECT | |
gene, | |
corrByGene, | |
-- here we do get a DENSE ranking based on ORDERing the | |
-- correlations from most positive downward | |
## DENSE_RANK() OVER(ORDER BY corrByGene ASC) AS gene_score | |
-- rather than doing a ranking, we could also just try using the | |
-- correlations directly, after inverting them so that a | |
-- a smaller value is "better" | |
(1-corrByGene) AS gene_score | |
FROM | |
`isb-cgc.smr_scratch.CPTAC_mRNAseq_corrs` | |
WHERE | |
corrByGene>0.10 ), | |
-- | |
-- next we JOIN the ranks to GO terms based on gene symbol; | |
-- basically what we are doing is annotating the ranking with the | |
-- GO_ID, GO_Name and Evidence category: | |
-- | |
annotScoresT AS ( | |
SELECT | |
gene, | |
corrByGene, | |
gene_score, | |
GO_ID, | |
GO_Name, | |
Evidence | |
FROM | |
geneScoresT AS a | |
JOIN | |
`isb-cgc-02-0001.Daves_working_area.go_slim_hgnc` AS b | |
ON | |
a.gene = b.Symbol ), | |
-- | |
-- Now we're going to "score" the GO gene-sets by grouping on | |
-- GO_ID and GO_Name. For each ID/Name, we sum the ranks, count | |
-- the number of genes, and then define the score as the sum | |
-- of the ranks divided by the number of genes | |
-- | |
goScoresT AS ( | |
SELECT | |
GO_ID, | |
GO_Name, | |
SUM(gene_score) AS sumGenesScore, | |
COUNT(gene) AS numGenes, | |
SUM(gene_score) / COUNT(gene) AS normScore | |
FROM | |
annotScoresT | |
WHERE | |
( Evidence="EXP" | |
OR Evidence="IDA" | |
OR Evidence="IPI" | |
OR Evidence="IMP" | |
OR Evidence="IGI" | |
OR Evidence="IEP" ) | |
GROUP BY | |
GO_ID, | |
GO_Name ) | |
-- | |
-- Finally, we should have our scored GO categories! | |
-- remember that the lower the score, the better, | |
-- so we sort ASCending | |
-- | |
SELECT | |
* | |
FROM | |
goScoresT | |
WHERE | |
numGenes >= 20 | |
ORDER BY | |
normScore ASC |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment