Skip to content

Instantly share code, notes, and snippets.

@bjo
Last active October 20, 2015 20:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bjo/0dd6bd0cc0252667a1dc to your computer and use it in GitHub Desktop.
Save bjo/0dd6bd0cc0252667a1dc to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:8adc23cefdc7354083f9e14db98c3c5755e1afd151c4b82baeb79b948ccec95e"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"GTEx eQTL detection: cis- pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is the master notebook for the cis-pipeline. There are separate notebooks for:\n",
"\n",
"misc. and meta analyses: http://nbviewer.ipython.org/gist/bjo/646278c49a10b6307bd9\n",
"eQTLBMA part of the pipeline: http://nbviewer.ipython.org/gist/bjo/250135a17a004b014434\n",
"Collection of relevant plots: http://nbviewer.ipython.org/gist/bjo/633dddf85ad9dc54c07b\n",
"\n",
"The notebook is organized in a following fashion:\n",
"\n",
"### 0. GC content (Ideally already taken care of in the pre-processing, to be fixed)\n",
"### 1. Running PEER in a tissue-specific manner\n",
"### 2. \n",
"\n",
"Now that the tissue-level quantification has been finished for GTEx Phase 1, we can move on to the cis-eQTL pipeline. First, we can take the TPM-converted matrices (view Ian's consortium notebook for more details) and control for the GC content, which is one of the standard practices in eQTL studies, and an example is shown in this work:\n",
"http://www.nature.com/nature/journal/v464/n7289/full/nature08872.html\n",
"\n",
"For now, we'll use the 'certain biotypes', and use FeatureCounts TPM data as an example, and we'll only take expression from individuals for which we have known genotypes. The correction of other matrices will be conducted later in the notebook. The TPM-converted matrices are located at:\n",
"\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes\n",
"\n",
"\n",
"### Note: For the details of the pre-processing pipeline, consult Ian's notebook. Here are some brief remarks that might be of interest:\n",
"\n",
"#### 1. Threshold for MAF has been set at 0.05.\n",
"#### 2. This pipeline only looks at gene-level transcripts for 'certain biotypes', but we can move on to uncertain biotypes and exon-level quantifications later.\n",
"#### 3. The matrices are initially corrected for gene/transcript length, sequencing depth, and GC content.\n",
"#### 4. Separate analysis for protein-coding genes and non-coding genes.\n",
"#### 5. For the purposes of this pipeline, we will define cis- as being within 100Kb of the gene. (100Kb from TSS for eQTLBMA in particular, seems like the TES implementation doesn't work).\n",
"#### 6. The pipeline will correct for top 3 genotype PCs along with varying number of PEER factors (optimized tissue-specifically) for eQTL detection.\n",
"#### 7. The matrices are quantified with featurecounts, htseq and RSEM.\n",
"#### 8. eQTL detection is done with MatrixEQTL, eQTLBMA, and snptest.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 0. GC content (Ideally already taken care of in the pre-processing, to be fixed)\n",
"\n",
"First, we need to calculate GC content for each transcripts, and we'll use the ENCODE v.19 annotations, as they were also used for the quantification. (This portion is very similar to the normalization step from Ian's notebook on RNA-seq mapping)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"cd /tigress/BEE/gtex/external_sources/hg19/\n",
"wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz\n",
"wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.lncRNA_transcripts.fa.gz"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the longest transcripts per gene"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"cd /tigress/BEE/gtex/external_sources/hg19/\n",
"gzip -cd gencode.v19.lncRNA_transcripts.fa.gz\n",
"gzip -cd gencode.v19.pc_transcripts.fa.gz\n",
"\n",
"mkdir /tigress/BEE/gtex/data/auxiliary_bj5\n",
"mkdir /tigress/BEE/gtex/data/auxiliary_bj5/normalization\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/isolate_longest_sequences_in_transcriptomes.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/isolate_longest_sequences_in_transcriptomes.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute sequence lengths"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"# Transcripts\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt\n",
"\n",
"cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt \\\n",
"> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.seq_length.txt \n",
"tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt \\\n",
">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.seq_length.txt \n",
"\n",
"# Genes\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt\n",
"\n",
"cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt \\\n",
"> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.seq_length.txt \n",
"tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt \\\n",
">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.seq_length.txt "
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute GC content"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"# Transcripts\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
"/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt\n",
"\n",
"cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt \\\n",
"> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.gc_content.txt \n",
"tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt \\\n",
">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.gc_content.txt \n",
"\n",
"# Genes\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa \\\n",
"/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt\n",
"\n",
"cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt \\\n",
"> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.gc_content.txt \n",
"tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt \\\n",
">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.gc_content.txt "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's set up the jobs for GC-correction. Before that, we need to point the R library to the correction location:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"echo 'export R_LIBS=/tigress/BEE/tools/R_libs' >> ~/.bashrc\n",
"source ~/.bashrc"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: This may throw an error depending on the situation of the R library and the R version. Another alternative that works is to install the required packages manually and point the R library to the user's folder, namely:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# In R:\n",
"\n",
"source(\"http://bioconductor.org/biocLite.R\")\n",
"biocLite(\"EDASeq\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/\n",
"mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/\n",
"mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/\n",
"\n",
"python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/ _certain_biotypes_TPM.txt \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/\n",
"\n",
"sh /tigress/BEE/gtex/tmp/low_level_scripts/normalize_only_GC_wrapper.sh"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py\n",
"##################################################\n",
"# normalize_only_GC.py\n",
"#\n",
"#\n",
"##################################################\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"in_path = in_path + '*' + in_suffix\n",
"out_path = argv[3]\n",
"\n",
"# in_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/*_certain_biotypes_TPM.txt'\n",
"# out_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/'\n",
"\n",
"matrices = glob.glob(in_path + '*')\n",
"\n",
"master_script = \"/tigress/BEE/gtex/tmp/low_level_scripts/normalize_only_GC_wrapper.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"seq_lengths, gc_content = {}, {}\n",
"\n",
"seq_lengths['pc_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt'\n",
"seq_lengths['pc_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt'\n",
"seq_lengths['lncRNA_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt'\n",
"seq_lengths['lncRNA_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt'\n",
"\n",
"gc_content['pc_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt'\n",
"gc_content['pc_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt'\n",
"gc_content['lncRNA_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt'\n",
"gc_content['lncRNA_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt'\n",
"\n",
"k = 0\n",
"for i, matrix in enumerate(matrices):\n",
" out_matrix = out_path + matrix.split('/')[-1].replace('_TPM.txt','_TPM_GC.txt')\n",
" print out_matrix\n",
" if os.path.isfile(out_matrix):\n",
" continue\n",
" k+=1\n",
" genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
" sbatchfile = \"/tigress/BEE/gtex/tmp/low_level_scripts/norm_GC_\" + str(i) + \".slurm\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J norm_%s # job name\n",
"#SBATCH --mem=12000 # 12 GB requested\n",
"#SBATCH -t 01:00:00 # to be placed in the test queue\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R \\\n",
"%s %s %s %s %s %s\n",
"\n",
"\"\"\"%(i, matrix, seq_lengths['pc_'+genes_or_transcripts], seq_lengths['lncRNA_'+genes_or_transcripts], \\\n",
" gc_content['pc_'+genes_or_transcripts], gc_content['lncRNA_'+genes_or_transcripts], out_matrix)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
"master_handle.close()\n",
"\n",
"print 'sh %s'%(master_script)\n",
"print 'Number of jobs:', k+1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R\n",
"##################################################\n",
"# normalize_only_GC.R\n",
"#\n",
"# Use EDASeq R package to normalize gene expression within-lane by GC-content\n",
"# \n",
"# See:\n",
"# Risso, Davide, et al. \"GC-content normalization for RNA-Seq data.\" BMC bioinformatics 12.1 (2011): 480.\n",
"##################################################\n",
"#!/usr/bin/Rscript\n",
"\n",
"## Collect arguments\n",
"args <-commandArgs(TRUE)\n",
"\n",
"## Default setting when no arguments passed\n",
"if (length(args) != 6 ){\n",
" args <- c(\"--help\")\n",
"}\n",
"## Help section\n",
"if(\"--help\" %in% args) {\n",
" cat(\"\n",
" normalize_only_GC.R\n",
" \n",
" Usage:\n",
" ./normalize_only_GC.R /path/to/expression_matrix.txt /path/to/seq_length_protein_coding.txt \\\n",
" /path/to/seq_length_lncRNA.txt /path/to/GC_content_protein_coding.txt \\\n",
" /path/to/GC_content_lncRNA.txt /path/to/normalized_expression_matrix.txt\\n\")\n",
" q(save=\"no\")\n",
"}\n",
"\n",
"\n",
"library(EDASeq)\n",
"\n",
"expr = args[1]\n",
"seq_length_pc = args[2]\n",
"seq_length_lncRNA = args[3]\n",
"GC_content_pc = args[4]\n",
"GC_content_lncRNA = args[5]\n",
"\n",
"norm_expr_out = args[6]\n",
"\n",
"# expr = \"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM.txt\"\n",
"# seq_length_pc = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.pc_genes.seq_length.txt\"\n",
"# seq_length_lncRNA = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.lncRNA_genes.seq_length.txt\"\n",
"# GC_content_pc = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.pc_genes.gc_content.txt\"\n",
"# GC_content_lncRNA = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.lncRNA_genes.gc_content.txt\"\n",
"\n",
"# norm_expr_out = \"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM_GC.txt\"\n",
"\n",
"# import data\n",
"expr = read.table(file=expr, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
"seq_length_pc = read.table(file=seq_length_pc, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
"seq_length_lncRNA = read.table(file=seq_length_lncRNA, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
"GC_content_pc = read.table(file=GC_content_pc, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
"GC_content_lncRNA = read.table(file=GC_content_lncRNA, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
"\n",
"# turn covariates into named vectors to be used by normalization functions\n",
"df_to_named_vector <- function(df, col){\n",
" tmp <- as.numeric(df[,col])\n",
" names(tmp) <- row.names(df)\n",
" return(tmp)\n",
" }\n",
"\n",
"seq_length_pc <- df_to_named_vector(seq_length_pc, 'length')\n",
"seq_length_lncRNA <- df_to_named_vector(seq_length_lncRNA, 'length')\n",
"GC_content_pc <- df_to_named_vector(GC_content_pc, 'gccontent')\n",
"GC_content_lncRNA <- df_to_named_vector(GC_content_lncRNA, 'gccontent')\n",
"\n",
"seqs_lncRNA <- intersect(rownames(expr), names(seq_length_lncRNA))\n",
"seqs_lncRNA <- intersect(seqs_lncRNA, names(GC_content_lncRNA))\n",
"expr_lncRNA <- as.matrix(expr[seqs_lncRNA, , drop=FALSE ])\n",
"\n",
"seqs_pc <- intersect(rownames(expr), names(seq_length_pc))\n",
"seqs_pc <- intersect(seqs_pc, names(GC_content_pc))\n",
"expr_pc <- as.matrix(expr[seqs_pc, , drop=FALSE ])\n",
"\n",
"# reduce expression matrices to only those genes/transcripts that are expressed (non-zero) in\n",
"# at least 10% of samples for this particular tissue\n",
"expr_lncRNA = as.matrix(expr_lncRNA[apply(expr_lncRNA, 1, function(x) length(x[x != 0])/length(x) >= 0.10),])\n",
"expr_pc = as.matrix(expr_pc[apply(expr_pc, 1, function(x) length(x[x != 0])/length(x) >= 0.10),])\n",
" \n",
"# create eSet objects\n",
"data_lncRNA <- newSeqExpressionSet(expr_lncRNA, featureData=AnnotatedDataFrame(data.frame(gc=GC_content_lncRNA[row.names(expr_lncRNA)])), round=FALSE)\n",
"data_pc <- newSeqExpressionSet(expr_pc, featureData=AnnotatedDataFrame(data.frame(gc=GC_content_pc[row.names(expr_pc)])), round=FALSE)\n",
"\n",
"# Perform within-lane normalizations separately for lincRNA and mRNA because we would expect\n",
"# different distributions\n",
"\n",
"# Within-lane normalization for GC-content\n",
"norm_lncRNA <- normCounts(withinLaneNormalization(data_lncRNA, \"gc\", which=\"full\", offset=FALSE, round=FALSE))\n",
"norm_pc <- normCounts(withinLaneNormalization(data_pc, \"gc\", which=\"full\", offset=FALSE, round=FALSE))\n",
"\n",
"# bind the pc and lncRNA dataframes together\n",
"norm <- rbind(norm_pc,norm_lncRNA)\n",
"\n",
"write.table(norm, file=norm_expr_out, quote=FALSE, sep='\\t', row.names=T, col.names=NA)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##1. Running PEER in a tissue-specific manner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let us set up the directory for our cis mapping pipeline:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have obtained the matrices, we can begin the process of explicitly modeling hidden covaraites through PEER:\n",
"\n",
"http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398141/\n",
"\n",
"We first need to install PEER:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# In R:\n",
"\n",
"source(\"http://bioconductor.org/biocLite.R\")\n",
"biocLite(\"peer\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1.1 Genotype PCs\n",
"\n",
"In the pipeline, we will take the top 3 genotype PCs given by the GTEx consortium. The genotype PCs have been calculated by the GTEx consortium:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"cp GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_PostImput_20genotPCs.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# In R, run the following:\n",
"\n",
"genotype_PCs = read.table('GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_PostImput_20genotPCs.txt', sep = '\\t', stringsAsFactors=FALSE, header=TRUE)\n",
"rownames(genotype_PCs) = sapply(c(1:dim(genotype_PCs)[1]), function(x) {paste(strsplit(genotype_PCs$FID[x], split = \"-\")[[1]][1], strsplit(genotype_PCs$FID[x], split = \"-\")[[1]][2], sep = \".\")})\n",
"genotype_PCs_top3 = genotype_PCs[,c(\"C1\",\"C2\",\"C3\")]\n",
"write.table(genotype_PCs_top3, file = \"GTEx_Genotype_PCs_top3.txt\", sep = '\\t')\n",
"\n",
"load(\"subSample1MLFA1.RData\")\n",
"row.names(subSample1MLFA) = sapply(row.names(subSample1MLFA), function(x) {paste(\"GTEX.\", x, sep = \"\")})\n",
"genotype_PCs_top3_LFA = subSample1MLFA[,c(1:3)]\n",
"colnames(genotype_PCs_top3_LFA) = c(\"C1\",\"C2\",\"C3\")\n",
"write.table(genotype_PCs_top3_LFA, file = \"LFA_Genotype_PCs_top3.txt\", sep = \"\\t\")\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1.2 Alternate - Genotype PCs\n",
"\n",
"We can alternatively take take the top three genotype PCs that have been generated by the known genotypes and using LFA:\n",
"http://arxiv.org/abs/1312.2041 and \n",
"https://github.com/StoreyLab/lfa"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# In R:\n",
"\n",
"install.packages(\"devtools\")\n",
"library(\"devtools\")\n",
"install_github(\"Storeylab/lfa\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA/lfaWrapper.sh\n",
"\n",
"#!/usr/bin/Rscript\n",
"# sbatch -t 24:00:00 -N 1 --ntasks-per-node=1 --mail-type=begin --mail-type=end --mail-user=bj5@princeton.edu --mem=125000 ./lfaWrapper.sh\n",
"library('lfa')\n",
"library('Matrix')\n",
"library('R.utils')\n",
"\n",
"NUM_SUBJECTS = 450\n",
"NUM_CHRS = 22\n",
"\n",
"# Take in subject names\n",
"tempRowNames = read.table('/tigress/BEE/gtex/data/genotype/imputed_genotypes/allelic_dosage/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr22.allelic_dosage.txt', header = TRUE)\n",
"indNames = colnames(tempRowNames)\n",
"indNames = indNames[2:NUM_SUBJECTS+1]\n",
"remove(tempRowNames)\n",
"indNames = substr(indNames,6,10)\n",
"\n",
"# Tidy the subject IDs\n",
"for(i in seq(1,length(indNames))){\n",
" if(substr(indNames[i],5,5)=='.'){\n",
" indNames[i] = substr(indNames[i],1,4)\n",
" }\n",
"}\n",
"\n",
"numRow = 0\n",
"for(i in seq(1,NUM_CHRS)){\n",
" numRow = numRow + countLines(paste(\"/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr\",i,\".bim\",sep=''))\n",
"}\n",
"print(numRow)\n",
"genotypeHolder = matrix(nrow=numRow,ncol=NUM_SUBJECTS)\n",
"genoHolderIndex = 1\n",
"for(i in seq(1,NUM_CHRS)){\n",
" dataLength = countLines(paste('/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr',i,'.bim',sep=\"\"))\n",
" temp = read.bed(paste('/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr',i,sep=\"\"))\n",
" genotypeHolder[genoHolderIndex:(genoHolderIndex + dataLength - 1),] = temp\n",
" genoHolderIndex = genoHolderIndex + dataLength\n",
" remove(temp)\n",
"}\n",
"print(object.size(genotypeHolder))\n",
"\n",
"# Sample 1M snps to form genotype PCs\n",
"for(i in seq(1,10)){\n",
" indices = sample(1:numRow,1000000,replace = FALSE)\n",
" genotypeSample = genotypeHolder[indices,]\n",
"\n",
" subSample1MLFA = lfa(genotypeSample,d=51)\n",
" rownames(subSample1MLFA) = indNames\n",
" save(subSample1MLFA, file=paste('subSample1MLFA',i,'.RData',sep=''))\n",
"}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA/lfaWrapper.sh\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running the following code results in files subSample1MLFA1.RData through subSample1MLFA10.RData, and all ten iterations correlate well with each other in for the first 3 PCs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###1.3 Scripts for running PEER on expression matrices\n",
"\n",
"Now let's prepare the scripts that set up tissue-specific PEER factor estimation; we'll start with featurecounts and PEER factors from 10 to 75 in increments of 5."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/ \\\n",
"_genes_certain_biotypes_counts_normalized.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
"featurecounts /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER/ 10 15 20 25 30 35 40 45 50 55 60 \n",
"\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_cis_peer_correction_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_cis_peer_correction_chr11_wrapper.sh\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py\n",
"##################################################\n",
"# cis_peer_correction_chr11_wrapper.py\n",
"#\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"genotype_pcs = argv[3]\n",
"gene_metadata = argv[4]\n",
"method = argv[5]\n",
"# in_path = in_path + method + '/TPM_with_genotypes/' + '*' + '_' + method + in_suffix\n",
"# out_path = argv[6] + \"to be appended later\"\n",
"joblog_dir = argv[7]\n",
"\n",
"# Array of PEER factors that are given in the input\n",
"peer_factors = argv[8:len(argv)]\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
"# or\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/'\n",
"# argv[2] = '_genes_certain_biotypes_TPM.txt'\n",
"# or \n",
"# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
"# argv[3] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
"# argv[5] = 'featurecounts'\n",
"# argv[6] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
"# argv[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER'\n",
"# argv[8:len(argv)] = ['10','20','30','40','50']\n",
"\n",
"# Make the job log directories\n",
"if not os.path.exists(joblog_dir):\n",
" os.makedirs(joblog_dir)\n",
"if not os.path.exists(joblog_dir + 'err'):\n",
" os.makedirs(joblog_dir + 'err')\n",
"if not os.path.exists(joblog_dir + 'out'):\n",
" os.makedirs(joblog_dir + 'out')\n",
"\n",
"matrices = glob.glob(in_path + '*' + in_suffix)\n",
"\n",
"for peer_factor in peer_factors:\n",
" out_path = argv[6] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
" if not os.path.exists(out_path):\n",
" os.makedirs(out_path)\n",
"\n",
" master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_\" + peer_factor + \"_cis_peer_correction_chr11_wrapper.sh\"\n",
" master_handle=open(master_script, 'w')\n",
" master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
" # Examples\n",
" # peer_factor = 10\n",
" # residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM.txt'\n",
" # tissue_name = 'Adipose-Subcutaneous'\n",
" factor_file_directory = out_path + 'PEER_factors'\n",
" if not os.path.exists(factor_file_directory):\n",
" os.makedirs(factor_file_directory)\n",
"\n",
" k = 0\n",
" for i, matrix in enumerate(matrices):\n",
" filename = matrix.split('/')[-1]\n",
" tissue_name = str.split(filename, '_'+method)[0]\n",
" out_matrix = out_path + filename\n",
" # out_matrix = out_path + filename.replace('_TPM.txt','_TPM_GC.txt')\n",
" if os.path.isfile(out_matrix):\n",
" continue\n",
" k+=1\n",
" genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
" outfilename = method+\"_\"+tissue_name+\"_\"+peer_factor\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/PEER/\" + method + '_TPM_GC_3GenPC_' + peer_factor + 'PEER1000_' + tissue_name + \".slurm\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J PEER_%s_%s # job name\n",
"#SBATCH --mem=6000 # 6 GB requested\n",
"#SBATCH -t 04:00:00 \n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R \\\n",
"%s %s %s %s %s %s %s\n",
"\n",
" \"\"\"%(tissue_name, peer_factor, joblog_dir+'err/'+outfilename+'.err', joblog_dir+'out/'+outfilename+'.out', matrix, genotype_pcs, gene_metadata, peer_factor, out_matrix, tissue_name, factor_file_directory)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
" master_handle.close()\n",
"\n",
" print 'sh %s'%(master_script)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R\n",
"####################################################\n",
"# cis_peer_correction_chr11.R\n",
"# Exploratory: Testing with chr 11.\n",
"#\n",
"####################################################\n",
"\n",
"#!/usr/bin/Rscript\n",
"\n",
"## Collect arguments\n",
"args <-commandArgs(TRUE)\n",
"\n",
"## Default setting when no arguments passed\n",
"if (length(args) != 7 ){\n",
" args <- c(\"--help\")\n",
"}\n",
"## Help section\n",
"if(\"--help\" %in% args) {\n",
" cat(\"\n",
" cis_peer_correction.R\n",
" \n",
" Usage:\n",
" ./cis_peer_correction.R /path/to/expression_matrix.txt /path/to/genotype_pcs.RData \\\n",
" /path/to/gene_metadata/ num_of_peer_factors \\\n",
" /path/to/normalized_expression_matrix.txt tissue_name /path/to/factor_files\\n\")\n",
" q(save=\"no\")\n",
"}\n",
"\n",
"expr_matrix = args[1]\n",
"genotype_pcs = args[2]\n",
"gene_metadata = args[3]\n",
"peer_factors = as.numeric(args[4])\n",
"residual_matrix = args[5]\n",
"tissue_name = args[6]\n",
"factor_file_directory = args[7]\n",
"\n",
"# expr_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM.txt'\n",
"# expr_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_counts_normalized.txt'\n",
"# genotype_pcs = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
"# gene_metadata = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
"# peer_factors = '10'\n",
"# residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM_GC.txt'\n",
"# tissue_name = 'Brain-Hypothalamus'\n",
"# factor_file_directory = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/PEER_factors'\n",
"\n",
"library(peer)\n",
"library(dplyr)\n",
"\n",
"# Load in the genotype PCs (top 3 will be used for this pipeline), saved as samSample1MLFA\n",
"genotype_PCs_top3 = read.table(genotype_pcs, stringsAsFactors=FALSE, header=TRUE, sep='\\t')\n",
"gene_metadata = read.table(gene_metadata, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1)\n",
"\n",
"# Load in the expression matrices\n",
"# Also correct for covariates using PEER with K factors and N steps\n",
"\n",
"expr_matrix = read.table(file=expr_matrix, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1)\n",
"# The samples included are already sorted for samples with genotypes.\n",
"\n",
"# Data massaging: left_join is similar to a databaes left join, matching entries in first table to the second\n",
"expr_matrix$gene_id = row.names(expr_matrix)\n",
"gene_metadata$gene_id = row.names(gene_metadata)\n",
"joined_matrix = left_join(expr_matrix, gene_metadata, by=\"gene_id\")\n",
"# Take only chr 11 for now.\n",
"joined_matrix = filter(joined_matrix, chr == 11)\n",
"joined_matrix = arrange(joined_matrix, start)\n",
"row.names(joined_matrix) = joined_matrix$gene_id\n",
"joined_matrix = select(joined_matrix, 1:(dim(joined_matrix)[2]-6))\n",
"\n",
"# Optional: remove genes that are expressed in less than 10% of individuals.\n",
"joined_matrix = joined_matrix[sapply(1:dim(joined_matrix)[1], function(x) {(sum(joined_matrix[x,]==0)/dim(joined_matrix)[2])<0.1}),]\n",
"\n",
"# PEER takes in an inverted matrix as input\n",
"inverted_matrix = as.data.frame(t(data.matrix(joined_matrix)))\n",
"genotype_PCs_top3_df = as.data.frame(genotype_PCs_top3[,1:3])\n",
"inverted_matrix$subject_id = row.names(inverted_matrix)\n",
"genotype_PCs_top3_df$subject_id = row.names(genotype_PCs_top3_df)\n",
"inverted_joined = left_join(inverted_matrix, genotype_PCs_top3_df, by=\"subject_id\")\n",
"subject_list = inverted_joined$subject_id\n",
"\n",
"# Top 3 PCs, can be changed for future tasks\n",
"genotype_top3 = as.matrix(inverted_joined[,(dim(inverted_joined)[2]-2):dim(inverted_joined)[2]])\n",
"inverted_joined = as.matrix(select(inverted_joined, c(1:(dim(inverted_joined)[2]-4))))\n",
"\n",
"# Now set up PEER for detection of hidden covariates\n",
"model = PEER()\n",
"\n",
"PEER_setPhenoMean(model,inverted_joined)\n",
"PEER_setCovariates(model, genotype_top3)\n",
"PEER_setNk(model,peer_factors)\n",
"PEER_update(model)\n",
"\n",
"# Now that PEER has been set up, we can obtain the residuals and save the resulting matrix:\n",
"residuals = PEER_getResiduals(model)\n",
"colnames(residuals) = colnames(inverted_joined)\n",
"rownames(residuals) = subject_list\n",
"\n",
"write.table(as.data.frame(t(data.matrix(residuals))), file=residual_matrix, quote = FALSE, sep = \"\\t\")\n",
"\n",
"# Also save the factors, weights, and alpha values:\n",
"write.table(PEER_getX(model), file=paste(factor_file_directory, '/', tissue_name, '_Factors.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n",
"write.table(PEER_getW(model), file=paste(factor_file_directory, '/', tissue_name, '_Loadings.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n",
"write.table(PEER_getAlpha(model), file=paste(factor_file_directory, '/', tissue_name, '_Alphas.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is that we have the expression matrices that are the residuals of K peer factors and 3 genotype PCs, saved under:\n",
"\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000\n",
"\n",
"as well as the corresponding peer factors, loadings and alphas.\n",
"\n",
"We can also repeat the procedure for htseq and rsem.\n",
"\n",
"\n",
"### 1.4 Checkpoint: Check if the PEER jobs have run successfully, and rerun the missing files\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/ \\\n",
"_genes_certain_biotypes_counts_normalized.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
"featurecounts /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER/ 10 15 20 25 30 35 40 45 50 55 60 \n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"10\n",
"20\n",
"30\n",
"Missing: Vagina\n",
"Missing: SmallIntestine-TerminalIleum\n",
"Missing: Cells-EBV-transformedlymphocytes\n",
"40\n",
"Missing: Vagina\n",
"Missing: Ovary\n",
"Missing: SmallIntestine-TerminalIleum\n",
"50\n",
"Missing: Pituitary\n",
"Missing: Brain-Hypothalamus\n",
"Missing: Vagina\n",
"Missing: Prostate\n",
"Missing: Brain-CerebellarHemisphere\n",
"Missing: Ovary\n",
"Missing: Testis\n",
"Missing: Cells-EBV-transformedlymphocytes\n",
"Missing: Brain-Putamen_basalganglia_\n",
"Missing: Kidney-Cortex\n",
"Missing: Colon-Transverse\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_cis_peer_correction_chr11_rerun_wrapper.sh\n",
"Number of jobs: 17\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py\n",
"##################################################\n",
"# cis_peer_correction_chr11_rerun_wrapper.py\n",
"# provide jobs that have failed with more time and memory\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"genotype_pcs = argv[3]\n",
"gene_metadata = argv[4]\n",
"method = argv[5]\n",
"# in_path = in_path + method + '/TPM_with_genotypes/' + '*' + '_' + method + in_suffix\n",
"# out_path = argv[6] + \"to be appended later\"\n",
"joblog_dir = argv[7]\n",
"\n",
"# Array of PEER factors that are given in the input\n",
"peer_factors = argv[8:len(argv)]\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
"# or\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/'\n",
"# argv[2] = '_genes_certain_biotypes_TPM.txt'\n",
"# or \n",
"# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
"# argv[3] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
"# argv[5] = 'featurecounts'\n",
"# argv[6] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
"# argv[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER'\n",
"# argv[8:len(argv)] = ['10','20','30','40','50']\n",
"\n",
"# Create full tissue_list.\n",
"tissue_list = []\n",
"matrices = glob.glob(in_path + '*' + in_suffix)\n",
"for i, matrix in enumerate(matrices):\n",
" filename = matrix.split('/')[-1]\n",
" tissue_name = str.split(filename, '_'+method)[0]\n",
" tissue_list.append(tissue_name)\n",
"\n",
"master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_cis_peer_correction_chr11_rerun_wrapper.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"k = 0\n",
"for peer_factor in peer_factors:\n",
" print(peer_factor)\n",
" out_path = argv[6] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
"\n",
" # Examples\n",
" # peer_factor = 10\n",
" # residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM.txt'\n",
" # tissue_name = 'Adipose-Subcutaneous'\n",
" factor_file_directory = out_path + 'PEER_factors'\n",
"\n",
" for tissue in tissue_list:\n",
" out_matrix = out_path + tissue + \"_\" + method + in_suffix\n",
" # out_matrix = out_path + tissue + \"_\" + method + in_suffix.replace('_TPM.txt','_TPM_GC.txt')\n",
" if os.path.isfile(out_matrix):\n",
" continue\n",
" print(\"Missing: \" + tissue)\n",
" genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
" outfilename = method+\"_\"+tissue+\"_\"+peer_factor\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/PEER/\" + method + '_TPM_GC_3GenPC_' + peer_factor + 'PEER1000_' + tissue + \".slurm\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J PEER_%s_%s # job name\n",
"#SBATCH --mem=12000 # 12 GB requested\n",
"#SBATCH -t 08:00:00 \n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R \\\n",
"%s %s %s %s %s %s %s\n",
"\n",
" \"\"\"%(tissue, peer_factor, joblog_dir+'err/'+outfilename+'.err', joblog_dir+'out/'+outfilename+'.out', matrix, genotype_pcs, gene_metadata, peer_factor, out_matrix, tissue, factor_file_directory)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
" k+=1\n",
"\n",
"master_handle.close()\n",
"\n",
"print 'sh %s'%(master_script)\n",
"print 'Number of jobs:', k"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##2. Running MatrixEQTL for optimization\n",
"\n",
"With the expression matrices, and other necessary inputs generated in the accompanying notebook misc_meta_analysis.ipynb, we can go ahead and perform univariate eQTL detection using MatrixEQTL: \n",
"\n",
"http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/\n",
"\n",
"###2.1 Running MatrixEQTL with the output of PEER factors, only for Chromosome 11\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"# Set up the MatrixEQTL env.\n",
"\n",
"cd /tigress/BEE/gtex/analyses/user_specific/cis-mapping\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts\n",
"\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_10PEER1000\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_20PEER1000\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_30PEER1000\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_40PEER1000\n",
"mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_50PEER1000\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"mkdir: cannot create directory `/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL': File exists\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 0 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
"\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
"\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_Imput_cis_matrix_eqtl_chr11_wrapper.sh"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py\n",
"##################################################\n",
"# cis_matrix_eqtl_wrapper.py\n",
"# Wrapper for the cis_matrix_eqtl.R file and submitting the batch jobs\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"method = argv[3]\n",
"mapping_dir = argv[4]\n",
"joblog_dir = argv[6]\n",
"incl_imput = argv[7]\n",
"analysis_dir = argv[8]\n",
"\n",
"# Array of PEER factors that are given in the input\n",
"peer_factors = argv[9:len(argv)]\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
"# argv[2] = '_genes_certain_biotypes_TPM_GC.txt'\n",
"# argv[3] = 'featurecounts'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
"# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
"# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
"# argv[7] = '0' or '1'\n",
"# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
"# argv[9:len(argv)] = ['10','20','30','40','50']\n",
"\n",
"# Make the job log directories\n",
"if not os.path.exists(joblog_dir):\n",
" os.makedirs(joblog_dir)\n",
"if not os.path.exists(joblog_dir + 'err'):\n",
" os.makedirs(joblog_dir + 'err')\n",
"if not os.path.exists(joblog_dir + 'out'):\n",
" os.makedirs(joblog_dir + 'out')\n",
"\n",
"if incl_imput == \"0\":\n",
" imput = \"Noimput\"\n",
"else:\n",
" imput = \"Imput\"\n",
"\n",
"for peer_factor in peer_factors:\n",
" print(peer_factor)\n",
"\n",
" in_path = argv[1] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/' + '*' + '_' + method + in_suffix\n",
" out_path = argv[5] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
"\n",
" if not os.path.exists(out_path):\n",
" os.makedirs(out_path)\n",
"\n",
" matrices = glob.glob(in_path + '*')\n",
" print(\"Number of matrices: \" + str(len(matrices)))\n",
"\n",
" master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_\" + peer_factor + '_' + imput + \"_cis_matrix_eqtl_chr11_wrapper.sh\"\n",
" master_handle=open(master_script, 'w')\n",
" master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
" # MatrixEQTL parameters\n",
" incl_trans = '0' # Don't include trans\n",
" chr_number = '11' # chromosome number\n",
" cis_dist = '1e5' # cis window\n",
"\n",
" k = 0\n",
" for i, matrix in enumerate(matrices):\n",
" filename = matrix.split('/')[-1]\n",
" tissue_name = str.split(filename, '_'+method)[0]\n",
" out_file = out_path + filename.replace('.txt','_MatrixEQTL')\n",
" if os.path.isfile(out_file):\n",
" continue\n",
" k+=1\n",
" genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_' + peer_factor + '_MatrixEQTL_chr' + chr_number + '_' + tissue_name + '_' + imput + \".slurm\"\n",
" outfilename = method+\"_\"+tissue_name+\"_\"+peer_factor\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J meqtl_%s_%s # job name\n",
"#SBATCH --mem=6000 # 6 GB requested\n",
"#SBATCH -t 04:00:00 \n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
"%s %s %s %s %s %s %s %s %s %s\n",
"\n",
"\"\"\"%(tissue_name, peer_factor, joblog_dir+'err/'+outfilename+'_'+imput+'.err', joblog_dir+'out/'+outfilename+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
" master_handle.close()\n",
"\n",
" print 'sh %s'%(master_script)\n",
" print 'Number of jobs:', k"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R\n",
"####################################################\n",
"# cis_matrix_eqtl.R\n",
"# Performs univariate cis-eQTL analysis using Matrix eQTL\n",
"#\n",
"####################################################\n",
"\n",
"#!/usr/bin/Rscript\n",
"\n",
"# Collect arguments\n",
"args <-commandArgs(TRUE)\n",
"\n",
"## Default setting when no arguments passed\n",
"if (length(args) != 10 ){\n",
" args <- c(\"--help\")\n",
"}\n",
"## Help section\n",
"if(\"--help\" %in% args) {\n",
" cat(\"\n",
" cis_peer_correction.R\n",
" \n",
" Usage:\n",
" ./cis_matrix_eqtl.R /path/to/expression_matrix.txt include_trans_binary \\\n",
" include_imputed_binary chromosome_number cis_disance_def peer_factor \\\n",
" /path/to/mapping/directory /path/to/output/files method /path/to/analysis/dir\\n\")\n",
" q(save=\"no\")\n",
"}\n",
"\n",
"expression_file_location = args[1]\n",
"incl_trans = args[2]\n",
"incl_imput = args[3]\n",
"chr_number = args[4]\n",
"cis_dist = as.numeric(args[5])\n",
"peer_factor = args[6]\n",
"\n",
"# Example\n",
"# args[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_45PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC.txt'\n",
"# args[2] = '0'\n",
"# args[3] = '0'\n",
"# args[4] = '11'\n",
"# args[5] = '1e5'\n",
"# args[6] = '10'\n",
"\n",
"imput = 'Imput'\n",
"if (incl_imput == '0') {imput = 'Noimput'}\n",
"SNP_file_name = paste(args[7], '/genotype/genotypesChr', chr_number, '_', imput, '.txt', sep=\"\")\n",
"covariates_file_name = character()\n",
"SNP_position_file = paste(args[7], '/references/SNP_positions/SNP_positions_Chr', chr_number, '.txt', sep=\"\")\n",
"gene_position_file = paste(args[7], '/references/gene_positions/gene_positions_Chr', chr_number, '.txt', sep=\"\")\n",
"\n",
"output_file_name_cis = paste(args[8], '_Chr', chr_number, '_', imput, '_cis.txt', sep=\"\")\n",
"output_file_name_trans = paste(args[8], '_Chr', chr_number, '_', imput, '_trans.txt', sep=\"\")\n",
"\n",
"method = paste(args[9])\n",
"expression_file_name = strsplit(expression_file_location, split=\"/\")[[1]][length(strsplit(expression_file_location, split=\"/\")[[1]])]\n",
"tissue_name = strsplit(expression_file_name, split = paste('_', method, sep=\"\"))[[1]][1] \n",
"analysis_dir = paste(args[10], \"/\", tissue_name, sep=\"\")\n",
"\n",
"# Example\n",
"# args[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
"# args[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/\n",
"# ... TPM_GC_3GenPC_30PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL'\n",
"# args[9] = 'featurecounts'\n",
"# args[10] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
"\n",
"# Need to check that the subjects read in from expression matrices match the subjects in genotype data\n",
"library(dplyr)\n",
"\n",
"expression_matrix = read.csv(file = expression_file_location, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
"genotype_matrix = read.csv(file = SNP_file_name, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
"row.names(genotype_matrix) = genotype_matrix$SNP\n",
"\n",
"genotype_matrix = genotype_matrix[,sapply(colnames(expression_matrix), function(x) {match(x, colnames(genotype_matrix))})]\n",
"\n",
"# Also obtain genes and snps that are in the expression and genotype files, respectively.\n",
"gene_positions = read.table(file = gene_position_file, header = FALSE, sep = '\\t', stringsAsFactors = FALSE)\n",
"snp_positions = read.table(file = SNP_position_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
"\n",
"colnames(gene_positions) = c(\"gene_ID\", \"chr_num\", \"start\", \"end\")\n",
"gene_positions = mutate(gene_positions, chr = paste(\"chr\", chr_num, sep = \"\"))\n",
"gene_positions = select(gene_positions, c(gene_ID, chr, start, end))\n",
"# Make sure the list of genes and their ordering are the same\n",
"expression_matrix = expression_matrix[rownames(expression_matrix) %in% gene_positions$gene_ID,]\n",
"gene_positions = gene_positions[sapply(rownames(expression_matrix), function(x) {match(x, gene_positions$gene_ID)}),]\n",
"\n",
"# Import MatrixEQTL\n",
"library(MatrixEQTL)\n",
"\n",
"# Use linear model with p threshold of 0.01\n",
"useModel = modelLINEAR;\n",
"pvOutputThreshold_cis = 1e-2\n",
"pvOutputThreshold_trans = 0\n",
"if (incl_trans == '1') {pvOutputThreshold_trans = 1e-2}\n",
"\n",
"snps = SlicedData$new()\n",
"snps$CreateFromMatrix(as.matrix(genotype_matrix));\n",
"\n",
"gene = SlicedData$new()\n",
"gene$CreateFromMatrix(as.matrix(expression_matrix));\n",
"\n",
"gene_qnorm = gene$copy();\n",
"\n",
"# Quantile-normalize the expression values - ties broken by averaging\n",
"for( sl in 1:length(gene_qnorm) ) {\n",
" mat = gene_qnorm[[sl]];\n",
" mat = t(apply(mat, 1, rank, ties.method = \"average\"));\n",
" mat = qnorm(mat / (ncol(gene_qnorm)+1));\n",
" gene_qnorm[[sl]] = mat;\n",
"}\n",
"rm(sl, mat);\n",
"\n",
"# me = Matrix_eQTL_main(\n",
"# snps = snps,\n",
"# gene = gene,\n",
"# output_file_name = output_file_trans,\n",
"# pvOutputThreshold = pvOutputThreshold,\n",
"# useModel = useModel, \n",
"# verbose = TRUE,\n",
"# pvalue.hist = \"qqplot\",\n",
"# min.pv.by.genesnp = FALSE,\n",
"# noFDRsaveMemory = FALSE);\n",
"\n",
"# Only for generating the P-value hist:\n",
"me = Matrix_eQTL_main(\n",
" snps = snps,\n",
" gene = gene_qnorm,\n",
" output_file_name = output_file_name_trans,\n",
" pvOutputThreshold = pvOutputThreshold_trans,\n",
" useModel = useModel, \n",
" verbose = TRUE,\n",
" output_file_name.cis = output_file_name_cis,\n",
" pvOutputThreshold.cis = pvOutputThreshold_cis,\n",
" snpspos = snp_positions, \n",
" genepos = gene_positions,\n",
" cisDist = cis_dist,\n",
" pvalue.hist = 100,\n",
" min.pv.by.genesnp = FALSE,\n",
" noFDRsaveMemory = FALSE)\n",
"\n",
"# Save the histogram\n",
"output_file_name_hist = paste(analysis_dir, '/', tissue_name, '_', imput, '_Chr', chr_number, '_', peer_factor, '_hist.pdf', sep=\"\")\n",
"print(paste(\"Saving to:\", output_file_name_hist))\n",
"pdf(output_file_name_hist)\n",
"plot(me)\n",
"dev.off()\n",
"\n",
"# repeat for the qqplot:\n",
"me_q = Matrix_eQTL_main(\n",
" snps = snps,\n",
" gene = gene_qnorm,\n",
" output_file_name = output_file_name_trans,\n",
" pvOutputThreshold = pvOutputThreshold_trans,\n",
" useModel = useModel, \n",
" verbose = TRUE,\n",
" output_file_name.cis = output_file_name_cis,\n",
" pvOutputThreshold.cis = pvOutputThreshold_cis,\n",
" snpspos = snp_positions, \n",
" genepos = gene_positions,\n",
" cisDist = cis_dist,\n",
" pvalue.hist = \"qqplot\",\n",
" min.pv.by.genesnp = FALSE,\n",
" noFDRsaveMemory = FALSE)\n",
"\n",
"# Save the qqplot\n",
"output_file_name_qqplot = paste(analysis_dir, '/', tissue_name, '_', imput, '_Chr', chr_number, '_', peer_factor, '_qqplot.pdf', sep=\"\")\n",
"print(paste(\"Saving to:\", output_file_name_hist))\n",
"pdf(output_file_name_qqplot)\n",
"plot(me_q)\n",
"dev.off()\n",
"\n",
"# Save the Matrix EQTL object as RData\n",
"save(gene_qnorm, me, snps, file=paste(args[8], '_Chr', chr_number, '_', imput, '_me.RData', sep=\"\"))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2 Checkpoint: Check if the MatrixEQTL jobs have run successfully, and rerun the missing files\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 0 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py\n",
"##################################################\n",
"# cis_matrix_eqtl_rerun_wrapper.py\n",
"# provide jobs that have failed with more time and memory\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"method = argv[3]\n",
"mapping_dir = argv[4]\n",
"joblog_dir = argv[6]\n",
"incl_imput = argv[7]\n",
"analysis_dir = argv[8]\n",
"\n",
"# Array of PEER factors that are given in the input\n",
"peer_factors = argv[9:len(argv)]\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
"# argv[2] = '_genes_certain_biotypes_TPM_GC.txt'\n",
"# argv[3] = 'featurecounts'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
"# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
"# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
"# argv[7] = '0' or '1'\n",
"# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
"# argv[9:len(argv)] = ['10','20','30','40','50']\n",
"\n",
"# Make tissue list:\n",
"tissue_list = []\n",
"# Use 'TPM_GC_3GenPC_10PEER1000' to build tissue list\n",
"in_path = in_path + method + '/TPM_GC_3GenPC_10PEER1000/'\n",
"matrices = glob.glob(in_path + '*' + in_suffix)\n",
"for i, matrix in enumerate(matrices):\n",
" filename = matrix.split('/')[-1]\n",
" tissue_name = str.split(filename, '_'+method)[0]\n",
" tissue_list.append(tissue_name)\n",
"\n",
"if incl_imput == \"0\":\n",
" imput = \"Noimput\"\n",
"else:\n",
" imput = \"Imput\"\n",
"\n",
"master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + '_' + imput + \"_cis_matrix_eqtl_chr11_rerun_wrapper.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"k = 0\n",
"for peer_factor in peer_factors:\n",
" print(peer_factor)\n",
" out_path = argv[5] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
"\n",
" # MatrixEQTL parameters\n",
" incl_trans = '0' # Don't include trans\n",
" chr_number = '11' # chromosome number\n",
" cis_dist = '1e5' # cis window\n",
"\n",
" for tissue in tissue_list:\n",
" out_file = out_path + tissue + \"_\" + method + in_suffix.replace('.txt','_MatrixEQTL')\n",
" check_file = out_path + tissue + \"_\" + method + in_suffix.replace('.txt','_MatrixEQTL_Chr') + chr_number + '_' + imput + '_me.RData'\n",
" if os.path.isfile(check_file):\n",
" continue\n",
" k+=1\n",
" print \"Missing: \" + check_file\n",
" genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_' + peer_factor + '_MatrixEQTL_chr' + chr_number + '_' + tissue + \"_\" + imput + \".slurm\"\n",
" outfilename = method+\"_\"+tissue+\"_\"+peer_factor\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J meqtl_%s_%s # job name\n",
"#SBATCH --mem=12000 # 12 GB requested\n",
"#SBATCH -t 08:00:00 # to be placed in the test queue\n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
"%s %s %s %s %s %s %s %s %s %s\n",
"\n",
"\"\"\"%(tissue, peer_factor, joblog_dir+'err/'+outfilename+'_'+imput+'.err', joblog_dir+'out/'+outfilename+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
"master_handle.close()\n",
"\n",
"print 'sh %s'%(master_script)\n",
"print 'Number of jobs:', k"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3 Number of cis-eQTLs as as a function of the number of PEER factors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"MatrixEQTL outputs the following metric: effect size, p-value, FDR, t-statistic, as well as the list of significant SNP-gene pairs.\n",
"\n",
"Since we have a lot of statistics, let's try to construct plots that will hopefully help with the analyses. But we need to create the necessary directories first:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py\n",
"\n",
"##################################################\n",
"# create_dirs.py\n",
"# create necessary directories\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"method = argv[3]\n",
"out_path = argv[4] + method + '/'\n",
"\n",
"in_path = in_path + method + '/TPM_GC_3GenPC_10PEER1000/' + '*' + '_' + method + in_suffix\n",
"\n",
"# in_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
"# in_suffix = '_genes_certain_biotypes_TPM.txt'\n",
"# method = 'featurecounts'\n",
"# out_path = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/\"\n",
"\n",
"directory = out_path + 'analysis'\n",
"if not os.path.exists(directory):\n",
" os.makedirs(directory)\n",
"\n",
"matrices = glob.glob(in_path + '*')\n",
"print(len(matrices))\n",
"\n",
"for i, matrix in enumerate(matrices):\n",
" filename = matrix.split('/')[-1]\n",
" tissue_name = str.split(filename, '_'+method)[0]\n",
" \n",
" if not os.path.exists(directory + '/' + tissue_name):\n",
" os.makedirs(directory + '/' + tissue_name)\n",
" if not os.path.exists(directory + '/' + tissue_name + '/Examples/'):\n",
" os.makedirs(directory + '/' + tissue_name + '/Examples')\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash \n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's actually create the plots:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# In R:\n",
"\n",
"install.packages(\"ggplot2\")"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ featurecounts 1e5 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt 100"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_cis_matrix_eqtl_post_wrapper.sh\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py\n",
"##################################################\n",
"# cis_matrix_eqtl_post_wrapper.py\n",
"# \n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"method = argv[2]\n",
"cis_cutoff = argv[3]\n",
"gene_name_file = argv[4]\n",
"snp_positions_file = argv[5]\n",
"tissue_list_file = argv[6]\n",
"num_indiv_cutoff = int(argv[7])\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
"# argv[2] = 'featurecounts'\n",
"# argv[3] = '1e5'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
"# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt'\n",
"# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
"# argv[7] = '100'\n",
"\n",
"master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_cis_matrix_eqtl_post_wrapper.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"# Make tissue list:\n",
"f = open(tissue_list_file)\n",
"\n",
"for line in f.readlines():\n",
" entry = str.split(line.strip(),'\\t')\n",
" # If the number of individuals is greater than the cutoff:\n",
" if int(entry[1]) >= num_indiv_cutoff:\n",
" tissue = entry[0]\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/\" + method + '_' + tissue + \"_cis_matrix_eqtl_post.slurm\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J %s_Matrix_post # job name\n",
"#SBATCH --mem=4000 # 4 GB requested\n",
"#SBATCH -e /tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/err/Matrix_%s_post.err # err output directory\n",
"#SBATCH -o /tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/out/Matrix_%s_post.out # out output directory\n",
"#SBATCH -t 5:00:00 \n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R \\\n",
"%s %s %s %s \\\n",
"%s \\\n",
"%s \\\n",
"10 15 20 25 30 35 40 45 50 55 60\n",
"\"\"\"%(tissue, tissue, tissue, in_path, method, cis_cutoff, tissue, gene_name_file, snp_positions_file)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
"master_handle.close()\n",
"print 'sh %s'%(master_script)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R \n",
"####################################################\n",
"# post_analyses_MatrixEQTL.R \n",
"# Iterates through K PEER factors for every tissue and produces interesting graphs.\n",
"#\n",
"####################################################\n",
"\n",
"#!/usr/bin/Rscript\n",
"\n",
"# Collect arguments\n",
"args <-commandArgs(TRUE)\n",
"\n",
"library(dplyr)\n",
"library(MatrixEQTL)\n",
"library(ggplot2)\n",
"\n",
"# Load in data\n",
"# load(\"Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL_Chr11_Imput_me.RData\")\n",
"# file_name = Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL_Chr11_Imput_me.RData\n",
"\n",
"in_dir = args[1]\n",
"method = args[2]\n",
"cis_dist = as.numeric(args[3])\n",
"in_dir = paste(in_dir, method, \"/\", sep='')\n",
"tissue = args[4]\n",
"gene_name_file = args[5]\n",
"snp_position_file = args[6]\n",
"peer_factors = args[7:length(args)]\n",
"\n",
"# args[1] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
"# args[2] = 'featurecounts'\n",
"# args[3] = '1e5'\n",
"# args[4] = 'WholeBlood'\n",
"# args[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
"# args[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt'\n",
"# args[7] = c(10,15,20,25,30,35,40,45,50,55,60)\n",
"\n",
"# Save the numer of significant pairs and unique genes detected at each level.\n",
"Counts_by_tissue = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
"#Genes_snps_by_tissue = data.frame(tissue = character(0), imputed = character(0), gene_list = character(0), snp_list = character(0))\n",
"\n",
"gene_names = read.csv(gene_name_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
"row.names(gene_names) = gene_names$gene_id\n",
"print(dim(gene_names))\n",
"snp_positions = read.csv(snp_position_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
"row.names(snp_positions) = snp_positions$rsID\n",
"print(dim(snp_positions))\n",
"print(tissue)\n",
"\n",
"for (peer in peer_factors) {\n",
" print(peer)\n",
"\n",
" # First iterate through nonimputed:\n",
" result_path = paste(in_dir, \"TPM_GC_3GenPC_\", peer, \"PEER1000/\", sep='')\n",
" analysis_path = paste(in_dir, \"analysis/\", tissue, '/', sep='')\n",
" # Match a regular expression pattern:\n",
" result_file <- dir(result_path, pattern = paste(tissue,\".*\\\\Noimput_me.RData\",sep=''))\n",
"\n",
" # The result file of MatrixEQTL exists but the summary file (output of this) doesn't\n",
" if (length(result_file) > 0) {\n",
"\n",
" load(paste(result_path, result_file[1], sep=''))\n",
"\n",
" # Plot top 5 eQTL - unique gene pairs, if they exist.\n",
" temp = data.frame(me$cis$eqtls)\n",
" temp$snps = as.character(temp$snps)\n",
" temp$gene = as.character(temp$gene)\n",
" \n",
" top_genes = head(unique(temp$gene), n=5)\n",
" # Only proceed if there are at least 5 genes.\n",
" if (length(top_genes) == 5) {\n",
"\n",
" temp$TSS = gene_names[temp$gene,][,\"start\"]\n",
" temp$TES = gene_names[temp$gene,][,\"end\"]\n",
" temp$gene_name = gene_names[temp$gene,][,\"gene_name\"]\n",
" temp$SNP_pos = snp_positions[temp$snps,][,\"pos\"]\n",
" # Difference mapping: -1e5 to 0 if before TSS\n",
" # scale from 1 to 1e5 if in gene\n",
" # 1e5 to 2e5 if after TES\n",
" temp$diff = sapply(c(1:dim(temp)[1]), function(x) {\n",
" if ((temp$SNP_pos[x] - temp$TSS[x])<0) {\n",
" temp$SNP_pos[x] - temp$TSS[x]\n",
" } else if ((temp$SNP_pos[x] - temp$TES[x])>0) {\n",
" temp$SNP_pos[x] - temp$TES[x] + cis_dist\n",
" } else {\n",
" cis_dist * (temp$SNP_pos[x] - temp$TSS[x]) / (temp$TES[x] - temp$TSS[x])\n",
" }})\n",
"\n",
" tryCatch({\n",
" g = ggplot(temp[temp$FDR < 0.01,], aes(x = diff)) + geom_histogram(binwidth = 10000) + ggtitle('Distribution of distances to TSS') + xlab('Distance') + scale_x_discrete(breaks=c(0, 1e5), labels=c(\"TSS\", \"TES\"))\n",
" print(paste(\"Trying to save: \", analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
" ggsave(filename = paste(analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''), plot=g, width = 7, height = 7)\n",
" }, error = function(e) {\n",
" print(paste(\"Could not save: \", analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
" })\n",
"\n",
" for (j in 1:5) {\n",
" index = which(temp$gene == top_genes[j])[1]\n",
" test_snp = snps$FindRow(temp[index,\"snps\"])\n",
" test_gene_q = gene_qnorm$FindRow(temp[index,\"gene\"])\n",
"\n",
" test.df = data.frame(snp = as.numeric(test_snp$row), gene = as.numeric(test_gene_q$row))\n",
" row.names(test.df) = colnames(snps)\n",
" test.df = test.df[!is.na(test.df$snp),]\n",
" test.df$snp_chr = as.character(test.df$snp)\n",
" test_name = as.character(temp[index,\"gene_name\"])\n",
" test_rsID = as.character(temp[index,\"snps\"])\n",
"\n",
" tryCatch({\n",
" g = ggplot(test.df, aes(x = snp_chr, y = gene, color=snp_chr)) + geom_violin() + geom_jitter() + ggtitle(paste(test_name, test_rsID, sep='_')) + xlab(test_rsID) + ylab(test_name)\n",
" print(paste(\"Trying to save: \", analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
" ggsave(filename = paste(analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''), plot = g, width = 7, height = 7)\n",
" }, error = function(e) {\n",
" print(paste(\"Could not save: \", analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
" })\n",
" }\n",
" }\n",
" \n",
" # Take all\n",
" n_pairs = dim(me$cis$eqtls)[1]\n",
" n_genes = length(unique(me$cis$eqtls$gene))\n",
" Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"pval\", imputed = \"Noimput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
"\n",
" # Take FDR-controlled at 5%\n",
" n_pairs = dim(filter(me$cis$eqtls, FDR <= 0.01))[1]\n",
" n_genes = length(unique(filter(me$cis$eqtls, FDR <= 0.01)$gene))\n",
" Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"FDR\", imputed = \"Noimput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
"}\n",
"}\n",
"\n",
"if (dim(Counts_by_tissue)[1] != 0) {\n",
" print(paste(\"Save: \", analysis_path, tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
" save(Counts_by_tissue, file=paste(analysis_path, tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
"}\n",
"\n",
"Counts_by_tissue = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
"\n",
"for (peer in peer_factors) {\n",
" print(peer)\n",
"\n",
" # Now iterate through imputed:\n",
" # First iterate through nonimputed:\n",
" result_path = paste(in_dir, \"TPM_GC_3GenPC_\", peer, \"PEER1000/\", sep='')\n",
" analysis_path = paste(in_dir, \"analysis/\", tissue, '/', sep='')\n",
" # Match a regular expression pattern:\n",
" result_file <- dir(result_path, pattern = paste(tissue,\".*\\\\Imput_me.RData\",sep=''))\n",
"\n",
" if (length(result_file) > 0) {\n",
"\n",
" load(paste(result_path, result_file[1], sep=''))\n",
"\n",
" temp = data.frame(me$cis$eqtls)\n",
" temp$snps = as.character(temp$snps)\n",
" temp$gene = as.character(temp$gene)\n",
" \n",
" top_genes = head(unique(temp$gene), n=5)\n",
" # Only proceed if there is at least 5 genes.\n",
" if (length(top_genes) == 5) {\n",
"\n",
" temp$TSS = gene_names[temp$gene,][,\"start\"]\n",
" temp$TES = gene_names[temp$gene,][,\"end\"]\n",
" temp$gene_name = gene_names[temp$gene,][,\"gene_name\"]\n",
" temp$SNP_pos = snp_positions[temp$snps,][,\"pos\"]\n",
" # Difference mapping: -1e5 to 0 if before TSS\n",
" # scale from 1 to 1e5 if in gene\n",
" # 1e5 to 2e5 if after TES\n",
" temp$diff = sapply(c(1:dim(temp)[1]), function(x) {\n",
" if ((temp$SNP_pos[x] - temp$TSS[x])<0) {\n",
" temp$SNP_pos[x] - temp$TSS[x]\n",
" } else if ((temp$SNP_pos[x] - temp$TES[x])>0) {\n",
" temp$SNP_pos[x] - temp$TES[x] + cis_dist\n",
" } else {\n",
" cis_dist * (temp$SNP_pos[x] - temp$TSS[x]) / (temp$TES[x] - temp$TSS[x])\n",
" }})\n",
"\n",
" tryCatch({\n",
" g = ggplot(temp[temp$FDR < 0.01,], aes(x = diff)) + geom_histogram(binwidth = 10000) + ggtitle('Distribution of distances to TSS') + xlab('Distance') + scale_x_discrete(breaks=c(0, 1e5), labels=c(\"TSS\", \"TES\"))\n",
" print(paste(\"Trying to save: \", analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
" ggsave(filename = paste(analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''), plot=g, width = 7, height = 7)\n",
" }, error = function(e) {\n",
" print(paste(\"Could not save: \", analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
" })\n",
"\n",
" for (j in 1:5) {\n",
" index = which(temp$gene == top_genes[j])[1]\n",
" test_snp = snps$FindRow(temp[index,\"snps\"])\n",
" test_gene_q = gene_qnorm$FindRow(temp[index,\"gene\"])\n",
"\n",
" test.df = data.frame(snp = as.numeric(test_snp$row), gene = as.numeric(test_gene_q$row))\n",
" row.names(test.df) = colnames(snps)\n",
" test.df = test.df[!is.na(test.df$snp),]\n",
" test_name = as.character(temp[index,\"gene_name\"])\n",
" test_rsID = as.character(temp[index,\"snps\"])\n",
"\n",
" tryCatch({\n",
" g = ggplot(test.df, aes(x = snp, y = gene, color=snp)) + geom_jitter() + ggtitle(paste(test_name, test_rsID, sep='_')) + xlab(test_rsID) + ylab(test_name)\n",
" print(paste(\"Trying to save: \", analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
" ggsave(filename = paste(analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''), plot = g, width = 7, height = 7)\n",
" }, error = function(e) {\n",
" print(paste(\"Could not save: \", analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
" })\n",
" }\n",
" }\n",
"\n",
" # Take all\n",
" n_pairs = dim(me$cis$eqtls)[1]\n",
" n_genes = length(unique(me$cis$eqtls$gene))\n",
" Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"pval\", imputed = \"Imput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
"\n",
" # Take FDR-controlled at 1%\n",
" n_pairs = dim(filter(me$cis$eqtls, FDR <= 0.01))[1]\n",
" n_genes = length(unique(filter(me$cis$eqtls, FDR <= 0.01)$gene))\n",
" Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"FDR\", imputed = \"Imput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
"\n",
"}\n",
"}\n",
"\n",
"if (dim(Counts_by_tissue)[1] != 0) {\n",
" print(paste(\"Save: \", analysis_path, tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
" save(Counts_by_tissue, file=paste(analysis_path, tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
"}"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TODO: Upload the plots of pair sas function PEER factors in this section"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4 Tissue-specific optimization for PEER factors\n",
"\n",
"For the actual cis-eQTL detection, we will use the optimal number of PEER factors for each tissue, and feed the matrix into eQTLBMA."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL_aggregate.R \n",
"\n",
"####################################################\n",
"# post_analyses_MatrixEQTL_aggregate.R \n",
"# Aggregates all results and optimizes for K PEER factors in a tissue-specific manner.\n",
"#\n",
"####################################################\n",
"\n",
"#!/usr/bin/Rscript\n",
"\n",
"library(ggplot2)\n",
"library(dplyr)\n",
"\n",
"analysis_path = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis/'\n",
"tissue_list_file = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
"num_indiv_cutoff = 100\n",
"\n",
"tissue_list = read.table(tissue_list_file, header=FALSE, sep=\"\\t\", stringsAsFactors=FALSE)\n",
"colnames(tissue_list) = c(\"Tissue\", \"Count\")\n",
"rownames(tissue_list) = tissue_list[\"Tissue\"][[1]]\n",
"tissue_list = tissue_list[tissue_list[\"Count\"]>=num_indiv_cutoff,]\n",
"\n",
"Counts_by_tissue_aggregate = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
"\n",
"for (i in c(1:dim(tissue_list)[1])) {\n",
" tissue = tissue_list[\"Tissue\"][[1]][i]\n",
" load(paste(analysis_path, tissue, \"/\", tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
" Counts_by_tissue_aggregate = rbind(Counts_by_tissue_aggregate, Counts_by_tissue)\n",
" load(paste(analysis_path, tissue, \"/\", tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
" Counts_by_tissue_aggregate = rbind(Counts_by_tissue_aggregate, Counts_by_tissue)\n",
"}\n",
"\n",
"for (i in c(1:dim(tissue_list)[1])) {\n",
" tissue = tissue_list[\"Tissue\"][[1]][i]\n",
" print(tissue)\n",
" print(sum(Counts_by_tissue_aggregate[\"tissue\"] == tissue))\n",
"}\n",
"\n",
"# We have the data frame with all counts.\n",
"save(Counts_by_tissue_aggregate, file=paste(analysis_path, \"All_Counts.RData\", sep=\"\"))\n",
"\n",
"# Save the plots of number for all tissues:\n",
"g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"pval\"), aes(x = peer_K, y = pairs, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
"ggsave(filename = paste(analysis_path, \"All_tissues_pval_pairs.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
"\n",
"g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"FDR\"), aes(x = peer_K, y = pairs, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
"ggsave(filename = paste(analysis_path, \"All_tissues_FDR_pairs.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
"\n",
"g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"pval\"), aes(x = peer_K, y = genes, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
"ggsave(filename = paste(analysis_path, \"All_tissues_pval_genes.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
"\n",
"g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"FDR\"), aes(x = peer_K, y = genes, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
"ggsave(filename = paste(analysis_path, \"All_tissues_FDR_genes.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
"\n",
"\n",
"# Optimize on the number of K PEER factors to use for each tissue:\n",
"# Criteria 1: when it decreases\n",
"peer_K_to_iterate = c(15, 20, 25, 30, 35, 40, 45, 50, 55, 60)\n",
"tissue_list[\"optimal_dec\"] = 10\n",
"\n",
"for (i in c(1:dim(tissue_list)[1])) {\n",
" tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
" current_opt = 10\n",
" current_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == 10)[\"pairs\"][[1]][1]\n",
" for (j in peer_K_to_iterate) {\n",
" next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
" if (next_val <= current_val) {\n",
" tissue_list[tissue_name, \"optimal_dec\"] = current_opt\n",
" break\n",
" }\n",
" current_opt = j\n",
" current_val = next_val\n",
" }\n",
" if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
" tissue_list[tissue_name, \"optimal_dec\"] = current_opt\n",
" }\n",
"}\n",
"\n",
"# Another Criteria: when it reaches 5% of maximal\n",
"# Tunable alpha, when it reaches N% of maximal\n",
"alpha = 0.05\n",
"tissue_list[\"optimal_95\"] = 10\n",
"\n",
"for (i in c(1:dim(tissue_list)[1])) {\n",
" tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
" current_opt = 10\n",
" thresh = max(filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\")[\"pairs\"]) * (1-alpha)\n",
" for (j in peer_K_to_iterate) {\n",
" next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
" if (next_val >= thresh) {\n",
" tissue_list[tissue_name, \"optimal_95\"] = current_opt\n",
" break\n",
" }\n",
" current_opt = j\n",
" }\n",
" if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
" tissue_list[tissue_name, \"optimal_95\"] = current_opt\n",
" }\n",
"}\n",
"\n",
"# Another Criteria: when it reaches 1% of maximal\n",
"# Tunable alpha, when it reaches N% of maximal\n",
"alpha = 0.01\n",
"tissue_list[\"optimal_99\"] = 10\n",
"\n",
"for (i in c(1:dim(tissue_list)[1])) {\n",
" tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
" current_opt = 10\n",
" thresh = max(filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\")[\"pairs\"]) * (1-alpha)\n",
" for (j in peer_K_to_iterate) {\n",
" next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
" if (next_val >= thresh) {\n",
" tissue_list[tissue_name, \"optimal_99\"] = current_opt\n",
" break\n",
" }\n",
" current_opt = j\n",
" }\n",
" if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
" tissue_list[tissue_name, \"optimal_99\"] = current_opt\n",
" }\n",
"}\n",
"\n",
"# Save the optimal K for future use:\n",
"\n",
"write.table(tissue_list, file='/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt', row.names=FALSE, quote=FALSE, sep=\"\\t\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL_aggregate.R\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we have the optimal number of PEER factors for each tissue, for tissues that have the number of experiments above 100:\n",
"\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt\n",
"\n",
"We will use the matrices chosen by this metric to perform cis-eQTL detection, both with MatrixEQTL and eQTLBMA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Workflow summary so far\n",
"\n",
"The following block summarizes what we have so far, and run the pipeline for all autosomal chromosomes (1-22)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## cis-eQTLs with MatrixEQTL\n",
"\n",
"For the actual cis-eQTL detection, we will use the optimal number of PEER factors for each tissue, and feed the matrix into eQTLBMA."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/ \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/featurecounts/analysis \\\n",
"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt\n",
"\n",
"python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py \\\n",
"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
"featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_Imput_cis_matrix_eqtl_wrapper_all.sh\n",
"Number of jobs: 616\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py\n",
"##################################################\n",
"# cis_matrix_eqtl_wrapper_all.py\n",
"# Wrapper for the cis_matrix_eqtl.R file and submitting the batch jobs\n",
"#\n",
"##################################################\n",
"\n",
"import glob\n",
"from sys import argv\n",
"import os.path\n",
"\n",
"in_path = argv[1]\n",
"in_suffix = argv[2]\n",
"method = argv[3]\n",
"mapping_dir = argv[4]\n",
"joblog_dir = argv[6]\n",
"incl_imput = argv[7]\n",
"analysis_dir = argv[8]\n",
"peer_factors_file = argv[9]\n",
"\n",
"# Examples\n",
"# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
"# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
"# argv[3] = 'featurecounts'\n",
"# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
"# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/'\n",
"# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
"# argv[7] = '0' or '1'\n",
"# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/featurecounts/analysis'\n",
"# argv[9] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt'\n",
"\n",
"# Make the job log directories\n",
"if not os.path.exists(joblog_dir):\n",
" os.makedirs(joblog_dir)\n",
"if not os.path.exists(joblog_dir + 'err'):\n",
" os.makedirs(joblog_dir + 'err')\n",
"if not os.path.exists(joblog_dir + 'out'):\n",
" os.makedirs(joblog_dir + 'out')\n",
"if not os.path.exists(analysis_dir):\n",
" os.makedirs(analysis_dir)\n",
"\n",
"if incl_imput == \"0\":\n",
" imput = \"Noimput\"\n",
"else:\n",
" imput = \"Imput\"\n",
"\n",
"# Create the array of jobs to run:\n",
"tissue_list = []\n",
"peer_factors = []\n",
"f = open(peer_factors_file)\n",
"# First line is a header:\n",
"f.readline()\n",
"for line in f.readlines():\n",
" entry = str.split(line.strip(),'\\t')\n",
" tissue_list.append(entry[0])\n",
" # We use the optimal_99 critera:\n",
" peer_factors.append(entry[-1])\n",
"\n",
"out_path = argv[5] + method + '/'\n",
"\n",
"if not os.path.exists(out_path):\n",
" os.makedirs(out_path)\n",
"\n",
"master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + '_' + imput + \"_cis_matrix_eqtl_wrapper_all.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"# MatrixEQTL parameters\n",
"incl_trans = '0' # Don't include trans\n",
"cis_dist = '1e5' # cis window\n",
"\n",
"k = 0\n",
"for i in range(len(tissue_list)):\n",
" tissue_name = tissue_list[i]\n",
" peer_factor = peer_factors[i]\n",
" filename = tissue_name + '_' + method + in_suffix\n",
" matrix = argv[1] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/' + filename\n",
" out_file = out_path + filename.replace('.txt','_MatrixEQTL')\n",
" # Iterate through 22 chromosomes\n",
" for j in range(22):\n",
" chr_number = str(j+1)\n",
" sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_MatrixEQTL_chr' + chr_number + '_' + tissue_name + '_' + imput + \".slurm\"\n",
" job_outfile = method+\"_\"+tissue_name+\"_cis_all\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J meqtl_%s # job name\n",
"#SBATCH --mem=8000 # 8 GB requested\n",
"#SBATCH -t 06:00:00 # to be placed in the test queue\n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
"%s %s %s %s %s %s %s %s %s %s\n",
"\"\"\"%(tissue_name, joblog_dir+'err/'+job_outfile+'_'+imput+'.err', joblog_dir+'out/'+job_outfile+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
" k += 1\n",
"\n",
"master_handle.close()\n",
"\n",
"print 'sh %s'%(master_script)\n",
"print 'Number of jobs:', k"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Results: cis-eQTLS:\n",
"\n",
"We first look through MatrixEQTL results."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%bash\n",
"\n",
"python /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py\n",
"####################################################\n",
"# MatrixEQTL_cis_FDR_control_wrapper.py\n",
"# \n",
"#\n",
"####################################################\n",
"tissue_list_file = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
"f = open(tissue_list_file)\n",
"tissue_list = []\n",
"for line in f.readlines():\n",
" tissue_list.append(str.split(line.strip(), '\\t')[0])\n",
"f.close()\n",
"\n",
"original_dir = \"/tigress/bj5/MatrixEQTL/cis/featurecounts/\"\n",
"write_dir = \"/tigress/bj5/cis_meta/MatrixEQTL/\"\n",
"\n",
"master_script = \"/tigress/bj5/cis_meta/scripts/MatrixEQTL_FDR_control.sh\"\n",
"master_handle=open(master_script, 'w')\n",
"master_handle.write(\"#!/bin/bash\\n\\n\")\n",
"\n",
"for item in tissue_list:\n",
" sbatchfile = \"/tigress/bj5/cis_meta/scripts/batch/MatrixEQTL_FDR_control_\" + item + \".slurm\"\n",
" sbatchhandle=open(sbatchfile, 'w')\n",
" cmd=r\"\"\"#!/bin/bash\n",
"#SBATCH -J Mat_%s_FDR # job name\n",
"#SBATCH --mem=4000 # 8 GB requested\n",
"#SBATCH -t 01:00:00 # to be placed in the test queue\n",
"#SBATCH -e %s # err output directory\n",
"#SBATCH -o %s # out output directory\n",
"\n",
"/usr/bin/Rscript /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R \\\n",
"%s %s %s\n",
"\"\"\"%(item, write_dir + item + \".err\", write_dir + item + \".out\", original_dir, item, write_dir)\n",
" sbatchhandle.write(cmd)\n",
" sbatchhandle.close()\n",
" master_handle.write(\"sbatch \" + sbatchfile + \" \\n\")\n",
"\n",
"master_handle.close()\n",
"\n",
"print 'sh %s'%(master_script)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%writefile /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R\n",
"####################################################\n",
"# MatrixEQTL_cis_FDR_control.R\n",
"# \n",
"#\n",
"####################################################\n",
"\n",
"#!/usr/bin/Rscript\n",
"\n",
"# Collect arguments\n",
"args <-commandArgs(TRUE)\n",
"\n",
"# FDR Control of p-values from the null distribution\n",
"original_dir = '/tigress/bj5/MatrixEQTL/cis/featurecounts/'\n",
"tissue = 'Adipose-Subcutaneous'\n",
"method = 'featurecounts'\n",
"file_attr = '_genes_certain_biotypes_counts_normalized_MatrixEQTL_Chr'\n",
"imputed = 'Imput'\n",
"suffix = '_me.RData'\n",
"write_dir = '/tigress/bj5/cis_meta/MatrixEQTL/'\n",
"\n",
"original_dir = args[1]\n",
"tissue = args[2]\n",
"write_dir = args[3]\n",
"\n",
"snps_list = character()\n",
"genes_list = character()\n",
"pvals_list = numeric()\n",
"betas_list = numeric()\n",
"statistic_list = numeric()\n",
"total_tests = 0\n",
"total_called_eqtls = 0\n",
"\n",
"for (i in c(1:22)) {\n",
" print(i)\n",
" filename = paste(original_dir, tissue, '_', method, file_attr, i, '_', imputed, suffix, sep='')\n",
" load(filename)\n",
"\n",
" snps_list = append(snps_list, sapply(me$cis$eqtls$snps, as.character))\n",
" genes_list = append(genes_list, sapply(me$cis$eqtls$gene, as.character))\n",
" pvals_list = append(pvals_list, me$cis$eqtls$pvalue)\n",
" betas_list = append(betas_list, me$cis$eqtls$beta)\n",
" statistic_list = append(statistic_list, me$cis$eqtls$statistic)\n",
" total_tests = total_tests + me$cis$ntests\n",
" total_called_eqtls = total_called_eqtls + me$cis$neqtls\n",
"}\n",
"\n",
"ordering = order(pvals_list)\n",
"ordered_pvals_list = pvals_list[ordering]\n",
"ordered_snps_list = snps_list[ordering]\n",
"ordered_genes_list = genes_list[ordering]\n",
"ordered_betas_list = betas_list[ordering]\n",
"ordered_statistic_list = statistic_list[ordering]\n",
"\n",
"# Perform Benjamini-Hochberg\n",
"alpha_hat = sapply(c(1:total_called_eqtls), function(x) {ordered_pvals_list[x] * total_tests / x})\n",
"alpha_hat[(alpha_hat >= 1.0)] = 1.0\n",
"cur_index = 1\n",
"\n",
"alpha = numeric()\n",
"ptm <- proc.time()\n",
"while (cur_index <= length(alpha_hat)) {\n",
" temp = alpha_hat[cur_index:length(alpha_hat)]\n",
" min_index = which.min(temp)\n",
" min_alpha = min(temp)\n",
" alpha = append(alpha, rep(min_alpha, min_index))\n",
" cur_index = cur_index + min_index\n",
"}\n",
"print(\"FDR calculation time: \")\n",
"proc.time() - ptm\n",
"\n",
"num_eqtls = sum(alpha <= 0.05)\n",
"result = data.frame(snp = ordered_snps_list[1:num_eqtls], gene = ordered_genes_list[1:num_eqtls], pval = ordered_pvals_list[1:num_eqtls], beta = ordered_betas_list[1:num_eqtls], statistic = ordered_statistic_list[1:num_eqtls], FDR = alpha[1:num_eqtls])\n",
"write.table(result, file=paste(write_dir, tissue, \"_FDR_5percent.txt\", sep=\"\"), quote = FALSE, sep = \"\\t\", row.names = FALSE)\n",
"\n",
"print(paste(\"Total number of tests:\", total_tests))\n",
"print(paste(\"Total eQTLs called at 5% FDR:\", num_eqtls))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Overwriting /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R\n"
]
}
],
"prompt_number": 13
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment