bjo/eQTL_cis.ipynb Secret

## eQTL_cis.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:8adc23cefdc7354083f9e14db98c3c5755e1afd151c4b82baeb79b948ccec95e"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "GTEx eQTL detection: cis- pipeline"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is the master notebook for the cis-pipeline. There are separate notebooks for:\n",
      "\n",
      "misc. and meta analyses: http://nbviewer.ipython.org/gist/bjo/646278c49a10b6307bd9\n",
      "eQTLBMA part of the pipeline: http://nbviewer.ipython.org/gist/bjo/250135a17a004b014434\n",
      "Collection of relevant plots: http://nbviewer.ipython.org/gist/bjo/633dddf85ad9dc54c07b\n",
      "\n",
      "The notebook is organized in a following fashion:\n",
      "\n",
      "### 0. GC content (Ideally already taken care of in the pre-processing, to be fixed)\n",
      "### 1. Running PEER in a tissue-specific manner\n",
      "### 2. \n",
      "\n",
      "Now that the tissue-level quantification has been finished for GTEx Phase 1, we can move on to the cis-eQTL pipeline. First, we can take the TPM-converted matrices (view Ian's consortium notebook for more details) and control for the GC content, which is one of the standard practices in eQTL studies, and an example is shown in this work:\n",
      "http://www.nature.com/nature/journal/v464/n7289/full/nature08872.html\n",
      "\n",
      "For now, we'll use the 'certain biotypes', and use FeatureCounts TPM data as an example, and we'll only take expression from individuals for which we have known genotypes. The correction of other matrices will be conducted later in the notebook. The TPM-converted matrices are located at:\n",
      "\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes\n",
      "\n",
      "\n",
      "### Note: For the details of the pre-processing pipeline, consult Ian's notebook. Here are some brief remarks that might be of interest:\n",
      "\n",
      "#### 1. Threshold for MAF has been set at 0.05.\n",
      "#### 2. This pipeline only looks at gene-level transcripts for 'certain biotypes', but we can move on to uncertain biotypes and exon-level quantifications later.\n",
      "#### 3. The matrices are initially corrected for gene/transcript length, sequencing depth, and GC content.\n",
      "#### 4. Separate analysis for protein-coding genes and non-coding genes.\n",
      "#### 5. For the purposes of this pipeline, we will define cis- as being within 100Kb of the gene. (100Kb from TSS for eQTLBMA in particular, seems like the TES implementation doesn't work).\n",
      "#### 6. The pipeline will correct for top 3 genotype PCs along with varying number of PEER factors (optimized tissue-specifically) for eQTL detection.\n",
      "#### 7. The matrices are quantified with featurecounts, htseq and RSEM.\n",
      "#### 8. eQTL detection is done with MatrixEQTL, eQTLBMA, and snptest.\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 0. GC content (Ideally already taken care of in the pre-processing, to be fixed)\n",
      "\n",
      "First, we need to calculate GC content for each transcripts, and we'll use the ENCODE v.19 annotations, as they were also used for the quantification. (This portion is very similar to the normalization step from Ian's notebook on RNA-seq mapping)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "cd /tigress/BEE/gtex/external_sources/hg19/\n",
      "wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz\n",
      "wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.lncRNA_transcripts.fa.gz"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Find the longest transcripts per gene"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "cd /tigress/BEE/gtex/external_sources/hg19/\n",
      "gzip -cd gencode.v19.lncRNA_transcripts.fa.gz\n",
      "gzip -cd gencode.v19.pc_transcripts.fa.gz\n",
      "\n",
      "mkdir /tigress/BEE/gtex/data/auxiliary_bj5\n",
      "mkdir /tigress/BEE/gtex/data/auxiliary_bj5/normalization\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/isolate_longest_sequences_in_transcriptomes.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/isolate_longest_sequences_in_transcriptomes.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compute sequence lengths"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "# Transcripts\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt\n",
      "\n",
      "cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt \\\n",
      "> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.seq_length.txt \n",
      "tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt \\\n",
      ">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.seq_length.txt \n",
      "\n",
      "# Genes\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_sequence_length.py \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt\n",
      "\n",
      "cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt \\\n",
      "> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.seq_length.txt \n",
      "tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt \\\n",
      ">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.seq_length.txt "
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Compute GC content"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "# Transcripts\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.pc_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
      "/tigress/BEE/gtex/external_sources/hg19/gencode.v19.lncRNA_transcripts.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt\n",
      "\n",
      "cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt \\\n",
      "> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.gc_content.txt \n",
      "tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt \\\n",
      ">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.transcripts.gc_content.txt \n",
      "\n",
      "# Genes\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.longest.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/compute_GC_content.py \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.longest.fa \\\n",
      "/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt\n",
      "\n",
      "cat /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt \\\n",
      "> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.gc_content.txt \n",
      "tail -n+2 /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt \\\n",
      ">> /tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.genes.gc_content.txt "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's set up the jobs for GC-correction. Before that, we need to point the R library to the correction location:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "echo 'export R_LIBS=/tigress/BEE/tools/R_libs' >> ~/.bashrc\n",
      "source ~/.bashrc"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Note: This may throw an error depending on the situation of the R library and the R version. Another alternative that works is to install the required packages manually and point the R library to the user's folder, namely:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In R:\n",
      "\n",
      "source(\"http://bioconductor.org/biocLite.R\")\n",
      "biocLite(\"EDASeq\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/\n",
      "mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/\n",
      "mkdir /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/ _certain_biotypes_TPM.txt \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/\n",
      "\n",
      "sh /tigress/BEE/gtex/tmp/low_level_scripts/normalize_only_GC_wrapper.sh"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py\n",
      "##################################################\n",
      "#  normalize_only_GC.py\n",
      "#\n",
      "#\n",
      "##################################################\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "in_path = in_path + '*' + in_suffix\n",
      "out_path = argv[3]\n",
      "\n",
      "# in_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/*_certain_biotypes_TPM.txt'\n",
      "# out_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/'\n",
      "\n",
      "matrices = glob.glob(in_path + '*')\n",
      "\n",
      "master_script = \"/tigress/BEE/gtex/tmp/low_level_scripts/normalize_only_GC_wrapper.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "seq_lengths, gc_content = {}, {}\n",
      "\n",
      "seq_lengths['pc_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.seq_length.txt'\n",
      "seq_lengths['pc_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.seq_length.txt'\n",
      "seq_lengths['lncRNA_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.seq_length.txt'\n",
      "seq_lengths['lncRNA_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.seq_length.txt'\n",
      "\n",
      "gc_content['pc_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_genes.gc_content.txt'\n",
      "gc_content['pc_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.pc_transcripts.gc_content.txt'\n",
      "gc_content['lncRNA_genes'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_genes.gc_content.txt'\n",
      "gc_content['lncRNA_transcripts'] = '/tigress/BEE/gtex/data/auxiliary_bj5/normalization/gencode.v19.lncRNA_transcripts.gc_content.txt'\n",
      "\n",
      "k = 0\n",
      "for i, matrix in enumerate(matrices):\n",
      "    out_matrix = out_path + matrix.split('/')[-1].replace('_TPM.txt','_TPM_GC.txt')\n",
      "    print out_matrix\n",
      "    if os.path.isfile(out_matrix):\n",
      "        continue\n",
      "    k+=1\n",
      "    genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
      "    sbatchfile = \"/tigress/BEE/gtex/tmp/low_level_scripts/norm_GC_\" + str(i) + \".slurm\"\n",
      "    sbatchhandle=open(sbatchfile, 'w')\n",
      "    cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J norm_%s      # job name\n",
      "#SBATCH --mem=12000           # 12 GB requested\n",
      "#SBATCH -t 01:00:00           # to be placed in the test queue\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R \\\n",
      "%s %s %s %s %s %s\n",
      "\n",
      "\"\"\"%(i, matrix, seq_lengths['pc_'+genes_or_transcripts], seq_lengths['lncRNA_'+genes_or_transcripts], \\\n",
      "     gc_content['pc_'+genes_or_transcripts], gc_content['lncRNA_'+genes_or_transcripts], out_matrix)\n",
      "    sbatchhandle.write(cmd)\n",
      "    sbatchhandle.close()\n",
      "    master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "master_handle.close()\n",
      "\n",
      "print 'sh %s'%(master_script)\n",
      "print 'Number of jobs:', k+1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R\n",
      "##################################################\n",
      "#  normalize_only_GC.R\n",
      "#\n",
      "#  Use EDASeq R package to normalize gene expression within-lane by GC-content\n",
      "# \n",
      "#  See:\n",
      "#  Risso, Davide, et al. \"GC-content normalization for RNA-Seq data.\" BMC bioinformatics 12.1 (2011): 480.\n",
      "##################################################\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "## Collect arguments\n",
      "args <-commandArgs(TRUE)\n",
      "\n",
      "## Default setting when no arguments passed\n",
      "if (length(args) != 6 ){\n",
      "  args <- c(\"--help\")\n",
      "}\n",
      "## Help section\n",
      "if(\"--help\" %in% args) {\n",
      "  cat(\"\n",
      "      normalize_only_GC.R\n",
      " \n",
      "      Usage:\n",
      "      ./normalize_only_GC.R /path/to/expression_matrix.txt /path/to/seq_length_protein_coding.txt \\\n",
      "      /path/to/seq_length_lncRNA.txt /path/to/GC_content_protein_coding.txt \\\n",
      "      /path/to/GC_content_lncRNA.txt /path/to/normalized_expression_matrix.txt\\n\")\n",
      "  q(save=\"no\")\n",
      "}\n",
      "\n",
      "\n",
      "library(EDASeq)\n",
      "\n",
      "expr = args[1]\n",
      "seq_length_pc = args[2]\n",
      "seq_length_lncRNA = args[3]\n",
      "GC_content_pc = args[4]\n",
      "GC_content_lncRNA  = args[5]\n",
      "\n",
      "norm_expr_out  = args[6]\n",
      "\n",
      "# expr = \"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM.txt\"\n",
      "# seq_length_pc = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.pc_genes.seq_length.txt\"\n",
      "# seq_length_lncRNA = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.lncRNA_genes.seq_length.txt\"\n",
      "# GC_content_pc = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.pc_genes.gc_content.txt\"\n",
      "# GC_content_lncRNA = \"/tigress/BEE/gtex/data/auxiliary/normalization/gencode.v19.lncRNA_genes.gc_content.txt\"\n",
      "\n",
      "# norm_expr_out = \"/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM_GC.txt\"\n",
      "\n",
      "# import data\n",
      "expr = read.table(file=expr, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
      "seq_length_pc = read.table(file=seq_length_pc, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
      "seq_length_lncRNA = read.table(file=seq_length_lncRNA, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
      "GC_content_pc = read.table(file=GC_content_pc, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
      "GC_content_lncRNA = read.table(file=GC_content_lncRNA, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1) \n",
      "\n",
      "# turn covariates into named vectors to be used by normalization functions\n",
      "df_to_named_vector <- function(df, col){\n",
      "    tmp <- as.numeric(df[,col])\n",
      "    names(tmp) <- row.names(df)\n",
      "    return(tmp)\n",
      "    }\n",
      "\n",
      "seq_length_pc <- df_to_named_vector(seq_length_pc, 'length')\n",
      "seq_length_lncRNA <- df_to_named_vector(seq_length_lncRNA, 'length')\n",
      "GC_content_pc <- df_to_named_vector(GC_content_pc, 'gccontent')\n",
      "GC_content_lncRNA <- df_to_named_vector(GC_content_lncRNA, 'gccontent')\n",
      "\n",
      "seqs_lncRNA <- intersect(rownames(expr), names(seq_length_lncRNA))\n",
      "seqs_lncRNA <- intersect(seqs_lncRNA, names(GC_content_lncRNA))\n",
      "expr_lncRNA <- as.matrix(expr[seqs_lncRNA, , drop=FALSE ])\n",
      "\n",
      "seqs_pc <- intersect(rownames(expr), names(seq_length_pc))\n",
      "seqs_pc <- intersect(seqs_pc, names(GC_content_pc))\n",
      "expr_pc <- as.matrix(expr[seqs_pc, , drop=FALSE ])\n",
      "\n",
      "# reduce expression matrices to only those genes/transcripts that are expressed (non-zero) in\n",
      "# at least 10% of samples for this particular tissue\n",
      "expr_lncRNA = as.matrix(expr_lncRNA[apply(expr_lncRNA, 1, function(x) length(x[x != 0])/length(x) >= 0.10),])\n",
      "expr_pc = as.matrix(expr_pc[apply(expr_pc, 1, function(x) length(x[x != 0])/length(x) >= 0.10),])\n",
      "    \n",
      "# create eSet objects\n",
      "data_lncRNA <- newSeqExpressionSet(expr_lncRNA, featureData=AnnotatedDataFrame(data.frame(gc=GC_content_lncRNA[row.names(expr_lncRNA)])), round=FALSE)\n",
      "data_pc <- newSeqExpressionSet(expr_pc, featureData=AnnotatedDataFrame(data.frame(gc=GC_content_pc[row.names(expr_pc)])), round=FALSE)\n",
      "\n",
      "# Perform within-lane normalizations separately for lincRNA and mRNA because we would expect\n",
      "# different distributions\n",
      "\n",
      "# Within-lane normalization for GC-content\n",
      "norm_lncRNA <- normCounts(withinLaneNormalization(data_lncRNA, \"gc\", which=\"full\", offset=FALSE, round=FALSE))\n",
      "norm_pc <- normCounts(withinLaneNormalization(data_pc, \"gc\", which=\"full\", offset=FALSE, round=FALSE))\n",
      "\n",
      "# bind the pc and lncRNA dataframes together\n",
      "norm <- rbind(norm_pc,norm_lncRNA)\n",
      "\n",
      "write.table(norm, file=norm_expr_out, quote=FALSE, sep='\\t', row.names=T, col.names=NA)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/group_general/top_level_scripts/normalize_only_GC.R\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##1. Running PEER in a tissue-specific manner"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, let us set up the directory for our cis mapping pipeline:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Once we have obtained the matrices, we can begin the process of explicitly modeling hidden covaraites through PEER:\n",
      "\n",
      "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398141/\n",
      "\n",
      "We first need to install PEER:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In R:\n",
      "\n",
      "source(\"http://bioconductor.org/biocLite.R\")\n",
      "biocLite(\"peer\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###1.1 Genotype PCs\n",
      "\n",
      "In the pipeline, we will take the top 3 genotype PCs given by the GTEx consortium. The genotype PCs have been calculated by the GTEx consortium:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "cp GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_PostImput_20genotPCs.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In R, run the following:\n",
      "\n",
      "genotype_PCs = read.table('GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_PostImput_20genotPCs.txt', sep = '\\t', stringsAsFactors=FALSE, header=TRUE)\n",
      "rownames(genotype_PCs) = sapply(c(1:dim(genotype_PCs)[1]), function(x) {paste(strsplit(genotype_PCs$FID[x], split = \"-\")[[1]][1], strsplit(genotype_PCs$FID[x], split = \"-\")[[1]][2], sep = \".\")})\n",
      "genotype_PCs_top3 = genotype_PCs[,c(\"C1\",\"C2\",\"C3\")]\n",
      "write.table(genotype_PCs_top3, file = \"GTEx_Genotype_PCs_top3.txt\", sep = '\\t')\n",
      "\n",
      "load(\"subSample1MLFA1.RData\")\n",
      "row.names(subSample1MLFA) = sapply(row.names(subSample1MLFA), function(x) {paste(\"GTEX.\", x, sep = \"\")})\n",
      "genotype_PCs_top3_LFA = subSample1MLFA[,c(1:3)]\n",
      "colnames(genotype_PCs_top3_LFA) = c(\"C1\",\"C2\",\"C3\")\n",
      "write.table(genotype_PCs_top3_LFA, file = \"LFA_Genotype_PCs_top3.txt\", sep = \"\\t\")\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###1.2 Alternate - Genotype PCs\n",
      "\n",
      "We can alternatively take take the top three genotype PCs that have been generated by the known genotypes and using LFA:\n",
      "http://arxiv.org/abs/1312.2041 and \n",
      "https://github.com/StoreyLab/lfa"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In R:\n",
      "\n",
      "install.packages(\"devtools\")\n",
      "library(\"devtools\")\n",
      "install_github(\"Storeylab/lfa\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA/lfaWrapper.sh\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "# sbatch -t 24:00:00 -N 1 --ntasks-per-node=1 --mail-type=begin --mail-type=end --mail-user=bj5@princeton.edu --mem=125000 ./lfaWrapper.sh\n",
      "library('lfa')\n",
      "library('Matrix')\n",
      "library('R.utils')\n",
      "\n",
      "NUM_SUBJECTS = 450\n",
      "NUM_CHRS = 22\n",
      "\n",
      "# Take in subject names\n",
      "tempRowNames = read.table('/tigress/BEE/gtex/data/genotype/imputed_genotypes/allelic_dosage/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr22.allelic_dosage.txt', header = TRUE)\n",
      "indNames = colnames(tempRowNames)\n",
      "indNames = indNames[2:NUM_SUBJECTS+1]\n",
      "remove(tempRowNames)\n",
      "indNames = substr(indNames,6,10)\n",
      "\n",
      "# Tidy the subject IDs\n",
      "for(i in seq(1,length(indNames))){\n",
      "    if(substr(indNames[i],5,5)=='.'){\n",
      "        indNames[i] = substr(indNames[i],1,4)\n",
      "    }\n",
      "}\n",
      "\n",
      "numRow = 0\n",
      "for(i in seq(1,NUM_CHRS)){\n",
      "    numRow = numRow + countLines(paste(\"/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr\",i,\".bim\",sep=''))\n",
      "}\n",
      "print(numRow)\n",
      "genotypeHolder = matrix(nrow=numRow,ncol=NUM_SUBJECTS)\n",
      "genoHolderIndex = 1\n",
      "for(i in seq(1,NUM_CHRS)){\n",
      "    dataLength = countLines(paste('/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr',i,'.bim',sep=\"\"))\n",
      "    temp = read.bed(paste('/tigress/BEE/gtex/data/genotype/imputed_genotypes/plink/GTEx_Analysis_2015-01-12_OMNI_2.5M_5M_450Indiv_allchr_genot_imput_info04_maf05_HWEp1E6_dbSNPIDs.chr',i,sep=\"\"))\n",
      "    genotypeHolder[genoHolderIndex:(genoHolderIndex + dataLength - 1),] = temp\n",
      "    genoHolderIndex = genoHolderIndex + dataLength\n",
      "    remove(temp)\n",
      "}\n",
      "print(object.size(genotypeHolder))\n",
      "\n",
      "# Sample 1M snps to form genotype PCs\n",
      "for(i in seq(1,10)){\n",
      "    indices = sample(1:numRow,1000000,replace = FALSE)\n",
      "    genotypeSample = genotypeHolder[indices,]\n",
      "\n",
      "    subSample1MLFA = lfa(genotypeSample,d=51)\n",
      "    rownames(subSample1MLFA) = indNames\n",
      "    save(subSample1MLFA, file=paste('subSample1MLFA',i,'.RData',sep=''))\n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/LFA/lfaWrapper.sh\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Running the following code results in files subSample1MLFA1.RData through subSample1MLFA10.RData, and all ten iterations correlate well with each other in for the first 3 PCs."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###1.3 Scripts for running PEER on expression matrices\n",
      "\n",
      "Now let's prepare the scripts that set up tissue-specific PEER factor estimation; we'll start with featurecounts and PEER factors from 10 to 75 in increments of 5."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/ \\\n",
      "_genes_certain_biotypes_counts_normalized.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
      "featurecounts /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER/ 10 15 20 25 30 35 40 45 50 55 60 \n",
      "\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_cis_peer_correction_chr11_wrapper.sh\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_cis_peer_correction_chr11_wrapper.sh\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py\n",
      "##################################################\n",
      "#  cis_peer_correction_chr11_wrapper.py\n",
      "#\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "genotype_pcs = argv[3]\n",
      "gene_metadata = argv[4]\n",
      "method = argv[5]\n",
      "# in_path = in_path + method + '/TPM_with_genotypes/' + '*' + '_' + method + in_suffix\n",
      "# out_path = argv[6] + \"to be appended later\"\n",
      "joblog_dir = argv[7]\n",
      "\n",
      "# Array of PEER factors that are given in the input\n",
      "peer_factors = argv[8:len(argv)]\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
      "# or\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/'\n",
      "# argv[2] = '_genes_certain_biotypes_TPM.txt'\n",
      "# or \n",
      "# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
      "# argv[3] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
      "# argv[5] = 'featurecounts'\n",
      "# argv[6] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
      "# argv[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER'\n",
      "# argv[8:len(argv)] = ['10','20','30','40','50']\n",
      "\n",
      "# Make the job log directories\n",
      "if not os.path.exists(joblog_dir):\n",
      "    os.makedirs(joblog_dir)\n",
      "if not os.path.exists(joblog_dir + 'err'):\n",
      "    os.makedirs(joblog_dir + 'err')\n",
      "if not os.path.exists(joblog_dir + 'out'):\n",
      "    os.makedirs(joblog_dir + 'out')\n",
      "\n",
      "matrices = glob.glob(in_path + '*' + in_suffix)\n",
      "\n",
      "for peer_factor in peer_factors:\n",
      "    out_path = argv[6] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
      "    if not os.path.exists(out_path):\n",
      "        os.makedirs(out_path)\n",
      "\n",
      "    master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_\" + peer_factor + \"_cis_peer_correction_chr11_wrapper.sh\"\n",
      "    master_handle=open(master_script, 'w')\n",
      "    master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "    # Examples\n",
      "    # peer_factor = 10\n",
      "    # residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM.txt'\n",
      "    # tissue_name = 'Adipose-Subcutaneous'\n",
      "    factor_file_directory = out_path + 'PEER_factors'\n",
      "    if not os.path.exists(factor_file_directory):\n",
      "        os.makedirs(factor_file_directory)\n",
      "\n",
      "    k = 0\n",
      "    for i, matrix in enumerate(matrices):\n",
      "        filename = matrix.split('/')[-1]\n",
      "        tissue_name = str.split(filename, '_'+method)[0]\n",
      "        out_matrix = out_path + filename\n",
      "        # out_matrix = out_path + filename.replace('_TPM.txt','_TPM_GC.txt')\n",
      "        if os.path.isfile(out_matrix):\n",
      "            continue\n",
      "        k+=1\n",
      "        genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
      "        outfilename = method+\"_\"+tissue_name+\"_\"+peer_factor\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/PEER/\" + method + '_TPM_GC_3GenPC_' + peer_factor + 'PEER1000_' + tissue_name + \".slurm\"\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J PEER_%s_%s   # job name\n",
      "#SBATCH --mem=6000     # 6 GB requested\n",
      "#SBATCH -t 04:00:00     \n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R \\\n",
      "%s %s %s %s %s %s %s\n",
      "\n",
      "    \"\"\"%(tissue_name, peer_factor, joblog_dir+'err/'+outfilename+'.err', joblog_dir+'out/'+outfilename+'.out', matrix, genotype_pcs, gene_metadata, peer_factor, out_matrix, tissue_name, factor_file_directory)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "    master_handle.close()\n",
      "\n",
      "    print 'sh %s'%(master_script)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R\n",
      "####################################################\n",
      "# cis_peer_correction_chr11.R\n",
      "# Exploratory: Testing with chr 11.\n",
      "#\n",
      "####################################################\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "## Collect arguments\n",
      "args <-commandArgs(TRUE)\n",
      "\n",
      "## Default setting when no arguments passed\n",
      "if (length(args) != 7 ){\n",
      "  args <- c(\"--help\")\n",
      "}\n",
      "## Help section\n",
      "if(\"--help\" %in% args) {\n",
      "  cat(\"\n",
      "      cis_peer_correction.R\n",
      " \n",
      "      Usage:\n",
      "      ./cis_peer_correction.R /path/to/expression_matrix.txt /path/to/genotype_pcs.RData \\\n",
      "      /path/to/gene_metadata/ num_of_peer_factors \\\n",
      "      /path/to/normalized_expression_matrix.txt tissue_name /path/to/factor_files\\n\")\n",
      "  q(save=\"no\")\n",
      "}\n",
      "\n",
      "expr_matrix = args[1]\n",
      "genotype_pcs = args[2]\n",
      "gene_metadata = args[3]\n",
      "peer_factors = as.numeric(args[4])\n",
      "residual_matrix = args[5]\n",
      "tissue_name = args[6]\n",
      "factor_file_directory = args[7]\n",
      "\n",
      "# expr_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/featurecounts/TPM_with_genotypes/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM.txt'\n",
      "# expr_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_counts_normalized.txt'\n",
      "# genotype_pcs = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
      "# gene_metadata = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
      "# peer_factors = '10'\n",
      "# residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Brain-Hypothalamus_featurecounts_genes_certain_biotypes_TPM_GC.txt'\n",
      "# tissue_name = 'Brain-Hypothalamus'\n",
      "# factor_file_directory = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/PEER_factors'\n",
      "\n",
      "library(peer)\n",
      "library(dplyr)\n",
      "\n",
      "# Load in the genotype PCs (top 3 will be used for this pipeline), saved as samSample1MLFA\n",
      "genotype_PCs_top3 = read.table(genotype_pcs, stringsAsFactors=FALSE, header=TRUE, sep='\\t')\n",
      "gene_metadata = read.table(gene_metadata, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1)\n",
      "\n",
      "# Load in the expression matrices\n",
      "# Also correct for covariates using PEER with K factors and N steps\n",
      "\n",
      "expr_matrix = read.table(file=expr_matrix, sep='\\t', header=TRUE, stringsAsFactors=FALSE, row.names=1)\n",
      "# The samples included are already sorted for samples with genotypes.\n",
      "\n",
      "# Data massaging: left_join is similar to a databaes left join, matching entries in first table to the second\n",
      "expr_matrix$gene_id = row.names(expr_matrix)\n",
      "gene_metadata$gene_id = row.names(gene_metadata)\n",
      "joined_matrix = left_join(expr_matrix, gene_metadata, by=\"gene_id\")\n",
      "# Take only chr 11 for now.\n",
      "joined_matrix = filter(joined_matrix, chr == 11)\n",
      "joined_matrix = arrange(joined_matrix, start)\n",
      "row.names(joined_matrix) = joined_matrix$gene_id\n",
      "joined_matrix = select(joined_matrix, 1:(dim(joined_matrix)[2]-6))\n",
      "\n",
      "# Optional: remove genes that are expressed in less than 10% of individuals.\n",
      "joined_matrix = joined_matrix[sapply(1:dim(joined_matrix)[1], function(x) {(sum(joined_matrix[x,]==0)/dim(joined_matrix)[2])<0.1}),]\n",
      "\n",
      "# PEER takes in an inverted matrix as input\n",
      "inverted_matrix = as.data.frame(t(data.matrix(joined_matrix)))\n",
      "genotype_PCs_top3_df = as.data.frame(genotype_PCs_top3[,1:3])\n",
      "inverted_matrix$subject_id = row.names(inverted_matrix)\n",
      "genotype_PCs_top3_df$subject_id = row.names(genotype_PCs_top3_df)\n",
      "inverted_joined = left_join(inverted_matrix, genotype_PCs_top3_df, by=\"subject_id\")\n",
      "subject_list = inverted_joined$subject_id\n",
      "\n",
      "# Top 3 PCs, can be changed for future tasks\n",
      "genotype_top3 = as.matrix(inverted_joined[,(dim(inverted_joined)[2]-2):dim(inverted_joined)[2]])\n",
      "inverted_joined = as.matrix(select(inverted_joined, c(1:(dim(inverted_joined)[2]-4))))\n",
      "\n",
      "# Now set up PEER for detection of hidden covariates\n",
      "model = PEER()\n",
      "\n",
      "PEER_setPhenoMean(model,inverted_joined)\n",
      "PEER_setCovariates(model, genotype_top3)\n",
      "PEER_setNk(model,peer_factors)\n",
      "PEER_update(model)\n",
      "\n",
      "# Now that PEER has been set up, we can obtain the residuals and save the resulting matrix:\n",
      "residuals = PEER_getResiduals(model)\n",
      "colnames(residuals) = colnames(inverted_joined)\n",
      "rownames(residuals) = subject_list\n",
      "\n",
      "write.table(as.data.frame(t(data.matrix(residuals))), file=residual_matrix, quote = FALSE, sep = \"\\t\")\n",
      "\n",
      "# Also save the factors, weights, and alpha values:\n",
      "write.table(PEER_getX(model), file=paste(factor_file_directory, '/', tissue_name, '_Factors.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n",
      "write.table(PEER_getW(model), file=paste(factor_file_directory, '/', tissue_name, '_Loadings.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n",
      "write.table(PEER_getAlpha(model), file=paste(factor_file_directory, '/', tissue_name, '_Alphas.txt', sep=\"\"), row.names=FALSE, col.names=FALSE)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The result is that we have the expression matrices that are the residuals of K peer factors and 3 genotype PCs, saved under:\n",
      "\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000\n",
      "\n",
      "as well as the corresponding peer factors, loadings and alphas.\n",
      "\n",
      "We can also repeat the procedure for htseq and rsem.\n",
      "\n",
      "\n",
      "### 1.4 Checkpoint: Check if the PEER jobs have run successfully, and rerun the missing files\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/ \\\n",
      "_genes_certain_biotypes_counts_normalized.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
      "featurecounts /tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER/ 10 15 20 25 30 35 40 45 50 55 60 \n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10\n",
        "20\n",
        "30\n",
        "Missing: Vagina\n",
        "Missing: SmallIntestine-TerminalIleum\n",
        "Missing: Cells-EBV-transformedlymphocytes\n",
        "40\n",
        "Missing: Vagina\n",
        "Missing: Ovary\n",
        "Missing: SmallIntestine-TerminalIleum\n",
        "50\n",
        "Missing: Pituitary\n",
        "Missing: Brain-Hypothalamus\n",
        "Missing: Vagina\n",
        "Missing: Prostate\n",
        "Missing: Brain-CerebellarHemisphere\n",
        "Missing: Ovary\n",
        "Missing: Testis\n",
        "Missing: Cells-EBV-transformedlymphocytes\n",
        "Missing: Brain-Putamen_basalganglia_\n",
        "Missing: Kidney-Cortex\n",
        "Missing: Colon-Transverse\n",
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_cis_peer_correction_chr11_rerun_wrapper.sh\n",
        "Number of jobs: 17\n"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py\n",
      "##################################################\n",
      "#  cis_peer_correction_chr11_rerun_wrapper.py\n",
      "#  provide jobs that have failed with more time and memory\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "genotype_pcs = argv[3]\n",
      "gene_metadata = argv[4]\n",
      "method = argv[5]\n",
      "# in_path = in_path + method + '/TPM_with_genotypes/' + '*' + '_' + method + in_suffix\n",
      "# out_path = argv[6] + \"to be appended later\"\n",
      "joblog_dir = argv[7]\n",
      "\n",
      "# Array of PEER factors that are given in the input\n",
      "peer_factors = argv[8:len(argv)]\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
      "# or\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/normalized_by_length_GC_and_depth_hg19/'\n",
      "# argv[2] = '_genes_certain_biotypes_TPM.txt'\n",
      "# or \n",
      "# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
      "# argv[3] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/GTEx_Genotype_PCs_top3.txt'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
      "# argv[5] = 'featurecounts'\n",
      "# argv[6] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
      "# argv[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/PEER'\n",
      "# argv[8:len(argv)] = ['10','20','30','40','50']\n",
      "\n",
      "# Create full tissue_list.\n",
      "tissue_list = []\n",
      "matrices = glob.glob(in_path + '*' + in_suffix)\n",
      "for i, matrix in enumerate(matrices):\n",
      "    filename = matrix.split('/')[-1]\n",
      "    tissue_name = str.split(filename, '_'+method)[0]\n",
      "    tissue_list.append(tissue_name)\n",
      "\n",
      "master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_cis_peer_correction_chr11_rerun_wrapper.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "k = 0\n",
      "for peer_factor in peer_factors:\n",
      "    print(peer_factor)\n",
      "    out_path = argv[6] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
      "\n",
      "    # Examples\n",
      "    # peer_factor = 10\n",
      "    # residual_matrix = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_10PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM.txt'\n",
      "    # tissue_name = 'Adipose-Subcutaneous'\n",
      "    factor_file_directory = out_path + 'PEER_factors'\n",
      "\n",
      "    for tissue in tissue_list:\n",
      "        out_matrix = out_path + tissue + \"_\" + method + in_suffix\n",
      "        # out_matrix = out_path + tissue + \"_\" + method + in_suffix.replace('_TPM.txt','_TPM_GC.txt')\n",
      "        if os.path.isfile(out_matrix):\n",
      "            continue\n",
      "        print(\"Missing: \" + tissue)\n",
      "        genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
      "        outfilename = method+\"_\"+tissue+\"_\"+peer_factor\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/PEER/\" + method + '_TPM_GC_3GenPC_' + peer_factor + 'PEER1000_' + tissue + \".slurm\"\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J PEER_%s_%s      # job name\n",
      "#SBATCH --mem=12000     # 12 GB requested\n",
      "#SBATCH -t 08:00:00     \n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11.R \\\n",
      "%s %s %s %s %s %s %s\n",
      "\n",
      "    \"\"\"%(tissue, peer_factor, joblog_dir+'err/'+outfilename+'.err', joblog_dir+'out/'+outfilename+'.out', matrix, genotype_pcs, gene_metadata, peer_factor, out_matrix, tissue, factor_file_directory)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "        k+=1\n",
      "\n",
      "master_handle.close()\n",
      "\n",
      "print 'sh %s'%(master_script)\n",
      "print 'Number of jobs:', k"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_peer_correction_chr11_rerun_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##2. Running MatrixEQTL for optimization\n",
      "\n",
      "With the expression matrices, and other necessary inputs generated in the accompanying notebook misc_meta_analysis.ipynb, we can go ahead and perform univariate eQTL detection using MatrixEQTL: \n",
      "\n",
      "http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/\n",
      "\n",
      "###2.1 Running MatrixEQTL with the output of PEER factors, only for Chromosome 11\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "# Set up the MatrixEQTL env.\n",
      "\n",
      "cd /tigress/BEE/gtex/analyses/user_specific/cis-mapping\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts\n",
      "\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_10PEER1000\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_20PEER1000\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_30PEER1000\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_40PEER1000\n",
      "mkdir /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/TPM_GC_3GenPC_50PEER1000\n",
      "\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "mkdir: cannot create directory `/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL': File exists\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash \n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 0 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
      "\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_Noimput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
      "\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_10_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_15_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_20_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_25_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_30_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_35_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_40_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_45_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_50_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_55_Imput_cis_matrix_eqtl_chr11_wrapper.sh\n",
      "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_60_Imput_cis_matrix_eqtl_chr11_wrapper.sh"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py\n",
      "##################################################\n",
      "#  cis_matrix_eqtl_wrapper.py\n",
      "#  Wrapper for the cis_matrix_eqtl.R file and submitting the batch jobs\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "method = argv[3]\n",
      "mapping_dir = argv[4]\n",
      "joblog_dir = argv[6]\n",
      "incl_imput = argv[7]\n",
      "analysis_dir = argv[8]\n",
      "\n",
      "# Array of PEER factors that are given in the input\n",
      "peer_factors = argv[9:len(argv)]\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
      "# argv[2] = '_genes_certain_biotypes_TPM_GC.txt'\n",
      "# argv[3] = 'featurecounts'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
      "# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
      "# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
      "# argv[7] = '0' or '1'\n",
      "# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
      "# argv[9:len(argv)] = ['10','20','30','40','50']\n",
      "\n",
      "# Make the job log directories\n",
      "if not os.path.exists(joblog_dir):\n",
      "    os.makedirs(joblog_dir)\n",
      "if not os.path.exists(joblog_dir + 'err'):\n",
      "    os.makedirs(joblog_dir + 'err')\n",
      "if not os.path.exists(joblog_dir + 'out'):\n",
      "    os.makedirs(joblog_dir + 'out')\n",
      "\n",
      "if incl_imput == \"0\":\n",
      "    imput = \"Noimput\"\n",
      "else:\n",
      "    imput = \"Imput\"\n",
      "\n",
      "for peer_factor in peer_factors:\n",
      "    print(peer_factor)\n",
      "\n",
      "    in_path = argv[1] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/' + '*' + '_' + method + in_suffix\n",
      "    out_path = argv[5] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
      "\n",
      "    if not os.path.exists(out_path):\n",
      "        os.makedirs(out_path)\n",
      "\n",
      "    matrices = glob.glob(in_path + '*')\n",
      "    print(\"Number of matrices: \" + str(len(matrices)))\n",
      "\n",
      "    master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_\" + peer_factor + '_' + imput + \"_cis_matrix_eqtl_chr11_wrapper.sh\"\n",
      "    master_handle=open(master_script, 'w')\n",
      "    master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "    # MatrixEQTL parameters\n",
      "    incl_trans = '0'    # Don't include trans\n",
      "    chr_number = '11'   # chromosome number\n",
      "    cis_dist = '1e5'    # cis window\n",
      "\n",
      "    k = 0\n",
      "    for i, matrix in enumerate(matrices):\n",
      "        filename = matrix.split('/')[-1]\n",
      "        tissue_name = str.split(filename, '_'+method)[0]\n",
      "        out_file = out_path + filename.replace('.txt','_MatrixEQTL')\n",
      "        if os.path.isfile(out_file):\n",
      "            continue\n",
      "        k+=1\n",
      "        genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_' + peer_factor + '_MatrixEQTL_chr' + chr_number + '_' + tissue_name + '_' + imput + \".slurm\"\n",
      "        outfilename = method+\"_\"+tissue_name+\"_\"+peer_factor\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J meqtl_%s_%s      # job name\n",
      "#SBATCH --mem=6000           # 6 GB requested\n",
      "#SBATCH -t 04:00:00           \n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
      "%s %s %s %s %s %s %s %s %s %s\n",
      "\n",
      "\"\"\"%(tissue_name, peer_factor, joblog_dir+'err/'+outfilename+'_'+imput+'.err', joblog_dir+'out/'+outfilename+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "    master_handle.close()\n",
      "\n",
      "    print 'sh %s'%(master_script)\n",
      "    print 'Number of jobs:', k"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R\n",
      "####################################################\n",
      "# cis_matrix_eqtl.R\n",
      "# Performs univariate cis-eQTL analysis using Matrix eQTL\n",
      "#\n",
      "####################################################\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "# Collect arguments\n",
      "args <-commandArgs(TRUE)\n",
      "\n",
      "## Default setting when no arguments passed\n",
      "if (length(args) != 10 ){\n",
      "  args <- c(\"--help\")\n",
      "}\n",
      "## Help section\n",
      "if(\"--help\" %in% args) {\n",
      "  cat(\"\n",
      "      cis_peer_correction.R\n",
      " \n",
      "      Usage:\n",
      "      ./cis_matrix_eqtl.R /path/to/expression_matrix.txt include_trans_binary \\\n",
      "      include_imputed_binary chromosome_number cis_disance_def peer_factor  \\\n",
      "      /path/to/mapping/directory /path/to/output/files method /path/to/analysis/dir\\n\")\n",
      "  q(save=\"no\")\n",
      "}\n",
      "\n",
      "expression_file_location = args[1]\n",
      "incl_trans = args[2]\n",
      "incl_imput = args[3]\n",
      "chr_number = args[4]\n",
      "cis_dist = as.numeric(args[5])\n",
      "peer_factor = args[6]\n",
      "\n",
      "# Example\n",
      "# args[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/featurecounts/TPM_GC_3GenPC_45PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC.txt'\n",
      "# args[2] = '0'\n",
      "# args[3] = '0'\n",
      "# args[4] = '11'\n",
      "# args[5] = '1e5'\n",
      "# args[6] = '10'\n",
      "\n",
      "imput = 'Imput'\n",
      "if (incl_imput == '0') {imput = 'Noimput'}\n",
      "SNP_file_name = paste(args[7], '/genotype/genotypesChr', chr_number, '_', imput, '.txt', sep=\"\")\n",
      "covariates_file_name = character()\n",
      "SNP_position_file = paste(args[7], '/references/SNP_positions/SNP_positions_Chr', chr_number, '.txt', sep=\"\")\n",
      "gene_position_file = paste(args[7], '/references/gene_positions/gene_positions_Chr', chr_number, '.txt', sep=\"\")\n",
      "\n",
      "output_file_name_cis = paste(args[8], '_Chr', chr_number, '_', imput, '_cis.txt', sep=\"\")\n",
      "output_file_name_trans = paste(args[8], '_Chr', chr_number, '_', imput, '_trans.txt', sep=\"\")\n",
      "\n",
      "method = paste(args[9])\n",
      "expression_file_name = strsplit(expression_file_location, split=\"/\")[[1]][length(strsplit(expression_file_location, split=\"/\")[[1]])]\n",
      "tissue_name = strsplit(expression_file_name, split = paste('_', method, sep=\"\"))[[1]][1] \n",
      "analysis_dir = paste(args[10], \"/\", tissue_name, sep=\"\")\n",
      "\n",
      "# Example\n",
      "# args[7] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
      "# args[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/\n",
      "# ... TPM_GC_3GenPC_30PEER1000/Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL'\n",
      "# args[9] = 'featurecounts'\n",
      "# args[10] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
      "\n",
      "# Need to check that the subjects read in from expression matrices match the subjects in genotype data\n",
      "library(dplyr)\n",
      "\n",
      "expression_matrix = read.csv(file = expression_file_location, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "genotype_matrix = read.csv(file = SNP_file_name, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "row.names(genotype_matrix) = genotype_matrix$SNP\n",
      "\n",
      "genotype_matrix = genotype_matrix[,sapply(colnames(expression_matrix), function(x) {match(x, colnames(genotype_matrix))})]\n",
      "\n",
      "# Also obtain genes and snps that are in the expression and genotype files, respectively.\n",
      "gene_positions = read.table(file = gene_position_file, header = FALSE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "snp_positions = read.table(file = SNP_position_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "\n",
      "colnames(gene_positions) = c(\"gene_ID\", \"chr_num\", \"start\", \"end\")\n",
      "gene_positions = mutate(gene_positions, chr = paste(\"chr\", chr_num, sep = \"\"))\n",
      "gene_positions = select(gene_positions, c(gene_ID, chr, start, end))\n",
      "# Make sure the list of genes and their ordering are the same\n",
      "expression_matrix = expression_matrix[rownames(expression_matrix) %in% gene_positions$gene_ID,]\n",
      "gene_positions = gene_positions[sapply(rownames(expression_matrix), function(x) {match(x, gene_positions$gene_ID)}),]\n",
      "\n",
      "# Import MatrixEQTL\n",
      "library(MatrixEQTL)\n",
      "\n",
      "# Use linear model with p threshold of 0.01\n",
      "useModel = modelLINEAR;\n",
      "pvOutputThreshold_cis = 1e-2\n",
      "pvOutputThreshold_trans = 0\n",
      "if (incl_trans == '1') {pvOutputThreshold_trans = 1e-2}\n",
      "\n",
      "snps = SlicedData$new()\n",
      "snps$CreateFromMatrix(as.matrix(genotype_matrix));\n",
      "\n",
      "gene = SlicedData$new()\n",
      "gene$CreateFromMatrix(as.matrix(expression_matrix));\n",
      "\n",
      "gene_qnorm = gene$copy();\n",
      "\n",
      "# Quantile-normalize the expression values - ties broken by averaging\n",
      "for( sl in 1:length(gene_qnorm) ) {\n",
      "  mat = gene_qnorm[[sl]];\n",
      "  mat = t(apply(mat, 1, rank, ties.method = \"average\"));\n",
      "  mat = qnorm(mat / (ncol(gene_qnorm)+1));\n",
      "  gene_qnorm[[sl]] = mat;\n",
      "}\n",
      "rm(sl, mat);\n",
      "\n",
      "# me = Matrix_eQTL_main(\n",
      "#     snps = snps,\n",
      "#     gene = gene,\n",
      "#     output_file_name = output_file_trans,\n",
      "#     pvOutputThreshold = pvOutputThreshold,\n",
      "#     useModel = useModel, \n",
      "#     verbose = TRUE,\n",
      "#     pvalue.hist = \"qqplot\",\n",
      "#     min.pv.by.genesnp = FALSE,\n",
      "#     noFDRsaveMemory = FALSE);\n",
      "\n",
      "# Only for generating the P-value hist:\n",
      "me = Matrix_eQTL_main(\n",
      "    snps = snps,\n",
      "    gene = gene_qnorm,\n",
      "    output_file_name = output_file_name_trans,\n",
      "    pvOutputThreshold = pvOutputThreshold_trans,\n",
      "    useModel = useModel, \n",
      "    verbose = TRUE,\n",
      "    output_file_name.cis = output_file_name_cis,\n",
      "    pvOutputThreshold.cis = pvOutputThreshold_cis,\n",
      "    snpspos = snp_positions, \n",
      "    genepos = gene_positions,\n",
      "    cisDist = cis_dist,\n",
      "    pvalue.hist = 100,\n",
      "    min.pv.by.genesnp = FALSE,\n",
      "    noFDRsaveMemory = FALSE)\n",
      "\n",
      "# Save the histogram\n",
      "output_file_name_hist = paste(analysis_dir, '/', tissue_name, '_', imput, '_Chr', chr_number, '_', peer_factor, '_hist.pdf', sep=\"\")\n",
      "print(paste(\"Saving to:\", output_file_name_hist))\n",
      "pdf(output_file_name_hist)\n",
      "plot(me)\n",
      "dev.off()\n",
      "\n",
      "# repeat for the qqplot:\n",
      "me_q = Matrix_eQTL_main(\n",
      "  snps = snps,\n",
      "  gene = gene_qnorm,\n",
      "  output_file_name = output_file_name_trans,\n",
      "  pvOutputThreshold = pvOutputThreshold_trans,\n",
      "  useModel = useModel, \n",
      "  verbose = TRUE,\n",
      "  output_file_name.cis = output_file_name_cis,\n",
      "  pvOutputThreshold.cis = pvOutputThreshold_cis,\n",
      "  snpspos = snp_positions, \n",
      "  genepos = gene_positions,\n",
      "  cisDist = cis_dist,\n",
      "  pvalue.hist = \"qqplot\",\n",
      "  min.pv.by.genesnp = FALSE,\n",
      "  noFDRsaveMemory = FALSE)\n",
      "\n",
      "# Save the qqplot\n",
      "output_file_name_qqplot = paste(analysis_dir, '/', tissue_name, '_', imput, '_Chr', chr_number, '_', peer_factor, '_qqplot.pdf', sep=\"\")\n",
      "print(paste(\"Saving to:\", output_file_name_hist))\n",
      "pdf(output_file_name_qqplot)\n",
      "plot(me_q)\n",
      "dev.off()\n",
      "\n",
      "# Save the Matrix EQTL object as RData\n",
      "save(gene_qnorm, me, snps, file=paste(args[8], '_Chr', chr_number, '_', imput, '_me.RData', sep=\"\"))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2.2 Checkpoint: Check if the MatrixEQTL jobs have run successfully, and rerun the missing files\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 0 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis 10 15 20 25 30 35 40 45 50 55 60\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py\n",
      "##################################################\n",
      "#  cis_matrix_eqtl_rerun_wrapper.py\n",
      "#  provide jobs that have failed with more time and memory\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "method = argv[3]\n",
      "mapping_dir = argv[4]\n",
      "joblog_dir = argv[6]\n",
      "incl_imput = argv[7]\n",
      "analysis_dir = argv[8]\n",
      "\n",
      "# Array of PEER factors that are given in the input\n",
      "peer_factors = argv[9:len(argv)]\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
      "# argv[2] = '_genes_certain_biotypes_TPM_GC.txt'\n",
      "# argv[3] = 'featurecounts'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
      "# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
      "# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
      "# argv[7] = '0' or '1'\n",
      "# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis'\n",
      "# argv[9:len(argv)] = ['10','20','30','40','50']\n",
      "\n",
      "# Make tissue list:\n",
      "tissue_list = []\n",
      "# Use 'TPM_GC_3GenPC_10PEER1000' to build tissue list\n",
      "in_path = in_path + method + '/TPM_GC_3GenPC_10PEER1000/'\n",
      "matrices = glob.glob(in_path + '*' + in_suffix)\n",
      "for i, matrix in enumerate(matrices):\n",
      "    filename = matrix.split('/')[-1]\n",
      "    tissue_name = str.split(filename, '_'+method)[0]\n",
      "    tissue_list.append(tissue_name)\n",
      "\n",
      "if incl_imput == \"0\":\n",
      "    imput = \"Noimput\"\n",
      "else:\n",
      "    imput = \"Imput\"\n",
      "\n",
      "master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + '_' + imput + \"_cis_matrix_eqtl_chr11_rerun_wrapper.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "k = 0\n",
      "for peer_factor in peer_factors:\n",
      "    print(peer_factor)\n",
      "    out_path = argv[5] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/'\n",
      "\n",
      "    # MatrixEQTL parameters\n",
      "    incl_trans = '0'    # Don't include trans\n",
      "    chr_number = '11'   # chromosome number\n",
      "    cis_dist = '1e5'    # cis window\n",
      "\n",
      "    for tissue in tissue_list:\n",
      "        out_file = out_path + tissue + \"_\" + method + in_suffix.replace('.txt','_MatrixEQTL')\n",
      "        check_file = out_path + tissue + \"_\" + method + in_suffix.replace('.txt','_MatrixEQTL_Chr') + chr_number + '_' + imput + '_me.RData'\n",
      "        if os.path.isfile(check_file):\n",
      "            continue\n",
      "        k+=1\n",
      "        print \"Missing: \" + check_file\n",
      "        genes_or_transcripts = ['genes' if 'genes' in matrix else 'transcripts'][0]\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_' + peer_factor + '_MatrixEQTL_chr' + chr_number + '_' + tissue + \"_\" + imput + \".slurm\"\n",
      "        outfilename = method+\"_\"+tissue+\"_\"+peer_factor\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J meqtl_%s_%s      # job name\n",
      "#SBATCH --mem=12000           # 12 GB requested\n",
      "#SBATCH -t 08:00:00           # to be placed in the test queue\n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
      "%s %s %s %s %s %s %s %s %s %s\n",
      "\n",
      "\"\"\"%(tissue, peer_factor, joblog_dir+'err/'+outfilename+'_'+imput+'.err', joblog_dir+'out/'+outfilename+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "master_handle.close()\n",
      "\n",
      "print 'sh %s'%(master_script)\n",
      "print 'Number of jobs:', k"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_rerun_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2.3 Number of cis-eQTLs as as a function of the number of PEER factors"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "MatrixEQTL outputs the following metric: effect size, p-value, FDR, t-statistic, as well as the list of significant SNP-gene pairs.\n",
      "\n",
      "Since we have a lot of statistics, let's try to construct plots that will hopefully help with the analyses. But we need to create the necessary directories first:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py\n",
      "\n",
      "##################################################\n",
      "#  create_dirs.py\n",
      "#  create necessary directories\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "method = argv[3]\n",
      "out_path = argv[4] + method + '/'\n",
      "\n",
      "in_path = in_path + method + '/TPM_GC_3GenPC_10PEER1000/' + '*' + '_' + method + in_suffix\n",
      "\n",
      "# in_path = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/expression_matrices/'\n",
      "# in_suffix = '_genes_certain_biotypes_TPM.txt'\n",
      "# method = 'featurecounts'\n",
      "# out_path = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/\"\n",
      "\n",
      "directory = out_path + 'analysis'\n",
      "if not os.path.exists(directory):\n",
      "    os.makedirs(directory)\n",
      "\n",
      "matrices = glob.glob(in_path + '*')\n",
      "print(len(matrices))\n",
      "\n",
      "for i, matrix in enumerate(matrices):\n",
      "    filename = matrix.split('/')[-1]\n",
      "    tissue_name = str.split(filename, '_'+method)[0]\n",
      "    \n",
      "    if not os.path.exists(directory + '/' + tissue_name):\n",
      "        os.makedirs(directory + '/' + tissue_name)\n",
      "    if not os.path.exists(directory + '/' + tissue_name + '/Examples/'):\n",
      "        os.makedirs(directory + '/' + tissue_name + '/Examples')\n",
      "        "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash \n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now let's actually create the plots:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# In R:\n",
      "\n",
      "install.packages(\"ggplot2\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/ featurecounts 1e5 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt 100"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_cis_matrix_eqtl_post_wrapper.sh\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py\n",
      "##################################################\n",
      "#  cis_matrix_eqtl_post_wrapper.py\n",
      "#  \n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "method = argv[2]\n",
      "cis_cutoff = argv[3]\n",
      "gene_name_file = argv[4]\n",
      "snp_positions_file = argv[5]\n",
      "tissue_list_file = argv[6]\n",
      "num_indiv_cutoff = int(argv[7])\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
      "# argv[2] = 'featurecounts'\n",
      "# argv[3] = '1e5'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
      "# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt'\n",
      "# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
      "# argv[7] = '100'\n",
      "\n",
      "master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + \"_cis_matrix_eqtl_post_wrapper.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "# Make tissue list:\n",
      "f = open(tissue_list_file)\n",
      "\n",
      "for line in f.readlines():\n",
      "    entry = str.split(line.strip(),'\\t')\n",
      "    # If the number of individuals is greater than the cutoff:\n",
      "    if int(entry[1]) >= num_indiv_cutoff:\n",
      "        tissue = entry[0]\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/\" + method + '_' + tissue + \"_cis_matrix_eqtl_post.slurm\"\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J %s_Matrix_post      # job name\n",
      "#SBATCH --mem=4000           # 4 GB requested\n",
      "#SBATCH -e /tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/err/Matrix_%s_post.err           # err output directory\n",
      "#SBATCH -o /tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/out/Matrix_%s_post.out           # out output directory\n",
      "#SBATCH -t 5:00:00           \n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R \\\n",
      "%s %s %s %s \\\n",
      "%s \\\n",
      "%s \\\n",
      "10 15 20 25 30 35 40 45 50 55 60\n",
      "\"\"\"%(tissue, tissue, tissue, in_path, method, cis_cutoff, tissue, gene_name_file, snp_positions_file)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "master_handle.close()\n",
      "print 'sh %s'%(master_script)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_post_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R \n",
      "####################################################\n",
      "# post_analyses_MatrixEQTL.R \n",
      "# Iterates through K PEER factors for every tissue and produces interesting graphs.\n",
      "#\n",
      "####################################################\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "# Collect arguments\n",
      "args <-commandArgs(TRUE)\n",
      "\n",
      "library(dplyr)\n",
      "library(MatrixEQTL)\n",
      "library(ggplot2)\n",
      "\n",
      "# Load in data\n",
      "# load(\"Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL_Chr11_Imput_me.RData\")\n",
      "# file_name = Adipose-Subcutaneous_featurecounts_genes_certain_biotypes_TPM_GC_MatrixEQTL_Chr11_Imput_me.RData\n",
      "\n",
      "in_dir = args[1]\n",
      "method = args[2]\n",
      "cis_dist = as.numeric(args[3])\n",
      "in_dir = paste(in_dir, method, \"/\", sep='')\n",
      "tissue = args[4]\n",
      "gene_name_file = args[5]\n",
      "snp_position_file = args[6]\n",
      "peer_factors = args[7:length(args)]\n",
      "\n",
      "# args[1] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/'\n",
      "# args[2] = 'featurecounts'\n",
      "# args[3] = '1e5'\n",
      "# args[4] = 'WholeBlood'\n",
      "# args[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/gene_names_positions.txt'\n",
      "# args[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/SNP_positions/SNP_positions_Chr11.txt'\n",
      "# args[7] = c(10,15,20,25,30,35,40,45,50,55,60)\n",
      "\n",
      "# Save the numer of significant pairs and unique genes detected at each level.\n",
      "Counts_by_tissue = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
      "#Genes_snps_by_tissue = data.frame(tissue = character(0), imputed = character(0), gene_list = character(0), snp_list = character(0))\n",
      "\n",
      "gene_names = read.csv(gene_name_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "row.names(gene_names) = gene_names$gene_id\n",
      "print(dim(gene_names))\n",
      "snp_positions = read.csv(snp_position_file, header = TRUE, sep = '\\t', stringsAsFactors = FALSE)\n",
      "row.names(snp_positions) = snp_positions$rsID\n",
      "print(dim(snp_positions))\n",
      "print(tissue)\n",
      "\n",
      "for (peer in peer_factors) {\n",
      "    print(peer)\n",
      "\n",
      "    # First iterate through nonimputed:\n",
      "    result_path = paste(in_dir, \"TPM_GC_3GenPC_\", peer, \"PEER1000/\", sep='')\n",
      "    analysis_path = paste(in_dir, \"analysis/\", tissue, '/', sep='')\n",
      "    # Match a regular expression pattern:\n",
      "    result_file <- dir(result_path, pattern = paste(tissue,\".*\\\\Noimput_me.RData\",sep=''))\n",
      "\n",
      "    # The result file of MatrixEQTL exists but the summary file (output of this) doesn't\n",
      "    if (length(result_file) > 0) {\n",
      "\n",
      "    load(paste(result_path, result_file[1], sep=''))\n",
      "\n",
      "    # Plot top 5 eQTL - unique gene pairs, if they exist.\n",
      "    temp = data.frame(me$cis$eqtls)\n",
      "    temp$snps = as.character(temp$snps)\n",
      "    temp$gene = as.character(temp$gene)\n",
      "    \n",
      "    top_genes = head(unique(temp$gene), n=5)\n",
      "    # Only proceed if there are at least 5 genes.\n",
      "    if (length(top_genes) == 5) {\n",
      "\n",
      "        temp$TSS = gene_names[temp$gene,][,\"start\"]\n",
      "        temp$TES = gene_names[temp$gene,][,\"end\"]\n",
      "        temp$gene_name = gene_names[temp$gene,][,\"gene_name\"]\n",
      "        temp$SNP_pos = snp_positions[temp$snps,][,\"pos\"]\n",
      "        # Difference mapping: -1e5 to 0 if before TSS\n",
      "        # scale from 1 to 1e5 if in gene\n",
      "        # 1e5 to 2e5 if after TES\n",
      "        temp$diff = sapply(c(1:dim(temp)[1]), function(x) {\n",
      "            if ((temp$SNP_pos[x] - temp$TSS[x])<0) {\n",
      "                temp$SNP_pos[x] - temp$TSS[x]\n",
      "            } else if ((temp$SNP_pos[x] - temp$TES[x])>0) {\n",
      "                temp$SNP_pos[x] - temp$TES[x] + cis_dist\n",
      "            } else {\n",
      "                cis_dist * (temp$SNP_pos[x] - temp$TSS[x]) / (temp$TES[x] - temp$TSS[x])\n",
      "            }})\n",
      "\n",
      "        tryCatch({\n",
      "            g = ggplot(temp[temp$FDR < 0.01,], aes(x = diff)) + geom_histogram(binwidth = 10000) + ggtitle('Distribution of distances to TSS') + xlab('Distance') + scale_x_discrete(breaks=c(0, 1e5), labels=c(\"TSS\", \"TES\"))\n",
      "            print(paste(\"Trying to save: \", analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
      "            ggsave(filename = paste(analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''), plot=g, width = 7, height = 7)\n",
      "        }, error = function(e) {\n",
      "            print(paste(\"Could not save: \", analysis_path, tissue, \"_Noimput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
      "        })\n",
      "\n",
      "        for (j in 1:5) {\n",
      "            index = which(temp$gene == top_genes[j])[1]\n",
      "            test_snp = snps$FindRow(temp[index,\"snps\"])\n",
      "            test_gene_q = gene_qnorm$FindRow(temp[index,\"gene\"])\n",
      "\n",
      "            test.df = data.frame(snp = as.numeric(test_snp$row), gene = as.numeric(test_gene_q$row))\n",
      "            row.names(test.df) = colnames(snps)\n",
      "            test.df = test.df[!is.na(test.df$snp),]\n",
      "            test.df$snp_chr = as.character(test.df$snp)\n",
      "            test_name = as.character(temp[index,\"gene_name\"])\n",
      "            test_rsID = as.character(temp[index,\"snps\"])\n",
      "\n",
      "            tryCatch({\n",
      "                g = ggplot(test.df, aes(x = snp_chr, y = gene, color=snp_chr)) + geom_violin() + geom_jitter() + ggtitle(paste(test_name, test_rsID, sep='_')) + xlab(test_rsID) + ylab(test_name)\n",
      "                print(paste(\"Trying to save: \", analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
      "                ggsave(filename = paste(analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''), plot = g, width = 7, height = 7)\n",
      "            }, error = function(e) {\n",
      "                print(paste(\"Could not save: \", analysis_path, \"Examples/\", tissue, \"_Noimput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
      "            })\n",
      "        }\n",
      "    }\n",
      "    \n",
      "    # Take all\n",
      "    n_pairs = dim(me$cis$eqtls)[1]\n",
      "    n_genes = length(unique(me$cis$eqtls$gene))\n",
      "    Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"pval\", imputed = \"Noimput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
      "\n",
      "    # Take FDR-controlled at 5%\n",
      "    n_pairs = dim(filter(me$cis$eqtls, FDR <= 0.01))[1]\n",
      "    n_genes = length(unique(filter(me$cis$eqtls, FDR <= 0.01)$gene))\n",
      "    Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"FDR\", imputed = \"Noimput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
      "}\n",
      "}\n",
      "\n",
      "if (dim(Counts_by_tissue)[1] != 0) {\n",
      "    print(paste(\"Save: \", analysis_path, tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
      "    save(Counts_by_tissue, file=paste(analysis_path, tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
      "}\n",
      "\n",
      "Counts_by_tissue = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
      "\n",
      "for (peer in peer_factors) {\n",
      "    print(peer)\n",
      "\n",
      "    # Now iterate through imputed:\n",
      "    # First iterate through nonimputed:\n",
      "    result_path = paste(in_dir, \"TPM_GC_3GenPC_\", peer, \"PEER1000/\", sep='')\n",
      "    analysis_path = paste(in_dir, \"analysis/\", tissue, '/', sep='')\n",
      "    # Match a regular expression pattern:\n",
      "    result_file <- dir(result_path, pattern = paste(tissue,\".*\\\\Imput_me.RData\",sep=''))\n",
      "\n",
      "    if (length(result_file) > 0) {\n",
      "\n",
      "    load(paste(result_path, result_file[1], sep=''))\n",
      "\n",
      "    temp = data.frame(me$cis$eqtls)\n",
      "    temp$snps = as.character(temp$snps)\n",
      "    temp$gene = as.character(temp$gene)\n",
      "    \n",
      "    top_genes = head(unique(temp$gene), n=5)\n",
      "    # Only proceed if there is at least 5 genes.\n",
      "    if (length(top_genes) == 5) {\n",
      "\n",
      "        temp$TSS = gene_names[temp$gene,][,\"start\"]\n",
      "        temp$TES = gene_names[temp$gene,][,\"end\"]\n",
      "        temp$gene_name = gene_names[temp$gene,][,\"gene_name\"]\n",
      "        temp$SNP_pos = snp_positions[temp$snps,][,\"pos\"]\n",
      "        # Difference mapping: -1e5 to 0 if before TSS\n",
      "        # scale from 1 to 1e5 if in gene\n",
      "        # 1e5 to 2e5 if after TES\n",
      "        temp$diff = sapply(c(1:dim(temp)[1]), function(x) {\n",
      "            if ((temp$SNP_pos[x] - temp$TSS[x])<0) {\n",
      "                temp$SNP_pos[x] - temp$TSS[x]\n",
      "            } else if ((temp$SNP_pos[x] - temp$TES[x])>0) {\n",
      "                temp$SNP_pos[x] - temp$TES[x] + cis_dist\n",
      "            } else {\n",
      "                cis_dist * (temp$SNP_pos[x] - temp$TSS[x]) / (temp$TES[x] - temp$TSS[x])\n",
      "            }})\n",
      "\n",
      "        tryCatch({\n",
      "            g = ggplot(temp[temp$FDR < 0.01,], aes(x = diff)) + geom_histogram(binwidth = 10000) + ggtitle('Distribution of distances to TSS') + xlab('Distance') + scale_x_discrete(breaks=c(0, 1e5), labels=c(\"TSS\", \"TES\"))\n",
      "            print(paste(\"Trying to save: \", analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
      "            ggsave(filename = paste(analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''), plot=g, width = 7, height = 7)\n",
      "        }, error = function(e) {\n",
      "            print(paste(\"Could not save: \", analysis_path, tissue, \"_Imput_\", peer, \"_Distance_distribution.pdf\", sep=''))\n",
      "       })\n",
      "\n",
      "        for (j in 1:5) {\n",
      "            index = which(temp$gene == top_genes[j])[1]\n",
      "            test_snp = snps$FindRow(temp[index,\"snps\"])\n",
      "            test_gene_q = gene_qnorm$FindRow(temp[index,\"gene\"])\n",
      "\n",
      "            test.df = data.frame(snp = as.numeric(test_snp$row), gene = as.numeric(test_gene_q$row))\n",
      "            row.names(test.df) = colnames(snps)\n",
      "            test.df = test.df[!is.na(test.df$snp),]\n",
      "            test_name = as.character(temp[index,\"gene_name\"])\n",
      "            test_rsID = as.character(temp[index,\"snps\"])\n",
      "\n",
      "            tryCatch({\n",
      "                g = ggplot(test.df, aes(x = snp, y = gene, color=snp)) + geom_jitter() + ggtitle(paste(test_name, test_rsID, sep='_')) + xlab(test_rsID) + ylab(test_name)\n",
      "                print(paste(\"Trying to save: \", analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
      "                ggsave(filename = paste(analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''), plot = g, width = 7, height = 7)\n",
      "            }, error = function(e) {\n",
      "                print(paste(\"Could not save: \", analysis_path, \"Examples/\", tissue, \"_Imput_\", peer, \"_\", test_name, \"_\", test_rsID, \".pdf\", sep=''))\n",
      "           })\n",
      "        }\n",
      "    }\n",
      "\n",
      "    # Take all\n",
      "    n_pairs = dim(me$cis$eqtls)[1]\n",
      "    n_genes = length(unique(me$cis$eqtls$gene))\n",
      "    Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"pval\", imputed = \"Imput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
      "\n",
      "    # Take FDR-controlled at 1%\n",
      "    n_pairs = dim(filter(me$cis$eqtls, FDR <= 0.01))[1]\n",
      "    n_genes = length(unique(filter(me$cis$eqtls, FDR <= 0.01)$gene))\n",
      "    Counts_by_tissue = rbind(Counts_by_tissue, data.frame(tissue = tissue, control = \"FDR\", imputed = \"Imput\", peer_K = peer, pairs = n_pairs, genes = n_genes, stringsAsFactors = FALSE))\n",
      "\n",
      "}\n",
      "}\n",
      "\n",
      "if (dim(Counts_by_tissue)[1] != 0) {\n",
      "    print(paste(\"Save: \", analysis_path, tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
      "    save(Counts_by_tissue, file=paste(analysis_path, tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL.R\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "TODO: Upload the plots of pair sas function PEER factors in this section"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### 2.4 Tissue-specific optimization for PEER factors\n",
      "\n",
      "For the actual cis-eQTL detection, we will use the optimal number of PEER factors for each tissue, and feed the matrix into eQTLBMA."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL_aggregate.R \n",
      "\n",
      "####################################################\n",
      "# post_analyses_MatrixEQTL_aggregate.R \n",
      "# Aggregates all results and optimizes for K PEER factors in a tissue-specific manner.\n",
      "#\n",
      "####################################################\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "library(ggplot2)\n",
      "library(dplyr)\n",
      "\n",
      "analysis_path = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/chr11_test/featurecounts/analysis/'\n",
      "tissue_list_file = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
      "num_indiv_cutoff = 100\n",
      "\n",
      "tissue_list = read.table(tissue_list_file, header=FALSE, sep=\"\\t\", stringsAsFactors=FALSE)\n",
      "colnames(tissue_list) = c(\"Tissue\", \"Count\")\n",
      "rownames(tissue_list) = tissue_list[\"Tissue\"][[1]]\n",
      "tissue_list = tissue_list[tissue_list[\"Count\"]>=num_indiv_cutoff,]\n",
      "\n",
      "Counts_by_tissue_aggregate = data.frame(tissue = character(0), control = character(0), imputed = character(0), peer_K = numeric(0), pairs = numeric(0), genes = numeric(0), stringsAsFactors = FALSE)\n",
      "\n",
      "for (i in c(1:dim(tissue_list)[1])) {\n",
      "    tissue = tissue_list[\"Tissue\"][[1]][i]\n",
      "    load(paste(analysis_path, tissue, \"/\", tissue, \"_Summary_Imput.RData\", sep=\"\"))\n",
      "    Counts_by_tissue_aggregate = rbind(Counts_by_tissue_aggregate, Counts_by_tissue)\n",
      "    load(paste(analysis_path, tissue, \"/\", tissue, \"_Summary_Noimput.RData\", sep=\"\"))\n",
      "    Counts_by_tissue_aggregate = rbind(Counts_by_tissue_aggregate, Counts_by_tissue)\n",
      "}\n",
      "\n",
      "for (i in c(1:dim(tissue_list)[1])) {\n",
      "    tissue = tissue_list[\"Tissue\"][[1]][i]\n",
      "    print(tissue)\n",
      "    print(sum(Counts_by_tissue_aggregate[\"tissue\"] == tissue))\n",
      "}\n",
      "\n",
      "# We have the data frame with all counts.\n",
      "save(Counts_by_tissue_aggregate, file=paste(analysis_path, \"All_Counts.RData\", sep=\"\"))\n",
      "\n",
      "# Save the plots of number for all tissues:\n",
      "g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"pval\"), aes(x = peer_K, y = pairs, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
      "ggsave(filename = paste(analysis_path, \"All_tissues_pval_pairs.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
      "\n",
      "g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"FDR\"), aes(x = peer_K, y = pairs, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
      "ggsave(filename = paste(analysis_path, \"All_tissues_FDR_pairs.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
      "\n",
      "g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"pval\"), aes(x = peer_K, y = genes, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
      "ggsave(filename = paste(analysis_path, \"All_tissues_pval_genes.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
      "\n",
      "g = ggplot(data = filter(Counts_by_tissue_aggregate, control == \"FDR\"), aes(x = peer_K, y = genes, color = tissue)) + geom_line(aes(group = tissue)) + facet_grid(. ~ imputed) + theme(legend.position = \"none\")\n",
      "ggsave(filename = paste(analysis_path, \"All_tissues_FDR_genes.pdf\", sep=''), plot = g, width = 10, height = 10)\n",
      "\n",
      "\n",
      "# Optimize on the number of K PEER factors to use for each tissue:\n",
      "# Criteria 1: when it decreases\n",
      "peer_K_to_iterate = c(15, 20, 25, 30, 35, 40, 45, 50, 55, 60)\n",
      "tissue_list[\"optimal_dec\"] = 10\n",
      "\n",
      "for (i in c(1:dim(tissue_list)[1])) {\n",
      "    tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
      "    current_opt = 10\n",
      "    current_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == 10)[\"pairs\"][[1]][1]\n",
      "    for (j in peer_K_to_iterate) {\n",
      "        next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
      "        if (next_val <= current_val) {\n",
      "            tissue_list[tissue_name, \"optimal_dec\"] = current_opt\n",
      "            break\n",
      "        }\n",
      "        current_opt = j\n",
      "        current_val = next_val\n",
      "    }\n",
      "    if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
      "        tissue_list[tissue_name, \"optimal_dec\"] = current_opt\n",
      "    }\n",
      "}\n",
      "\n",
      "# Another Criteria: when it reaches 5% of maximal\n",
      "# Tunable alpha, when it reaches N% of maximal\n",
      "alpha = 0.05\n",
      "tissue_list[\"optimal_95\"] = 10\n",
      "\n",
      "for (i in c(1:dim(tissue_list)[1])) {\n",
      "    tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
      "    current_opt = 10\n",
      "    thresh = max(filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\")[\"pairs\"]) * (1-alpha)\n",
      "        for (j in peer_K_to_iterate) {\n",
      "        next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
      "        if (next_val >= thresh) {\n",
      "            tissue_list[tissue_name, \"optimal_95\"] = current_opt\n",
      "            break\n",
      "        }\n",
      "        current_opt = j\n",
      "    }\n",
      "    if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
      "        tissue_list[tissue_name, \"optimal_95\"] = current_opt\n",
      "    }\n",
      "}\n",
      "\n",
      "# Another Criteria: when it reaches 1% of maximal\n",
      "# Tunable alpha, when it reaches N% of maximal\n",
      "alpha = 0.01\n",
      "tissue_list[\"optimal_99\"] = 10\n",
      "\n",
      "for (i in c(1:dim(tissue_list)[1])) {\n",
      "    tissue_name = tissue_list[\"Tissue\"][[1]][i]\n",
      "    current_opt = 10\n",
      "    thresh = max(filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\")[\"pairs\"]) * (1-alpha)\n",
      "        for (j in peer_K_to_iterate) {\n",
      "        next_val = filter(Counts_by_tissue_aggregate, tissue==tissue_name, control==\"FDR\", imputed==\"Imput\", peer_K == j)[\"pairs\"][[1]][1]\n",
      "        if (next_val >= thresh) {\n",
      "            tissue_list[tissue_name, \"optimal_99\"] = current_opt\n",
      "            break\n",
      "        }\n",
      "        current_opt = j\n",
      "    }\n",
      "    if (current_opt == peer_K_to_iterate[length(peer_K_to_iterate)]) {\n",
      "        tissue_list[tissue_name, \"optimal_99\"] = current_opt\n",
      "    }\n",
      "}\n",
      "\n",
      "# Save the optimal K for future use:\n",
      "\n",
      "write.table(tissue_list, file='/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt', row.names=FALSE, quote=FALSE, sep=\"\\t\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/post_analyses_MatrixEQTL_aggregate.R\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we have the optimal number of PEER factors for each tissue, for tissues that have the number of experiments above 100:\n",
      "\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt\n",
      "\n",
      "We will use the matrices chosen by this metric to perform cis-eQTL detection, both with MatrixEQTL and eQTLBMA."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## 3. Workflow summary so far\n",
      "\n",
      "The following block summarizes what we have so far, and run the pipeline for all autosomal chromosomes (1-22)\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## cis-eQTLs with MatrixEQTL\n",
      "\n",
      "For the actual cis-eQTL detection, we will use the optimal number of PEER factors for each tissue, and feed the matrix into eQTLBMA."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/ \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/ 1 \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/featurecounts/analysis \\\n",
      "/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt\n",
      "\n",
      "python /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/create_dirs.py \\\n",
      "/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/ _genes_certain_biotypes_counts_normalized.txt \\\n",
      "featurecounts /tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "sh /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/featurecounts_Imput_cis_matrix_eqtl_wrapper_all.sh\n",
        "Number of jobs: 616\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py\n",
      "##################################################\n",
      "#  cis_matrix_eqtl_wrapper_all.py\n",
      "#  Wrapper for the cis_matrix_eqtl.R file and submitting the batch jobs\n",
      "#\n",
      "##################################################\n",
      "\n",
      "import glob\n",
      "from sys import argv\n",
      "import os.path\n",
      "\n",
      "in_path = argv[1]\n",
      "in_suffix = argv[2]\n",
      "method = argv[3]\n",
      "mapping_dir = argv[4]\n",
      "joblog_dir = argv[6]\n",
      "incl_imput = argv[7]\n",
      "analysis_dir = argv[8]\n",
      "peer_factors_file = argv[9]\n",
      "\n",
      "# Examples\n",
      "# argv[1] = '/tigress/BEE/gtex/data/phenotype/expression/expression_matrices/normalized/hg19/'\n",
      "# argv[2] = '_genes_certain_biotypes_counts_normalized.txt'\n",
      "# argv[3] = 'featurecounts'\n",
      "# argv[4] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping'\n",
      "# argv[5] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/'\n",
      "# argv[6] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/joblogs/MatrixEQTL/'\n",
      "# argv[7] = '0' or '1'\n",
      "# argv[8] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/MatrixEQTL/featurecounts/analysis'\n",
      "# argv[9] = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_optimal_PEER_k.txt'\n",
      "\n",
      "# Make the job log directories\n",
      "if not os.path.exists(joblog_dir):\n",
      "    os.makedirs(joblog_dir)\n",
      "if not os.path.exists(joblog_dir + 'err'):\n",
      "    os.makedirs(joblog_dir + 'err')\n",
      "if not os.path.exists(joblog_dir + 'out'):\n",
      "    os.makedirs(joblog_dir + 'out')\n",
      "if not os.path.exists(analysis_dir):\n",
      "    os.makedirs(analysis_dir)\n",
      "\n",
      "if incl_imput == \"0\":\n",
      "    imput = \"Noimput\"\n",
      "else:\n",
      "    imput = \"Imput\"\n",
      "\n",
      "# Create the array of jobs to run:\n",
      "tissue_list = []\n",
      "peer_factors = []\n",
      "f = open(peer_factors_file)\n",
      "# First line is a header:\n",
      "f.readline()\n",
      "for line in f.readlines():\n",
      "    entry = str.split(line.strip(),'\\t')\n",
      "    tissue_list.append(entry[0])\n",
      "    # We use the optimal_99 critera:\n",
      "    peer_factors.append(entry[-1])\n",
      "\n",
      "out_path = argv[5] + method + '/'\n",
      "\n",
      "if not os.path.exists(out_path):\n",
      "    os.makedirs(out_path)\n",
      "\n",
      "master_script = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/\" + method + '_' + imput + \"_cis_matrix_eqtl_wrapper_all.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "# MatrixEQTL parameters\n",
      "incl_trans = '0'    # Don't include trans\n",
      "cis_dist = '1e5'    # cis window\n",
      "\n",
      "k = 0\n",
      "for i in range(len(tissue_list)):\n",
      "    tissue_name = tissue_list[i]\n",
      "    peer_factor = peer_factors[i]\n",
      "    filename = tissue_name + '_' + method + in_suffix\n",
      "    matrix = argv[1] + method + '/TPM_GC_3GenPC_' + peer_factor + 'PEER1000/' + filename\n",
      "    out_file = out_path + filename.replace('.txt','_MatrixEQTL')\n",
      "    # Iterate through 22 chromosomes\n",
      "    for j in range(22):\n",
      "        chr_number = str(j+1)\n",
      "        sbatchfile = \"/tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/batch/MatrixEQTL/\" + method + '_TPM_GC_3GenPC_MatrixEQTL_chr' + chr_number + '_' + tissue_name + '_' + imput + \".slurm\"\n",
      "        job_outfile = method+\"_\"+tissue_name+\"_cis_all\"\n",
      "        sbatchhandle=open(sbatchfile, 'w')\n",
      "        cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J meqtl_%s     # job name\n",
      "#SBATCH --mem=8000           # 8 GB requested\n",
      "#SBATCH -t 06:00:00           # to be placed in the test queue\n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl.R \\\n",
      "%s %s %s %s %s %s %s %s %s %s\n",
      "\"\"\"%(tissue_name, joblog_dir+'err/'+job_outfile+'_'+imput+'.err', joblog_dir+'out/'+job_outfile+'_'+imput+'.out', matrix, incl_trans, incl_imput, chr_number, cis_dist, peer_factor, mapping_dir, out_file, method, analysis_dir)\n",
      "        sbatchhandle.write(cmd)\n",
      "        sbatchhandle.close()\n",
      "        master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "        k += 1\n",
      "\n",
      "master_handle.close()\n",
      "\n",
      "print 'sh %s'%(master_script)\n",
      "print 'Number of jobs:', k"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/BEE/gtex/analyses/user_specific/cis-mapping/scripts/cis_matrix_eqtl_wrapper_all.py\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Results: cis-eQTLS:\n",
      "\n",
      "We first look through MatrixEQTL results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "python /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py\n",
      "####################################################\n",
      "# MatrixEQTL_cis_FDR_control_wrapper.py\n",
      "# \n",
      "#\n",
      "####################################################\n",
      "tissue_list_file = '/tigress/BEE/gtex/analyses/user_specific/cis-mapping/references/tissues_by_number_of_samples.txt'\n",
      "f = open(tissue_list_file)\n",
      "tissue_list = []\n",
      "for line in f.readlines():\n",
      "    tissue_list.append(str.split(line.strip(), '\\t')[0])\n",
      "f.close()\n",
      "\n",
      "original_dir = \"/tigress/bj5/MatrixEQTL/cis/featurecounts/\"\n",
      "write_dir = \"/tigress/bj5/cis_meta/MatrixEQTL/\"\n",
      "\n",
      "master_script = \"/tigress/bj5/cis_meta/scripts/MatrixEQTL_FDR_control.sh\"\n",
      "master_handle=open(master_script, 'w')\n",
      "master_handle.write(\"#!/bin/bash\\n\\n\")\n",
      "\n",
      "for item in tissue_list:\n",
      "    sbatchfile = \"/tigress/bj5/cis_meta/scripts/batch/MatrixEQTL_FDR_control_\" + item + \".slurm\"\n",
      "    sbatchhandle=open(sbatchfile, 'w')\n",
      "    cmd=r\"\"\"#!/bin/bash\n",
      "#SBATCH -J Mat_%s_FDR     # job name\n",
      "#SBATCH --mem=4000           # 8 GB requested\n",
      "#SBATCH -t 01:00:00           # to be placed in the test queue\n",
      "#SBATCH -e %s           # err output directory\n",
      "#SBATCH -o %s           # out output directory\n",
      "\n",
      "/usr/bin/Rscript /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R \\\n",
      "%s %s %s\n",
      "\"\"\"%(item, write_dir + item + \".err\", write_dir + item + \".out\", original_dir, item, write_dir)\n",
      "    sbatchhandle.write(cmd)\n",
      "    sbatchhandle.close()\n",
      "    master_handle.write(\"sbatch \" + sbatchfile  + \" \\n\")\n",
      "\n",
      "master_handle.close()\n",
      "\n",
      "print 'sh %s'%(master_script)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control_wrapper.py\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%writefile /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R\n",
      "####################################################\n",
      "# MatrixEQTL_cis_FDR_control.R\n",
      "# \n",
      "#\n",
      "####################################################\n",
      "\n",
      "#!/usr/bin/Rscript\n",
      "\n",
      "# Collect arguments\n",
      "args <-commandArgs(TRUE)\n",
      "\n",
      "# FDR Control of p-values from the null distribution\n",
      "original_dir = '/tigress/bj5/MatrixEQTL/cis/featurecounts/'\n",
      "tissue = 'Adipose-Subcutaneous'\n",
      "method = 'featurecounts'\n",
      "file_attr = '_genes_certain_biotypes_counts_normalized_MatrixEQTL_Chr'\n",
      "imputed = 'Imput'\n",
      "suffix = '_me.RData'\n",
      "write_dir = '/tigress/bj5/cis_meta/MatrixEQTL/'\n",
      "\n",
      "original_dir = args[1]\n",
      "tissue = args[2]\n",
      "write_dir = args[3]\n",
      "\n",
      "snps_list = character()\n",
      "genes_list = character()\n",
      "pvals_list = numeric()\n",
      "betas_list = numeric()\n",
      "statistic_list = numeric()\n",
      "total_tests = 0\n",
      "total_called_eqtls = 0\n",
      "\n",
      "for (i in c(1:22)) {\n",
      "    print(i)\n",
      "    filename = paste(original_dir, tissue, '_', method, file_attr, i, '_', imputed, suffix, sep='')\n",
      "    load(filename)\n",
      "\n",
      "    snps_list = append(snps_list, sapply(me$cis$eqtls$snps, as.character))\n",
      "    genes_list = append(genes_list, sapply(me$cis$eqtls$gene, as.character))\n",
      "    pvals_list = append(pvals_list, me$cis$eqtls$pvalue)\n",
      "    betas_list = append(betas_list, me$cis$eqtls$beta)\n",
      "    statistic_list = append(statistic_list, me$cis$eqtls$statistic)\n",
      "    total_tests = total_tests + me$cis$ntests\n",
      "    total_called_eqtls = total_called_eqtls + me$cis$neqtls\n",
      "}\n",
      "\n",
      "ordering = order(pvals_list)\n",
      "ordered_pvals_list = pvals_list[ordering]\n",
      "ordered_snps_list = snps_list[ordering]\n",
      "ordered_genes_list = genes_list[ordering]\n",
      "ordered_betas_list = betas_list[ordering]\n",
      "ordered_statistic_list = statistic_list[ordering]\n",
      "\n",
      "# Perform Benjamini-Hochberg\n",
      "alpha_hat = sapply(c(1:total_called_eqtls), function(x) {ordered_pvals_list[x] * total_tests / x})\n",
      "alpha_hat[(alpha_hat >= 1.0)] = 1.0\n",
      "cur_index = 1\n",
      "\n",
      "alpha = numeric()\n",
      "ptm <- proc.time()\n",
      "while (cur_index <= length(alpha_hat)) {\n",
      "    temp = alpha_hat[cur_index:length(alpha_hat)]\n",
      "    min_index = which.min(temp)\n",
      "    min_alpha = min(temp)\n",
      "    alpha = append(alpha, rep(min_alpha, min_index))\n",
      "    cur_index = cur_index + min_index\n",
      "}\n",
      "print(\"FDR calculation time: \")\n",
      "proc.time() - ptm\n",
      "\n",
      "num_eqtls = sum(alpha <= 0.05)\n",
      "result = data.frame(snp = ordered_snps_list[1:num_eqtls], gene = ordered_genes_list[1:num_eqtls], pval = ordered_pvals_list[1:num_eqtls], beta = ordered_betas_list[1:num_eqtls], statistic = ordered_statistic_list[1:num_eqtls], FDR = alpha[1:num_eqtls])\n",
      "write.table(result, file=paste(write_dir, tissue, \"_FDR_5percent.txt\", sep=\"\"), quote = FALSE, sep = \"\\t\", row.names = FALSE)\n",
      "\n",
      "print(paste(\"Total number of tests:\", total_tests))\n",
      "print(paste(\"Total eQTLs called at 5% FDR:\", num_eqtls))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Overwriting /tigress/bj5/cis_meta/scripts/MatrixEQTL_cis_FDR_control.R\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}