Skip to content

Instantly share code, notes, and snippets.

@ipurusho
ipurusho / select.model
Last active August 29, 2015 14:07
Identify working design model among numerous covariates
#Identify working design model among numerous covariates
# Returns design that can be used directly by DESeq2
#Usage:
# select.model(covariates,main,region)
# covariates: character list of covariates in meta data matrix (column names of meta data matrix)
# main: main factor forumula (can include covariates)
# region: brain region (temprorarily hard-coded for the immediate analysis)
#Example:
# select.model(c("Antidepressant","Alcool","History.of.Abuse","Cause.of.death","PMI"),
# "RIN+Age.+Gender+Phenotype+Gender:Phenotype",
@ipurusho
ipurusho / RRHO_Modified.R
Created November 12, 2014 17:29
A slightly modified version of RRHO (http://www.bioconductor.org/packages/release/bioc/html/RRHO.html) which outputs common scale and enables 4 lists for RRHO compare.
generate.rrho<-function(pval.data,logfc.data,list,outdir){
max.scale<-list()
for(i in 1:nrow(list)){
list1<-cbind(rownames(pval.data),-1*log10(pval.data[,as.character(list[i,1])])*sign(logfc.data[,as.character(list[i,1])]))
list2<-cbind(rownames(pval.data),-1*log10(pval.data[,as.character(list[i,2])])*sign(logfc.data[,as.character(list[i,2])]))
print(head(list1))
@ipurusho
ipurusho / analyze.proteomic.data.R
Last active September 22, 2017 10:22
Impute missing values and run t test for proteomic data
analyze.proteomic.data<-function(intensities,meta_data,condA,condB){
#parse conditions to prepare for t test with replicates
colnames(intensities)<-meta_data
condA.regex<-paste("^",condA,"$",sep="")
condB.regex<-paste("^",condB,"$",sep="")
condA.indices<-grep(condA.regex,colnames(intensities))
condB.indices<-grep(condB.regex,colnames(intensities))
@ipurusho
ipurusho / sort_clusters.sh
Created December 22, 2014 17:16
Sorts gene clusters after clusterbed is performed.
#!/bin/bash
filename=$1
awk -F "\t" '{split($9,a,";");print a[1], $10}' -v awk_file=filename | cut -d ' ' -f2,3 | sort -u | sort -k2 -n
library(biomaRt)
library(stringr)
#######get counts #########
setwd("/path/to/count/directory")
data = list.files(pattern = 'htseq_counts.txt'); #detect count files based on file type
count_list = lapply(data,read.table,header=F,sep="\t",row.names=1) #read files in batch, save to list
counts<-do.call(cbind, count_list) #create count data frame
colnames(counts)<-data #set column names
colnames(counts)<-str_replace(colnames(counts),".htseq_counts.txt","")#remove filename extension
@ipurusho
ipurusho / GC_calc.py
Last active April 2, 2022 13:35
A simple introduction to Apache Spark and pyspark. Calculate GC content of sequences in a FASTA file
"""GC_calc.py"""
import sys
from pyspark import SparkContext
import re
#turns Fasta file into a list of sequences (for current understanding of pyspark SparkContext input)
fastaFile = sys.argv[1]
@ipurusho
ipurusho / serial_gc.py
Created May 7, 2015 00:58
Serial version of GC_calc.py for speedup testing purposes
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
sequences = {}
@ipurusho
ipurusho / featureCoverage.scala
Last active August 29, 2015 14:22
calculate coverage of features in a Feature RDD using binning
def featureCoverage(reference: RDD[Feature],reads: RDD[AlignmentRecord],bins:Int): RDD[(String, Iterable[Double])] = {
val getBinsForward = for{
feature <- reference
bin = bins
interval = ((((feature.getEnd.toDouble-feature.getStart.toDouble)/bin).toDouble).ceil).toInt
strand = feature.getFeatureType.toString
start = feature.getStart.toInt
end = feature.getEnd.toInt
refName = feature.getContig.getContigName.toString
@ipurusho
ipurusho / evaluate_covariates.R
Last active August 29, 2015 14:22
Evaluate the percent contribution of categorical and continuous covariates using dimensionally reduced data from read counts
evaluate.covariates<-function(x,pc.percents,continuous,categorical){
covariate.contribution<-function(x,continuous,categorical){
#asinh transform continuous covariates
asinh.continuous <- lapply(continuous,asinh)
asinh.continuous <- as.data.frame(do.call(cbind,asinh.continuous))
#discretize cateogorical covariates to perform lm
import scala.collection.mutable.ArrayBuffer
def binFromRange(start: Int, end: Int): ArrayBuffer[Int] ={
val bin_offsets = Array(512+64+8+1,64+8+1,8+1,0)
val binFirstShift = 17
val binNextShift = 3
var startBin = start >> binFirstShift