Skip to content

Instantly share code, notes, and snippets.

View vsbuffalo's full-sized avatar

Vince Buffalo vsbuffalo

View GitHub Profile
# Factors are more memory efficient (if labels > few bytes), since redundant multi-byte
# labels are stored once in memory (as attributes), and integers keep the mapping. E.g.:
a = sample(paste0("chrom", c(1:22, "X", "Y")), 1e8, replace=TRUE)
object.size(a)
# 800001192 bytes
object.size(factor(a))
# 400001744 bytes
# For long character vectors of repeating values, this *really* pays off.
@vsbuffalo
vsbuffalo / tweets.R
Created May 22, 2014 20:30
Visualize your mentions over time
library(ggplot2)
library(lubridate)
library(dplyr)
library(reshape2)
myname <- "@vsbuffalo" # for removing later
d <- read.csv("tweets.csv", header=TRUE, stringsAsFactors=FALSE)
extractMentions <- function(x) {
gsub("[^@]*(@[a-zA-Z0-9_]+).*", "\\1", x, perl=TRUE)
@vsbuffalo
vsbuffalo / bds-toc.md
Last active August 29, 2015 13:58
Bioinformatics Data Skills ToC

Bioinformatics Data Skills Table of Contents

This may change due to length considerations. Parts in bold are available for early release from O'Reilly.

Part I. Ideology: Data Skills, Robust and Reproducible Bioinformatics

  • How to Learn Robust and Reproducible Bioinformatics

Part II. Prerequisites: Setting up a Project, Working with Unix, Version Control, and Data

@vsbuffalo
vsbuffalo / summarizeByTile.R
Created November 8, 2013 22:54
Example of GenomicRanges's tileGenome, which I think demonstrates its power. This might be a bit faster as a custom script in Python or C, but (1) this would take longer and (2) this is much more interactive (3) on real data, it's actually pretty fast. Stuff like this is why Bioconductor should be in every bioinformatician's toolkit.
library(GenomicRanges)
summarizeByTile <-
# given a GRanges (or some sort of ranged data) object `x`, and a
# *corresponding* vector values to summarize `y` (these *must*
# correspond), calculate the summary per tile with the function `fun`.
# Note: this is still beta; wider tests coming, use with caution.
function(x, y, tiles, fun, mcol_name="y") {
stopifnot(length(x) == length(y))
@vsbuffalo
vsbuffalo / entropy_class.py
Created September 26, 2013 18:16
Version of entropy function we wrote in class
from __future__ import division
from collections import Counter
from math import log
def entropy(seq, unit="bit"):
"""
Returns entropy of DNA sequence.
The entropy formula is:
entropy = -sum_i (log(p_i) * p_i)
@vsbuffalo
vsbuffalo / entropy_vince.py
Created September 26, 2013 17:57
Vince's version of entropy in Python
"""
entropy.py
Calculate entropy of a given list.
"""
from math import log, log10
from collections import Counter
import pdb
def entropy(x, logfun=lambda x: log(x, 2)):
@vsbuffalo
vsbuffalo / naive_nshared.py
Created September 5, 2013 23:33
Calculate number of minor alleles (not in consensus sequence).
import sys
from readfq import readfq
from itertools import combinations
from datetime import datetime
def num_shared(seq_a, seq_b, consensus_seq):
"""
Given two alignment sequences in multiple alignment FASTA format,
calculate the number of shared SNPs (for minor alleles only, not
in consensus).
import numpy as np
from itertools import combinations
from collections import Counter
import datetime as dt
np.random.seed(0)
def repeat_mutation_sim(G, N, L, mu=3e-8):
"""
Generate N repeats of length L mutating at rate
@vsbuffalo
vsbuffalo / .tmux
Created August 19, 2013 01:27
My tmux configuration
# use GNU screen's C-a binding, since it's programmed in my brain
set-option -g prefix C-a
unbind C-b
# use GNU screen's C-a C-a for last window
bind-key C-a last-window
# use 1-based indexing, since 1 is close
set -g base-index 1
@vsbuffalo
vsbuffalo / trim.sh
Created August 8, 2013 05:17
generic, slightly insane paired end quality trimming script
#!/bin/bash
# trim.sh - generic, slightly insane paired end quality trimming script
# Vince Buffalo <vsbuffaloAAAAAA@gmail.com> (sans poly-A)
set -e
set -u
## pre-config
ADAPTERS=illumina_adapters.fa
SAMPLE_NAME=some_sample_name
IN1=in1.fastq