Skip to content

Instantly share code, notes, and snippets.

View RyanSchu's full-sized avatar

Ryan Schubert RyanSchu

View GitHub Profile
@RyanSchu
RyanSchu / Email Match.md
Last active November 6, 2022 22:46
Matching an email - Regex tutorial

Regular Expression Tutorial: Matching an Email

Many strings have a structure, pattern, or logic that can be used to identify and validate data. Regular expressions (regex) are a means of identifying strings that meet some such structure. This tutorial will go through a regex example that identifies strings that are in a valid email structure.

Summary

/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
@RyanSchu
RyanSchu / Qsub_dependencies.md
Last active May 17, 2019 22:57
Creating dependencies in qsub

Hi guys, lately we use a lot of cores for all of us to run things. We want to make sure we are always leaving at least one core open, but it can be a pain to wait for things to finish to qsub new things. This gist shows the basics of how to check on memory/cpu usage and create dependencies so you can submit all your jobs, but not take up all the cores at once. We can do this using the -W flag in qsub.

Checking memory

If the CLI seems to be running slow check the system memory with the command free -h. This will display the following items

total        used        free      shared  buff/cache   available

We care about free memory. If the free memory drops too low (say less than 80GB) someone with sudo privelages (Ryan or Dr. Wheeler) can clear out the buff/cach memory with the following commands

About Awk

Awk is a text processing language that comes standard with most distributions of Linux/Unix. In my personal experience, awk is faster at parsing, filtering, and writing text files than either python or R with few exceptions. This cheat sheet goes over the basic awk commands that I use the most.

How does awk work

Awk processes a text file line by line and is used to apply some condition to each based on its contents. I have found the most use for it on text files of large matrices (that is text files with distinct, consistent columns) or on text that has clear consistent delimeters. Awk interpretes each column in your line and stores it as a variable from 1 to n where n is the number of columns you have. Say you have a file that looks as such:

ID  gene_name type  start stop Chr
ENSG0 C1orf22 protein_coding 178965 183312 chr2
@RyanSchu
RyanSchu / Welcome to the Wheeler Lab.md
Last active September 12, 2019 18:57
a primer for new members

Greetings! If you're reading this you've been welcomed into the wheeler lab for the semester. Congradulations! This collection of documents will serve as a guide for some of the various tools you'll be using this semester. By no means is it comprehensive, but the hope is that it will serve as a directory to point you towards more useful resources, including tutorials, cheat sheets, papers, twitter threads, SOPs, and manual pages. Most of the lab is catered towards independent problem solvers. Feel free to shoot any of the senior members a message for help, but you learn the most by just trying. Good luck and get to work!

Things you'll probably use

Everything on this list are things you are likely to use. It has beed divided according to programming language/interface and ordered by how useful I find it, though many of these rankings are arbitrary as I use most of these tools every day.

command line/bash

  • awk x (Also see my [awk cheat sheet]
gene_list = []
with open('/home/ryan/multi_coding_subset.txt', 'r') as assoc:
for line in assoc:
intron,gene_vec = line.split('\t')
gene_vec = gene_vec.replace('c','').replace('\"','').replace('(','').replace(')','').replace(' ','').replace('\n','')
newvec = gene_vec.split(',')
for i in newvec:
if i not in gene_list:
gene_list.append(str(i))
@RyanSchu
RyanSchu / QQplotHowTo.md
Last active February 19, 2019 21:24
QQplot_Tophits.R

Here is an rscript to generate a qqplot that comes with five basic flags. It is designed to be run through the command line using these flags as necessary. This script can be used to generate either partial or complete qqplots, with partial qqplots displaying only the top proportion of hits according to user input. The script is designed to take in a variety of types described as follows:

  • Data with or without a header (Default is with)

The script assumes by default that the input file has a header. If the input file does not have a header, then the user may signal this by supplying the --noheader flag at runtime

  • Data in .gz format

The script in utilizing the fread function from data.table interprets whether or not the data is in .gz format based on the file name ending with .gz or not

  • Single or multi column data
library(dplyr)
library(ggplot2)
library(tidyr)
library(data.table)
METS<-as.data.frame(fread("zcat ~/mets_analysis/meqtl/combined_pop/METS_FDR0.05_PC0_PF0.txt.gz", colClasses = "character"))
colnames(METS)<-c("snps", "gene", "statistic", "pvalue", "FDR", "beta")
gemma<-as.data.frame(fread("zcat ~/mets_analysis/for_gemma/gemma_whole_genome_FDR0.05.txt.gz", header = F, sep = "\t"))
colnames(gemma)<-c("chr", "rs", "ps", "n_miss", "allele1", "allele0", "af", "beta", "se", "l_remle", "l_mle", "p_wald", "p_lrt", "p_score", "gene_inf", "FDR")
info<-separate(gemma, gene_inf, into = c("gene1","gene2",NA,NA,NA,NA), sep = "\\.")
library(data.table)
library(dplyr)
library(ggplot2)
library(qvalue)
"%&%" = function(a,b) paste(a,b,sep="")
#List METS and MESA pops
discovery_cohorts<-c("METS_FDR0.05_PC0_PF0.txt.gz","METS_FDR0.05_PC0_PF10.txt.gz","METS_FDR0.05_PC0_PF20.txt.gz","METS_FDR0.05_PC10_PF0.txt.gz","METS_FDR0.05_PC10_PF10.txt.gz","METS_FDR0.05_PC10_PF20.txt.gz","METS_FDR0.05_PC3_PF0.txt.gz","METS_FDR0.05_PC3_PF10.txt.gz","METS_FDR0.05_PC3_PF20.txt.gz")
replication_cohorts<-as.matrix(read.table(file="~/mets_analysis/meqtl/replication_pops/pop_list"))
discovery_dir<-"~/mets_analysis/meqtl/combined_pop/"

About Aliases

aliases are a way to create shortcuts for a given command. While their scope is limited they are useful for saving time on tedious typing. For example, using the bash command less there is the useful flag -S that prevents text from wrapping on the display. This is quite useful for readability but it can be tedious to write less -S every time. Instead what you can do is use alias to create a shortcut for that specific command.

alias less='less -S'

Now whenever you ype less it instead performs less -S by default. However this alias will go away once you leave the session. If you want the alias to be permanent then we will need to work with two hidden files

Making a permanent alias

first lets create a hidden file that will contain all the aliases you want. hidden files follow the syntax .file_name. To create a hidden file do

Preimputation steps

Read the Sanger How-to page to make sure your vcf file meets all the requirements for Sanger imputation.

The preimputation steps for UMichigan and Sanger are relatively similar. The main difference is that VCF files are not split by chromosome under sanger imputation. I recommend carrying out the Michigan preimputation steps anyways. If you have already carried out the preimputation steps for U Mich imputation then the only other requirement will be to merge the chr.vcf files into one vcf file and sort the result.

#merge vcf files from list of files
bcftools concat --file-list ~/preimputation/vcf_list.txt -o merged.vcf
gzip merged.vcf