Skip to content

Instantly share code, notes, and snippets.

View RyanSchu's full-sized avatar

Ryan Schubert RyanSchu

View GitHub Profile
@RyanSchu
RyanSchu / dbgap.md
Last active January 13, 2022 22:34
Downloading Data from dbgap

Install the aspera cli client required by dbgap

aspera installation manual download page

wget https://download.asperasoft.com/download/sw/cli/3.7.7/aspera-cli-3.7.7.608.927cce8-linux-64-release.sh

update permissions of the installation executable and execute

chmod a+x ./aspera-cli-3.7.7.608.927cce8-linux-64-release.sh
./aspera-cli-3.7.7.608.927cce8-linux-64-release.sh
@RyanSchu
RyanSchu / GwasQCExample.md
Last active September 21, 2018 20:57
GWAS QC PIPELINE EXAMPLE

GWAS QC

Each and every gwas is different so there is no perfect standard of how to run GWAS. However many of the steps we must take are predictable. Most commonly, we find ourselves filtering by missingness, removing related individuals and removing snps that are in linkage disequilibrium with each other. Finally, we usually want to run PCA on our population to validate that they have been adequately filtered. This pipeline combines these general steps and more into three coherent scripts to easily perform gwas qc.

Running the Pipeline

The pipeline is broken into three scripts main scripts that performs one of the steps mentioned above. Each has a handful of options and produces multiple output files for the user. I would encourage you to look at the project wiki and get an understanding of what each option does and what files are produced. While the pipeline produces predicatble results, it is largely up to the user to give direction and decide on wh

@RyanSchu
RyanSchu / PCA.md
Last active September 24, 2018 17:30
PCA w/out hapmap

These scripts are meant to be run from the command line. Save the eigenplot.R script attached. You can then run it from the command line as such

Rscript eigenplot.R --val Path/to/pca.eigencal --vec Path/to/pca.eigenvec --o /directory/to/output

This will make the pdf file pca_plots in the specified output directory.

@RyanSchu
RyanSchu / Illumina to Lgen.md
Last active February 8, 2024 01:38
Convert FinalReport.txt files to plink lgen format

About

This gist details how to convert illumina final report genotype data to an input useable by PLINK. From there the genotype can be quality controlled and the end results exported to a .vcf file or otherwise parsed.

Lgen/PLINK format

lgen from the PLINK documentation "A text file with no header line, and one line per genotype call (or just not-homozygous-major calls if 'lgen-ref' was invoked) usually with the following five fields:

Family ID

#origianlly written by Brianne Coffey Github@breecoffey
#modified by Ryan Schubert
import argparse
import gzip
import re
from os.path import expanduser
current = expanduser("~")
parser = argparse.ArgumentParser(description='Input & Output Files') #create the argument parser

Preimputation steps

Read the Sanger How-to page to make sure your vcf file meets all the requirements for Sanger imputation.

The preimputation steps for UMichigan and Sanger are relatively similar. The main difference is that VCF files are not split by chromosome under sanger imputation. I recommend carrying out the Michigan preimputation steps anyways. If you have already carried out the preimputation steps for U Mich imputation then the only other requirement will be to merge the chr.vcf files into one vcf file and sort the result.

#merge vcf files from list of files
bcftools concat --file-list ~/preimputation/vcf_list.txt -o merged.vcf
gzip merged.vcf

About Aliases

aliases are a way to create shortcuts for a given command. While their scope is limited they are useful for saving time on tedious typing. For example, using the bash command less there is the useful flag -S that prevents text from wrapping on the display. This is quite useful for readability but it can be tedious to write less -S every time. Instead what you can do is use alias to create a shortcut for that specific command.

alias less='less -S'

Now whenever you ype less it instead performs less -S by default. However this alias will go away once you leave the session. If you want the alias to be permanent then we will need to work with two hidden files

Making a permanent alias

first lets create a hidden file that will contain all the aliases you want. hidden files follow the syntax .file_name. To create a hidden file do

library(data.table)
library(dplyr)
library(ggplot2)
library(qvalue)
"%&%" = function(a,b) paste(a,b,sep="")
#List METS and MESA pops
discovery_cohorts<-c("METS_FDR0.05_PC0_PF0.txt.gz","METS_FDR0.05_PC0_PF10.txt.gz","METS_FDR0.05_PC0_PF20.txt.gz","METS_FDR0.05_PC10_PF0.txt.gz","METS_FDR0.05_PC10_PF10.txt.gz","METS_FDR0.05_PC10_PF20.txt.gz","METS_FDR0.05_PC3_PF0.txt.gz","METS_FDR0.05_PC3_PF10.txt.gz","METS_FDR0.05_PC3_PF20.txt.gz")
replication_cohorts<-as.matrix(read.table(file="~/mets_analysis/meqtl/replication_pops/pop_list"))
discovery_dir<-"~/mets_analysis/meqtl/combined_pop/"
library(dplyr)
library(ggplot2)
library(tidyr)
library(data.table)
METS<-as.data.frame(fread("zcat ~/mets_analysis/meqtl/combined_pop/METS_FDR0.05_PC0_PF0.txt.gz", colClasses = "character"))
colnames(METS)<-c("snps", "gene", "statistic", "pvalue", "FDR", "beta")
gemma<-as.data.frame(fread("zcat ~/mets_analysis/for_gemma/gemma_whole_genome_FDR0.05.txt.gz", header = F, sep = "\t"))
colnames(gemma)<-c("chr", "rs", "ps", "n_miss", "allele1", "allele0", "af", "beta", "se", "l_remle", "l_mle", "p_wald", "p_lrt", "p_score", "gene_inf", "FDR")
info<-separate(gemma, gene_inf, into = c("gene1","gene2",NA,NA,NA,NA), sep = "\\.")
@RyanSchu
RyanSchu / QQplotHowTo.md
Last active February 19, 2019 21:24
QQplot_Tophits.R

Here is an rscript to generate a qqplot that comes with five basic flags. It is designed to be run through the command line using these flags as necessary. This script can be used to generate either partial or complete qqplots, with partial qqplots displaying only the top proportion of hits according to user input. The script is designed to take in a variety of types described as follows:

  • Data with or without a header (Default is with)

The script assumes by default that the input file has a header. If the input file does not have a header, then the user may signal this by supplying the --noheader flag at runtime

  • Data in .gz format

The script in utilizing the fread function from data.table interprets whether or not the data is in .gz format based on the file name ending with .gz or not

  • Single or multi column data