Skip to content

Instantly share code, notes, and snippets.

View darencard's full-sized avatar

Daren Card darencard

View GitHub Profile
@darencard
darencard / efficient_repeatmodeler.md
Last active March 6, 2018 04:32
Description of changes to make RepeatModeler run more efficiently for sample sequencing data analysis

Running RepeatModeler More Efficiently

RepeatModeler isn't very well suited for sample sequencing data, taking a long time and creating copious amounts of intermediate data files. It obviously wasn't designed for small fragments and reads, which are what we get with sample sequencing data, and here are the main difficulties.

  1. The subsampling steps for each round take a long time (hours in later rounds) and are done using a single core, which is wasteful and inefficient. However, the full script depends on this subsetting to run properly, so there isn't really a way around this.
  2. Parallelization occurs during the RECON analyses of rounds 2 to N, so overall, it makes little sense to parallelize heavily since a major bottleneck is the subsetting step (see #1).
  3. Huge amounts of intermediate files are produced, which grow rapidly with each round. Most of these are the batch-* files that are used for parallelization during the RECON rounds. In later rounds (5+, the output size inflates to over 200GB, m
@darencard
darencard / basespace_quickstart.md
Created October 3, 2018 16:40
Installing, authenticating, and downloading using BaseSpace CLI

Installing, authenticating, and downloading using BaseSpace CLI

Installation on Mac computer

  1. Install BaseSpace CLI to $HOME/bin directory and make executable.
wget "https://api.bintray.com/content/basespace/BaseSpaceCLI-EarlyAccess-BIN/latest/\$latest/amd64-osx/bs?bt_package=latest" \
-O $HOME/bin/bs
chmod u+x $HOME/bin/bs
@darencard
darencard / scrolling_DNA.md
Last active January 12, 2021 12:24
Creating a scrolling DNA sequence visualization

Creating a scrolling DNA sequence visualization

Are you a scientist working in genomics? Have you ever given an interview to your university communications staff or local press? Did you ever find yourself wishing you could have a continuously scrolling nucleotide sequence running in the background during one of these interviews? Well look no further, because here is exactly what you need to create such an effect, which will really wow those who watch your interview.

We will create two scripts - one that creates a random string of nucleotides and one that colors each nucleotide a different color. Here is the first script, in python, named random_seq.py.

#!/usr/bin/env python

import random

NGSadmix (Skotte et al. 2013)

NGSadmix genotype matrices include a header line and two beginning columns (with headers) with the marker ID (scaffold and position) and the reference and alternative allele (all sites must be biallelic). Three genotype likelihoods are given for each sample and marker in a standardized format (sum to 1.0) and correspond to the likelihood of increasingless less reference alleles (homozygous reference, heterozygous, homozygous alternative). All values are space-delimited and missing data is coded as 0.000 across all three allele combinations. Here is an example with three samples at two markers:

Marker Ref. Alt. Sample1 Sample1 Sample1 Sample2 Sample2 Sample2 Sample3 Sample3 Sample3
scaffold1_100 A C 1.000 0.000 0.000 0.333 0.333 0.333 0.250 0.750 0.000
scaffold2_1000 G T 0.000 0.000 0.000 0.500 0.500 0.000 0.010 0.990 0.000

The following RADpipe command will create this as output from a filtered VCF:

python genotypes_from_VCF.py --samplesheet <samplesheet.txt> --fi
@darencard
darencard / phastcons_conserved_extraction.md
Created October 10, 2018 20:45
Commands for extracting conserved regions from PhastCons conservation tracks

Commands for extracting conserved regions from 100-way PhastCons conservation tracks. Then BLAST can be used to extract orthologous regions from genome of interest for as many regions as possible.

# download the human gene annotations
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz

# convert human gene annotations to GTF file format
zcat refGene.txt.gz | cut -f 2- | genePredToGtf -utr file stdin stdout > refGene.gtf

# extract exon features from GTF file
@darencard
darencard / install_run_provean_notes.md
Created October 10, 2018 20:47
Notes on installing and running Provean

Notes from work installing and running Provean to predict protein impact of variants. Provean input files were produced based on VEP output using commands below. Some trial runs were completed using a computer to understand how quickly Provean can be run in parallel to work through all annotated genes.

# running PROVEAN

# installation & dependencies
# 1. checked that blast was installed and also reinstalled cd-hit to avoid issue with certain version
# 2. installed the NCBI nr database
sudo mkdir /opt/ncbi_blast_nr_db_2018-01-29
sudo chmod 775 /opt/ncbi_blast_nr_db_2018-01-29
@darencard
darencard / gdrive_download
Created August 1, 2017 18:58
Script to download files from Google Drive using Bash
#!/usr/bin/env bash
# gdrive_download
#
# script to download Google Drive files from command line
# not guaranteed to work indefinitely
# taken from Stack Overflow answer:
# http://stackoverflow.com/a/38937732/7002068
gURL=$1
@darencard
darencard / gist:785015b8e2cb3c5fbc7d
Last active November 25, 2022 01:43
Calculating the mean, minimum, and maximum of a column using Awk (vary $3 to reflect desired column of data)
cat <input.txt> | awk '{if(min==""){min=max=$3}; if($3>max) {max=$3}; if($3< min) {min=$3}; total+=$3; count+=1} END {print "mean =", total/count, "\nminimum =", min, "\nmaximum =", max}'
@darencard
darencard / orthomcl_tutorial.md
Last active March 15, 2023 13:31
Running OrthoMCL on a set of protein annotations

Running OrthoMCL on a set of protein annotations from various species

OrthoMCL is the leading piece of software for inferring orthologs across several organisms. In this tutorial I will provide detailed instructions for running a set of protein annotations through OrthoMCL.

Software and Data

  1. OrthoMCL, and it's dependencies, must be installed. Detailed information on this tool and its installation can be found here. I actually used a slightly modified version of OrthoMCL that was made available by the author of the orthomcl-pipeline (see below). There isn't much details on the ways this is different from the existing OrthoMCL, but this is available here.
  2. orthmcl-pipeline must also be installed, as this is how we will automate the OrthoMCL process. Detailed information on this tool and its installation can be found [here](https
@darencard
darencard / gene_structure_stats.md
Last active April 18, 2023 13:35
Script to produce estimates of gene structure

Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.

Inferring the structure of gene annotations

When annotating genomes it is often desireable to know the overall structure of genes, including information like exon and intron lengths among other metrics. Here is a program genestats that will calculate such measures for a user.

#!/usr/bin/env bash

usage()