Daren Card darencard

## efficient_repeatmodeler.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / efficient_repeatmodeler.md
            
            
              Last active
              March 6, 2018 04:32
            
              
                Description of changes to make RepeatModeler run more efficiently for sample sequencing data analysis
              
          
    Running RepeatModeler More Efficiently

RepeatModeler isn't very well suited for sample sequencing data, taking a long time and creating copious amounts of intermediate data files. It obviously wasn't designed for small fragments and reads, which are what we get with sample sequencing data, and here are the main difficulties.

The subsampling steps for each round take a long time (hours in later rounds) and are done using a single core, which is wasteful and inefficient. However, the full script depends on this subsetting to run properly, so there isn't really a way around this.
Parallelization occurs during the RECON analyses of rounds 2 to N, so overall, it makes little sense to parallelize heavily since a major bottleneck is the subsetting step (see #1).
Huge amounts of intermediate files are produced, which grow rapidly with each round. Most of these are the batch-* files that are used for parallelization during the RECON rounds. In later rounds (5+, the output size inflates to over 200GB, m


## basespace_quickstart.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              2 stars
            
          
                darencard
                / basespace_quickstart.md
            
            
              Created
              October 3, 2018 16:40
            
              
                Installing, authenticating, and downloading using BaseSpace CLI
              
          
    Installing, authenticating, and downloading using BaseSpace CLI

Installation on Mac computer


Install BaseSpace CLI to $HOME/bin directory and make executable.

wget "https://api.bintray.com/content/basespace/BaseSpaceCLI-EarlyAccess-BIN/latest/\$latest/amd64-osx/bs?bt_package=latest" \
-O $HOME/bin/bs
chmod u+x $HOME/bin/bs

  
## scrolling_DNA.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                darencard
                / scrolling_DNA.md
            
            
              Last active
              January 12, 2021 12:24
            
              
                Creating a scrolling DNA sequence visualization
              
          
    Creating a scrolling DNA sequence visualization

Are you a scientist working in genomics? Have you ever given an interview to your university communications staff or local press? Did you ever find yourself wishing you could have a continuously scrolling nucleotide sequence running in the background during one of these interviews? Well look no further, because here is exactly what you need to create such an effect, which will really wow those who watch your interview.
We will create two scripts - one that creates a random string of nucleotides and one that colors each nucleotide a different color. Here is the first script, in python, named random_seq.py.
#!/usr/bin/env python

import random

  
## genotype_matrix_format.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                darencard
                / genotype_matrix_format.md
            
            
              Last active
              July 30, 2021 21:22
            
          
    NGSadmix (Skotte et al. 2013)

NGSadmix genotype matrices include a header line and two beginning columns (with headers) with the marker ID (scaffold and position) and the reference and alternative allele (all sites must be biallelic). Three genotype likelihoods are given for each sample and marker in a standardized format (sum to 1.0) and correspond to the likelihood of increasingless less reference alleles (homozygous reference, heterozygous, homozygous alternative). All values are space-delimited and missing data is coded as 0.000 across all three allele combinations. Here is an example with three samples at two markers:
Marker Ref. Alt. Sample1 Sample1 Sample1 Sample2 Sample2 Sample2 Sample3 Sample3 Sample3
scaffold1_100 A C 1.000 0.000 0.000 0.333 0.333 0.333 0.250 0.750 0.000
scaffold2_1000 G T 0.000 0.000 0.000 0.500 0.500 0.000 0.010 0.990 0.000

The following RADpipe command will create this as output from a filtered VCF:
python genotypes_from_VCF.py --samplesheet <samplesheet.txt> --fi


## phastcons_conserved_extraction.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              1 star
            
          
                darencard
                / phastcons_conserved_extraction.md
            
            
              Created
              October 10, 2018 20:45
            
              
                Commands for extracting conserved regions from PhastCons conservation tracks
              
          
    Commands for extracting conserved regions from 100-way PhastCons conservation tracks. Then BLAST can be used to extract orthologous regions from genome of interest for as many regions as possible.
# download the human gene annotations
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz

# convert human gene annotations to GTF file format
zcat refGene.txt.gz | cut -f 2- | genePredToGtf -utr file stdin stdout > refGene.gtf

# extract exon features from GTF file

  
## install_run_provean_notes.md

      
              1 file
            
          
              1 fork
            
          
              3 comments
            
          
              1 star
            
          
                darencard
                / install_run_provean_notes.md
            
            
              Created
              October 10, 2018 20:47
            
              
                Notes on installing and running Provean
              
          
    Notes from work installing and running Provean to predict protein impact of variants. Provean input files were produced based on VEP output using commands below. Some trial runs were completed using a computer to understand how quickly Provean can be run in parallel to work through all annotated genes.
# running PROVEAN

# installation & dependencies
# 1. checked that blast was installed and also reinstalled cd-hit to avoid issue with certain version
# 2. installed the NCBI nr database
sudo mkdir /opt/ncbi_blast_nr_db_2018-01-29
sudo chmod 775 /opt/ncbi_blast_nr_db_2018-01-29

  
## gdrive_download
#!/usr/bin/env bash

# gdrive_download
#
# script to download Google Drive files from command line
# not guaranteed to work indefinitely
# taken from Stack Overflow answer:
# http://stackoverflow.com/a/38937732/7002068

gURL=$1

## gist:785015b8e2cb3c5fbc7d
cat <input.txt> | awk '{if(min==""){min=max=$3}; if($3>max) {max=$3}; if($3< min) {min=$3}; total+=$3; count+=1} END {print "mean =", total/count, "\nminimum =", min, "\nmaximum =", max}'

## orthomcl_tutorial.md

      
              1 file
            
          
              2 forks
            
          
              0 comments
            
          
              8 stars
            
          
                darencard
                / orthomcl_tutorial.md
            
            
              Last active
              March 15, 2023 13:31
            
              
                Running OrthoMCL on a set of protein annotations
              
          
    Running OrthoMCL on a set of protein annotations from various species

OrthoMCL is the leading piece of software for inferring orthologs across several organisms. In this tutorial I will provide detailed instructions for running a set of protein annotations through OrthoMCL.
Software and Data


OrthoMCL, and it's dependencies, must be installed. Detailed information on this tool and its installation can be found here. I actually used a slightly modified version of OrthoMCL that was made available by the author of the orthomcl-pipeline (see below). There isn't much details on the ways this is different from the existing OrthoMCL, but this is available here.
orthmcl-pipeline must also be installed, as this is how we will automate the OrthoMCL process. Detailed information on this tool and its installation can be found [here](https


## gene_structure_stats.md

      
              1 file
            
          
              1 fork
            
          
              7 comments
            
          
              8 stars
            
          
                darencard
                / gene_structure_stats.md
            
            
              Last active
              April 18, 2023 13:35
            
              
                Script to produce estimates of gene structure
              
          
    Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.
Inferring the structure of gene annotations

When annotating genomes it is often desireable to know the overall structure of genes, including information like exon and intron lengths among other metrics. Here is a program genestats that will calculate such measures for a user.
#!/usr/bin/env bash

usage()
	#!/usr/bin/env bash

	# gdrive_download
	#
	# script to download Google Drive files from command line
	# not guaranteed to work indefinitely
	# taken from Stack Overflow answer:
	# http://stackoverflow.com/a/38937732/7002068

	gURL=$1