Skip to content

Instantly share code, notes, and snippets.

Daren Card darencard

Block or report user

Report or block darencard

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
View genotype_matrix_format.md

NGSadmix (Skotte et al. 2013)

NGSadmix genotype matrices include a header line and two beginning columns (with headers) with the marker ID (scaffold and position) and the reference and alternative allele (all sites must be biallelic). Three genotype likelihoods are given for each sample and marker in a standardized format (sum to 1.0) and correspond to the likelihood of increasingless less reference alleles (homozygous reference, heterozygous, homozygous alternative). All values are space-delimited and missing data is coded as 0.000 across all three allele combinations. Here is an example with three samples at two markers:

Marker Ref. Alt. Sample1 Sample1 Sample1 Sample2 Sample2 Sample2 Sample3 Sample3 Sample3
scaffold1_100 A C 1.000 0.000 0.000 0.333 0.333 0.333 0.250 0.750 0.000
scaffold2_1000 G T 0.000 0.000 0.000 0.500 0.500 0.000 0.010 0.990 0.000

The following RADpipe command will create this as output from a filtered VCF:

python genotypes_from_VCF.py --samplesheet <samplesheet.txt> --fi
@darencard
darencard / gist:785015b8e2cb3c5fbc7d
Last active Jan 10, 2016
Calculating the mean, minimum, and maximum of a column using Awk (vary $3 to reflect desired column of data)
View gist:785015b8e2cb3c5fbc7d
cat <input.txt> | awk '{if(min==""){min=max=$3}; if($3>max) {max=$3}; if($3< min) {min=$3}; total+=$3; count+=1} END {print "mean =", total/count, "\nminimum =", min, "\nmaximum =", max}'
@darencard
darencard / missing_from_vcf.md
Last active Jan 12, 2017
Extract the proportion of missing data per sample in a VCF/BCF file
View missing_from_vcf.md

Simply replace <<FILE>> with your properly formated VCF/BCF file name (2 places). Required bcftools v. 1.2+.

paste \
<(bcftools query -f '[%SAMPLE\t]\n' <<FILE>> | head -1 | tr '\t' '\n') \
<(bcftools query -f '[%GT\t]\n' <<FILE>> | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2)
@darencard
darencard / filter_high_missing_samples.md
Created Jan 12, 2017
Filter away samples from a VCF/BCF that have high amounts of missing data
View filter_high_missing_samples.md

Simply replace <<INPUT>>, <<OUTPUT>>, and <<PROP>> with the input file name, output file name, and proportion missing data at which points samples begin to get excluded, repectively. For example, 0.75 means that samples with greater than 75% missing data are filtered away. Requires bcftools v. 1.2+.

bcftools view -S ^<(paste <(bcftools query -f '[%SAMPLE\t]\n' <<INPUT>> | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' <<INPUT>> | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2) | awk '{ if ($2 > <<PROP>>) print $1 }') <<INPUT>> | bgzip > <<OUTPUT>>
@darencard
darencard / parse_ncbi_python.md
Last active Feb 15, 2017
shell one-liner that parses the fasta headers from the NCBI python genome. will likely work on other genomes from NCBI as well.
View parse_ncbi_python.md

shell one-liner that parses the fasta headers from the NCBI python genome. will likely work on other genomes from NCBI as well.

output fields:

  1. transcript ID
  2. full transcript ID w/ version (.1, .2, etc.)
  3. full gene identifier (watch out for spaces and weird symbols)
  4. gene symbol
  5. transcript variant (watch out for spaces), with NA meaning none
  6. type of transcript (mRNA, ncNRA, etc.)
@darencard
darencard / parse_ncbi_transcripts.md
Last active Feb 21, 2017
parse transcripts out of NCBI GFF based on gene ids
View parse_ncbi_transcripts.md

Script to parse a NCBI GFF based on transcript IDs (e.g., XM_000..). These transcript IDs must not include the version suffix (.1, .2, etc.).

Columns returned:

  1. chromosome/scaffold
  2. start position of transcript
  3. end position of transcript
  4. transcript number
  5. gene number
  6. gene ID (NCBI)
@darencard
darencard / imagej_macros_scale_png.md
Last active Mar 2, 2017
ImageJ macro to automate scale bar addition using command line
View imagej_macros_scale_png.md

Overview:

Below is a ImageJ macro that will read a user-provided image file, add a scale bar, and then output a PNG image. The user must specify two arguments, the input file and the output file, as one quote-enclosed argument (see example below). The scale bar characteristics have been hard coded and can be changed by hand, or the script can be modified to allow argument specifications.

This macro is designed to be called from the command line using the ImageJ executable. With my Mac OSX computer running Fiji, the path is /Applications/Fiji.app/Contents/MacOS/ImageJ-macosx. This has not been tested elsewhere and may not work without some effort. It relies on the Bio-Formats plugin to read the file and was written to convert from Zeiss's .czi files, so no guarantee that it works with others as desired. It is especially important to note that this does not set the scale, but infers it based on the metadata stored in the .czi files. Therefore, it will probably not work well with other file types.

If y

@darencard
darencard / image_dimensions.md
Created Mar 3, 2017
Python script that determines image dimensions using ImageJ
View image_dimensions.md

Overview:

Below is a ImageJ Python script that will read a user-provided image file and output the width and height, with units, to stdout. This macro is designed to be called from the command line using the ImageJ executable. With my Mac OSX computer running Fiji, the path is /Applications/Fiji.app/Contents/MacOS/ImageJ-macosx. This has not been tested elsewhere and may not work without some effort. It relies on the Bio-Formats plugin to read the file and was written to convert from Zeiss's .czi files, so no guarantee that it works with others as desired. It is especially important to note that this does not set the scale, but infers it based on the metadata stored in the .czi files. Therefore, it will probably not work well with other file types.

Usage:

/Applications/Fiji.app/Contents/MacOS/ImageJ-macosx --headless get_image_dims.py input

Python script:

View plotly_tutorial_offline_beers.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@darencard
darencard / popstats_from_vcf.Md
Created Jul 17, 2017
Calculating population genetic statistics from VCF files using BCFtools
View popstats_from_vcf.Md

Useful Oneliners for Calculating Population Genetic Statistics from VCF files

The following commands require non-standard software like BCFtools and VCFtools.

thin variants to prevent linkage biases and output the number of sampled alleles and the allele frequency for the reference allele

vcftools --thin 10000 --recode --recode-INFO-all --stdout --gzvcf <my_variants.vcf.gz> | \
  bcftools query -f '%CHROM\t%POS[\t%GT]\n' - | \
  awk -v OFS="\t" '{ miss=0; hom_ref=0; hom_alt=0; het=0; \
You can’t perform that action at this time.