Daren Card darencard

## genotype_matrix_format.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                darencard
                / genotype_matrix_format.md
            
            
              Last active
              July 30, 2021 21:22
            
          
    NGSadmix (Skotte et al. 2013)

NGSadmix genotype matrices include a header line and two beginning columns (with headers) with the marker ID (scaffold and position) and the reference and alternative allele (all sites must be biallelic). Three genotype likelihoods are given for each sample and marker in a standardized format (sum to 1.0) and correspond to the likelihood of increasingless less reference alleles (homozygous reference, heterozygous, homozygous alternative). All values are space-delimited and missing data is coded as 0.000 across all three allele combinations. Here is an example with three samples at two markers:
Marker Ref. Alt. Sample1 Sample1 Sample1 Sample2 Sample2 Sample2 Sample3 Sample3 Sample3
scaffold1_100 A C 1.000 0.000 0.000 0.333 0.333 0.333 0.250 0.750 0.000
scaffold2_1000 G T 0.000 0.000 0.000 0.500 0.500 0.000 0.010 0.990 0.000

The following RADpipe command will create this as output from a filtered VCF:
python genotypes_from_VCF.py --samplesheet <samplesheet.txt> --fi


## gist:785015b8e2cb3c5fbc7d
cat <input.txt> | awk '{if(min==""){min=max=$3}; if($3>max) {max=$3}; if($3< min) {min=$3}; total+=$3; count+=1} END {print "mean =", total/count, "\nminimum =", min, "\nmaximum =", max}'

## missing_from_vcf.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / missing_from_vcf.md
            
            
              Last active
              January 12, 2017 16:26
            
              
                Extract the proportion of missing data per sample in a VCF/BCF file
              
          
    Simply replace <<FILE>> with your properly formated VCF/BCF file name (2 places).
Required bcftools v. 1.2+.
paste \
<(bcftools query -f '[%SAMPLE\t]\n' <<FILE>> | head -1 | tr '\t' '\n') \
<(bcftools query -f '[%GT\t]\n' <<FILE>> | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2)

  
## filter_high_missing_samples.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / filter_high_missing_samples.md
            
            
              Created
              January 12, 2017 17:14
            
              
                Filter away samples from a VCF/BCF that have high amounts of missing data
              
          
    Simply replace <<INPUT>>, <<OUTPUT>>, and <<PROP>> with the input file name, output file name, and proportion missing data at which points samples begin to get excluded, repectively. For example, 0.75 means that samples with greater than 75% missing data are filtered away. Requires bcftools v. 1.2+.
bcftools view -S ^<(paste <(bcftools query -f '[%SAMPLE\t]\n' <<INPUT>> | head -1 | tr '\t' '\n') <(bcftools query -f '[%GT\t]\n' <<INPUT>> | awk -v OFS="\t" '{for (i=1;i<=NF;i++) if ($i == "./.") sum[i]+=1 } END {for (i in sum) print i, sum[i] / NR }' | sort -k1,1n | cut -f 2) | awk '{ if ($2 > <<PROP>>) print $1 }') <<INPUT>> | bgzip > <<OUTPUT>>

  
## parse_ncbi_python.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / parse_ncbi_python.md
            
            
              Last active
              February 15, 2017 14:52
            
              
                shell one-liner that parses the fasta headers from the NCBI python genome. will likely work on other genomes from NCBI as well.
              
          
    shell one-liner that parses the fasta headers from the NCBI python genome. will likely work on other genomes from NCBI as well.
output fields:

transcript ID
full transcript ID w/ version (.1, .2, etc.)
full gene identifier (watch out for spaces and weird symbols)
gene symbol
transcript variant (watch out for spaces), with NA meaning none
type of transcript (mRNA, ncNRA, etc.)


## parse_ncbi_transcripts.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / parse_ncbi_transcripts.md
            
            
              Last active
              February 21, 2017 15:50
            
              
                parse transcripts out of NCBI GFF based on gene ids
              
          
    Script to parse a NCBI GFF based on transcript IDs (e.g., XM_000..). These transcript IDs must not include the version suffix (.1, .2, etc.).
Columns returned:

chromosome/scaffold
start position of transcript
end position of transcript
transcript number
gene number
gene ID (NCBI)


## imagej_macros_scale_png.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              2 stars
            
          
                darencard
                / imagej_macros_scale_png.md
            
            
              Last active
              February 26, 2024 17:53
            
              
                ImageJ macro to automate scale bar addition using command line
              
          
    Overview:

Below is a ImageJ macro that will read a user-provided image file, add a scale bar, and then output a PNG image. The user must specify two arguments, the input file and the output file, as one quote-enclosed argument (see example below). The scale bar characteristics have been hard coded and can be changed by hand, or the script can be modified to allow argument specifications.
This macro is designed to be called from the command line using the ImageJ executable. With my Mac OSX computer running Fiji, the path is /Applications/Fiji.app/Contents/MacOS/ImageJ-macosx. This has not been tested elsewhere and may not work without some effort. It relies on the Bio-Formats plugin to read the file and was written to convert from Zeiss's .czi files, so no guarantee that it works with others as desired. It is especially important to note that this does not set the scale, but infers it based on the metadata stored in the .czi files. Therefore, it will probably not work well with other file types.
If y

  
## image_dimensions.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / image_dimensions.md
            
            
              Created
              March 3, 2017 16:39
            
              
                Python script that determines image dimensions using ImageJ
              
          
    Overview:

Below is a ImageJ Python script that will read a user-provided image file and output the width and height, with units, to stdout. This macro is designed to be called from the command line using the ImageJ executable. With my Mac OSX computer running Fiji, the path is /Applications/Fiji.app/Contents/MacOS/ImageJ-macosx. This has not been tested elsewhere and may not work without some effort. It relies on the Bio-Formats plugin to read the file and was written to convert from Zeiss's .czi files, so no guarantee that it works with others as desired. It is especially important to note that this does not set the scale, but infers it based on the metadata stored in the .czi files. Therefore, it will probably not work well with other file types.
Usage:

/Applications/Fiji.app/Contents/MacOS/ImageJ-macosx --headless get_image_dims.py input
Python script:


## efficient_repeatmodeler.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                darencard
                / efficient_repeatmodeler.md
            
            
              Last active
              March 6, 2018 04:32
            
              
                Description of changes to make RepeatModeler run more efficiently for sample sequencing data analysis
              
          
    Running RepeatModeler More Efficiently

RepeatModeler isn't very well suited for sample sequencing data, taking a long time and creating copious amounts of intermediate data files. It obviously wasn't designed for small fragments and reads, which are what we get with sample sequencing data, and here are the main difficulties.

The subsampling steps for each round take a long time (hours in later rounds) and are done using a single core, which is wasteful and inefficient. However, the full script depends on this subsetting to run properly, so there isn't really a way around this.
Parallelization occurs during the RECON analyses of rounds 2 to N, so overall, it makes little sense to parallelize heavily since a major bottleneck is the subsetting step (see #1).
Huge amounts of intermediate files are produced, which grow rapidly with each round. Most of these are the batch-* files that are used for parallelization during the RECON rounds. In later rounds (5+, the output size inflates to over 200GB, m


## genome_oneliners.md

      
              1 file
            
          
              7 forks
            
          
              1 comment
            
          
              35 stars
            
          
                darencard
                / genome_oneliners.md
            
            
              Last active
              March 12, 2024 17:40
            
              
                Quick one-liner commands that are useful for genomics
              
          
    Useful Genomics Oneliners

The following commands sometimes require non-standard software like bioawk and seqtk.
rename scaffold headers with sequential numbers and lengths ("scaffold-N ")
bioawk -c fastx '{ print ">scaffold-" ++i" "length($seq)"\n"$seq }' < genome.fasta > new_genome.fasta
make association table of old and renamed scaffold names after above renaming command