Skip to content

Instantly share code, notes, and snippets.

View standage's full-sized avatar

Daniel Standage standage

View GitHub Profile
@standage
standage / taxtrav.c
Last active December 17, 2015 08:09
Traverse the taxonomy to determine the classification of a given taxon at a particular taxonomic rank.
/*
Copyright (c) 2013-2014, Daniel S. Standage <daniel.standage@gmail.com>
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
@standage
standage / sample-gc.pl
Last active December 17, 2015 09:19
Given a list of Fasta files containing genomic sequences, sample random subsequences and calculate their GC content.
#!/usr/bin/env perl
# Copyright (c) 2013, Daniel S. Standage <daniel.standage@gmail.com>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
@standage
standage / gsq-select.pl
Created June 14, 2013 03:30
Given a set of transcript or protein alignments in GeneSeqer or GenomeThreader format, retrieve all sequences that map to the specified genomic region.
#!/usr/bin/env perl
use strict;
use Bio::SeqIO;
use Getopt::Long;
# Given a set of transcript or protein alignments in GeneSeqer or GenomeThreader
# format, retrieve all sequences that map to the specified genomic region
#
# Example: perl gsq-select.pl Chr1:100001-200000 tair-ests.fa est-alignments.gsq
@standage
standage / fasta-validate.pl
Last active June 8, 2018 15:15
NCBI provides specifications for several Fasta defline identifier formats/conventions at ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html. This script will automatically detect the convention used sequence-by-sequence, and convert all deflines to the requested format.
#!/usr/bin/env perl
# Copyright (c) 2012-2013, Daniel S. Standage <daniel.standage@gmail.com>
#
# Permission to use, copy, modify, and/or distribute this software for any
# purpose with or without fee is hereby granted, provided that the above
# copyright notice and this permission notice appear in all copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
# WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
@standage
standage / gsq-to-gff3.pl
Created July 8, 2013 17:43
Convert alignments in GeneSeqer (or GenomeThreader) output to GFF3 format.
#!/usr/bin/env perl
use strict;
my %alignments;
my @exon_scores;
my $align_score;
while(my $line = <STDIN>)
{
if($line =~ m/ Exon +\d+.+; score: (.+)/)
{
@standage
standage / map-collapse.pl
Created August 5, 2013 20:19
Given a mapping (in the form of a tab-delimited text file), collapse the mapping so that each key occupies a single line.
#!/usr/bin/env perl
# Copyright (c) 2013, Daniel S. Standage <daniel.standage@gmail.com>
#
# input: two columns of data in tab-delimited format, mapping from a key
# (column 1) to a value (column 2);
# output: also two columns of data in tab delimited format, but all values
# sharing the same key are printed on a single line; that is, each key
# corresponds to a comma-separated list of associated values
@standage
standage / gt-bug-115-desired-output.gff3
Last active December 22, 2015 20:58
My attempt to reproduce the behavior I described in https://github.com/genometools/genometools/issues/115.
##gff-version 3
##sequence-region PdomScaf0001 252075 255869
PdomScaf0001 maker mRNA 252075 255869 . - . ID=PdomMRNAr1.1-00022.1;_AED=0.29;_eAED=0.29;_QI=0|0.5|0.33|1|1|1|3|893|923
PdomScaf0001 maker exon 252075 255370 118 - . Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 maker stop_codon 252968 252970 . - . Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 . CDS 252968 255370 . - . ID=myCDS;Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 maker exon 255453 255790 35.2 - . Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 . CDS 255453 255790 . - . ID=myCDS;Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 maker exon 255839 255869 9.8 - . Parent=PdomMRNAr1.1-00022.1
PdomScaf0001 . CDS 255839 255869 . - . ID=myCDS;Parent=PdomMRNAr1.1-00022.1
@standage
standage / defline-test.fasta
Created September 16, 2013 16:46
Bogus data file used for debugging Fasta parsing scripts, libraries, etc.
>gi|94449065|gb|DQ473580.1| Ricinus communis plastid Tic40 mRNA, complete cds; nuclear gene for plastid product
ACGT
>gi|94449065|gb|DQ473580.1|abcdefghijk Ricinus communis plastid Tic40 mRNA, complete cds; nuclear gene for plastid product
ACGT
>gb|DQ473580.1| Ricinus communis plastid Tic40 mRNA, complete cds; nuclear gene for plastid product
ACGT
>gb|DQ473580.1|abcdefghijk Ricinus communis plastid Tic40 mRNA, complete cds; nuclear gene for plastid product
ACGT
>gi|61556715|ref|NM_001013027.1| Danio rerio ba1 globin, like (ba1l), mRNA
ACGT
@standage
standage / aed-analysis.py
Created September 16, 2013 16:52
Script for analyzing Maker output and comparing AED vs eAED scores
#!/usr/bin/env python
import sys
import re
aedgt = 0
eaedgt = 0
perfect = 0
nosupport = 0
for line in sys.stdin:
if "\tmRNA\t" in line:
@standage
standage / count-alpha-chars-in-stdin.c
Created September 16, 2013 17:19
Count alphabetic characters from standard input.
#include <stdio.h>
int main()
{
int frequency[26];
int ch;
for (ch = 0; ch < 26; ch++)
frequency[ch] = 0;