Skip to content

Instantly share code, notes, and snippets.

@standage
Created April 30, 2013 17:50
Show Gist options
  • Save standage/5490485 to your computer and use it in GitHub Desktop.
Save standage/5490485 to your computer and use it in GitHub Desktop.
Given a set of DNA sequences (in Fasta format) and a set of coordinates (in "seqid,start,end" format), extract the given subsequences.
#!/usr/bin/env perl
use strict;
use Bio::SeqIO;
my $usage = "perl $0 seqs.fasta < coords.csv > subseqs.fasta # coords.csv file is 3 comma-delimited values: seqid, start, and end";
my $seqfile = shift(@ARGV) or die("Usage: $usage");
# Load sequences into memory
my %seqs;
my $seqinput = Bio::SeqIO->new( "-file" => $seqfile, "-format" => "Fasta" );
while(my $seq = $seqinput->next_seq)
{
$seqs{ $seq->id } = $seq;
}
# Read in coordinates, write output
my $seqoutput = Bio::SeqIO->new( "-fh" => \*STDOUT, "-format" => "Fasta" );
my $gene_count = 1;
while(my $line = <STDIN>)
{
chomp($line);
next if($line =~ m/^$/);
my($seqid, $start, $end) = split(/,/, $line);
my $seq = $seqs{ $seqid } or die("error: could not find sequence '$seqid' $!");
my $subseq = Bio::PrimarySeq->new( "-id" => "gene$gene_count",
"-desc" => "$seqid:$start-$end",
"-seq" => $seq->subseq($start, $end) );
$seqoutput->write_seq($subseq);
$gene_count++;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment