Skip to content

Instantly share code, notes, and snippets.

@craic
Created April 11, 2020 15:51
Show Gist options
  • Save craic/790e57e3ea140797d66a9dccaaa098a2 to your computer and use it in GitHub Desktop.
Save craic/790e57e3ea140797d66a9dccaaa098a2 to your computer and use it in GitHub Desktop.
Simple script to cleanup a FASTA file from the GISAID SARS-CoV-2 database
#!/usr/bin/env ruby
# cleanup_gisaid.rb
# Robert Jones jones@craic.com
# Freely distributed under the MIT licence
# This script takes a fasta format file from GISAID and fixes various data quality issues
# The vast majority of the sequences are fine but the few anomalies can cause problems downstream
def output_sequence(sequence)
sequence.downcase!
# strip non-nucleotide characters
sequence.gsub!(/[^a-z]+/, '')
linelength = 100
i = 0
seqlen = sequence.length
while i < seqlen do
line = sequence[i,linelength]
puts line
i += linelength
end
end
abort "Usage: #{$0} <fasta file>" unless ARGV.length == 1
header = ""
sequence = ""
open(ARGV[0], 'rb').each_line do |line|
line.chomp!
#strip ^M
line.gsub!(/[[:cntrl:]]/, '')
if line =~ /^>/
if sequence != ""
# output prior sequence
output_sequence(sequence)
sequence = ""
end
header = line
# some records have no line break after the label
# >BetaCoV/Nonthaburi/61/2020|EPI_ISL_403962ATACCT...
if line.length > 100
if line =~ /^(>.*?EPI_ISL_\d+)(\w+)$/
header = $1
sequence = $2
end
end
# some lines have a space between '>' and header
header.sub!(/^>\s+/, '>')
# some lines have a space between two words like 'Hong' and 'Kong'
header.gsub!(/\s+/, '_')
puts header
else
# sequence
sequence << line.downcase
end
end
if sequence != ""
# output last sequence
output_sequence(sequence)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment