Skip to content

Instantly share code, notes, and snippets.

@jasongallant
Forked from iracooke/README.md
Created March 26, 2021 01:24
Show Gist options
  • Save jasongallant/38c864c98de8e48edac6d365fbed2707 to your computer and use it in GitHub Desktop.
Save jasongallant/38c864c98de8e48edac6d365fbed2707 to your computer and use it in GitHub Desktop.
NCBI TSA Submission Guide

Steps to submit to TSA

If you have a transcriptome that has been assembled from shotgun reads the TSA (Transcriptome Shotgun Assembly) database is a good place to put it so that it can be widely accessed.

This guide assumes that you simply want to submit the assembled sequences from your transcriptome without annotations. NCBI sets a high bar for inclusion of annotations so for most non-model organisms they are probably not going to meet the criteria.

To create a TSA submission take a look at the ncbi guidelines. This gist is based on those guidelines.

Register BioProject

Start your TSA submission by creating a BioProject. When you create this bioproject you should select the project type as Assembly. If you have previously create a bioproject for your raw data with a different project type you will need to create a new one for the TSA submission. Ensure that you have uploaded all of the raw data that was used to create the assembly to SRA.

Cleaning

Probably the trickiest part of the process is cleaning the data so that it passes NCBI's automated checks. This gist contains a simple ruby script that will perform some very basic cleaning but you will almost certainly need to do more than this after you make your first attempt at submission.

First clean the database so that

- No sequence is shorter than 200 nucleotides
- No sequences has more than 14 consecutive Ns

This can be done with the ncbi_ise.rb script

	./ncbi_ise.rb transcriptome.fasta > cleaned_transcriptome.fasta 

Template

Fill out the template form https://submit.ncbi.nlm.nih.gov/templates/

Make SQN file

To do this you need the tbl2asn tool installed. This tool is distributed by NCBI but can also be installed with homebrew

  brew tap homebrew/science
  brew install tbl2asn

Once you have the tbl2asn utility installed you can use it to create the final asn file for upload.

Be sure to edit the command below to use the correct organism Genus and species

tbl2asn -t template.sbt -w assembly.cmt -M t -i cleaned_transcriptome.fasta -j "[organism=Genus species][moltype=transcribed_RNA][tech=TSA]"

This .asn file can be uploaded directly to the NCBI website and will be subjected to more checks. Most likely it will identify all kinds of other problems. If it identifies problems immediately it will display a bunch of lines in red with errors like this one.

ERROR fhd.sqn FhD07614 1..34 VECTOR_MATCH File: fhd.sqn, Code(VECTOR_MATCH), Sequence-id: FhD07614, Interval: 1..34, This sequence has a Strong match on the following UniVec vector: gnl|uv|NGB00309.1:1-48 Multimer of Pharmacia EcoRI adaptor used in I.M.A.G.E. library NCI_CGAP_CLL1 and other libraries

You should download all these errors to a file (eg called 'errors.txt') and then provide them as input to another round of cleaning with the ncbi_ise.rb script like this;

	./ncbi_ise.rb -r errors.txt cleaned_transcriptome.fasta > cleaned_transcriptome2.fasta 

After doing this rerun the tbl2asn command on the newly cleaned fasta file to regenerate the .sqn file and try again with the NCBI upload. If it keeps failing look at the errors it produces and see if you can fix them manually, or perhaps edit the ncbi_ise.rb script so as to improve the cleaning process.

Eventually you will satisfy NCBI and your file will go through. But the process is most likely not over. NCBI may get back to you with further errors. These come in a file called FCSreport.txt. The ncbi_ise.rb script is also able to use this to perform further cleaning like this

	./ncbi_ise.rb -f FCSreport.txt cleaned_transcriptome2.fasta > cleaned_transcriptome3.fasta 

Once you've done this regenerate the .sqn file and resubmit.

#!/usr/bin/env ruby
#
# Clean up a transcriptome sequences to conform with NCBI requirements
#
# Usage: ./ncbi_ise.rb transcriptome.fasta > cleaned_transcriptome.fasta
#
require 'bio'
require 'optparse'
require 'set'
options = {:report => nil, :fcsreport => nil}
parser = OptionParser.new do|opts|
opts.banner = "Usage: ncbi_ise.rb [options] input.fasta"
opts.on('-r','--report file','Report file generated when uploading a .sqn to TSA during submission') do |repfile|
options[:report] = repfile;
end
opts.on('-f', '--fcs-report file', 'Report file generated after a TSA submission is processed but still has errors') do |repfile|
options[:fcsreport] = repfile;
end
opts.on('-h', '--help', 'Displays Help') do
puts opts
exit
end
end
parser.parse!
if ARGV[0].nil?
$stderr.write "You must supply an input file\n"
puts parser
exit
end
input_fasta = Bio::FlatFile.auto(ARGV[0])
# Find out the lengths of each sequence
#
sequence_lengths = {}
input_fasta.each do |entry|
eid=entry.entry_id
sequence_lengths[eid] = entry.naseq.length
end
input_fasta = Bio::FlatFile.auto(ARGV[0])
# Parse an ncbi error report file to find sequences for trimming
#
#
# For example we are parsing lines like this
#
# ERROR fhd.sqn FhD07614 35..35 VECTOR_MATCH File: fhd.sqn, Code(VECTOR_MATCH), Sequence-id: FhD07614, Interval: 35..35, ...
#
# And our aim is to extra the id FhD07614 and trim interval 35..35
#
trim_sequences = {}
unless ( options[:report].nil? )
File.open(options[:report], "r").each do |line|
if line !~ /Sequence-id: (.*), Interval: (.*),/
$stderr.write "Unrecognised VecScreen output \n#{line}"
else
id,interval = line.match(/Sequence-id: (.*), Interval: (.*),/).captures
vstart,vend = interval.split("..").collect { |e| e.to_i-1 }
trim_sequences[id] = [vstart,vend]
end
end
end
# Parse an ncbi FSC report to find more sequences for trimming or deletion
#
# For example we are parsing line like this for trimming
#
# FhD01419 1316 75..203 vector/etc
#
#
duplicated_sequences = Set.new
excluded_sequences = Set.new
unless ( options[:fcsreport].nil? )
File.open(options[:fcsreport], "r").each do |line|
parts = line.split("\t")
# Match and parse lines specifying parts of sequences to trim
#
if (parts.length == 4) && ( parts[2] =~ /\d+\.\.\d+/)
id,interval = parts[0],parts[2]
vstart,vend = interval.split("..").collect { |e| e.to_i-1 }
trim_sequences[id] = [vstart,vend]
end
# Match lines specifying duplicated sequences
if (line =~ /^lcl\|/ )
# Parse the ids of duplicates and them to the duplicates list
dups = line.to_enum(:scan, /lcl\|([^~]*)/).map { Regexp.last_match.captures[0] }
dlens = dups.collect { |dp| sequence_lengths[dp] }
rep = dups[dlens.index(dlens.max)]
dups.delete(rep)
duplicated_sequences.merge(dups)
end
# This is designed to match lines like this
# comp34504_c0_seq1 206 vector/etc
#
if ( parts.length == 3) && ( parts[1] =~ /^\d+$/ )
excluded_sequences.add(parts[0])
end
end
end
too_short=0
toomany_n=0
trim_n=0
dup_n = 0
excluded_n = 0
input_fasta.each do |entry|
eid=entry.entry_id
# No X's or *'s
#
seq = entry.naseq.gsub(/[Xx]/,"n")
seq.sub!(/^n*/,"")
seq.sub!(/n*$/,"")
skip=false
if trim_sequences.has_key?(eid)
seq.slice!(trim_sequences[eid][0],trim_sequences[eid][1])
trim_n+=1
end
# require 'byebug'; byebug
# Skip sequences that are too short or have too many N's
if (seq.length < 200)
too_short+=1
skip=true
elsif seq =~ /nnnnnnnnnnnnnn/ ||
seq =~ /NNNNNNNNNNNNNN/ || # No more than 14 N's anywhere
seq[1..20] =~ /NNNNNNNNNN/ ||
seq[1..20] =~ /nnnnnnnnnn/ || # No more than 10 N's at start or end
seq[-20..-1] =~ /NNNNNNNNNN/ ||
seq[-20..-1] =~ /nnnnnnnnnn/ ||
(seq.count('n')/seq.length) > 0.1 ||
(seq.count('N')/seq.length) > 0.1
toomany_n+=1
skip=true
end
# Skip duplicates
if duplicated_sequences.member?(eid)
dup_n += 1
skip=true
end
# Skip excludes
if excluded_sequences.member?(eid)
excluded_n += 1
skip=true
end
$stdout.write ">#{entry.entry_id}\n#{seq.upcase}\n" unless skip
end
$stderr.write "Skipped #{too_short} sequences because they were too short\n"
$stderr.write "Skipped #{toomany_n} sequences because they had too many Ns\n"
$stderr.write "Trimmed #{trim_n} sequences\n"
$stderr.write "Skipped #{dup_n} sequences with too much redundancy\n"
$stderr.write "Skipped #{excluded_n} sequences identified for exclusion\n"
@jasongallant
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment