Skip to content

Instantly share code, notes, and snippets.

@ShaiberAlon
Created October 8, 2018 17:58
Show Gist options
  • Save ShaiberAlon/cc140d413339926c7b12fd7043e335d2 to your computer and use it in GitHub Desktop.
Save ShaiberAlon/cc140d413339926c7b12fd7043e335d2 to your computer and use it in GitHub Desktop.
Split fasta into multiple fasta files with a max size
#!/usr/bin/env python
'''
split fasta file into multiple smaller fasta files
Use like this:
python SPLIT-FASTA.py fasta-name.fa output-prefix SIZE
So if your input fasta was contigs.fa, and had 190 sequences then:
python SPLIT-FASTA.py contigs.fa contigs-mini 50
would result in four output files:
contigs-mini_0.fa
contigs-mini_1.fa
contigs-mini_2.fa
contigs-mini_3.fa
'''
import sys
import anvio.fastalib as f
input_file_name = sys.argv[1]
output_file_prefix = sys.argv[2]
max_number_per_file = int(sys.argv[3])
c = f.ReadFasta(f_name= input_file_name)
n = 0
m = 0
output_file_name = '%s_%s.fa' % (output_file_prefix, m)
output_fasta = f.FastaOutput(output_file_path=output_file_name)
for header,seq in zip(c.ids, c.sequences):
if n == max_number_per_file:
n = 0
m += 1
output_fasta.close()
output_file_name = '%s_%s.fa' % (output_file_prefix, m)
output_fasta = f.FastaOutput(output_file_path=output_file_name)
n += 1
output_fasta.write_id(header)
output_fasta.write_seq(seq)
output_fasta.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment