Skip to content

Instantly share code, notes, and snippets.

@sinamajidian
Last active August 4, 2020 08:42
Show Gist options
  • Save sinamajidian/afb163c72973b85c8f5de99e79df3bb7 to your computer and use it in GitHub Desktop.
Save sinamajidian/afb163c72973b85c8f5de99e79df3bb7 to your computer and use it in GitHub Desktop.
Extract parts of fasta file for all scaffolds

We have the fasta files of some viral genomes. We want to aligend all of them and extract only one gene of them.

cat *.fasta > all.fasta
cat gene.fasta all.fasta > all_with_gene.fasta

The I ran mafft on them

mafft --auto all_with_gene.fasta > aligned.fasta

The first record in aligned.fasta shows the position of interest. By openning the file in an editor atom.

------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------atgtttgtgtttctggtgctgctgccgct
ggtgagcagccagtgcgtgaacctgaccacccgcacccagctgccgccggcgtataccaa
cagctttacccgcggcgtgtattatccggataaagtgtttcgcagcagcgtgctgcatag
cacccaggatctgtttctgccgttttttagcaacgtgacctggtttcatgcgattcatgt
gagcggcaccaacggcaccaaacgctttgataacccggtgctgccgtttaacgatggcgt
..
gctgtatcaggatgtgaactgcaccgaagtgccggtggcgattcatgcggatcagctgac
cccgacctggcgcgtgtatagcaccggcagcaacgtgtttcagacccgcgcgggctgcct
gattggcgcggaacatgtgaacaacagctatgaatgcgatattccgattggcgcgggcat
ttgcgcgagctatcagacccagaccaacagcccgcgccgcgcgcgc--------------
------------------------------------------------------------
------------------------------------------------------------

I found out the line number, the start and the end which was 361 and 395, respectively. So, for 0-index python list, we should have

start_line=359
end_line=393

The code extracts these lines from all scaffolds in the fasta files.

The code can be improved for extracting the line numbers, which can also be done in python by checking the first scaffold (as the reference, the gene ), and find the first line containing a/t/c/g rather than complete '-'.

ref_genome=genomes[1]
for line in ref_genome:
    num_dash=0
    for letter in line:
        if letter=='-': num_dash+=1
    first_n
    if num_dash!=60:
        print(line)
        
# a code for extracting a gene from multiple sequence alignemnt output
start_line=359
end_line=393
file_address="aligned_336.fasta"
file_fasta = open(file_address, 'r')
genomes=[]
genome=[]
genomes_name=[]
first_fasta=True # the reference of intrested gene
for line in file_fasta:
line_strip=line.strip()
if line_strip.startswith('>'): # fasta
genomes_name.append(line_strip)
genomes.append(genome)
genome=[]
else:
genome.append(line_strip)
genomes.append(genome)
ref_genome=genomes[1]
line=ref_genome[0]
genomes_trim=[]
for genome in genomes:
genome_trim=genome[start_line:end_line]
genomes_trim.append(genome_trim)
file_fasta_out = open(file_address+"_trimmed.fasta", 'w')
for idx, genome_name in enumerate(genomes_name):
file_fasta_out.write(genome_name+'\n')
genome=genomes_trim[idx+1]
for line in genome:
file_fasta_out.write(line+'\n')
file_fasta_out.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment