sinamajidian/README.md

## README.md

      
    Raw
  

              README.md
            
          
    We have the fasta files of some viral genomes. We want to aligend all of them and extract only one gene of them.
cat *.fasta > all.fasta
cat gene.fasta all.fasta > all_with_gene.fasta

The I ran mafft on them
mafft --auto all_with_gene.fasta > aligned.fasta

The first record in aligned.fasta shows the position of interest.
By openning the file in an editor atom.
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------atgtttgtgtttctggtgctgctgccgct
ggtgagcagccagtgcgtgaacctgaccacccgcacccagctgccgccggcgtataccaa
cagctttacccgcggcgtgtattatccggataaagtgtttcgcagcagcgtgctgcatag
cacccaggatctgtttctgccgttttttagcaacgtgacctggtttcatgcgattcatgt
gagcggcaccaacggcaccaaacgctttgataacccggtgctgccgtttaacgatggcgt
..
gctgtatcaggatgtgaactgcaccgaagtgccggtggcgattcatgcggatcagctgac
cccgacctggcgcgtgtatagcaccggcagcaacgtgtttcagacccgcgcgggctgcct
gattggcgcggaacatgtgaacaacagctatgaatgcgatattccgattggcgcgggcat
ttgcgcgagctatcagacccagaccaacagcccgcgccgcgcgcgc--------------
------------------------------------------------------------
------------------------------------------------------------

I found out the line number, the start and the end which was 361 and 395, respectively. So, for 0-index python list,  we should have
start_line=359
end_line=393

The code extracts these lines from all scaffolds in the fasta files.
The code can be improved for extracting the line numbers,  which can also be done in python by checking the first scaffold (as the reference, the gene ), and find the first line containing a/t/c/g rather than complete '-'.
ref_genome=genomes[1]
for line in ref_genome:
    num_dash=0
    for letter in line:
        if letter=='-': num_dash+=1
    first_n
    if num_dash!=60:
        print(line)
        

## extract_fasta.py


# a code for extracting a gene from multiple sequence alignemnt output


start_line=359
end_line=393
file_address="aligned_336.fasta"


file_fasta = open(file_address, 'r')

genomes=[]
genome=[]
genomes_name=[]
first_fasta=True    # the reference of intrested gene
for line in file_fasta:
    line_strip=line.strip()

    if line_strip.startswith('>'):  # fasta
        genomes_name.append(line_strip)
        genomes.append(genome)
        genome=[]

    else:
        genome.append(line_strip)

genomes.append(genome)


ref_genome=genomes[1]
line=ref_genome[0]


genomes_trim=[]
for genome in genomes:
    genome_trim=genome[start_line:end_line]
    genomes_trim.append(genome_trim)


file_fasta_out = open(file_address+"_trimmed.fasta", 'w')

for idx, genome_name in enumerate(genomes_name):
    file_fasta_out.write(genome_name+'\n')
    genome=genomes_trim[idx+1]
    for line in genome:
        file_fasta_out.write(line+'\n')

file_fasta_out.close()



	# a code for extracting a gene from multiple sequence alignemnt output


	start_line=359
	end_line=393
	file_address="aligned_336.fasta"



	file_fasta = open(file_address, 'r')

	genomes=[]
	genome=[]
	genomes_name=[]
	first_fasta=True # the reference of intrested gene
	for line in file_fasta:
	line_strip=line.strip()

	if line_strip.startswith('>'): # fasta
	genomes_name.append(line_strip)
	genomes.append(genome)
	genome=[]

	else:
	genome.append(line_strip)

	genomes.append(genome)



	ref_genome=genomes[1]
	line=ref_genome[0]


	genomes_trim=[]
	for genome in genomes:
	genome_trim=genome[start_line:end_line]
	genomes_trim.append(genome_trim)



	file_fasta_out = open(file_address+"_trimmed.fasta", 'w')

	for idx, genome_name in enumerate(genomes_name):
	file_fasta_out.write(genome_name+'\n')
	genome=genomes_trim[idx+1]
	for line in genome:
	file_fasta_out.write(line+'\n')

	file_fasta_out.close()