- Understand Tajima's D formula and calculate the number of segregating sites from an aligned FASTA file
Source code
lst = [1, 2, 3, 2, 3, 4, 5, 1, 5] # Initialize a list
print("list =", lst) # show the list
print("set =", set(lst)) # Convert the list into a set object
print(len(set(lst))) # Show the number of element of the set
Result
$ python3 day2_3_example1.py 15-05-10 17:58 list = [1, 2, 3, 2, 3, 4, 5, 1, 5] set = {1, 2, 3, 4, 5} 5
Note
- Set object does not allow duplicate elements and the elements are stored in random order.
- Set object uses braces {} instead of square brakets []
- The duplicate objects are discarded when a list object is converted into a set object, because each element in a set object must be unique.
- Compare two sequences and detect segregating sites
Source coe
seq1 = "ATGC" # First sequence
seq2 = "ATAT" # Second sequence
sequences = [seq1, seq2] # Define a sequences list object having two sequences
for i in range(0, len(seq1)): # Iteration with i that indicates the index of nucleotide position
alleles = [] # Initialize alleles list in which allele data will be stored
for seq in sequences: # Check the two sequences with iteration
alleles.append(seq[i]) # Each nucleotide at the same position is added in the alleles list
print(alleles, end="\t")
if len(set(alleles)) > 1: # Check the number of unique element in the alleles
print("segregated") # in the case of more than two alleles are detected
else:
print("not segregated") # in the case of only one kind of nucleotide is detected at the position
Result
$ python3 day2_3_example2.py ['A', 'A'] not segregated ['T', 'T'] not segregated ['G', 'A'] segregated ['C', 'T'] segregated
Note
- Segregating site is the position of single nucleotide polymorphism (SNP).
- Please be careful to look at the double loop structure, namely the sequences list is like a two dimensional structure
- The first iteration is a process of selecting a sequence, and the second iteration is a process of selecting each position of the selected sequence
- Calculate the total number of segregating sites in an aligned sequences loading from a FASTA file
Template
import sys # import sys module for open/close methods
sequences = []
f = open(sys.argv[1])
for line in f:
if not line[0] == ">":
sequences.append(line.rstrip())
f.close()
total_seg_sites = 0 # Initialize a variable to store the total number of segregating sites
#
# Implement a process here
#
print("total segregating sites =", total_seg_sites) # Show the total number of segregating sites
Expected result
$ python3 day2_3_exercise1.py input.fa ['G'] not segregating ['A'] not segregating ['A'] not segregating ['G'] not segregating ['A'] not segregating ['C', 'T'] segregating ['T'] not segregating ['A', 'T'] segregating ['G'] not segregating ['T'] not segregating Total segregating sites = 2
Note
- tab space is inserted before "not segregating" and "segregating"
- Define two functions
- A function to load a FASTA file
- A function to calculate segregated sites
- The function interface (arguments and return value) should be as follows:
import sys # Import sys module
def load_fasta(fasta): # Define load_fasta function, fasta is an argument data file
sequences = [] # Initialize seuqnces list
#
# Your code here
#
return sequences # Return the sequences list
def count_seg_sites(sequences): # Define a function to count total number of segregating sites from sequences list object
total_seg_sites = 0 # Initialize a variable to keep the total number ofsegregating sites
#
# Your code here
#
return total_seg_sites # Return the total number of segregaed sites
sequences = load_fasta(sys.argv[1]) # Call load_fasta function with a fasta file path and load sequence data
total_seg_sites = count_seg_sites(sequences) # Calculate the total number of segregating sites
print("Total segregating sites =", total_seg_sites) # Show the result
Note
- The command line arguments and the result should be the same as the exercise1
- Calculate MAF (minor allele frequency) in each segregating site
MAF (minor allele frequency):
- Allele frequency is the ratio of each allele
- Minor allele frequency is the one which is less than the other allele frequency
For example
ATGC ATGC ATGC TTGC
- At the first position of the sequences, allele frequency A:T=3/4:1/4
- In this case, MAF = 1/4
Expected result
$ python3 day2_3_advanced2.py input.fa ['G'] not segregating ['A'] not segregating ['A'] not segregating ['G'] not segregating ['A'] not segregating ['C', 'T'] segregating, MAF= 0.214 ['T'] not segregating ['A', 'T'] segregating, MAF= 0.214 ['G'] not segregating ['T'] not segregating Total segregating sites = 2
Note
- tab space is inserted before "not segregating" and "segregating"
- Implement a Python code to calculate the Derived Allele Frequency (DAF) from given FASTA files.
-
Definition of Derived Allele Frequency (DAF):
- The Derived Allele Frequency (DAF) is the frequency of the allele that has arisen as a mutation in the population of interest, relative to an outgroup (reference) population. The derived allele is the one that is different from the allele present in the common ancestor, which can be inferred using the outgroup.
-
Task:
- Write a Python script that reads a FASTA file representing the sequences of a population (ingroup) and another FASTA file representing the sequences of an outgroup (reference).
- The script should determine the derived allele at each segregating site by comparing the alleles in the ingroup with the outgroup.
- Calculate the Derived Allele Frequency (DAF) for each site.
- Print the position, DAF, and the derived allele for each site.
-
If you do not understand the definition of DAF:
- Use an LLM (Language Language Model) tool to look up and understand the definition and calculation of DAF before proceeding with the implementation.
Input:
input.fa
: A FASTA file containing the sequences of the ingroup.outgroup.fa
: A FASTA file containing the sequences of the outgroup.
Output:
- For each site, output the position, DAF, and the derived allele.
- Output the list of DAFs for all sites.