masaomi/bio334_day2_part3.md Secret

## bio334_day2_part3.md

      
    Raw
  

              bio334_day2_part3.md
            
          
    Day2 Part3 11-12

Goal


Understand Tajima's D formula and calculate the number of segregating sites from an aligned FASTA file

Source code
lst = [1, 2, 3, 2, 3, 4, 5, 1, 5] # Initialize a list
                                              
print("list =", lst)              # show the list
print("set =", set(lst))          # Convert the list into a set object
print(len(set(lst)))              # Show the number of element of the set
Result
$ python3 day2_3_example1.py                                                                                               15-05-10 17:58
list = [1, 2, 3, 2, 3, 4, 5, 1, 5]
set = {1, 2, 3, 4, 5}
5

Note

Set object does not allow duplicate elements and the elements are stored in random order.
Set object uses braces {} instead of square brakets []
The duplicate objects are discarded when a list object is converted into a set object, because each element in a set object must be unique.

Example2


Compare two sequences and detect segregating sites

Source coe
seq1 = "ATGC"                   # First sequence
seq2 = "ATAT"                   # Second sequence 

sequences = [seq1, seq2]        # Define a sequences list object having two sequences

for i in range(0, len(seq1)):   # Iteration with i that indicates the index of nucleotide position
    alleles = []                # Initialize alleles list in which allele data will be stored
    for seq in sequences:       # Check the two sequences with iteration
        alleles.append(seq[i])  # Each nucleotide at the same position is added in the alleles list
    print(alleles, end="\t")
    if len(set(alleles)) > 1:   # Check the number of unique element in the alleles 
        print("segregated")     # in the case of more than two alleles are detected
    else:
        print("not segregated")   # in the case of only one kind of nucleotide is detected at the position
Result
$ python3 day2_3_example2.py
['A', 'A']	not segregated
['T', 'T']	not segregated
['G', 'A']	segregated
['C', 'T']	segregated

Note

Segregating site is the position of single nucleotide polymorphism (SNP).
Please be careful to look at the double loop structure, namely the sequences list is like a two dimensional structure
The first iteration is a process of selecting a sequence, and the second iteration is a process of selecting each position of the selected sequence

Exercise1


Calculate the total number of segregating sites in an aligned sequences loading from a FASTA file

Template
import sys                                  # import sys module for open/close methods

sequences = []
f = open(sys.argv[1])
for line in f:
    if not line[0] == ">":
        sequences.append(line.rstrip())
f.close()

total_seg_sites = 0                         # Initialize a variable to store the total number of segregating sites

#
# Implement a process here
#            
 
print("total segregating sites =", total_seg_sites)  # Show the total number of segregating sites
Expected result
$ python3 day2_3_exercise1.py input.fa
['G']	not segregating
['A']	not segregating
['A']	not segregating
['G']	not segregating
['A']	not segregating
['C', 'T']	segregating
['T']	not segregating
['A', 'T']	segregating
['G']	not segregating
['T']	not segregating
Total segregating sites = 2

Note

tab space is inserted before "not segregating" and "segregating"

Advanced exercise1


Define two functions
A function to load a FASTA file
A function to calculate segregated sites
The function interface (arguments and return value) should be as follows:

import sys                  # Import sys module

def load_fasta(fasta):      # Define load_fasta function, fasta is an argument data file
    sequences = []          # Initialize seuqnces list

    #
    # Your code here
    #

    return sequences        # Return the sequences list
    
def count_seg_sites(sequences):             # Define a function to count total number of segregating sites from sequences list object
    total_seg_sites = 0                     # Initialize a variable to keep the total number ofsegregating sites 

    #
    # Your code here
    #

    return total_seg_sites                  # Return the total number of segregaed sites
          
sequences = load_fasta(sys.argv[1])                 # Call load_fasta function with a fasta file path and load sequence data
total_seg_sites = count_seg_sites(sequences)        # Calculate the total number of segregating sites
print("Total segregating sites =", total_seg_sites)  # Show the result
Note

The command line arguments and the result should be the same as the exercise1

Advanced exercise2


Calculate MAF (minor allele frequency) in each segregating site

MAF (minor allele frequency):

Allele frequency is the ratio of each allele
Minor allele frequency is the one which is less than the other allele frequency

For example
ATGC
ATGC
ATGC
TTGC


At the first position of the sequences, allele frequency A:T=3/4:1/4
In this case, MAF = 1/4

Expected result
$ python3 day2_3_advanced2.py input.fa
['G']	not segregating
['A']	not segregating
['A']	not segregating
['G']	not segregating
['A']	not segregating
['C', 'T']	segregating, MAF= 0.214
['T']	not segregating
['A', 'T']	segregating, MAF= 0.214
['G']	not segregating
['T']	not segregating
Total segregating sites = 2

Note

tab space is inserted before "not segregating" and "segregating"

Advanced exercise3 (LLM)


Implement a Python code to calculate the Derived Allele Frequency (DAF) from given FASTA files.


Definition of Derived Allele Frequency (DAF):

The Derived Allele Frequency (DAF) is the frequency of the allele that has arisen as a mutation in the population of interest, relative to an outgroup (reference) population. The derived allele is the one that is different from the allele present in the common ancestor, which can be inferred using the outgroup.


Task:

Write a Python script that reads a FASTA file representing the sequences of a population (ingroup) and another FASTA file representing the sequences of an outgroup (reference).
The script should determine the derived allele at each segregating site by comparing the alleles in the ingroup with the outgroup.
Calculate the Derived Allele Frequency (DAF) for each site.
Print the position, DAF, and the derived allele for each site.


If you do not understand the definition of DAF:

Use an LLM (Language Language Model) tool to look up and understand the definition and calculation of DAF before proceeding with the implementation.


Input:

input.fa: A FASTA file containing the sequences of the ingroup.
outgroup.fa: A FASTA file containing the sequences of the outgroup.

Output:

For each site, output the position, DAF, and the derived allele.
Output the list of DAFs for all sites.