Skip to content

Instantly share code, notes, and snippets.

@masaomi
Last active May 21, 2024 15:59
Show Gist options
  • Save masaomi/c5a11879a9deefccd40c57c1ba975b3a to your computer and use it in GitHub Desktop.
Save masaomi/c5a11879a9deefccd40c57c1ba975b3a to your computer and use it in GitHub Desktop.

Day2 Part3 11-12

Goal

  • Understand Tajima's D formula and calculate the number of segregating sites from an aligned FASTA file

Source code

lst = [1, 2, 3, 2, 3, 4, 5, 1, 5] # Initialize a list
                                              
print("list =", lst)              # show the list
print("set =", set(lst))          # Convert the list into a set object
print(len(set(lst)))              # Show the number of element of the set

Result

$ python3 day2_3_example1.py                                                                                               15-05-10 17:58
list = [1, 2, 3, 2, 3, 4, 5, 1, 5]
set = {1, 2, 3, 4, 5}
5

Note

  • Set object does not allow duplicate elements and the elements are stored in random order.
  • Set object uses braces {} instead of square brakets []
  • The duplicate objects are discarded when a list object is converted into a set object, because each element in a set object must be unique.

Example2

  • Compare two sequences and detect segregating sites

Source coe

seq1 = "ATGC"                   # First sequence
seq2 = "ATAT"                   # Second sequence 

sequences = [seq1, seq2]        # Define a sequences list object having two sequences

for i in range(0, len(seq1)):   # Iteration with i that indicates the index of nucleotide position
    alleles = []                # Initialize alleles list in which allele data will be stored
    for seq in sequences:       # Check the two sequences with iteration
        alleles.append(seq[i])  # Each nucleotide at the same position is added in the alleles list
    print(alleles, end="\t")
    if len(set(alleles)) > 1:   # Check the number of unique element in the alleles 
        print("segregated")     # in the case of more than two alleles are detected
    else:
        print("not segregated")   # in the case of only one kind of nucleotide is detected at the position

Result

$ python3 day2_3_example2.py
['A', 'A']	not segregated
['T', 'T']	not segregated
['G', 'A']	segregated
['C', 'T']	segregated

Note

  • Segregating site is the position of single nucleotide polymorphism (SNP).
  • Please be careful to look at the double loop structure, namely the sequences list is like a two dimensional structure
  • The first iteration is a process of selecting a sequence, and the second iteration is a process of selecting each position of the selected sequence

Exercise1

  • Calculate the total number of segregating sites in an aligned sequences loading from a FASTA file

Template

import sys                                  # import sys module for open/close methods

sequences = []
f = open(sys.argv[1])
for line in f:
    if not line[0] == ">":
        sequences.append(line.rstrip())
f.close()

total_seg_sites = 0                         # Initialize a variable to store the total number of segregating sites

#
# Implement a process here
#            
 
print("total segregating sites =", total_seg_sites)  # Show the total number of segregating sites

Expected result

$ python3 day2_3_exercise1.py input.fa
['G']	not segregating
['A']	not segregating
['A']	not segregating
['G']	not segregating
['A']	not segregating
['C', 'T']	segregating
['T']	not segregating
['A', 'T']	segregating
['G']	not segregating
['T']	not segregating
Total segregating sites = 2

Note

  • tab space is inserted before "not segregating" and "segregating"

Advanced exercise1

  • Define two functions
  • A function to load a FASTA file
  • A function to calculate segregated sites
  • The function interface (arguments and return value) should be as follows:
import sys                  # Import sys module

def load_fasta(fasta):      # Define load_fasta function, fasta is an argument data file
    sequences = []          # Initialize seuqnces list

    #
    # Your code here
    #

    return sequences        # Return the sequences list
    
def count_seg_sites(sequences):             # Define a function to count total number of segregating sites from sequences list object
    total_seg_sites = 0                     # Initialize a variable to keep the total number ofsegregating sites 

    #
    # Your code here
    #

    return total_seg_sites                  # Return the total number of segregaed sites
          
sequences = load_fasta(sys.argv[1])                 # Call load_fasta function with a fasta file path and load sequence data
total_seg_sites = count_seg_sites(sequences)        # Calculate the total number of segregating sites
print("Total segregating sites =", total_seg_sites)  # Show the result

Note

  • The command line arguments and the result should be the same as the exercise1

Advanced exercise2

  • Calculate MAF (minor allele frequency) in each segregating site

MAF (minor allele frequency):

  • Allele frequency is the ratio of each allele
  • Minor allele frequency is the one which is less than the other allele frequency

For example

ATGC
ATGC
ATGC
TTGC
  • At the first position of the sequences, allele frequency A:T=3/4:1/4
  • In this case, MAF = 1/4

Expected result

$ python3 day2_3_advanced2.py input.fa
['G']	not segregating
['A']	not segregating
['A']	not segregating
['G']	not segregating
['A']	not segregating
['C', 'T']	segregating, MAF= 0.214
['T']	not segregating
['A', 'T']	segregating, MAF= 0.214
['G']	not segregating
['T']	not segregating
Total segregating sites = 2

Note

  • tab space is inserted before "not segregating" and "segregating"

Advanced exercise3 (LLM)

  • Implement a Python code to calculate the Derived Allele Frequency (DAF) from given FASTA files.
  1. Definition of Derived Allele Frequency (DAF):

    • The Derived Allele Frequency (DAF) is the frequency of the allele that has arisen as a mutation in the population of interest, relative to an outgroup (reference) population. The derived allele is the one that is different from the allele present in the common ancestor, which can be inferred using the outgroup.
  2. Task:

    • Write a Python script that reads a FASTA file representing the sequences of a population (ingroup) and another FASTA file representing the sequences of an outgroup (reference).
    • The script should determine the derived allele at each segregating site by comparing the alleles in the ingroup with the outgroup.
    • Calculate the Derived Allele Frequency (DAF) for each site.
    • Print the position, DAF, and the derived allele for each site.
  3. If you do not understand the definition of DAF:

    • Use an LLM (Language Language Model) tool to look up and understand the definition and calculation of DAF before proceeding with the implementation.

Input:

  • input.fa: A FASTA file containing the sequences of the ingroup.
  • outgroup.fa: A FASTA file containing the sequences of the outgroup.

Output:

  • For each site, output the position, DAF, and the derived allele.
  • Output the list of DAFs for all sites.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment