Skip to content

Instantly share code, notes, and snippets.

@masaomi
Last active May 24, 2024 12:38
Show Gist options
  • Save masaomi/1397a32c4b870f7ab7e92f479770788d to your computer and use it in GitHub Desktop.
Save masaomi/1397a32c4b870f7ab7e92f479770788d to your computer and use it in GitHub Desktop.

Day3 Final

Goal

  • Understand VCF file format and process the SNP data by using Dictionary data structure

Example1

  • Load VCF files in a current directory and keep both position and nucleotide in a list of dictionary objects

day3_example1.py

import glob

samples = []
for file_name in glob.glob("*.vcf"):
    f = open(file_name)
    snps = {}
    for line in f:
        pos = line.split()[2]
        nuc = line.split()[4]
        snps[pos] = nuc
    f.close()
    samples.append(snps)

print(samples)

Result

$ python3 day3_example1.py
[{'1': 'A', '3': 'G', '4': 'C'}, {'1': 'A', '4': 'T'}, {'1': 'A', '3': 'G', '4': 'C'}]

Question

  • What sort of things do you find out from this information?

Note

  • From this information, you can infer the following result:
Position sample1  sample2  sample3 SNP?
1 A A A No
2 N N N No
3 G N G Yes
4 C T C Yes
  • N means that a SNP is not detected, in other words, we do not know whether it is a SNP site or not, but simply we assume that it is a the same nucleotide with the reference sequence in this exercise, though we do not know what it is (A, T, G, or C).
    • Actually, we do not know whether the 3rd position is a SNP or not, but simply we assume it is a SNP in this exercise
  • so, we can know the following information even without the reference sequence under the assumptions above
    • the number of SNPs == 2
    • sample1 and sample3 are exactly the same sequence

Example2

  • Calculate the number of segregating sites from two SNP data in dictionary

day3_example2.py

snps1 = {1: 'A', 3: 'G', 4: 'C'}
snps2 = {1: 'A', 4: 'T'}

print("Sample1: ", snps1)
print("Sample2: ", snps2)
pos1 = set(snps1.keys())
pos2 = set(snps2.keys())
common_pos = pos1 & pos2
exclor_pos = pos1 ^ pos2

diff = len(exclor_pos)
for pos in common_pos:
    if not snps1[pos] == snps2[pos]:
        diff += 1

print("The number of segregating sites = ", diff)

Result

$ python3 day3_example2.py
Sample1:  {1: 'A', 3: 'G', 4: 'C'}
Sample2:  {1: 'A', 4: 'T'}
The number of segregating sites =  2

Note

  • There are two cases of a different site in two SNP data:
    1. SNP information in either sample at the same position
    2. SNP information in both samples at the same position and the SNPs are different

Exercise1

  • Calculate the nucleotide diversity by loading VCF files in the current directory (assuming that the length of reference == 4)

Hint

  • You may need to calculate the total number of pair-wise differences
  • The result should be same to the result in Day1 Part3 Advanced exercise (day1_3_advanced.py)
  • Let's think why it becomes same!!

Expected result (day3_exercise1.py)

$ python3 day3_exercise1.py
sample1.vcf :  [('1', 'A'), ('3', 'G'), ('4', 'C')]
sample2.vcf :  [('1', 'A'), ('4', 'T')]
sample3.vcf :  [('1', 'A'), ('3', 'G'), ('4', 'C')]
compare: sample1 & sample2: diff=2
compare: sample1 & sample3: diff=0
compare: sample2 & sample3: diff=2
pi = 0.333333

Exercise2

  • Calculate the nucleotide diversity of the five variants in chromosome1 of Arabidopsis thaliana

Reference

Data

  • SNP data (Athaliana_SNPs/*.txt)
  • A. thaliana chromosome1 reference sequence (TAIR10_chr1.fas, the length of Chr1: 30,427,671 [bp])

Hint

  • Each file of SNP data contains only the SNP information (alternative nucleotide and position against the reference)
  • There might be several ways to calculate them
  • The simplest way would be as follows:
    • Prepare multiple aligned FASTA files from SNP data and reference sequence in chromosome 1 (TAIR10_chr1.fas)
    • Then apply your Python code (e.g. day2_4_exercise1.py) to them
  • As another solution, you can directly apply the exercise code above (day3_exercise1.py) to it without making multialigned fasta file.

SNP data

  • e.g. quality_variant_108_TAIR10_Chr1.txt
108 Chr1  83  T C 40  41  1 1
  • The first column is the line number
  • The third column is the SNP position
  • The fourth is the reference nucleotide
  • The fifth is the alternative nucleotide
  • This file includes also SNP information in non-coding region

Expected result (day3_exercise2.py)

$ python3 day3_exercise2.py Athaliana_SNPs
Loading  Athaliana_SNPs/quality_variant_108_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_139_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_159_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_265_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_350_TAIR10_Chr1.txt
compare: 0, 1 diff=49723
compare: 0, 2 diff=268146
compare: 0, 3 diff=280398
compare: 0, 4 diff=253099
compare: 1, 2 diff=264772
compare: 1, 3 diff=277197
compare: 1, 4 diff=249935
compare: 2, 3 diff=275780
compare: 2, 4 diff=241002
compare: 3, 4 diff=274920
pi = 0.008002

Advanced exercise1

  • Calculate Nucleotide diversity of each gene in Arabidopsis thaliana chromosome1
  • Plot them as well as the day2 part5 exercise1 (A. kamchatica homeologs)

Data

  • SNP data (Athaliana_SNPs/*.txt)
  • A. thaliana chromosome1 reference sequence (TAIR10_chr1.fas)
  • A. thaliana chromosome1 gene annotation data (TAIR10_chr1.gff)

Hints

  • The simplest way would be as follows:
    • Make multiple aligned FASTA files from SNP data and reference sequence and annotation data in each gene of chromosome 1
    • Then you can apply your Python code that you made in the previous exercises to them
  • A batch shell script may be useful to run the scripts for all the genes
  • It would be great if you could directly calculate the nucleotide diversity from the VCF files without making the multiple aligned FASTA files (which would be much faster to calculate, though the implementation might cost more)

GFF (TAIR10_chr1.gff)

Chr1  TAIR10  gene  3631  5899  . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
  • The third column (separated by space) contains the annotation type
  • The fourth column is the starting position, and the fifth is the end position of the annotation on the reference sequence, TAIR10_chr1.fas.

Note

  • It can search the candidate regions under selecting by comparing the nucleotide diversities (it is expected that the naturally selected regions (genes) have lower nucleotide diversity than others).

Expected result (Athaliana_chr1_genes_pi.png) Athaliana_chr1_genes_pi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment