masaomi/bio334_day3_final.md Secret

## bio334_day3_final.md

      
    Raw
  

              bio334_day3_final.md
            
          
    Day3 Final

Goal


Understand VCF file format and process the SNP data by using Dictionary data structure

Example1


Load VCF files in a current directory and keep both position and nucleotide in a list of dictionary objects

day3_example1.py
import glob

samples = []
for file_name in glob.glob("*.vcf"):
    f = open(file_name)
    snps = {}
    for line in f:
        pos = line.split()[2]
        nuc = line.split()[4]
        snps[pos] = nuc
    f.close()
    samples.append(snps)

print(samples)
Result
$ python3 day3_example1.py
[{'1': 'A', '3': 'G', '4': 'C'}, {'1': 'A', '4': 'T'}, {'1': 'A', '3': 'G', '4': 'C'}]
Question

What sort of things do you find out from this information?

Note

From this information, you can infer the following result:


Position
sample1
 sample2
 sample3
SNP?


1
A
A
A
No


2
N
N
N
No


3
G
N
G
Yes


4
C
T
C
Yes


N means that a SNP is not detected, in other words, we do not know whether it is a SNP site or not, but simply we assume that it is a the same nucleotide with the reference sequence in this exercise, though we do not know what it is (A, T, G, or C).

Actually, we do not know whether the 3rd position is a SNP or not, but simply we assume it is a SNP in this exercise


so, we can know the following information even without the reference sequence under the assumptions above

the number of SNPs == 2
sample1 and sample3 are exactly the same sequence


Example2


Calculate the number of segregating sites from two SNP data in dictionary

day3_example2.py
snps1 = {1: 'A', 3: 'G', 4: 'C'}
snps2 = {1: 'A', 4: 'T'}

print("Sample1: ", snps1)
print("Sample2: ", snps2)
pos1 = set(snps1.keys())
pos2 = set(snps2.keys())
common_pos = pos1 & pos2
exclor_pos = pos1 ^ pos2

diff = len(exclor_pos)
for pos in common_pos:
    if not snps1[pos] == snps2[pos]:
        diff += 1

print("The number of segregating sites = ", diff)
Result
$ python3 day3_example2.py
Sample1:  {1: 'A', 3: 'G', 4: 'C'}
Sample2:  {1: 'A', 4: 'T'}
The number of segregating sites =  2
Note

There are two cases of a different site in two SNP data:

SNP information in either sample at the same position
SNP information in both samples at the same position and the SNPs are different


Exercise1


Calculate the nucleotide diversity by loading VCF files in the current directory (assuming that the length of reference == 4)

Hint

You may need to calculate the total number of pair-wise differences
The result should be same to the result in Day1 Part3 Advanced exercise (day1_3_advanced.py)
Let's think why it becomes same!!

Expected result (day3_exercise1.py)
$ python3 day3_exercise1.py
sample1.vcf :  [('1', 'A'), ('3', 'G'), ('4', 'C')]
sample2.vcf :  [('1', 'A'), ('4', 'T')]
sample3.vcf :  [('1', 'A'), ('3', 'G'), ('4', 'C')]
compare: sample1 & sample2: diff=2
compare: sample1 & sample3: diff=0
compare: sample2 & sample3: diff=2
pi = 0.333333
Exercise2


Calculate the nucleotide diversity of the five variants in chromosome1 of Arabidopsis thaliana

Reference

http://1001genomes.org/
"The 1001 Genomes Consortium, 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana, Cell, 2016":http://www.cell.com/cell/pdfExtended/S0092-8674(16)30667-5

Data

SNP data (Athaliana_SNPs/*.txt)
A. thaliana chromosome1 reference sequence (TAIR10_chr1.fas, the length of Chr1: 30,427,671 [bp])

Hint

Each file of SNP data contains only the SNP information (alternative nucleotide and position against the reference)
There might be several ways to calculate them
The simplest way would be as follows:

Prepare multiple aligned FASTA files from SNP data and reference sequence in chromosome 1 (TAIR10_chr1.fas)
Then apply your Python code (e.g. day2_4_exercise1.py) to them


As another solution, you can directly apply the exercise code above (day3_exercise1.py) to it without making multialigned fasta file.

SNP data

e.g. quality_variant_108_TAIR10_Chr1.txt

108 Chr1  83  T C 40  41  1 1


The first column is the line number
The third column is the SNP position
The fourth is the reference nucleotide
The fifth is the alternative nucleotide
This file includes also SNP information in non-coding region

Expected result (day3_exercise2.py)
$ python3 day3_exercise2.py Athaliana_SNPs
Loading  Athaliana_SNPs/quality_variant_108_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_139_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_159_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_265_TAIR10_Chr1.txt
Loading  Athaliana_SNPs/quality_variant_350_TAIR10_Chr1.txt
compare: 0, 1 diff=49723
compare: 0, 2 diff=268146
compare: 0, 3 diff=280398
compare: 0, 4 diff=253099
compare: 1, 2 diff=264772
compare: 1, 3 diff=277197
compare: 1, 4 diff=249935
compare: 2, 3 diff=275780
compare: 2, 4 diff=241002
compare: 3, 4 diff=274920
pi = 0.008002

Advanced exercise1


Calculate Nucleotide diversity of each gene in Arabidopsis thaliana chromosome1
Plot them as well as the day2 part5 exercise1 (A. kamchatica homeologs)

Data

SNP data (Athaliana_SNPs/*.txt)
A. thaliana chromosome1 reference sequence (TAIR10_chr1.fas)
A. thaliana chromosome1 gene annotation data (TAIR10_chr1.gff)

Hints

The simplest way would be as follows:

Make multiple aligned FASTA files from SNP data and reference sequence and annotation data in each gene of chromosome 1
Then you can apply your Python code that you made in the previous exercises to them


A batch shell script may be useful to run the scripts for all the genes
It would be great if you could directly calculate the nucleotide diversity from the VCF files without making the multiple aligned FASTA files (which would be much faster to calculate, though the implementation might cost more)

GFF (TAIR10_chr1.gff)
Chr1  TAIR10  gene  3631  5899  . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010


The third column (separated by space) contains the annotation type
The fourth column is the starting position, and the fifth is the end position of the annotation on the reference sequence, TAIR10_chr1.fas.

Note

It can search the candidate regions under selecting by comparing the nucleotide diversities (it is expected that the naturally selected regions (genes) have lower nucleotide diversity than others).

Expected result (Athaliana_chr1_genes_pi.png)