Last active
June 10, 2016 18:15
-
-
Save walterst/802b37ee4ec8f4ea3cbf64c60a637c38 to your computer and use it in GitHub Desktop.
Description of process and scripts used to count nucleotide differences within target genera
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We want to ask the question of how different sequences are within certain genera. In this case, I was looking at Prevotella, | |
Bacteroides, and Porphyromonas genera within Bacteroidetes, and the distance between sequences are a count of nucleotide differences | |
divided by the length of the sequence considered. | |
To do this, I used the 99% OTUs (16S only) from the SILVA 123 release, available here: | |
http://www.arb-silva.de/no_cache/download/archive/qiime/ | |
We want to minimize the number of sequences included that may erroneously be labeled as the target taxa, but fall on other parts of | |
the Bacteroidetes tree with other taxa, rather than grouped with the target genus. My goal is to find a node within a Bacteroidetes | |
tree whose descendents are all or mostly the target genus while retaining the most possible tips that contain the | |
target genus. | |
First, all Bacteroidetes IDs were parsed from the 99% OTUs taxonomy file, using a grep command looking for the string: | |
"Bacteria;Bacteroidetes;" and these were written to a text file. An outgroup Dictyglomi sequence (CP001146.624121.625660) was added | |
to this file as an outgroup. QIIME's (version 1.9.1) filter_fasta.py script was called using this text file as --seq_id_fp input to | |
create a fasta file with only Bacteroidetes + the single group sequences in it. | |
This fasta file was filtered to remove sequences shorter than 1300 base pairs. Custom script to filter short reads here: | |
https://gist.github.com/walterst/edde79da8912759285567f6f4cb50c8c | |
These filtered reads were then aligned using the default settings for QIIME's align_seqs.py script, using Greengenes | |
core_set_aligned.imputed file as the template for the alignment. Next the alignment was filtered using these settings to | |
filter high entropy/gapped positions: filter_alignment.py --entropy_threshold 0.10 --allowed_gap_frac 0.80 | |
and a tree was built using the make_phylogeny.py with default settings. | |
To step through the resulting tree's internal nodes and find the optimal node (descending tips contain the most possible | |
target genus reads with <1% nontarget genera). The total tree branch length and branch length descending from this | |
node are also recorded by this script. Some cleanup of taxonomy was used to group the target genera taxa strings together. | |
Script location: https://gist.github.com/walterst/bd0801224dbf93608afaf77a9f55bc5f | |
The resulting output files of the script were gone through by hand to find the largest target genus counts while retaining | |
the most specificity possible. breakdown of the results and choices are below (bracketed lists show the non-target taxa | |
in the group that was chosen): | |
Total tree branch lengths 843.43 | |
porphyromonas (able to get all 553 tips) | |
porphyromonas tips Other tips Fraction porphyromonas Branch length | |
553 1 1.00 8.26 | |
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas', 553), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;ratAN060301C', 1)] | |
prevotella (total 5630 tips, chose result with 4713 for specificity) | |
4713 71 0.99 55.94 | |
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 4713), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;uncultured', 69), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 2)] | |
Next Prevotella group dropped to 3363, so chose this a cutoff. | |
Bacteroides (total 3820 tips, seems to be largely mixed with other taxa) | |
1487 53 0.97 16.24 | |
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides', 1487), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 47), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales Incertae Sedis;Phocaeicola', 2), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 2), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Barnesiella', 1), | |
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;uncultured', 1)] | |
From these resulting text files, the target IDs for the optimal portion of the tree selection were then selected, and | |
copied into new text files, and used as --seq_id_fp parameters with filter_fasta.py to create selected fasta files | |
for Bacteroides, Prevotella, and Porphyromonas. Each of these were aligned using same align_seqs.py approach described above. | |
For each of these alignments, I want to take each aligned read and count the nucleotide differences between each read, | |
except for self-self. As I don't want to consider differences in sequence length at the ends as actual differences | |
between the reads, I only consider a slice of the alignment in this calculation that removes all differences in overhangs | |
at the ends. The script for getting these counts | |
is here: https://gist.github.com/walterst/a46b774444d882d96bd2e1f72d7c179e | |
The output of this script for each of the three alignments was used for generating histograms of the distances within each | |
genus. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment