walterst/workflow_genus_distances.txt

## workflow_genus_distances.txt
We want to ask the question of how different sequences are within certain genera. In this case, I was looking at Prevotella,
Bacteroides, and Porphyromonas genera within Bacteroidetes, and the distance between sequences are a count of nucleotide differences
divided by the length of the sequence considered.

To do this, I used the 99% OTUs (16S only) from the SILVA 123 release, available here:
http://www.arb-silva.de/no_cache/download/archive/qiime/

We want to minimize the number of sequences included that may erroneously be labeled as the target taxa, but fall on other parts of
the Bacteroidetes tree with other taxa, rather than grouped with the target genus. My goal is to find a node within a Bacteroidetes
tree whose descendents are all or mostly the target genus while retaining the most possible tips that contain the
target genus.

First, all Bacteroidetes IDs were parsed from the 99% OTUs taxonomy file, using a grep command looking for the string:
"Bacteria;Bacteroidetes;" and these were written to a text file. An outgroup Dictyglomi sequence (CP001146.624121.625660) was added
to this file as an outgroup. QIIME's (version 1.9.1) filter_fasta.py script was called using this text file as --seq_id_fp input to
create a fasta file with only Bacteroidetes + the single group sequences in it.

This fasta file was filtered to remove sequences shorter than 1300 base pairs. Custom script to filter short reads here:
https://gist.github.com/walterst/edde79da8912759285567f6f4cb50c8c

These filtered reads were then aligned using the default settings for QIIME's align_seqs.py script, using Greengenes
core_set_aligned.imputed file as the template for the alignment. Next the alignment was filtered using these settings to
filter high entropy/gapped positions: filter_alignment.py --entropy_threshold 0.10 --allowed_gap_frac 0.80
and a tree was built using the make_phylogeny.py with default settings.

To step through the resulting tree's internal nodes and find the optimal node (descending tips contain the most possible
target genus reads with <1% nontarget genera). The total tree branch length and branch length descending from this
node are also recorded by this script. Some cleanup of taxonomy was used to group the target genera taxa strings together.
Script location: https://gist.github.com/walterst/bd0801224dbf93608afaf77a9f55bc5f

The resulting output files of the script were gone through by hand to find the largest target genus counts while retaining
the most specificity possible. breakdown of the results and choices are below (bracketed lists show the non-target taxa
in the group that was chosen):
Total tree branch lengths 843.43

porphyromonas (able to get all 553 tips)
porphyromonas tips	Other tips	Fraction porphyromonas	Branch length
553	1	1.00	8.26
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas', 553),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;ratAN060301C', 1)]
prevotella (total 5630 tips, chose result with 4713 for specificity)
4713	71	0.99	55.94
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 4713),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;uncultured', 69),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 2)]
Next Prevotella group dropped to 3363, so chose this a cutoff.

Bacteroides (total 3820 tips, seems to be largely mixed with other taxa)
1487	53	0.97	16.24
[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides', 1487),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 47),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales Incertae Sedis;Phocaeicola', 2),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 2),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Barnesiella', 1),
('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;uncultured', 1)]

From these resulting text files, the target IDs for the optimal portion of the tree selection were then selected, and
copied into new text files, and used as --seq_id_fp parameters with filter_fasta.py to create selected fasta files
for Bacteroides, Prevotella, and Porphyromonas. Each of these were aligned using same align_seqs.py approach described above.

For each of these alignments, I want to take each aligned read and count the nucleotide differences between each read,
except for self-self. As I don't want to consider differences in sequence length at the ends as actual differences
between the reads, I only consider a slice of the alignment in this calculation that removes all differences in overhangs
at the ends. The script for getting these counts
is here: https://gist.github.com/walterst/a46b774444d882d96bd2e1f72d7c179e
The output of this script for each of the three alignments was used for generating histograms of the distances within each
genus.
	We want to ask the question of how different sequences are within certain genera. In this case, I was looking at Prevotella,
	Bacteroides, and Porphyromonas genera within Bacteroidetes, and the distance between sequences are a count of nucleotide differences
	divided by the length of the sequence considered.

	To do this, I used the 99% OTUs (16S only) from the SILVA 123 release, available here:
	http://www.arb-silva.de/no_cache/download/archive/qiime/

	We want to minimize the number of sequences included that may erroneously be labeled as the target taxa, but fall on other parts of
	the Bacteroidetes tree with other taxa, rather than grouped with the target genus. My goal is to find a node within a Bacteroidetes
	tree whose descendents are all or mostly the target genus while retaining the most possible tips that contain the
	target genus.

	First, all Bacteroidetes IDs were parsed from the 99% OTUs taxonomy file, using a grep command looking for the string:
	"Bacteria;Bacteroidetes;" and these were written to a text file. An outgroup Dictyglomi sequence (CP001146.624121.625660) was added
	to this file as an outgroup. QIIME's (version 1.9.1) filter_fasta.py script was called using this text file as --seq_id_fp input to
	create a fasta file with only Bacteroidetes + the single group sequences in it.

	This fasta file was filtered to remove sequences shorter than 1300 base pairs. Custom script to filter short reads here:
	https://gist.github.com/walterst/edde79da8912759285567f6f4cb50c8c

	These filtered reads were then aligned using the default settings for QIIME's align_seqs.py script, using Greengenes
	core_set_aligned.imputed file as the template for the alignment. Next the alignment was filtered using these settings to
	filter high entropy/gapped positions: filter_alignment.py --entropy_threshold 0.10 --allowed_gap_frac 0.80
	and a tree was built using the make_phylogeny.py with default settings.

	To step through the resulting tree's internal nodes and find the optimal node (descending tips contain the most possible
	target genus reads with <1% nontarget genera). The total tree branch length and branch length descending from this
	node are also recorded by this script. Some cleanup of taxonomy was used to group the target genera taxa strings together.
	Script location: https://gist.github.com/walterst/bd0801224dbf93608afaf77a9f55bc5f

	The resulting output files of the script were gone through by hand to find the largest target genus counts while retaining
	the most specificity possible. breakdown of the results and choices are below (bracketed lists show the non-target taxa
	in the group that was chosen):
	Total tree branch lengths 843.43

	porphyromonas (able to get all 553 tips)
	porphyromonas tips Other tips Fraction porphyromonas Branch length
	553 1 1.00 8.26
	[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas', 553),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;ratAN060301C', 1)]
	prevotella (total 5630 tips, chose result with 4713 for specificity)
	4713 71 0.99 55.94
	[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 4713),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;uncultured', 69),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 2)]
	Next Prevotella group dropped to 3363, so chose this a cutoff.

	Bacteroides (total 3820 tips, seems to be largely mixed with other taxa)
	1487 53 0.97 16.24
	[('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides', 1487),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales S24-7 group', 47),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidales Incertae Sedis;Phocaeicola', 2),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella', 2),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Barnesiella', 1),
	('Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;uncultured', 1)]

	From these resulting text files, the target IDs for the optimal portion of the tree selection were then selected, and
	copied into new text files, and used as --seq_id_fp parameters with filter_fasta.py to create selected fasta files
	for Bacteroides, Prevotella, and Porphyromonas. Each of these were aligned using same align_seqs.py approach described above.

	For each of these alignments, I want to take each aligned read and count the nucleotide differences between each read,
	except for self-self. As I don't want to consider differences in sequence length at the ends as actual differences
	between the reads, I only consider a slice of the alignment in this calculation that removes all differences in overhangs
	at the ends. The script for getting these counts
	is here: https://gist.github.com/walterst/a46b774444d882d96bd2e1f72d7c179e
	The output of this script for each of the three alignments was used for generating histograms of the distances within each
	genus.