An R script to regroup genes to ontologies while maintaining compositionality
Same default output of the humann2 untility script regroup_table.py while addresses the many-to-one mapping problem. The only difference is that each gene abundance from your gene table is divided by the number of times it maps to an ontology. This ensures that your output table is truely compositional (componenets sum to 1).
"Each component should never give more than itself..." ~ Thom Quinn
$ Rscript regroup_table_compositional.R --input <joined_gene_table.tsv> --map <path/to/utility_mapping_file.txt.gz> --out <ontology_table.tsv>
## ## -i, --input (required) ## Path to the unstratified gene output table from humann2 (format tsv). Or any tab ## delim table with rows as genes and columns as samples. No empty rows permitted ## ## -m, --map (required) ## Path to utility mapping file ## ## -o --out (required) ## Output file name ## ## -h, --help ## Show this help message and exit
- Must have dplyr R packages installed.
- Input table must be an unstratified gene table. Use humann2's code
humann2_split_stratified_table --input $TABLE --outputto generate the unstratified table. The future version will include an option to include stratified or the original table (includes both stratified and unstratified). Any tab delimited gene list with abundances should also work as long as you have a mapping file that correspond with it. Works with a normalized gene abundance table and with a single sample or joined table.
- humann2 utility mapping file or a mapping file for that maps your genes to the ontology of choice. You can install humann2's utility scripts using their code:
humann2_databases --download utility_mapping full $DIR
Humann2 program should be in your path. The actual script is used in the last step.
Run humann2 on all fastq files
#!/bin/bash for fastq in $(ls /path/to/fastq/files) do humann2 --input $fastq --output results/ --metaphlan ~/metaphlan/ done
Download mapping files to regroup gene families into other functional categories
$ humann2_databases --download utility_mapping full ~/biobakery_databases/humann2
Join all gene family tables (for each sample)
$ humann2_join_tables --input results/ --output results/humann2_genefamilies.tsv --file_name genefamilies
Split a tables into two files (one stratified and one unstratified)
$ humann2_split_stratified_table --input results/humann2_genefamilies.tsv --output results/
Download regroup_table_compositional.R and run
$ curl -o regroup_table_compositional.R “https://gist.githubusercontent.com/aimirza/555570cc009fe8989343d68ecc0408ac/raw/2074c17f417e4afb5b51038bc038fc4728802118/regroup_table_compositional.R”
$ Rscript --vanilla regroup_table_compositional.R \
-i results/humann2_genefamilies_unstratified.tsv \
-m ~/biobakery_databases/humann2/utility_mapping/map_level4ec_uniref90.txt.gz \
Regrouping uniref90 gene joined table with 40 columns (samples) to ECs took ~ 2 minutes and max memory used was 3.4G. Regrouping to informative GOs or Kos took less than 6 minutes but used > 20 G memory.
Please let me know how to reduce memory usage in the comments below!
Many thanks to the Biobakery team for developing super useful software.
Thank you Thom Quinn for the guidance.
Date: May 22st, 2020