aimirza/README.md

## README.md

      
    Raw
  

              README.md
            
          
    An R script to regroup genes to ontologies while maintaining compositionality

Same default output of the humann2 untility script regroup_table.py while addresses the many-to-one mapping problem. The only difference is that each gene abundance from your gene table is divided by the number of times it maps to an ontology. This ensures that your output table is truely compositional (componenets sum to 1).
"Each component should never give more than itself..." ~ Thom Quinn
Command

$ Rscript regroup_table_compositional.R --input <joined_gene_table.tsv> --map <path/to/utility_mapping_file.txt.gz>  --out <ontology_table.tsv>
Options:

##
##  -i, --input (required)
##      Path to the unstratified gene output table from humann2 (format tsv). Or any tab
##      delim table with rows as genes and columns as samples. No empty rows permitted 
##
##  -m, --map (required)
##      Path to utility mapping file
##      
##  -o --out (required)
##      Output file name
##
##  -h, --help
##      Show this help message and exit

Requirements


Must have dplyr R packages installed.
Input table must be an unstratified gene table. Use humann2's code humann2_split_stratified_table --input $TABLE --output to generate the unstratified table. The future version will include an option to include stratified or the original table (includes both stratified and unstratified). Any tab delimited gene list with abundances should also work as long as you have a mapping file that correspond with it. Works with a normalized gene abundance table and with a single sample or joined table.
humann2 utility mapping file or a mapping file for that maps your genes to the ontology of choice. You can install humann2's utility scripts using their code: humann2_databases --download utility_mapping full $DIR

Example pipeline

Humann2 program should be in your path. The actual script is used in the last step.
Run humann2 on all fastq files
#!/bin/bash
for fastq in $(ls /path/to/fastq/files)
do
    humann2 --input $fastq --output results/ --metaphlan ~/metaphlan/
done

Download mapping files to regroup gene families into other functional categories
$ humann2_databases --download utility_mapping full ~/biobakery_databases/humann2
Join all gene family tables (for each sample)
$ humann2_join_tables --input results/ --output results/humann2_genefamilies.tsv --file_name genefamilies
Split a tables into two files (one stratified and one unstratified)
$ humann2_split_stratified_table --input results/humann2_genefamilies.tsv --output results/
Download regroup_table_compositional.R and run
$ curl -o regroup_table_compositional.R “https://gist.githubusercontent.com/aimirza/555570cc009fe8989343d68ecc0408ac/raw/2074c17f417e4afb5b51038bc038fc4728802118/regroup_table_compositional.R”
$ Rscript --vanilla regroup_table_compositional.R \

-i results/humann2_genefamilies_unstratified.tsv \

-m ~/biobakery_databases/humann2/utility_mapping/map_level4ec_uniref90.txt.gz \

-o results/humann2_level4ec_unstratified_compositional.tsv
Regrouping uniref90 gene joined table with 40 columns (samples) to ECs took ~ 2 minutes and max memory used was 3.4G. Regrouping to informative GOs or Kos took less than 6 minutes but used > 20 G memory.

Please let me know how to reduce memory usage in the comments below!
Acknowledgments

Many thanks to the Biobakery team for developing super useful software.

Thank you Thom Quinn for the guidance.
Date: May 22st, 2020

  
## regroup_table_compositional.R
# Same default output of the humann2 untility script regroup_table.py
# while addresses the many-to-one mapping problem. Each gene abundance
# from your gene table is divided by the number of times it maps to a feature.
# This ensures that your output table is truely compositional (sums to 1).
# "Each component should never give more than itself..." ~ Thom Quinn

# To Run:
# $ Rscript --vanilla regroup_table_compositional.R -i <joined_gene_table.tsv> -m <path/to/utility_mapping_file.txt.gz>  -o <ontology_table.tsv>

# Date: May 21st, 2020

## Options:
##
##  -i, --input (required)
##      Path to the gene output table from humann2 (format tsv). Or any
##      tab delim table with rows as genes and columns as samples
##      Ensure that there are now empty rows. Table must be unstratified
##
##  -m, --map (required)
##      Path to utility mapping file
##
##  -o --out (required)
##      Output file name
##
##  -h, --help
##      Show this help message and exit


# Script is based on Thom Quinns example to transforms the data from 'gene space' into 'pathway space':

# devtools::install_github('tpq/amalgam')
# library(amalgam)
# set.seed(1)
# A <- randAcomp(5, 6) # 5 samples, 6 components
#
# pathway <- sample(0:1, size = 3*6, replace = TRUE) # rule to turn 6 components into 3 pathways
# pathway <- matrix(pathway, 6, 3)

# Each component should never give more than itself...
# in other words, rows should sum to 1 -- this addresses the many-to-one mapping problem
# (note: I add a tiny number, 1e-8, to avoid divide by zero error)
# > pathway <- sweep(pathway, 1, rowSums(pathway + 1e-8), "/")

# The dot product between the data and the normalized pathway matrix
# transforms the data from 'gene space' into 'pathway space'
# newdata <- A %*% pathway

start_time <- Sys.time()

#library("optparse", quietly = TRUE)
#library("dplyr", quietly = TRUE)
#library("profmem")
suppressPackageStartupMessages(library("optparse"))
suppressPackageStartupMessages(library("dplyr"))


option_list = list(
  make_option(
    c("-i", "--input"),
    type="character", default=NULL,
              help="Joined output table (format tsv)", metavar="character"),
  make_option(
    c("-m", "--map"),
    type="character", default=NULL,
              help="Path to mapping file", metavar="character"),
  make_option(
    c("-o", "--out"),
    type="character", default=NULL,
              help="Output file name", metavar="character")
);


opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);


message(sprintf("loading files. May take a while \n"))

# Read in the joined output table
genefamilies <- read.table(opt$input,
                           header=TRUE, check.names = FALSE, fill = TRUE, row.names = 1, stringsAsFactors = FALSE, sep="\t", comment.char = "")


# Read in the unitly mapping file
map_ontology <- scan(opt$map, what="", sep="\n", quiet = TRUE)
# Separate elements by one or more whitepace
map_ontology <- strsplit(map_ontology, "[[:space:]]+")
# Extract the first vector element and set it as the list element name
names(map_ontology) <- sapply(map_ontology, `[[`, 1)
# Remove the first vector element from each list element
map_ontology <- lapply(map_ontology, `[`, -1)
map_ontology = lapply(map_ontology, unlist)

message(sprintf("Loading files complete \n"))


# Get a list of all of the unique genes found in the mapping file
uniref_genes_all <- unlist(map_ontology, use.names = FALSE) %>% unique() %>% sort()


# Get genes only genes found in both the inout table and the mapping file
common_genes <- Reduce(intersect, list(row.names(genefamilies), uniref_genes_all)) %>% sort()

# Subset input genes found in the common gene list
genefamilies_common <- genefamilies[common_genes,]
rm(genefamilies)

# Subset the mapping file genes found in the common gene list
map_ontology_common <- sapply(map_ontology, function(x) x[x %in% common_genes])
rm(map_ontology)

#remove empty features from feature map
map_ontology_common <- map_ontology_common[lapply(map_ontology_common,length)>0]


# make a table with rows as genes and columns as features
mapping_rule_matrix <- matrix(0, nrow = length(common_genes), ncol = length(names(map_ontology_common)))
mapping_rule_matrix <- as.data.frame(mapping_rule_matrix, row.names=common_genes) # convert matrix to df and set row.names as genes
names(mapping_rule_matrix) <- names(map_ontology_common) # set column names as features

# Create gene to ontology mapping rules
for (feature in names(map_ontology_common)){
  mapping_rule_matrix[unlist(map_ontology_common[feature]), feature] = 1
}

mapping_rule_matrix_norm <- sweep(mapping_rule_matrix, 1, rowSums(mapping_rule_matrix), "/")
# You can remove this line above to get the same ontology values produced from regroup_table.py
rm(mapping_rule_matrix)

# The dot product between the data and the normalized ontology matrix
# Transforms the data from "gene space" to "ontology space"
A <- t(genefamilies_common[common_genes,]) %>% as.matrix()
B <- mapping_rule_matrix_norm[common_genes,] %>% as.matrix()
regrouped_table_norm <- t(A %*% B)

# Round to a precision of 3
regrouped_table_norm <- round(regrouped_table_norm, 3)

# Move row names to column
regrouped_table_norm <- tibble::rownames_to_column(as.data.frame(regrouped_table_norm), "# Gene Family")

# save new table
write.table(regrouped_table_norm, file=opt$out, row.names=FALSE, sep = "\t", quote = FALSE)

# calculate time elapsed
end_time <- Sys.time()
elapsed_time <- as.numeric(end_time - start_time, units = "mins")
message(sprintf("Script ran in %s minutes (%s secs) \n", round(elapsed_time, 1), round(elapsed_time*60)))

# Print final message
message(sprintf("File created:\n%s\n", opt$out))
	# Same default output of the humann2 untility script regroup_table.py
	# while addresses the many-to-one mapping problem. Each gene abundance
	# from your gene table is divided by the number of times it maps to a feature.
	# This ensures that your output table is truely compositional (sums to 1).
	# "Each component should never give more than itself..." ~ Thom Quinn

	# To Run:
	# $ Rscript --vanilla regroup_table_compositional.R -i <joined_gene_table.tsv> -m <path/to/utility_mapping_file.txt.gz> -o <ontology_table.tsv>

	# Date: May 21st, 2020

	## Options:
	##
	## -i, --input (required)
	## Path to the gene output table from humann2 (format tsv). Or any
	## tab delim table with rows as genes and columns as samples
	## Ensure that there are now empty rows. Table must be unstratified
	##
	## -m, --map (required)
	## Path to utility mapping file
	##
	## -o --out (required)
	## Output file name
	##
	## -h, --help
	## Show this help message and exit


	# Script is based on Thom Quinns example to transforms the data from 'gene space' into 'pathway space':

	# devtools::install_github('tpq/amalgam')
	# library(amalgam)
	# set.seed(1)
	# A <- randAcomp(5, 6) # 5 samples, 6 components
	#
	# pathway <- sample(0:1, size = 3*6, replace = TRUE) # rule to turn 6 components into 3 pathways
	# pathway <- matrix(pathway, 6, 3)

	# Each component should never give more than itself...
	# in other words, rows should sum to 1 -- this addresses the many-to-one mapping problem
	# (note: I add a tiny number, 1e-8, to avoid divide by zero error)
	# > pathway <- sweep(pathway, 1, rowSums(pathway + 1e-8), "/")

	# The dot product between the data and the normalized pathway matrix
	# transforms the data from 'gene space' into 'pathway space'
	# newdata <- A %*% pathway

	start_time <- Sys.time()

	#library("optparse", quietly = TRUE)
	#library("dplyr", quietly = TRUE)
	#library("profmem")
	suppressPackageStartupMessages(library("optparse"))
	suppressPackageStartupMessages(library("dplyr"))


	option_list = list(
	make_option(
	c("-i", "--input"),
	type="character", default=NULL,
	help="Joined output table (format tsv)", metavar="character"),
	make_option(
	c("-m", "--map"),
	type="character", default=NULL,
	help="Path to mapping file", metavar="character"),
	make_option(
	c("-o", "--out"),
	type="character", default=NULL,
	help="Output file name", metavar="character")
	);



	opt_parser = OptionParser(option_list=option_list);
	opt = parse_args(opt_parser);


	message(sprintf("loading files. May take a while \n"))

	# Read in the joined output table
	genefamilies <- read.table(opt$input,
	header=TRUE, check.names = FALSE, fill = TRUE, row.names = 1, stringsAsFactors = FALSE, sep="\t", comment.char = "")


	# Read in the unitly mapping file
	map_ontology <- scan(opt$map, what="", sep="\n", quiet = TRUE)
	# Separate elements by one or more whitepace
	map_ontology <- strsplit(map_ontology, "[[:space:]]+")
	# Extract the first vector element and set it as the list element name
	names(map_ontology) <- sapply(map_ontology, `[[`, 1)
	# Remove the first vector element from each list element
	map_ontology <- lapply(map_ontology, `[`, -1)
	map_ontology = lapply(map_ontology, unlist)

	message(sprintf("Loading files complete \n"))


	# Get a list of all of the unique genes found in the mapping file
	uniref_genes_all <- unlist(map_ontology, use.names = FALSE) %>% unique() %>% sort()


	# Get genes only genes found in both the inout table and the mapping file
	common_genes <- Reduce(intersect, list(row.names(genefamilies), uniref_genes_all)) %>% sort()

	# Subset input genes found in the common gene list
	genefamilies_common <- genefamilies[common_genes,]
	rm(genefamilies)

	# Subset the mapping file genes found in the common gene list
	map_ontology_common <- sapply(map_ontology, function(x) x[x %in% common_genes])
	rm(map_ontology)

	#remove empty features from feature map
	map_ontology_common <- map_ontology_common[lapply(map_ontology_common,length)>0]


	# make a table with rows as genes and columns as features
	mapping_rule_matrix <- matrix(0, nrow = length(common_genes), ncol = length(names(map_ontology_common)))
	mapping_rule_matrix <- as.data.frame(mapping_rule_matrix, row.names=common_genes) # convert matrix to df and set row.names as genes
	names(mapping_rule_matrix) <- names(map_ontology_common) # set column names as features

	# Create gene to ontology mapping rules
	for (feature in names(map_ontology_common)){
	mapping_rule_matrix[unlist(map_ontology_common[feature]), feature] = 1
	}

	mapping_rule_matrix_norm <- sweep(mapping_rule_matrix, 1, rowSums(mapping_rule_matrix), "/")
	# You can remove this line above to get the same ontology values produced from regroup_table.py
	rm(mapping_rule_matrix)

	# The dot product between the data and the normalized ontology matrix
	# Transforms the data from "gene space" to "ontology space"
	A <- t(genefamilies_common[common_genes,]) %>% as.matrix()
	B <- mapping_rule_matrix_norm[common_genes,] %>% as.matrix()
	regrouped_table_norm <- t(A %*% B)

	# Round to a precision of 3
	regrouped_table_norm <- round(regrouped_table_norm, 3)

	# Move row names to column
	regrouped_table_norm <- tibble::rownames_to_column(as.data.frame(regrouped_table_norm), "# Gene Family")

	# save new table
	write.table(regrouped_table_norm, file=opt$out, row.names=FALSE, sep = "\t", quote = FALSE)

	# calculate time elapsed
	end_time <- Sys.time()
	elapsed_time <- as.numeric(end_time - start_time, units = "mins")
	message(sprintf("Script ran in %s minutes (%s secs) \n", round(elapsed_time, 1), round(elapsed_time*60)))

	# Print final message
	message(sprintf("File created:\n%s\n", opt$out))