onertipaday/usecase03.Rmd

## usecase03.Rmd
---
title: "VESPUCCI Use Case 03"
author: "Paolo Sonego"
date: "06/24/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# VESPUCCI use case 03

[VESPUCCI](https://vespucci.readthedocs.io/) is the gene expression database for grapevine and we can access it via its GraphQL interface, called [COMPASS](https://compass-.readthedocs.io/). The [rCOMPASS](https://onertipaday.github.io/rcompass/) package is a R package that wraps some functionalities to simplify communication with the COMPASS interface.

In this use case, we have queried the database with a short list of genes putatively involved in pollen development with the final goal of retrieving additional genes with the same function. First we have identified the conditions where those genes are correlated and modulated (using limma normalization). Then we have expanded this module by adding new conditions of interest (flower-related samples) and new genes (the most highly correlated to the ones initially selected). To assess the composition of the extended module in comparison with the original one, we have performed an enrichment analysis for Gene and Plant ontology terms. Finally, using TPM normalized data we have evaluated the gene expression levels in five different tissues in order to identify the most likely highly expressed genes in flower. Based on their annotation we have detected some new genes with a putative role in pollen development.

```{r}
devtools::install_github("onertipaday/rcompass")
library(rcompass)
```

## Genes

To establish an initial set of genes putatively involved in grapevine pollen development we have used the outcome of the comparison of two independent parthenocarpic somatic variants with their original seeded cultivars ([Nwafor et al. 2014](https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-15-1030), [Royo et al. 2016](https://academic.oup.com/jxb/article/67/1/259/2885121)). In particular, we have manually selected genes that satisfy two criteria:
1. they are differentially expressed between mutant and wild-type developing flowers in both experiments (given that both seedless accessions are pollen sterile)
2. they are homologues to Arabidopsis genes implicated in [tapetal devolopment](https://link.springer.com/article/10.1007/s11103-013-0085-5), [microspore separation](http://www.plantphysiol.org/content/142/3/1004) and [pollen wall formation](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2409014/).

This list includes the transcription factors *MS1* (VIT_01s0011g06390), *MYB35*/*TDF1* (VIT_17s0000g05400), *MYB80* (VIT_19s0015g01280) with its direct target *GLOX1* (VIT_11s0016g03490), the pectin degrading enzymes *QRT1* (VIT_19s0015g00150), *QRT3* (VIT_01s0011g01300), the sporopollenin biosynthetic enzymes or transporters *ABCG26* (VIT_19s0015g00960), *ACOS5* (VIT_01s0010g03720), *CYP703A2* (VIT_15s0046g00330), *CYP704B1* (VIT_01s0026g02700), *LAP5* (VIT_03s0038g01460), *LAP6* (VIT_15s0021g02170), *MS2* (VIT_08s0007g07100), *SWEET5* (VIT_17s0000g08110), *TKPR1* (VIT_03s0038g04220) and *TKPR2* (VIT_01s0011g03480).


```{r}
gene_list <- c("VIT_01s0010g03720","VIT_01s0011g01300","VIT_01s0011g03480","VIT_01s0011g06390","VIT_01s0026g02700","VIT_01s0137g00030","VIT_03s0038g01460","VIT_03s0038g04220","VIT_08s0007g07100","VIT_11s0016g03490","VIT_15s0021g02170","VIT_15s0046g00330","VIT_17s0000g05400","VIT_17s0000g08110","VIT_19s0015g00960","VIT_19s0015g00150","VIT_19s0015g01280")
```

We can query the compendium with the list of gene names and get a list of BiologicalFeature objects that represents our genes of interest. The easiest way to know which are the valid filter values to be used is to use the autocompletion (ALT + SPACEBAR) in the COMPASS GraphQL interface at [http://compass.fmach.it/graphql](http://compass.fmach.it/graphql).


```{r}
bf_ids <- get_biofeature_id(compendium = "vespucci", id_In = gene_list, useIds = FALSE)
knitr::kable(bf_ids)
```

## Create a Module
We are now ready to create our first module. If we pass a list of genes (BiologicalFeature objects) it will automatically retrieve the "best" conditions (SampleSet objects).

```{r module_1_creation}
module <- create_module_bf(biofeaturesNames = bf_ids$id,
                             normalization = "limma",
                             top_n = 50,
                             useIds = TRUE)
```

## Plot a module
The easiest way to visualize a module is using a heatmap. The Plot object wraps a module and can return different plot type with different format (the Plotly JSON format or HTML). To see an heatmap embedded in rstudio we can retrieve the HTML code of the Plot (that uses Plotly) and display it directly into the viewer.

```{r plot_module_heatmap_html, echo = TRUE, eval = FALSE}
view_plot(module = module,
          normalization = "limma",
          plotType = "module_heatmap_expression",
          alternativeColoring = FALSE,
          min = -6, max = 6)
```

## Extend module part 1
Annotations will be used to selet all flower-related samples using a SPARQL query. The term [NCIT_C19157](http://purl.obolibrary.org/obo/NCIT_C19157) indicates ***specimen*** while [PO_0009046](http://purl.obolibrary.org/obo/PO_0009046) is the Plant Ontology term for flower.

```{r}
sparql = "SELECT ?s ?p ?o WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/NCIT_C19157> . ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/PO_0009046>}"

flower_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql,
                                              normalization = "limma")[,1]
```

Once the samples have been selected, we can obtain the corresponding SampleSet objects relative to the chosen annotation and create a new module with the extra conditions.

```{r}
flower_ss <- get_sampleset_by_sampleid(normalization = "limma", samples = flower_samples)
extended_samplesets <- c(flower_ss$id, colnames(module))
extended_module <- create_module(biofeaturesNames = bf_ids$id,
                                 samplesetNames = unique(extended_samplesets),
                                 useIds = TRUE,
                                 normalization = "limma")
```

We are now ready to plot the extended module's heatmap

```{r plot_heatmap_extended_module, echo = TRUE, eval = FALSE}
view_plot(module = extended_module,
          normalization = "limma",
          alternativeColoring = FALSE,
          plotType = "module_heatmap_expression",
          min=-6, max=6)
```

## Extend module part 2
In order to extend the module in the other dimension, that is by adding new genes, we could look at the distribution of gene's correlation and add those genes with highest correlation with the current module.

```{r plot_module_distribution html, echo = TRUE, eval = FALSE}
view_plot(module = extended_module, normalization = "limma",
          plotType = "biological_features_uncentered_correlation_distribution")
```

We will select the first 50 genes (not yet present in our original module) with a correlation coefficient of at least 0.7

```{r}
top_bf <- plot_module_distribution(module = extended_module,
                                   normalization = "limma",
                                   plotType = "biological_features_uncentered_correlation_distribution",
                                   getRank = TRUE)
```

Once the BiologicalFeature ids have been selected we can obtain the correspondig objects and add the new BiologicalFeature to the module

```{r}
top_bf_df <- as.data.frame(top_bf)
no_bf_ids_df <- top_bf_df[!top_bf$id %in% bf_ids$id,]
top_50_bf_df <- no_bf_ids_df[no_bf_ids_df$value>=0.7,][1:50,]
extended_module_plus_top_50_bf <- create_module(biofeaturesNames = c(bf_ids$id,top_50_bf_df$id),
                                 samplesetNames = unique(extended_samplesets),
                                 useIds = TRUE,
                                 normalization = "limma")

```

As always we can have a look at the heatmap

```{r plot_heatmap_extended_module_top_50_bf, echo = TRUE, eval = FALSE}
view_plot(module = extended_module_plus_top_50_bf,
          normalization = "limma",
          alternativeColoring = FALSE,
          plotType = "module_heatmap_expression",
          min=-6, max=6)
```

# Enrichment
To have a better understanding of the composition of our extended module, we can have a look at the enrichment and see if it is any different from one we started from.

### Original module enrichment

```{r}
(module_enrichment <- enrich_module(module = module, normalization = "limma"))
```

### Extended module enrichment

```{r}
(extended_module_enrichment <- enrich_module(module = extended_module_plus_top_50_bf, normalization = "limma"))
```

The most evident differences between the two enrichments concern the Plant ontology term __flower__ and the Gene ontology term __cell wall organization__, which are over-represented in the extended module compared to the original one. The extended module’s enrichment in the term __flower__ was expected and can be easily explained by the fact that we have expanded the original module by adding the SampleSet objects with samples annotated as __cell wall organization__ (see Extend module part 1). On the other side, the over-representation of the term “cell wall organization” in the extended module is likely due to the fact that by introducing new correlated BiologicalFeatures objects (see Extend module part 2) we have gained genes with a putative role in pollen development and in particular in pollen wall formation.


# Tissue-specific gene expression
Using the TPM-normalized data it is possible to know whether specific genes are expressed or not. We will collect samples coming from 5 different tissues, root, stem, leaf, flower and fruit and check gene expression levels. Samples are collected using specific Plant ontology terms used to annotate the different tissues. Together with basic terms (like fruit) also more specific terms are used (such as berry fruit).
```{r}
root <- get_OntologyNode(normalization="tpm",originalId="PO_0009005",ontology_Name="Plant ontology")
stem <- get_OntologyNode(normalization="tpm",originalId="PO_0009047",ontology_Name="Plant ontology")
leaf <- get_OntologyNode(normalization="tpm",originalId="PO_0025034",ontology_Name="Plant ontology")
flower <- get_OntologyNode(normalization="tpm",originalId="PO_0009046",ontology_Name="Plant ontology")
fruit <- get_OntologyNode(normalization="tpm",originalId="PO_0009001",ontology_Name="Plant ontology")
```

Now that we have all the terms for the 5 tissues we can build the SPARQL queries to obtain only those samples we are interested in. We will filter out all samples with Inoculation and Treatment terms [NCIT_C68774](http://purl.obolibrary.org/obo/NCIT_C68774), [AGRO_00000322](http://purl.obolibrary.org/obo/AGRO_00000322).
```{r}
root_terms <-  get_OntologyNode(normalization = "tpm", descendantOf = root$id)
stem_terms <-  get_OntologyNode(normalization = "tpm", descendantOf = stem$id)
leaf_terms <-  get_OntologyNode(normalization = "tpm",descendantOf = leaf$id)
flower_terms <-  get_OntologyNode(normalization = "tpm", descendantOf = flower$id)
fruit_terms <- get_OntologyNode(normalization = "tpm", descendantOf = fruit$id)
# header and filter
sparql_header <-  "SELECT ?s ?p ?o WHERE { ?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/NCIT_C19157> ."
sparql_filter <- "FILTER NOT EXISTS { VALUES ?p1 { <http://purl.obolibrary.org/obo/NCIT_C68774> <http://purl.obolibrary.org/obo/AGRO_00000322> }  ?s ?p1 ?o1 }  }"
# tissues terms
sparql_root_terms <- glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{root_terms$originalId}>} UNION", .open = "{{")
sparql_stem_terms <- glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{stem_terms$originalId}>} UNION", .open = "{{")
sparql_leaf_terms <- glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{leaf_terms$originalId}>} UNION", .open = "{{")
sparql_flower_terms <- glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{flower_terms$originalId}>} UNION", .open = "{{")
sparql_fruit_terms <- glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{fruit_terms$originalId}>} UNION", .open = "{{")
#queries
sparql_root_query <- paste0(sparql_header,
                       paste0(sparql_root_terms, collapse = ""),
                       glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{root$originalId}>} ", .open = "{{"),
                       sparql_filter)
sparql_stem_query <- paste0(sparql_header,
                       paste0(sparql_stem_terms, collapse = ""),
                       glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{stem$originalId}>} ", .open = "{{"),
                       sparql_filter)
sparql_leaf_query <- paste0(sparql_header,
                       paste0(sparql_leaf_terms, collapse = ""),
                       glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{leaf$originalId}>} ", .open = "{{"),
                       sparql_filter)
sparql_flower_query <- paste0(sparql_header,
                       paste0(sparql_flower_terms, collapse = ""),
                       glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{flower$originalId}>} ", .open = "{{"),
                       sparql_filter)
sparql_fruit_query <- paste0(sparql_header,
                       paste0(sparql_fruit_terms, collapse = ""),
                       glue::glue(" {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.obolibrary.org/obo/{{fruit$originalId}>} ", .open = "{{"),
                       sparql_filter)
```

retrieve samples ids with normalization "tpm".
```{r}
root_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql_root_query,
                                              normalization = "tpm")

stem_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql_stem_query,
                                              normalization = "tpm")

leaf_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql_leaf_query,
                                              normalization = "tpm")

flower_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql_flower_query,
                                              normalization = "tpm")
fruit_samples <- get_sparql_annotation_triples(target = "sample",
                                              query = sparql_fruit_query,
                                              normalization = "tpm")
```

retrieve sampleSets ids from samples ids normalization "tpm"
```{r}
root_ss <- get_sampleset_by_sampleid(samples = root_samples[,1], normalization = "tpm")
stem_ss <- get_sampleset_by_sampleid(samples = stem_samples[,1], normalization = "tpm")
leaf_ss <- get_sampleset_by_sampleid(samples = leaf_samples[,1], normalization = "tpm")
flower_ss <- get_sampleset_by_sampleid(samples = flower_samples[,1], normalization = "tpm")
fruit_ss <- get_sampleset_by_sampleid(samples = fruit_samples[,1], normalization = "tpm")
```

Remove previous objects for returning memory to the operating system
```{r eval = FALSE}
rm(module, extended_module)
rm(top_bf_df, no_bf_ids_df, top_50_bf_df)
rm(sparql_root_query, sparql_stem_query, sparql_leaf_query, sparql_flower_query, sparql_fruit_query)
rm(root_samples, stem_samples, leaf_samples, flower_samples, fruit_samples )
```

Since getting all TPM data for thousand of samples and genes might take a while, we prepared an `.rds` (a binary) file with all the data we will need. In order to see how to create such file just follow the procedure in the [How to obtain the complete TPM module] section of this notebook. Here we load the full saved TPM matrix.
```{r module_full, message = FALSE, eval = FALSE}
 download.file(url="https://www.dropbox.com/s/23dflc1az97an93/tpm_module_full.rds?dl=0","/tmp/tpm_module_full_ranked_max.rds", method="wget")
tpm_module_full <- readRDS("/tmp/tpm_module_full_ranked_max.rds")
```

With the TPM values we will create a DataFrame by ranking all genes within each sample considering missing values as not-expressed.
```{r module_rank, eval = FALSE }
suppressPackageStartupMessages(library(matrixStats))
library("imputeTS")
# tpm_na <- na.replace(tpm_module_full, 0)
tpm_na <- na_replace(tpm_module_full, 0)
# tpm_na <- tpm_module_full
# tpm_na[is.na(tpm_na)] <- 0
my_na_bool <- is.na(tpm_module_full) # mi dice dove sono gli na
module_rank_zero <- colRanks(tpm_na, ties.method = "max", preserveShape = TRUE, na.last = TRUE)
rownames(module_rank_zero) <- rownames(tpm_module_full)
colnames(module_rank_zero) <- colnames(tpm_module_full)
module_rank_zero[is.na(tpm_module_full)] <- NA
module_rank <- module_rank_zero
# saveRDS(module_rank,file = "~/Sandbox/tpm_module_full_ranked_max.rds", compress = T)
```

load the matrix without rebuilding
```{r module_rank, include = FALSE, message = FALSE}
 download.file(url="https://www.dropbox.com/s/att302ebrdq4lbj/tpm_module_full_ranked_max.rds?dl=0","/tmp/tpm_module_full_ranked_max.rds", method="wget")
module_rank <- readRDS("/tmp/tpm_module_full_ranked_max.rds")
```

Given a gene we are now able to calculate 95% confidence intervals of rank mean by random resampling from all ranks of the gene across all conditions and check whether or not our specific sample of ranks (coming from just flower tissue for example) is outside that interval.
```{r}
cat("Selected gene:",gene_name <- gene_list[length(gene_list)])
gene_id <- get_biofeature_id(id_In = gene_name, useIds = F)$id
gene_ranks <- module_rank[rownames(module_rank) == gene_id,]
tissues <- colnames(module_rank)[colnames(module_rank) %in% flower_ss$id]
cat("\nSample size:", size <- length(tissues), "\n")
# gene_ranks_samples_boot <- boot(data=gene_ranks, statistic=mean, R=1000)
gene_ranks_samples <- lapply(1:15000, function(i) sample(gene_ranks, size=size, replace = FALSE))
# gene_ranks_samples <- sample(gene_ranks, size=size, replace = FALSE)
mean_dist <- sapply(gene_ranks_samples, mean)
mean_sample <- mean(module_rank[rownames(module_rank) == gene_id,
                           colnames(module_rank) %in% flower_ss$id],na.rm = TRUE)
cat('Flower samples rank mean:', mean_sample, "\n")
cat('Confidence interval:', quantile(mean_dist, c(2.5,97.5)/100,  na.rm = TRUE), "\n")

```

```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plotly))
data.frame(mean_dist=mean_dist) %>%
  ggplot(aes(x = mean_dist)) +
  geom_histogram(aes(mean_dist,y = ..density..), binwidth=density(mean_dist, na.rm = TRUE)$bw) +
  geom_density(fill="#B3CDE3" , alpha = 0.6) +
  geom_rug(col = "#B3CDE3") + theme_light() + xlab("ranking") + ylab("Prob")-> gg
plotly::ggplotly(gg)
```
We can calculate the same thing for all genes in our module following the below procedure or loading them directly from file (see the end of the procedure)

# Mean_dist and Mean_sample calculation

```{r}
n <- 15000
genes_ann <- get_biofeature_id(id_In = rownames(extended_module_plus_top_50_bf), useIds = T)
genes_ranks <- module_rank[rownames(module_rank) %in% genes_ann$id,]

```

# Root
```{r}
tissues_root <- colnames(module_rank)[colnames(module_rank) %in% root_ss$id]
# gene_ranks_samples_boot <- boot(data=gene_ranks, statistic=mean, R=1000)
cat("\nSample size root:", size <- length(tissues_root), "\n")
genes_ranks_samples <- list()
for (g in 1:dim(genes_ranks)[[1]]){
  genes_ranks_samples[[g]] <- lapply(1:n, function(i) sample(genes_ranks[g,],
                                                                 size=size,
                                                                 replace = FALSE))
}
mean_dist_root <- list()
for(i in 1:dim(genes_ranks)[[1]]){
  mean_dist_root[[i]] <- sapply(genes_ranks_samples[[i]], mean, na.rm = TRUE)
}
mean_sample_root <- list()
for(i in 1:length(mean_dist_root)){
  mean_sample_root[[i]] <- mean(module_rank[rownames(module_rank) == genes_ann$id[i],
                           colnames(module_rank) %in% root_ss$id], na.rm = TRUE)

}
mean_sample_root <- unlist(mean_sample_root)
names(mean_sample_root) <- genes_ann$name
```

# Stem
```{r}
tissues_stem <- colnames(module_rank)[colnames(module_rank) %in% stem_ss$id]
cat("\nSample size stem:", size <- length(tissues_stem), "\n")
genes_ranks_samples <- list()
for (g in 1:dim(genes_ranks)[[1]]){
  genes_ranks_samples[[g]] <- lapply(1:n, function(i) sample(genes_ranks[g,],
                                                                 size=size,
                                                                 replace = FALSE))
}
mean_dist_stem <- list()
for(i in 1:length(genes_ranks_samples)){
  mean_dist_stem[[i]] <- sapply(genes_ranks_samples[[i]], mean, na.rm = TRUE)
}
mean_sample_stem <- list()
for(i in 1:length(mean_dist_stem)){
  mean_sample_stem[[i]] <- mean(module_rank[rownames(module_rank) == genes_ann$id[i],
                           colnames(module_rank) %in% stem_ss$id], na.rm = TRUE)

}
mean_sample_stem <- unlist(mean_sample_stem)
names(mean_sample_stem) <- genes_ann$name
```

# Leaf
```{r}
tissues_leaf <- colnames(module_rank)[colnames(module_rank) %in% leaf_ss$id]
# gene_ranks_samples_boot <- boot(data=gene_ranks, statistic=mean, R=1000)
cat("\nSample size leaf:", size <- length(tissues_leaf), "\n")
genes_ranks_samples <- list()
for (g in 1:dim(genes_ranks)[[1]]){
  genes_ranks_samples[[g]] <- lapply(1:n, function(i) sample(genes_ranks[g,],
                                                                 size=size,
                                                                 replace = FALSE))
}
mean_dist_leaf <- list()
for(i in 1:dim(genes_ranks)[[1]]){
  mean_dist_leaf[[i]] <- sapply(genes_ranks_samples[[i]], mean, na.rm = TRUE)
}
mean_sample_leaf <- list()
for(i in 1:length(mean_dist_leaf)){
  mean_sample_leaf[[i]] <- mean(module_rank[rownames(module_rank) == genes_ann$id[i],
                           colnames(module_rank) %in% leaf_ss$id], na.rm = TRUE)

}
mean_sample_leaf <- unlist(mean_sample_leaf)
names(mean_sample_leaf) <- genes_ann$name
```

# Flower
```{r}
tissues_flower <- colnames(module_rank)[colnames(module_rank) %in% flower_ss$id]
cat("\nSample size flower:", size <- length(tissues_flower), "\n")
genes_ranks_samples <- list()
for (g in 1:dim(genes_ranks)[[1]]){
  genes_ranks_samples[[g]] <- lapply(1:n, function(i) sample(genes_ranks[g,],
                                                                 size=size,
                                                                 replace = FALSE))
}
mean_dist_flower <- list()
for(i in 1:dim(genes_ranks)[[1]]){
  mean_dist_flower[[i]] <- sapply(genes_ranks_samples[[i]], mean, na.rm = TRUE)
}
mean_sample_flower <- list()
for(i in 1:length(mean_dist_flower)){
  mean_sample_flower[[i]] <- mean(module_rank[rownames(module_rank) == genes_ann$id[i],
                           colnames(module_rank) %in% flower_ss$id], na.rm = TRUE)

}
mean_sample_flower <- unlist(mean_sample_flower)
names(mean_sample_flower) <- genes_ann$name
```

# Fruit
```{r}
tissues_fruit <- colnames(module_rank)[colnames(module_rank) %in% fruit_ss$id]
# gene_ranks_samples_boot <- boot(data=gene_ranks, statistic=mean, R=1000)
cat("\nSample size fruit:", size <- length(tissues_fruit), "\n")
genes_ranks_samples <- list()
for (g in 1:dim(genes_ranks)[[1]]){
  genes_ranks_samples[[g]] <- lapply(1:n, function(i) sample(genes_ranks[g,],
                                                                 size=size,
                                                                 replace = FALSE))
}
mean_dist_fruit <- list()
for(i in 1:dim(genes_ranks)[[1]]){
  mean_dist_fruit[[i]] <- sapply(genes_ranks_samples[[i]], mean, na.rm = TRUE)
}
mean_sample_fruit <- list()
for(i in 1:length(mean_dist_fruit)){
  mean_sample_fruit[[i]] <- mean(module_rank[rownames(module_rank) == genes_ann$id[i],
                           colnames(module_rank) %in% fruit_ss$id], na.rm = TRUE)

}
mean_sample_fruit <- unlist(mean_sample_fruit)
names(mean_sample_fruit) <- genes_ann$name
```

# save the mean_dist and mean_sample for all tissues

```{r}
save(mean_dist_root, mean_dist_stem, mean_dist_leaf, mean_dist_flower, mean_dist_fruit, file="~/Sandbox/mean_dist_tissues.RData", compress = TRUE)
save(mean_sample_root, mean_sample_stem, mean_sample_leaf, mean_sample_flower, mean_sample_fruit, file="~/Sandbox/mean_sample_tissues.RData", compress = TRUE)
```

```{r}
module_tissue_expression <- data.frame(root = mean_sample_root,
                                       stem = mean_sample_stem,
                                       leaf = mean_sample_leaf,
                                       flower = mean_sample_flower,
                                       fruit = mean_sample_fruit)
knitr::kable(t(module_tissue_expression))
```

By inspecting the above data.frame we can see that for genes of the extended module the sample of ranks coming from flower is outside (with a percentile of score > 97.5) the 95% confidence interval of rank mean. This means that a large proportion (70%) of the genes in our extended module are more likely to be highly expressed in flower (which is even more significant if we consider that this tissue is not highly represented in the database). Nevertheless, in a number of cases those genes are also highly expressed in other tissues, for example fruit. Therefore, to narrow down the number of new candidate genes with a potential role in pollen development we may be interested in selecting the ones that are more likely to be highly expressed in flower but not expressed in fruit (and not present in our original module).

Which genes are more likely to be highly expressed in flower but not expressed in fruit?
```{r USETHIS}
percentile.ranked <- function(a.vector, value) {
  numerator <- length(sort(a.vector)[a.vector < value])
  denominator <- length(a.vector)
  round(numerator/denominator,3)*100
}

# root
perc_sample_root <- numeric()
for (i in 1:length(mean_sample_root)){
  perc_sample_root[i] <- percentile.ranked(mean_dist_root[[i]], mean_sample_root[i])
}
names(perc_sample_root) <- genes_ann$name
# per verificare la distribuzione
percentile_dist_root <- list()
for (i in 1:length(mean_sample_root)){
  percentile_dist_root[[i]] <- quantile(mean_dist_root[[i]],probs = seq(0, 1, by = .005),na.rm=T)
}
# stem
perc_sample_stem <- numeric()
for (i in 1:length(mean_sample_stem)){
  perc_sample_stem[i] <- percentile.ranked(mean_dist_stem[[i]], mean_sample_stem[i])
}
names(perc_sample_stem) <- genes_ann$name
# leaf
perc_sample_leaf <- numeric()
for (i in 1:length(mean_sample_leaf)){
  perc_sample_leaf[i] <- percentile.ranked(mean_dist_leaf[[i]], mean_sample_leaf[i])
}
names(perc_sample_leaf) <- genes_ann$name
# flower
perc_sample_flower <- numeric()
for (i in 1:length(mean_sample_flower)){
  perc_sample_flower[i] <- percentile.ranked(mean_dist_flower[[i]], mean_sample_flower[i])
}
names(perc_sample_flower) <- genes_ann$name
# fruit
perc_sample_fruit <- numeric()
for (i in 1:length(mean_sample_fruit)){
  perc_sample_fruit[i] <- percentile.ranked(mean_dist_fruit[[i]], mean_sample_fruit[i])
}
names(perc_sample_fruit) <- genes_ann$name
mean_sample_root[17]
perc_sample_root[17]
```

```{r}
module_tissue_expression_perc <- data.frame(root = perc_sample_root,
                                       stem = perc_sample_stem,
                                       leaf = perc_sample_leaf,
                                       flower = perc_sample_flower,
                                       fruit = perc_sample_fruit)
#saveRDS(module_tissue_expression_perc, file = "~/Sandbox/module_tissue_expression_perc.rds", compress = TRUE)
module_tissue_expression_perc <- readRDS("~/Sandbox/module_tissue_expression_perc.rds")
knitr::kable(t(module_tissue_expression_perc))
```


Alternatively you can directly load the mean_dist and mean_sample for all tissues

# load the mean_dist and mean_sample for all tissues

```{r}
download.file(url="https://www.dropbox.com/s/knelrtnw5xca9fx/mean_dist_tissues.RData?dl=0","/tmp/mean_dist_tissues.RData", method="wget")
load("/tmp/mean_dist_tissues.RData")

download.file(url="https://www.dropbox.com/s/pd4myr0ejg571xi/mean_sample_tissues.RData?dl=0","/tmp/mean_sample_tissues.RData", method="wget")
load("/tmp/mean_sample_tissues.RData")

download.file(url="https://www.dropbox.com/s/gcubf4lx9ovjxmq/module_tissue_expression_perc.rds?dl=0","/tmp/mean_sample_tissues.RData", method="wget")
module_tissue_expression_perc <- readRDS("~/Sandbox/module_tissue_expression_perc.rds")
```


```{r}
(rownames(module_tissue_expression_perc)[ module_tissue_expression_perc$flower > 97.5 & module_tissue_expression_perc$fruit < 2.5] -> selected_genes)
```
Based on the annotation of the extracted genes we can focus on the most interesting ones. For example, VIT_12s0059g00700 encodes a MYB transcription factor with a function in anther and pollen development, VIT_12s0057g00370 a pectinase involved in pollen wall formation, VIT_04s0008g02020 an endo-β-1,4-D-glucanase, VIT_06s0004g06900 a pyruvate decarboxylase (see annotation example below) and VIT_18s0001g07310 a Rho GTPase-activating protein, all of which are known for playing a role in pollen tube growth in model species.


```{r}
get_biofeature_annotations(bioFeature_Name_In = "VIT_06s0004g06900")$bioFeatureId %>%
  unique() %>%
  get_annotation_triples(ids = .)
```


Gene name can be obtained from an external source like UNIPROT through its SPARQL interface
```{r UNIPROT_SPARQL}
suppressPackageStartupMessages(library(SPARQL))
endpoint <- "https://sparql.uniprot.org/sparql"
# options <- "output=xml"
query_h <- "PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
    PREFIX gramene: <http://purl.uniprot.org/gramene/>

    SELECT ?protein ?desc ?see_gramene
    WHERE
    {
        ?protein a up:Protein .
        ?protein up:encodedBy ?gene .
        ?protein up:organism taxon:29760 .
        ?protein <http://www.w3.org/2000/01/rdf-schema#seeAlso> ?see_gramene .
        ?protein <http://www.w3.org/2000/01/rdf-schema#label> ?desc .
    "
sparql_UNIPROT_query <- glue::glue(" {?see_gramene <http://purl.uniprot.org/core/transcribedFrom> gramene:{{selected_genes}}} UNION ", .open = "{{")
query_b <- stringi::stri_replace_last_coll(paste0(sparql_UNIPROT_query, collapse = ""), "UNION", "")
query = paste0(query_h, query_b,"}")
(res <- SPARQL::SPARQL(endpoint,query)$results)
```

# Using the UniProt.ws package

```{r}
suppressPackageStartupMessages(library(UniProt.ws))
availableUniprotSpecies(pattern="vinifera")
#   taxon ID   Species name
# 1    29760 Vitis vinifera
(up <- UniProt.ws(taxId = 29760))

head(keytypes(up))
head(columns(up))
selected_uniprot_ids <- res$protein %>%  stringr::str_replace("<http://purl.uniprot.org/uniprot/","") %>% str_replace(">","")
# table(up@taxIdUniprots =="F6HBS8")
(res_uniprot <- select(up,
              keys = selected_uniprot_ids,
              columns = c("GENES","GENEID","GI_NUMBER*","REFSEQ_PROTEIN","REFSEQ_NUCLEOTIDE"),
              keytype = "UNIPROTKB"))

```
Alternatively directly from [UNIPROT](https://www.uniprot.org/uploadlists/) you can:
- select `Gene name for` for the `2. Select options - From` option and `Vitis vinifera` as organism:
- run the below R code (i.e. copy the previously selected gene names)
- paste (CTRL + V) into the `1.Provide your identifiers` form.

```{r}
library("clipr")
write_clip(selected_genes)
```

---

# How to obtain the complete TPM module

First, we retrieve all biofeatures and samplesets ids
```{r build_full_tpm_compendium_module, echo = FALSE, eval = FALSE}
all_bf <- get_biofeature_id()
all_samplesets <- get_sampleset_by_sampleid(samples = NULL, normalization = "tpm")
```


```{r, echo = FALSE, eval = FALSE}
tpm_module_1 <- create_module(biofeaturesNames = all_bf$id,
                              samplesetNames = all_samplesets$id[1:1000],
                              normalization = "tpm",
                              useIds = TRUE)
write.table(tpm_module_1, file="~/Sandbox/tpm_module_1.tsv", sep="\t", col.names=NA, row.names=T)
```

```{r, echo = FALSE, eval = FALSE}
tpm_module_2 <- create_module(biofeaturesNames = all_bf$id,
                              samplesetNames = all_samplesets$id[1001:2000],
                              normalization = "tpm",
                              useIds = TRUE)
write.table(tpm_module_2, file="~/Sandbox/tpm_module_2.tsv", sep="\t", col.names=NA, row.names=T)
```

```{r, echo = FALSE, eval = FALSE}
tpm_module_3 <- create_module(biofeaturesNames = all_bf$id,
                              samplesetNames = all_samplesets$id[2001:3000],
                              normalization = "tpm",
                              useIds = TRUE)
write.table(tpm_module_3, file="~/Sandbox/tpm_module_3.tsv", sep="\t", col.names=NA, row.names=T)
```

```{r, echo = FALSE, eval = FALSE}
tpm_module_4 <- create_module(biofeaturesNames = all_bf$id,
                              samplesetNames = all_samplesets$id[3001:length(all_samplesets$id)],
                              normalization = "tpm",
                              useIds = TRUE)
write.table(tpm_module_4, file="~/Sandbox/tpm_module_4.tsv", sep="\t", col.names=NA, row.names=T)
```

```{r, echo = FALSE, eval = FALSE}
tpm_module_full <- cbind(tpm_module_1, tpm_module_2, tpm_module_3, tpm_module_4)
saveRDS(tpm_module_full, file = "~/Sandbox/tmp_module_full.rds", compress = TRUE)
# h5write(tpm_module_full, file= "~/Sandbox/tpm_module_full.h5", "tmp_module")
```


# Session information
```{r session information, echo=FALSE}
print(sessionInfo(), locale = FALSE)
```