d0choa/all_variants_for_genelist.Rmd

## all_variants_for_genelist.Rmd
---
title: "Batch-query all platform evidence associated with a gene/target list (R)"
output:
  md_document:
    variant: markdown_github
---

How to batch-access information related to a list of targets from the Open Targets Platform is a recurrent question. Here, I provide an example on how to access all target-disease evidence for a set of IFN-gamma signalling related proteins. I will further reduce the evidence to focus on all the coding or non-coding variants clinically-associated with the gene list of interest. I used R and sparklyr, but a Python implementation would be very similar. The platform documentation and the community space have very similar examples.


## Spark connection and loading datasets

To most efficiently interact with the datasets in parquet format, we will use Sparklyr in R. It's not the only way to access the information as JSON files are also available.

```{r, Rsetup, message=FALSE, warning=FALSE, results='hide'}
library(dplyr)
library(sparklyr)
library(sparklyr.nested)
```


### Option 1: Work with local datasets

The easiest way to interact with the datasets is to download the latest platform release parquet data directly into your computer. Datasets vary in size and they are all partioned into multiple chunks to optimise I/O. You can go to the [documentation](https://platform-docs.opentargets.org/data-access/datasets) to find alternative ways of downloading the information.

The first step is to create a connection to your local Spark instance.

```{r, SparkConnectionLocal, eval = FALSE}
sc <- spark_connect(master = "local")
```

For the purpose of this excercise we will use 3 datasets: [target](https://platform-docs.opentargets.org/target), [disease](https://platform-docs.opentargets.org/disease-or-phenotype) and [evidence](https://platform-docs.opentargets.org/evidence). See their respective documentation pages for more information. To load the datasets, you will just need to provide the top-level directory containing all the chunks of a dataset (e.g. `~/datasets/evidence`), spark will then interpret all the partioned chunks included in this directory.

```{r, loadPlatformEvidenceLocally, eval = FALSE}
evidencePath <- "<local-path-evidence>"
targetPath <- "<local-path-target>"
diseasePath <- "<local-path-disease>"
## read datasets
evd <- spark_read_parquet(sc, path = evidencePath)
target <- spark_read_parquet(sc, path = targetPath)
disease <- spark_read_parquet(sc, path = diseasePath)
```

### Option 2: Directy in Google Cloud Platform (Advanced)

Alternatively, you can setup a spark configuration to read directly from Google Cloud Platform. This is a more complex setup as it's running in a Dataproc cluster accessing directly the datasets. This will probably be more useful to power-users of the Open targets Platform.

```{r, SparkConnectionGCP}
conf <- spark_config()
conf$`spark.hadoop.fs.gs.requester.pays.mode` <- "AUTO"
conf$`spark.hadoop.fs.gs.requester.pays.project.id` <- "open-targets-eu-dev"
sc <- spark_connect(master = "yarn", config = conf)
```

With a slightly modified spark configuration you can directly access the google cloud datasets. Here, an example on how to tap into the `21.11` data.

```{r, loadPlatformEvidenceGPC}
evidencePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/evidence"
targetPath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/targets"
diseasePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/diseases"
## read datasets
evd <- spark_read_parquet(sc, path = evidencePath)
target <- spark_read_parquet(sc, path = targetPath)
disease <- spark_read_parquet(sc, path = diseasePath)
```

## Browse the schema to find fields of interest

 The best way to better understand the available information the datasets is to browse the schema. Next you can see the available columns in the `evidence` dataset. Evidence is a composite of objects from multiple datasources. As a result, the schema is a compromise between multiple elements. This means that if you are only interested in a subset of the datasources, you might find a lot of the columns in the dataset useless.

```{r, browseEvidenceSchema}
## Browse the evidence schema
columns <- evd %>%
  sdf_schema() %>%
  lapply(function(x) do.call(tibble, x)) %>%
  bind_rows()

```

```{r, schemaTable, echo = FALSE, results = 'asis'}
knitr::kable(head(columns), "pipe")
```

## Batch-searching datasets information (data joining)

In this excercise, we want to look at information related with a list of IFN-gamma signaling related genes/proteins. We will use the list of approvedSymbol's of interest and load them into Spark.

```{r, targetList}
IFNgsignalling <- data.frame(
  approvedSymbol = c(
    "JAK1",
    "JAK2",
    "IFNGR1",
    "IFNGR2",
    "STAT1",
    "B2M",
    "IRF1",
    "SOCS1"
  )
)
IFNgsignalling <- copy_to(sc, IFNgsignalling, "IFNgsignalling")

```

The next step is the joining of multiple pieces of information that we have. We will sequentially:

1. Subset the target dataset containing 60k+ genes using the genes of interest. We will only keep the information about the approved symbol and the Ensembl ID, but there is a large amount of gene metadata on that dataset.

2. We will join the information about all the evidence related to the targets. This is a one-2-many relationship as there could be multiple evidence for the same gene.

3. We will use the disease dataset to resolve the disease names. In the evidence dataset, diseases are only represented with their EFO ids, so we make use of the dataset to obtain their names. There is as well rich information about the disease that could be leveraged for more complex hypothesis.

```{r, datajoining}
IFNgEvd <- target %>%
  inner_join(IFNgsignalling, by = "approvedSymbol") %>%
  select(targetId = id, approvedSymbol) %>%
  # join target-disease evidence
  inner_join(evd, by = "targetId") %>%
  # join disease names
  inner_join(
    disease %>%
      select(
        diseaseId = id,
        diseaseName = name
      ),
    by = c("diseaseId")
  ) %>%
  sdf_persist()
```

The resulting `IFNgEvd` object contains all the Platform target-disease evidence for the targets of interest.

## Extract variant-related information

For the purpose of this excercise, we want to filter the dataset to restrict to all the evidence that contain variant information. This reduces the set to the few datasources that report causal variants to the Open Targets Platform. That implies that a number of columns also become irrelevant to the reduced dataset.

Next, we filter the information to the data of interest and provide an example on how to post-process columns that contain nested information. For example, this is the case of the ClinVar `clinicalSignificances` which we will explode into multiple rows for this use case. Finally, we order the dataset by the Platform score to prioritise these variants with a more likely causal link to the disease.

```{r, variantInfo}

# function to drop all columns without non-Null values
dropAllNullColumns <- function(sparkdf){
  null_counts <- sparkdf %>%
    summarise_all(~ sum(as.numeric(!is.na(.)), na.rm = TRUE)) %>%
    collect()

  notNullColumns <- null_counts %>%
    select_if(~(. != 0)) %>%
    colnames()

  out <- sparkdf %>%
    select(one_of(notNullColumns))

  return(out)
}

IFNgVariantInfo <- IFNgEvd %>%
  # keep only variant-based information
  filter(!is.na(variantRsId)) %>%
  # drop all columns without information
  dropAllNullColumns() %>%
  # explode nested field
  sdf_explode(clinicalSignificances) %>%
  # sort by score
  arrange(desc(score))

```

In some cases, data might need to be moved to the R workspace (outside Spark) for further analysis.

```{r, export}
df <- IFNgVariantInfo %>%
  collect()
```

Careful interpretation of the data is always required. For example, this table contains variants that have been catalogued as beningn in ClinVar, so there is no truly causal link reported. Similarly, Open Targets Genetics Portal evidence with very low L2G scores might have very weak causal link with the gene. Some of these might result on some false positive assignments.

```{r, evidenceTable, echo = FALSE, results = 'asis'}
knitr::kable(head(df), "pipe")
```
	---
	title: "Batch-query all platform evidence associated with a gene/target list (R)"
	output:
	md_document:
	variant: markdown_github
	---

	How to batch-access information related to a list of targets from the Open Targets Platform is a recurrent question. Here, I provide an example on how to access all target-disease evidence for a set of IFN-gamma signalling related proteins. I will further reduce the evidence to focus on all the coding or non-coding variants clinically-associated with the gene list of interest. I used R and sparklyr, but a Python implementation would be very similar. The platform documentation and the community space have very similar examples.


	## Spark connection and loading datasets

	To most efficiently interact with the datasets in parquet format, we will use Sparklyr in R. It's not the only way to access the information as JSON files are also available.

	```{r, Rsetup, message=FALSE, warning=FALSE, results='hide'}
	library(dplyr)
	library(sparklyr)
	library(sparklyr.nested)
	```


	### Option 1: Work with local datasets

	The easiest way to interact with the datasets is to download the latest platform release parquet data directly into your computer. Datasets vary in size and they are all partioned into multiple chunks to optimise I/O. You can go to the [documentation](https://platform-docs.opentargets.org/data-access/datasets) to find alternative ways of downloading the information.

	The first step is to create a connection to your local Spark instance.

	```{r, SparkConnectionLocal, eval = FALSE}
	sc <- spark_connect(master = "local")
	```

	For the purpose of this excercise we will use 3 datasets: [target](https://platform-docs.opentargets.org/target), [disease](https://platform-docs.opentargets.org/disease-or-phenotype) and [evidence](https://platform-docs.opentargets.org/evidence). See their respective documentation pages for more information. To load the datasets, you will just need to provide the top-level directory containing all the chunks of a dataset (e.g. `~/datasets/evidence`), spark will then interpret all the partioned chunks included in this directory.

	```{r, loadPlatformEvidenceLocally, eval = FALSE}
	evidencePath <- "<local-path-evidence>"
	targetPath <- "<local-path-target>"
	diseasePath <- "<local-path-disease>"
	## read datasets
	evd <- spark_read_parquet(sc, path = evidencePath)
	target <- spark_read_parquet(sc, path = targetPath)
	disease <- spark_read_parquet(sc, path = diseasePath)
	```

	### Option 2: Directy in Google Cloud Platform (Advanced)

	Alternatively, you can setup a spark configuration to read directly from Google Cloud Platform. This is a more complex setup as it's running in a Dataproc cluster accessing directly the datasets. This will probably be more useful to power-users of the Open targets Platform.

	```{r, SparkConnectionGCP}
	conf <- spark_config()
	conf$`spark.hadoop.fs.gs.requester.pays.mode` <- "AUTO"
	conf$`spark.hadoop.fs.gs.requester.pays.project.id` <- "open-targets-eu-dev"
	sc <- spark_connect(master = "yarn", config = conf)
	```

	With a slightly modified spark configuration you can directly access the google cloud datasets. Here, an example on how to tap into the `21.11` data.

	```{r, loadPlatformEvidenceGPC}
	evidencePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/evidence"
	targetPath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/targets"
	diseasePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/diseases"
	## read datasets
	evd <- spark_read_parquet(sc, path = evidencePath)
	target <- spark_read_parquet(sc, path = targetPath)
	disease <- spark_read_parquet(sc, path = diseasePath)
	```

	## Browse the schema to find fields of interest

	The best way to better understand the available information the datasets is to browse the schema. Next you can see the available columns in the `evidence` dataset. Evidence is a composite of objects from multiple datasources. As a result, the schema is a compromise between multiple elements. This means that if you are only interested in a subset of the datasources, you might find a lot of the columns in the dataset useless.

	```{r, browseEvidenceSchema}
	## Browse the evidence schema
	columns <- evd %>%
	sdf_schema() %>%
	lapply(function(x) do.call(tibble, x)) %>%
	bind_rows()

	```

	```{r, schemaTable, echo = FALSE, results = 'asis'}
	knitr::kable(head(columns), "pipe")
	```

	## Batch-searching datasets information (data joining)

	In this excercise, we want to look at information related with a list of IFN-gamma signaling related genes/proteins. We will use the list of approvedSymbol's of interest and load them into Spark.

	```{r, targetList}
	IFNgsignalling <- data.frame(
	approvedSymbol = c(
	"JAK1",
	"JAK2",
	"IFNGR1",
	"IFNGR2",
	"STAT1",
	"B2M",
	"IRF1",
	"SOCS1"
	)
	)
	IFNgsignalling <- copy_to(sc, IFNgsignalling, "IFNgsignalling")

	```

	The next step is the joining of multiple pieces of information that we have. We will sequentially:

	1. Subset the target dataset containing 60k+ genes using the genes of interest. We will only keep the information about the approved symbol and the Ensembl ID, but there is a large amount of gene metadata on that dataset.

	2. We will join the information about all the evidence related to the targets. This is a one-2-many relationship as there could be multiple evidence for the same gene.

	3. We will use the disease dataset to resolve the disease names. In the evidence dataset, diseases are only represented with their EFO ids, so we make use of the dataset to obtain their names. There is as well rich information about the disease that could be leveraged for more complex hypothesis.

	```{r, datajoining}
	IFNgEvd <- target %>%
	inner_join(IFNgsignalling, by = "approvedSymbol") %>%
	select(targetId = id, approvedSymbol) %>%
	# join target-disease evidence
	inner_join(evd, by = "targetId") %>%
	# join disease names
	inner_join(
	disease %>%
	select(
	diseaseId = id,
	diseaseName = name
	),
	by = c("diseaseId")
	) %>%
	sdf_persist()
	```

	The resulting `IFNgEvd` object contains all the Platform target-disease evidence for the targets of interest.

	## Extract variant-related information

	For the purpose of this excercise, we want to filter the dataset to restrict to all the evidence that contain variant information. This reduces the set to the few datasources that report causal variants to the Open Targets Platform. That implies that a number of columns also become irrelevant to the reduced dataset.

	Next, we filter the information to the data of interest and provide an example on how to post-process columns that contain nested information. For example, this is the case of the ClinVar `clinicalSignificances` which we will explode into multiple rows for this use case. Finally, we order the dataset by the Platform score to prioritise these variants with a more likely causal link to the disease.

	```{r, variantInfo}

	# function to drop all columns without non-Null values
	dropAllNullColumns <- function(sparkdf){
	null_counts <- sparkdf %>%
	summarise_all(~ sum(as.numeric(!is.na(.)), na.rm = TRUE)) %>%
	collect()

	notNullColumns <- null_counts %>%
	select_if(~(. != 0)) %>%
	colnames()

	out <- sparkdf %>%
	select(one_of(notNullColumns))

	return(out)
	}

	IFNgVariantInfo <- IFNgEvd %>%
	# keep only variant-based information
	filter(!is.na(variantRsId)) %>%
	# drop all columns without information
	dropAllNullColumns() %>%
	# explode nested field
	sdf_explode(clinicalSignificances) %>%
	# sort by score
	arrange(desc(score))

	```

	In some cases, data might need to be moved to the R workspace (outside Spark) for further analysis.

	```{r, export}
	df <- IFNgVariantInfo %>%
	collect()
	```

	Careful interpretation of the data is always required. For example, this table contains variants that have been catalogued as beningn in ClinVar, so there is no truly causal link reported. Similarly, Open Targets Genetics Portal evidence with very low L2G scores might have very weak causal link with the gene. Some of these might result on some false positive assignments.

	```{r, evidenceTable, echo = FALSE, results = 'asis'}
	knitr::kable(head(df), "pipe")
	```