David Ochoa d0choa

## simple_data_mover.sh
#!/bin/bash
# Job requirements
#Submit this script with: sbatch thefilename
#For more details about each parameter, please check SLURM sbatch documentation https://slurm.schedmd.com/sbatch.html

#SBATCH --time=8:00:00   # walltime
#SBATCH --ntasks=1   # number of tasks
#SBATCH --cpus-per-task=16   # number of CPUs Per Task i.e if your code is multi-threaded
#SBATCH --nodes=1   # number of nodes
#SBATCH -p datamover   # partition(s)

## distance_clump_v2.py
"""Prototype of distance based clumping."""

from typing import TYPE_CHECKING

import numpy as np
import pyspark.ml.functions as fml
import pyspark.sql.functions as f
from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
from pyspark.sql import SparkSession

## distance_clump.py
"""Prototype of distance based clumping."""

import pyspark.sql.functions as f
from pyspark.sql import Column, SparkSession, Window

spark = SparkSession.builder.getOrCreate()

data = [
    ("s1", "chr1", 3, 2.0, False),
    ("s1", "chr1", 4, 3.0, False),

## coloc_ml.py
"""
Compute all vs all Bayesian colocalisation analysis for all Genetics Portal

This script calculates posterior probabilities of different causal variants
configurations under the assumption of a single causal variant for each trait.

Logic reproduced from: https://github.com/chr1swallace/coloc/blob/main/R/claudia.R
"""

from functools import reduce

## estimateLogABF.py
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import VectorUDT, Vectors
import pyspark.sql.types as T
import pyspark.sql.functions as F

sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')

## Coloc_normalisation.py
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession
from functools import reduce

sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
                          'open-targets-eu-dev')

## potentialNewVariantsInIIndex.py
from os import sep
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession

sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
                          'open-targets-eu-dev')

## missingTopLoci.py
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession

sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
                          'open-targets-eu-dev')

# establish spark connection

## 2021_approvals.R
library("tidyverse")
library("sparklyr")
library("sparklyr.nested")
library("cowplot")
library("ggsci")

#Spark config
config <- spark_config()

# Allowing to GCP datasets access

## all_variants_for_genelist.Rmd
---
title: "Batch-query all platform evidence associated with a gene/target list (R)"
output:
  md_document:
    variant: markdown_github
---

How to batch-access information related to a list of targets from the Open Targets Platform is a recurrent question. Here, I provide an example on how to access all target-disease evidence for a set of IFN-gamma signalling related proteins. I will further reduce the evidence to focus on all the coding or non-coding variants clinically-associated with the gene list of interest. I used R and sparklyr, but a Python implementation would be very similar. The platform documentation and the community space have very similar examples.
	#!/bin/bash
	# Job requirements
	#Submit this script with: sbatch thefilename
	#For more details about each parameter, please check SLURM sbatch documentation https://slurm.schedmd.com/sbatch.html

	#SBATCH --time=8:00:00 # walltime
	#SBATCH --ntasks=1 # number of tasks
	#SBATCH --cpus-per-task=16 # number of CPUs Per Task i.e if your code is multi-threaded
	#SBATCH --nodes=1 # number of nodes
	#SBATCH -p datamover # partition(s)
	"""Prototype of distance based clumping."""

	from typing import TYPE_CHECKING

	import numpy as np
	import pyspark.ml.functions as fml
	import pyspark.sql.functions as f
	from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
	from pyspark.sql import SparkSession
	"""Prototype of distance based clumping."""

	import pyspark.sql.functions as f
	from pyspark.sql import Column, SparkSession, Window

	spark = SparkSession.builder.getOrCreate()

	data = [
	("s1", "chr1", 3, 2.0, False),
	("s1", "chr1", 4, 3.0, False),
	"""
	Compute all vs all Bayesian colocalisation analysis for all Genetics Portal

	This script calculates posterior probabilities of different causal variants
	configurations under the assumption of a single causal variant for each trait.

	Logic reproduced from: https://github.com/chr1swallace/coloc/blob/main/R/claudia.R
	"""

	from functools import reduce
	from pyspark import SparkConf
	from pyspark.sql import SparkSession
	from pyspark.ml.regression import LinearRegression
	from pyspark.ml.feature import VectorAssembler
	from pyspark.ml.linalg import VectorUDT, Vectors
	import pyspark.sql.types as T
	import pyspark.sql.functions as F

	sparkConf = SparkConf()
	sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
	from os import sep
	import pyspark.sql.functions as F
	from pyspark import SparkConf
	from pyspark.sql import SparkSession

	sparkConf = SparkConf()
	sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
	sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
	'open-targets-eu-dev')
	library("tidyverse")
	library("sparklyr")
	library("sparklyr.nested")
	library("cowplot")
	library("ggsci")

	#Spark config
	config <- spark_config()

	# Allowing to GCP datasets access
	---
	title: "Batch-query all platform evidence associated with a gene/target list (R)"
	output:
	md_document:
	variant: markdown_github
	---

	How to batch-access information related to a list of targets from the Open Targets Platform is a recurrent question. Here, I provide an example on how to access all target-disease evidence for a set of IFN-gamma signalling related proteins. I will further reduce the evidence to focus on all the coding or non-coding variants clinically-associated with the gene list of interest. I used R and sparklyr, but a Python implementation would be very similar. The platform documentation and the community space have very similar examples.