Skip to content

Instantly share code, notes, and snippets.

View d0choa's full-sized avatar
👨‍💻
Open Targets Platform Coordinator

David Ochoa d0choa

👨‍💻
Open Targets Platform Coordinator
View GitHub Profile
#!/bin/bash
# Job requirements
#Submit this script with: sbatch thefilename
#For more details about each parameter, please check SLURM sbatch documentation https://slurm.schedmd.com/sbatch.html
#SBATCH --time=8:00:00 # walltime
#SBATCH --ntasks=1 # number of tasks
#SBATCH --cpus-per-task=16 # number of CPUs Per Task i.e if your code is multi-threaded
#SBATCH --nodes=1 # number of nodes
#SBATCH -p datamover # partition(s)
@d0choa
d0choa / distance_clump_v2.py
Created July 4, 2023 20:30
distance clumping based on nested structures and densevectors
"""Prototype of distance based clumping."""
from typing import TYPE_CHECKING
import numpy as np
import pyspark.ml.functions as fml
import pyspark.sql.functions as f
from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
from pyspark.sql import SparkSession
@d0choa
d0choa / distance_clump.py
Created June 29, 2023 21:14
Experiment to implement distance based clumps
"""Prototype of distance based clumping."""
import pyspark.sql.functions as f
from pyspark.sql import Column, SparkSession, Window
spark = SparkSession.builder.getOrCreate()
data = [
("s1", "chr1", 3, 2.0, False),
("s1", "chr1", 4, 3.0, False),
"""
Compute all vs all Bayesian colocalisation analysis for all Genetics Portal
This script calculates posterior probabilities of different causal variants
configurations under the assumption of a single causal variant for each trait.
Logic reproduced from: https://github.com/chr1swallace/coloc/blob/main/R/claudia.R
"""
from functools import reduce
@d0choa
d0choa / estimateLogABF.py
Last active April 1, 2022 20:10
Estimate logABF from credible set
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import VectorUDT, Vectors
import pyspark.sql.types as T
import pyspark.sql.functions as F
sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
@d0choa
d0choa / Coloc_normalisation.py
Last active March 30, 2022 18:36
Experimenting with coloc in pyspark
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession
from functools import reduce
sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
'open-targets-eu-dev')
@d0choa
d0choa / potentialNewVariantsInIIndex.py
Last active March 16, 2022 16:01
List of potential new variants in variant index (derived from other datasets)
from os import sep
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession
sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
'open-targets-eu-dev')
@d0choa
d0choa / missingTopLoci.py
Last active March 3, 2022 15:17
Diagnostic script to find and explain missing top loci from the V2D dataset
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql import SparkSession
sparkConf = SparkConf()
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.mode', 'AUTO')
sparkConf = sparkConf.set('spark.hadoop.fs.gs.requester.pays.project.id',
'open-targets-eu-dev')
# establish spark connection
@d0choa
d0choa / 2021_approvals.R
Last active July 12, 2022 01:48
Supporting evidence on 2021 FDA approvals
library("tidyverse")
library("sparklyr")
library("sparklyr.nested")
library("cowplot")
library("ggsci")
#Spark config
config <- spark_config()
# Allowing to GCP datasets access
@d0choa
d0choa / all_variants_for_genelist.Rmd
Last active November 23, 2022 20:01
All platform variants associated with a list of genes in R
---
title: "Batch-query all platform evidence associated with a gene/target list (R)"
output:
md_document:
variant: markdown_github
---
How to batch-access information related to a list of targets from the Open Targets Platform is a recurrent question. Here, I provide an example on how to access all target-disease evidence for a set of IFN-gamma signalling related proteins. I will further reduce the evidence to focus on all the coding or non-coding variants clinically-associated with the gene list of interest. I used R and sparklyr, but a Python implementation would be very similar. The platform documentation and the community space have very similar examples.