Cyriac Kandoth ckandoth

## gnomad_vcf_prep.txt
# Fetch the WGS gnomAD 3.1.2 per-chrom VCFs (the large size is mostly due to INFO fields):
mkdir gnomad
gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz gnomad
gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz.tbi gnomad

# Shortlist INFO fields we want to keep when merging these into a single VCF of reduced file size:
bcftools view -h gnomad/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz | grep ^##INFO | cut -f3- -d= |  grep -Ev "controls|non_cancer|non_neuro|non_topmed|non_v2|vep" | sort | less -S

cadd_phred
cadd_raw_score

## test_az_sdk_blob_upload.py
#!/usr/bin/env python

# Prereqs: Run "az login" to get a refresh token at "~/.azure/msal_token_cache.json" which expires only if unused for 90 days
# Depends: pip install azure-identity azure-storage-blob
# Sources: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_containers.py

STORAGE_ACCOUNT_URL = "https://blahdiblahdiblah.blob.core.windows.net"
CONTAINER_NAME = "mdlhot"

# Use the MSAL refresh token to get a temporary access token for use with blob storage libraries

## ensembl_vep_106_with_offline_cache.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              1 star
            
          
                ckandoth
                / ensembl_vep_106_with_offline_cache.md
            
            
              Created
              April 12, 2022 20:34
            
              
                Install Ensembl's VEP v106 with local cache for running offline
              
          
    Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
Instead of the official instructions, we will use mamba (conda, but faster) to install VEP and its dependencies. If you don't already have mamba, use these steps to download and install it into $HOME/mambaforge, then run a script that adds it to your $PATH:
curl -L https://github.com/conda-forge/miniforge/releases/download/4.12.0-0/Mambaforge-Linux-x86_64.sh -o /tmp/mambaforge.sh
sh /tmp/mambaforge.sh -bfp $HOME/mambaforge && rm -f mambaforge.sh
. $HOME/mambaforge

  
## prep_grch38_ref.txt
# Prepare a conda environment with tools we will need:
mamba create -n ref; conda activate ref
mamba install -y -c bioconda htslib==1.13 bcftools==1.13 samtools==1.13 picard-slim==2.26.2 bwa-mem2==2.2.1 bwa==0.7.17 gsutil==4.68

# Fetch the alignment-ready human reference FASTA and index:
gsutil -m cp gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai} .

# Index the reference FASTA for use with various tools:
picard CreateSequenceDictionary -R GRCh38_Verily_v1.genome.fa
bwa-mem2 index GRCh38_Verily_v1.genome.fa

## install_nextflow_singularity.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              3 stars
            
          
                ckandoth
                / install_nextflow_singularity.md
            
            
              Last active
              March 22, 2024 06:58
            
              
                Install conda and use it to install nextflow and singularity
              
          
    This guide will show you how to install conda and then use it to install nextflow and singularity for executing popular bioinformatics workflows. Unfortunately, singularity is not available on Windows or macOS. So, this guide will only target Linux environments. If you have to use Windows 10, then try WSL2. If you have to use macOS, then try a Virtual Machine.
Download the Miniconda3 installer for Linux environments:
curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
Install into a folder named miniconda3 under your home directory, and delete the installer:
bash miniconda.sh -bup $HOME/miniconda3 && rm -f miniconda.sh

  
## ngs_test_data.sh
# GOAL: Create a small test dataset for CI/CD of cancer bioinformatics tools/pipelines
# Prerequisites: A clean Linux environment with a working internet connection

# Download and install mambaforge into a folder under your home directory:
curl -L https://github.com/conda-forge/miniforge/releases/download/4.14.0-0/Mambaforge-Linux-x86_64.sh -o /tmp/mambaforge.sh
sh /tmp/mambaforge.sh -bfp $HOME/mambaforge && rm -f /tmp/mambaforge.sh

# Add the following to your ~/.bashrc file to activate the base environment whenever you login:
if [ -f "$HOME/mambaforge/etc/profile.d/conda.sh" ]; then
    . $HOME/mambaforge/etc/profile.d/conda.sh

## base_image_benchmark.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ckandoth
                / base_image_benchmark.md
            
            
              Last active
              January 24, 2022 16:47
            
              
                Compare the speed of containerized bwa-mem using various base images
              
          
    On a Linux VM or Workstation with docker installed, fetch the GRCh38 FASTA, its index, and a pair of FASTQs:
wget -P /hot/ref https://storage.googleapis.com/genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai}
wget -P /hot/reads/test https://storage.googleapis.com/data.cyri.ac/test_L001_R{1,2}_001.fastq.gz
If on a Slurm cluster, here is an example of wrapping a docker run command in an sbatch request:
sbatch --chdir=/hot --output=ref/std.out --error=ref/std.err --nodes=1 --ntasks-per-node=1 --cpus-per-task=8 --mem=30G --time=4:00:00 --wrap="docker run --help"

  
## single_machine_slurm_on_ubuntu.md

      
              1 file
            
          
              12 forks
            
          
              14 comments
            
          
              30 stars
            
          
                ckandoth
                / single_machine_slurm_on_ubuntu.md
            
            
              Last active
              December 13, 2023 20:12
            
              
                Install Slurm 19.05 on a standalone machine running Ubuntu 20.04
              
          
    Use apt to install the necessary packages:
sudo apt install -y slurm-wlm slurm-wlm-doc

Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:

Set your machine's hostname in SlurmctldHost and NodeName.
Set CPUs as appropriate, and optionally Sockets, CoresPerSocket, and ThreadsPerCore. Use command lscpu to find what you have.
Set RealMemory to the number of megabytes you want to allocate to Slurm jobs,
Set StateSaveLocation to /var/spool/slurm-llnl.
Set ProctrackType to linuxproc because processes are less likely to escape Slurm control on a single machine config.


## ensembl_vep_102_with_offline_cache.md

      
              1 file
            
          
              2 forks
            
          
              21 comments
            
          
              10 stars
            
          
                ckandoth
                / ensembl_vep_102_with_offline_cache.md
            
            
              Last active
              November 7, 2023 14:32
            
              
                Install Ensembl's VEP v102 with local cache for running offline
              
          
    Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
Instead of the official instructions, we will use conda to install VEP and its dependencies. If you don't already have conda, install it into $HOME/miniconda3 as follows:
curl -sL https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh -o /tmp/miniconda.sh
sh /tmp/miniconda.sh -bfp $HOME/miniconda3
Add the conda bin folder into your $PATH so that all installed tools are accessible via command-line. You can also add this to your ~/.bashrc

  
## ensembl_vep_95_with_offline_cache.md

      
              1 file
            
          
              6 forks
            
          
              4 comments
            
          
              12 stars
            
          
                ckandoth
                / ensembl_vep_95_with_offline_cache.md
            
            
              Last active
              October 4, 2022 21:49
            
              
                Install Ensembl's VEP v95 with various caches for running offline
              
          
    Ensembl's VEP (Variant Effect Predictor) is popular for how it picks a single effect per gene as detailed here, its CLIA-compliant HGVS variant format, and Sequence Ontology nomenclature for variant effects.
To follow these instructions, we'll assume you have these packaged essentials installed:
## For Debian/Ubuntu system admins ##
sudo apt-get install -y build-essential git libncurses-dev

## For RHEL/CentOS system admins ##
sudo yum groupinstall -y 'Development Tools'
sudo yum install -y git ncurses-devel
	# Fetch the WGS gnomAD 3.1.2 per-chrom VCFs (the large size is mostly due to INFO fields):
	mkdir gnomad
	gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz gnomad
	gsutil -m cp gs://gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr*.vcf.bgz.tbi gnomad

	# Shortlist INFO fields we want to keep when merging these into a single VCF of reduced file size:
	bcftools view -h gnomad/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz \| grep ^##INFO \| cut -f3- -d= \| grep -Ev "controls\|non_cancer\|non_neuro\|non_topmed\|non_v2\|vep" \| sort \| less -S

	cadd_phred
	cadd_raw_score
	#!/usr/bin/env python

	# Prereqs: Run "az login" to get a refresh token at "~/.azure/msal_token_cache.json" which expires only if unused for 90 days
	# Depends: pip install azure-identity azure-storage-blob
	# Sources: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/storage/azure-storage-blob/samples/blob_samples_containers.py

	STORAGE_ACCOUNT_URL = "https://blahdiblahdiblah.blob.core.windows.net"
	CONTAINER_NAME = "mdlhot"

	# Use the MSAL refresh token to get a temporary access token for use with blob storage libraries
	# Prepare a conda environment with tools we will need:
	mamba create -n ref; conda activate ref
	mamba install -y -c bioconda htslib==1.13 bcftools==1.13 samtools==1.13 picard-slim==2.26.2 bwa-mem2==2.2.1 bwa==0.7.17 gsutil==4.68

	# Fetch the alignment-ready human reference FASTA and index:
	gsutil -m cp gs://genomics-public-data/references/GRCh38_Verily/GRCh38_Verily_v1.genome.fa{,.fai} .

	# Index the reference FASTA for use with various tools:
	picard CreateSequenceDictionary -R GRCh38_Verily_v1.genome.fa
	bwa-mem2 index GRCh38_Verily_v1.genome.fa
	# GOAL: Create a small test dataset for CI/CD of cancer bioinformatics tools/pipelines
	# Prerequisites: A clean Linux environment with a working internet connection

	# Download and install mambaforge into a folder under your home directory:
	curl -L https://github.com/conda-forge/miniforge/releases/download/4.14.0-0/Mambaforge-Linux-x86_64.sh -o /tmp/mambaforge.sh
	sh /tmp/mambaforge.sh -bfp $HOME/mambaforge && rm -f /tmp/mambaforge.sh

	# Add the following to your ~/.bashrc file to activate the base environment whenever you login:
	if [ -f "$HOME/mambaforge/etc/profile.d/conda.sh" ]; then
	. $HOME/mambaforge/etc/profile.d/conda.sh