DonFreed/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Creating a gVCF file for all samples in the 1000 Genomes project

Table of Contents


Introduction
Requirements
Setup
Installing Python
Installing StarCluster
Configuring StarCluster
Setting up a VPC
Setting up an AMI
Creating a Persistent Disk
Testing your Setup
Running the Analysis

Introduction

The 1000 Genomes project provides one of the most comprehensive datasets of variation across diverse human populations. Because of its completeness, the 1000 Genomes Project data has been utilized in many other projects and the papers describing the main findings have been cited over 9,000 times. Data from the project are available in fastq, BAM and VCF formats.
The variant callsets produced by 1000 Genomes Project are frequently used as reference data for other genomes projects. These callsets are highly refined, consisting of raw calls from many different variant callers followed by heavy filtering. Unfortunately, this process has removed useful information as variants that are false-positives in the 1000 Genomes are likely to be false-positives in other projects.
To address this, we propose to create a gVCF for all of the low-coverage samples analyzed by the 1000 Genomes Project.
Requirements


A computer running Linux.
An internet connection.
An Amazon Web Services account (know your account number, Access key ID, and Secret Access Key). See the AWS page and AWS IAM.
A Sentieon license file and the Sentieon toolkit executable or the GATK JAR file.

The Sentieon license and toolkit or the GATK JAR file is expected to already be on your linux machine. See the GATK website and the Sentieon website for more information.
Setup

We can make a directory to do all of our work in. Assuming a user's home directory is /home/user/, let's make a subdirectory to do our work in.
mkdir /home/user/1kgenomes_tutorial

Let's move into that directory.
cd /home/user/1kgenomes_tutorial

Installing Python

We will install Python by installing anaconda. We can start by downloading the installer.
wget https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh

Then run the installer script. Agree to the license when prompted by typing yes. When prompted, change the install directory to /home/user/1kgenomes_tutorial/anaconda2. When prompted to edit your .bashrc file, answer no.
bash Anaconda2-4.2.0-Linux-x86_64.sh

You can now access the installed Python with /home/user/1kgenomes_tutorial/anaconda2/bin/python. You can access pip through /home/user/1kgenomes_tutorial/anaconda2/bin/pip.
Installing StarCluster

Let's install Starcluster. We can clone the source code from GitHub.
git clone https://github.com/jtriley/StarCluster.git

Now let's install StarCluster. We'll put the executable in our anaconda directory to keep things tidy. We can do this using the tool pip that was distributed with anaconda.
anaconda2/bin/pip install -e StarCluster/

We can now access Starcluster with /home/user/1kgenomes_tutorial/anaconda2/bin/starcluster
Configuring StarCluster

StarCluster requires some configuration before it will work. First we can have StarCluster create an empty template file for us. To do this we run any StarCluster command that requires a config file. If no config file is found, StarCluster can create a template config file for us. Let's do that. Run the command below and select option 2.
anaconda2/bin/starcluster shi

This will create a config file in our home directory (/home/user in this example). Now we can add our AWS information to this configuration file. Let's say that our account information is as shown below (the AWS_USER_ID is our account number):
AWS_ACCESS_KEY_ID = AKIAIUITQJ4NCRISZQ
AWS_SECRET_ACCESS_KEY = f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi
AWS_USER_ID = 012345678901

We can add this information to our config file.
sed -i 's/#your_aws_access_key_id/AKIAIUITQJ4NCRISZQ/' ~/.starcluster/config
sed -i 's/#your_secret_access_key/f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi/' ~/.starcluster/config
sed -i 's/#your userid/012345678901/' ~/.starcluster/config

Now when we run the following command, it should return pricing information on the AWS spot market to us.
anaconda2/bin/starcluster shi c3.xlarge

If that command returned an error like !!! ERROR - AuthFailure: AWS was not able to validate the provided access credentials, please fix your credentials and try again.
Setting up a VPC

StarCluster requires that your virtual machines reside within the same Virtual Private Cloud (VPC) so that they can communicate with each other over a secure network. We must set up the VPC before we can go any further. We can do this in the python interpreter (commands run in the interpreter are marked with a >>>).
First let's install boto3
anaconda2/bin/pip install boto3

Now let's start the interpreter.
anaconda2/bin/python

import boto3, time, and random.
>>> import boto3
>>> import random
>>> import time

We need to create a new IP address for the VPC. Will leave 16 bits for the subnet address. When starting the boto3 session, use the same credentials you used in the StarCluster config file.
>>> session = boto3.session.Session(aws_access_key_id="AKIAIUITQJ4NCRISZQ", aws_secret_access_key="f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi", region_name="us-east-1")
>>> ec2 = session.resource("ec2")
>>> mask_bits = (1<<32)-1 ^ (1<<16)-1
>>> ip = (10 << 24 | random.randrange(0, 1<<24)) & mask_bits
>>> ip2 = []
>>> for i in range(2, -1, -1):
...   ip2.append(ip >> i * 8 & 0xff)
...
>>> ip = '.'.join([str(x) for x in [10] + ip2]) + '/16'

The VPC IP should look something like this.
>>> ip
'10.81.0.0/16'

Now, let's create the new VPC.
>>> vpc = ec2.create_vpc(CidrBlock=ip)
>>> vpc
ec2.Vpc(id='vpc-f74cf391')

Wait until the VPC state is no longer 'pending'. If this takes longer than a couple of minutes, something is wrong.
>>> while vpc.state == "pending":
...   time.sleep(5)
...   vpc.reload()
...

Now we can create an internet gateway and add it to the route table.
>>> response = vpc.meta.client.create_internet_gateway()
>>> vpc.attach_internet_gateway(InternetGatewayId=response["InternetGateway"]["InternetGatewayId"])
>>> route_table = list(vpc.route_tables.all())[0]
>>> route_table.create_route(DestinationCidrBlock="0.0.0.0/0", GatewayId=response["InternetGateway"]["InternetGatewayId"])

Let's create a subnet IP leaving 12 bits to machines in our subnet.
>>> vpc_ip = ip.split('/')[0]
>>> subnet = random.randrange(0, 1<<4)
>>> _vpc_ip, shift = 0, 0
>>> for ip_byte in reversed(vpc_ip.split('.')):
...   _vpc_ip |= int(ip_byte) << shift
...   shift += 8
...
>>> subnet_ip = _vpc_ip | (subnet << 12)
>>> ip2 = []
>>> for i in range(3, -1, -1):
...   ip2.append(subnet_ip >> i * 8 & 0xff)
...
>>> subnet_ip = '.'.join([str(x) for x in ip2]) + '/12'

Now we can create a new subnet.
>>> subnet = ec2.create_subnet(VpcId=vpc.vpc_id, CidrBlock=subnet_ip)

Wait until the subnet is not pending.
>>> while subnet.state == "pending":
...   time.sleep(5)
...   subnet.reload()
...

We can modify the subnet to use public IPs and we are finished.
>>> subnet.meta.client.modify_subnet_attribute(SubnetId=subnet.id, MapPublicIpOnLaunch={"Value":True})

Keep track of the subnet id, we will need it later.
>>> subnet.id
'subnet-3bc4f211'
>>> quit()

Setting up an AMI

Now that we have our VPC and subnet set up, we can move on to creating an Amazon Machine Image (AMI) that holds all of the software we need in our analysis.
The default starcluster keys are located at ~/.ssh/mykey.rsa. We can make a new ssh key and we can modify the config file.
anaconda2/bin/starcluster createkey -o ~/.ssh/testkey.rsa testkey
sed -i 's/\[key mykey\]/[key testkey]/' ~/.starcluster/config
sed -i 's|~/.ssh/mykey.rsa|~/.ssh/testkey.rsa|' ~/.starcluster/config
sed -i 's/mykey/testkey/' ~/.starcluster/config

To setup the AMI, we can start a single-node small cluster. We can use the -N option to specify the subnet we created in the last step.
anaconda2/bin/starcluster start -s 1 -u onekg -m ami-52a0c53b -I t2.micro -N subnet-3bc4f211 ami_setup

SSH into the AWS cluster.
anaconda2/bin/starcluster sm ami_setup

Let's update the installed the packages. When asked to update some configuration files (such as for grub-pc or cloud-init), choose to keep the current version.
sudo apt-get update
sudo apt-get upgrade

We can make some room on the machine by removing some unnecessary files.
rm /usr/lib/jvm/java-6-openjdk-amd64/jre/lib/amd64/server/classes.jsa /usr/local/cuda-5.0/doc/pdf/NPP_Library.pdf /usr/local/cuda-5.0/doc/pdf/CUDA_Toolkit_Reference_Manual.pdf /usr/local/src/cuda_5.0.35_linux_64_ubuntu11.10-1.run /usr/local/src/gridscheduler-scbuild.tar.gz /var/cache/apt/archives/python2.7-dev_2.7.3-0ubuntu3.9_amd64.deb /var/cache/apt/archives/emacs23-common_23.3+1-1ubuntu9.2_all.deb

Now we can install the software required for the analysis of the 1000 Genomes samples. The required software is Python3 with the retrying, awscli, and boto3 packages, mdadm, bwa, samblaster, samtools, java 8, the GATK jar file and the Sentieon toolkit.
Let's start by installing the mdadm (software RAID) package. Run the following command and leave the postfix configuration unchanged.
apt-get install mdadm
apt-get autoremove

Now we will install Java 8.
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java8-installer

Next, we can install Python3 through the anaconda3 bundle. Select no when asked if you want to prepend to the /root/.bashrc. We can do this ourselves.
Download the install script.
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh

Install Anaconda3 Python distribution. Install the distribution at /usr/local/bin/anaconda3 when prompted for a prefix. After anaconda is installed, we will install the Python packages awscli, retrying and boto3.
bash Anaconda3-4.2.0-Linux-x86_64.sh
rm Anaconda3-4.2.0-Linux-x86_64.sh
echo 'export PATH=/usr/local/bin/anaconda3/bin:$PATH' >> /root/.bashrc
echo 'export PATH=/usr/local/bin/anaconda3/bin:$PATH' >> /home/onekg/.bashrc
rm Anaconda3-4.2.0-Linux-x86_64.sh
source /root/.bashrc
pip install awscli retrying boto3

Now, we can install bwa, samblaster and samtools.
# bwa
wget https://github.com/lh3/bwa/releases/download/v0.7.15/bwa-0.7.15.tar.bz2
tar -jxf bwa-0.7.15.tar.bz2
cd bwa-0.7.15/
make
cp bwa /usr/local/bin/
cd ..

# samblaster
wget https://github.com/GregoryFaust/samblaster/releases/download/v.0.1.24/samblaster-v.0.1.24.tar.gz
tar -zxf samblaster-v.0.1.24.tar.gz
cd samblaster-v.0.1.24/
make
cp samblaster /usr/local/bin/
cd ..

# samtools
wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar -jxf samtools-1.3.1.tar.bz2
cd samtools-1.3.1/
./configure && make install
cd ..

rm -r bwa-0.7.15.tar.bz2 samblaster-v.0.1.24.tar.gz samtools-1.3.1.tar.bz2 bwa-0.7.15 samblaster-v.0.1.24 samtools-1.3.1

The rest of the software (the Sentieon executable, and the GATK JAR file) should be on your local Linux machine. We can use StarCluster to transfer those files to our AWS cluster.
First, we will terminate our SSH connection with the AWS cluster.
hostname
# master
exit
hostname
# mylabhost.school.edu

Now that we are back at our local host, we can upload the rest of the software. Assuming that the software are in our home directory, we can run the following command.
anaconda2/bin/starcluster put ami_setup GenomeAnalysisTK-3.7.tar.bz2 sentieon-genomics-201611.tar.gz /root/

We can SSH back into our Amazon cluster and install the software we just transferred.
anaconda2/bin/starcluster sm ami_setup
tar -jxf GenomeAnalysisTK-3.7.tar.bz2
mv GenomeAnalysisTK.jar /usr/local/bin/
tar -zxf sentieon-genomics-201611.tar.gz
mv sentieon-genomics-201611/ /usr/local/src/
echo 'export PATH=/usr/local/src/sentieon-genomics-201611/bin/:$PATH' >> /home/onekg/.bashrc
echo 'export PATH=/usr/local/src/sentieon-genomics-201611/bin/:$PATH' >> /root/.bashrc
source ~/.bashrc

At this point we have installed all of the software on our instance that we will need for the analysis. We can now make an EBS image of the instace. To do this, we need to find the instance-id of the master node, which we can find with the listclusters (lc) command. Remember to save the AMI id, when it appears on the terminal. We will need it later.
exit
anaconda2/bin/starcluster lc
# StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
# Software Tools for Academics and Researchers (STAR)
# Please submit bug reports to starcluster@mit.edu
# 
# -----------------------------------------
# ami_setup (security group: @sc-ami_setup)
# -----------------------------------------
# Launch time: 20xx-xx-xx xx:xx:xx
# Uptime: 0 days, xx:xx:xx
# VPC: xxx
# Subnet: xxx
# Zone: us-east-1a
# Keypair: testkey
# EBS volumes: N/A
# Cluster nodes:
#      master running i-03ac5ecc37e339411 ec2-54-89-177-15.compute-1.amazonaws.com
# Total nodes: 1
 anaconda2/bin/starcluster ebsimage i-03ac5ecc37e339411 onekg-test-1

Now that we have the image, we can terminate the cluster. Remember to terminate the cluster after you are finished because you will be charged by the hour.
anaconda2/bin/starcluster terminate ami_setup

Creating a Persistent Disk

Now that we have our AMI setup, the final step before we can start to process the data is to create a persistent EBS disk that we can use to store our scripts, log files and the human reference genome. We can create an EBS disk with StarCluster using the following commands. For the -m option, use the AMI id that you created in the previous step of the tutorial.
anaconda2/bin/starcluster start -s 1 -u onekg -m ami-0d4da51b -I t2.micro -N subnet-3bc4f211 vol_setup

Find the instance id with the listclusters command, as before. 40 GB should be plenty of space for the human reference genome, some scripts and some log files. As before with the subnet id and the AMI id, keep track of the id of the volume that is created.
anaconda2/bin/starcluster listclusters
# StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
# Software Tools for Academics and Researchers (STAR)
# Please submit bug reports to starcluster@mit.edu
#
# -----------------------------------------
# vol_setup (security group: @sc-vol_setup)
# -----------------------------------------
# Launch time: 20xx-xx-xx xx:xx:xx
# Uptime: 0 days, xx:xx:xx
# VPC: xxx
# Subnet: xxx
# Zone: us-east-1a
# Keypair: testkey
# EBS volumes: N/A
# Cluster nodes:
#      master running i-0c4d57843a9b9a6af ec2-54-208-45-129.compute-1.amazonaws.com
# Total nodes: 1
anaconda2/bin/starcluster createvolume -n onekgData -H i-0c4d57843a9b9a6af 40 us-east-1a

Now we can setup our volume. The commands for downloading the reference genome and indexing it are shown below. However, transferring the index files from another computer will be much faster than recreating them. If you have the index files on another computer, it is recommended that you transfer them using starcluster put.
# SSH in and create some directories
anaconda2/bin/starcluster sm vol_setup
mkdir -p /data
mount /dev/xvdz /data/
mkdir -p /data/Logs /data/Reference /data/Scripts

# Download the reference genome
cd /data/Reference
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
gzip -d hs37d5.fa.gz

# Index the reference genome
samtools index hs37d5.fa
samtools dict -s "Homo Sapiens" -o hs37d5.dict hs37d5.fa
bwa index hs37d5.fa

In the scripts directory, we can download the scripts we will use for data processing.
cd /data/Scripts/
git clone https://github.com/DonFreed/1kgenomes-gvcfs.git

We can close our SSH connection to the instance and upload our Sentieon license file to the EBS volume. Putting the license on the EBS volume will allow it to be accessed by all of the worker nodes on our cluster.
exit
hostname
# mylabhost.school.edu
anaconda2/bin/starcluster put vol_setup Sentieon_License.lic /data/

Now, we can terminate this cluster and modify our configuration file so that future clusters will mount the EBS disk.
anaconda2/bin/starcluster terminate vol_setup
sed -i 's/#VOLUMES = oceandata, biodata/VOLUMES = onekgData/' ~/.starcluster/config
echo -e "\n[volume onekgData]\nVOLUME_ID = vol-079dc042663b95643\nMOUNT_PATH = /data\n" >> ~/.starcluster/config

Testing your Setup

Now that we have everything in place, we can test our setup by running a few samples through the analysis pipeline. We will go through the testing process using the Sentieon software below, but the process is very similar with the GATK. We will start by creating an AWS S3 bucket to store the results of our analysis.
anaconda2/bin/python
>>> import boto3
>>> session = boto3.session.Session(aws_access_key_id="AKIAIUITQJ4NCRISZQ", aws_secret_access_key="f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi", region_name="us-east-1")
>>> s3 = boto3.resource('s3')
>>> s3.create_bucket(Bucket="onekg-gvcfs-test")
>>> quit()

Now we can run our analysis. We can start up a small (four node) cluster for testing. We can use a t2.micro instance for the head node, but for the worker nodes, we will use m3.2xlarge instances which our testing has show offer the best performance on this task at the lowest cost. Also note that we are using the -b option to set a bid price on the spot market. If the market price exceeds our bid our worker nodes may be terminated, but that should be OK, we can just restart them later.
anaconda2/bin/starcluster start -b 0.3 -s 4 -m ami-0d4da51b -n ami-0d4da51b -u onekg -I t2.micro -i m3.2xlarge -N subnet-3bc4f211 test_onekg
anaconda2/bin/starcluster sm test_onekg

Now that we are in the cluster, the first step is to run the setup_master.sh script. This script will handle configuration of all of the worker nodes and the scheduler.
cd /data/Scripts/1kgenomes-gvcfs
bash setup_master.sh

To check that the script worked correctly, run the following code. The complex_value "ephemeral" should be very large.
qconf -se node001

We also need to modify our SGE submission script to set the SENTIEON_LICENSE environmental variable.
sed -i 's|$@|SENTIEON_LICENSE=/data/Sentieon_License.lic\n$@|' general.sh

Now we are ready to run the analysis. Specify the destination bucket in AWS S3 (onekg-gvcfs-test) and your AWS access_key (AKIAIUITQJ4NCRISZQ) and secret_key (f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi) and the data processing scripts should take care of the rest. For testing, be suer to set the --n_to_run argument to a low number. 6 or 10 should be sufficient.
python3 run_analysis.py --n_to_run 10 --sentieon onekg-gvcfs-test AKIAIUITQJ4NCRISZQ f+PJ+oufQofbIXk8Lhcxdj++uxROAOAJbNJWt1mi

Running the Analysis

The procedure for running the analysis is very similar to the testing procedure. For the full analysis more worker nodes are recommended. Also, remove the --n_to_run option in the last command shown above and all of the data will be processed.