tomsing1/bcbio_installation_on_aws.md

## bcbio_installation_on_aws.md

      
    Raw
  

              bcbio_installation_on_aws.md
            
          
    bcbio

A python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel pipeline that handles distributed execution, idempotent processing restarts and safe transactional steps.


bcbio documentation
bcbio github page

Installation

This document contains notes about how to install bcbio for use on AWS starting from:

AMI: ubuntu server 14.04
instance type: t2.medium (to allow indexing the human genome with hisat2)
basic prerequisites installed using apt-get:

sudo apt-get update
sudo apt-get install -y curl wget git unzip tar gzip bzip2 xz-utils pigz
I Standard bcbio installation with tools and data stored on the filesystem

The following steps install bcbio and the specified genome data on the filesystem of an instance. This is useful to

create an AMI for use on a single instance (eg with many cores)

1. change permissions for the installation directories

sudo chmod 777 /usr/local
sudo chmod 777 /usr/local/share
sudo chmod 777 /usr/local/bin
2. install bcbio

wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
  --genomes GRCh37 --aligners bwa --aligners bowtie2
II Dockerized bcbio installation with data stored on the same filesystem

This installation mode will retrieve a docker image with all tools and store it together
with the specified data on the local machine.
The steps below will install bcbio tools (via a single docker image rather than retrieving
the dependencies separately via bioconda) and download the specified genome data.
This is useful to

generate a bcbio AMI for use on a single instance (eg with multiple cores)

1. Install bcbio-vm within an isolated conda installation

wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
~/install/bcbio-vm/anaconda/bin/conda install --yes -c bioconda bcbio-nextgen-vm
sudo ln -s ~/install/bcbio-vm/anaconda/bin/bcbio_vm.py /usr/local/bin/bcbio_vm.py
sudo ln -s ~/install/bcbio-vm/anaconda/bin/conda /usr/local/bin/bcbiovm_conda
2. install docker and git

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates git
sudo apt-key adv \
  --keyserver hkp://ha.pool.sks-keyservers.net:80 \
  --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | \
  sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-engine
sudo service docker start
3. add the user to the docker group

USERNAME=ubuntu
sudo gpasswd -a ${USERNAME} docker
newgrp docker
4. Ensure the driver scripts have the right permissions

sudo chgrp docker /usr/local/bin/bcbio_vm.py
sudo chmod g+s /usr/local/bin/bcbio_vm.py
5. Install a dockerized bcbio-nextgen

Get the latest bcbio docker image with software and tools,
and also download the genome data.
The docker image is stored in the system's default location (/var/lib/docker). The
data is in the specified data directory.
bcbio_vm.py --datadir=~/install/bcbio-vm/data saveconfig

bcbio_vm.py \
  install \
  --data \
  --tools \
  --genomes hg38 \
  --genomes mm10 \
  --aligners hisat2

If you have an existing bcbio-nextgen installation and want to avoid re-installing
existing genome data, omit the --data argument from the bcbio_vm.py call.

III Standard bcbio installation on a new persistent volume

All of bcbio is installed on the /shared volume, which can be detached and reattached in another context.
This is useful to

reuse the bcbio installation and data e.g by making it available to all nodes on a cluster

The following steps

download and install bcbio and its dependencies on the /shared volume
download the specified data on the /shared volume

1. update / install system packages

sudo apt-get update
sudo apt-get install -y curl wget git unzip tar gzip bzip2 xz-utils pigz
2. initialize and mount shared directory

⚠️ This command will erase any preexisting data on /dev/xvdb!
sudo mkfs -t ext4 /dev/xvdb
sudo mkdir /shared
sudo mount /dev/xvdb /shared
3. Create folders for your bcbio projects

sudo mkdir /shared/bcbio
sudo chown ${USER} /shared/bcbio
4. Install bioconda tools, but no data

wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py 

/shared/bcbio 

--nodata 

--isolate 

--tooldir=/shared/bcbio
export PATH=/shared/bcbio/bin:$PATH

### 5. Add genome files

```bash
# upgrade the installation to add data
bcbio_nextgen.py upgrade \
  -u stable \
  --genomes mm10 \
  --genomes hg38

6. Add aligners / indices

bcbio_nextgen.py upgrade \
  -u stable \
  --aligner hisat2
7. Edit the system configuration to match your instance

The system configuration /shared/bcbio/galaxy/bcbio_system.yaml is initialized when bcbio is first installed, but your instance / cluster may have changed.
7. Do an analysis in /shared/your-project.

Add your samples, create a project configuration and then run the analysis from a work directory.
mkdir /shared/your-project/work
cd /shared/your-project/work
bcbio_nextgen.py ../config/your-project.yaml -n 16
Note: You can do all the bcbio installation and project setup with a smaller instance,
shutdown that instance and spin up a larger one that matches your project needs for the
actual run.
Additional documentation:

bcbio: simplified bcbio cloud usage
IV Dockerized bcbio installation with data stored on a separate volume

This installation mode will retrieve a docker image with all tools and store it on the local machine, which needs to have sufficient disk space to accommodate it. The data can be stored on a different volume, e.g /shared.
The steps below are useful to

download the genomic data and store it on a separate volume for reuse
generation of a bcbio AMI (that needs to mount the data volume to be useful)

1. Install bcbio-vm in an an isolated conda installation within the user's home directory

wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
~/install/bcbio-vm/anaconda/bin/conda install --yes -c bioconda bcbio-nextgen-vm
sudo ln -s ~/install/bcbio-vm/anaconda/bin/bcbio_vm.py /usr/local/bin/bcbio_vm.py
sudo ln -s ~/install/bcbio-vm/anaconda/bin/conda /usr/local/bin/bcbiovm_conda
2. install docker and git

sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates git
sudo apt-key adv \
  --keyserver hkp://ha.pool.sks-keyservers.net:80 \
  --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | \
  sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-engine
sudo service docker start
3. add the user to the docker group

USERNAME=ubuntu
sudo gpasswd -a ${USERNAME} docker
newgrp docker
4. Ensure the driver scripts have the right permissions

sudo chgrp docker /usr/local/bin/bcbio_vm.py
sudo chmod g+s /usr/local/bin/bcbio_vm.py
5. Install a dockerized bcbio-nextgen

Get the latest bcbio docker image with software and tools,
and also download genome data:
sudo chown ${USERNAME} /shared
DATADIR=/shared/bcbio/data
mkdir -p ${DATADIR}
bcbio_vm.py --datadir=${DATADIR} saveconfig

bcbio_vm.py \
  --datadir=${DATADIR} \
  install \
  --data \
  --tools \
  --genomes hg38 \
  --genomes mm10 \
  --aligners hisat2 \
  --datatarget rnaseq \
  --datatarget variation

If you have an existing bcbio-nextgen installation and want to avoid re-installing
existing genome data, omit the --data argument from the bcbio_vm.py call.