A python toolkit providing best-practice pipelines for fully automated high throughput sequencing analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel pipeline that handles distributed execution, idempotent processing restarts and safe transactional steps.
This document contains notes about how to install bcbio for use on AWS starting from:
- AMI:
ubuntu server 14.04
- instance type: t2.medium (to allow indexing the human genome with hisat2)
- basic prerequisites installed using apt-get:
sudo apt-get update
sudo apt-get install -y curl wget git unzip tar gzip bzip2 xz-utils pigz
The following steps install bcbio and the specified genome data on the filesystem of an instance. This is useful to
- create an AMI for use on a single instance (eg with many cores)
sudo chmod 777 /usr/local
sudo chmod 777 /usr/local/share
sudo chmod 777 /usr/local/bin
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
--genomes GRCh37 --aligners bwa --aligners bowtie2
This installation mode will retrieve a docker image with all tools and store it together with the specified data on the local machine.
The steps below will install bcbio tools (via a single docker image rather than retrieving the dependencies separately via bioconda) and download the specified genome data.
This is useful to
- generate a bcbio AMI for use on a single instance (eg with multiple cores)
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
~/install/bcbio-vm/anaconda/bin/conda install --yes -c bioconda bcbio-nextgen-vm
sudo ln -s ~/install/bcbio-vm/anaconda/bin/bcbio_vm.py /usr/local/bin/bcbio_vm.py
sudo ln -s ~/install/bcbio-vm/anaconda/bin/conda /usr/local/bin/bcbiovm_conda
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates git
sudo apt-key adv \
--keyserver hkp://ha.pool.sks-keyservers.net:80 \
--recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | \
sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-engine
sudo service docker start
USERNAME=ubuntu
sudo gpasswd -a ${USERNAME} docker
newgrp docker
sudo chgrp docker /usr/local/bin/bcbio_vm.py
sudo chmod g+s /usr/local/bin/bcbio_vm.py
Get the latest bcbio docker image with software and tools, and also download the genome data.
The docker image is stored in the system's default location (/var/lib/docker
). The
data is in the specified data directory.
bcbio_vm.py --datadir=~/install/bcbio-vm/data saveconfig
bcbio_vm.py \
install \
--data \
--tools \
--genomes hg38 \
--genomes mm10 \
--aligners hisat2
- If you have an existing bcbio-nextgen installation and want to avoid re-installing
existing genome data, omit the
--data
argument from thebcbio_vm.py
call.
All of bcbio is installed on the /shared
volume, which can be detached and reattached in another context.
This is useful to
- reuse the bcbio installation and data e.g by making it available to all nodes on a cluster
The following steps
- download and install bcbio and its dependencies on the
/shared
volume - download the specified data on the
/shared
volume
sudo apt-get update
sudo apt-get install -y curl wget git unzip tar gzip bzip2 xz-utils pigz
/dev/xvdb
!
sudo mkfs -t ext4 /dev/xvdb
sudo mkdir /shared
sudo mount /dev/xvdb /shared
sudo mkdir /shared/bcbio
sudo chown ${USER} /shared/bcbio
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py
/shared/bcbio
--nodata
--isolate
--tooldir=/shared/bcbio
export PATH=/shared/bcbio/bin:$PATH
### 5. Add genome files
```bash
# upgrade the installation to add data
bcbio_nextgen.py upgrade \
-u stable \
--genomes mm10 \
--genomes hg38
bcbio_nextgen.py upgrade \
-u stable \
--aligner hisat2
The system configuration /shared/bcbio/galaxy/bcbio_system.yaml
is initialized when bcbio is first installed, but your instance / cluster may have changed.
Add your samples, create a project configuration and then run the analysis from a work directory.
mkdir /shared/your-project/work
cd /shared/your-project/work
bcbio_nextgen.py ../config/your-project.yaml -n 16
Note: You can do all the bcbio installation and project setup with a smaller instance, shutdown that instance and spin up a larger one that matches your project needs for the actual run.
bcbio: simplified bcbio cloud usage
This installation mode will retrieve a docker image with all tools and store it on the local machine, which needs to have sufficient disk space to accommodate it. The data can be stored on a different volume, e.g /shared
.
The steps below are useful to
- download the genomic data and store it on a separate volume for reuse
- generation of a bcbio AMI (that needs to mount the data volume to be useful)
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/install/bcbio-vm/anaconda
~/install/bcbio-vm/anaconda/bin/conda install --yes -c bioconda bcbio-nextgen-vm
sudo ln -s ~/install/bcbio-vm/anaconda/bin/bcbio_vm.py /usr/local/bin/bcbio_vm.py
sudo ln -s ~/install/bcbio-vm/anaconda/bin/conda /usr/local/bin/bcbiovm_conda
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates git
sudo apt-key adv \
--keyserver hkp://ha.pool.sks-keyservers.net:80 \
--recv-keys 58118E89F3A912897C070ADBF76221572C52609D
echo "deb https://apt.dockerproject.org/repo ubuntu-trusty main" | \
sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-engine
sudo service docker start
USERNAME=ubuntu
sudo gpasswd -a ${USERNAME} docker
newgrp docker
sudo chgrp docker /usr/local/bin/bcbio_vm.py
sudo chmod g+s /usr/local/bin/bcbio_vm.py
Get the latest bcbio docker image with software and tools, and also download genome data:
sudo chown ${USERNAME} /shared
DATADIR=/shared/bcbio/data
mkdir -p ${DATADIR}
bcbio_vm.py --datadir=${DATADIR} saveconfig
bcbio_vm.py \
--datadir=${DATADIR} \
install \
--data \
--tools \
--genomes hg38 \
--genomes mm10 \
--aligners hisat2 \
--datatarget rnaseq \
--datatarget variation
- If you have an existing bcbio-nextgen installation and want to avoid re-installing
existing genome data, omit the
--data
argument from thebcbio_vm.py
call.